Colfax Research, 11 May 2016, Guide to Automatic Vectorization with Intel AVX-512 Instructions in Knights Landing Processors, here.
This publication is part of a developer guide focusing on the new features in 2nd generation Intel® Xeon Phi™processors code-named Knights Landing (KNL). In this document, we focus on the new vector instruction set introduced in Knights Landing processors, Intel® Advanced Vector Extensions 512 (Intel® AVX-512). The discussion includes:
Introduction to vector instructions in general,
The structure and specifics of AVX-512, and
Practical usage tips: checking if a processor has support for various features, compilation process and compiler arguments, and pros and cons of explicit and automatic vectorization using the Intel® C++ Compiler and the GNU Compiler Collection.
Capabilities of Intel AVX-512 in Intel Xeon Scalable Processors (Skylake), here.
This paper reviews the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set and answers two critical questions:
- How do Intel® Xeon® Scalable processors based on the Skylake architecture (2017) compare to their predecessors based on Broadwell due to AVX-512?
- How are Intel Xeon processors based on Skylake different from their alternative, Intel® Xeon Phi™ processors with the Knights Landing architecture, which also feature AVX-512?
We address these questions from the programmer’s perspective by demonstrating C language code of microkernels benefitting from AVX-512. For each example, we dig deeper and analyze the compilation practices, resultant assembly, and optimization reports.
In addition to code studies, the paper contains performance measurements for a synthetic benchmark with guidelines on estimating peak performance. In conclusion, we outline the workloads and application domains that can benefit from the new features of AVX-512 instructions.
GNU, Fast Scatter Gather IO
13.6 Fast Scatter-Gather I/O
Some applications may need to read or write data to multiple buffers, which are separated in memory. Although this can be done easily enough with multiple calls to
write, it is inefficient because there is overhead associated with each kernel call.Instead, many platforms provide special high-speed primitives to perform these scatter-gather operations in a single kernel call. The GNU C Library will provide an emulation on any system that lacks these primitives, so they are not a portability threat. They are defined in
sys/uio.h.These functions are controlled with arrays of
iovecstructures, which describe the location and size of each buffer.
Anderson, Malik, Gregg, 19 Jan 2016, HiPEAC, Prague, Automatic Vectorization of Interleaved Data REvistied, here.
Mike Shiffman, 17 Apr 2018, Farsight Security Blog, Google Protocol Buffer Deserialization The Hard Way, here.