Agner’s CPU Blog, Future instruction set: AVX-512, here.
The size of vector registers are extended from 256 bits (YMM registers) to 512 bits (ZMM) registers. There is room for further extensions to at least 1024 bits (what will they be called?)
The number of vector registers is doubled to 32 registers in 64-bit mode. There will still be only 8 vector registers in 32-bit mode.
Eight new mask registers k0 – k7 allow masked and conditional operations. Most vector instructions can be masked so that it only operates on selected vector elements while the remaining vector elements are unchanged or zeroed. This will replace the use of vector registers as masks.
Most vector instructions with a memory operand have an option for broadcasting a scalar operand.
Floating point vector instructions have options for specifying the rounding mode and for suppressing exceptions.
There is a new addressing mode called compressed displacement. Where instructions have a memory operand with a pointer and an 8-bit sign-extended displacement, the displacement is multiplied by the size of the operand. This makes it possible to address a larger interval with just a single byte displacement as long as the memory operands are properly aligned. This makes the instructions smaller in some cases to compensate for the longer prefix.
More than 100 new instructions
The 512-bit registers can do vector operations on 32-bit and 64-bit signed and unsigned integers and single and double precision floats, but unfortunately not on 8-bit and 16-bit integers
Optimization manuals updated, 4 Sep, here. Heed the words of Professor Fog, “Note that these manuals are not for beginners.”
The optimization manuals at www.agner.org/optimize/#manuals have now been updated. The most important additions are:
- AMD Piledriver and Jaguar processors are now described in the microarchitecture manual and the instruction tables.
- Intel Ivy Bridge and Haswell processors are now described in the microarchitecture manual and the instruction tables.
- The micro-op cache of Intel processors is analyzed in more detail
- The assembly manual has more information on the AVX2 instruction set.
- The C++ manual describes the use of my vector classes for writing parallel code.
Some interesting test results for the newly tested processors:
- Supports the new AVX2 instruction set which allows integer vectors of 256 bits and gather instructions
- Supports fused multiply-and-add instructions of the FMA3 type
- The cache bandwidth is doubled to 256 bits. It can do two reads and one write per clock cycle.
- Cache bank conflicts have been removed
- The number of read and write buffers, register files, reorder buffer and reservation station are all bigger than in previous processors
- There are more execution units and one more execution port than on previous processors. This makes a throughput of four instructions per clock cycle quite realistic in many cases.
- The throughput for not-taken branches is doubled to two not-taken branches per clock cycle, including fused branch instructions. The throughput for taken branches is largely unchanged.
- There are two execution units for floating point multiplication and for fused multiply-and-add, but only one execution unit for floating point addition. This design appears to be suboptimal since floating point code typically contains more additions than multiplications. But at least it enables Intel to boast a floating point performance of 32 FLOPS per clock cycle.
- The fused multiply-and-add operation is the first case in the history of Intel processors of micro-ops having more than two input dependencies. Other instructions with more than two input dependencies are still split into two micro-ops, though. AMD processors don’t have this limitation.
- The delays for moving data between different execution units is smaller than on previous Intel processors in many cases.