LShift, Optimizing loops in C for higher numerical throughput and for fun, here. Loop hackery going from ~5 clock cycles per FLOP to about 0.5 cycles per FLOP. Asserts GCC cannot optimize the sum of squares loop where ICC can. Comments point out less loop hackery is required in Fortran, no aliasing makes the Fortran compiler’s job easier. All the loop hackery is for Sandy Bridge, so there is another factor of two performance harvestable by getting down to 0.25 cycles per FLOP on Haswell. Haswell gives you two FMA floating point execution units on each clock.
C code I used for testing all the above loops, is here. To rule out memory bandwidth issues as much as it was possible, I run tests for a bunch of vectors small enough to fit into L1 cache. Throughputs for single core:SSE AVX naive: 1733.4 MFLOPS 1696.6 MFLOPS // 1xCPU_CLOCK barrier for scalar instructions horizontal: 5963.6 MFLOPS 9419.8 MFLOPS // 4xCPU_CLOCK and 8xCPU_CLOCK for SSE and AVX unrolled: 11264.8 MFLOPS 11496.6 MFLOPS cached: 14253.7 MFLOPS 15086.5 MFLOPS final: 17985.4 MFLOPS 18210.4 MFLOPS // Both, SSE and AVX settle at around 10xCPU_CLOCK