Pink Iguana

Home » Code » Loop Hackery

Loop Hackery


LShift, Optimizing loops in C for higher numerical throughput and for fun, here. Loop hackery going from ~5 clock cycles per FLOP to about 0.5 cycles per FLOP. Asserts GCC cannot optimize the sum of squares loop where ICC can. Comments point out less loop hackery is required in Fortran, no aliasing makes the Fortran compiler’s job easier. All the loop hackery is for Sandy Bridge, so there is another factor of two performance harvestable by getting down to 0.25 cycles per FLOP on Haswell.  Haswell gives you two FMA floating point execution units on each clock.

C code I used for testing all the above loops, is here. To rule out memory bandwidth issues as much as it was possible, I run tests for a bunch of vectors small enough to fit into L1 cache. Throughputs for single core:

                SSE              AVX
naive:       1733.4 MFLOPS    1696.6 MFLOPS    // 1xCPU_CLOCK barrier for scalar instructions
horizontal:  5963.6 MFLOPS    9419.8 MFLOPS    // 4xCPU_CLOCK and 8xCPU_CLOCK for SSE and AVX
unrolled:   11264.8 MFLOPS   11496.6 MFLOPS
cached:     14253.7 MFLOPS   15086.5 MFLOPS
final:      17985.4 MFLOPS   18210.4 MFLOPS    // Both, SSE and AVX settle at around 10xCPU_CLOCK

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: