altechnative, Choice of Compilers – Part 1: x86, 2010, here. The xbit labs reviewer’s comment that the hype behind AVX2 has to be tempered by the reality that programmers will not rewrite their code to use AVX2 is astute. Curious that, if this observation is true, for a bunch of Wall Street analytics this means they are nowhere near tracking Moore’s Law. Recall, if you worked in technology between 1950 and 2013 all you had to do was bet everything on a once in a thousand year phenomenon called Moore’s Law and things would pretty much work out for you. I remember one luminary getting up in front of an auditorium a couple years ago and explaining that on his visit to the Computer Museum in Boston he learned that Moore’s Law did not work anymore. Probably he was confused with Dennard’s MOSFET Scaling, here.
So, what do the results seem to say? Looking at the directly comparable results, PGCC’s performance is consistently about 12-17% behind GCC. It even almost catches up, on the Pentium 4 being only 3% behind on that platform. Both GCC and PGCC seem to suffer when moving to a faster Pentium 4. Their performance drops by around 30% despite an increase in clock-speed of 13%. This is a rather unimpressive result. ICC’s performance leaves GCC’s and PGCC’s performance in the realm of embarrasing. Not only does it outperform GCC and PGCC by between 2.63x-7.22x and 3.14x-7.47x respectively, but it’s performance increases on the Pentium 4 instead of decreasing as it does with GCC and PGCC. And not only does it increase, it even increases by 20% relative to the clock speed, despite the Pentium 4 Celeron having a much smaller Level 2 cache than the Pentium III in this test (128KB vs. 512KB). This is a truly impressive result. Not only does it show that ICC is a very good compiler, but it also shows that a lot of the perception of Pentium 4′s poor performance comes from poorly performing compilers, rather than from badly designed hardware.
On the Core 2, PGCC appears to close the gap from being 7.5x slower on the Pentium 4 down to a marginally less embarrasing result of being only 6.2x slower than ICC. GCC remains safely ahead ahead of PGCC in performance, but it’s lag behind ICC has increased (5.14x slower) compared to Athlon XP and Pentium III results (2.63x and 3.44x respectively).
Sun’s compiler seems to be somewhat lacking in performance. Its results are roughly on par with GCC. Pathscale’s compiler, however, manages to beat GCC to 2nd place, behind ICC. This is quite a respectable achievement considering that parts of the PathCC compiler suite are based on GCC.
Looking at the logged compiler remarks, it is clear that ICC is gaining massive advantage from knowing how to vectorize loops. Since the code in question uses only single precision floating point numbers, full vectorization performance is achieved on all of the tested processors. Just about all inner loops in the code vectorize. Even though Portland explain that PGCC currently lacks vectorizable versions of more complex math functions (e.g. sinf()), the compiler also failed to vectorize much simpler operations in loops like addition and multiplication. Instead it chose to unroll the loops. The response from PGI support is that it is deemed that unrolling loops with small iteration counts is faster than vectorizing them. Clearly, this seems to be wrong, since both both GCC and ICC vectorized these simple loops, and outperformed PGCC.
Diagnostic output about loop vectorization did not seem to be available from SunCC, as the only similar switches seemed to be related to OpenMP parallelization which was not used in this test.
For those that don’t know what vectorization is all about – you may have heard of SIMD: Single Instruction Multiple Data. The idea is that instead of processing scalars (single values) you can process whole vectors of (i.e. arrays) scalars simultaneously, if the operations you are doing on them is the same. So, instead of processing 4 32-bit floats one at a time, you pack them into a vector (if you have them in an array, or they are appropriately aligned in memory) and process the vector in parallel. x86 processors since the Pentium MMX have had this capability for integer operations and since the Pentium III for floating point operations. Provided that the compiler knows how to convert loop operations on arrays into vector operations, the speed-up can be massive.