Kamil Rocki and Martin Burtscher, HPC Wire, The Future of Accelerator Programming, here. I think it is a safe bet that hacking SIMD code to competitive performance levels is not going to get easier. Maybe that is a good thing?
To predict how accelerator programming might develop from this point forward, it may be helpful to study how other acceleration hardware has evolved in the past. For example, some early high-end PCs contained an extra chip, called a co-processor, to accelerate floating-point (FP) calculations. Later, this co-processor was combined with the CPU on the same die and is now fully integrated with the integer processing core. Only separate FP registers and ALUs remain. The much more recently added SIMD support (including MMX, SSE, AltiVec, and AVX) did not start out on a separate chip but is now also fully integrated in the core. Just like the floating-point instructions, SIMD instructions operate on separate registers and ALUs.
Interestingly, the programmer’s view of these two types of instructions is surprisingly different. The floating-point operations and data types have been standardized long ago (IEEE 754) and are now ubiquitous. They are directly available in high-level languages through normal arithmetic operations and built-in 32-bit single-precision and 64-bit double-precision types. In contrast, no standard exists for SIMD instructions, and their existence is largely hidden from programmers. It is left to the compiler to ‘vectorize’ the code and employ these instructions. Developers wishing to use SIMD instructions explicitly have to resort to compiler-specific macros that are not portable.
Since GPUs and MICs obtain their high performance through SIMD-like execution, we believe accelerators are more likely to track the evolution of SIMD- than FP-instruction support. Another similarity to SIMD, and a key factor that made CUDA so successful, is that CUDA hides the SIMD aspect of the GPU hardware and allows the programmer to think in terms of individual threads operating on scalar data elements rather than warps operating on vectors. Hence, accelerators will undoubtedly also be moved onto the CPU chip, but we surmise that their code will not be seamlessly interwoven with CPU code nor will the accelerators’ hardware-supported data types be made explicitly available to the programmers.