Markus Puschel, How to Write Fast Numerical Code, Spring 2014, here. Not bad.
Bordawkar et.al., IBM, Believe it or Not! Multi -core CPUs Can Match GPU Perfromance for FLOP-intensive Application! here.
We evaluated the performance of a real-world image processing application on the latest GPU and commodity multi-core processors using a wide variety of parallelization techniques. A pthreads-based version of the application running on a dual quad-core Intel Xeon system was able to match nVidia 285 GPU performance. Using fully automatic compiler-driven auto-parallelization and optimization, a single Power7 processor was able to achieve performance better than that on the nVidia 285 GPU. This is a compelling productivity result, given the effort required to develop an equivalent high- performance CUDA implementation. Our results also conclusively demonstrate that, under certain conditions, it is possible for a program running on a multi-core processor to match or even beat the performance of an equivalent GPU program, even for a FLOP-intensive structured application. In future, we plan to compare performance of such applications on upcoming GPU architectures from AMD and nVidia, e.g., nVidia Fermi.