Noah Clemons, Intel, Intel Math Kernel Library Perspectives and Latest Advances, here. Really covering the proposed transition to Phi, hard. The slides were presented in April, 2013 to the Linux Foundation folks. The second half of the slides is a demonstration of the compiler switches to flip to control numerical reproducibility and operation execution order. Not exactly sure if you are going to the trouble and expense to optimize on a given platform and you are on the buy side why you care about this. You care about the correct answer, the stability of your computation, testability, and the known numerical error bounds you can preserve through optimization. Unless you are selling the code to other folks to execute on their unspecified machines why do you care, beyond having relevant unit tests, about detailed backward-compatible numerical reproducibility? But the greatest thing I have seen so far this week is the Intel Math Kernel Library Link Line Advisor, here.
Intel Math Kernel Library (VML) Performance and Accuracy Data, here. LA double precision vector exponential, good to 1.98*ulp, on an i5-4670 at 2.3GHz in 3.65 clocks. CdfNorm in 7.9 clocks. That’s what you get out of the box without thinking about anything. Just for perspective, there are shops running on average 5 cycles for a double precision floating point multiply, and you can just wait/invest another 2.9 cycles and get the cumulative normal distribution function evaluated to 1.46*ulp. You can see why the Eigen 3 folks made sure to hook up to VML. I haven’t seen VML benchmarks since v10.2 probably. I’m guessing all floating point optimization roads lead to Intel Moscow at this point. If you look at the plots of cycles-per element versus vector length the knee of the performance curve looks to be vectors between 10 and 20 elements long. So, if you are Algo trading interest rate swaps (or anything else) that’s the performance sweet spot to aim for. Turn off hyper threading, don’t boost the clock speed by selectively shutting down cores – just run all the cores at the fastest uniform clock speed available and you will be reasonably competitive. If you are running Monte Carlo you can take the vector lengths out to somewhere between 100 and 1000.
Intel® Math Kernel Library 11.1
List of VML Functions
|Trigonometric||Hyperbolic||Power, Root||Exponential, Logarithmic||Arithmetic||Rounding||Special|