But my data does not arrive in vectorized quantities how is Vectorized Black Scholes performance relevant to my problem?
Let’s assume your Black Scholes inputs arrive on a session/queue in consecutive microsecond arrival times 0, 1, and n. Assume if you process them as scalars each finishes in 200 milliseconds (using Dr. Johnson’s No Simpsons scalar code), on one core. So they finish as scalar computations at 200ms, ~400ms , and ~n*200ms. Now instead of using Dr. Johnson’s code let’s use that Intel code on one core. We will dynamically aggregate all the Black Scholes inputs arriving on the queue and process them all at once. The input that arrives at 0 mics will wait for the input arriving at n mics before running the Intel code because we need to know the size of the vector at execution time. Well when do the vectorized computations finish? There is only 200ms of computation to do for the first input so it will be done at 200ms + n mics , n-1 mics slower than the scalar result (above). When does the second input complete? 200ms + n mics (~200 ms faster). Unless n is too large, the nth input will be ready at 200ms + n mics as well. As you would expect, vectorized code performance is superior to scalar code performance in the average case for this queuing example.
But cores are cheap, I will just buy many cores and just use Dr. Johnson’s No Simpsons scalar code. Why should I vectorize in that case?
As long as the differences in the input arrival times are small relative to the computation time the vectorized code will have better average case performance than simply massively distributing the scalar code, up to a point. Just run the Intel vector code on all the cores you have. The best case for vectorized code performance is when the input arrival times are more or less simultaneous. The optimal vector length does decrease as the input inter-arrival time increases though. If you can get enough cores so there is no queuing for processing the inputs (200ms/ 1mic, 200K cores, in this example) then there is no obvious opportunity to vectorize profitably. The vectorized code can be run as scalar code or on vectors of varying length depending on how you want handle the queue processing.