Black Scholes 2014
Here is the Black Scholes equation, competitive code in 2014 on a single offtheshelf microprocessor can execute this computation 150 to 200 million times a second. Who does this matter to? If you are running an Algorithmic trading/SOR and Black Scholes is in your inner loop between processing market data/order book and executing via the exchange gateway/SOR then you probably do not want to run noncompetitive code. There are folks who say microseconds count. If you are running a Monte Carlo simulator with Black Scholes code evaluating the randomly generated scenarios across a large inventory, you probably need competitive code just to get the simulation to stabilize and converge.
Anyway here is the Black Scholes closed form equation for a call option:
call price = S*N(d1) – K*exp(r*(Tt))*N(d2)
d1 = (ln(S/K) +(r + sigma**2 *0.5)*(Tt))/sigma*sqrt(Tt)
d2 = d1 – sigma * sqrt(Tt)
• N() is the cumulative distribution function of the standard normal distribution
• T – t is the time to maturity
• S is the spot price of the underlying asset
• K is the strike price
• r is the risk free rate (annual rate, expressed in terms of continuous compounding)
• sigma is the volatility of returns of the underlying asset
What is competitive performance for Black Scholes valuation in late 2013? How do you generate a reasonable estimate, so you know when to stop optimizing/tinkering? The most straightforward way to get a competitive execution time estimate is to count the required multiplies and additions. We could do that directly from the equations above, making some assumptions about common subexpression elimination and caching. But we can simply analyze Intel’s code for Black Scholes (below), that’s easier for now. Once we know the number of multiplies and additions (or equivalently the instruction cycle count) we can estimate how many execution cycles we will need on a given microprocessor executing this code. Since we can look up the clock speed of the given microprocessor we can back into an estimate of the time per Black Scholes valuation. The last time we did this analysis in 2009, here, we found that on a 2007 vintage microprocessor (IBM POWER 6) that we needed 170 cycles for the valuations and input sensitivities. That was about 36 nanoseconds per valuation or about 30 million Black Scholes valuations per second. What has changed in competitive Black Scholes performance in the past five years?
Here is Intel code implementing the Black Scholes equation valuation.
void BlackScholesFormula( int nopt, tfloat r, tfloat sig, tfloat s0[], tfloat x[], tfloat t[], tfloat vcall[], tfloat vput[] ) { vmlSetMode( VML_EP ); DIV(s0, x, Div); LOG(Div, Log); for ( j = 0; j < nopt; j++ ) { // loop 1 tr [j] = t[j] * r; tss[j] = t[j] * sig_2; tss05[j] = tss[j] * HALF; mtr[j] = tr[j]; } EXP(mtr, Exp); INVSQRT(tss, InvSqrt); for ( j = 0; j < nopt; j++ ) { // loop 2 w1[j] =(Log[j] + tr[j] + tss05[j]) * InvSqrt[j] *INV_SQRT2; w2[j] =(Log[j] + tr[j] – tss05[j]) * InvSqrt[j] *INV_SQRT2; } ERF(w1, w1); ERF(w2, w2); for ( j = 0; j < nopt; j++ ) { // loop 3 w1[j] = HALF + HALF * w1[j]; w2[j] = HALF + HALF * w2[j]; vcall[j] = s0[j] * w1[j] – x[j] * Exp[j] * w2[j]; vput[j] = vcall[j] – s0[j] + x[j] * Exp[j]; } }
Let’s estimate the theoretical competitive performance of this code. Notice a couple things in the intel code that shave off a few cycles. The risk free rate and the volatility are assumed to be scalars, they don’t vary with the portfolio of call options. On the other hand the code always computes a put and call option price as opposed to a single portfolio position. We will assume double precision although single precision would be significantly faster. We will assume LA rather than EP for VML execution. We will add some cycles to the estimates coming from this code to account for input sensitivities, but otherwise leave the code as is. That will make these estimates directly comparable to the 2009 estimates. We need some current cycles counts from Intel VML, here. We use the counts labeled Intel® Core ™ i54670T Processor base clocked at 2.3 GHz turbo to 3.3 GHz.
Code 
Cycles/Element 
Accuracy ULP 
Div() 
3.02 
3.10 
Log() 
6.16 
0.80 
loop 1 
2 

Exp() 
3.65 
1.98 
Invsqrt() 
3.70 
1.42 
loop2 
3 

ERF() x 2 
6.05*2 
1.33 
loop3 
4 

37.63 
The sensitivities drop out from differentiating the Black Scholes formula. We have lots of common subexpressions:
delta = N(d1),
gamma = N(d1)/(S*sigma*t),
vega = S*N(d1) * sqrt(t),
theta = (S*N(d1)*sigma)/2*sqrt(t) – r* K*exp(r*t)*N(d2), and
rho = K*t*exp(r*t)*N(d2)
This looks like 6 or maybe 7 cycles of computation, remember you have two FMA execution units running on each clock tick. I don’t see a way to argue away the 3 cycles for the divide in the gamma. Let’s say 44 cycles all in for this estimate on a single core. So, a full Black Scholes valuation is executed every 13 nanoseconds @ 3.3 GHz and about 19 nanoseconds @ 2.3 GHz if there is no L1/L2 cache pressure on the core. This code is not going to generate many extra cache misses.
The i54670T has four cores let’s assume you can use three of them trivially @2.3 GHz and the OS uses the 4^{th}. You can get to 150 million full Black Scholes valuations per second on one i54670T, if you can somehow get to execute on that 4^{th} core you could crack 200 million per second. It does not look like Turbo boost is going to get you 4 cores running at 3.3 GHz. Maybe overclocking the 4670 will let you crack 200 million full Black Scholes valuations per second. I’d estimate competitive performance in late 2013 is 150 to 200 million full Black Scholes per second on a single i54670T at $213 a pop, up from 30 million per second in 2009. More or less what you would expect from code tracking Moore’s Law.
[…] tick of the underlying you have 4000 Black Scholes Option revaluations. We have shown previously in Black Scholes 2014 (we even gave you the code) competitive code should be good for 150 to 200 million independent […]