Black Scholes Evaluation on a Contemporary Microprocessor
We present a simple analysis on the expected Black Scholes runtime performance via competitive code running on a contemporary microprocessor in 2009. The Black Scholes formula used here is a closed form solution for the Black Scholes PDE for a European Call option. This BS formula should run about 36 nanoseconds per valuation with a full set of risk outputs. A portfolio of 100,000 European Call options is estimated to run in about 4 milliseconds on a single core of a single contemporary microprocessor. To look at it another way, there is no more than 4 milliseconds of actual work for a single 4.7GHz core to perform in evaluating BS on a portfolio of 100,000 Calls. The analysis scales; for 50,000 simulated points in a Monte Carlo using the BS marks we expect runtime to complete in 200 seconds on a single contemporary microprocessor core.
Problem
Let’s look at the estimated time to run the Black Scholes formula for 100,000 European Calls. For simple mark to market the Black Scholes formula requires five doubles as input for each Call.
Input Variable | Definition |
t | Time to maturity |
S | Spot |
K | Strike |
r | Risk free rate |
sigma | Vol of log returns |
So for 100,000 Calls, each requiring 40 bytes of input (5 inputs * 8 bytes) we get 4MB of inputs.
The code implementing the Black Scholes formula evaluates a closed form expression depending on the five inputs and produces a series of six outputs.
Output Variable | Definition |
mtm | Mark to market |
delta | Derivative wrt S |
gamma | 2nd derivative wrt S |
vega | Derivative wrt sigma |
theta | Derivative wrt Time |
rho | Derivative wrt risk free rate |
So for 100,000 Calls we are going to generate 48 bytes of output per Call or 4.8 MB. Just by virtue of the input and output size being small we know the BS evaluation can possibly run with the floating point unit of the target microprocessor highly utilized. The two additional pieces of information we need are the specifics about the target microprocessor and the computational form of the Black Scholes formula. In the next two sections we assemble that information. This is not an IO bound computation, agreed?
Microprocessor Background
IBM published a specification for the 64 bit Power 6 in May 2007 see [1]. The processor is clocked at 4.7 GHz with 128KB L1 and 4MB private Level 2 cache (8MB total). The MASS library provides scalar and vectorized math functions and even clock cycle performance estimates, see [2]. For evaluating Black Scholes formulas there are several vector functions we are interested in, including those in the table below.
Vector Function | Clocks | Description |
vexp | 11.74 | exponential |
vdiv | 7.88 | divide |
vlog | 12.34 | log |
vsqrt | 11.14 | square root |
The takeaway here is that for input vectors of length 1000 these functions require 2 to 3 nanoseconds per valuation. So now you can probably guess where we are going with this analysis. We are going to look at the specific form of the Black Scholes equations and estimate:
1. The length of time to get the inputs on average
2. The length of time to perform the evaluation of a vectorized Black Scholes code and compute each of the outputs
3. The length of time needed to write the outputs on average
We have 4MBs of inputs and 4.8 MBs of outputs so we are going to make the handwaving assumption that all the inputs and outputs fit in the 8MB L2 cache. Recognize we could simply adjust the number of Calls so all the inputs and output strictly fit in the L2 Cache and run multiple batches without substantially changing the average performance given the rather crude assumptions we are employing.
Patterson gives us an estimate of 200 clocks to get to DRAM from a 2006 vintage microprocessor. Notice in Patterson’s presentation he observes it is frequently better to recompute values than read them from memory (or a file); sort of an important point if you are coding floating point these days. Intel lists the following for Pentium M
To Where | Clocks |
Register | 1 |
L1 | 3 |
L2 | 14 |
Main Memory | 240 |
So, once we know the details of the computational form for Black Scholes we will have most of the data we need to make an estimate of the BS runtime on a portfolio of 100K Calls.
Black Scholes Formula Background
Black Scholes in computational form looks like this:
MTM(t, S, K, r, sigma) = S * N(d1) – K*exp(-r*t)* N(d2)
where
d1 = (ln(S/K) + (r + sigma*sigma/2)*t)/ sigma*sqrt(t) and
d2 = d1 – sigma*t.
Then see [6] from Abromowitz and Stegun to get the computational form of the cumulative normal distribution function
N(x) = 1 – n(x)*(b1*y + b2*y*y+ b3*y*y*y + b4*y*y*y*y + b5*y*y*y*y*y)
where
n(x) = (exp(x*x/2))/sqrt(2*pi)
and y is actually a function of x so
y(x) = 1/(1.2316419*x)
and that is the whole computational story. Very roughly this cannot take long to run: 26 multiplies, divides, and adds, a couple exponential function and square root valuations, and a natural log valuation and it only takes five input variables. Think in terms of the number of clocks in a microsecond. In one millionth of a second a contemporary microprocessor executes somewhere between 3 and 4.7 thousand operations (read adds or multiplies). In a scalar function each multiply may take 4 cycles but in a vector pipeline you can simply get a multiply result per clock. When you write this code you are simply trying to value a handful of expensive operations (exp, divide, ln, sqrt where expensive means 10-20 cycles) and a couple dozen arithmetic ops that retire once a cycle in the pipeline. If you are not familiar with math library code, just recall broadly, Chebyshev approximations will typically give you a reasonably low residual error fit for a modest degree polynomial. You already know you will not be needing even one millionth of a second to execute this BS valuation. Now you just want to know how many million BS evaluations your code could run in a second. Lets finish the rough analysis.
Runtime Estimates
All you need to see is that you can vectorize the entire BS formula for each valuation, you only have five doubles as inputs.. Now just count the operations (multiplies and additions) and look up the expensive ones (the exps, divides, lns, square roots) in the MASS library table above – all the intermediate results and constants fit in the L1 cache for the batch of 1000 Calls (yes, handwaving but very plausible). We are also going to assume the dependencies within the computation are insignificant. Loosely, we assume, at a batch size of 1000, there is always computation that can be submitted to the execution pipeline so there are not bubbles of idle processor time (less plausible, requires more detailed analysis for completely accurate runtime accounting). Finally, we are not accounting for the time to load from the L1 cache instead assuming we have register file time access, always (still less plausible but probably not a big deal).
Let’s assume a batch size of 1000 Calls so we match up with the MASS library performance data and let’s also assume all the inputs are in the L2 cache to start. The time to get 1000 input vectors or 40KB into the 128KB L1 cache cost the 14 cycle latency on each of the L1 cache misses lets estimate 1000 cycles of latency (the computation stops waiting for inputs from L2) over 1000 input vectors of 5 doubles (lets assume we do not know how to prefetch at all). Let’s be conservative and say that inputs require one clock per valuation since they are buffered. Let’s argue that the writes will be overlapped with computation (some kind of write back cache) and cost no more than another clock per valuation of latency. So we are left with the time for a single BS valuation is 2 clocks + the length of time to perform the evaluation of a vectorized Black Scholes code and compute each of the outputs.
From the bottom up then; the CDF coded up looks like this in scalar format. (This code will further abuse the notation for t – it is just a temp variable in the code and it is the time to maturity in the text, sry.)
double N(const double x)
{
const double b1 = 0.319381530;
const double b2 = -0.356563782;
const double b3 = 1.781477937;
const double b4 = -1.821255978;
const double b5 = 1.330274429;
const double p = 0.2316419;
const double c = 0.39894228;
if(x >= 0.0) {
double t = 1.0 / ( 1.0 + p * x );
return (1.0 – c * exp( -x * x / 2.0 ) * t *
( t *( t * ( t * ( t * b5 + b4 ) + b3 ) + b2 ) + b1 ));
}
else {
double t = 1.0 / ( 1.0 – p * x );
return ( c * exp( -x * x / 2.0 ) * t *
( t *( t * ( t * ( t * b5 + b4 ) + b3 ) + b2 ) + b1 ));
}
}
With no heroism whatsoever the cumulative normal distribution function will cost:
a cycle for the Boolean on the if statement and
double t = 1.0 / ( 1.0 + p * x );
8 cycles for this vector divide and an additional cycle for the fused multiply-add in the denominator – a total of 9 clocks.
return (1.0 – c * exp( -x * x / 2.0 ) * t *
( t *( t * ( t * ( t * b5 + b4 ) + b3 ) + b2 ) + b1 ));
12 cycles for the vector exponential plus 5 cycles for the Horner’s rule form of the polynomial with fused multiply-adds plus at most 5 cycles for the remaining operations – a total of 23 clocks. Lets round up and say the entire cumulative normal distribution function will cost 35 clocks. This particular scalar code will not get there, but a suitably vectorized and equivalent code would hit the performance mark of 35 clocks per element.
How much will d1 and d2 cost?
d1 = (ln(S/K) + (r + sigma*sigma/2)*t)/ sigma*sqrt(t)
d2 = d1 – sigma*t
Well, d2 will cost one additional cycle after computing d1. The vector log in d1 cost 13 cycles. S/K will cost 8 cycles using vdiv(). The sqrt(t) will cost 12 cycles. The reciprocal of sigma*sqrt(t) will cost another 8 cycles. All the remaining operations in d1 cost say 5 cycles. The total cost for computing d1 and d2 is less than 50 cycles.
So then computing the mark will cost 50 cycles to get d1 and d2.
MTM(t, S, K, r, sigma) = S * N(d1) – K*exp(-r*t)* N(d2)
The two evaluations of N() will cost 35 cycles a piece or 70 cycles. The vector exponential will cost 12 cycles. All the remaining ops take no more than 5 cycles. The total count for the mark is less than 140 cycles or perhaps 30 nanoseconds.
But wait, what about the delta, gamma, and the rest of the greeks. By differentiating the BS formula appropriately we get:
delta = N(d1),
gamma = n(d1)/(S*sigma*t),
vega = S*n(d1) * sqrt(t),
theta = -(S*n(d1)*sigma)/2*sqrt(t) – r* K*exp(-r*t)*N(d2),
and
rho = K*t*exp(-r*t)*N(d2)
These are all subexpressions that were computed in the evaluation of the MTM lets call it as 5 cycles per greek, 25 cycles for all the greeks (a little handwaving but not much new computation in the greeks right?).
So the whole evaluation of a vectorized Black Scholes code and computing each of the outputs takes 165 cycles plus the 2 cycles for getting the inputs and outputs to and from the L2 caches, call it 170 cycles or 36 nanoseconds. So for 100,000 Calls you expect the computation to finish in 3.6 milliseconds i.e., in practice the dominant part of a contemporary BS computation will be reading the trades and market data from the database. By the time the option trade portfolio SQL query returns the BS batch is done (assuming somewhat aggressive buffered input overlapping with BS computation) on a single core of a single processor purchased new in the last 24 months. In terms of Black Sholes valuations per second close to 30 million per second is competitive in 2009.
Conclusions
This BS formula should run about 170 clocks per valuation – call it 36 nanoseconds per valuation with a full set of risk outputs – or 3.6 milliseconds for the portfolio of 100, 000 Calls on a single core of a single processor. We anticipated vectorizing the computation a bit but assumed nothing executing in parallel, i.e., we only need one core to achieve these runtime performance targets. Note, the analysis doesn’t look much different if the portfolio is composed of both Puts and Calls. Also notice we have assumed nothing about the composition of the series of inputs such as finding sequences of evaluations where t is constant, etc. So the same analysis trick will basically work for as many inputs as you can throw at it. Say you have 50,000 Monte Carlo paths giving you variable t, r, S, and sigma with K constant. The evaluations will finish in a little under 200 seconds on a single core of a single microprocessor without any further optimization (I think it is straight forward to pull the 200 seconds MC BS performance estimate down with a little further analysis). Finally, you can always just run this MC BS code on a grid or multicore and parallelize until you meet your elapsed time goals.
The IBM Power is fairly nice architecture to work with for floating point computations but the methodology used here would be applicable to any contemporary microprocessor that is seriously supporting floating point computation. There is nothing special about the using IBM Power for the analysis other than it is easy to do with publically available references. Obviously the headline performance estimate is going to depend on the target microprocessor core details but a ballpark performance figure between 30-100 nanoseconds per BS valuation seems reasonable. That is between 150 and 500 operations per BS evaluation with large enough L1 cache so that there are not many expensive cache misses.
References
[1] http://www-03.ibm.com/press/us/en/presskit/21546.wss
[3] http://www.fz-juelich.de/jsc/datapool/jump/JUMP-Tuning-Guide-for-HPC-Applications-on-POWER6.pdf
[4] http://www.eecs.berkeley.edu/BEARS/presentations/06/Patterson.ppt
[5] http://lwn.net/Articles/252125/
[6] http://www.sitmo.com/doc/Calculating_the_Cumulative_Normal_Distribution
Black Scholes model has been widely criticized for causing the crisis. I am not sure if using it for testing IBM Architectures is a good idea.
Monte Carlo simulations would have been better. Other approaches for valuing options are far better.
I would be happy to run performance analysis on other popular option valuation benchmarks. My sense is there is a lot of flawed code out there in production on the Street and it would be helpful to illuminate the degree of badness.
[…] giveaway that you need check your wallet immediately. Oh, Celoxica provided the data – hey Black Scholes in 36 Nanoseconds in 2009 on an off-the-shelf […]
[…] Black Scholes in 36 nanoseconds: Instructions on how to write code that runs fast. […]
[…] looks like the FPGA performance here is compared against naive x86 code. Recall Pink I published Black Scholes in 36 nanoseconds on 2007 commodity hardware using XLC. So unless Chimps was being misleading here I would assume […]
[…] per Black Scholes element on a single core it’s probably 20 to 30 percent better than the 2007 cycle counts on the XLC/MASS POWER6 (it was like 170 cycles all in). The multicore speedups aren’t that interesting unless they […]
[…] an estimate of the time per Black Scholes valuation. The last time we did this analysis in 2009, here, we found that on a 2007 vintage microprocessor (IBM POWER 6) that we needed 170 cycles for the […]