Pink Iguana

Home » Code » More Black Scholes

More Black Scholes


Pink Iguana, Vectorized Black Scholes Code, here.  I forgot that Intel published their Black Scholes code and we talked about it a little  last June. I like the Intel code format better than just serially going through intrinsic function calls. It is a little easier to see what is gong on. You can see the code is vectorized; the blocking variable int nopt is threaded through all the loops so there are always plenty of independent operations to execute in parallel. Presumably the vector length blocking is  available through a global variable to those intrinsic function calls DIV(), LOG() , etc. Once you get the compiler to unroll your loops and keep the Fused Multiply Add execution units busy you basically know the code should execute close the  theoretical max speed. That means you can simply count up the  adds and multiplies in each of your loops and function calls to get a runtime estimate for the code. If you do not know how many adds and multiplies an inverse square root requires, Intel or the intrinsic library provider will typically provide cycle counts for each of their intrinsic functions. If you know how many floating point execution units you have on the microprocessor, you can simply estimate how many fused multiply add cycles and how many intrinsic cycles your code will need to determine the theoretical max execution speed. This code is not going to miss in L1 very much so you will not get much of a cache effect on the runtime. After getting capable and matching compiler, microprocessor,  and intrinsics library the entire performance game  here depends on how large the variable nopt can be. If nopt can be large the execution performance will be nearly optimal. If nopt cannot be vary large, this code will retire the floating point executions at the rates advertised, like in Hager’s plot of Mflops versus vector length.

One other thing to note when comparing this code to Dr. Johnson’s is range checking the inputs. Johnson’s code has the advantage that it can stick conditionals in the execution path and the performance is effectively unaltered.  If you stick conditionals in tightly vectorized code you end up with bubbles in the execution pipeline which degrades performance. These contemporary microprocessors are little superscalar superpiplined parallel machines and competitive performing code has to take advantage of that. The Intel code has to assume that all the inputs have been previously checked so that both the answer accuracy and the performance optimality are not going to have to take a divide by zero exception or a branch (even if predicted correctly), for example. Sometimes for  vectorized code execution, folks are content if they know there is a bound on the number of exceptions resulting from bad data. Then they can write an exception handler and amortize the performance penalty of the infrequent exception processing over the balance of the code execution.

void BlackScholesFormula( int nopt, tfloat r, tfloat sig,tfloat s0[], tfloat x[], tfloat t[], tfloat vcall[], tfloat vput[] )


vmlSetMode( VML_EP );

DIV(s0, x, Div); LOG(Div, Log);

for ( j = 0; j < nopt; j++ ) {

tr [j] = t[j] * r;

tss[j] = t[j] * sig_2;

tss05[j] = tss[j] * HALF;

mtr[j] = -tr[j];


EXP(mtr, Exp); INVSQRT(tss, InvSqrt);

for ( j = 0; j < nopt; j++ ) {

w1[j] =(Log[j] + tr[j] + tss05[j]) * InvSqrt[j] *INV_SQRT2;

w2[j] =(Log[j] + tr[j] – tss05[j]) * InvSqrt[j] *INV_SQRT2;


ERF(w1, w1); ERF(w2, w2);

for ( j = 0; j < nopt; j++ ) {

w1[j] = HALF + HALF * w1[j];

w2[j] = HALF + HALF * w2[j];

vcall[j] = s0[j] * w1[j] – x[j] * Exp[j] * w2[j];

vput[j] = vcall[j] – s0[j] + x[j] * Exp[j];



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: