Home » Code » More Black Scholes

More Black Scholes

Shou-li, Intel, Case Study: Computing Black-Scholes with Intel Advanced Vector Extensions, here. They like EP which make the finance folk nervous.  Nice paper though. They are showing 279MM BS evals per second using EP. The estimates of 150MM-200MM vals/sec. DP are consistent with this benchmark number and the posted VML cycle counts. The error upper bounds are much stronger with double precision and SIngle Precision compared to EP (as used in this benchmark).

After taking care of memory alignment, we are ready to choose a vectorization method for our Black-Scholes implementation. To minimize the amount of work and maintain portability, we take the user-initiated semiautomatic vectorization approach to vectorization: the user informs the compiler that a loop should be vectorized by using #pragma SIMD. The compiler does everything it can to vectorize the loop and generates an error message if it cannot vectorize the loop marked #pragma SIMD. This behavior is different from the previous model in which the user simply provides suggestions in#pragma IVDEP and the compiler evaluates if the vectorized code can be executed faster than serial code. Vectorized code will be generated when the compiler thinks it can outperform the serial version. In this model, it’s the programmer’s responsibility to ensure that vectorization overhead does not exceed any speedup gain.

01 for(int i = 0; i < NUM_ITERATIONS; i++)
02 #pragma simd
03     for(int opt = 0; opt < OPT_N; opt++)
04     {
05         float CallVal =0.0f, PutVal  = 0.0f;
06         float T = OptionYears[opt];
07         float X = OptionStrike[opt];
08         float S = StockPrice[opt];
09         float sqrtT = sqrtf(T);
10         float d1 = (logf(S / X) + (RISKFREE + 0.5f * VOLATILITY * VOLATILITY) * T)
11         / (VOLATILITY * sqrtT);
12         float d2 = d1 - VOLATILITY * sqrtT;
13         float CNDD1 = HALF + HALF*erff(INV_SQRT2*d1);
14         float CNDD2 = HALF + HALF*erff(INV_SQRT2*d2);
15         float expRT = expf(-RISKFREE * T);
17         CallVal  += S * CNDD1 - X * expRT * CNDD2;
18         PutVal  += CallVal  +  expRT - S;
19         CallResult[opt] = CallVal ;
20         PutResult[opt] = PutVal ;
21     }

Parallel version runs at 0.27955 Billion option per second:
Completed pricing 1152 million options in 4.945128 seconds:
Figure 6: Example of user-initiated vectorization

Vectorized code takes advantage of SIMD instructions in modern microprocessors and delivers application performance without higher frequency or higher core count. In our case, vectorization delivered a 7.15X performance increase out of an 8X maximum.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: