Home » Code » Vectorization


Looking at von Leitner’s example of vectorization the reader could come away with the mistaken impression that the stakes are small with respect to vectorization.  For some codes it’s true, vectorization is just one of many optimizations that compilers are getting increasingly good at like: in-lining, strength reduction, or tail recursion.  After all the example shows a vectorized implementation of memset(). How often is memset() in the critical performance path of code that is important to you? I’ll guess, not all that often.

int zero(char* array) {
     unsigned long i;
     for (i=0; i<1024; ++i)

For FinQuant codes vectorization typically plays large role in determining the competitive performance capabilities of the optimized code. Vectorized optimization may be important for simply being able to get the full P&L and risk valuation for the desk’s trade inventory posted with the end of day marks before the traders leave for the day. Vectorized code may be the only way to run enough paths to get a particular Firm Risk Monte Carlo to converge. Vectorized code may be important for your Algorithmic trading code execution in a particularly competitive market.

Code Optimization via vectorization does come at a cost. In many FinQuant applications the contemporary compiler simply is not able to automatically find the available vectorization, it has to be exposed by the programmer. It is not that von Leitner is wrong with the direction of his argument. But the argument, in the case of important FinQuant codes and applications, is a little too general.

Take a look at the first part of the Intel Black Scholes code. The big difference between a piece of scalar code and a vectorized code is that vector length and loop index parameter nopt. If nopt = 1 at the time BlackScholesFormula() is called the code is going to execute as scalar code. This means that the floating point execution pipeline can only schedule the operations in a single iteration of the for loop and if the compiler cannot find enough Instruction Level parallelism (ILP) then each floating point operation costs closer to 4 to 5 cycles rather than 0.5 or 0.25 cycles (on an average throughput basis).

void BlackScholesFormula( int nopt, tfloat r, tfloat sig,tfloat s0[], tfloat x[], tfloat t[], tfloat vcall[], tfloat vput[] )
vmlSetMode( VML_EP );
DIV(s0, x, Div); LOG(Div, Log);
for ( j = 0; j < nopt; j++ ) {
tr [j] = t[j] * r;
tss[j] = t[j] * sig_2;
tss05[j] = tss[j] * HALF;
mtr[j] = -tr[j];

If nopt > 1 then the compiler and the programmer start to have something to work with v.v. optimization and vectorization. This is pretty much the whole game for floating point performance improvements known as SSE, AVX, AVX2, AVX-512. In other words, in a contemporary world where you cannot get more clock cycles on commodity microprocessors, the only way the performance of a large collection of important FinQuant code is going to improve with Moore’s Law is to expose ILP to the compiler.

In some cases it is not so difficult to find the ILP. Monte Carlo simulation provides an enormous amount of ILP. Many portfolio valuation applications have lots of easy ILP because the individual trades in the portfolio require independent valuation. With streaming computations finding the ILP is not something the compiler is generally going to figure out if you just give it the right switches. Moreover, If you don’t find the available ILP for many applications it is performance lost. There is no other way to recover. Parallelization across on chip cores is slower, off chip cores is a non-starter typically, coprocessors on the other side of the bus cannot recover from the data transfer time, the overhead of even multiple threads can leave the floating point execution units idle for relatively long periods of time. If the competitive computation bottleneck is the floating point execution unit pipeline then exposing ILP almost becomes a forced move.

In the Intel code, and generally in any vectorized code, the key to exposing ILP to the compiler is to assemble all the inputs required for the floating point execution unit so that nopt is as large as possible (looking for the knee in the performance curve).  To see why, look at Hager’s daxpy performance plot of MFlops versus Vector Length. Vector Length is the known size of nopt at the time BlackScholesFormula() is called. The larger nopt is known to be (i.e., all the inputs are available) the more MFlops your code execution gets. The steep slope for the Blaze daxpy is very attractive meaning it can be worthwhile to rearrange your code so that you get the execution performance that Blaze is advertizing. How different is daxpy than the Intel loop example we started with? Not all that much, daxpy is a little more aggressive because it can schedule the Fused Multiply Add execution units to get a floating point add and multiply to complete on each clock cycle. But Hager has many vectorized performance template plots you can use to refine performance estimates.

Von Leitner’s presentation is great but could be misleading for some FinQuant floating point codes. You as the programmer need to think through how you can make nopt hit the sweet spot for performance for your code’s execution. In general compilers are not just going to figure it out, today.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: