Pink Iguana

Home » Posts tagged 'Runtime Opt'

Tag Archives: Runtime Opt

Write Latency for HFT Logging

Jones, Evan, Write Latency and Block Size, here.  Don’t know him; like the data; I’ll read the blog for a bit. Let’s play I’m not going to pay a lot for this muffler. I’m thinking, just for a target, why would I willingly allocate more than 10 microseconds per message on all logging (local, prime broker, regulatory, venue, etc, etc)  of cash equity HFT messages with off-the-shelf hardware and no substantial mods to the Linux kernel? When you look under the covers, or you know more than me today, maybe it has to be 20 or 30 microseconds but for this game you have to explain (account for the time) for anything above 10 microseconds.

Honeyman, Linux NFS Client Write Performance, 2001, here. This is one of those deals, when honey speaks you listen even if it was ten years ago, but it can get messy, witness, here. Don’t ask why. Never explain, never complain; Fun People; cwtvomk (Come wipe the vomit off my keyboard), that’s just honey warming up, respect.

Smith, Greg,  Tuning Linux for low PostgreSQL latency, here.

Linux Dev Center, Low latency in the Linux Kernel, here.

Torvalds, email archive, here.

Westnet, The Linux Page Cache and pdflush: Theory of Operation and Tuning for Write-Heavy Loads, here.

Advertisement

Xilinx References Apr 2012

Xilinx, High Performance Computing Using FPGAs, Sep 2010, here.

The shift to multicore CPUs forces application developers to adopt a parallel programming model to exploit CPU performance. Even using the newest multicore architectures, it is unclear whether the performance growth expected by the HPC end user can be delivered, especially when running the most data- and compute- intensive applications. CPU-based systems augmented with hardware accelerators as co-processors are emerging as an alternative to CPU-only systems. This has opened up opportunities for accelerators like Graphics Processing Units (GPUs), FPGAs, and other accelerator technologies to advance HPC to previously unattainable performance levels.

I buy the argument to a degree. As the number of cores per chip grow, the easy pipelining and parallelization opportunities will diminish. The argument is stronger if there are more cores per chip. 8 cores or under per general purpose chip it’s sort of a futuristic theoretical argument. More than a few programmers can figure out how to code up a 4 to 8 stage pipeline for their application without massive automated assistance. But the FPGA opportunity does exist.

The convergence of storage and Ethernet networking is driving the adoption of 40G and 100G Ethernet in data centers. Traditionally, data is brought into the processor memory space via a PCIe network interface card. However, there is a mismatch of bandwidth between PCIe (x8, Gen3) versus the Ethernet 40G and 100G protocols; with this bandwidth mismatch, PCIe (x8, Gen3) NICs cannot support Ethernet 40G and 100G protocols. This mismatch creates the opportunity for the QPI protocol to be used in networking systems. This adoption of QPI in networking and storage is in addition to HPC.

I buy the FPGA application in the NIC space. I want my NIC to go directly to L3 pinned pages, yessir I do, 100G please.

Xilinx FPGAs double their device density from one generation to the next. Peak performance of FPGAs and processors can be estimated to show the impact of doubling the performance on FPGAs [Ref 6], [Ref 7]. This doubling of capacity directly results in increased FPGA compute capabilities.

The idea proposed here is that you want to be on the exponentially increasing density curve for the FPGAs in lieu of clock speed increases you are never going to see again. Sort of a complicated bet to make for mortals, maybe.

I like how they do the comparisons though. They say here is our Virtex-n basketball player and here is  the best NBA Basketball player … and they show you crusty old Mike Bibby 2012. Then they say watch as the Virtex-n basketball player takes Mike Bibby down low in the post, and notice the Virtex-n basketball player is still growing exponentially. So you can imagine how much better he will do against Mike Bibby in the post next year. Finally they say that Mike Bibby was chosen as the best NBA player for this comparison by his father Henry, who was also a great NBA player.

FPGAs tend to consume power in tens of watts, compared to other multicores and GPUs that tend to consume power in hundreds of watts. One primary reason for lower power consumption in FPGAs is that the applications typically operate between 100–300 MHz on FPGAs compared to applications on high-performance processors executing between 2–3 GHz.

Silly making Lemonade out of Lemons argument, the minute I can have my FPGAs clocked at 3 GHz I throw away the 300MHz FPGAs, no?

Intel, An Introduction to the Intel QuickPath Interconnect, QPI, Jan 2009, here.

Xilinx Research Labs/NCSA, FPGA HPC – The road beyond processors, Jul 2007, here. Need more current references but  I keep hearing the same themes in arguments for FGPA HPC, so let’s think about this for a bit:

FPGAs have an opening because you are not getting any more clocks from microprocessor fab shrinks: OK.

Power density: meh. Lots of FinQuant code can run on a handful of cores. The Low Latency HFT folks cannot really afford many L2 misses. The NSA boys are talking about supercomputers for crypto not binary protocol parsing.

Microprocessors have all functions that are hardened in silicon and you pay for them whether you use them or not  and you can’t use that silicon for something else: Meh, don’t really care if I use all the silicon on my 300 USD microprocessor as long as the code is running close to optimal on the parts of the silicon useful to my application. It would be nice if I got more runtime performance for my 300USD, no doubt. This point is like Advil is bad because you don’t always need to finish the bottle after you blow out your ankle. Yeah, I understand the silicon real estate is the most expensive in the world.

Benchmarks: Black Scholes 18msec FPGA @ 110 Mhz Virtex-4 203x faster than Opeteron – 2.2 Ghz: You Cannot be Serious! 3.7 microseconds per Black Scholes evaluation was competitive performance at the turn of the century. The relative speedup slides and quotations make me nervous. Oh, Celoxica provided the data – hey Black Scholes in 36 Nanoseconds on a single core of a dual core off-the-shelf general microprocessor from 2007. So the Virtex-4 does 1M Black Scholes evaluations in 18 milliseconds flat to competitive code on a dual core general purpose off-the-shelf microprocessor in 2007.

Make it easy for the users to use this hardware and get „enough of a performance‟ increase to be useful: meh, it’s for applications that do not need to go fast, for now (2007)?

Do not try to be the fastest thing around when being as fast with less power is sufficient: meh, really do not care so much about the power thing

FPGA: Different operations map to different silicon allows massive pipelining; lots of parallelismOK. So, why bother with the previous two points?

Eggers/ U. Washington, CHiMPS, here. Eggers is reasonable.

There have been (at least) two hindrances to the widespread adoption of FPGAs by scientific application developers: having to code in a hardware description language, such as Verilog (with its accompanying hardware-based programming model) and poor FPGA memory performance for random memory accesses. CHiMPS, our C-to-FPGA synthesis compiler, solves both problems with one memory architecture, the many-cache memory model.

Many-cache organizes the small, distributed memories on an FPGA into application-specific caches, each targeting a particular data structure or region of memory in an application and each customized for the particular memory operations that access it.

CHiMPS provides all the traditional benefits we expect from caching. To reduce cache latency, CHiMPS duplicates the caches, so that they’re physically located near the hardware logic blocks that access them. To increase memory bandwidth, CHiMPS banks the caches to match the memory parallelism in the code. To increase task-level parallelism, CHiMPS duplicates caches (and their computation blocks) through loop unrolling and tiling. Despite the lack of FPGA support for cache coherency, CHiMPS facilitates data sharing among FPGA caches and between the FPGA and its CPU through a simple flushing of cached values. And in addition, to harness the potential of the massively parallel computation offered by FPGAs, CHiMPS compiles to a spatial dataflow execution model, and then provides a mechanism to order dependent memory operations to retain C memory ordering semantics.

CHiMPS’s compiler analyses automatically generate the caches from C source. The solution allows scientific programmers to retain their familiar programming environment and memory model, and at the same time provides performance that is on average 7.8x greater and power that is one fourth that of a CPU executing the same source code. The CHiMPS work has been published in the International Symposium on Computer Architecture (ISCA, 2009), the International Conference on Field Programmable Logic and Applications (FPL, 2008), and High-Performance Reconfigurable Computing Technology and Applications (HPRCTA, 2008), where it received the Best Paper Award.

Agner Fog

SD Times, Fog Around Intel Compilers, here.

Agner Fog is a computer science professor at the University of Copenhagen‘s college of engineering. As he puts it, “I have done research on microprocessors and optimized code for more than 12 years. My motivation is to make code compatible, especially when it pretends to be.”

Fog has written a number of blog entries about Intel’s compilers and how they treat competing processors. In November, AMD and Intel settled, and Fog has written up a magnificent analysis of the agreement.

If you have any interest in compilers, and in Intel’s compilers, you should definitely read his paragraph-by-paragraph read through.

Fog, Agner, Software Optimization Resources, here. I was reading Fog’s Optimizing Software in C++ (here) this morning. It’s a runtime optimization guide for Windows, Linux, and Mac. I have seen it before and perhaps been remiss in not commenting more fully. Without the benefit of trying out many of Fog’s code samples and directives against current versions of ICC and GCC I cannot be certain, but based on what I have optimized in the recent past, his body of works looks very legitimate and exhaustive. You ask, how exhaustive? Let’s start with the copyright, it’s got a succession plan:

This series of five manuals is copyrighted by Agner Fog. Public distribution and mirroring is not allowed. Non-public distribution to a limited audience for educational purposes is allowed. The code examples in these manuals can be used without restrictions. A GNU Free Documentation License shall automatically come into force when I die. See http://www.gnu.org/copyleft/fdl.html.

Professor Fog is laying out code optimization paths for 4 different compilers on 3 different operating systems. I will not and cannot check out/verify all the scenarios presented because I possess the attention span of a squirrel compared to Professor Fog.  He also provides a page on random number generators, here,  which seems legit to the extent that he points you to Matsumoto’s Mersenne Twister RNG page, here. The random number references do not appear to be as comprehensive as the C++ runtime optimization references. But  this looks to be a case of:

We’re not worthy

in a most complimentary way to Professor Fog.

Travelling Salesman

Slashdot, Travelling Salesman, Thriller Set in a World Where P=NP, here. Watch the trailer – it is the best thing you will see this week. Premieres 16 Jun. From the trailer it looks like Tom Cruise meets Mr. T to work a polynomial runtime algorithm to take over the world. Wonder what language the algorithm is coded up in? Please, please have it be standard ml of new jersey running on Xcode/Mountain Lion cross compiled to run on Amazon Elastic Compute Cloud. If these really are literally four of the smartest guys on earth you know right away there is going to be a problem checking out the right patch version from CVS so they know what to build and hide from the bad government drill-your-head guys. Maybe there are a bunch of grad students in the next room, with pizza and iPad2s or something. I’ll start you off with the relevant dialog:

Government Bad Guy: Hit Control-C now.

Smartest in the World 1: I can’t …

President: NOW! Abort!

Smartest in the World 2: What’s Control See?

We will see if they got this right. If there is already a TSP movie rolling to distribution in Jun 2012, the HFT movie cannot be far behind, right? Movie website is here. GLL deconstructs here.

JP Morgan’s London Whale needs Maxeler’s FPGA Supercomputer to run Risk?

Bloomberg, JPMorgan Trader Iksil Fuels Prop-Trading Debate With Bets, here. London Whale needs some P from Series 9 Investment Grade CDX. Can Bloomberg and some prominent officials help fix the situation?

Zerohedge, From Bruno Iksil’s Personal Profile: Enjoys “Walking Over Water” And Being “Humble”, here. Oh, so the London Whale is shorting tranches of series 9 CDX in $100bn  quantities.That starts to make sense given all the hysteria. So they make the purchases at the corporate-level to hedge SCDO exposure that is not actively traded by the desk. Maybe Bruno is the one who needs the the Maxeler FPGA Supercomputer Credit batch at JPM (see Credit Derivatives,  Flynn’s Architectural Case for Maxeler in 2012?, and Street FP Due Diligence 3: Epic Yet?) to run in 238 seconds. How do you tag that? London Whale needs Mammoth Supercomputer to Stay Afloat? Too much, right?  I still suspect that the entire JPM credit batch (as described) completes in less than a minute on a low-end Mac Pro even with the Gaussian Copula positions. Sort of more like “Bruno uses iPhone to Track Purchases.” (see Business InsiderFinancial PostWall Street Journal blogSober LookNew York Times DealbookFinancial Times Alphavilleblogrunner)

Zerohedge, 31 Dec 2011 Notional Amt. of Derivative Contracts Top 25 Comm. Banks, here. Didn’t MS carry derivative inventory in the past?

Low Latency

Low Latency trading is High Frequency trading with Remote Procedure Calls. Fully automated trading strategies in electronic market making, liquidity detection, cross-asset arbitrage, short-term statistical arbitrage, and volatility arbitrage meet unavoidable signal propagation delay and “The End of Geography.” How does one analyze algorithmic trading systems searching for arbitrage between geographically remote markets and exchanges where the interprocess communication latency is large relative to the pre-trade analytics execution time? Given that two markets are connected by the lowest latency network connection available, the focus questions of this survey are:

1. What is the best Low Latency trading system to employ?   and

2. Where are the premium Low Latency trading opportunities?

Academic market microstructure studies, commercially successful automated markets, and technology advances in fast switching and long distance signal transmission have recently converged to make Low Latency proprietary trading commercially viable. Academic advances in market microstructure set the stage for profitable High Frequency trading. According to O’Hara, the term market microstructure refers to the study of the process and outcomes of exchanging assets under a specific set of rules. Lo and MacKinlay’s 2001 book A Non-Random Walk Down Wall Street popularized the notion that US Equity prices did not follow a Random Walk.  In 1985, Glosten and Milgrom provided a model for the adverse selection cost of trading. Hasbrouck and Saar have surveyed 2007-8 NASDAQ trading tapes to characterize the nature of contemporary Low Latency trading.  In 2007, O’Hara published a comprehensive overview, Market Microstructure Theory, surveying market microstructure with particular attention to the role of Bayesian hypothesis testing in developing pre-trade analytics.

Electronic trading has evolved in almost all liquid Equity, FX, and Fixed Income global markets including Cash Equity, Futures, Treasuries, FX, Interest Rate Swaps, and Credit. In Equities, FX, and US Treasury bond markets some electronic platforms have grown to take significant market share, in other markets, not so much. Stoll in Electronic Trading in Stock Markets, surveyed the recent history of Cash Equity trading automation noting that several electronic exchanges have received significant indirect governmental assistance in securing market share on the back of Reg NMS.  In 2008, Mizrach and Neely, The Microstructure of the U.S. Treasury Market, surveyed the transition to electronic trading in several Fixed Income markets including secondary Treasury bond trading. High Frequency trading has had solid press coverage over the past year (see London Review of Books, Donald MacKenzie, “How to Make Money In Microseconds”; or CNBC, John Carney ‘Yes, High Frequency Traders will Take Over Swap Trading”) perhaps most notably post the 6 May 10 Flash Crash (see NYT, Graham Bowley, “Lone $4.1 Billion Sale Led to ‘Flash Crash’ in May”).

The Barksdale company, Spread Networks is offering bandwidth on an 825 mile straight-line dark fiber run between Carteret, New Jersey and Chicago South Loop. The fiber reportedly has an end-to-end latency of 13.3 milliseconds (assume round trip) down from the previous best 16.3ms (see Forbes 9Sep10). Considering that physical signal propagation limits with single-mode fiber transmitting at roughly two thirds the speed of light combined with the straight 825 mile path give a theoretical round trip latency of approximately 13.2+ms, a detailed analysis of the expected commercial side effects is warranted.

In High Frequency trading, a profitable proprietary desk strategy is to become a non-contractually obligated and unregulated dealer/market maker. Avoiding regulation means avoiding potentially onerous margin requirements compensating a clearinghouse or exchange for risk. A typical market maker strategy is to simultaneously post bid offer orders to earn spread while adjusting spreads to fit the market. Low Latency trading simply adds arbitrage strategies between geographically remote regions or markets to the High Frequency strategies. Capital constrained Low Latency trading desks run risk-neutral matched books i.e., the corresponding trading systems book assets and liabilities with perfectly equivalent contractual obligations and maturities. Any asset/liability mismatches are insignificant, fleeting/transient, or easily correctable at a macro level post detection. If Low Latency trading desks run matched books the Front-to-Back process and infrastructure costs (i.e., the costs of maintaining inventory, valuation analytics, daily risk and P&L reporting) remain small and hedging costs are minimal. As the trading desk seeks additional revenue opportunities deeper in the Capital Markets, they can trade directional risk strategies including: statistical arbitrage, mean reversion, pairs trading, and trend following (ala Princeton-Newport, Citadel, or Renaissance). These directional strategies require more capital for collateralizing/funding inventory positions and Front-to-Back operations. One exotics Rates trader put it well, in saying when he prints a trade, it has a certain amount of P (profit) in it; the goal for the rest of the year is to realize as much inception P as possible while hedging as cheaply as possible.

We argue that in the absence of potential network latency improvement (i.e., assuming an optimal low latency fiber connection or microwave/wireless), the only remaining way to gain competitive advantage, available to a Low Latency prop desk, is by reducing computational latency.

Low Latency versus High Throughput

Let’s think about the non-batch version of Black Scholes where the average vector length is less than 1024. Here is our old Black Scholes benchmark against a Power microprocessor (below). We will redo the analysis with contemporary Intel silicon, ICC, and MKL. Folks at Tabb Group in The Value of a Millisecond: Finding the Optimal Speed of a Trading Infrastructure, here,  say a 5 millisecond delay could cost $4 million. If the colocation  network latency is under control, it’s probably a good thing to make each fraction of a millisecond of parsing the limit order book as close to optimal as you can.

Black Scholes Evaluation on a Contemporary Microprocessor (2009)

We present a simple analysis on the expected Black Scholes runtime performance via competitive code running on a contemporary microprocessor in 2009. The Black Scholes formula used here is a closed form solution for the Black Scholes PDE for a European Call option. This BS formula should run about 36 nanoseconds per valuation with a full set of risk outputs. A portfolio of 100,000 European Call options is estimated to run in about 4 milliseconds on a single core of a single contemporary microprocessor. To look at it another way, there is no more than 4 milliseconds of actual work for a single 4.7GHz core to perform in evaluating BS on a portfolio of 100,000 Calls.   The analysis scales; for 50,000 simulated points in a Monte Carlo using the BS marks we expect runtime to complete in 200 seconds on a single contemporary microprocessor core.

Problem

Let’s look at the estimated time to run the Black Scholes formula for 100,000 European Calls. For simple mark to market the Black Scholes formula requires five doubles as input for each Call.

Input Variable Definition
t Time to maturity
S Spot
K Strike
r Risk free rate
sigma Vol of log returns

So for 100,000 Calls, each requiring 40 bytes of input (5 inputs * 8 bytes) we get 4MB of inputs.

The code implementing the Black Scholes formula evaluates a closed form expression depending on the five inputs and produces a series of six outputs.

Output Variable Definition
mtm Mark to market
delta Derivative wrt S
gamma 2nd derivative wrt S
vega Derivative wrt sigma
theta Derivative wrt Time
rho Derivative wrt risk free rate

So for 100,000 Calls we are going to generate 48 bytes of output per Call or 4.8 MB. Just by virtue of the input and output size being small we know the BS evaluation can possibly run with the floating point unit of the target microprocessor highly utilized.  The two additional pieces of information we need are the specifics about the target microprocessor and the computational form of the Black Scholes formula.  In the next two sections we assemble that information. This is not an IO bound computation, agreed?

 

Microprocessor Background

IBM published a specification for the 64 bit Power 6 in May 2007 see [1]. The processor is clocked at 4.7 GHz with 128KB L1 and 4MB private Level 2 cache (8MB total). The MASS library provides scalar and vectorized math functions and even clock cycle performance estimates, see [2]. For evaluating Black Scholes formulas there are several vector functions we are interested in, including those in the table below.

Vector Function Clocks Description
vexp 11.74 exponential
vdiv 7.88 divide
vlog 12.34 log
vsqrt 11.14 square root

The takeaway here is that for input vectors of length 1000 these functions require 2 to 3 nanoseconds per valuation. So now you can probably guess where we are going with this analysis. We are going to look at the specific form of the Black Scholes equations and estimate:

1. The length of time to get the inputs on average

2. The length of time to perform the evaluation of a vectorized Black Scholes code and compute each of the outputs

3. The length of time needed to write the outputs on average

We have 4MBs of inputs and 4.8 MBs of outputs so we are going to make the handwaving assumption that all the inputs and outputs fit in the 8MB L2 cache. Recognize we could simply adjust the number of Calls so all the inputs and output strictly fit in the L2 Cache and run multiple batches without substantially changing the average performance given the rather crude assumptions we are employing.

Patterson gives us an estimate of 200 clocks to get to DRAM from a 2006 vintage microprocessor. Notice in Patterson’s presentation he observes it is frequently better to recompute values than read them from memory (or a file); sort of an important point if you are coding floating point these days. Intel lists the following for Pentium M

To Where Clocks
Register 1
L1 3
L2 14
Main Memory 240

So, once we know the details of the computational form for Black Scholes we will have most of the data we need to make an estimate of the BS runtime on a portfolio of 100K Calls.

 

Black Scholes Formula Background

Black Scholes in computational form looks like this:

MTM(t, S, K, r, sigma) =  S * N(d1) – K*exp(-r*t)* N(d2)

 

where

d1 = (ln(S/K) + (r + sigma*sigma/2)*t)/ sigma*sqrt(t) and

 

d2 = d1 – sigma*t.

 

Then see [6] from Abromowitz and Stegun to get the computational form of the cumulative normal distribution function

N(x) = 1 – n(x)*(b1*y + b2*y*y+ b3*y*y*y + b4*y*y*y*y + b5*y*y*y*y*y)

 

where

 

n(x) = (exp(x*x/2))/sqrt(2*pi)

 

 

and y is actually a function of x so

y(x) = 1/(1.2316419*x)

and that is the whole computational story. Very roughly this cannot take long to run: 26 multiplies, divides, and adds, a couple exponential function and square root valuations, and a natural log valuation and it only takes five input variables. Think in terms of the number of clocks in a microsecond. In one millionth of a second a contemporary microprocessor executes somewhere between 3 and 4.7 thousand operations (read adds or multiplies). In a scalar function each multiply may take 4 cycles but in a vector pipeline you can simply get a multiply result per clock. When you write this code you are simply trying to value a handful of expensive operations (exp, divide, ln, sqrt where expensive means 10-20 cycles) and a couple dozen arithmetic ops that retire once a cycle in the pipeline. If you are not familiar with math library code, just recall broadly, Chebyshev approximations will typically give you a reasonably low residual error fit for a modest degree polynomial.  You already know you will not be needing even one millionth of a second to execute this BS valuation. Now you just want to know how many million BS evaluations your code could run in a second.  Lets finish the rough analysis.

Runtime Estimates

All you need to see is that you can vectorize the entire BS formula for each valuation, you only have five doubles as inputs.. Now just count the operations (multiplies and additions) and look up the expensive ones (the exps, divides, lns, square roots) in the MASS library table above – all the intermediate results and constants fit in the L1 cache for the batch of 1000 Calls (yes, handwaving but very plausible). We are also going to assume the dependencies within the computation are insignificant. Loosely, we assume, at a batch size of 1000, there is always computation that can be submitted to the execution pipeline so there are not bubbles of idle processor time (less plausible, requires more detailed analysis for completely accurate runtime accounting). Finally, we are not accounting for the time to load from the L1 cache instead assuming we have register file time access, always (still less plausible but probably not a big deal).

Let’s assume a batch size of 1000 Calls so we match up with the MASS library performance data and let’s also assume all the inputs are in the L2 cache to start. The time to get 1000 input vectors or 40KB into the 128KB L1 cache cost the 14 cycle latency on each of the L1 cache misses lets estimate 1000 cycles of latency (the computation stops waiting for inputs from L2) over 1000 input vectors of 5 doubles (lets assume we do not know how to prefetch at all). Let’s be conservative and say that inputs require one clock per valuation since they are buffered. Let’s argue that the writes will be overlapped with computation (some kind of write back cache) and cost no more than another clock per valuation of latency. So we are left with the time for a single BS valuation is 2 clocks + the length of time to perform the evaluation of a vectorized Black Scholes code and compute each of the outputs.

From the bottom up then; the CDF coded up looks like this in scalar format. (This code will further abuse the notation for t  – it is just a temp variable in the code and it is the time to maturity in the text, sry.)

double N(const double x)

{

const double b1 =  0.319381530;

const double b2 = -0.356563782;

const double b3 =  1.781477937;

const double b4 = -1.821255978;

const double b5 =  1.330274429;

const double p  =  0.2316419;

const double c  =  0.39894228;

if(x >= 0.0) {

double t = 1.0 / ( 1.0 + p * x );

return (1.0 – c * exp( -x * x / 2.0 ) * t *

( t *( t * ( t * ( t * b5 + b4 ) + b3 ) + b2 ) + b1 ));

}

else {

double t = 1.0 / ( 1.0 – p * x );

return ( c * exp( -x * x / 2.0 ) * t *

( t *( t * ( t * ( t * b5 + b4 ) + b3 ) + b2 ) + b1 ));

}

}

With no heroism whatsoever the cumulative normal distribution function will cost:

a cycle for the Boolean on the if statement and

double t = 1.0 / ( 1.0 + p * x );

8 cycles for this vector divide and an additional cycle for the fused multiply-add in the denominator – a total of 9 clocks.

return (1.0 – c * exp( -x * x / 2.0 ) * t *

( t *( t * ( t * ( t * b5 + b4 ) + b3 ) + b2 ) + b1 ));

12 cycles for the vector exponential plus 5 cycles for the Horner’s rule form of the polynomial with fused multiply-adds plus at most 5 cycles for the remaining operations – a total of 23 clocks. Lets round up and say the entire cumulative normal distribution function will cost 35 clocks. This particular scalar code will not get there, but a suitably vectorized and equivalent code would hit the performance mark of 35 clocks per element.

How much will d1 and d2 cost?

d1 = (ln(S/K) + (r + sigma*sigma/2)*t)/ sigma*sqrt(t)

 

d2 = d1 – sigma*t

 

Well, d2 will cost one additional cycle after computing d1.  The vector log in d1 cost 13 cycles. S/K will cost 8 cycles using vdiv().  The sqrt(t) will cost 12 cycles.  The reciprocal of sigma*sqrt(t) will cost another 8 cycles. All the remaining operations in d1 cost say 5 cycles. The total cost for computing d1 and d2 is less than 50 cycles.

So then computing the mark will cost 50 cycles to get d1 and d2.

MTM(t, S, K, r, sigma) =  S * N(d1) – K*exp(-r*t)* N(d2)

The two evaluations of N() will cost 35 cycles a piece or 70 cycles. The vector exponential will cost 12 cycles. All the remaining ops take no more than 5 cycles. The total count for the mark is less than 140 cycles or perhaps 30 nanoseconds.

But wait, what about the delta, gamma, and the rest of the greeks. By differentiating the BS formula appropriately we get:

 

delta = N(d1),

 

gamma = n(d1)/(S*sigma*t),

 

vega  = S*n(d1) * sqrt(t),

 

theta = -(S*n(d1)*sigma)/2*sqrt(t) – r* K*exp(-r*t)*N(d2),

and

rho =  K*t*exp(-r*t)*N(d2)

These are all subexpressions that were computed in the evaluation of the MTM lets call it as 5 cycles per greek, 25 cycles for all the greeks (a little handwaving but not much new computation in the greeks right?).

So the whole evaluation of a vectorized Black Scholes code and computing each of the outputs takes 165 cycles plus the 2 cycles for getting the inputs and outputs to and from the L2 caches, call it 170 cycles or 36 nanoseconds. So for 100,000 Calls you expect the computation to finish in 3.6 milliseconds i.e., in practice the dominant part of a contemporary BS computation will be reading the trades and market data from the database. By the time the option trade portfolio SQL query returns the BS batch is done (assuming somewhat aggressive buffered input overlapping with BS computation) on a single core of a single processor purchased new in the last 24 months. In terms of Black Sholes valuations per second close to 30 million per second is competitive in 2009.

Conclusions

This BS formula should run about 170 clocks per valuation – call it 36 nanoseconds per valuation with a full set of risk outputs – or 3.6 milliseconds for the portfolio of 100, 000 Calls on a single core of a single processor.  We anticipated vectorizing the computation a bit but assumed nothing executing in parallel, i.e., we only need one core to achieve these runtime performance targets. Note, the analysis doesn’t look much different if the portfolio is composed of both Puts and Calls. Also notice we have assumed nothing about the composition of the series of inputs such as finding sequences of evaluations where t is constant, etc. So the same analysis trick will basically work for as many inputs as you can throw at it. Say you have 50,000 Monte Carlo paths giving you variable t, r, S, and sigma with K constant. The evaluations will finish in a little under 200 seconds on a single core of a single microprocessor without any further optimization (I think it is straight forward to pull the 200 seconds MC BS performance estimate down with a little further analysis). Finally, you can always just run this MC BS code on a grid or multicore and parallelize until you meet your elapsed time goals.

The IBM Power is fairly nice architecture to work with for floating point computations but the methodology used here would be applicable to any contemporary microprocessor that is seriously supporting floating point computation. There is nothing special about the using IBM Power for the analysis other than it is easy to do with publically available references. Obviously the headline performance estimate is going to depend on the target microprocessor core details but a ballpark performance figure between 30-100 nanoseconds per BS valuation seems reasonable. That is between 150 and 500 operations per BS evaluation with large enough L1 cache so that there are not many expensive cache misses.

References

[1] http://www-03.ibm.com/press/us/en/presskit/21546.wss

[2] http://www-01.ibm.com/support/docview.wss?rs=2021&context=SSVKBV&uid=swg27005374&loc=en_US&cs=utf-8&lang=en

[3] http://www.fz-juelich.de/jsc/datapool/jump/JUMP-Tuning-Guide-for-HPC-Applications-on-POWER6.pdf

[4] http://www.eecs.berkeley.edu/BEARS/presentations/06/Patterson.ppt

[5] http://lwn.net/Articles/252125/

[6] http://www.sitmo.com/doc/Calculating_the_Cumulative_Normal_Distribution

ArXiv paper – Black Swans and Fast Computers

“Society’s drive toward ever faster socio-technical systems, means that there is an urgent need to understand the threat from ‘black swan’ extreme events that might emerge. On 6 May 2010, it took just five minutes for a spontaneous mix of human and machine interactions in the global trading cyberspace to generate an unprecedented system-wide Flash Crash. However, little is known about what lies ahead in the crucial sub-second regime where humans become unable to respond or intervene sufficiently quickly. Here we analyze a set of 18,520 ultrafast black swan events that we have uncovered in stock-price movements between 2006 and 2011. We provide empirical evidence for, and an accompanying theory of, an abrupt system-wide transition from a mixed human-machine phase to a new all-machine phase characterized by frequent black swan events with ultrafast durations (<650ms for crashes, <950ms for spikes). Our theory quantifies the systemic fluctuations in these two distinct phases in terms of the diversity of the system’s internal ecology and the amount of global information being processed. Our finding that the ten most susceptible entities are major international banks, hints at a hidden relationship between these ultrafast ‘fractures’ and the slow ‘breaking’ of the global financial system post-2006. More generally, our work provides tools to help predict and mitigate the systemic risk developing in any complex socio-technical system that attempts to operate at, or beyond, the limits of human response times.”

Here.

Street Credit FP Due Diligence

The thread we started to determine if it made sense to rewrite an optimized Credit derivative P&L and Risk batch (here) can be wrapped up. It is not worth the time to optimize the code. If you need fast Gaussian Copula computing, go make a deal with Maxeler, they have already got one , you see? If you need a fast contemporary credit derivative batch (CDS, CDX,  ITRX) you can probably just optimize your conventional serial code and be competitive in 3-6 months. We have shown how to optimize the code in these posts, if you do not already know how to proceed.   It would make sense to evaluate Maxeler and the dataflow architecture idea but check your wallet if someone starts doing that “equivalent to 12,000 x86 core dance” for a given computation unless you are reasonably familiar with the code. We have no problem with the concept that Maxeler may be significantly faster, even for CDS valaution and risk, than a tightly optimized off the shelf code (MJ Flynn is a hitter with a record in computation circles) but we haven’t seen the argument yet (and we doubt it is a simple argument). Moreover, why does dataflow payoff architecturally now for Flynn when it did not payoff for Arvind for like 20 years ending in Monsoon?

From a Credit trading perspective, realtime or otherwise, there is nothing to see here, just move on.

From a macro risk perspective it is unambiguously a good thing to have a fast view of  the aggregate risk across single name, index, and correlation inventory. Getting solid analytic numbers for a old piece of quant code is notoriously hard. No one with actual market background will check market convention assumptions, trade representations , or the valuations.  If some enterprising company  has a tied out version of the old quant code get it from them.

But there is more to this story. If the broker dealer cannot bring themselves to separate the correlation book from the credit book arguing that the correlation book is managed to zero risk and standard reserves are held  while the trades roll off. Then in order to run second order risks like counterparty valuation adjustment, the fast correlation risk becomes important. Unless you get the correlation book running to speed your second order risk monte carlo simulation batches performance will be dominated by the correlation book performance (assuming every other product book batch demonstrates competitive performance and better turnaround time compared to the correlation book). Hello, Maxeler.

Mac Pro+2s: 1 Supercomputer+FPGAs+4m: 0

Apart from saving over a million dollars infrastructure cost annually, why bother running your credit derivative P&L/Risk batch Mac Pro+2s rather than Supercomputer+FPGAs+4m?

A. Rates, Mortgage, FX, Commodities P&L/Risk batches

It’s only the credit derivative guys in the Computerworld UK article that had this P&L and Risk batch performance problem in 2008 and now it is fixed?

B. Moore’s Law

“The number of transistors incorporated in a chip will approximately double every 24 months.”  —Gordon Moore, Intel Co-Founder

More of a prediction than a law really, but it has predicted accurately for 40 years and is expected to hold for another decade.  In technology if you do nothing more than “ride the wave” of Moore’s Law you are doing good, however, this observation’s simplicity can be deceiving. If the plan is for the credit derivative batch to “ride the wave” with thousands of grid CPUs, FPGAs, and associated staff you may have a challenge when scaling to larger problems, like counterparty valuation adjustment (CVA), that Moore’s Law will not automatically address for you. For example, Moore’s Law doesn’t get you power and machine room space for 10,000 grid CPUs for the CVA simulation.   That is closer to “not riding the wave.”

C. Simplicity

Supercomputers and FPGAs, in addition to GPUs, Cuda, and CPU grids even dataflow architecture are all reasonable technology bets that could very well be the best way to ride the Moore’s Law wave through 2020. Using some or all of these technologies together in 2011 to run such a modest computation so slowly should elicit some follow up questions.