Home » Posts tagged 'Numerical Opt'

# Tag Archives: Numerical Opt

## Vectorized Black Scholes Code from Intel

**Intel Corporation**, Overview of Vector Mathematics in Intel Math Kernel Library, 2012, here. I assume these are the Moscow Intel folks. We can back out the number cycles per Black Scholes element on a single core it’s probably 20 to 30 percent better than the 2007 cycle counts on the XLC/MASS POWER6 (it was like 170 cycles all in).

void BlackScholesFormula( int nopt, tfloat r, tfloat sig,tfloat s0[], tfloat x[], tfloat t[], tfloat vcall[], tfloat vput[] )

{

vmlSetMode( VML_EP );

DIV(s0, x, Div); LOG(Div, Log);

for ( j = 0; j < nopt; j++ ) {

tr [j] = t[j] * r;

tss[j] = t[j] * sig_2;

tss05[j] = tss[j] * HALF;

mtr[j] = -tr[j];

}

EXP(mtr, Exp); INVSQRT(tss, InvSqrt);

for ( j = 0; j < nopt; j++ ) {

w1[j] =(Log[j] + tr[j] + tss05[j]) * InvSqrt[j] *INV_SQRT2;

w2[j] =(Log[j] + tr[j] – tss05[j]) * InvSqrt[j] *INV_SQRT2;

}

ERF(w1, w1); ERF(w2, w2);

for ( j = 0; j < nopt; j++ ) {

w1[j] = HALF + HALF * w1[j];

w2[j] = HALF + HALF * w2[j];

vcall[j] = s0[j] * w1[j] – x[j] * Exp[j] * w2[j];

vput[j] = vcall[j] – s0[j] + x[j] * Exp[j];

}

}

## Abstraction is the Natural Enemy of Performance

**Michael Feldman**, HPC Wire, HPC Programming in the Age of Multicore: One Man’s View, here. Wellein sounds solid on programming for performance.

At this June’s International Supercomputing Conference (ISC’13) in Leipzig, Germany, Wellein will be delivering a keynote titled,

Fooling the Masses with Performance Results: Old Classics & Some New Ideas. The presentation was inspired by David H. Bailey 1991 classic articleTwelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, and is intended to follow the humorous tone of Bailey’s original work.

## JP Morgan’s London Whale needs Maxeler’s FPGA Supercomputer to run Risk?

**Bloomberg**, JPMorgan Trader Iksil Fuels Prop-Trading Debate With Bets, here. London Whale needs some P from Series 9 Investment Grade CDX. Can Bloomberg and some prominent officials help fix the situation?

**Zerohedge**, From Bruno Iksil’s Personal Profile: Enjoys “Walking Over Water” And Being “Humble”, here. Oh, so the London Whale is shorting tranches of series 9 CDX in $100bn quantities.That starts to make sense given all the hysteria. So they make the purchases at the corporate-level to hedge SCDO exposure that is not actively traded by the desk. Maybe Bruno is the one who needs the the Maxeler FPGA Supercomputer Credit batch at JPM (see Credit Derivatives, Flynn’s Architectural Case for Maxeler in 2012?, and Street FP Due Diligence 3: Epic Yet?) to run in 238 seconds. How do you tag that? **London Whale needs Mammoth Supercomputer to Stay Afloat**? Too much, right? I still suspect that the entire JPM credit batch (as described) completes in less than a minute on a low-end Mac Pro even with the Gaussian Copula positions. Sort of more like “Bruno uses iPhone to Track Purchases.” (see Business Insider, Financial Post, Wall Street Journal blog, Sober Look, New York Times Dealbook, Financial Times Alphaville, blogrunner)

**Zerohedge**, 31 Dec 2011 Notional Amt. of Derivative Contracts Top 25 Comm. Banks, here. Didn’t MS carry derivative inventory in the past?

## Why Read the Heroes?

A small informal effort like Pink Iguana needs to lean heavily on curation for a specific audience. How narrow is the audience? Take the entire massive EcoFin community of DeLong, Wilmott, and Mankiw then subtract most of the folks who: don’t care if a 22nm semiconductor fab is competitive in 2012, haven’t compiled their code –O3 recently, or are sort of meh to the idea that there is a RDMA transport to L3. Those folks remaining might be the Pink Iguana audience if they also like: buying credit protection from AIG stories, P=NP speculation, and IEEE754. It’s the far side of the long tail.

So why curate for such a specific audience? Despite The End of Blogging, in 2012 there are remarkable and reasonably frequent publication streams from Gowers, Lipton, and Tao. The thing that is different in the last five years is the public availability of unfiltered, authoritative, and lucid commentary on specific topics. The keys are unfiltered, authoritative, and lucid. DeLong, Mankiw, Cowen, and Krugman run similarly authoritative and lucid publication streams that are more informed by their partisan backgrounds than Gowers, Lipton, and Tao. Intel, NVIDIA, and IBM have authoritative and lucid information as well, but they also have a day job to do. If folks like Gowers, Lipton, and Tao are regularly publishing there might be more, right? You just have to go look around, and maybe you figure out how something (e.g., ETP Arbitrage, Credit Derivatives, HFT, a specific floating point computation) actually works. So, Wisty curates on Pink Iguana.

Why are these folks in the Pink Iguana Hall of Heroes (listed below the Blogroll) and why should you read the Heroes?

**A Credit Trader** hasn’t published since 2009, he went to do other stuff, but wow what got published there was magnificent. Read Getchen Morgenson at NYT, for example this, then read The AIG Fiasco or Bond-CDS Negative Basis or How to Lose a Billion Dollars on a Trade, it is like a teenage lucidity head rush.

**Avellaneda** – 2010 Quant of the Year posts regularly from his NYU faculty page and covers Research and market commentary, Stochastic Calculus, PDEs for Finance, Risk and Portfolio Management.

**Bookstaber** – Author of the book A Demon of Our Own Design, ran Firm Risk at Salomon back in the day, and now is Senior Policy Advisor at the SEC. See Physics Envy in Finance or Human Complexity: The Strategic game of ? and ?

**DeLong** – Even with the constant bitching about the press and Team Republican plus the liveblogging of World War 2, I have never seen a better EcoFin website, see DeLong and Summers: Fiscal Policy in a Depressed Economy or Econ 191: Spring 2012. DeLong’s blog really is the model for curation and commentary to a large audience.

**Gowers** – Rouse Ball chair, Cambridge U, Fields Medal 1998, see ICM 2010 or Finding Cantor’s proof that there are transcendental numbers, and he was piqued to comment Re: Steig Larsson, or perhaps the translator Reg Keeling in Wiles meets his match. So, Salander’s picture perfect memory, capacity to defeat armed motorcycle gangs in hand-to-hand combat, and assorted other superpowers pass without comment but she thinks she has a proof of Fermat, you gotta call a mathematician to check yourself before you wreck yourself. Gowers is on the Heroes list forever, check.

**Kahan** – doesn’t publish so much anymore but he is the Edgar Allen Poe of floating point computations gone wrong horror stories, and they are all here. He did IEEE 754 floating point standard and won a Turing Award. When and if he has something to say, I will probably want to listen, see How Java’s Floating-Point Hurts Everyone Everywhere and Desperately Needed Remedies for the Undebuggability of Large Floating-Point Computations in Science and Engineering.

**Lipton** has a gloriously unique perspective presented in Godel’s Lost Letter. He provides the descriptive narrative for algorithm complexity in a public conversation typically dominated by proofs and expositions of computational models. If algorithm complexity was professional sports, its kind of like Lipton figured out there should be color commentators broadcasting live from the game. Top posts include: Interdisciplinary Research – Challenges, The Letterman Top Ten list of why P = NP is impossible, and The Singularity Is Here In Chess; its John Madden, Dick Vitale, and Andres Cantor meet Kurt Godel, John von Neumann, and Andrey Kolmogorov in the best possible way.

**Tao** saved the consistency of Peano Arithmetic as described here after winning a Fields Medal in 2006, see The Black Scholes Equation and Hilbert’s Fifth Problem and related topics.

**Tufte** is “the guy” for the visual display of quantitative information. He has been the guy at least since the early 1980s and does not really publish the same way as Gowers, Lipton, or Tao. Tufte kind of figured out his publication flow before the internet, so you buy his books and if you want to know what he is thinking about now, you go to his course. He has stuff on line, lots of it, for example see his notebooks, or about ET. The Tufte course attendance is sort of mandatory, not sure but I think that’s in Dodd-Frank Title VII, so just do it before they find out.

## Credit Derivatives

So, a major broker-dealer gets a 2011 industry award for running their overnight Risk and P&L credit derivative batch though a 1000 node CPU grid and some FPGAs in 238 seconds (in 2008 the same computation and same broker-dealer but presumably different CDS inventory took 8 hours, a great success). Then some blog posting claims that this same credit derivative batch could be run with some optimized C++ code on a $2500 Mac Pro in under 2 seconds IEEE 754 double precision tied out to the precision of the inputs. What’s going on? Does the credit derivative batch require a $1,000,000 CPU grid, FPGAs, and 238 seconds or one $2,500 MacPro, some optimized code, and 2 seconds?

It’s the MacPro + 2 seconds very likely, let’s think how this could be wrong:

A. The disclosed data is materially wrong or we have misinterpreted it egregiously. The unusual event in all this is the public disclosure of the broker-dealer’s runtime experience and the code they were running. It is exceedingly rare to see such a public disclosure. That said, the 8 hour production credit derivative batch at a broker-dealer in 2008 is not the least-bit surprising. The disclosure itself tells you the production code was once bloated enough to be optimized from 8 hours to 4 minutes. You think they nailed the code optimization on an absolute basis when the 4 minutes result was enough to get a 2011 industry award, really? The part about the production infrastructure being a supercomputer with thousands of grid CPUs and FPGAs, while consistent with other production processes we have seen and heard of running on the Street, is the part you really have to hope is not true.

B. The several hundred thousand position credit derivative batch could be Synthetic CDO tranches and standard tranches and require slightly more complicated computation than the batch of default swaps assumed in the previous analysis. But all the correlation traders (folks that trade CDOs and tranches) we know in 2011 were back to trading vanilla credit derivatives and bonds. The credit derivative batch with active risk flowing through in 2011 is the default swap batch (you can include index protection like CDX and ITRX in this batch as well). Who is going to spend three years improving the overnight process on a correlation credit derivative book that is managed part time by a junior trader with instructions to take on zero risk? No one.

C. The ISDA code just analyzed is not likely to be the same as the 2011 award winning production credit derivative batch code. In fact, we know portions of the production credit derivative batch were translated into FPGA circuitry, so the code is real different, right? Well over the last decade of CDS trading most of the broker-dealers evolved to the same quantitative methodology for valuing default swaps. Standards for converting upfront fees to spreads (ISDA) and unwind fees (Bloomberg’s CDSW screen) have influenced broker-dealer CDS valuation methodology. We do not personally know exactly what quantitative analytics each one of the of the broker-dealers runs in 2011, but Jarrow-Turnbull default arrival and Brent’s method for credit curve cooking covers a non-trivial subset of the broker-dealers. The ISDA code is not likely to be vastly different from the production code the other broker-dealers use in terms of quantitative method. Of course, in any shop there could be some undisclosed quantitative tweaks included in production and the MacPro + 2 seconds analysis case would be exposed in that event.

D. The computational performance analysis just presented could be flawed. We have only thought about this since seeing the Computerworld UK article and spent a couple weekends working out the estimate details. We could have made a mistake or missed something. But even if we are off by a factor of 100 in our estimates (we are not) its still $2500 + MacPro + 200 seconds versus $1,000,000 + 1000 CPU+ FPGAs+238 seconds.

## Street FP Due Diligence 3: Epic yet?

**I am not gonna pay a lot for this muffler**

What is left to get a runtime estimate for the credit derivative batch on an off-the-shelf computer? We need a cooked credit curve (bootstraped hazard rate term structure) and to estimate the computational cost of the IOUs from the previous valuation step. Let’s account for the clock cycles needed to cook a USD credit curve out to 10 years with par CDS quotes at 1Y, 3Y, 5Y, 7Y, 10Y and a previously cooked USD Libor. Then we will address the additional computational cost of the valuation IOUs. The final post in this series will cover the additional computational cost of perturbational risk (multiply single curve cooking estimates by 20-30x corresponding to the number of distinct perturbed curves required for risk, ditto valuation estimates) as well as the cost of valuation for non-standard inventory CDS trades (trades where the paydates are not standard and therefore introduce an interpolation computation charge).

**Credit Curve Cooking:**

Assuming we are given a cooked USDLibor curve we can see from the valuation code the credit curve cooking only needs to take the par CDS quotes for the value date and return a set of assignments through JpmcdsForwardZeroPrice() for just the 20 paydates in the expected case (5Y CDS paying quarterly) or both the 20 paydates and the 130 numerical integration grid points (biweekly grid). So, if we derive estimates on the convergence speed of the one dimensional root finder, Brent’s method, and then get estimates on the cost of the Valuation IOUs we will have a reasonably complete floating point performance picture. We will assume away the cost of USD libor curve cooking and USDLibor interpolation since the computation cost is small and can be amortized across all the credits cooked against USDLibor, in any case. So, Valuation IOUs 2, 6, and 9 (see below) are assumed to cost less than one clock cycle of floating point execution. They are just variable assignments with some cost in cache pressure.

Here is the ISDA credit curve cooking code. Notice they use Brent’s method and the convergence criteria looks like zero to ten decimal paces. Given that you are cooking a credit curve that was cooked the previous business day as well you can assume the root is bracketed to 10bps or three decimal places. Let’s not even bother to account for potential compute time efficiency the bootstrapping valuations could provide. Assume every Brent function valuation cost as much as a 5Y CDS valuation.

/* we work with a continuously compounded curve since that is faster –

but we will convert to annual compounded since that is traditional */

cdsCurve = JpmcdsMakeTCurve (today,

endDates,

couponRates,

nbDate,

JPMCDS_CONTINUOUS_BASIS,

JPMCDS_ACT_365F);

if (cdsCurve == NULL) goto done;

context.discountCurve = discountCurve;

context.cdsCurve = cdsCurve;

context.recoveryRate = recoveryRate;

context.stepinDate = stepinDate;

context.cashSettleDate = cashSettleDate;

for (i = 0; i < nbDate; ++i)

{

double guess;

double spread;

guess = couponRates[i] / (1.0 – recoveryRate);

cl = JpmcdsCdsContingentLegMake (MAX(today, startDate),

endDates[i],

1.0,

protectStart);

if (cl == NULL) goto done;

fl = JpmcdsCdsFeeLegMake(startDate,

endDates[i],

payAccOnDefault,

couponInterval,

stubType,

1.0,

couponRates[i],

paymentDCC,

badDayConv,

calendar,

protectStart);

if (fl == NULL) goto done;

context.i = i;

context.cl = cl;

context.fl = fl;

if (JpmcdsRootFindBrent ((TObjectFunc)cdsBootstrapPointFunction,

(void*) &context,

0.0, /* boundLo */

1e10, /* boundHi */

100, /* numIterations */

guess,

0.0005, /* initialXstep */

0, /* initialFDeriv */

1e-10, /* xacc */

1e-10, /* facc */

&spread) != SUCCESS)

{

JpmcdsErrMsg (“%s: Could not add CDS maturity %s spread %.2fbp\n”,

routine,

JpmcdsFormatDate(endDates[i]),

1e4 * couponRates[i]);

goto done;

}

cdsCurve->fArray[i].fRate = spread;

Assume Brent on average converges superlinearly like the secant method (the exponent is something like 1.6, see Convergence of Secant Method). Assume also that one iteration of Brent’s method costs one function evaluation (w. the IOUs computation time added) and some nominal extra clocks say +20%. Given the accuracy of the initial guess, the smooth objective function, and the expected superlinear convergence five iterations (1.6**5) should be good enough on average. With par CDS quotes at 1Y, 3Y, 5Y, 7Y, and 10Y to fit we expect to need 20 CDS valuations. Thinking ahead to perturbational risk (in the next post), we can obviously bundle many perturbations into a single massive vectorized Brent’s method call and drive the cooking cost massively lower. Let’s ignore this obvious optimization for now. What do we have so far?

CDS credit curve cooking estimates are then as follows:

Expected case = 1.2 * (20 * 150 + 20 * valuation IOU) clock cycles (quarterly paydates=gridpoints)

Worst case = 1.2 * (20 * 810 + 20 * valuation IOU) clock cycles (biweekly numerical integration gridpoints)

Expected case is running at 3600 cycles plus 24x the IOU cycle count. In other words, one microsecond plus some IOU accounting.

**Valuation IOU Runtime Accounting:**

With subtopic titles like “Valuation IOU Runtime Accounting” I am certain to cut my blog reading population in half – the price of a lonely optimization post. Recall our list of valuation IOUs. We assumed that some Rates process will exogenously produce some suitably cooked and interpolated USD Lidor (discCurve) at the cost of some L2 cache pressure to us. Note we assume on the code rewrite we vectorize and unroll loops anywhere possible in the code. So in our accounting we take the cycle counts from Intel’s vectorized math library (MKL) for double precision. Vector divides cost 5.4 cycles per element. Vector exponentials cost 7.9 cycles per element. Vector log costs 9.9 cycles and vector reciprocals cost 3.36 cycles per element. The vector lengths for these cycle counts are probably close to 1024 so perhaps the cycle counts are slightly optimistic.

Valuation IOU Accounting list

- survival [paydates] = JpmcdsForwardZeroPrice(SpreadCurve,…) 4*paydate clocks
- discount[paydates] = JpmcdsForwardZeroPrice(discCurve,…) 0 assumed away
- vector1[paydates] = notional*couponRate*accTime[paydates] paydate clocks
- vector2[paydates] = survival[paydates]*discount[paydates] paydate clocks
- s[gridpts] = JpmcdsForwardZeroPrice(SpreadCurve,…) 4*paydate clocks
- df[gridpts] = JpmcdsForwardZeroPrice(discCurve,…) 0 assumed away
- t[gridpts] grid time deltas 0 assume away
- lambda[gridpts] = log(s[gridpts]/s[gridpts])/t[gridpts] 15*gridpts clocks
- fwdRate[gridpts] = log(df[gridpts]/df[gridpts])/t[gridpts] 0 assumed away
- lambdafwdRates[gridpts] = lambda[gridpts] * fwdRates[gridpts] gridpts clocks
- vector3[gridpts0 = exp(-t[gridpts]*lambdafwdRates[gridpts]) 10 * gridpts clocks
- vector4[gridpts] = s[gridpts] * df[gridpts] gridpts clocks
- vector5[gridpts] = reciprocal (lambdafwdRates[gridpts]) 4 * gripts clocks
- t0[gridpts], t1[gridpts] coefficients for accrued interpolation 0 assume away
- thisPv[gridpts] 4 * gridpts clocks

Looks like the IOU cost in aggregate is 10 * paydate clocks or 200 clock cycles plus 35 * gridpts clocks (ugh!). So expected case assumes gridpoints = quarterly paydates so all in 45 * paydates clock cycles or 900 clocks. Worst case 200 clocks + 35 * 130 clocks or 4750 clocks.

All in then, CDS credit curve cooking estimates are as follows:

Expected case = 1.2 * (20 * 150 + 20 * 900) clock cycles (quarterly paydates=gridpoints)

Worst case = 1.2 * (20 * 810 + 20 * 4750) clock cycles (biweekly numerical integration gridpoints)

Expected case looks like 25,200 clocks or 7 microseconds. Worst case looks like 133,440 cycles or 37 microseconds. Ok, we tentatively thought 20 microseconds worst case but recall we left a chunk of optimizations unaccounted for in the interests of brevity. But the worst case is about two times more expensive than we naively expected.

From here it is easy, full risk will be 20x to 30x the single curve cooking estimate even if we are totally apathetic and clueless. Expected case is 7 * 20 microseconds call it 140 microseconds and worst case is 37 * 30 microseconds call it 1.11 milliseconds. Since we are going to argue that non-standard CDS interpolation can be stored with the trade description the incremental cycle count for non-standard deals is de minimus since the computational cost is amortized over the life of the position (5Y on average). Expected case then in our target 10K credit curves 100K CDS looks like 10 K * 140 microseconds + 100K * 20*42 nanoseconds or one and a half a seconds on a single core of a single microprocessor. If I have to do the Computerworld UK credit batch on the cheapest Mac Pro I can order from the Apple store today, the Quad core 2.8 GHz Nehalem for 2,499 USD (not gonna take the Add 1,200USD to get the 3.33 GHz Westmere). It’ll cost me 5.8 seconds on a single core but fortunately I purchased four cores (of raw power according to the Apple website) so the Computerworld UK credit derivative batch still runs in a second and change for 2,499 USD + shipping.

## Street FP Due Diligence 2: Looking Epic maybe

The first step to getting a better estimate on the off-the-shelf runtime for credit curve cooking is to get a sense of how long the valuation of a quoted SNAC CDS (standardized default swaps, no custom cashflows) requires. Getting a SNAC CDS valuation runtime estimate is the subject of this post. The outline of the argument is to take the SNAC valuation runtime estimate and cook the credit curve (bootstrap the hazard rate term structure) from the quoted SNAC default swaps with terms 1Y, 3Y, and 5Y (see O’Kane and Turnbull pgs 12-14). Once we have a cooked credit curve we can run valuation and risk for the default swap inventory dependent on the given cooked credit curve. For a given credit, the serial computation runtime for a single CDS position valuation is an order of magnitude smaller than the credit curve cooking computation. Take a look at O’Kane and Turnbull pgs5-10 to review the Jarrow-Turnbull reduced form model for valuing the contingent leg and the fee leg. See the FINCAD document describing the modeling assumptions behind the ISDA CDS Standard Model, here. Recognize also that the ISDA code doesn’t have all the calendars and currency convention data you would expect in a complete production Credit P&L but it’s OK as a proxy code for getting estimates.

Vanilla default swaps have two legs a Premium Leg and a Contingent Leg, the default swap PV (present value) is the difference between the Premium Leg PV and the Contingent Leg PV. If counterparty A is long protection in a default swap they are paying premium to get the protection of the contingent leg, as well as being short the credit. We will assume the average maturity of a default swap is 5Y. We will proceed purely with static analysis of the code with no peeking at the underlying mathematics for better optimizations – nothing but the code.

**Valuing the Premium Leg**

The basic computational task with the Premium Leg is to discount each of the scheduled premium cash flows off the risky curve accounting for both the time value of money and the probability that the premium may never be paid due to termination of the default swap. Typically the main computational part of valuing the Premium Leg is valuing the accrual on default. The actual discounting of the premium cashflows to be paid on the quarterly paydates is straightforward and computationally inexpensive. Accruing on default is a standard convention and refers to the portion of the premium owed by the protection buyer in the event that a default occurs between premium paydates ( as oppose to default arriving exactly on a paydate). The protection buyer owes the premium accrued up to the date that the relevant default is officially recognized. For computational efficiency we are going to push the accrued on default computation over to the Contingent Leg computation, because they are so similar. This leaves about 20-30 cycles of computation for a 5Y default swap plus an IOU to account for the more computationally expensive accrual on default valuation. Here is the relevant code from ISDA from inside a loop for each paydate:

amount = notional * couponRate * accTime;

survival = JpmcdsForwardZeroPrice(spreadCurve, today, accEndDate + obsOffset);

discount = JpmcdsForwardZeroPrice(discCurve, today, payDate);

myPv = amount * survival * discount;

Notice that the variables notional, couponRate, and n-1 of n assignments of accTime are known at compile time. The calls to the function JpmcdsForwardZeroPrice() for the spreadCurve and this discCurve are simply interpolations from the cooked curves where the interpolation parameters (at least n-1 out of n) are known at compile time. Price them as assignments and assume that the cache size will not be swamped with 20 or so doubles for the quarterly paying CDS. The product survival * discount per paydate is known at curve cooking time. The trade off is between consuming a cycle per paydate versus adding to the cache load. Let’s put it in the cache for now and assume the curve cooker will compute the product survival*discount. So there is slightly more than a cycle (call it 1.5 cycles) per quarterly paydate w. no cache pressure or wait state penalties. We will call it 30 cycles for a 5Y quarterly paying average default swap.

**Valuing the Contingent Leg**

We need to account for the clock cycles required to:

- PV the default swap payout given the expected default arrival over the term of the CDS contract.
- PV the accrued premium owed by the protection buyer to the protection seller in the event the default arrives between premium cash flow paydates (IOU from the payleg).

Again static code analysis only, no peeking at the mathematics. Here is the relevant ISDA code for 1. integrating the value of the terminal CDS contingent payoff over the term of the CDS contract.

s1 = JpmcdsForwardZeroPrice(spreadCurve, today, startDate);

df1 = JpmcdsForwardZeroPrice(discCurve, today, MAX(today, startDate));

loss = 1.0 – recoveryRate;

for (i = 1; i < tl->fNumItems; ++i)

{

double lambda;

double fwdRate;

double thisPv;

s0 = s1;

df0 = df1;

s1 = JpmcdsForwardZeroPrice(spreadCurve, today, tl->fArray[i]);

df1 = JpmcdsForwardZeroPrice(discCurve, today, tl->fArray[i]);

t = (double)(tl->fArray[i] – tl->fArray[i-1])/365.0;

lambda = log(s0/s1)/t;

fwdRate = log(df0/df1)/t;

thisPv = loss * lambda / (lambda + fwdRate) *

(1.0 – exp(-(lambda + fwdRate) * t)) * s0 * df0;

myPv += thisPv;

}

Again, as in the case of the pay leg, all the calls to JpmcdsForwardZeroPrice() are interpolations known at credit curve cooking time so we will account for them as variable assignments and assign the computational cost of interpolation to the curve cooker. The time consuming computation here and in the code below (for accrued on default) depends on the resolution of the time grid (fNumItems) to get the discrete summation to converge to the continuous integral value. If the integration time grid has M points and r is the 5Y swap rate (the risk free rate USD currently 114 bps 19Jan12) then O’Kane and Turnbull (pg10) show the percentage error in the discreet approximation is:

r/2*M

In production P&L batches I have seen M=26 (biweekly integration grid points) but based on current levels M=4 (quarterly cashflow paydates) would bring the error to within the bid ask spread. Let’s assume M=26 worst cast and M=4 expected case for performance approximations.

Notice that the expensive floating point operations in these loops are the math.h functions log() and exp() and divides. We will push the log and exp calls to the curve cooker since the grid points for the integration as well as all the interpolations are known at curve cooking time. The variable lambda is known at credit curve cooking time and the variable fwdRate is known at Libor curve cooking time. Similarly all the values of t are known at compile time so we are not even going to multiply by the reciprocal of 365 inside the loop. We will also book the computational cost of the reciprocal 1.0/(lambda + fwdRate) to the curve cooker. So, no divides and no math.h calls in the loop we cost it at 4 fused add multiply cycles after vectorizing, loop unrolling, and optimization. In the expected case, loop 1 cost us 4 cycles * (4*5) grid points or 80 cycles. In the worst case, loop 1 cost us 4 cycles * (26*5) grid points or 520 clock cycles.

Here is the relevant ISDA code for 2, a very similar loop compared to the PV of the contingent payoff loop, right?

for (i = 1; i < tl->fNumItems; ++i)

{

double lambda;

double fwdRate;

double thisPv;

double t0;

double t1;

double lambdafwdRate;

if(tl->fArray[i] <= stepinDate) continue;

s1 = JpmcdsForwardZeroPrice(spreadCurve, today, tl->fArray[i]);

df1 = JpmcdsForwardZeroPrice(discCurve, today, tl->fArray[i]);

t0 = (double)(subStartDate + 0.5 – startDate)/365.0;

t1 = (double)(tl->fArray[i] + 0.5- startDate)/365.0;

t = t1-t0;

lambda = log(s0/s1)/t;

fwdRate = log(df0/df1)/t;

lambdafwdRate = lambda + fwdRate + 1.0e-50;

thisPv = lambda * accRate * s0 * df0 * (

(t0 + 1.0/(lambdafwdRate))/(lambdafwdRate) –

(t1 + 1.0/(lambdafwdRate))/(lambdafwdRate) *

s1/s0 * df1/df0);

myPv += thisPv;

s0 = s1;

df0 = df1;

subStartDate = tl->fArray[i];

}

We will treat both the loops simultaneously assume that the loops will be fused. The same analysis applies to this loop: laqmbda, fwdRate, lambdafwdRate, and all the interpolations and ratios are known at curve cooking time so they will be accounted for in the curve cooker not in the CDS valuation. Net, 2 fused multiply cycles will get the accrued on default value per grid point. Expected case = 40 cycles, worst case an additional 260 cycles.

CDS valuation estimates are then as follows

Expected case = 30 + 80 + 40 = 150 clock cycles, 42 ns, 23MM valuations /second

Worst case = 30 + 520 + 260 = 810 clock cycles, 225 ns, 4.5 MM valuations/second

In another post we will go through the curve cooking accounting and deal with the cache traffic we are creating with allocating computation to the curve cooker. Informally, we think cooking is an order of magnitude more expensive than valuation so we are kind of thinking under 10 – 20 microseconds to cook a curve on a single off-the-shelf core in either expected or worst case. 50K curves cooked per second seems plausible – let’s see how it goes the cache penalties/fp pipeline bubbles could catch up with us.

## Street Credit FP Due Diligence

**Question**:

Over/Under: 12 seconds of floating point computation; P&L and risk for several 100K default swaps; off the shelf servers in 2012 – running standard ISDA CDS model code; 5 to 6 decimal place tie out?

Even for non-SNAC positions I’m thinking under, perhaps epically under depending on the hardware budget. Ideally you run code to determine this and we do not have the optimized code in hand, but we can write it. Before writing the code you want to run the due diligence to see if writing the code is worth it. Let’s see.

**History**:

We recall from previous 2009 estimates ( see Totally Serial) for unoptimized scalar code cooking of par and perturbed curves certainly costs less 7 milliseconds and valuation of a default swap’s contingent and fee legs take about 40 microseconds . This is of course assuming that the default swaps have custom cashflows and are not SNAC default swaps (standard instruments with IMM date cashflow schedules) in which case the default swap valuation execution time massively collapses and the credit curve cooking time can be significantly reduced as well.

**Methodology**:

Let’s assume 100K OTC (non SNAC default swaps) and 10K credit curves and we can scale to fit the actual curve and deal counts with more accurate estimates. Lets further assume that the 10K credit curves are independent (there is no known geometric or arithmetic structure to the traders marks). So, you actually have to cook each of the 10K credit curves. For unoptimized code then we expect about 7 ms*10K + 0.04ms * 100K or ~80 seconds floating point computation elapsed time on a single contemporary microprocessor core. Naively, you would guess this benchmark optimization would be looking for about 20-30 seconds (performance improvement 3x-4x over unoptimized) on a single x86 core. So assuming we buy a $3000 6-8 core microprocessor in 2012 the 100K/10K batch optimized fp code executes in a few seconds, expected case. Variance in the estimate will depend on some assumptions like the average number of iterations for convergence with Brent, the average term of the default swaps, the number of perturbed credit curves needed for full risk, the scaling across multiple on-chip cores among other things.

Amdahl’s Law directs the due diligence to first focus on the curve cooking so that’s what we will do in addition to rechecking the previous estimates ( I think they were very conservative) in the outlined argument. Lets count the operations in the ISDA code to nail this estimate down. Then we will estimate what the L1 and L2 cache efficiency we can plausibly expect to get a target due diligence number. What do we look at in the ISDA src distribution? Bunch of src files we don’t care about (DK) from the perspective of getting an estimate of the floating point computation time. The other annotated file names (in bold) contain source we may need to study.

- badday.c DK
- bsearch.c DK
- buscache.c DK
- busday.c DK
- cashflow.c DK
**cds.c computes contingent leg, fee leg, vanilla CDS PV****cdsbootstrap.c Brent root finder for credit curve cooking****cdsone.c****converts upfront to flat spread**- cerror.c DK
- cfileio.c DK
- cfinanci.cpp DK
- cmemory.c DK
**contingentleg.c computes contingent leg and one period integral**- convert.c DK
- cx.c DK
- cxbsearch.c DK
- cxdatelist.c DK
**cxzerocurve.c calc zero price for given start and maturity**- date_sup.c DK
- dateadj.c DK
- dateconv.c DK
- datelist.c DK
- dtlist.c DK
**feeleg.c calc PV of single fee, fee leg, accrued**- fltrate.c DK
**gtozc.c looks like curve cooking****interpc.c interpolation of rates**- ldate.c DK
**linterpc.c curve interpolation**- lintrpl.c DK
- lprintf.c DK
- lscanf.c DK
**rtbrent.c root finding for curve cooking****schedule.c possibly DK – cash flow schedule generation****streamcf.c calculate cashflows**- strutil.c DK
**stub.c stub payments****tcurve.c discounting****timeline.c not sure maybe DK**- version.c DK
- yearfrac.c DK
**zcall.c zero curve****zcswap.c add strip of swap to zcurve**- zcswdate.c DK
**zcswutil.c cashflow lists for swap instruments****zerocurve.c build zero curve from mm and swaps****zr2coup.c calculates the par swap rate****zr2fwd.c calculates zero coupon forward rate**

We should be able to pound this due diligence through with a little effort, let’s see how it goes. Maybe I am missing something? It’ll take a couple postings I expect, but it will be interesting.

## Maxeler does Dataflow

Many companies are trying novel chip technologies to accelerate specialized kinds of computing jobs. But there’s signs that something a bit more radical, backed by technology vendors such as Maxeler Technologies, is also winning converts.

The London-based company was formed to commercialize a technology known by the phrase dataflow, a concept discussed since the 1970s that represents a sharp break from the fundamental design used in most computers and microprocessor chips inside them.

Clark at Wall Street Journal, here. Odd lead in – bringing 1970s concepts to you today – does it run APL? Alright I’m gonna bite, lets see what the FPGA technology bet is with Maxeler. Wonder what the dataflow folks Arvind , Culler, and Iannucci are doing these days? Oh, Arvind’s doing Mobile phone ecosystems for Nokia; Culler’s doing Electricity generation in Berkeley; Iannucci’s doing some digital radio stuff at RAI. Arvind’s 2005 ISCA talk slides from his home page seems like a good place to start. Lets figure out why dataflow is going to be a good technology bet in 2012 for pricing several 100K default swaps off a bunch of cooked credit curves.

## Algorithm Acceleration versus Moore’s Law

Two blog posts contrasting the macro performance improvements in numerical optimization see Noam Nisan and John Cook, It probably makes sense to bookmark Nisan’s blog Turing’s Invisible Hand long term even though it is now a composite posting location for others than Nisan – see how it goes. I like Cook’s R in action post, here, as well.