Pink Iguana

Home » Posts tagged 'Floating Point'

Tag Archives: Floating Point

Vectorized Black Scholes Code from Intel

Intel Corporation, Overview of Vector Mathematics in Intel Math Kernel Library, 2012, here.  I assume these are the Moscow Intel folks. We can back out the number cycles per Black Scholes element on a single core  it’s probably 20 to 30 percent better than the 2007 cycle counts  on the XLC/MASS  POWER6  (it was like 170 cycles all in).

void BlackScholesFormula( int nopt, tfloat r, tfloat sig,tfloat s0[], tfloat x[], tfloat t[], tfloat vcall[], tfloat vput[] )

{

vmlSetMode( VML_EP );

DIV(s0, x, Div); LOG(Div, Log);

for ( j = 0; j < nopt; j++ ) {

tr [j] = t[j] * r;

tss[j] = t[j] * sig_2;

tss05[j] = tss[j] * HALF;

mtr[j] = -tr[j];

}

EXP(mtr, Exp); INVSQRT(tss, InvSqrt);

for ( j = 0; j < nopt; j++ ) {

w1[j] =(Log[j] + tr[j] + tss05[j]) * InvSqrt[j] *INV_SQRT2;

w2[j] =(Log[j] + tr[j] – tss05[j]) * InvSqrt[j] *INV_SQRT2;

}

ERF(w1, w1); ERF(w2, w2);

for ( j = 0; j < nopt; j++ ) {

w1[j] = HALF + HALF * w1[j];

w2[j] = HALF + HALF * w2[j];

vcall[j] = s0[j] * w1[j] – x[j] * Exp[j] * w2[j];

vput[j] = vcall[j] – s0[j] + x[j] * Exp[j];

}

}

Advertisement

The Federal Reserve and FPGA/C Paranoia

Roger Lowenstein, NYT, The Federal Resrve’s Framers Would Be Shocked, here.

ONE hundred years ago today,President Woodrow Wilson went before Congress and demanded that it “act now” to create the Federal Reserve System. His proposal set off a fierce debate. One of the plan’s most strident critics, Representative Charles A. Lindbergh Sr., the father of the aviator, predicted that the Federal Reserve Act would establish “the most gigantic trust on earth,” and that the Fed would become an economic dictator or, as he put it, an “invisible government by the money power.”

Tan, Boland, and Constantinides, FPGA Paranoia: Testing numerical properties of FPGA floating point IP-cores, here. If you go to a future where your code execution is not in the economic sweet  spot driving microprocessor and system board design maybe you have to resort to FPGA hackery to continue to scale with Moore’s Law. The Maxeler folks did impress on me that floating point performance has always been a wide spectrum of possible engineering solutions. Vector math double precision ICC/MKL hacking on commodity processors is somewhere in the middle of the spectrum of stuff you can do.

Aside from exhaustive testing across every possible input value, the only way to confirm that hardware performs exactly as specified is to perform some sort of formal verification of the hardware implementation [12–14]. However, such a process typically requires a lot of time and effort, which is unacceptable, espe- cially when new algorithms that save time and area are being developed all the time. As such a more tractable approach is to create a test suite that identifies corner cases that are most likely to cause errors in the output, and test the hard- ware against these cases, alongside further random tests, to ensure it works as desired. In terms of floating point verification, the TestFloat [15], the IeeeCC754 test suite [16] and Paranoia [17] are three such test suites. In this project, we have chosen to focus on the latter, for it is the most well-known, as can be seen by its various translations – from its original BASIC program [17] into various languages, including Pascal [18] and C [19]. The popularity of Paranoia has en- sured it is still used on modern CPU hardware to show flaws in many floating point arithmetic implementations [20], as well as to benchmark GPU floating point arithmetic  and for this reason, we wish to create the same level of scrutiny towards FPGA IP-cores.

Thos Sumner and David Gay, A C version of Kahan’s Floating Point Test “Paranoia”, here. Interesting if there is a vectorized math and FMA version somewhere.

static char *hist[] = {
	"The program attempts to discriminate among",
	"   FLAWs, like lack of a sticky bit,",
	"   Serious DEFECTs, like lack of a guard digit, and",
	"   FAILUREs, like 2+2 == 5 .",
	"Failures may confound subsequent diagnoses.\n",
	"The diagnostic capabilities of this program go beyond an earlier",
	"program called `MACHAR', which can be found at the end of the",
	"book  `Software Manual for the Elementary Functions' (1980) by",
	"W. J. Cody and W. Waite. Although both programs try to discover",
	"the Radix, Precision and range (over/underflow thresholds)",
	"of the arithmetic, this program tries to cope with a wider variety",
	"of pathologies, and to say how well the arithmetic is implemented.",
	"\nThe program is based upon a conventional radix representation for",
	"floating-point numbers, but also allows logarithmic encoding",
	"as used by certain early WANG machines.\n",
	"BASIC version of this program (C) 1983 by Prof. W. M. Kahan;",
	"see source comments for more history.",
	0};

C code on FPGAs

Nadav Rotem, C to Verilog, here.

C-to-Verilog is a free and open sourced on-line C to Verilog compiler. You can copy-and-paste your existing C code and our on-line compiler will synthesize it into optimized verilog.
For additional information on how to use our website to create FPGA designs, watch the screencast below.

Andrew Putnam, Susan Eggers, et. al., Performance and Power of Cache-Based Reconfigurable Computing, here.  Did not see these slides at Susan Eggers’ website. Note Prassanna Sundararajan’s name is on the slides – he must know something about 2008 CHIMPS.

This paper describes CHiMPS, a C-based accelerator compiler for hybrid CPU-FPGA computing platforms. CHiMPS’s goal is to facilitate FPGA programming for high performance computing developers. It inputs generic ANSIC code and automatically generates VHDL blocks for an FPGA. The accelerator architecture is customized with multiple caches that are tuned to the application. Speedups of 2.8x to 36.9x (geometric mean 6.7x) are achieved on a variety of HPC benchmarks with minimal source code changes.

Looking at the Chimps paper and they have a black scholes benchmark but only quote relative performance  between x86/gcc and CHIMPS/FPGA. It sure looks like the FPGA performance here is compared against naive x86 code. Recall Pink I published Black Scholes in 36 nanoseconds    on 2007 commodity hardware using XLC. So unless Chimps was being misleading here I would assume they had a 2008 FPGA that did Black Scholes evals in  2.05 (36/17.5) nanoseconds. I’m gonna go ahead give a  time travel 2008 shenanigans yellow card because I don’t believe there was enough non ILP parallelism in the basic Black Scholes formula evaluation given the FPGA clock speeds circa 2008 (something like 550MHz) see Xilinx References Apr 2012.

Speedup numbers do not fully convey the performance potential of the CHiMPS accelerators, because performance is often highly dependent on the input data size and the computation-communication ratio. To illustrate this, Figure 2 depicts Black-Scholes’s sensitivity to input size. For very small numbers of stocks, using CHiMPS is detrimental to performance, because the data transfer time to the FPGA is greater than executing on the CPU. Performance breaks even around 24 stocks, and continues to increase until it peaks at a speedup of 17.6 (700 stocks). Speedup then decreases since the data for more stocks no longer fits in the FPGA caches, and the CPU’s larger cache and memory prefetcher keep the CPU execution time steady.

Check in with Agner Fog

Agner’s CPU Blog, New C++ Vector Class Library, here. I’m interested in the AVX2 side of this

Great news. I have made a new vector class library that makes it easier to use the vector instruction sets from SSE2 to AVX and AVX2. It’s a C++ library that defines a lot of vector classes, functions and operators. Adding two vectors is as simple as writing a + sign instead of using assembly code or intrinsic functions. This is useful where the compiler doesn’t vectorize your code automatically. The resulting code has no extra overhead when compiled with an optimizing compiler.This library has much more features than Intel’s vector classes:

Features

  • vectors of 8, 16, 32 and 64-bit integers, signed and unsigned
  • vectors of single and double precision floating point numbers
  • total vector size 128 or 256 bits
  • defines almost all common operators
  • boolean operations and branches on vector elements
  • defines many arithmetic functions
  • permute, blend and table-lookup functions
  • fast integer division
  • many mathematical functions (requires external library)
  • can build code for different instruction sets from the same source code
  • CPU dispatching to utilize higher instruction sets when available
  • uses metaprogramming (including preprocessing directives and templates) to find the best implementation for the selected instruction set and parameter values of a given operator or function

Xilinx References Apr 2012

Xilinx, High Performance Computing Using FPGAs, Sep 2010, here.

The shift to multicore CPUs forces application developers to adopt a parallel programming model to exploit CPU performance. Even using the newest multicore architectures, it is unclear whether the performance growth expected by the HPC end user can be delivered, especially when running the most data- and compute- intensive applications. CPU-based systems augmented with hardware accelerators as co-processors are emerging as an alternative to CPU-only systems. This has opened up opportunities for accelerators like Graphics Processing Units (GPUs), FPGAs, and other accelerator technologies to advance HPC to previously unattainable performance levels.

I buy the argument to a degree. As the number of cores per chip grow, the easy pipelining and parallelization opportunities will diminish. The argument is stronger if there are more cores per chip. 8 cores or under per general purpose chip it’s sort of a futuristic theoretical argument. More than a few programmers can figure out how to code up a 4 to 8 stage pipeline for their application without massive automated assistance. But the FPGA opportunity does exist.

The convergence of storage and Ethernet networking is driving the adoption of 40G and 100G Ethernet in data centers. Traditionally, data is brought into the processor memory space via a PCIe network interface card. However, there is a mismatch of bandwidth between PCIe (x8, Gen3) versus the Ethernet 40G and 100G protocols; with this bandwidth mismatch, PCIe (x8, Gen3) NICs cannot support Ethernet 40G and 100G protocols. This mismatch creates the opportunity for the QPI protocol to be used in networking systems. This adoption of QPI in networking and storage is in addition to HPC.

I buy the FPGA application in the NIC space. I want my NIC to go directly to L3 pinned pages, yessir I do, 100G please.

Xilinx FPGAs double their device density from one generation to the next. Peak performance of FPGAs and processors can be estimated to show the impact of doubling the performance on FPGAs [Ref 6], [Ref 7]. This doubling of capacity directly results in increased FPGA compute capabilities.

The idea proposed here is that you want to be on the exponentially increasing density curve for the FPGAs in lieu of clock speed increases you are never going to see again. Sort of a complicated bet to make for mortals, maybe.

I like how they do the comparisons though. They say here is our Virtex-n basketball player and here is  the best NBA Basketball player … and they show you crusty old Mike Bibby 2012. Then they say watch as the Virtex-n basketball player takes Mike Bibby down low in the post, and notice the Virtex-n basketball player is still growing exponentially. So you can imagine how much better he will do against Mike Bibby in the post next year. Finally they say that Mike Bibby was chosen as the best NBA player for this comparison by his father Henry, who was also a great NBA player.

FPGAs tend to consume power in tens of watts, compared to other multicores and GPUs that tend to consume power in hundreds of watts. One primary reason for lower power consumption in FPGAs is that the applications typically operate between 100–300 MHz on FPGAs compared to applications on high-performance processors executing between 2–3 GHz.

Silly making Lemonade out of Lemons argument, the minute I can have my FPGAs clocked at 3 GHz I throw away the 300MHz FPGAs, no?

Intel, An Introduction to the Intel QuickPath Interconnect, QPI, Jan 2009, here.

Xilinx Research Labs/NCSA, FPGA HPC – The road beyond processors, Jul 2007, here. Need more current references but  I keep hearing the same themes in arguments for FGPA HPC, so let’s think about this for a bit:

FPGAs have an opening because you are not getting any more clocks from microprocessor fab shrinks: OK.

Power density: meh. Lots of FinQuant code can run on a handful of cores. The Low Latency HFT folks cannot really afford many L2 misses. The NSA boys are talking about supercomputers for crypto not binary protocol parsing.

Microprocessors have all functions that are hardened in silicon and you pay for them whether you use them or not  and you can’t use that silicon for something else: Meh, don’t really care if I use all the silicon on my 300 USD microprocessor as long as the code is running close to optimal on the parts of the silicon useful to my application. It would be nice if I got more runtime performance for my 300USD, no doubt. This point is like Advil is bad because you don’t always need to finish the bottle after you blow out your ankle. Yeah, I understand the silicon real estate is the most expensive in the world.

Benchmarks: Black Scholes 18msec FPGA @ 110 Mhz Virtex-4 203x faster than Opeteron – 2.2 Ghz: You Cannot be Serious! 3.7 microseconds per Black Scholes evaluation was competitive performance at the turn of the century. The relative speedup slides and quotations make me nervous. Oh, Celoxica provided the data – hey Black Scholes in 36 Nanoseconds on a single core of a dual core off-the-shelf general microprocessor from 2007. So the Virtex-4 does 1M Black Scholes evaluations in 18 milliseconds flat to competitive code on a dual core general purpose off-the-shelf microprocessor in 2007.

Make it easy for the users to use this hardware and get „enough of a performance‟ increase to be useful: meh, it’s for applications that do not need to go fast, for now (2007)?

Do not try to be the fastest thing around when being as fast with less power is sufficient: meh, really do not care so much about the power thing

FPGA: Different operations map to different silicon allows massive pipelining; lots of parallelismOK. So, why bother with the previous two points?

Eggers/ U. Washington, CHiMPS, here. Eggers is reasonable.

There have been (at least) two hindrances to the widespread adoption of FPGAs by scientific application developers: having to code in a hardware description language, such as Verilog (with its accompanying hardware-based programming model) and poor FPGA memory performance for random memory accesses. CHiMPS, our C-to-FPGA synthesis compiler, solves both problems with one memory architecture, the many-cache memory model.

Many-cache organizes the small, distributed memories on an FPGA into application-specific caches, each targeting a particular data structure or region of memory in an application and each customized for the particular memory operations that access it.

CHiMPS provides all the traditional benefits we expect from caching. To reduce cache latency, CHiMPS duplicates the caches, so that they’re physically located near the hardware logic blocks that access them. To increase memory bandwidth, CHiMPS banks the caches to match the memory parallelism in the code. To increase task-level parallelism, CHiMPS duplicates caches (and their computation blocks) through loop unrolling and tiling. Despite the lack of FPGA support for cache coherency, CHiMPS facilitates data sharing among FPGA caches and between the FPGA and its CPU through a simple flushing of cached values. And in addition, to harness the potential of the massively parallel computation offered by FPGAs, CHiMPS compiles to a spatial dataflow execution model, and then provides a mechanism to order dependent memory operations to retain C memory ordering semantics.

CHiMPS’s compiler analyses automatically generate the caches from C source. The solution allows scientific programmers to retain their familiar programming environment and memory model, and at the same time provides performance that is on average 7.8x greater and power that is one fourth that of a CPU executing the same source code. The CHiMPS work has been published in the International Symposium on Computer Architecture (ISCA, 2009), the International Conference on Field Programmable Logic and Applications (FPL, 2008), and High-Performance Reconfigurable Computing Technology and Applications (HPRCTA, 2008), where it received the Best Paper Award.

Why Read the Heroes?

A small informal effort like Pink Iguana needs to lean heavily on curation for a specific audience.  How narrow is the audience? Take the entire massive EcoFin community of DeLong, Wilmott, and Mankiw then subtract most of the folks who: don’t care if a 22nm semiconductor fab is competitive in 2012, haven’t compiled their code –O3 recently, or are sort of meh to the idea that there is a RDMA transport to L3.  Those folks remaining might be the Pink Iguana audience if they also  like: buying credit protection from AIG stories, P=NP speculation, and IEEE754. It’s the far side of the long tail.

So why curate for such a specific audience? Despite The End of Blogging, in 2012 there are remarkable and reasonably frequent publication streams from Gowers, Lipton, and Tao. The thing that is different in the last five years is the public availability of unfiltered, authoritative, and lucid commentary on specific topics. The keys are unfiltered, authoritative, and lucid. DeLong, Mankiw, Cowen, and Krugman run similarly authoritative and lucid publication streams that are more informed by their partisan backgrounds than Gowers, Lipton, and Tao. Intel, NVIDIA, and IBM have authoritative and lucid information as well, but they also have a day job to do. If folks like Gowers, Lipton, and Tao are regularly publishing there might be more, right? You just have to go look around, and maybe you figure out how something (e.g., ETP Arbitrage, Credit Derivatives, HFT, a specific floating point computation) actually works. So, Wisty curates on Pink Iguana.

Why are these folks in the Pink Iguana Hall of Heroes (listed below the Blogroll) and why should you read the Heroes?

A Credit Trader hasn’t published since 2009, he went to do other stuff, but wow what got published there was magnificent.  Read Getchen Morgenson at NYT, for example this, then read The AIG Fiasco or Bond-CDS Negative Basis or How to Lose a Billion Dollars on a Trade, it is like a teenage lucidity head rush.

Avellaneda – 2010 Quant of the Year posts regularly from his NYU faculty page and covers Research and market commentary, Stochastic Calculus, PDEs for Finance, Risk and Portfolio Management.

Bookstaber – Author of the book A Demon of Our Own Design, ran Firm Risk at Salomon back in the day, and now is Senior Policy Advisor at the SEC. See Physics Envy in Finance or Human Complexity: The Strategic game of ? and ?

DeLong – Even with the constant bitching about the press and Team Republican plus the liveblogging of World War 2, I have never seen a better EcoFin website, see DeLong and Summers: Fiscal Policy in a Depressed Economy or Econ 191: Spring 2012. DeLong’s blog really is the model for curation and commentary to a large audience.

Gowers – Rouse Ball chair, Cambridge U, Fields Medal 1998, see ICM 2010 or Finding Cantor’s proof that there are transcendental numbers, and he was piqued to comment Re: Steig Larsson, or perhaps the translator Reg Keeling in Wiles meets his match. So, Salander’s picture perfect memory, capacity to defeat armed motorcycle gangs in hand-to-hand combat, and assorted other superpowers pass without comment but she thinks she has a proof of Fermat, you gotta call a mathematician to check yourself before you wreck yourself. Gowers is on the Heroes list forever, check.

Kahan – doesn’t publish so much anymore but he is the Edgar Allen Poe of floating point computations gone wrong horror stories, and they are all here. He did IEEE 754 floating point standard and won a Turing Award. When and if he has something to say, I will probably want to listen, see How Java’s Floating-Point Hurts Everyone Everywhere and  Desperately Needed Remedies for the Undebuggability of Large Floating-Point Computations in Science and Engineering.

Lipton has a gloriously unique perspective presented in Godel’s Lost Letter. He provides the descriptive narrative for algorithm complexity in a public conversation typically dominated by proofs and expositions of computational models. If algorithm complexity was professional sports, its kind of like Lipton figured out there should be color commentators broadcasting live from the game. Top posts include: Interdisciplinary Research – Challenges, The Letterman Top Ten list of why P = NP is impossible, and The Singularity Is Here In Chess; its John Madden, Dick Vitale, and Andres Cantor meet Kurt Godel, John von Neumann, and Andrey Kolmogorov in the best possible way.

Tao saved the consistency of Peano Arithmetic as described here after winning a Fields Medal in 2006, see The Black Scholes Equation and Hilbert’s Fifth Problem and related topics.

Tufte is “the guy” for the visual display of quantitative information. He has been the guy at least since the early 1980s and does not really publish the same way as Gowers, Lipton, or Tao. Tufte kind of figured out his publication flow before the internet, so you buy his books and if you want to know what he is thinking about now, you go to his course. He has stuff on line, lots of it, for example see his notebooks, or about ET. The Tufte course attendance is sort of mandatory, not sure but I think that’s in Dodd-Frank Title VII, so just do it before they find out.

Kahan Talk

Desperately Needed Remedies for the Undebuggability of Large Floating-Point Computations in Science and Engineering 

(for ICME, Stanford University, 13 Oct. 2011), here. Fabulous read. I was trying to remember all the nightmare horror stories of folks who decided double precision was too much bother for their floating point computations. Kahan is the Edgar Allen Poe of botched-floating-point-computation story telling.  A bunch of those stories are here, and they are … horrifying. Get the children out of the room before opening the pdf file. I recall the first chapters in Higham’s book were recounted like this as well.

Check this out (from Kahan’s presentation) …

“Jean-Michel Muller’s (Counter)Example

His program implements a discrete dynamical system whose state at time N is the row [xN, xN+1] . Starting with x0 := 2 and x1 := –4 , the sequence x2, x3, x4, …, xN+1, … is

generated by a recurrence xN+1 := 111 – ( 1130 – 3000/xN–1)/xN for N = 1, 2, 3, … in

turn. An interesting challenge is the computation of, say, x50 using the Floating-Point

hardware’s arithmetic in any commercial computer or calculator, new or old.

They all get x50 ≈ 100 . The correct value is x50 ≈ 6.0001465345614 .”

Go ahead put it in a spreadsheet it takes no more than few seconds to put the recurrence in Excel, we’ll wait.

ARE YOU NOT ENTERTAINED!

Maxeler Clippings re JP Morgan Dec11

Wall Street Journalhere Maxeler Makes Waves with Dataflow Design

HPCWirehere  J.P. Morgan Deploys Maxeler Dataflow Supercomputer for Fixed Income Trading

Peter Cherasia, Head of Markets Strategies at J.P. Morgan, commented: “With the new Maxeler technology, J.P. Morgan’s trading businesses can now compute orders of magnitude more quickly, making it possible to improve our understanding and control of the profile of our complex trading risk.”

insideHPChere  JP Morgan Fires Up Maxeler FPGA Super

Compute Scotlandhere Near realtime: JP Morgan & Maxeler

Credit Derivatives

So, a major broker-dealer gets a 2011 industry award for running their overnight Risk and P&L credit derivative batch though a 1000 node CPU grid and some FPGAs in 238 seconds (in 2008 the same computation and same broker-dealer but presumably different CDS inventory took 8 hours, a great success). Then some blog posting claims that this same credit derivative batch could be run with some optimized C++ code on a $2500 Mac Pro in under 2 seconds IEEE 754 double precision tied out to the precision of the inputs. What’s going on? Does the credit derivative batch require a $1,000,000 CPU grid, FPGAs, and 238 seconds or one $2,500 MacPro, some optimized code, and 2 seconds?

It’s the MacPro + 2 seconds very likely, let’s think how this could be wrong:

A. The disclosed data is materially wrong or we have misinterpreted it egregiously. The unusual event in all this is the public disclosure of the broker-dealer’s runtime experience and the code they were running. It is exceedingly rare to see such a public disclosure.  That said, the 8 hour production credit derivative batch at a broker-dealer in 2008 is not the least-bit surprising.  The disclosure itself tells you the production code was once bloated enough to be optimized from 8 hours to 4 minutes. You think they nailed the code optimization on an absolute basis when the 4 minutes result was enough to  get a 2011 industry award, really? The part about the production infrastructure being a supercomputer with thousands of grid CPUs and FPGAs, while consistent with other production processes we have seen and heard of running on the Street, is the part you really have to hope is not true.

B. The several hundred thousand position credit derivative batch could be Synthetic CDO tranches and standard tranches and require slightly more complicated computation than the batch of default swaps assumed in the previous analysis.  But all the correlation traders (folks that trade CDOs and tranches) we know in 2011 were back to trading vanilla credit derivatives and bonds. The credit derivative batch with active risk flowing through in 2011 is the default swap batch (you can include index protection like CDX and ITRX in this batch as well). Who is going to spend three years improving the overnight process on a correlation credit derivative book that is managed part time by a junior trader with instructions to take on zero risk? No one.

C. The ISDA code just analyzed is not likely to be the same as the 2011 award winning production credit derivative batch code. In fact, we know portions of the production credit derivative batch were translated into FPGA circuitry, so the code is real different, right? Well over the last decade of CDS trading most of the broker-dealers evolved to the same quantitative methodology for valuing default swaps. Standards for converting upfront fees to spreads (ISDA) and unwind fees (Bloomberg’s CDSW screen) have influenced broker-dealer CDS valuation methodology. We do not personally know exactly what quantitative analytics each one of the of the broker-dealers runs in 2011, but Jarrow-Turnbull default arrival and Brent’s method for credit curve cooking covers a non-trivial subset of the broker-dealers. The ISDA code is not likely to be vastly different from the production code the other broker-dealers use in terms of quantitative method. Of course, in any shop there could be some undisclosed quantitative tweaks included in production and the MacPro + 2 seconds analysis case would be exposed in that event.

D. The computational performance analysis just presented could be flawed. We have only thought about this since seeing the Computerworld UK article and spent a couple weekends working out the estimate details. We could have made a mistake or missed something. But even if we are off by a factor of 100 in our estimates (we are not) its still $2500 + MacPro + 200 seconds versus $1,000,000 + 1000 CPU+ FPGAs+238 seconds.

Street FP Due Diligence 3: Epic yet?

I am not gonna pay a lot for this muffler

What is left to get a runtime estimate for the credit derivative batch on an off-the-shelf computer? We need a cooked credit curve (bootstraped hazard rate term structure) and to estimate the computational cost of the IOUs from the previous valuation step. Let’s account for the clock cycles needed to cook a USD credit curve out to 10 years with par CDS quotes at 1Y, 3Y, 5Y, 7Y, 10Y and a previously cooked USD Libor. Then we will address the additional computational cost of the valuation IOUs. The final post in this series will cover the additional computational cost of perturbational risk (multiply single curve cooking estimates by 20-30x corresponding to the number of distinct perturbed curves required for risk, ditto valuation estimates) as well as the cost of valuation for non-standard inventory CDS trades (trades where the paydates are not standard and therefore introduce an interpolation computation charge).

Credit Curve Cooking:

Assuming we are given a cooked USDLibor curve we can see from the valuation code the credit curve cooking only needs to take the par CDS quotes for the value date and return a set of assignments through JpmcdsForwardZeroPrice() for just the 20 paydates in the expected case (5Y CDS paying quarterly) or both the 20 paydates and the 130 numerical integration grid points (biweekly grid). So, if we derive estimates on the convergence speed of the one dimensional root finder, Brent’s method, and then get estimates on the cost of the Valuation IOUs we will have a reasonably complete floating point performance picture. We will assume away the cost of USD libor curve cooking and USDLibor interpolation since the computation cost is small and can be amortized across all the credits cooked against USDLibor, in any case. So, Valuation IOUs 2, 6, and 9  (see below) are assumed to cost less than one clock cycle of floating point execution. They are just variable assignments with some cost in cache pressure.

Here is the ISDA credit curve cooking code. Notice they use Brent’s method and the convergence criteria looks like zero to ten decimal paces. Given that you are cooking a credit curve that was cooked  the previous business day as well you can assume the root is bracketed to 10bps or three decimal places. Let’s not even bother to account for potential compute time efficiency the bootstrapping valuations could provide. Assume every Brent function valuation cost as much as a 5Y CDS valuation.

/* we work with a continuously compounded curve since that is faster –

but we will convert to annual compounded since that is traditional */

cdsCurve = JpmcdsMakeTCurve (today,

endDates,

couponRates,

nbDate,

JPMCDS_CONTINUOUS_BASIS,

JPMCDS_ACT_365F);

if (cdsCurve == NULL)    goto done;

context.discountCurve = discountCurve;

context.cdsCurve      = cdsCurve;

context.recoveryRate  = recoveryRate;

context.stepinDate    = stepinDate;

context.cashSettleDate = cashSettleDate;

for (i = 0; i < nbDate; ++i)

{

double guess;

double spread;

guess = couponRates[i] / (1.0 – recoveryRate);

cl = JpmcdsCdsContingentLegMake (MAX(today, startDate),

endDates[i],

1.0,

protectStart);

if (cl == NULL)     goto done;

fl = JpmcdsCdsFeeLegMake(startDate,

endDates[i],

payAccOnDefault,

couponInterval,

stubType,

1.0,

couponRates[i],

paymentDCC,

badDayConv,

calendar,

protectStart);

if (fl == NULL)    goto done;

context.i  = i;

context.cl = cl;

context.fl = fl;

if (JpmcdsRootFindBrent ((TObjectFunc)cdsBootstrapPointFunction,

(void*) &context,

0.0,    /* boundLo */

1e10,   /* boundHi */

100,    /* numIterations */

guess,

0.0005, /* initialXstep */

0,      /* initialFDeriv */

1e-10,  /* xacc */

1e-10,  /* facc */

&spread) != SUCCESS)

{

JpmcdsErrMsg (“%s: Could not add CDS maturity %s spread %.2fbp\n”,

routine,

JpmcdsFormatDate(endDates[i]),

1e4 * couponRates[i]);

goto done;

}

cdsCurve->fArray[i].fRate = spread;

Assume Brent on average converges superlinearly like the secant method (the exponent is something like 1.6, see Convergence of Secant  Method). Assume also that one iteration of Brent’s method costs one function evaluation (w. the IOUs computation time added) and some nominal extra clocks say +20%. Given the accuracy of the initial guess, the smooth objective function, and the expected superlinear convergence five iterations (1.6**5) should be good enough on average. With par CDS quotes at 1Y, 3Y, 5Y, 7Y, and 10Y to fit we expect to need 20 CDS valuations. Thinking ahead to perturbational risk (in the next post), we can obviously bundle many perturbations into a single massive vectorized Brent’s method call and drive the cooking cost massively lower. Let’s ignore this obvious optimization for now.  What do we have so far?

CDS credit curve cooking estimates are then as follows:

Expected case = 1.2 * (20 * 150 + 20 * valuation IOU) clock cycles (quarterly paydates=gridpoints)

Worst case = 1.2 * (20 * 810 + 20 * valuation IOU) clock cycles (biweekly numerical integration gridpoints)

Expected case is running at 3600 cycles plus 24x the IOU cycle count. In other words, one microsecond plus some IOU accounting.

Valuation IOU Runtime Accounting:

With subtopic titles like “Valuation IOU Runtime Accounting” I am certain to cut my blog reading population in half – the price of a lonely optimization post. Recall our list of valuation IOUs. We assumed that some Rates process will exogenously produce some suitably cooked and interpolated USD Lidor  (discCurve) at the cost of some L2 cache pressure to us. Note we assume on the code rewrite we vectorize and unroll loops anywhere possible in the code. So in our accounting we take the cycle counts from Intel’s vectorized math library (MKL) for double precision.  Vector divides cost 5.4 cycles per element. Vector exponentials cost 7.9 cycles per element. Vector log  costs 9.9 cycles and vector reciprocals cost 3.36 cycles per element. The  vector lengths for these cycle counts are probably close to 1024 so perhaps the cycle counts are slightly optimistic.

Valuation IOU Accounting list

  1. survival [paydates] = JpmcdsForwardZeroPrice(SpreadCurve,…)       4*paydate clocks
  2. discount[paydates] = JpmcdsForwardZeroPrice(discCurve,…)               0 assumed away
  3. vector1[paydates] = notional*couponRate*accTime[paydates]               paydate clocks
  4. vector2[paydates] = survival[paydates]*discount[paydates]              paydate clocks
  5. s[gridpts] =  JpmcdsForwardZeroPrice(SpreadCurve,…)                4*paydate clocks
  6. df[gridpts] = JpmcdsForwardZeroPrice(discCurve,…)                        0 assumed away
  7. t[gridpts] grid time deltas                                                                           0 assume away
  8. lambda[gridpts] = log(s[gridpts]/s[gridpts])/t[gridpts]                     15*gridpts clocks
  9. fwdRate[gridpts] = log(df[gridpts]/df[gridpts])/t[gridpts]                    0 assumed away
  10. lambdafwdRates[gridpts] = lambda[gridpts] * fwdRates[gridpts]       gridpts clocks
  11. vector3[gridpts0 = exp(-t[gridpts]*lambdafwdRates[gridpts])              10 * gridpts clocks
  12. vector4[gridpts] = s[gridpts] * df[gridpts]                                                gridpts clocks
  13. vector5[gridpts] = reciprocal (lambdafwdRates[gridpts])                   4 * gripts clocks
  14. t0[gridpts], t1[gridpts] coefficients for accrued interpolation              0 assume away
  15. thisPv[gridpts]                                                                                                  4 * gridpts clocks

Looks like the IOU cost in aggregate is 10 * paydate clocks or 200 clock cycles plus 35 * gridpts clocks (ugh!). So expected case assumes gridpoints = quarterly paydates  so all in 45 * paydates clock cycles  or 900 clocks.  Worst case 200 clocks + 35 * 130 clocks or 4750 clocks.

All in then, CDS credit curve cooking estimates are as follows:

Expected case = 1.2 * (20 * 150 + 20 * 900) clock cycles (quarterly paydates=gridpoints)

Worst case = 1.2 * (20 * 810 + 20 * 4750) clock cycles (biweekly numerical integration gridpoints)

Expected case looks like 25,200 clocks or 7 microseconds. Worst case looks like 133,440 cycles or 37 microseconds. Ok, we tentatively thought 20 microseconds worst case but recall we left a chunk of optimizations unaccounted for in the interests of brevity.  But the worst case is about two times more expensive than we naively expected.

From here it is easy, full risk will be 20x to 30x the single curve cooking estimate even if we are totally apathetic and clueless. Expected case is 7 * 20 microseconds call it 140 microseconds and worst case is 37 * 30 microseconds call it 1.11 milliseconds.  Since we are going to argue that non-standard CDS interpolation can be stored with the trade description the incremental cycle count for non-standard deals is de minimus since the computational cost is amortized over the life of the position (5Y on average). Expected case  then in our target 10K credit curves 100K CDS  looks like 10 K * 140 microseconds + 100K *  20*42 nanoseconds or one and a half a seconds on a single core of a single microprocessor.  If I have to do the Computerworld UK credit batch on the cheapest Mac Pro I can order from the Apple store today, the Quad core 2.8 GHz Nehalem for 2,499 USD (not gonna take the Add 1,200USD to get the 3.33 GHz Westmere). It’ll cost me 5.8 seconds on a single core but fortunately I purchased four cores (of raw power according to the Apple website) so the Computerworld UK credit derivative batch still runs in a second and change for 2,499 USD + shipping.