Assume you are approached for an opinion on the computational performance of piece of software that computes the mark to market, risk, and P&L explanatories for a randomly selected 5 year vanilla single name credit default swap. Lets say the software completes computation in 1 second on a contemporary microprocessor in 2009. Assuming the computation output accurate, correct, and complete what benchmarks can you use to get a ballpark sense of how good the performance of 1 second average elapsed time really is.

Lets do a back of the envelope estimate of all the floating point multiplies and adds required for this credit default swap processing on a single microprocessor core. Patterson gives you that a contemporary microprocessor takes 4 clocks for a floating point multiply and 200 clocks to go to DRAM (an L1 and L2 cache miss). Lets assume we are working with a 65 nm Intel Pentium running at 3.6 GHz and that a floating point add on this core takes 2 clocks. So if my actual CDS computation is about half floating point adds and half fp multiplies then in one second I get over a billion 1,000,000,000 operations theoretically to compute required outputs. If I need to use fp divides, they are expensive at 20 clocks per divide and exponentials (vectorized) cost any where from 10 to 30 clocks depending on the accuracy requirements. Just for round numbers lets go with the billion fp ops per second as an average.

At inception a 5 year OTC vanilla default swap has 20 premium payment dates that are actually fixed to be IMM dates . The net present value of these scheduled premium payments is determined by discounting each by the risky discount factor determined by the cooked credit curve. The net present value of the other swap leg, the protection leg, is determined by integrating the product of default arrival probability and the default payout amount over the term of the premium leg. Since the hazard rate is typically piecewise constant (and the recovery rate constant over the future of the swap) the integral can be closely approximated by a weighted summation of expected protection values at each of the premium cashflow dates, the 20 IMM dates for a 5 year swap. Assuming all fp is double precision then each of the 20 cash flow dates requires 8 bytes so the entire NPV sum requires 160 bytes and certainly fits in the L1 cache (probably in the register file as well). So, lets estimate that all the mtm, deltas, gammas, explanatory evaluations for the single vanilla credit default swap requires 100 mtm evaluations. I don’t think its likely that there are even 30 separate mtm evaluations required, so 100 seems conservative. Thus 100 times we need to sum 20 discounted cashflows and 20 expected payoffs on default. Lets assume the code has not been optimized so you need 300 clocks for each valuation, times 100 evaluations, you need 30,000 clocks to do the valuations if the computation is hitting the L1 cache. Lets assume we don’t actually hit the L1 cache all the time and performance on the evals requires an extra 90,000 clocks for memory waits, for a total of 120,000 clocks. That accounts for 120 microseconds (correction: its 40 microseconds but certainly less than 120 microseconds so the remaining estimates should hold) of compute time so the balance of the time must be spent cooking curves.

Lets assume this code has to cook the underlying credit curve once for each of the 100 evaluations above. Again it seems unlikely that the par curve and various perturbed curves for computing the first and second divided differences and the scenario curves will amount to even 30 cooked curves. We cost it at 100 cooked curves to be conservative. Lets assume the bootstrap process has to fit five annual par spreads (although typically only 3 of the five are marked). Furthermore lets assume that the fp operation count of fitting the 5Y spread is the same as the 4Y, 3Y, 2Y, and 1Y spreads (although they obviously require fewer fp ops). Fixing the 5Y point involves some root finding algorithm for a fairly smooth and well behaved function. Presumably you can use Brent or even simple bisection at the cost of say 5 full mtm evaluations at 300 clocks a piece. These 1500 clocks need to be executed at each of the terms 1Y ( I know not really) , 2Y, 3Y, 4Y, and 5Y for a cost of 7500 clocks. For 100 cooked curves that is 750,000 clocks. Lets assume the L1 cache miss induces a factor of 10 memory wait state penalty so the cost is 7,500,000 clocks.

Now we are talking about using some clock, so the curve cooking + the evaluation comes to 7,500,000 + 120,000 or 7,620,000 clocks . Out of the 3.6 billion clocks available per second we have managed to use less than 0.3% of them to retire the required operations with tons of wasted clocks to buffer our estimations. Through this back of the envelope estimation we can find about 3 milliseconds of work to complete the whole computation end to end on a contemporary microprocessor. So that one second elapsed time performance we started off evaluating isn’t looking so much like A-list performance. Its missing normal performance by roughly a factor of 300. It appears to perform like competitive code running on a microprocessor from the year Titanic won the Oscar for best movie (1997)

BTW to eval 100 OTC trades – full mtm, risk, and P&L explanatories on the same credit curve looks like about 20 milliseconds using these estimates. The ball park time for the time required to run mtm, risk and explanatories on all the non terminated credit default swaps existing worldwide today: Assume 50,000 different curves and 10,000,000 non terminated CDS then its 50,000* 7.5 ms + 10,000,000* 0.12 ms about 26.25 minutes. Lets call it an hour because there are some OTC CDS with terms greater than 5Y and we need to leave some more buffer for general programmer apathy/cluelessness. But it would take a code rewrite; you’d have to get one of those 65 nm Intel microprocessors; all the trades from DTCC; and all the marks from MarkIT. On the other hand you could run it on your home computer after work, before dinner, and if you cannot afford a new computer but you do all the other stuff on your old PC-clunker; I’m thinking 2.5 hours and you’re done with the world’s overnight credit batch in time to catch the end of South Park w. the kids, I’m totally serial.

Brilliant.

[…] in computational runtime optimization in quantitative finance in 2012 as well. Previously, in Totally Serial, we have shown that the 2009 global credit derivative inventory of 10MM default swap positions […]

[…] recall from previous 2009 estimates ( see Totally Serial) for unoptimized scalar code cooking of par and perturbed curves certainly costs less 7 […]