Albanese, et.al., Jan 2011, Coherent Global Market Simulations For Counterparty Credit Risk, here. CVA floating point hacks.
To explain how greater performance can be achieved by doing more calculations, let us recall a few traits of the silicon economics affecting microchip and board designs. In recent times, there was in fact a radical shift in this landscape.
It used to be that:
(i) Computing capabilities were limited by the ability of ALUs to execute floating point and integer arithmetics
(ii) Memory was expensive and a scarce resource
(iii) Most algorithms were single-threaded and parallelism was best brokered transparently by middleware layers dispatching jobs to large grid farms
(iv) Code was best written in native C++ optimized in such a way to speed up the execution of a great variety of bespoke algorithms.
Although these practices are still widespread in the transition period we are living, the underling technology has now shifted quite radically.
(a) Nowadays, it is relatively cheap to populate microchips with highly capable ALU cores. The 8-socket CPU boards of the emerging generation entail as many as 80 cores capable of hyperthreading in the case of Intel or 96 cores in the case of AMD. Even more extreme ALU counts are seen in the GPU space where the AMD Firepro GPUs have 1600 cores and nVidia Fermis have 512.
(b) Memory is relatively cheap and readily available up to terabyte scale, thus enabling single node technology for portfolio processing as a viable alternative to grid computing.
(c) The clock frequency and bandwidths of data paths are not keeping pace with the compute power of ALUs and the massive memory available, rendering the memory bottleneck tighter than ever within the bounds of cost effective designs.
(d) Vastly different microchip architectures have emerged, including SIMD multiprocessors with up to 16-32 data registers located in discrete GPU parts as in the nVidia Fermi and ATI Firestream, the multicore MIMD designs on CPU boards by Intel and AMD and the emerging MIMD-SIMD hybrid fusion architectures, the Intel Sandybridge and AMD Booldozer.
(e) MIMD and SIMD designs are characterized by radically different threading models: SSE2/SSE3/AVX primitives rule with CPUs while the lightweight, no-frills threading models in CUDA/OpenCL are used for GPUs.
(f) Cache hierarchies for MIMD architectures are complex and involve up to 2 MB per core. GPUs instead are nearly cacheless except for a modest amount of shared memory located on individual SIMD microprocessors.
(f) On the programming language side we see the merit a bifurcation away from catch- all C++ coding. On the one hand, the variety of architectures motivates a revival of interest in low-level optimization of basic building block algorithms. On the other hand, the complexity of multi-threaded orchestration in shared memory designs using large scale in-memory processing motivates the use of higher level languages. Features such as garbage collection, managed thread pools and support for service oriented architectures are in fact essential for complexity management.