David Bindel, 2001, CS 279 Annotated Course Bibliography, here. Rereading this – very nice summary.
This annotated bibliography is a complement to a set of course notes for CS 279, System Support for Scientific Computation, taught in Spring 2001 by W. Kahan at UC Berkeley. It is meant to be representative, not comprehensive, and includes pointers to survey works and online bibliographies which cover the literature in various areas in more depth. In the hope of making the bilbiography simpler to use, the notes are partitioned by topic. The categorization is far from precise, but where a reference provides pertinent coverage of multiple topics, I have tried to provide appropriate cross-references in the section text.
This bibliography is based largely on the bibliography assembled by Judy Hu for the Spring 1992 version of this course. Other major sources for annotated entries include the bibliography of the Apple Numerics Manual and bibliogra- phies from the papers of W. Kahan. Annotations attributable to those sources are clearly labeled.
Muller, et. al., 2009, Handbook of Floating-Point Arithmetic, here. Just starting to read through this morning. Seems to have some prospect of covering roundoff analysis on contemporary microprocessors.
3.5.2 Fused multiply-add
The IBM Power/PowerPC, HP/Intel IA-64, and HAL/Fujitsu SPARC64 VI instruction sets define a fused multiply-add (FMA) instruction, which performs the operation a × b + c with only one rounding error with respect to the exact result (see Section 2.8 page 51).18 This is actually a family of instructions that includes useful variations such as fused multiply-subtract.
These operations are compatible with the FMA defined by IEEE 754- 2008. As far as this operator is concerned, IEEE 754-2008 standardized already existing practice.
The processors implementing these instruction sets (the IBM POWER family, PowerPC processors from various vendors, the HP/Intel Itanium family for IA-64) provide hardware FMA operators, with latencies compa- rable to classical + and × operators. For illustration, the FMA latency is 4 cycles in Itanium2, 7 cycles on Power6, and both processors are capable of launching 2 FMA operations at each cycle.
There should soon be FMA hardware in the processors implementing the IA-32 instruction set: they are defined in the SSE5 extensions announced by AMD and in the AVX extensions announced by Intel.
Kahan, 1996, The Improbability of Probabilistic Error Analysis for Numerical Computations, here. Treasure
Roundoff in Floating-Point Arithmetic:
Suppose the program asks the computer to calculate W := X·Y + Z ;
what the computer actually calculates is w = ( (x·y)·(1 + ß) + z )·(1 + μ)
in which ß and μ stand for rounding errors, tiny values for which we know a priori bounds like, say,
| ß | < 2-53 , | μ | < 2-53 ; 2-53 ≈ 10-16 . ( These bounds suit Double Precision ( REAL*8 ) on most computers nowadays.)
The simplest model of roundoff assumes that nothing more can be known about ß and μ .
The simplest probabilistic model of roundoff assumes that ß and μ are independent random variates distributed Uniformly
between their bounds ±2-53 . Both models merely approximate the truth.