Gabriele Paoloni, Intel, How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures, Sep 2010, here.
The purpose of this document is to provide software developers with precise methods to measure the clock cycles required to execute specific C code in a Linux environment running on a generic Intel architecture processor. These methods can be very useful in a CPU-benchmarking context, in a code-optimization context, and also in an OS-tuning context. In all these cases, the developer is interested in knowing exactly how many clock cycles are elapsed while executing code.
At the time of this writing, the best description of how to benchmark code execution can be found in . Unfortunately, many problems were encountered while using this method. This paper describes the problems and proposes two separate solutions.
Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual, v2B Instruction Set Manual, here.
Intel, Using the RDTSC Instruction for Performance Monitoring, 1997, here.
Programmers are often puzzled by the erratic performance numbers they see when using the RDTSC (read-time stamp counter) instruction to monitor performance. This erratic behavior is not an instruction flaw; the RDTSC instruction will always give back a proper cycle count. The variations appear because there are many things that happen inside the system, invisible to the application programmer, that can affect the cycle count. This document will go over all the system events that can affect the cycle count returned by the RDTSC instruction, and will show how to minimize the effects of these events.