Home » Uncategorized » Memcpy Speed

Memcpy Speed

John D. McCalpin, Single-Threaded memory performance for dual socket Xeon E5-* systems, here.

There are a couple of factors at play here:

1. Idiom Substitution:
You have to be very careful to avoid having the cachebench memcpy kernel replaced with a library routine. The library routine is usually fast, but is often not the fastest option, and since the assembly code is not so easy to locate it is much harder to interpret. With the STREAM benchmark I usually add the “-ffreestanding” compiler option to tell the compiler not to make memcpy substitutions — then I double-check the assembly code to be sure. (Note that STREAM Copy counts both read and write traffic, so the values are twice as big as the memcpy results above — see the discussion at http://www.cs.virginia.edu/stream/ref.html#counting )

2. Memory Latency:
Single threaded memory bandwidth is concurrency-limited on these systems. Each of these Intel cores can handle 10 L1 Data Cache misses, so (in the absence of L2 hardware prefetching), your read bandwidth is going to be limited to 10 cache lines per memory latency. I don’t have exactly the same set of processors, but I do have some very similar processors that show:
Dual-Socket Xeon E5-2680: 79 ns (running at max Turbo speed of 3.1 GHz) “Sandy Bridge EP”
Dual-Socket Xeon X5680: 69 ns (running at nominal 3.33 GHz) “Westmere EP”
Xeon E3-1270: 53.6 ns (running at nominal 3.4 GHz) “Sandy Bridge” (with “client” uncore

3. Streaming Stores:
Depending on how the code is compiled, it may or may not contain non-temporal (“streaming”) stores. Streaming stores reduce the overall memory traffic by eliminating the read of the target cache lines before the are overwritten. This provides a large performance boost in the multicore case, but for a single thread streaming stores often reduce performance because they cannot be prefetched. (Prefetching store targets reduces the length of time that the store transactions hold on to the L1 Data Cache Line Fill Buffers, and so improve overall throughput.) The performance of streaming stores for a single thread differs significantly across Intel processors, but the details are not easy to investigate.

Here are some numbers from the STREAM benchmark to show how these factors interact. All are single-threaded STREAM Copy values run with processor pinning and (where applicable) enforced NUMA memory affinity. They were compiled with various versions of the Intel C compiler (versions 11 through 13, though there is very little difference in performance for this test):
with streaming stores without streaming stores with memcpy substitution
Dual-Socket Xeon E5-2680 7528 MB/s 12545 MB/s 8640 MB/s
Dual-Socket Xeon X5680 8140 MB/s 10215 MB/s ???
Single-Socket Xeon E3-1270 17950 MB/s 11970 MB/s ???

Recall that these numbers should be about twice the values that cachebench uses, so the 12545 MB/s (STREAM Copy) on the Xeon E5-2680 is only about 6% higher than the 5896 MB/s (Cachebench memcpy) on the Xeon E5-2670.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: