Pink Iguana

Home » Architecture

Category Archives: Architecture


Timothy Prickett Morgan, HPCWire, Intel ‘Haswell’ Xeon E5s Aimed Squarely at HPC, here

The Xeon E5-2600 v2 processors topped out at 12 cores per die, but the new Haswell chips, which are implemented in the same 22 nanometer processes as their predecessors, can have as many as 18 cores per die. So that is up to a 50 percent increase in the number of cores, which drives per-system performance and which allows HPC customers to cram a lot more performance into the same space. The two top-bin parts with high core counts, the E5-2699 v3 with 18 cores and the E5-2698 v3 with 16 cores, run at 2.3 GHz and have 45 MB and 40 MB of L3 cache across those cores, respectively. They also run a little hot – in part due to the integration of the voltage regulator on the die on all Haswell Xeon E5s. And what also makes these two top-end Haswell Xeon E5 processors unique in Intel chip history is that they do not come with an official list price. These are not customized chips and will be made available to all customers, but they are, according to an Intel spokesperson, “unique offerings that fall outside of our traditional, publically available 2S product offering” and were created for HPC, virtualization, and cloud customers looking for maximum performance. The 18-core chip is obviously the most interesting one for a lot of HPC workloads that can use threads and are not as sensitive to clock speeds, and it is also the one that Intel is using to show off relative performance compared to prior Xeon E5s.


The L1 and L2 cache memory bandwidth on the Haswell cores has been doubled up, and one of the reasons why is that the second generation of Advanced Vector Extensions (AVX) integer and floating point math units have a lot more performance than the AVX1 units etched into the prior Sandy Bridge and Ivy Bridge cores. The AVX1 unit had eight 256-bit registers for floating point (four for AVX add and four for AVX multiply) and could double peak floating point operations per second (flops) in these two chips compared to their many Xeon predecessors that have had 128-bit SSE math units to 8 flops per clock. The Haswell core has 256-bit registers with its AVX2 unit, based on two Fused Multiply Add units, which doubles the peak performance to 16 flops per clock at double precision and 32 flops per clock at single precision.

Kmiecik explained to HPCwire that the new FMA instructions in the AVX2 unit would potentially increase performance for structural analysis, computational fluid dynamics, and electromagnetic field and cosmology simulations. The AVX2 feature also supports full 256-bit wide integer calculations, rather than the 128-bit width for the prior AVX1, which will be useful to accelerate image and signal processing, genomics, and cryptographic workloads.

The trick is adding up the effects of the increased core counts, better single-threaded performance, memory bandwidth, and AVX2 math units as a whole running real workloads. And here is what the initial test results look like on a variety of HPC applications:

The tests above compare a two-socket server with the 18-core Haswell Xeon E5-2699 v3, which runs at 2.3 GHz, against a machine using the 12-core Xeon E5-2697 v2 processor, which runs at 2.7 GHz.


Dell Broadwell Laptops Announced for EOY Delivery

Agam Shah, PC World, Dell laptops listed with Intel’s inreleased Broadwell PC chip, here. 14nm for the Holidays.

It’s been a long wait for mainstream PCs that use Intel Core processors based on the Broadwell architecture, but Dell has listed laptops that could ship with the unreleased chips early next year.

Models of Dell’s Latitude 14 3000 (14-inch screen) and Latitude 15 3000 (15-inch screen) will have “5th Generation Intel Celeron, Core up to i7 processors,” according to specification sheets of the products provided by the PC maker.

The documents list the laptops as having life cycles between January 2015 and April 2016, indicating the products could start shipping early next year.

Dell is among the first PC makers to reveal laptops with fifth-generation Core processors, which are targeted at mainstream and business PCs. Shipments of PCs with those chips were pushed to early next year partly due to manufacturing issues on Intel’s new 14-nanometer process.

Getting Non-Compulsary Misses in L2? You’re doing it wrong.

Joel Hruska, Extreme Tech, How L1 and L2 CPU caches work, and why they’re essential part of modern chips, here. Seriously, if you need someone to tell you not to touch DRAM, or L4, or L3, you should just stop in the middle of Wall Street and think about what you are doing.

This chart from Anandtech’s Haswell review is useful because it actually illustrates the performance impact of adding a huge (128MB) L4 cache as well as the conventional L1/L2/L3 structures. Each stair step represents a new level of cache. The red line is the chip with an L4 — note that for large file sizes, it’s still almost twice as fast as the other two Intel chips.


Steve Dent, engadget, Review roundup: Intel’s 8-core Haswell-E is the fastest desktop CPU ever, here. Reviewer doesn’t seem to recognize there are poor people who will be running code compiling on Sandy Bridge for the rest of 2014 and a big chunk of 2015, I know ewwww. I like the part where i might get a 4GHz clock. 

Since it was teased in March, enthusiasts have been itching to see how Intel’s 8-core Haswell Extreme Edition processor (the i7-5960X) performs. It has now launched (along with two other Haswell-E models) and the reviews are in. Yes, it’s the world’s fastest desktop CPU — but the general consensus is “it could have been better.” Why? Because Intel recently launched a “Devil’s Canyon” CPU for $340 with a base clock speed of 4.0GHz that can easily be overclocked to 4.4GHz. The $1,000 Extreme Edition chip, on the other hand, has a base clock of 3.0GHz and max turbo speed of 3.5GHz. Since clock speeds are often more important to gamers than multiple cores, that might disappoint many a Battlefield 4 player. On the other hand, with DDR4 support and eight cores (Intel’s highest count ever on the desktop), the chip should excel at pro tasks like 4K video processing and 3D rendering. Given the price tag, that might be the only market that can afford it. Here’s what the experts think.

The Cloud Wins

Jason Perlow, ZDNet, Why I ditched my servers for the Cloud, here

Look, messing around with hypervisors is fun at work, but let’s face it, I really don’t want to run my own infrastructure anymore.

Xeon Servers

Lee Bell, The Inquirer, Boston previews upcoming Intel Xeon workstations and servers, here.

For servers, Boston has prepared a 1U rack-mount “pizza box” system, the Boston Value 360p. This is a two-socket server with twin 10Gbps Ethernet ports, support for 64GB of memory and 12Gbps SAS Raid. It can also be configured with NVM Express (NVMe) SSDs connected to the PCI Express bus rather than a standard drive interface

ISA Showdown

Joel Hruska, ExtremeTech, The Final ISA showdown: Is ARM, x86, or MIPS intrinsicly more power efficient? here. No link to the paper?

One of the canards that’s regularly trotted out in discussions of ARM vs. x86 processors is the idea that ARM chips are intrinsically more power efficient thanks to fundamental differences in the ISA (instruction set architecture). A new research paper examines these claims using a variety of ARM cores as well as a Loongson MIPS microprocessor, Intel’s Atom and Sandy Bridge microarchitectures, and AMD’s Bobcat.

Skylake 2015

Joel Hruska, Extreme Tech, Intel’s 14nm puzzle: As Skylake details leak, everybody asks – is the chip coming in 2015 or not? here. Get some AVX3.2 juice in 2015 maybe. 

Last week, we covered news of a new leaked Intel document that claimed to show the company’s 14nm Broadwell chips pushing back well into 2015. While new ultra-low power silicon will be available by the end of the year, this paper predicted that 14nm Broadwell-H processors — the more mainstream silicon, in other words — won’t ship until Q2 2015. Given this, we predicted that Broadwell’s successor, Skylake, would slip as well, possibly into early 2016. Now fresh documents show the opposite, with Skylake still coming next year.

TSX Nerfed

Simon Sharwood, The Register, Intel disables hot new TSX tech in early Broadwells and Haswells, here.

One of Intel’s new ways to make software go faster is called Transactional Synchronization Extensions (TSX), an innovation that gives developers fine control over how multi-threaded code uses a CPU’s resources.

See here for Aug14 Intel E3-120 v3 spec update.

Broadwell Core M

Ryan Smith, Anand Tech,  Intel Broadwell Architecture Preview: A Glimpse into Core M, here. Floating point on your iPad running at Tiger Noodles is going to smoke JPM/Maxeler’s million dollar FPGA supercomputer  for credit derivative valuation. However, the barrier to entry is high, you would have to learn compile a program on an x86.

Of course efficiency increases can only take you so far, so along with the above changes Intel is also making some more fundamental improvements to Broadwell’s math performance. Both multiplication and division are receiving a performance boost thanks to performance improvements in their respective hardware. Floating point multiplication is seeing a sizable reduction in instruction latency from 5 cycles to 3 cycles, and meanwhile division performance is being improved by the use of an even larger Radix-1024 (10bit) divider. Even vector operations will see some improvements here, with Broadwell implementing a faster version of the vector Gather instruction.