Pink Iguana

Home » Posts tagged 'HPC'

Tag Archives: HPC

Evaluation of Vectorizing Compilers

Maleki, An Evaluation of Vectorizing Compilers, here.  If you need the FP juice just code directly to the intrinsics. Your code will look like that Intel Black Scholes code.

Abstract—Most of today’s processors include vector units that have been designed to speedup single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly language or using intrinsics in high level languages is a time consuming and error-prone task. The alternative is to automate the process of vectorization by using vectorizing compilers.

This paper evaluates how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II. We evaluated three compilers: GCC (version 4.7.0), ICC (version 12.0) and XLC (version 11.01). Our results show that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers we evaluated.


Haswell i7-4771 Available for Pre-order

Tarun Iyer, Tom’s Hardware, Intel Core i7-4771 Quad-Core CPU Available for Pre-Order, here. DynaRack this. I think even if there were Ivy Bridge servers available the AVX2 on the Haswell will leave the theoretical Ivy Bridge servers far behind. Looking at 6-9 months where competing low latency pre trade analytics is hopelessly slow and noncompetitve, even through they would be faster than the current Sandy Bridge dinosaurs in the colo. This is how the server manufacturers are going to roll for the forseeable future, SLOW. The delicious part is that for  a big chunk of the low latency trading analytics there will be no really significant difference in how fast  their code runs on Sandy Bridge versus Haswell.  You need to recode to AVX2 and rebuild your code to see the FP juice. Pretty sure gcc will not see the juice  without extraordinary measures, and most of these folks are doing smart guy stuff like Ocaml and Erlang which are totally oblivious  to the realities of keeping a floating point pipeline full with two FMA units to schedule, as we have already covered.

Intel’s upcoming Core i7-4771 processor (BX80646I74771) has made an appearance and offers four processing cores, a clock rate of 3.5 GHz that can be Turbo-Boosted to 3.9 GHzand an 8 MB L3 Cache.

Fooling the Masses with Performance Results

Gerhard Wellein and Georg Hager, ISC13, Fooling the Masses with Performance Results: Old Clasic & Some New Ideas, here. Yahtzee! I have seen all these. Nice work. I have even seen we need a supercomputer to run the overnight build for the C++ framework because it takes so long.

Slow Computing 101
1.Do not use high compiler optimization levels or the latest compiler versions, because of numerical stability
2.Use fancy C++/JAVA/Python/… frameworks – they are much more maintainable and flexible
3.Scalability is still bad? Parallelize short loops with OpenMP and earn some extra bonus for a scalable hybrid code.
Time to solution?
“If I had a bigger machine, I could get the solution as fast as you want. This is of course due to the superior scalability of my code which is ready to scale on exaflop machines…..”

Network, Intel, Fulcrum, Infiniband, and Cray …Yahtzee!

Michael Feldman, HPCWire, Intel Weaves Strategy To Put Interconnect Fabrics On Chip, here. Boom goes the RDMA connection. All the Pluto Switch has to guarantee is that the packets get delivered in the same order as originally sent. The interleaving of the serial streams is important but probably not a huge deal.

Making the interconnect logic a first-class citizen on the processor, rather than just an I/O device would be a huge paradigm shift for the server market. If successfully executed at Intel, other chip vendors will be forced to follow suit. (AMD is likely already conjuring up something similar with the fabric technology it acquired from SeaMicro.) Meanwhile makers of discrete NICs and host adapter silicon will have to rethink their strategy, perhaps allying themselves with other chipmakers to offer competitive products.

Godson, ARM v Intel, Willmott Quant Blogs, Pony Physics, and the ECB.

Bill Dally, EE Times, Q&A: Nvidia’s Dally on 3-D ICs, China, cloud computing, here.  More complete interview reporting that HPC Wire recently summarized. EE Times may make you register for this but it is otherwise free. On China’s microprocessors:

Five years ago Godson was laughable. Now it’s competent but not state of the art. If they continue, I would expect them to be matching the West in three to five years and then pulling ahead. Quite frankly, this country is not investing as much in R&D in these strategic areas. It’s a question of government investment in research. In computing it’s slowed to a trickle.

If we want to have a pipeline of innovations that can fuel competitive products going forward, the government needs to invest in stuff beyond the horizon of what companies will reasonably invest in. The fundamental research lifts all the boats.

IB Times, ARM, Intel Battle Heats Up, here. Microprocessor server HPC is a side show, w Intel holding 94+% share.

Low-power processor maker ARM Holdings PLC (Nasdaq: ARMH) is stepping up the rhetoric against chip rival Intel Corp. (Nasdaq: INTC), saying it expects to take more of Intel‘s share in the notebook personal-computer market than Intel can take from it in the smartphone market.

Pablo Triana’s Blog,, here. Willmott hosts several quant blogs for Derman, Triana, Taleb, Das, etc.  that occasionally heat up, here.  For example this from Apr 2011:

I notice that Emanuel Derman is about to release his new book. The tome seems to deal with how the failings of finance theory can impact the world. This sounds very close to what my Lecturing Birds attempted to do. There are big differences though.

For one, Derman knows much more than I do about the subject matter.

He is also a better writer.

But I suspect that there is an area where I may have a slight comparative advantage. I am an amateur, a dilettante, a stranger in a strange land. Derman is a pro in the field. While he is way more open and honest than most other pros in this debate, he may not want to be more open and honest than necessary. In other words, he probably can´t or doesn´t want to be a denunciator. He can´t or doesn´t want to be too critical or too cynical. I, on the other hand, was able to be stringently accusatorial because I had no allegiance but to the evidence I unearthed and what such findings dictated me to conclude. Derman can highlight VaR´s weaknesses but he might not want to call for its banning. Derman can talk about BSM´s flaws, but he might not want to embrace Taleb-Haug. Derman can denounce the unrealism of models but he might not want to lead a campaign against the (possibly impractical, probably lethal) modelling of finance.

My Little Pony Physics,  You Tube, here. Going viral via Tosh.0.

Weisenthal, Business Insider,Everyone Agrees: The ECB Is About To Make The Biggest Decision In Its History, here.

All the politicians in Greece (even the mainstream ones) have said they want to renegotiate the bailout agreement.

If the rest of Europe doesn’t back down and agree to this, then the ECB will have to make a huge decision.

From JPMorgan

Unless Greece chooses to leave the Euro area, which we doubt will happen, a Greek exit will require the rest of the region to push the country out. The mechanism for this will be the ECB excluding the Greek central bank from TARGET2, the regional payments and settlement system. Although this might look like a technical decision about monetary plumbing, the ECB will elevate this to the Euro area heads of state. It will be the most important political decision since EMU’s launch.

Intel Server CPUs Shipping

HPC Wire, Intel Rolls Out New Server CPUs, here.

Since the E5-4600 supports the Advanced Vector Extensions (AVX), courtesy of the Sandy Bridge microarchitecture, the new chip can do floating point operations at twice the clip of its pre-AVX predecessors. According to Intel, a four-socket server outfitted with E5-4650 CPUs can deliver 602 gigaflops on Linpack, which is nearly twice the flops that can be achieved with the top-of the-line E7 technology. That makes this chip a fairly obvious replacement for the E7 when the application domain is scientific computing.

Picking a FinQuant Platform

Extreme Tech, Ivy Bridge: Intel’s killing blow on AMD, here. Let’s look at this again. For the FinQuant application space I’d estimate somewhere between 50% and 85% of what you care about in selecting a Linux server is the current, expected future, and realized future feature size of the fab producing the server’s microprocessors and chips. There are lots of other important variables: system and microprocessor architecture, programming languages, network transmission lines, compilers, operating systems, file systems, databases, etc. and each alone can make or break a FinQuant app, but they are all tails. The microprocessor fab feature size is the dog, it effectively determines how well my FinQuant infrastructure scales with Moore’s Law. The comparative technology priority has not always been this way. There used to be different instruction set architectures, networks were slower than DRAMs, and memories were small all requiring evaluation in addition to the shrinking microprocessor fab feature size. In all likelihood the comparative technology priorities will change in the future as well.

Right now, Intel Sandy Bridge is 32nm, Ivy Bridge is 22nm, AMD Operton is 28nm,  Xilinx is 28nm, Achronix is 22nm and the microprocessor market share main event is between AMD and Intel over design wins in mobile systems. Server-side Intel holds 95% market share to AMD 5%. Intel tries to set expectations of 22nm by 2013 and 14nm  by 2014, here for example while showing the Chandler, Ariz 14 nm fab construction, here.  Recall that things do not always move so smoothly for Intel, think about the relatively recent 8M Sandy Bridge support chip recall and Itanium. On the other hand AMD’s problems appear to be a shade worse than Intel’s witness:  ars technica on Server market share here, The Register here, Extreme Tech here. I don’t know how much these websites are owned or comped by Intel, but if i am holding a bunch of Opteron server-side exposure it is probably safe to argue that it’s time to think about a hedge.

All things being equal, if I am aggressively setting up an HPC FinQuant infrastructure play now. I kind of want to be production ready with 22 nm silicon by the end of 2012 looking to set up a smooth infrastructure transition to 14nm in 2014.

Anand Tech, Intel’s Ivy Bridge Architecture Exposed, here. Not sure how much I care about the integrated GPU for server side FinQuant apps unless the AVX2 is somehow related to the GPU.

ars, Transactional memory going mainstream with Intel Haswell, here.

phoronix, Compilers Mature For Intel Sandy/Ivy Bridge, Prep For Haswell, here. Wow Treasure.

tom’s hardware, AMD Steals Market Share From Intel, here. The interesting fight is in mobile from the market’s perspective. Servers not so interesting.

Xilinx References Apr 2012

Xilinx, High Performance Computing Using FPGAs, Sep 2010, here.

The shift to multicore CPUs forces application developers to adopt a parallel programming model to exploit CPU performance. Even using the newest multicore architectures, it is unclear whether the performance growth expected by the HPC end user can be delivered, especially when running the most data- and compute- intensive applications. CPU-based systems augmented with hardware accelerators as co-processors are emerging as an alternative to CPU-only systems. This has opened up opportunities for accelerators like Graphics Processing Units (GPUs), FPGAs, and other accelerator technologies to advance HPC to previously unattainable performance levels.

I buy the argument to a degree. As the number of cores per chip grow, the easy pipelining and parallelization opportunities will diminish. The argument is stronger if there are more cores per chip. 8 cores or under per general purpose chip it’s sort of a futuristic theoretical argument. More than a few programmers can figure out how to code up a 4 to 8 stage pipeline for their application without massive automated assistance. But the FPGA opportunity does exist.

The convergence of storage and Ethernet networking is driving the adoption of 40G and 100G Ethernet in data centers. Traditionally, data is brought into the processor memory space via a PCIe network interface card. However, there is a mismatch of bandwidth between PCIe (x8, Gen3) versus the Ethernet 40G and 100G protocols; with this bandwidth mismatch, PCIe (x8, Gen3) NICs cannot support Ethernet 40G and 100G protocols. This mismatch creates the opportunity for the QPI protocol to be used in networking systems. This adoption of QPI in networking and storage is in addition to HPC.

I buy the FPGA application in the NIC space. I want my NIC to go directly to L3 pinned pages, yessir I do, 100G please.

Xilinx FPGAs double their device density from one generation to the next. Peak performance of FPGAs and processors can be estimated to show the impact of doubling the performance on FPGAs [Ref 6], [Ref 7]. This doubling of capacity directly results in increased FPGA compute capabilities.

The idea proposed here is that you want to be on the exponentially increasing density curve for the FPGAs in lieu of clock speed increases you are never going to see again. Sort of a complicated bet to make for mortals, maybe.

I like how they do the comparisons though. They say here is our Virtex-n basketball player and here is  the best NBA Basketball player … and they show you crusty old Mike Bibby 2012. Then they say watch as the Virtex-n basketball player takes Mike Bibby down low in the post, and notice the Virtex-n basketball player is still growing exponentially. So you can imagine how much better he will do against Mike Bibby in the post next year. Finally they say that Mike Bibby was chosen as the best NBA player for this comparison by his father Henry, who was also a great NBA player.

FPGAs tend to consume power in tens of watts, compared to other multicores and GPUs that tend to consume power in hundreds of watts. One primary reason for lower power consumption in FPGAs is that the applications typically operate between 100–300 MHz on FPGAs compared to applications on high-performance processors executing between 2–3 GHz.

Silly making Lemonade out of Lemons argument, the minute I can have my FPGAs clocked at 3 GHz I throw away the 300MHz FPGAs, no?

Intel, An Introduction to the Intel QuickPath Interconnect, QPI, Jan 2009, here.

Xilinx Research Labs/NCSA, FPGA HPC – The road beyond processors, Jul 2007, here. Need more current references but  I keep hearing the same themes in arguments for FGPA HPC, so let’s think about this for a bit:

FPGAs have an opening because you are not getting any more clocks from microprocessor fab shrinks: OK.

Power density: meh. Lots of FinQuant code can run on a handful of cores. The Low Latency HFT folks cannot really afford many L2 misses. The NSA boys are talking about supercomputers for crypto not binary protocol parsing.

Microprocessors have all functions that are hardened in silicon and you pay for them whether you use them or not  and you can’t use that silicon for something else: Meh, don’t really care if I use all the silicon on my 300 USD microprocessor as long as the code is running close to optimal on the parts of the silicon useful to my application. It would be nice if I got more runtime performance for my 300USD, no doubt. This point is like Advil is bad because you don’t always need to finish the bottle after you blow out your ankle. Yeah, I understand the silicon real estate is the most expensive in the world.

Benchmarks: Black Scholes 18msec FPGA @ 110 Mhz Virtex-4 203x faster than Opeteron – 2.2 Ghz: You Cannot be Serious! 3.7 microseconds per Black Scholes evaluation was competitive performance at the turn of the century. The relative speedup slides and quotations make me nervous. Oh, Celoxica provided the data – hey Black Scholes in 36 Nanoseconds on a single core of a dual core off-the-shelf general microprocessor from 2007. So the Virtex-4 does 1M Black Scholes evaluations in 18 milliseconds flat to competitive code on a dual core general purpose off-the-shelf microprocessor in 2007.

Make it easy for the users to use this hardware and get „enough of a performance‟ increase to be useful: meh, it’s for applications that do not need to go fast, for now (2007)?

Do not try to be the fastest thing around when being as fast with less power is sufficient: meh, really do not care so much about the power thing

FPGA: Different operations map to different silicon allows massive pipelining; lots of parallelismOK. So, why bother with the previous two points?

Eggers/ U. Washington, CHiMPS, here. Eggers is reasonable.

There have been (at least) two hindrances to the widespread adoption of FPGAs by scientific application developers: having to code in a hardware description language, such as Verilog (with its accompanying hardware-based programming model) and poor FPGA memory performance for random memory accesses. CHiMPS, our C-to-FPGA synthesis compiler, solves both problems with one memory architecture, the many-cache memory model.

Many-cache organizes the small, distributed memories on an FPGA into application-specific caches, each targeting a particular data structure or region of memory in an application and each customized for the particular memory operations that access it.

CHiMPS provides all the traditional benefits we expect from caching. To reduce cache latency, CHiMPS duplicates the caches, so that they’re physically located near the hardware logic blocks that access them. To increase memory bandwidth, CHiMPS banks the caches to match the memory parallelism in the code. To increase task-level parallelism, CHiMPS duplicates caches (and their computation blocks) through loop unrolling and tiling. Despite the lack of FPGA support for cache coherency, CHiMPS facilitates data sharing among FPGA caches and between the FPGA and its CPU through a simple flushing of cached values. And in addition, to harness the potential of the massively parallel computation offered by FPGAs, CHiMPS compiles to a spatial dataflow execution model, and then provides a mechanism to order dependent memory operations to retain C memory ordering semantics.

CHiMPS’s compiler analyses automatically generate the caches from C source. The solution allows scientific programmers to retain their familiar programming environment and memory model, and at the same time provides performance that is on average 7.8x greater and power that is one fourth that of a CPU executing the same source code. The CHiMPS work has been published in the International Symposium on Computer Architecture (ISCA, 2009), the International Conference on Field Programmable Logic and Applications (FPL, 2008), and High-Performance Reconfigurable Computing Technology and Applications (HPRCTA, 2008), where it received the Best Paper Award.

Intel @ 22nm

Intel, Intel’s Revolutionary 22 nm Transistor Technology, Mark Bohr and Kaizad Mistry, May 2011, here.

EE Times, Intel exec says fabless model ‘collapsing’, here. Bohr is the guy from the 22nm presentation (above).

It’s the beginning of the end for the fabless model according to Mark Bohr, the man I think of as Mr. Process Technology at Intel.

Bohr claims TSMC’s recent announcement it will serve just one flavor of 20 nm process technology is an admission of failure. The Taiwan fab giant apparently cannot make at its next major node the kind of 3-D transistors needed mitigate leakage current, Bohr said.

“Qualcomm won’t be able to use that [20 nm] process,” Bohr told me in an impromptu discussion at yesterday’s press event where Intel announced its Ivy Bridge CPUs made in its tri-gate 22 nm process. “The foundry model is collapsing,” he told me.

Of course Intel would like the world to believe that only it can create the complex semiconductor technology the world needs. Not TSMC that serves competitors like Qualcomm or GlobalFoundries that makes chips for Intel’s archrival AMD.


But Bohr stretches the point too far when he says the foundries and fabless companies won’t be able to follow where Intel is going. I have heard top TSMC and GlobalFoundries R&D managers make a good case that 3-D transistors won’t be needed until the 14 nm generation. For its part, TSMC said at 20 nm there is not enough wiggle room to create significant variations for high performance versus low power processes.

Anand Tech, Ivy Bridge posts, here. Motherboard and laptop implementation commentary.

Intel, Haswell New Instruction Description Now Available, June 2011, here.

Intel just released public details on the next generation of the x86 architecture. Arriving first in our 2013 Intel microarchitecture codename “Haswell”, the new instructions accelerate a broad category of applications and usage models. Download the full Intel® Advanced Vector Extensions Programming Reference (319433-011).

These build upon the instructions coming in Intel® microarchitecture code name Ivy Bridge, including the digital random number generator, half-float (float16) accelerators, and extend the Intel® Advanced Vector extensions (Intel® AVX) that launched in 2011.

AVX2 integer data types expanded to 256-bit SIMD; Bit manipulation instructions; Gather; Any-to-Any permutes; Vector-Vector shifts; Floating point Multiply Accumulate.

Agner Fog

SD Times, Fog Around Intel Compilers, here.

Agner Fog is a computer science professor at the University of Copenhagen‘s college of engineering. As he puts it, “I have done research on microprocessors and optimized code for more than 12 years. My motivation is to make code compatible, especially when it pretends to be.”

Fog has written a number of blog entries about Intel’s compilers and how they treat competing processors. In November, AMD and Intel settled, and Fog has written up a magnificent analysis of the agreement.

If you have any interest in compilers, and in Intel’s compilers, you should definitely read his paragraph-by-paragraph read through.

Fog, Agner, Software Optimization Resources, here. I was reading Fog’s Optimizing Software in C++ (here) this morning. It’s a runtime optimization guide for Windows, Linux, and Mac. I have seen it before and perhaps been remiss in not commenting more fully. Without the benefit of trying out many of Fog’s code samples and directives against current versions of ICC and GCC I cannot be certain, but based on what I have optimized in the recent past, his body of works looks very legitimate and exhaustive. You ask, how exhaustive? Let’s start with the copyright, it’s got a succession plan:

This series of five manuals is copyrighted by Agner Fog. Public distribution and mirroring is not allowed. Non-public distribution to a limited audience for educational purposes is allowed. The code examples in these manuals can be used without restrictions. A GNU Free Documentation License shall automatically come into force when I die. See

Professor Fog is laying out code optimization paths for 4 different compilers on 3 different operating systems. I will not and cannot check out/verify all the scenarios presented because I possess the attention span of a squirrel compared to Professor Fog.  He also provides a page on random number generators, here,  which seems legit to the extent that he points you to Matsumoto’s Mersenne Twister RNG page, here. The random number references do not appear to be as comprehensive as the C++ runtime optimization references. But  this looks to be a case of:

We’re not worthy

in a most complimentary way to Professor Fog.