Pink Iguana

Home » Code

Category Archives: Code

Zeroing Buffers is Hard

Colin Percival, Daemonic Dispatches, Zeroing buffers is insufficient, here

Now, some parts of the stack are easy to zero (assuming a cooperative compiler): The parts which contain objects which we have declared explicitly. Sensitive data may be stored in other places on the stack, however: Compilers are free to make copies of data, rearranging it for faster access. One of the worst culprits in this regard is GCC: Because its register allocator does not apply any backpressure to the common subexpression elimination routines, GCC can decide to load values from memory into “registers”, only to end up spilling those values onto the stack when it discovers that it does not have enough physical registers (this is one of the reasons why gcc -O3 sometimes produces slower code than gcc -O2). Even without register allocation bugs, however, all compilers will store temporary values on the stack from time to time, and there is no legal way to sanitize these from within C. (I know that at least one developer, when confronted by this problem, decided to sanitize his stack by zeroing until he triggered a page fault — but that is an extreme solution, and is both non-portable and very clear C “undefined behaviour”.)

One might expect that the situation with sensitive data left behind in registers is less problematic, since registers are liable to be reused more quickly; but in fact this can be even worse. Consider the “XMM” registers on the x86 architecture: They will only be used by the SSE family of instructions, which is not widely used in most applications — so once a value is stored in one of those registers, it may remain there for a long time. One of the rare instances those registers are used by cryptographic code, however, is for AES computations, using the “AESNI” instruction set.

It gets worse. Nearly every AES implementation using AESNI will leave two values in registers: The final block of output, and the final round key. For encryption operations these aren’t catastrophic things to leak — the final block of output is ciphertext, and the final AES round key, while theoretically dangerous, is not enough on its own to permit an attack on AES — but the situation is very different for decryption operations: The final block of output is plaintext, and the final AES round is the AES key itself (or the first 128 bits of the key for AES-192 and AES-256). I am absolutely certain that there is software out there which inadvertantly keeps an AES key sitting in an XMM register long after it has been wiped from memory. As with “anonymous” temporary space allocated on the stack, there is no way to sanitize the complete CPU register set from within portable C code — which should probably come as no surprise, since C, being designed to be a portable language, is deliberately agnostic about the registers and even the instruction set of the target machine.

Let me say that again: It is impossible to safely implement any cryptosystem providing forward secrecy in C.


Check In with that VP of Really Fast Erlang HFT Code

Matt Levine, Bloomberg View, Goldman Sachs Just Says ‘VIce President’ to be Polite, here. And they say the guy from Duck Dynasty doesn’t have clue what he is talking about on Fox. Well, he ain’t alone, nosuh, not alone.   

My basic view of Sergey Aleynikov — the former Goldman Sachs programmer who left for a high-frequency trading firm, took some code on his way out the door, was arrested by the FBI at Goldman’s instigation, was convicted of theft and sentenced to eight years in prison, was released after about a year when an appeals court ruled that he hadn’t committed a crime, and was then charged with the theft again in state court just out of prosecutorial spite1 — is that Goldman has been unnecessarily mean to him and that the very least it can do would be to pay his (more than $2.3 million!) legal bills for the absurd criminal cases it put him through.

Intel pumps code renovation show – This Old Code

Nicole Hemsoth, HPCWire, New Degrees of Parallelism, Old Programming Planes, here. Intel pumping vectorization. 

Exploiting the capabilities of HPC hardware is now more a matter of pushing into deeper levels of parallelism versus adding more cores or overclocking. What this means is that the time is right for a revolution in programming. The question is whether that revolution should be one that torches the landscape or that handles things “diplomatically” with the existing infrastructure.

While some argue for a “rip and replace” approach to rethinking code for the new era of computational capability, others, including Intel’s Director of Software, James Reinders, are advocating approaches that blend the old and new—that preserve the order of existing programming models while still permitting major leaps ahead for parallelism.

To these ends, Reinders described the latest release of Intel’s Parallel Studio XE 2015 for us this week, pointing to the addition of new explicit vector programming capabilities as well as the many features inside OpenMP 4.0., which is a significant part of the new release.

Living Stingy, Why “This Old House” is Evil, here. Of course if you have x86 code you can only run it on Intel processors so Bob Vila wouldn’t have the same problem.

This television program has done more to ruin the psyche of American homeowners than any one single thing. These two are indirectly responsible for the economic meltdown of 2009, I believe.


Michael Wolfe, HPCWire, Compilers and More: MPI+X, here.


At ISC’14, there was intense and continuing interest in the choice of a standard approach for programming the next generation HPC systems. While not guaranteed, many of these systems are likely to be large clusters of nodes with multicore CPUs and some sort of attached accelerators. A standard programming approach is necessary to convince developers, and particularly ISVs, to start adoption now in preparation for this coming generation of systems. John Barr raised the same question in an article at Scientific Computing World from a more philosophical point of view. Here I address this question from a deeper technical perspective.

HPC programming is currently dominated by either a flat model with MPI across nodes as well as cores within a node, or a hybrid model with MPI across the nodes and OpenMP shared memory parallelism across the cores in a node. The advantage of flat MPI is a simpler programming model, only one level of parallelism and only one API. The disadvantage is it doesn’t take advantage of the shared data across the ranks on the same node, requiring message and buffer management across all ranks. MPI+OpenMP roughly inverts those advantages and disadvantages.

The reason MPI and MPI+OpenMP have worked so well over the past 20 years now is that most HPC systems are roughly isomorphic, with some differences in instruction set, node topology, and performance profiles. The system is a network of nodes, the nodes have one or more processors, the processors have one or more cores. The cores on a node share virtual and physical memory, with hardware cache coherence to make shared memory programming relatively safe. There are some outliers, like the big SGI shared memory systems, which have some programming model and performance advantages for certain applications.


A comparison of programming languages in economics

Tyler Cowen, Marginal Revolution, A comparison of programming languages in economics, here. Adorable, see also here.

Our first result is that C++ and Fortran still maintain a considerable speed advantage with respect to all other alternatives. For example, these compiled languages are between 2.10 and 2.69 times faster than Java, around 10 times faster than Matlab, and around 48 times faster than the Pypy implementation of Python. Second, C++ compilers have advanced enough that, contrary to the situation in the 1990s, C++ code runs slightly faster (5-7 percent) than Fortran code. The many other strengths of C++11 in terms of capabilities (full object orientation, template meta-programming, lambda functions, large user base) make it an attractive language for graduate students to learn. On the other hand, Fortran 2008 is simple and compact ñand, thus, relatively easy to learn ñ and it can take advantage of large amounts of legacy code.14 Third, even for our very simple code, there are noticeable di§erences among compilers. We Önd speed improvements of more than 100 percent between di§erent executables of the same underlying code (and using equivalent optimization compilation áags). While the open- source GCC compilers are superior in a Mac/Unix/Linux environment (for which they have been explicitly developed) to the Intel compilers, they do less well in a Windows machine. The deterioration in performance of the Clang compiler was expected given that the goal of the LLVM behind it is to minimize compilation time and executable Öle sizes, both important goals when developing general-use applications but often (but not always!) less relevant for numerical computation. Our fourth result is that Java imposes a speed penalty of 110 to 169 percent. Given the similarity between Java and C++ syntax, there does not seem to be an obvious advantage for choosing Java unless portability across platforms or the wide availability of Java programmers is an important factor.

They went and flipped the -O3 switch:

Our Mac machine had an Intel Core i7 @2.3 GHz processor, with 4 physical cores, and 16 GB of RAM. It rsn OSX 10.9.2. Our Windows machine had an Intel Core i7-3770 CPU @3.40GHz processor, with 4 physical cores, hyperthreading, and 12 GB of RAM. It ran Windows 7, Ultimate-SP1. The compilation áags were: 1. GCC compiler (Mac): g++ -o testc -O3 RBC_CPP.cpp 2. GCC compiler (Windows): g++ -Wl,–stack,4000000, -o testc -O3 RBC_CPP.cpp 3. Clang compiler: clang++ -o testclang -O3 RBC_CPP.cpp 4. Intel compiler: icpc -o testc -O3 RBC_CPP.cpp 5. Visual C: cl /F 4000000 /o testvcpp /O2 RBC_CPP.cpp 6. GCC compiler: gfortran -o testf -O3 RBC_F90.f90 7. Intel compiler: ifortran -o testf -O3 RBC_F90.f90 8. javac and run as java RBC_Java -XX:+AggressiveOpts.

Caml at Jane Street

Joab Jackson, IT World, You won’t believe what programming language this Wall Street firm uses, here. Language is probably super good at issuing AVX2 instructions on Haswell.

Trading firm Jane Street says Caml has given it a powerful set of tools for building large programs that have to run quickly and without errors.

“A huge amount of day-to-day programming is case analysis. Getting your programs right is really hard, and any tool you can get from the system to help catch errors is helpful,” said Yaron Minsky, head of the technology group at Jane Street, speaking Friday at the QCon developer conference in New York.

Jane Street is a proprietary trading firm that is the world’s largest industrial user of Caml and OCaml, the object-oriented version of Caml.

All of Jane Street’s trading and ancillary systems use Caml, with the exception of some C code for low-level system interfaces and some Visual Basic script powering analyst spreadsheets. All in all, Caml code handles about US$20 billion of trades every business day at Jane Street.

How to build a program using Intel MKL

Princeton Research Computing, How to build a program using Intel MKL, here. Nice, don’t click on the HPC hardware though. They have Westmere chips in there, yucky.

It is relatively simple to compile and link a C, C++, or Fortran program that makes use of the Intel MKL (Math Kernel Library), especially when using the Intel compilers.
Begin by determining the correct link parameters for your situtation at the Intel MKL Link Line Advisor page. Select the options as follows:
Intel product:
Intel MKL 10.2 (if using Intel Compiler 11.x)
Intel MKL 10.3 (if using Intel Compiler 12.x)
Intel Composer XE 2013 (if using Intel Compiler 13.x)
Intel OS: Linux
Processor architecture: Intel 64
Intel Fortran
Intel C/C++
Dynamic or static linking: Dynamic (recommended)
Interface layer: LP64
Sequential or multi-threaded layer:
Sequential (standard option)
Multi-threaded (only if doing multi-threading within each process using OpenMP)
The other options usually do not need to be specified.
Execute: module load intel
Type ‘module list’ to see which version’s environment was set up. If you wish to use a different Intel compiler version, then type ‘module avail’ to see your choices, and then ‘module purge’ followed by a ‘module load’ command specifying the desired version.
Since your selected Intel compiler and MKL environments have now been set up, there is no need to specify the -I (compilation) and -L (link) options as specified by the Link Line Advisor page. Instead, just append the other recommended link line options to your icc or ifort command invocation: e.g.,
icc -o prog prog.c -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm
Runtime threading configuration:
If you selected the sequential layer option above, Intel MKL does not use the OpenMP threading library, and therefore there is no need to set the OMP_NUM_THREADS environment variable.
If you specifed the multi-threaded layer, Intel MKL does use OpenMP under the hood, and you must specify in your PBS script (to be submitted with qsub) the number of threads to be created for each process. Please follow the relevant instructions by clicking here.


Intel, An Introduction to Vectorization with the Intel C++ Compiler, here. Nice references and standard material. Links on the jump.

Additional Reading and Community A Guide to Vectorization with Intel® C++ Compilers, Mario Deilmann, Kiefer Kuah, Martyn Corden, Mark Sabahi, all from Intel.

Vectorization with the Intel® Compilers (Part 1), A.J.C Bik, Intel, Intel Software Network Knowledge base and search the title in the keyword search. This article offers good bibliographical references.

The Software Vectorization Handbook. Applying Multimedia Extensions for Maximum Performance, A.J.C. Bik. Intel Press, June, 2004, for a detailed discussion of how to vectorize code using the Intel® compiler.

Vectorization: Writing C/C++ code in VECTOR Format, Mukkaysh Srivastav, Computational Research Laboratories (CRL) – Pune, India. Intel Software Network Knowledge base and search the title in the keyword search Intel® CilkTM Plus Introductory Information. Overviews, videos, getting started guide, documentation, white papers and a link to the community.

Elemental functions: Writing data parallel code in C/C++ using Intel® CilkTM Plus. Robert Geva, Intel Corporation Intel® C++ Composer XE documentation, Includes documentation for the Intel® C++ Compiler. Intel Software Network, Search for topics such as “Parallel Programming in the “Communities” menu or “Software Forums” or Knowledge Base in the “Forums and Support” menu.

Requirements for Vectorizable Loops, Martyn Corden, Intel Corporation The Software Optimization Cookbook, Second Edition, High-Performance Recipes for IA-32 Platforms by Richard Gerber, Aart J.C. Bik, Kevin B. Smith and Xinmin Tian, Intel Press.


Tabb Forum, Webinar, here.

Financial services and capital markets firms are massively unprepared when it comes to this new age of high performance computing. The combined growth of regulatory mandates and bigger data drivers means that legacy computing infrastructure and single-threaded software in these domains are under increasing pressure to perform. Why? Most software today is not developed to access the sizzling performance capabilities of new hardware architectures. In this TabbFORUM webinar, a panel of specialists will discuss the history and current choices for parallel hardware as well as the critical performance impacts of parallelized software for many use cases to compete effectively along the road ahead in global markets.


Robert Geva, Principal Engineer, Intel

Peter Lankford, Founder and Director, Securities Technology Analysis Center

Ben Young, Senior Software Engineer, SunGard

E. Paul Rowady, Principal & Head of Data and Analytics Research, TABB Group


Markus Puschel,  How to Write Fast Numerical Code, Spring 2014, here. Not bad.

Bordawkar, IBM, Believe it or Not! Multi -core CPUs Can Match GPU Perfromance for FLOP-intensive Application! here.

We evaluated the performance of a real-world image processing application on the latest GPU and commodity multi-core processors using a wide variety of parallelization techniques. A pthreads-based version of the application running on a dual quad-core Intel Xeon system was able to match nVidia 285 GPU performance. Using fully automatic compiler-driven auto-parallelization and optimization, a single Power7 processor was able to achieve performance better than that on the nVidia 285 GPU. This is a compelling productivity result, given the effort required to develop an equivalent high- performance CUDA implementation. Our results also conclusively demonstrate that, under certain conditions, it is possible for a program running on a multi-core processor to match or even beat the performance of an equivalent GPU program, even for a FLOP-intensive structured application. In future, we plan to compare performance of such applications on upcoming GPU architectures from AMD and nVidia, e.g., nVidia Fermi.