Georg Hager, Georg Hager’s Blog, Blaze library version 1.4 released, here.
Blaze is an open-source, high-performance C++ math library for dense and sparse arithmetic based on Smart Expression Templates. Right on time for SC’13, the fifth release of the Blaze library has just been released. Blaze 1.4 introduces subvector and submatrix views, which in combination with rows and columns provide great flexibility for accessing vector and matrix data.
Blaze is unique in that is does not simply rely on compiler magic but utilizes highly tuned libraries whenever feasible. In many cases, Blaze can thus achieve “best possible performance” as defined by suitable performance models.
Check out the details and download the library at the developer site: http://code.google.com/p/blaze-lib/
Google, blaze-lib project, here. Daxpy benchmark, here. Sandy Bridge/AVX, meh – where the Haswell numbers at? The height and slope for vectors of length less than 100 is particularly interesting for real code. That blaze FLOPs slope is pretty attractive for code that cannot assume nice long 1K vectors.
The following selected benchmarks give an impression of the performance of the Blaze library. In these benchmarks, Blaze is compared to the following third party libraries:
- Blitz++, version 0.9
- Boost uBLAS, version 1.46
- GMM++, version 4.1
- Armadillo, version 2.4.2
- MTL4, version 4.0.8368
- Eigen3, version 3.1.0alpha1
- Intel MKL, version 10.3 (update 8)
The benchmark system is an Intel Xeon E3-1280 (“Sandy Bridge”) CPU at 3.5 GHz base frequency with 8 MByte of shared L3 cache. Due to the “Turbo Mode” feature the processor can increase the clock speed by up to 400 MHz, depending on load and temperature. Since we use a single core only in our benchmarks, the CPU ran continuously at 3.9 GHz.
The maximum achievable memory bandwidth (as measured by the STREAM benchmark) is about 18.5 GByte/s. In contrast to other x86 processors, this limit can be hit by a single thread if the code is strongly memory bound. Each core has a theoretical peak performance of eight flops per cycle in double precision (DP) using AVX (“Advanced Vector Extensions”) vector instructions. A single core of the Xeon CPU can execute one AVX add and one AVX multiply operation per cycle. Full in-cache performance can only be achieved with SIMD-vectorized code. This includes loads and stores, which exist in full-width (AVX) vectorized, half-width (SSE) vectorized, and “scalar” variants. A maximum of one 256-bit wide AVX load and one 128-bit wide store can be sustained per cycle. 256-bit wide AVX stores thus have a twocycle throughput.
The GNU g++ 4.6.1 compiler was used with the following compiler flags:g++ -Wall -Wshadow -Woverloaded-virtual -ansi -pedantic -O3 -mavx -DNDEBUG -DMTL_HAS_BLAS -DEIGEN_USE_BLAS ...