Thomas Willhalm, Intel, Processing Arrays of Bits with Intel Advanced Vector Extensions 2, here. If you want the strategies that Bodek describes to go fast you probably are looking at some bit hackery using the AVX2 in the Algo execution. You will need some way to eliminate or convert unpredictable branches to something more computationally uniform and parallel but remaining compact in memory requirements. Solve Bodek’s issue by open sourcing industrial strength optimized code to run all the orders so the buy side folks’ orders don’t get tagged and run over. It’s not like you re going to get the genie back into the bottle, right? Presumably you could even do this at a larger Prime Broker shop for client execution, sort of the way those RBC folks were chatting about a couple of months ago. Don’t see why you cannot completely level the playing field for Bodek’s strategies in calendar 2014 and score some internalization flow before releasing the exhaust on the exchange/dark pool liquidity. Maybe just do the same thing JPM did with the CDS valuation – release a standard ALGO/SOR optimized code through a Equity/Equity Option/ETF/Fixed Income E trading broker dealer consortium for 2015 targeting AVX 512/Broadwell/Skylake. Intel would presumably love the consortium or even be part of it (call it The Round Table, you know … Knights of the Round Table… hahaha). Seems like there are several options to solve this at a big shop or even some smaller shops, the commodity infrastructure is just not that expensive, and you just need a dev group who are on top of their game. I knew a couple guys from MIPT (Russian MIT) in London who could crush this problem (but you will have to convince them to use AVX2 instead of Java:)) or convince a couple of smart Ptown undergrads to do this as a summer project before they go run full court @ appnexus.
With Intel® AVX2 there are two new instructions that address exactly these problems:
Intel® AVX2 introduces an instruction that gathers values from memory into a vector register. The base address is given as an argument and the offsets of the values are provided in another vector register. Additionally, a scaling factor is given for a greater flexibly in the size of the array elements. The instruction that gathers double-words using double-words as indexes is called VPGATHERDD. The corresponding intrinsic is:
__m256i _mm256_i32gather_epi32 ( int const * base, __m256i index, const int scale);Please refer to the Intel® Advanced Vector Extensions Programming Reference for a more precise definition. The programming reference also describes gather for other data types as well as the usage of gather with masks.
Vector shifts have been available since Intel® Streaming SIMD Extensions 2 (Intel® SSE2). However, all data elements in the vector register were always shifted by the same number of bits. What is new with Intel® AVX2 is the ability to provide a vector register for a variable shift-count per data element. For logical right shifts, the instruction is called VPSRLVD and the corresponding intrinsic is:
__m256i _mm256_srlv_epi32 (__m256i m, __m256i count);Other variants in data type and shifts are again listed in the Intel® Advanced Vector Extensions Programming Reference.