David Kanter, real world technologies, Knights Landing Details, here . Knights Landing seems to be a pure throughput play i.e., your compile time vectors need to be 100 to 1000 elements to get in the game. So Monte Carlo risk folks on the Street should get a big performance pop by moving to Knights with few complications. The clock frequency step down to 1+ GHz should force the ALGO folks into a higher frequency hybrid solution; again the situation unfolding moves in a direction facilitating OpenServer/DynaRack in a big way. The Throughput folks are going to get their own Knights++ or GPU solution good for floating point double precision with some programming constraints. But your code needs to have to have long vectors with dimensions known at compile time otherwise, I’m pretty sure you are running the performance equivalent Johnson’s code (not necessarily an ALGO show stopper, I know a kick ass Stat Arb guy who basically runs Johnson’s code for years) . If your vectors are more like 8-20 elements long and you are latency sensitive the Knights++ are not going to help very much. If your concern is competitive performance you are dropping a factor approaching 3 in performance/latentcy just to get started with Knights. If your concern is $s/FLOP, FLOPS/Watt, or FLOPS/die mm those are different problems. The market is fragmenting and ALGO AVX folks are getting isolated. Competitive ALGO folks, you need an AVX horse in this race: and it’s not Knights++, it’s not the Ivy Bridge Server chip that you are still waiting for in 1Q14, it’s not the “world’s fastest pc for HFT,” it’s gonna be something like Open Server fast prototyping DynaRack /NucRack.
In fact, now there is an advantage to being well capitalized in ALGO trading. It is not really the case that everybody has to use the same off the shelf systems for their low latency computing infrastructure. In certain cases (read low latency trading) the playing field is no longer level. Competitors may have to use the same off-the-shelf components to build their systems, but that’s where the similarity ends. Using commodity parts and Open Server you can build, test, and deploy a DynaRack that could net your Firm/Consortium a microprocessor generation performance advantage in all your ALGO executions running totally conventional vanilla C++ code compiled through ICC/AVX/MKL (no Java to FPGA bets required). Moreover, the sunk cost in engineering to start up will be amortized over a couple more generations of silicon (Broadwell and Skylake) and across asset classes e.g., Equity and Fixed Income ALGO trading (SEF ramp up on the back of Dodd Frank) and across regional markets (NY, Chicago, London, TK, HK, Singapore, etc.) . The Open Server racks might not be a simple as the Facebook servers, but they’re not rocket science either. You just need a reliable fast turn around that puts Haswell chips in COLO production for your ALGO/AVX code execution while the other guy is running a boatload of Sandy Bridge servers. What did that guy say? The internet is made of fast. It’s like Bodek observed, they will be making their systems faster so they can pay you more money. Reminds me of an old poem Grandpa Emilio Lazardo Wisty used to recite to us little kids all those many years ago every New Years Day after the family brunch, it started like this:
Big is back,
This approach makes perfect sense given the dual-issue nature of the core and would substantially simplify the fabric and interconnect design. To deliver nearly 2-3× higher performance in KNL at similar frequency, a naïve approach requires at least twice the number of cores and a significantly more complex interconnect. Doubling the per-core FLOP/s decreases the pressure on the interconnect, which is critical. For throughput architectures the biggest challenge is not computation, but the data movement and attendant behavior such as coherency, so increasing the complexity of the core to simplify the interconnect is a logical step.
Turning to the memory hierarchy in the KNL core, there is relatively little information available, but it is quite possible to sketch out the most likely configuration. AVX-512 is a load-op instruction set that can source one operand from memory for each operation (typically assumed to be a fused multiply-add or FMA). This means that the KNL core must have a 512-bit (64B) wide load pipeline for each vector unit. In contrast, the Silvermont memory cluster contains separate and dedicated load and store pipelines that are only 16B wide. To ensure performance compatibility, the KNL cache will be at least 32KB and 8-way associative with an aggregate read bandwidth of 128B/cycle. Given the 1-1.5GHz target frequency, this translates to 128-192GB/s, which is slightly lower than the L1D bandwidth in Haswell.
To summarize, most of the rumors concerning the Knights Landing core are correct, although we can add a fair bit of information regarding the actual memory pipeline and L1D cache design. Each core will offer 32 FLOPs/clock and 128B/cycle of bandwidth from the L1 data cache. To hit the stated performance goals, it is likely that Knights Landing will target 1.3-1.4GHz, but a more cautious estimate is 1.1-1.5GHz.
Computational Efficiency for CPUs and GPUs in 2012, here. Kanter is solid.