Pink Iguana

Home » Posts tagged 'Intel'

Tag Archives: Intel

TSX Nerfed

Simon Sharwood, The Register, Intel disables hot new TSX tech in early Broadwells and Haswells, here.

One of Intel’s new ways to make software go faster is called Transactional Synchronization Extensions (TSX), an innovation that gives developers fine control over how multi-threaded code uses a CPU’s resources.

See here for Aug14 Intel E3-120 v3 spec update.

Advertisement

Evaluation of Vectorizing Compilers

Maleki et.al., An Evaluation of Vectorizing Compilers, here.  If you need the FP juice just code directly to the intrinsics. Your code will look like that Intel Black Scholes code.

Abstract—Most of today’s processors include vector units that have been designed to speedup single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly language or using intrinsics in high level languages is a time consuming and error-prone task. The alternative is to automate the process of vectorization by using vectorizing compilers.

This paper evaluates how well compilers vectorize a synthetic benchmark consisting of 151 loops, two application from Petascale Application Collaboration Teams (PACT), and eight applications from Media Bench II. We evaluated three compilers: GCC (version 4.7.0), ICC (version 12.0) and XLC (version 11.01). Our results show that despite all the work done in vectorization in the last 40 years 45-71% of the loops in the synthetic benchmark and only a few loops from the real applications are vectorized by the compilers we evaluated.

Haswell i7-4771 Available for Pre-order

Tarun Iyer, Tom’s Hardware, Intel Core i7-4771 Quad-Core CPU Available for Pre-Order, here. DynaRack this. I think even if there were Ivy Bridge servers available the AVX2 on the Haswell will leave the theoretical Ivy Bridge servers far behind. Looking at 6-9 months where competing low latency pre trade analytics is hopelessly slow and noncompetitve, even through they would be faster than the current Sandy Bridge dinosaurs in the colo. This is how the server manufacturers are going to roll for the forseeable future, SLOW. The delicious part is that for  a big chunk of the low latency trading analytics there will be no really significant difference in how fast  their code runs on Sandy Bridge versus Haswell.  You need to recode to AVX2 and rebuild your code to see the FP juice. Pretty sure gcc will not see the juice  without extraordinary measures, and most of these folks are doing smart guy stuff like Ocaml and Erlang which are totally oblivious  to the realities of keeping a floating point pipeline full with two FMA units to schedule, as we have already covered.

Intel’s upcoming Core i7-4771 processor (BX80646I74771) has made an appearance and offers four processing cores, a clock rate of 3.5 GHz that can be Turbo-Boosted to 3.9 GHzand an 8 MB L3 Cache.

Time for DynaRack

Tiernan Ray, Barrons, The PC Market Isn’t Dead Yet, but It’s barely Breathing, here. There has to be a bunch of top shelf motherboard and rack designers you can pick up in this decline.

Last October, to the bemusement of many, we featured a tombstone on our cover with the words “R.I.P. PC.” So far, it may not be death for the personal computer, but it is surely a very unhealthy state of affairs.

Personal-computer shipments in the first half of this year were down 11%. IDC called the first quarter the worst in all the years it has tracked PC shipments. Gartner, meanwhile, pointed out that the second quarter made it five quarters in a row of year-over-year declines, the longest period of decline on record.

At 76 million units shipped last quarter, we’re back to the level of PC shipments last seen in early 2008, heading into recession.

Vectorized Black Scholes Code from Intel

Intel Corporation, Overview of Vector Mathematics in Intel Math Kernel Library, 2012, here.  I assume these are the Moscow Intel folks. We can back out the number cycles per Black Scholes element on a single core  it’s probably 20 to 30 percent better than the 2007 cycle counts  on the XLC/MASS  POWER6  (it was like 170 cycles all in).

void BlackScholesFormula( int nopt, tfloat r, tfloat sig,tfloat s0[], tfloat x[], tfloat t[], tfloat vcall[], tfloat vput[] )

{

vmlSetMode( VML_EP );

DIV(s0, x, Div); LOG(Div, Log);

for ( j = 0; j < nopt; j++ ) {

tr [j] = t[j] * r;

tss[j] = t[j] * sig_2;

tss05[j] = tss[j] * HALF;

mtr[j] = -tr[j];

}

EXP(mtr, Exp); INVSQRT(tss, InvSqrt);

for ( j = 0; j < nopt; j++ ) {

w1[j] =(Log[j] + tr[j] + tss05[j]) * InvSqrt[j] *INV_SQRT2;

w2[j] =(Log[j] + tr[j] – tss05[j]) * InvSqrt[j] *INV_SQRT2;

}

ERF(w1, w1); ERF(w2, w2);

for ( j = 0; j < nopt; j++ ) {

w1[j] = HALF + HALF * w1[j];

w2[j] = HALF + HALF * w2[j];

vcall[j] = s0[j] * w1[j] – x[j] * Exp[j] * w2[j];

vput[j] = vcall[j] – s0[j] + x[j] * Exp[j];

}

}

Intel Xeon Phi Drags Pfister From Skyrim

Greg Pfister, The Perils of Parallel, Intel Xeon Phi Announement (&me), here.  Check out the Intel links at the end of the post.

Number one is their choice as to the first product. The one initially out of the blocks is, not a lower-performance version, but rather the high end of the current generation: The one that costs more ($2649) and has high performance on double precision floating point. Intel says it’s doing so because that’s what its customers want. This makes it extremely clear that “customers” means the big accounts – national labs, large enterprises – buying lots of them, as opposed to, say, Prof. Joe with his NSF grant or Sub-Department Kacklefoo wanting to try it out. Clearly, somebody wants to see significant revenue right now out of this project after so many years. They have had a reasonably-sized pre-product version out there for a while, now, so it has seen some trial use. At national labs and (maybe?) large enterprises.

I fear we have closed in on an incrementally-improving era of computing, at least on the hardware and processing side, requiring inhuman levels of effort to push the ball forward another picometer. Just as well I’m not hanging my living on it any more.

Apple is a mess, or is great, but they don’t want Intel chips for their desktops much longer.

Henry Blodget, Business Insider,EX-APPLE ENGINEER: ‘Almost Everything That Apple Does That Involves The Internet Is A Mess’, here.  This whole Akio Morita is to Steve Jobs as Sony is to Apple is going to go for a while. Yarow has a post on the same topic, here.

John Gruber, Daring Fireball, Seriously, Apple is Doomed, here. Gruber could go toe to toe with Blodget on this.

Long-term, the verdict is out. Jobs has only been gone for a year. Apple has yet to do a Big New Thing without him. The retention of talent remains their biggest risk, and Forstall’s departure highlights that. But in terms of innovation-without-Jobs so far, I’d say going from the original slow chunky iPad in April 2010 to the retina super-fast iPad 4 and svelte iPad Mini today is a pretty brisk clip. Two and a half years later Apple offers two very different iPads that both completely blow the original one away — and the original one is now almost universally hailed as a landmark innovation in the history of personal computing.

Cringely,  While the Intel board was firing Paul Otellini they should have fired themselves, too, here. Apple is ditching Intel for it’s desktops so Otellini gets fired. The Haswell bet seems like less of a sure thing, now.

The company was too busy fighting AMD to notice the rise of mobile. And while the pundits are correctly saying ARM-this and ARM-that in their analysis of the Intel mobile debacle, the source of the successor technology is less important than the fact that the two largest high-end mobile manufacturers of all — Apple and Samsung — are making their own processors. They will never be Intel customers again.

Intel WiFi Inside and 7 Gbps wireless docking

Sean Gallagher, ars, Intel researchers put WiFi inside—the processor, that is, here. The futuristic view of supercomputers looking like smoking hairy golfballs just got a shave. Maybe just looking for smoking golf balls (hot 3D silicon) now.

At the Intel Developer Forum in San Francisco, Intel Chief Technology Officer Justin Rattner unveiled a pair of technologies coming out of Intel Labs that will overcome many of the size and power limits that have stood in the way of integrating radio technology more tightly with computers and other digital devices. The first, what Intel calls the “Moore’s Law Radio,” is a complete WiFi transceiver on a 32-nanometer scale silicon chip; the second, called Rosepoint, is a complete system-on-a-chip that integrates two Atom processor cores with a digital WiFi transceiver.

Lucas Mearian, ComputerWorld,  Intel demos 7Gbps wireless docking, here.

Intel on Thursday demonstrated multi-gigabit wireless docking technology that affords speeds of up to 7Gbps, 10 times the rate of the fastest Wi-Fi networks based on the IEEE 802.11n standard.

At its annual Intel Developers Forum, the chip maker demonstrated Wireless Gigabit (WiGig) docking technology using an ultrabook. The company said WiGig is on track to becoming the most important next-generation multi-gigabit wireless technology.

Intel CTO Justin Rattner said there will come a day when an ultrabook or tablet can be dropped anywhere on a desk and automatically connect to a display monitor and peripherals.

Network, Intel, Fulcrum, Infiniband, and Cray …Yahtzee!

Michael Feldman, HPCWire, Intel Weaves Strategy To Put Interconnect Fabrics On Chip, here. Boom goes the RDMA connection. All the Pluto Switch has to guarantee is that the packets get delivered in the same order as originally sent. The interleaving of the serial streams is important but probably not a huge deal.

Making the interconnect logic a first-class citizen on the processor, rather than just an I/O device would be a huge paradigm shift for the server market. If successfully executed at Intel, other chip vendors will be forced to follow suit. (AMD is likely already conjuring up something similar with the fabric technology it acquired from SeaMicro.) Meanwhile makers of discrete NICs and host adapter silicon will have to rethink their strategy, perhaps allying themselves with other chipmakers to offer competitive products.

Vectorizing AVX/SSE Links

Walking Randomly, Vectorizing code to take advantage of modern CPUs, here.

I’ve been playing with AVX vectorisation on Sandy Bridge CPUs off and on for a while now and thought that I’d write up a little of what I’ve discovered.  The basic idea of vectorisation is that each core in a modern CPU can operate on multiple values (i.e. a vector) simultaneously per instruction cycle.

Sandy bridge (and the newer Ivy Bridge) processors have 256bit wide vector units which means that each CORE can perform certain operations on up to eight 32-bit floats or four 64-bit doubles per clock cycle.  So, on a quad core you have 4 vector units (one per core) and could operate on up to 16 doubles or 32 floats per clock cycle.

This all sounds great so how does a programmer actually make use of this neat hardware trick?  There are many routes:-

Intel, Intel SPMD Program Compiler, here. Wow.

ispc is a compiler for a variant of the C programming language, with extensions for “single program, multiple data” (SPMD) programming. Under the SPMD model, the programmer writes a program that generally appears to be a regular serial program, though the execution model is actually that a number of program instancesexecute in parallel on the hardware. (See the ispc documentation for more details and examples that illustrate this concept.)

ispc compiles a C-based SPMD programming language to run on the SIMD units of CPUs and the Intel Xeon Phi™ architecture; it frequently provides a 3x or more speedup on CPUs with 4-wide vector SSE units and 5x-6x on CPUs with 8-wide AVX vector units, without any of the difficulty of writing intrinsics code. Parallelization across multiple cores is also supported by ispc, making it possible to write programs that achieve performance improvement that scales by both number of cores and vector unit size.

There are a few key principles in the design of ispc:

  • To build a small set of extensions to the C language that would deliver excellent performance to performance-oriented programmers who want to run SPMD programs on the CPU.
  • To provide a thin abstraction layer between the programmer and the hardware—in particular, to have an execution and data model where the programmer can cleanly reason about the mapping of their source program to compiled assembly language and the underlying hardware.
  • To make it possible to harness the computational power of SIMD vector units without the extremely low-programmer-productivity activity of directly writing intrinsics.
  • To explore opportunities from close coupling between C/C++ application code and SPMD ispc code running on the same processor—to have lightweight function calls between the two languages and to share data directly via pointers without copying or reformatting.

ispc is an open source compiler with a BSD license. It uses the remarkable LLVM Compiler Infrastructure for back-end code generation and optimization and is hosted on github. It supports Windows, Mac, and Linux, with both x86 and x86-64 targets. It currently supports the SSE2, SSE4, AVX1, AVX2, and Xeon Phi “Knight’s Corner” instruction sets.