To push the Secret Garden Leave Project I want to run the Intel Black Scholes code but I do not have the MKL library on my machine here. This highlights for me in Street FinQuant you really want ICC/MKL just so your programmers don’t have to spend time running around to find suitable backward compatible alternatives, if you actually need competitive performance code execution. But since we have no money we have to think, right? Probably the best thing to do is simply rewrite vectorized versions of exp(), log(), invsqrt(), div(), and erf() as placeholders to get the Intel code running. That’ll cost a day or two to get the right Cheby approximations up and running. If there are MKL like libs out there for gcc I would use them maybe the AMD ACML? We will see, nope click on the AMD link. AMD will support a variety of compilers and a couple platforms as long as they are Fortran compilers. Did not know the Quake III history in the inverse square root function – too much time looking at Street FinQuant and you miss what is really happening.

**grunttheperson**, Simple SSE and SSE2 (and now NEON) optimized sin, cos, log, and exp, here.

The story

I have spent quite a while looking for a simple (but fast) SSE version of some basic transcendental functions (sines and exponential). On the mac, you get the vsinf and friends (in the Accelerate framework) which are nice (there is a ppc version and an intel version, Apple rox) but closed-source, and restricted to macos..

Both Intel and AMD have some sort of vector math library with SIMD sines and cosines, but

- Intel MKL is not free (neither as beer, nor as speech)
- AMD ACML is free, but no source is available. Morever the vector functions are only available in 64bits OSes !
- Would you trust the intel MKL to run at full speed on AMD hardware ?
Some time ago, I found out the Intel Approximate Math library. This one is completely free and open-source, and it provides SSE and SSE2 versions of many functions. But it has two drawbacks:

- It is written as inline assembly, MASM style. The source is very targetted for MSVC/ICC so it is painful to use with gcc
- As the name implies, it is approximated. And, well, I had no use for a sine which has garbage in the ten last bits.
However, it served as a great source of inspiration for the sin_ps, cos_ps, exp_ps and log_ps provided below.

**Tuomas Tonteri**, A practical guide to SSE SIMD with C++, 2009, here.

Current personal computer CPUs have the capability for up to four times faster single precision floating point calculations when utilizing SSE instructions. Unfortunately, the learning curve is high and good documentation on the subject is scarce. In fact, most I could find was endless reference manuals listing available instructions and short tutorials, but little discussion on SSE and generally SIMD design concerns. Exception to this was Intel’s optimization reference manual, but it is very low level in nature. The examples are mostly in x86. Additionally, Intel’s manual is not in liberty to unbiasedly discuss SSE’s weak and strong points as it must help to sell CPUs.

The very first problem with SSE is how access the instructions without resorting to writing x86 assembly or buying Intel’s libraries. To this end, some C/C++ compilers come with so called SSE intrinsics headers. These provide a set of C functions, which give almost direct access to the vectorized instructions and the needed packaged data types. Unfortunately, coding with the C intrinsics is very inconvenient and results in unreadable code. I present a case here, that this can be solved with C++ operator overloading capabilities without sacrificing performance. Additionally, each version of SSE is accessed by a different intrinsics header and the correct selection and detection should be handled by the wrapping C++ class.

The second problem is that converting algorithms to effectively use even width four SIMD, as used by SSE, is at most times a very nontrivial task. In fact, depending on the problem domain, not infrequently vectorization is not worth the trouble versus the possible benefit. However, in some cases it is the difference between rendering an image 60 frames per second versus 15 frames per second or running a scientific calculation in a week instead of a month.

This guide addresses both of the above mentioned problems. Several algorithms will be transformed into SIMD design and the arising practical difficulties will be discussed. A convenient way to access the SSE extensions with the C++ operator overloading capabilities will be demonstrated. Performance benefits will be determined by benchmarking and evaluating the compiler’s instruction output.

**Wikipedia**, Fast inverse square root, here. Makes sense that this would be Kahan’s code.

The source code for

Quake IIIwas not released until QuakeCon 2005, but copies of the fast inverse square root code appeared on Usenet and other forums as early as 2002 or 2003.^{[1]}Initial speculation pointed to John Carmack as the probable author of the code, but he demurred and suggested it was written by Terje Mathisen, an accomplished assembly programmer who had previously helped id Software withQuakeoptimization. Mathisen had written an implementation of a similar bit of code in the late 1990s, but the original authors proved to be much further back in the history of 3D computer graphics with Gary Tarolli’s implementation for the SGI Indigo as a possible earliest known use. Rys Sommefeldt concluded that the original algorithm was devised by Greg Walsh at Ardent Computer in consultation with Cleve Moler of MATLAB fame.^{[19]}Cleve Moler learned about this trick from code written by Velvel Kahan and K.C. Ng at Berkeley around 1986 (see the comment section at the end of fdlibm code for sqrt^{[20]}).^{[21]}Jim Blinn also demonstrated a simple approximation of the inverse square root in a 1997 column forIEEE Computer Graphics and Applications.^{[22]}