Pink Iguana

AVX 512 and Scatter-Gather

Colfax Research, 11 May 2016, Guide to Automatic Vectorization with Intel AVX-512 Instructions in Knights Landing Processors, here.

This publication is part of a developer guide focusing on the new features in 2nd generation Intel® Xeon Phi™processors code-named Knights Landing (KNL). In this document, we focus on the new vector instruction set introduced in Knights Landing processors, Intel® Advanced Vector Extensions 512 (Intel® AVX-512). The discussion includes:

  • Introduction to vector instructions in general,

  • The structure and specifics of AVX-512, and

  • Practical usage tips: checking if a processor has support for various features, compilation process and compiler arguments, and pros and cons of explicit and automatic vectorization using the Intel® C++ Compiler and the GNU Compiler Collection.

Capabilities of Intel AVX-512 in Intel Xeon Scalable Processors (Skylake), here.

This paper reviews the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instruction set and answers two critical questions:

  1. How do Intel® Xeon® Scalable processors based on the Skylake architecture (2017) compare to their predecessors based on Broadwell due to AVX-512?
  2. How are Intel Xeon processors based on Skylake different from their alternative, Intel® Xeon Phi™ processors with the Knights Landing architecture, which also feature AVX-512?

We address these questions from the programmer’s perspective by demonstrating C language code of microkernels benefitting from AVX-512. For each example, we dig deeper and analyze the compilation practices, resultant assembly, and optimization reports.

In addition to code studies, the paper contains performance measurements for a synthetic benchmark with guidelines on estimating peak performance. In conclusion, we outline the workloads and application domains that can benefit from the new features of AVX-512 instructions.

GNU, Fast Scatter Gather IO

13.6 Fast Scatter-Gather I/O

Some applications may need to read or write data to multiple buffers, which are separated in memory. Although this can be done easily enough with multiple calls to read and write, it is inefficient because there is overhead associated with each kernel call.Instead, many platforms provide special high-speed primitives to perform these scatter-gather operations in a single kernel call. The GNU C Library will provide an emulation on any system that lacks these primitives, so they are not a portability threat. They are defined in sys/uio.h.These functions are controlled with arrays of iovec structures, which describe the location and size of each buffer.

Anderson, Malik, Gregg, 19 Jan 2016, HiPEAC, Prague, Automatic Vectorization of Interleaved Data REvistied, here.

Mike Shiffman, 17 Apr 2018, Farsight Security Blog, Google Protocol Buffer Deserialization The Hard Way, here.




Check in w. Reversing the Biosphere

Veronique Greenwood, 19 Apr 2018, Quanta, How many genes do cells need? Maybe almost all of them, here.

By knocking out genes three at a time, scientists have painstakingly deduced the web of genetic interactions that keeps a cell alive. Researchers long ago identified essential genes that yeast cells can’t live without, but new work, which appears today in Science, shows that looking only at those gives a skewed picture of what makes cells tick: Many genes that are inessential on their own become crucial as others disappear. The result implies that the true minimum number of genes that yeast — and perhaps, by extension, other complex organisms — need to survive and thrive may be surprisingly large.

About 20 years ago, Charles Boone and Brenda Andrews decided to do something slightly nuts. The yeast biologists, both professors at the University of Toronto, set out to systematically destroy or impair the genes in yeast, two by two, to get a sense of how the genes functionally connected to one another. Only about 1,000 of the 6,000 genes in the yeast genome, or roughly 17 percent, are considered essential for life: If a single one of them is missing, the organism dies. But it seemed that many other genes whose individual absence was not enough to spell the end might, if destroyed in tandem, sicken or kill the yeast. Those genes were likely to do the same kind of job in the cell, the biologists reasoned, or to be involved in the same process; losing both meant the yeast could no longer compensate.

Queensland University of Technology, 2 May 2018,, Math sheds light on how living cells ‘think’, here.

Queensland University of Technology (QUT) researcher Dr. Robyn Araujo has developed new mathematics to solve a of how the incredibly complex biological networks within cells can adapt and reset themselves after exposure to a new stimulus.

Her findings, published in Nature Communications, provide a new level of understanding of cellular communication and cellular ‘cognition’, and have potential application in a variety of areas, including new targeted cancer therapies and drug resistance.


Carrie Arnold, 2 May 2018, Quanta, Cells talk in a language that looks like viruses, here.

For cells, communication is a matter of life and death. The ability to tell other members of your species — or other parts of the body — that food supplies are running low or that an invading pathogen is near can be the difference between survival and extinction. Scientists have known for decades that cells can secrete chemicals into their surroundings, releasing a free-floating message for all to read. More recently, however, scientists discovered that cells could package their molecular information in what are known as extracellular vesicles. Like notes passed by children in class, the information packaged in an extracellular vesicle is folded and delivered to the recipient.

The past five years have seen an explosion of research into extracellular vesicles. As scientists uncovered the secrets about how the vesicles are made, how they package their information and how they’re released, it became clear that there are powerful similarities between vesicles and viruses.

A small group of researchers, led by Leonid Margolis, a Russian-born virologist at the National Institute of Child Health and Human Development (NICHD), and Robert Gallo, the HIV pioneer at the University of Maryland School of Medicine, has proposed that this similarity is more than mere coincidence. It’s not just that viruses appear to hijack the cellular pathways used to make extracellular vesicles for their own production — or that cells have also taken on some viral components to use in their vesicles. Extracellular vesicles and viruses, Margolis argues, are part of a continuum of membranous particles produced by cells. Between these two extremes are lipid-lined sacs filled with a variety of genetic material and proteins — some from hosts, some from viruses — that cells can use to send messages to one another.

Megan Molteni, 3 May 2018, Wired, Biology will be the next great computing platform, here.

In some ways, Synthego looks like any other Silicon Valleystartup. Inside its beige business park facilities, a five-minute drive from Facebook HQ, rows of nondescript black server racks whir and blink and vent. But inside the metal shelving, the company isn’t pushing around ones and zeros to keep the internet running. It’s making molecules to rewrite the code of life.

Crispr, the powerful gene-editing tool, is revolutionizing the speed and scope with which scientists can modify the DNA of organisms, including human cells. So many people want to use it—from academic researchers to agtech companies to biopharma firms—that new companies are popping up to staunch the demand. Companies like Synthego, which is using a combination of software engineering and hardware automation to become the Amazon of genome engineering. And Inscripta, which wants to be the Apple. And Twist Bioscience, which could be the Intel.

 Carol Lynn Curchoe, 20 May 2018, Medium, Top 10 Crispiest CRISPR Applications, here.

Watch my full CRISPR address at The Oxford Union here.

There is a heady and hysterical goldrush to CRISPR ALL THE THINGS. And with good reason. These are not your grandpa’s GMOs.

“Second-generation” genome-editing tools can now precisely convert a single base into another without the need for double strand break or incorporating a gene from another organism. At the drop of a “nickase,” C can be converted to T, and A to G, generating a STOP codon and abolishing the need for complex knockout — strategies. (Review CRISPR fundamentals here.)

Like the immeasurable heaven of the Laniakea supercluster, the applications of CRISPR seem to know no bounds. But, the most exciting applications for CRISPR have little to do with gene editing. At the rate of CRISPR publications (1000s per year), you may forgive yourself for not being able to stay up on the literature.

I have compiled some of my favorite (for about a minute) CRISPR applications. The breathless future of CRISPR means these will likely be overturned faster than an ubiquitinated protein.



Dhawal Shah, 27 Apr 2018, Quartz, Here are 300 free Ivy League university course you can take online right now, here. Math and Progamming listings are super light.

Computer Science (23 courses)

CS50’s Introduction to Computer Science
Harvard University

Algorithms, Part I
Princeton University

Algorithms, Part II
Princeton University

Machine Learning for Data Science and Analytics
Columbia University

Bitcoin and Cryptocurrency Technologies
Princeton University

Artificial Intelligence (AI)
Columbia University

Reinforcement Learning
Brown University

Computer Architecture
Princeton University

Machine Learning
Georgia Institute of Technology

Enabling Technologies for Data Science and Analytics: The Internet of Things
Columbia University

Machine Learning
Columbia University

Analysis of Algorithms
Princeton University

Networks Illustrated: Principles without Calculus
Princeton University

Machine Learning: Unsupervised Learning
Brown University

CS50’s Computer Science for Business Professionals
Harvard University

CS50’s AP® Computer Science Principles
Harvard University

HI-FIVE: Health Informatics For Innovation, Value & Enrichment (Administrative/IT Perspective)
Columbia University

Animation and CGI Motion
Columbia University

Networks: Friends, Money, and Bytes
Princeton University

CS50’s Understanding Technology
Harvard University

Data Structures and Software Design
University of Pennsylvania

Algorithm Design and Analysis
University of Pennsylvania

Computer Science: Algorithms, Theory, and Machines
Princeton University

Data Science (21 courses)

Statistical Thinking for Data Science and Analytics
Columbia University

Statistics and R
Harvard University

Introduction to Spreadsheets and Models
University of Pennsylvania

People Analytics
University of Pennsylvania

High-Dimensional Data Analysis
Harvard University

Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays
Harvard University

Data Science: R Basics
Harvard University

Case Studies in Functional Genomics
Harvard University

Causal Diagrams: Draw Your Assumptions Before Your Conclusions
Harvard University

Big Data and Education
Columbia University

Principles, Statistical and Computational Tools for Reproducible Science
Harvard University

Data Science: Inference and Modeling
Harvard University

Data Science: Visualization
Harvard University

High-performance Computing for Reproducible Genomics
Harvard University

Data Science: Linear Regression
Harvard University

Data Science: Capstone
Harvard University

Data Science: Wrangling
Harvard University

Data Science: Machine Learning
Harvard University

Data Science: Productivity Tools
Harvard University

Data Science: Probability
Harvard University

Data, Models and Decisions in Business Analytics
Columbia University

Shabbir Ahmed, starts 19 Aug 2018, edX, Deterministic Optimization, here.

Course Syllabus

Skip Syllabus DescriptionWeek 1

  • Module 1: Introduction
  • Module 2: Illustration of the Optimization Problems

Week 2

  • Module 3: Review of Mathematical Concepts
  • Module 4: Convexity

Week 3

  • Module 5: Outcomes of Optimization
  • Module 6: Optimality Certificates

Week 4

  • Module 7: Unconstrained Optimization: Derivate Based
  • Module 8: Unconstrained Optimization: Derivative Free

Week 5

  • Module 9: Linear Optimization Modeling – Network Flow Problems
  • Module 10: Linear Optimization Modeling – Electricity Markets

Week 6

  • Module 11: Linear Optimization Modeling – Decision-Making Under Uncertainty
  • Module 12: Linear Optimization Modeling – Handling Nonlinearity

Week 7

  • Module 13: Geometric Aspects of Linear Optimization
  • Module 14: Algebraic Aspect of Linear Optimization


Week 8

  • Module 15: Simplex Method in a Nutshell
  • Module 16: Further Development of Simplex Method

Week 9

  • Module 17: Linear Programming Duality
  • Module 18: Robust Optimization

Week 10

  • Module 19: Nonlinear Optimization Modeling – Approximation and Fitting
  • Module 20: Nonlinear Optimization Modeling – Statistical Estimation

Week 11

  • Module 21: Convex Conic Programming – Introduction
  • Module 22: Second-Order Conic Programming – Examples

Week 12

  • Module 23: Second-Order Conic Programming – Advanced Modeling
  • Module 24: Semi-definite Programming – Advanced Modeling

Week 13

  • Module 25: Discrete Optimization: Introduction
  • Module 26: Discrete Optimization: Modeling with binary variables – 1

Week 14

  • Module 27: Discrete Optimization: Modeling with binary variables – 2
  • Module 28: Discrete Optimization: Modeling exercises

Week 15

  • Module 29: Discrete Optimization: Linear programming relaxation

  • Module 30: Discrete Optimization: Solution methods

Sidney R. Coleman, Harvard University Department of Physics,  Physics 253: Quantum Field Theory, here.

Professor Coleman’s wit and teaching style is legendary and, despite all that may have changed in the 35 years since these lectures were recorded, many students today are excited at the prospect of being able to view them and experience Sidney’s particular genius second-hand.

Fearless Girl

Camila Domonoske, 24 Apr 2018, NPR, “Fearless Girl’ Statue willl face down Stock Exchange, not ‘Charging Bull,” here. This reeks of Michael Lewis-style cluelessness. You could defend the Charging Bull placement, in a pinch. NYSE not so much. Nothing  important happened there for 15 years, figuratively or literally. It’s like protesting billboards in Time Square the message gets lost in the abstraction. Let’s find three better places for the Fearless Girl: The White House is almost always an inspiring and meaningful protest to someone;  St. Patrick’s Cathedral is ok; The New Jersey Transit waiting room in Penn Station Hellmouth is good too.

Intel 10nm delayed 2019, L1 & L2, and LLVM

Paul Alcorn, 26 Apr 2018, tom’s Hardware, Intel’s 10nm IS Broken, Delayed Until 2019, here.

Intel announced its financial results today, and although it posted yet another record quarter, the company unveiled serious production problems with its 10nm process. As a result, Intel announced that it is shipping yet more 14nm iterations this year. They’ll come as Whiskey Lake processors destined for the desktop and Cascade Lake Xeons for the data center.

The 10nm Problems

Overall, Intel had a stellar quarter, but it originally promised that it would deliver the 10nm process back in 2015. After several delays, the company assured that it would deliver 10nm processors to market in 2017. That was further refined to the second half of this year.

On the earnings call today, Intel announced that it had delayed high-volume 10nm production to an unspecified time in 2019. Meanwhile, its competitors, like TSMC, are beginning high volume manufacturing of 7nm alternatives.

Recent semiconductor node naming conventions aren’t based on traditional measurements, so they’re more of a marketing exercise than a science-based metric. That means that TSMC’s 7nm isn’t entirely on par with Intel’s 10nm process. However, continued process node shrinks at other fabs show that other companies are successfully outmaneuvering the production challenges of smaller lithographies.

Intel’s CEO Brian Krzanich repeatedly pressed the point that the company is shipping Cannon Lake in low volume, but the company hasn’t pointed to specific customers or products. And we’ve asked. As we pointed out earlier this year, the delay may seem a minor matter, but Intel has sold processors based on the underlying Skylake microarchitecture since 2015, and it’s been stuck at the 14nm process since 2014. That means Intel is on the fourth (or fifth) iteration of the same process, which has hampered its ability to bring new microarchitectures to market. That doesn’t bode well for a company that regularly claims its process node technology is three years ahead of its competitors.

Krzanich explained that the company “bit off a little too much on this thing” by increasing 10nm density 2.7X over the 14nm node. By comparison, Intel increased density by only 2.4X when it moved to 14nm. Although the difference may be small, Krzanich pointed out that the industry average for density improvements is only 1.5-2X per node transition. Because of the production difficulties with 10nm, Intel has revised its density target back to 2.4X for the transition to the 7nm node. Intel will also lean more on heterogeneous architectures with its EMIB technology (which we covered here).

Joel Hruska, 17 May 2018, Extreme Tech, How L1 and L2 caches work, and why they’re an essential part of modern chips, here.

This chart shows the relationship between an L1 cache with a constant hit rate, but a larger L2 cache. Note that the total hit rate goes up sharply as the size of the L2 increases. A larger, slower, cheaper L2 can provide all the benefits of a large L1 — but without the die size and power consumption penalty. Most modern L1 cache rates have hit rates far above the theoretical 50 percent shown here — Intel and AMD both typically field cache hit rates of 95 percent or higher.

The next important topic is the set-associativity. Every CPU contains a specific type of RAM called tag RAM. The tag RAM is a record of all the memory locations that can map to any given block of cache. If a cache is fully associative, it means that any block of RAM data can be stored in any block of cache. The advantage of such a system is that the hit rate is high, but the search time is extremely long — the CPU has to look through its entire cache to find out if the data is present before searching main memory.

LLVM, The LLVM Compiler Infrastructure, here. If you open source Reversing the Biosphere it would look like this, no?

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies. Despite its name, LLVM has little to do with traditional virtual machines. The name “LLVM” itself is not an acronym; it is the full name of the project.

LLVM began as a research project at the University of Illinois, with the goal of providing a modern, SSA-based compilation strategy capable of supporting both static and dynamic compilation of arbitrary programming languages. Since then, LLVM has grown to be an umbrella project consisting of a number of subprojects, many of which are being used in production by a wide variety of commercial and open source projects as well as being widely used in academic research. Code in the LLVM project is licensed under the “UIUC” BSD-Style license.

The Lehman Trilogy

Stefano Massini, 4 Jul 2018, National Theater,  The Lehman Trilogy, here. The part of Dick Fuld will be played by Rachel Dolezal’s psychotherapist.

The story of a family and a company that changed the world, told in three parts on a single evening.

Sam Mendes directs Simon Russell BealeAdam Godley and Ben Miles who play the Lehman Brothers, their sons and grandsons.

On a cold September morning in 1844 a young man from Bavaria stands on a New York dockside. Dreaming of a new life in the new world. He is joined by his two brothers and an American epic begins.

163 years later, the firm they establish – Lehman Brothers – spectacularly collapses into bankruptcy, and triggers the largest financial crisis in history.

Stepanov, Vectorized Decoding, and Merton

Alex Stepanov, 6 Oct 2016, collected papers, here.

Information retrieval

  • Alexander A. Stepanov, Anil R. Gangolli, Daniel E. Rose, Ryan J. Ernst, and Paramjit S. Oberoi: SIMD-Based Decoding of Posting Lists. ACM Conference on Information and Knowledge Management (CIKM 2011), October 24–28, 2011, Glasgow, Scotland, UK.

  • Alexander A. Stepanov, Anil R. Gangolli, Daniel E. Rose, Ryan J. Ernst, and Paramjit S. Oberoi: SIMD-Based Decoding of Posting Lists. A9 Technical Report A9TR-2011-01, revision 2, June 2014, 30 pages. Appendix includes C++ code. PDF

Plaisance, Kurz, Lemire, June 2015, 1st International Symposium on Web AlGorithms, Vectorized VByte Decoding, here.

We consider the ubiquitous technique of VByte compression, which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer, and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last, and the decod- ing of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accel- erate VByte decoding using SIMD vector instructions have been disappointing, prodding search engines such as Google to use more complicated but faster-to-decode formats for performance- critical code. Our decoder (MASKED VBYTE) is 2 to 4 times faster than a conventional scalar VByte decoder, making the for- mat once again competitive with regard to speed.

Daniel Lemire and Leonid Boystov, 10 Sep 2012, arXiv, Decoding billions of integers per second through vectorization, here.

In many important applications — such as search engines and relational database systems — data is stored in the form of arrays of integers. Encoding and, most importantly, decoding of these arrays consumes considerable CPU time. Therefore, substantial effort has been made to reduce costs associated with compression and decompression. In particular, researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless, we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time, SIMD-BP128 saves up to 2 bits per integer. For even better compression, we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression ratio within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.

Peter Carr, Oct 2006, Bloomberg, Harvard’s Financial Scientist, here.  Merton had a model at IDL for buy and hold portfolio optimization for retirees. Problem was retirees don’t have enough money to optimize so IDL didn’t fly. Merton’s papers are here.

One major difference be- tween quantitative finance today and the field 30 or 35 years ago, according to Merton, rapidly increasing computing power has taken away much of the im- portance of intuition. Merton recalls his graduate students’ needing to spend considerable time looking up possible solutions to functions in books he kept in his office.