Intel recently revealed more information about its upcoming Larrabee chip, apparently intended to have GPU functionality but also be suitable for general-purpose numerical calculations.
How is one (esp. one keenly interested in numerical computing) to make sense of Larrabee and have some insight into its likely impact on the market?
From my point of view, Larrabee is a predictable innovation given the larger forces at work in the market. The first of these forces is the meteoric growth in the number of gates possible per die; this leads to a constant ability to put on a next-generation general-purpose chip something that previously was too big, complex, or different to go on-die. (Precedents: floating-point co-processors, I/O interfaces, multiple cores, and memory controllers.) This allows AMD and Intel to contemplate extending their x86 cores with GPU features, with the prospect of much higher performance.
The second force is the ability of a general-purpose software environment to appeal to a bigger market, which often means that the x86 (really, x86-64) instruction set architecture (ISA) has a major advantage over non-x86 ISAs, because it has more software available for it. (Precedent: even at the very high end, Cray has shifted its OS focus from the special-purpose Catamount to the general-purpose Linux, specifically for this reason.) The market rule is that, in the case of an approximate tie, the more-general-purpose component usually wins.
In this context, the GPUs have grown up as distinct chips, with performance unattainable from x86 chips, but with significant ease-of-use shortcomings relative to x86 chips as well. The GPU vendors want to improve their ease of use to reach a broader market; note Rob Farber's description of NVIDIA's CUDA evolution. Intel and AMD want to extend their already-general-purpose chips with GPU-like performance. I think of this simplistically via the diagram in Figure 1.

Figure 1. Performance / Programmability trade-offs of x86, GPU, and Larrabee chips
So, how could an intelligent observer tell how this is likely to play out? I believe two key metrics will tell us almost all we need to know.
- Sustainable memory bandwidth: Much of the superior performance that today's GPUs deliver compared to x86 sockets is due to their superior bandwidth to memory (100+GB/s versus ~10 in the latest Xeon sockets). When Larrabee comes out in 2009-2010, does it deliver 80-90% of the memory bandwidth of the then-available GPUs, or 20-30%? (I.e., where does Larrabee fall on the vertical axis?)
- Vector/threaded performance from nearly-vanilla C++/C/Fortran code: Larrabee's being based on the x86 instruction set will be small comfort to application developers if they have to restructure their code to the extent necessary with the GPU programming interfaces to get good performance. Intel's challenge is to get GPU-like performance with only slight changes to the original source code. (I.e., where does Larrabee wind up on the horizontal axis.) Note that some people might argue for no source changes, but I don't believe that's practical, given the typical state of applications vis-à-vis exposing vector/threaded constructs, and the state of commercial compilers.
And one further issue that appears to be overlooked(**) in the excitement to get many-core GPUs running fast: What's a model for parallel execution on these chips that will be stable over the next several years? For the x86 CPUs and, according to IBM, for its Cell also, OpenMP is the right on-chip parallel programming interface. But will this persist as a chip-wide parallel model? I claim not, as its effective implementations to date depend on cache coherence. Today the multicore x86 sockets all support chip-wide cache coherence. As the number of cores per chip grows exponentially, this won't be practical. (Witness what happened at the system level, where even SGI's heroic efforts to support system-wide cache-coherence have had to retreat in the face of astronomic core counts.) I believe what's likely to happen is that some balance of technical cost and difficulty and economic value will result in a cache-coherent building block of (say) 32 cores. Then those building blocks will be replicated on a die, with some non-globally-coherent interconnect. (Chip architects from major vendors privately confirm this approach.) At that point, parallel models will need to cope with distributed memory at least somewhat differently, and today's OpenMP will not be sufficient. IBM's support of OpenMP on Cell charts a new path, supporting OpenMP without hardware cache-coherence; the success of that path is still undetermined. Developers of GPU- or Larrabee-based programs, at least developers who don't want to rewrite their codes in 3 years, should be asking hard questions about the parallel models being exposed for these chips.
I believe that Intel's delivery of a Larrabee that delivers 80-90% of the performance of its contemporary GPUs, with much higher programmability, will be extremely difficult technically to achieve, but I think it's possible, and if pulled off would be a major change to the technical computing landscape.
Your thoughts?
(**) Though see an excellent discussion of related issues in the last 2 pages of a conversation with Kurt Akeley and Pat Hanrahan.