|
RSS Feed
[The title is something I first heard from Peter Hill, a Cray Research colleague.]
Graphics processing units, or GPUs, have provoked a lot of excitement among many scientific and engineering users recently, with the promise of much higher performance at very low cost compared to general-purpose processors (Opteron or Xeon x86-64 processors, e.g.). In this posting we’ll take a look at the details of why GPUs could be valuable in some domains, the difficulties in making them widely usable, and how they might be usable by desktop very high level language (VHLL) users.
First, a definition. What types of processors will we consider GPUs? The NVIDIA GeForce8000 series and the ATI Radeon are the original GPUs, designed as graphics coprocessors in desktop or laptop systems. The IBM/Sony/Toshiba Cell Broadband Engine was created for very similar roles in games and has similar attributes, so I think of it as a GPU as well. These chips all get their performance from numerous (100 or more) small, simple cores that can process independent streams of data. The means of coordinating the cores and the means of gaining access to data in non-local memories differs among the types of GPUs.
The appeal of GPUs
On October 4th, I attended a workshop at Northeastern University on general-purpose processing on GPUs, often known as GPGPUs, and got some of the details about why people are excited about GPGPUs. It basically comes down to performance and cost. People are seeing that tuned kernels can deliver 10X (and sometimes close to 100X) the performance of the fastest Opteron/Xeon chips, and this performance comes from a mass-market part that costs on the order of $500. So that’s the intuitive appeal.
Peeling a layer off the onion, the peak computational rate for the NVIDIA GeForce 8800 is over 500GFLOPS (compared to a few dozen GFLOPS for the fastest Xeon or Opteron sockets). That computation rate is accompanied by off-chip memory bandwidth of >80GB/s, again comparing to 6-20GB/s from Xeon/Opteron. So at a gross level, the fundamentals are present to get excellent performance. The price and ubiquity of GPUs plays in here strongly as well, as the GPU vendors hope that their cheapness will mean that lots of people will play around with programming them and come up with better approaches that make them more widely usable. Note, of course, that these amazing numbers are for the algorithms that match GPUs well; other algorithms, like the pointer-chasing common of databases, will not see any benefit from GPUs.
Does a potential of 10X better performance make you consider changing your programming tools?
The “Issues” with GPUs
With these amazing performance advantages, why haven’t GPUs already taken over the computing world? Well, there are a few minor (sic) issues with programmability.
- You can’t just take your existing C/Fortran program and recompile it for a GPU. The chips that do have compilers don’t take the standard languages as input; rather, there are necessary extensions to address the hardware widgets that can yield high performance.
- Often the chips aren’t even efficiently programmable from C and require delving into assembler (though there are some hybrid approaches [http://corepy.org/]). The GPUs are optimized to process long streams of data with exactly the same operations, much as the vector processors of the classic Cray systems worked, and you may remember what the strengths and weaknesses of those processors were. Namely, if you could arrange your code as streams of vectors, great, but there were certain constructs (e.g., data-dependent If tests inside loops) that could inhibit vectorization and lose an order of magnitude of performance. The GPUs present the low-level programmer with a similar situation, but the instruction sets and compilers do not yet appear to have the maturity of those Cray systems.
- Beyond the operations themselves, the programmer has to be intimately familiar with the memory hierarchy within the chip. This is independent of whether you’re programming in C or assembler; you still have to ensure that high-use data winds up in the high-bandwidth scratch memory of the chip. Several experienced GPU programmers have told me that this is a level of optimization that requires a PhD-level programmer for success.
- The programming interfaces within the GPUs are not (yet) standard. NVIDIA is strongly promoting its CUDA interface, which is C with extensions for both hardware widgets and communication/synchronization among the many cores on the chip. IBM’s Michael Gschwind claimed at the Northeastern workshop that the Cell BE’s use of the OpenMP interface for parallelism made it standard and open. He further stated he had no concerns about the scalability of OpenMP for future chip versions which presumably will have many more cores. I’m not familiar with the details of ATI’s programming approach.
In my view, the difficulty and lack of standardization of GPU programming will mean that three groups will program them directly: scientific library developers from the chip vendors, academic researchers wanting to understand the suitability of new computing architectures, and programmers working on projects where the increased performance is worth the extra effort (e.g., high-volume games, national security uses). Given that most of us don't fall in those categories, how could we expect to make use of GPUs? I believe it will be mostly via the “library model”; i.e., we will expect that one of those 3 pioneering groups has created a routine that’s relevant for the computation we want to do, and we’ll be able to use it somehow . But what might that look like? If I have an existing VHLL program, I want to make one or two changes at the top of the program, denoting that I want to use GPU routines for this code, and perhaps the number of GPUs. I’d like to remain blissfully ignorant of other details (the vendor and model number of the GPU, exactly which routines exist for that GPU, how data gets to the GPU and back, etc.) unless I want to delve into them for some reason. To make this more explicit, that means if I have a call to fft in my code, and that function happens to be implemented in the GPU in my system, that I don’t have to make any changes to the fft call itself, just the few changes at the top to request GPU use.
Does this agree with how you need to use GPUs? Are you more likely to dig in to lower levels of programming to get the potential performance gain, or remain in the VHLL until those tools make it practical to use GPUs?
Note that this won’t mean that GPUs are uniformly valuable, even with a simple interface like this. Like lots of configurations in parallel computing, there is an issue of how much data we’d need to move from the general-purpose processor’s cache or memory to the GPU’s cache or memory compared to how much work we’re doing with the data. As it stands today, this is a non-trivial issue. If AMD delivers on its goal of integrating general-purpose and graphics processors on the same die with the same memory, this issue may be reduced, at least for that configuration, making GPUs more usable.
But integration is not really the issue; it’s really the relative bandwidth between local memory (to the core, either graphics or general-purpose) and global memory (accessible by all cores). The Cell processor is already physically integrated, but the relative bandwidths are very different, making sharing of data between graphics and general-purpose cores an expensive operation a clever programmer will avoid. As long as this memory-residency/performance issue persists, efficient use of GPUs (or other special-purpose processors) will depend on smart scheduling of which operations of which sizes are best done on the graphics versus the general-purpose cores. What piece of software will do that scheduling? It doesn’t seem reasonable to have every programmer recreate this code, depending as it does on hardware-specific performance details that change with every chip generation. It seems like each function (fft in the example above) needs an uber-routine that decides which core-specific routine to call.
How would you expect to deal with this scheduling issue? Would you address it yourself, or expect to have it addressed by somebody else’s software (and if so, whose)?
The future of general-purpose v. graphics processors
One possibility is the integration of general-purpose and graphics processors. Note, of course, that one of the primary differentiators of today’s GPUs is their higher off-chip bandwidth. When the general-purpose and graphics processors are on the same die, they’ll both have access to the same memory bandwidth. If that tracks the general-purpose sockets, GPU users won’t be happy. If that tracks GPUs, GPU users will be happy and general-purpose core users will be ecstatic. Bandwidth is expensive to add to sockets, so there are economic reasons to reduce it. It’s not clear to me how this will play out.
Another possibility, in my view, is that the processor architects figure out how to include the unique GPU widgets in the general-purpose instruction sets, and in the next several years the general-purpose processors take over much of what is today viewed as GPU work. Remember, the processor cores themselves on today’s sockets are a relatively small fraction of the die area, so increasing that by 10-20% to include stream constructs or vector registers would not be a huge overall cost. (The memory bandwidth issue from the prior paragraph still applies.)
As I think about this from the perspective of VHLL users who want excellent performance but refuse to spend their days rewriting their code for the latest hot hardware innovation, it seems like we need some insulation from the vagaries of those various innovations. Yes, we want to be able to get their benefits, when they’re appropriate for our codes, but not at the cost of extensive changes.
How often are you willing to make significant changes to your program(s) to exploit significant hardware innovations? Every month? Every year? Every 5 years?
Summary: GPUs provide amazing performance benefits for some key computational kernels, but they are beasts to program and their use from VHLL programs today is non-trivial. Most VHLL users will probably want to be insulated from these hardware details, yet still have a way to get the performance benefit.
COMMENTS
You didn't mention RapidMind (http://www.rapidmind.net/) for programming GPUs. I have not spent enough time to get to know it well but it looks promising.
What is the story with the IBM/Sony/Toshiba Cell Broadband Engine?
It got a lot of press a year ago, but I haven't seen anything lately.
Is it an important innovation?
Ah, I neglected to mention RapidMind. I've also heard encouraging comments about it but haven't had time to look into it.
As far as the Cell BE, having heard a couple detailed presentations about using it for non-graphics apps, it sounds as though it's similar to the NVIDIA/ATI GPUs, but has some hardware differences that a performance-driven programmer will have to take into account. The summary I gleaned is that it has the same high-potential hard-to-program profile as the GPUs. People who have heard the roadmap for future Cell processors are apparently excited, but I haven't heard it first-hand.
I too was wondering why RapidMind was not mentioned in the context of this article. We a attempting to make decisions that minimize code rewrites when moving from one multi-core target to another and it seems that RapidMind's concepts squarely address this issue via a high level C++-like framework. We are currently contemplating whether RapidMind can incorporate a CUDA backend and achieve significant performance gains (or reasonably near so) compared to direct CUDA
The claim that OpenMP might be generally suited for many-core machines might not be true. The basic reason is that the CPU-CPU communication will be far more efficient than CPU-Mem communication. Some references: http://www.gotw.ca/publications/concurrency-ddj.htm
I received feedback on this GPU posting that the Cell is significantly different from the NVIDIA and ATI GPUs, and lumping them all together is a mistake. Among the points raised: the Cell is a significant step lower in peak FLOPS and memory bandwidth (~3X) and some other points about programmability I have not had time to confirm or refute. For completeness' sake, I wanted to leave a note that my conclusions may have been inaccurate.
|