Star-P Blog
    

Parallel Lounge: Parallel Computing Blog for Engineers, Scientists, Analysts

Current Articles | RSS Feed RSS Feed

Multicore: Why all the Hubbub?

Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 

Multicore chips have received tons of attention recently from industry pundits. What is multicore and why should you as a scientist or engineer who uses computing care about it?

Multicore refers to the physical placement of more than one core on a single chip. Note that terminology can be confusing here, as hardware people tend to refer to the chip as a “processor” while software people tend to refer to the core as the “processor”. I will use the term core to refer to what we all called a “processor” ten years ago (i.e., the thing that can execute a program, has registers and memory access, etc.) and the term socket to refer to the whole chip. Multicore has happened because the hardware architects have run out of ways to use the exponential growth in transistors per unit space (made famous as Moore’s Law) to make a single core faster, and so have just put in multiple instances of a fast core on a single die. [The Landscape of Parallel Computing Research: A View From Berkeley, Multicore Programming Primer]

Given that the growth in transistors doesn’t appear to be slowing down and that the number of transistors in a core appears to have reached its asymptote, those extra transistors will be consumed by more cores, with the number of cores growing exponentially over the next several years. If we tag 2006 as the widespread advent of general-purpose multicore (dual-core) chips and extrapolate with Moore’s Law (a factor of 4 every 3 years), that says 2009 will bring 8-core sockets and 2012 32-core sockets. And those cores may not be getting much faster from generation to generation, unlike the last several years, so delivered performance improvements will depend heavily on making use of those extra cores. In a computing world where parallelism was a niche technology until recently, this is a Big Deal.

Multicore chips come in several flavors. Intel and AMD build general-purpose multicore sockets [Intel quad core, AMD quad core] whose cores execute the x86-64 instruction set and today are cache coherent across the socket. A number of other vendors build multicore chips with other ins truction sets, sometimes for special purposes, and they have already pushed core counts higher.

  • SiCortex, for instance, builds a low-power 6-core chip today based on the MIPS instruction set.
  • Sun’s UltraSPARC T2 processor has 8 cores, each capable of running 8 threads quasi-simultaneously for a total of 64 threads on a socket.
  • The IBM/Sony/Toshiba Cell processor has one general-purpose core accelerated by 8 GPU-like cores.
  • Tilera just announced its 64-core TILE64™ Tile Processor.
  • The NVIDIA Tesla™ graphics processing unit (GPU) has 128 cores on a socket.

These special-purpose sockets often do not provide cache coherence across the whole chip, which makes them simpler to design and less power-hungry, and this probably gives a hint for where future general-purpose sockets will go as well. (Other hardware innovations like transactional memory [http://www.theregister.co.uk/2007/08/21/sun_transactional_memory_rock/] may improve the primitives used for synchronization, but it’s far from clear that those will be simple enough for use by a typical scientist or engineer who doesn’t know much about parallelism.)

All of these cores represent lots of computing that can potentially be done on a chip, with potentially being the operative word. To make a single application run faster on a multi-core, it has to be structured to take advantage of multiple processors. Outside of the rarified world of high performance computing (HPC) and enterprise applications like commercial databases, not many applications are ready to run in parallel. Further, lots of the attention for multicore chips has focused on the peak performance of the cores aggregated together. For instance, system vendors using AMD’s Barcelona quad-core socket focus on the peak speed for floating-point ops, 16 GFLOPS with a 2GHz clock.

Like all good marketing organizations, the hardware vendors are trying to focus our attention on the improved attributes of their new products while obscuring the not-so-improved attributes. Here lies one of the major pitfalls of the multicore sockets – they typically don’t have enough memory bandwidth to support the high FLOP rates of the cores. The hardware vendors aren’t venal in this change, they’re just doing what’s practical. Adding computation (more cores) is relatively straightforward and inexpensive. Adding bandwidth (more and faster off-chip pins) is difficult and expensive. Until somebody comes up with the next great idea in computers, we will see continued confirmation of Anant Agrawal’s mantra “Computation is cheap, [off-chip] communication is expensive.”

     Is this description consistent with your view of the future hardware landscape? Have I missed anything important?

The impact of these chips on you as an engineer or scientist mainly depends on how you do your computing today.
  • If you do most of your work on the desktop, you have probably used or written serial applications and may still be doing so. Changing from that world to a 8-core/socket world by 2009 is a huge shift, whose effects are hard to overstate. One way to look at that is that if you run a serial application on a 8-core socket, you’ll only be running at 12% of the potential speed of your chip. Not many disciplines or industries are so uncompetitive that one can afford to give away a factor of 8 and survive.
  • If you do most of your work on HPC systems, you’ve probably been using or writing parallel applications, so multicore may not be such a drastic change from what you’re used to. But the physical realities of the new chips will have a major qualitative effect on programming HPC systems built from them. First, the number of available cores will stress the scaling of the algorithms in most existing parallel applications. (I.e., your nicely performing 32-core program in 2005 will need to be running just as well on 1,024 cores by 2012 to keep pace with Moore’s Law; not many people know how to do that.) Second, the performance characteristics of the multicore chips will require parallel programs to schedule remote communication much the way that I/O-intensive workloads schedule I/O today; careful identification of data that can be pre-fetched, with the hope that sufficient pre-fetching will result in data being available soon enough to avoid excessive waiting for it.

So why would we as an industry put up with all this grief from multicore chips? Simple – we want programs to run way faster, and they offer the best near-term path to get there.

So I’ve covered the basics of multicore chips and their tremendous potential and notable pitfalls. For most of us, we have little alternative but to figure out how to use these chips to their utmost. Let’s look at how one might program these chips effectively.

SEA-CHANGE FOR DESKTOP PROGRAMMERS
If you run applications you get from other organizations on your desktop, your fate is in their hands. You may want to understand what those suppliers are doing about multicore, so that you can look for alternatives if they’re not responding to this technology shift.

If you write your own applications for the desktop, however, you’ll be facing the challenge of multicore yourself, and so probably have some new requirements for the programming tools you use. Assuming you’re not already conversant with parallel programming, you’ll probably want a gentle initiation. Unfortunately, “gentle” and “parallel programming” are rarely used in the same sentence without a “not”. :) [Note that with Star-P, we strive to do much better than prior approaches.]

For the rest of this discussion, we’ll assume you’re programming in a very high-level language (VHLL), such as Python, MATLAB® , Mathematica, R, etc. First, you’ll want an easy way to identify the parts of your program that consume all the time. Then you’ll want straightforward mechanisms that allow you to change only a few places in your program to get it to run in parallel, using all the cores in your desktop computer. And you’ll need to debug any errors you make along the way and look at the resulting performance to know whether it’s sufficient.

Popping up a level, you’ll want to know that the work you’re doing for parallelism will be sustainable to the next generations of chips that have even more cores, and that you can further refine your program to expose more parallelism rather than starting from scratch. And, given the work you’re putting into going parallel, you may want the flexibility to run larger problems that don’t fit on your desktop, without having to redo all your work.

    - If you’re a VHLL programmer, do you understand parallelism well enough to see how you could structure your application to expose the parallelism, or would you need help?
    - What type of information would you expect a tool to give you?


SCALE CHANGE FOR HPC PROGRAMMERS
If you’re already doing parallel programming for HPCs, let’s assume that, like most people, you’re using the Message Passing Interface (MPI) with C++/C or Fortran. For you, multicore primarily means changes in scale and performance balance. By scale, I mean that the number of cores you need to be able to use effectively, to keep pace with Moore’s Law, will grow exponentially. Compounding this effect is the relative reduction in bandwidth that has happened with early multicore chips and which we expect to continue.

Some of my colleagues at ISC and I believe this will have dramatic impacts on how people will do detailed parallel programming. Whereas a 2005-era parallel program may have been able to survive with a simple sequence of steps, all done cooperatively by the executing cores, a 2009- or 2012-era parallel program will have to be much more aware of the scheduling of computation and communication between the cores to tolerate the inter-chip latencies and yield good performance. In many cases this will result in a single program step from the 2005-era program being deconstructed to smaller parts, which will then be scheduled with prefetching, overlapping of computation and communication, and synchronization. Implementing these types of needed optimizations will be difficult using a low-level interface like MPI, and we believe that you’ll want new abstractions to make this a tractable problem.

    - If you’re an MPI programmer, are you intending to continue with it or are you considering shifting to new interfaces to cope with multicore?

While Star-P already runs on current multicore x86-64 and Itanium chips, we’re currently in the midst of designing changes to Star-P that will address multicore chips more robustly. These are the issues we see facing you as a programmer for multicore systems; we’re using them as requirements for our design work.

    - Does this match what you think you’ll want from your parallel programming tools?
    - If you had a very-potent-but-not-quite-magic wand, what would you want those tools to do?


[Next time: Expressing parallelism in Star-P.]

Posted by Steve Reinhardt

COMMENTS

While I generally agree with the content of this post, something that will make life easier, not harder, is that problem sizes are growing along with the number of cores here.

For example, weather and climate problems are going to larger (finer) grid sizes, and more detailed physics is required. That increases the number of operations per point, and so the amount of work that can be partitioned across the available cores increases along with the cores. Gustafson's Law (weak scaling) is very relevant here.

The interesting thing (to me) is to see how the networking (especially) and memory (to a lessor degree) scale with the compute cycles. Raw networking performance improves with time, but as the number of MPI processes increase the requirements imposed on the hardware also increase in a problem-dependent way. Striking a reasonable balance as the scale increases is non-trivial.

posted on Thursday, September 27, 2007 at 9:25 AM by Douglas Pase


Doug's comment about problem sizes getting bigger, so more work/core, are accurate, in my view. I would be curious if anybody has any data showing how problem sizes will be growing over the next few years. It seems we're going to go through a steep increase in the number of cores over the next few years, and it seems possible that the rate of increase in cores will be faster than the rate of increase in problem size. Again, as Doug mentions, this would be problem-dependent, but it seems possible. Thoughts?

posted on Thursday, September 27, 2007 at 10:22 AM by Steve Reinhardt


It seems to me that different industries tackle the same problem differently. I just moved an radar application from a multi-processor architecture to a IBM CBE based evaluation platform. This was extremely easy and I did not even consider to read any SDK documentation for this platform. The reason for this is rather simple: The radar application is programmed in a 5GL programming environment which is aware of the multi processor/core architectures. Therefore the programming environment acts as a multi processor/core compiler.

In areas where the application is very stringent and symmetric (e.g. masses of matrix multiplies) the task of manual programming might be achievable. But in areas where the computations are enormous, but not as stringent and symmetric (e.g. high resolution object recognition in satellite pictures) the burden is completely on the human programmer, if he is not using a programming environment which operates like a multiprocessor/core compiler.

On a application level, I just want to describe how the application shall be distributed and I want the tool to resolve all the message queue/shared memory communication, code generation, executable generation and distribution of the data and code to each processor/core. This enables me to concentrate on the "real" problem of parallel programming, which should be on the algorithmic issues - to resolve the load balance of computation and transfer networks.

All of this happened in a similar manner in 1970's when bit-slice processors promised higher processing capabilities. Later on the industry realised that programmers can not handle this complexity themselves. Modern compilers for single CPU's eased this problem and the parallel programming of multiple execution units with very long command words disappeared.

Therefore, for me it is not a question of a new "low level programming interface". I would form the question more like this "Am I programming a great number of single processors (with the result of resolving all the communication manually) or do I program a application on a processor (which may have multiple cores) with a multi-core compiler.

posted on Thursday, September 27, 2007 at 1:23 PM by Andreas Bogner


HPC systems have many compute nodes connected by very expensive high speed networks but it's not efficient to arbitrarily allocate MPI tasks across the nodes or across different CPU chips within nodes. Since core counts are increasing much faster than multi-chip CPU nodes, performance is improved with a careful placement of tasks (task mapping) within these high core count CPU chips. Use the on-chip coherent cache to avoid sending MPI messages between CPU chips in a node and across an already overloaded and slow memory interconnect. The on-chip coherent cache is easily 2 orders of magnitude faster. Know the task mapping across multi-chip multi-core compute nodes. The MPI library can keep a running count of the number of bytes sent to each core and node. Examining the task mapping data gives the most efficient placement of tasks to each core for the next time the program is run. It's funny that we buy the most expensive and fastest inter-node network and then strive to not use it.

posted on Friday, September 28, 2007 at 2:43 PM by Don Dossa


Exactly,and what kind of parallelism are we dealing with? If SMP all these cores will be dancing on one memory port and the memory wall will be steep, which suggest another technology will be needed, namely 3D for providing multiple wide pathways into the memory for each of the cores to chew on. This is not something that can be accomodated by conventional packaging. On MPI, this also entails an overhead in message packing and routing. In big machines this can easily reach microseconds of latency. Of course for small ones the network latency will be less. I see nothing in the cards yet that will crush this microsecond latency in hardware, though solutions are out there. You just have to look.

posted on Saturday, September 29, 2007 at 7:31 PM by John McDonald


Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Receive email when someone replies.
 
 

Subscribe by Email

Your email:
 
 

Latest Posts

 
 

Browse by Tag

 
 

Most Popular Posts