Parallel Lounge: Parallel Computing Blog for Engineers, Scientists, Analysts

Current Posts |  RSS Feed

What is the Cost of Performance?

Posted by David Gibson



As data sets grow, and algorithms grow increasingly complex, there’s a need by engineers, scientists and analysts to increase performance. The first step is often to re-write their algorithm – originally coded in MATLAB® or another very high level language (VHLL) – into a lower level language, C, C++, or Fortran. A typical project may take several months, and result in a 5-10X performance gain on a typical workstation (Option 1 below).

Option 1: Port to C++

  • ~6 person-months (~$120,000)
  • 5-10X gain in performance on a single processor
  • Calendar time to solution: 6 months
  • Does not scale beyond a single processor without further work

In Option 1, it should be noted that this serial programming effort does not scale “for free” beyond a single processor. So while the 5-10X gain is a perfectly acceptable target for many projects, those needing more performance will need to take a different approach. To increase performance further, one can turn to clustered servers. Of course, this typically involves some degree of parallel programming, with the relatively low-level paradigm of message passing (MPI or OpenMP). Here’s some data from a recent survey we carried out, asking 25 organizations about their MPI-based development efforts; presented below are the histograms of team size, and project length. While parallel programming projects vary widely, it is common to see teams of several engineers working 1-2 years.







So, let’s consider a fairly typical example, when the required computing power outstrips a single desktop, and the decision is made to develop an MPI-based application running on HPC clusters.

Option 2: Port to C++, with message passing (MPI)
  • Total incremental investment: $1,000,000
  • ~48 person-months (~$900,000)
    • 60X gain on a 128-core server (~$100,000)
    • Calendar time to solution: 12-18 months
  • Scales with more hardware, if higher performance is desired

Recently, a new programming paradigm has become available: Using existing VHLL code developed on a desktop, but extended to HPC server clusters with the Star-P software platform. This approach eliminates the C/MPI programming, and instead requires some incremental coding in the familiar MATLAB environment, leveraging much of the application’s existing code base. One can learn the handful of tags and commands within several days, and within a short number of weeks typical codes can be parallelized to run on the cluster. For a number of reasons, the processor utilization and compute efficiency may not be as high as on a “hand-tuned” custom MPI code, so a somewhat larger server may be necessary. (Fortunately, hardware is cheap, and getting cheaper.)
Here’s how the numbers come out for the typical case:

Option 3: Star-P extends MATLAB® to HPC Servers w/o MPI
  • Total incremental investment: $270,000
    • 1 person-month (~$20,000)
    • 60X gain on a 256-core server (~$200,000)
    • Star-P license ($50,000)
  • Calendar time to solution: 1 month
  • Scales with more hardware, if higher performance is desired

So this new programming model offers us the flexibility to trade off labor cost and time savings versus hardware costs. Because many projects are constrained by calendar time and available technical resources, a solution such as Star-P offers a way to radically transform the “cost of performance” equation.

Furthermore, this assessment only covers the short-term costs of the parallel port. In fact, most software costs are in maintenance of a code over time. In that case, the VHLL benefits of Star-P (faster and hence cheaper software development) will continue to pay off time after time, whereas the MPI-based approach will continue to cost substantially more.

I am curious to hear your feedback on the argument laid out here, and how it may relate to your projects:
  • What do you do to increase performance for codes written in MATLAB®, Python, R, and other VHLLs?
  • How long do your parallel ports take, with what size team?
  • What are your thoughts on the notion of trading hardware efficiency for calendar time and labor costs?



Article has 0 comments. Click To Read/Write Comments

Performance Tuning 101: Client-Server Communications and Vectorization

Posted by Roope Astala



Star-P is a client-server system. The client, running on your desktop, tells the high-performance server what to do, and the server spends its time happily crunching numbers.  The server doesn’t like to be micromanaged: it works the best when it can work on large chunks of data all on its own. It is "expensive" for client and server to communicate: after all, the client-server connection is often the slowest link in the hardware.

To illustrate the point, I’m showing the graph output of Star-P’s ppperf profiling tool.  The X-axis is time, and the Y-axis indicates the activity spent on the client, on the network, and on the server, for a given code segment. Going from left to right, it shows the communications activity - middle graph, in red - when a 72 MB matrix is sent from client to server. Then, the bottom (blue) graph shows the server activity increase in order to operate on the matrix. As you can see it takes more time to transmit the data than to do the actual computation.



In this first installment of Performance Tuning 101 series, we’ll discuss ways to reduce client-server communications traffic. Let’s first create some data using the *p tag: for those unfamiliar with Star-P syntax, it causes arrays to be created on the parallel server and distributed over CPUs.

x = rand(1,n*p);y=zeros(1,n*p);

By creating the data this way - directly on the server – we avoid data traffic through the client-server bottleneck.

Now, let’s do some number-crunching on these arrays to show the basic technique to avoid client-server traffic: vectorization. Compare element-wise multiplication in a for-loop

for idx=1:n
     y(idx) = 2*x(idx);
end


to a vectorized multiplication

y = 2*x;

In the for-loop, a command is sent from client to server in each iteration. This totals 1000 client-server calls, slowing down the code. Not only is the vectorized code simpler, but it requires only one client-server call.



Vectorization not only reduces client-server communications - it also lets you perform parallel operations on large arrays. Consider the following code that could be used to model 3-D random walks:

X=zeros(3,15000,15000);
for idx1=1:15000
     y = randn(3,15000);
     for idx2=1:15000
         x(1,idx1) = x(1,idx1) + y(1,idx2);
         x(2,idx1) = x(2,idx1) + y(2,idx2);
         x(3,idx1) = x(3,idx1) + y(3,idx2);
     end
end


This code took about 105 seconds to run in serial using regular MATLAB®. Just imagine how long it would take on a client-server platform! However, we can replace the for-loops by vectorized and parallelized operations:

X = sum(randn(3,15000,15000*p))

The speed-up is dramatic, this computation took only 4.5 seconds to run using 8 CPUs. In a sense we are trading memory for speed: we avoid the for-loop by constructing an extra 675 million element, 5.4 gigabyte array. This is where Star-P’s distributed arrays become very useful: a regular desktop might not have enough memory, so you’re forced to use the slower for-loops, whereas on a parallel server with large memory, you are free to restructure your code to use vectorized constructs.

So far, we have learned a couple of important lessons:

• Client-server communications is often the "weakest link" of a client-server system.
• When possible, avoid data traffic between client and server.
• Use vectorization.

In our next installments, we’ll take a look what’s happening on the server: what impact do your application types, data sizes and parallel server architectures have on performance tuning.



Article has 0 comments. Click To Read/Write Comments

Types of Parallelism: Extensions to Task Parallelism

Posted by Steve Reinhardt



Last time I mentioned that there were several dimensions in which the current ppeval mechanism could be generalized and become more useful.  We’ve thought about stencil operations, that is, operations that use a specific set of elements of an array repeatedly, and how to extend task parallelism to support them.  [NOTE:  These are just ideas, and do not exist in the current Star-P and may never exist in a future Star-P either.  The new directives in the example below are in blue to denote “blue sky” thinking.]

The following figure shows two example stencil operations on a 2D rectangular grid (in blue). The first, in pink, is known as a five-point stencil, and involves an element of the array and its north, south, east, and west neighbors.


The MATLAB® code below expresses a simple averaging operation, where the element itself is weighted as heavily as the sum of the four neighboring elements.

newx(i,j)= 4*x(i,j) + x(i-1,j) + x(i,j-1) + x(i,j+1)+ x(i+1,j)


Those neighboring elements are sometimes known as a halo around the central element. Typically the averaging or weighting operation would be applied to each element in the array, with some special handling around the boundary to avoid indexing out of bounds. The second example, in green, is known as a nine-point stencil, and additionally involves the northwest, northeast, southeast, and southwest neighbors. The code uses the MATLAB blkproc function, which implements stencil operations itself via the supplied function fn.

newy = blkproc(x,[3 3],fn)

with fn defined in this case as:

     function z = fn(x)
     z = sum(x(:))+7*x(2,2);

You can mentally extend these stencils to their 3D analogues as well.

  • Are stencils important to your codes?
  • Does this description of stencils cover your usage, or are there other angles that I’ve missed?
The very fact of the stencil, or neighborhood, of elements is what requires some extra thought when it comes to parallelizing this construct. For parallelism to accelerate the operation, some of the elements of the array will need to reside on the memory associated with each processor core. When that core gets to the boundary elements, it will need to use neighboring elements that reside on another core. How does it get access to those elements, and how does it know that it is getting the most recent values of those elements? These are the hard points in parallelizing this construct.

The following example illustrates both these hard points. It’s a perhaps-nonsensical extension of the MATLAB conv2 routine to include an extra argument that denotes the number of convolution steps to take. conv2 calculates the convolution of each element of its first argument a with the neighborhood represented by its second argument b. A simplified version of this code, that doesn’t scrupulously account for all end-cases, is shown following.

     function z = conv2(a,b,k)
     [an am] = size(a);
     [bn bm] = size(b);
     bn = fix(bn/2); bm = fix(bm/2);
     oldz = a;
     for ii=1:k
         for i=1:an
              for j=1:am %ignore out-of-bounds refs to oldz
                   z(i,j) = oldz(i-bn:i+bn,j-bm:j+bm).*b;
              end
         end
         oldz = z;
     end


Annotating stencils for parallel execution
We want the i and j loops to run in parallel, but the ii loop cannot, as it depends on the results of the prior convolution. What is the minimum we could add to this code sequence to make it run correctly and fast, in parallel? For simplicity of expression, let’s assume we have directives that can express what we need, instead of having to rewrite the code itself. It’s clear we need something to denote that the i/j loops should run in parallel, but that would be true even if there were no stencil operation inside them. The added wrinkle of the stencil is that we need some way to ensure that the “oldz = z” line results in each core getting the latest value of z from its neighbors before proceeding to the next iteration of the ii loop. One could imagine an update directive that would provide this functionality. The resulting code might look something like (as-yet-unimplemented directives in blue):

     function z = conv2(a,b,k)
     [an am] = size(a);
     [bn bm] = size(b);
     bn = fix(bn/2); bm = fix(bm/2);
     oldz = a;
    
%%$starp tparallel bcast(b), halo(oldz,[bn bm])
     for ii=1:k
         for i=1:an
              for j=1:am %ignore out-of-bounds refs to oldz
                   z(i,j) = oldz(i-bn:i+bn,j-bm:j+bm).*b;
              end
         end
         oldz = z;
    
%%$starp update(oldz)
     end
     %%$starp tparallel end


Note that in this specific case, the Star-P version of the conv2 routine would have this parallelism implemented internally, so you as a user wouldn’t have to think about it. But obviously you might have your own code where you need to implement something similar.
  • Do you see this as a reasonable interface for changing your code to specify parallelism?
  • Does this overlook angles that need to be addressed?
Summary: We discussed one means of extending the current ppeval mechanism for a specific type of fixed-pattern computations.



Article has 0 comments. Click To Read/Write Comments

Here's what you missed at Supercomputing 2007...

Posted by Ilya Mirman



Last week's SC07 show in Reno, NV was a great time to connect with friends, customers, partners, and other folks interested in developing parallel applications for high performance computers.

ISC's Virtual Tour

In addition to our booth, we participated in several interesting sessions, such as the ClearSpeed User Forum, HPC Challenge, Cray's booth, Parallel MATLAB® session, and others.  In case you missed something of interest, we've created a Virtual Tour of our SC07 activities.  Here, you will findthe various presentations, product demonstrations, news announcements, customer videos, and awards.

P.S.: here's Ronnie setting up the booth, Aquil demonstrating an image processing app, and me with Austin Powers (told me later he prefers Python over other Very High Level Languages).

   




Article has 0 comments. Click To Read/Write Comments

Python and Star-P win the HPC Challenge at SC|07

Posted by Ilya Mirman



This week at the SC07 conference, we won the HPC Challenge.  Judged by a committee led by Jack Dongarra (University of Tennessee, and Oak Ridge National Lab) and Jeremy Kepner (MIT Lincoln Lab), the goal of the competition is to focus the HPC community’s attention on developing a broad set of HPC hardware and HPC software capabilities that are necessary to productively use HPC systems.

Our entry this year was done in the Python language, extended to a parallel environment with  Star-P.  We  won in the category of "Most Productivity" - based 50% on performance, and 50% on code elegance, clarity, and size.

Python is a high-level, dynamically typed, multi-paradigm (procedural, objectoriented and functional) interpreted language created by Guido Van Russom. Natively, Python does not have data types and containers such as matrices and lacks linear algebra. and signal processing functions. Instead, these are added to the language through Python extension modules. Currently, the de facto Python module for numerical computing is NumPy authored by Travis Oliphant and others.

The Star-P package in Python is an extension module that can be imported just like any of the modules in the Python standard library. The syntax and semantics in Star-P Python closely model those in NumPy.

The basic premise of the model is to maintain compatibility in syntax with serial codes written using the NumPy module. In most cases, the user must not be burdened with having to think “in parallel”, keep track of distributions or worry about which portions of the code runs in serial and which in parallel. This allows users with a large existing serial application to port it to run in parallel with the least amount of effort.

The tests were run at a high performance cluster at the San Diego Supercomputing Center. The cluster consists of 32 nodes, where each node contains one quad-core Intel Xeon 5140 processor and 8 GB of memory, for a total of 128 cores and 256 GB of memory. The interconnect is IP over Infiniband.

Here are the scaling results for 3 of benchmarks.  The full submission and presentation can be found here.



Article has 1 comments. Click To Read/Write Comments

Interactive Discovery on Cray Supercomputers? Yes!!

Posted by Ilya Mirman



Greetings from the SC07 Conference in Reno, NV!  The show kicks off later today, and is shaping up to be pretty exciting.

One thing we are psyched about is our collaboration with Cray, where we are working together to enable Star-P to run on Cray's XT4 supercomputers.  (Cray announced this earlier this morning.)  The cool thing here is that it enables a whole new way to run applications on a Cray system, and makes Cray an extension of the scientist's desktop.  Imagine coding an algorithm in a desktop tool such as Python, MATLAB®, R, and others, and then running the code - with potentially enormous data sets - on a Cray system, without coding a line of C++, Fortran, and MPI.

We got some work ahead of us to deliver this to early adopters in Q1 of 2008, but already have some pieces of Star-P's port to Cray working.  One thing that makes it a pretty reasonable effort is that it's built on standard x86-64 chipset, running Linux.

Initial Porting Results
As a proof of concept, we tweaked a couple things in Star-P and compiled it for the Cray, and ran a couple toy problems on (perhaps the world's smallest) Cray supercomputer :)  - a 32-core XT4.  The neat thing here is that it worked, and scaled nicely.  For this test, we took 64 matrices of 3 different sizes (1000 x 1000, 1500 x 1500, and 2000 x 2000), and found the 2 most correlated vectors in each matrix, all done from Python (one of the very high-level languages Star-P supports).

We ran each problem on 1, 2, 4, 8, 16, and 32 processors of the XT4 system.  Looks like pretty good scaling for a 1st run.  The timing curve for the smallest problem (1000 x 1000 matrix size) starts to lean over slightly at 32 processors, likely due to the communication overhead associated with distributing the relatively small data set across 32 cores.



Article has 0 comments. Click To Read/Write Comments

How Supercomputing Geeks Will Invest $11m

Posted by Ilya Mirman



We have just closed a financing round (Series B, $11M), and folks ask me what we’re planning to do with the funds. Broadly speaking, this falls into 2 categories: customer-driven product enhancement, and the distribution channel.

Our Engineering team has a lot of cool things in the works, but here’s a quick overview of some of these:

  • Our largest investment will be to maintain the highest level of abstraction while dramatically increasing the run-time performance and global memory capability - well beyond what is typically encountered with interpreted very high level scripting languages typically used by domain experts (engineers, scientists, analysts).
  • Broader language support: there’s a handful of popular desktop apps – the very high level languages (MATLAB, Python, R, Mathematica, etc.) – and the funding will help us accelerate development of our multi-language platform, in terms of both language breadth, and functional coverage within each language.
  • Advances in automatic parallelization:  having been in the marketplace for a couple of years now, we have confirmed the fact that scientists/engs/analysts want to move their algorithms to deployment on the fastest, biggest systems without the delays of large modficiations to their codes. As a result, we will be investing more in state-of-the-art automatic parallelization, to make it increasingly easier to take an algorithm developed on a desktop PC, and optimize it for running on a large cluster.
  • Scaling to larger high performance computers: whereas today’s workgroup servers may contain 8-32 processors, tomorrow’s will contain hundreds and even thousands – enabled by microprocessor advancements, and demanded by increasingly growing data sets (e.g., from increasingly higher resolution MRIs, larger gene databases, hedge fund tick databases, etc.). With this funding, we will push the scaling limits up, in terms of both scaling efficiency, and in absolute terms.
On the channel side of things, we have a number of authorized resellers around the world, and will be working to expand this, in several key categories (more on this in the near future):



Article has 0 comments. Click To Read/Write Comments

GPUs: Threat or Menace? :)

Posted by Steve Reinhardt



[The title is something I first heard from Peter Hill, a Cray Research colleague.]

Graphics processing units, or GPUs, have provoked a lot of excitement among many scientific and engineering users recently, with the promise of much higher performance at very low cost compared to general-purpose processors (Opteron or Xeon x86-64 processors, e.g.). In this posting we’ll take a look at the details of why GPUs could be valuable in some domains, the difficulties in making them widely usable, and how they might be usable by desktop very high level language (VHLL) users.

First, a definition. What types of processors will we consider GPUs? The NVIDIA GeForce8000 series and the ATI Radeon are the original GPUs, designed as graphics coprocessors in desktop or laptop systems. The IBM/Sony/Toshiba Cell Broadband Engine was created for very similar roles in games and has similar attributes, so I think of it as a GPU as well. These chips all get their performance from numerous (100 or more) small, simple cores that can process independent streams of data. The means of coordinating the cores and the means of gaining access to data in non-local memories differs among the types of GPUs.

The appeal of GPUs

On October 4th, I attended a workshop at Northeastern University on general-purpose processing on GPUs, often known as GPGPUs, and got some of the details about why people are excited about GPGPUs. It basically comes down to performance and cost. People are seeing that tuned kernels can deliver 10X (and sometimes close to 100X) the performance of the fastest Opteron/Xeon chips, and this performance comes from a mass-market part that costs on the order of $500. So that’s the intuitive appeal.

Peeling a layer off the onion, the peak computational rate for the NVIDIA GeForce 8800 is over 500GFLOPS (compared to a few dozen GFLOPS for the fastest Xeon or Opteron sockets). That computation rate is accompanied by off-chip memory bandwidth of >80GB/s, again comparing to 6-20GB/s from Xeon/Opteron. So at a gross level, the fundamentals are present to get excellent performance. The price and ubiquity of GPUs plays in here strongly as well, as the GPU vendors hope that their cheapness will mean that lots of people will play around with programming them and come up with better approaches that make them more widely usable. Note, of course, that these amazing numbers are for the algorithms that match GPUs well; other algorithms, like the pointer-chasing common of databases, will not see any benefit from GPUs.

Does a potential of 10X better performance make you consider changing your programming tools?

The “Issues” with GPUs

With these amazing performance advantages, why haven’t GPUs already taken over the computing world? Well, there are a few minor (sic) issues with programmability.

  • You can’t just take your existing C/Fortran program and recompile it for a GPU. The chips that do have compilers don’t take the standard languages as input; rather, there are necessary extensions to address the hardware widgets that can yield high performance.
  • Often the chips aren’t even efficiently programmable from C and require delving into assembler (though there are some hybrid approaches [http://corepy.org/]). The GPUs are optimized to process long streams of data with exactly the same operations, much as the vector processors of the classic Cray systems worked, and you may remember what the strengths and weaknesses of those processors were. Namely, if you could arrange your code as streams of vectors, great, but there were certain constructs (e.g., data-dependent If tests inside loops) that could inhibit vectorization and lose an order of magnitude of performance. The GPUs present the low-level programmer with a similar situation, but the instruction sets and compilers do not yet appear to have the maturity of those Cray systems.
  • Beyond the operations themselves, the programmer has to be intimately familiar with the memory hierarchy within the chip. This is independent of whether you’re programming in C or assembler; you still have to ensure that high-use data winds up in the high-bandwidth scratch memory of the chip. Several experienced GPU programmers have told me that this is a level of optimization that requires a PhD-level programmer for success.
  • The programming interfaces within the GPUs are not (yet) standard. NVIDIA is strongly promoting its CUDA interface, which is C with extensions for both hardware widgets and communication/synchronization among the many cores on the chip. IBM’s Michael Gschwind claimed at the Northeastern workshop that the Cell BE’s use of the OpenMP interface for parallelism made it standard and open. He further stated he had no concerns about the scalability of OpenMP for future chip versions which presumably will have many more cores. I’m not familiar with the details of ATI’s programming approach.

In my view, the difficulty and lack of standardization of GPU programming will mean that three groups will program them directly: scientific library developers from the chip vendors, academic researchers wanting to understand the suitability of new computing architectures, and programmers working on projects where the increased performance is worth the extra effort (e.g., high-volume games, national security uses).

Given that most of us don't fall in those categories, how could we expect to make use of GPUs? I believe it will be mostly via the “library model”; i.e., we will expect that one of those 3 pioneering groups has created a routine that’s relevant for the computation we want to do, and we’ll be able to use it somehow . But what might that look like? If I have an existing VHLL program, I want to make one or two changes at the top of the program, denoting that I want to use GPU routines for this code, and perhaps the number of GPUs. I’d like to remain blissfully ignorant of other details (the vendor and model number of the GPU, exactly which routines exist for that GPU, how data gets to the GPU and back, etc.) unless I want to delve into them for some reason. To make this more explicit, that means if I have a call to fft in my code, and that function happens to be implemented in the GPU in my system, that I don’t have to make any changes to the fft call itself, just the few changes at the top to request GPU use.

Does this agree with how you need to use GPUs?
Are you more likely to dig in to lower levels of programming to get the potential performance gain, or remain in the VHLL until those tools make it practical to use GPUs?


Note that this won’t mean that GPUs are uniformly valuable, even with a simple interface like this. Like lots of configurations in parallel computing, there is an issue of how much data we’d need to move from the general-purpose processor’s cache or memory to the GPU’s cache or memory compared to how much work we’re doing with the data. As it stands today, this is a non-trivial issue. If AMD delivers on its goal of integrating general-purpose and graphics processors on the same die with the same memory, this issue may be reduced, at least for that configuration, making GPUs more usable.

But integration is not really the issue; it’s really the relative bandwidth between local memory (to the core, either graphics or general-purpose) and global memory (accessible by all cores). The Cell processor is already physically integrated, but the relative bandwidths are very different, making sharing of data between graphics and general-purpose cores an expensive operation a clever programmer will avoid. As long as this memory-residency/performance issue persists, efficient use of GPUs (or other special-purpose processors) will depend on smart scheduling of which operations of which sizes are best done on the graphics versus the general-purpose cores. What piece of software will do that scheduling? It doesn’t seem reasonable to have every programmer recreate this code, depending as it does on hardware-specific performance details that change with every chip generation. It seems like each function (fft in the example above) needs an uber-routine that decides which core-specific routine to call.

How would you expect to deal with this scheduling issue?
Would you address it yourself, or expect to have it addressed by somebody else’s software (and if so, whose)?


The future of general-purpose v. graphics processors

One possibility is the integration of general-purpose and graphics processors. Note, of course, that one of the primary differentiators of today’s GPUs is their higher off-chip bandwidth. When the general-purpose and graphics processors are on the same die, they’ll both have access to the same memory bandwidth. If that tracks the general-purpose sockets, GPU users won’t be happy. If that tracks GPUs, GPU users will be happy and general-purpose core users will be ecstatic. Bandwidth is expensive to add to sockets, so there are economic reasons to reduce it. It’s not clear to me how this will play out.

Another possibility, in my view, is that the processor architects figure out how to include the unique GPU widgets in the general-purpose instruction sets, and in the next several years the general-purpose processors take over much of what is today viewed as GPU work. Remember, the processor cores themselves on today’s sockets are a relatively small fraction of the die area, so increasing that by 10-20% to include stream constructs or vector registers would not be a huge overall cost. (The memory bandwidth issue from the prior paragraph still applies.)

As I think about this from the perspective of VHLL users who want excellent performance but refuse to spend their days rewriting their code for the latest hot hardware innovation, it seems like we need some insulation from the vagaries of those various innovations. Yes, we want to be able to get their benefits, when they’re appropriate for our codes, but not at the cost of extensive changes.

How often are you willing to make significant changes to your program(s) to exploit significant hardware innovations? Every month? Every year? Every 5 years?

Summary: GPUs provide amazing performance benefits for some key computational kernels, but they are beasts to program and their use from VHLL programs today is non-trivial. Most VHLL users will probably want to be insulated from these hardware details, yet still have a way to get the performance benefit.



Article has 5 comments. Click To Read/Write Comments

Types of Parallelism: Task Parallelism

Posted by Steve Reinhardt



Last time we looked at data parallelism, which is in some ways a restricted form of parallelism. This time I’ll address task parallelism, which is conceptually a very broad form of parallelism. Star-P currently implements a subset of complete task parallelism, where the “iterations” of a parallel loop have no dependencies or communication among them.

In its full glory, task-parallelism allows you to do completely different tasks in parallel. For instance (in a non-Star-P pseudocode), my morning ritual might look like
    if (myproc==1)
        Get_the_newspaper();
    elseif (myproc ==2)
         Make_breakfast();
         Eat_breakfast();
    elseif (myproc == 3)
         Let_the_dogs_out();
    elseif (myproc ==4)
         Wake_the_kids_up();
    end


Today’s Star-P implementation of task-parallelism is narrower than this, focused on performing embarrassingly parallel tasks on non-overlapping sections of distributed arrays. You might run into this as a set of images stored in a single large array, where you need to do a 2D FFT on each image. This doesn’t quite fit the definition of the data-parallel operations, as they assume that the operation will be applied to the array as a whole, not to portions of it. Original MATLAB® code for this might look like:

    % images is an N x N x #images array
    for i=1:size(images,3)
          fft_images(:,:,i) = fft2(images(:,:,i));
    end


It’s pretty easy to see that this loop doesn’t have any dependencies between iterations and could be correctly executed in parallel – each invocation of fft2 is using a different plane of images and creating a different plane of fft_images, and has no dependencies on other planes of either. To do this with Star-P, you’ll need to use the ppeval construct, modeled on the MATLAB cellfun construct, which takes as its input the name of a function to execute and the arguments to that function. The simplest Star-P implementation of this would be

    fft_images = ppeval(‘fft2’,images);

For a more complicated loop body, you would create a new function with just the body of the for loop, i.e.

    function y = dummy(x)
    y = fft2(x);


and then execution of that function as

    fft_images = ppeval(‘dummy’,images);

The ppeval construct splits its input array(s) along their last non-singleton dimension (by default) and passes them to the specified function, automatically mapping the size in the last dimension to the number of actual threads. Scalars are broadcast to all threads. [Note that all these examples have the dummy function written in the M language, but it’s also possible to have the dummy function as a serial C/C++ function.]

Another simple example is to take the 1-norm of a matrix, which is defined as the largest column sum of the matrix, in parallel (if this were not already implemented in Star-P). This could be implemented in the M language as

norm1 = norm(a, 1);


A parallel implementation with ppeval on the distributed array a could be done as

norm1 = max(ppeval(‘norm’, a, 1));

In this example, the return value from ppeval will be a vector (of the scalar values from each “iteration”), which will then be reduced by max to a scalar.


The ppeval construct is similar to the parfor construct in The MathWorks’ Distributed Computing Toolbox. It also bears some resemblance to the parallel for loop in OpenMP, though OpenMP is typically implemented with a dynamic work-sharing model, whereas ppeval does static work-sharing.

While a fair number of codes readily map to the current ppeval, we realize it could be generalized in several dimensions:

  • The mechanical step of creating the function (dummy above) could be eliminated by having a directive denote the parallelism, similar to the way directives are used in OpenMP programs
  • The required absence of any communication or synchronization between the iterations could be relaxed.
  • The fixed allocation of “iterations” of ppeval to threads could be made dynamic, giving better load balance where the computational cost of iterations is highly variable
  • The work-sharing of a single task and its data across all threads could be generalized to share several tasks and their data
  • The requirement of non-overlapping slices of input arrays could be relaxed
  • The iterations could be automatically blocked to the number of active cores in the Star-P session at the time of execution

    If your code doesn’t map well to ppeval, what are the obstacles?
    Would one of these generalizations enable it to run in parallel?

The basic ppeval construct can actually do some pretty sophisticated tasks. One case we’ve run into several times is that of assembling the stiffness matrix for a finite-element calculation, which can often be time-consuming and need parallelization. The stiffness matrix is usually very sparse, so code for this might look like

    stiffness = spalloc(dof,dof,0);
    for it=1:numel(elements)
        % create i, j, and v as row and column indices
        % and value, respectively, of the new stiffness
        % matrix elements
        stiffness(i,j) = stiffness(i,j) + v;
    end


Since the new elements of stiffness are scattered throughout, it doesn’t seem to meet the independence requirement of ppeval, but it actually turns out to be runnable in parallel. First, a quick reminder that the MATLAB code
    sparse_matrix = sparse(i,j,v);
    [ii jj vv] = find(sparse_matrix);

illustrates that sparse and find are nearly opposites of each other; “nearly” because sparse will combine the values of any duplicated i/j pairs it encounters. The values of i and j are the row and column indices, respectively, of the nonzero elements in the matrix and the values of v are those nonzero values themselves. In this example, barring duplicates, i, j, and v will equal ii, jj, and vv, respectively.

To parallelize the for loop just above, let’s think about each finite element creating a few entries in the stiffness matrix, and then aggregating those entries at the end.  First we’ll create our new dummy subroutine, in this case

     function [i j v] = dummy(elements, dof, ents_per_elem)
     stiffness = spalloc(dof,dof,0);
         for ii=1:ents_per_elem
              jj = column(i);
              …
              stiffness(ii,jj) = stiffness(ii,jj) + value;
         end
     end

     [i j v] = find(stiffness);
     return

Then, the original loop can be written as:

    [i j v] = ppeval(‘dummy’, elements,dof,ents_per_elem);

    stiffness = sparse(i,j,v);


This results in parallel code to assemble the stiffness matrix, which is conceptually not a simple thing to parallelize. In practice, you’d probably want to block this up so that each processor core would get a group of elements and return a single local stiffness matrix, but that’s a relatively straightforward optimization.

    Would you be willing to change your code this much to get it to run in parallel?
    If not, how would you want the Star-P constructs to be different?
    And what type of problem are you solving?

Summary: Existing task-parallel constructs provide effective means to parallelize some types of loops, and with ingenuity can be powerful.



Article has 0 comments. Click To Read/Write Comments

Large Graphs in Star-P

Posted by Viral Shah



Many of you use Star-P, Matlab, Python, R, among other languages for numerical computing. Often, in many scientific applications, the behavior of individual elements of a system is explained with simple rules. However, when assembled into a system, they collectively exhibit complex behavior. For example, Newton wrote down the equation for the gravitational attraction between two particles, and it is deceptively simple. But, interesting things happen when a whole bunch of these interact with each other. Feel free to post your own examples in the comments.

This blog entry is not about n-body simulations, but about combinatorial computing augmenting numerical computing. At a basic level, graphs are a simple way to capture relationships between elements in a system. In a dynamic system, this graph may also evolve over time.

Most languages for numerical computing (including Star-P) support sparse matrices as first class citizens. It turns out, that sparse matrices and graphs are one and the same. So, its quite convenient to represent graphs as sparse matrices. This approach has many benefits; you don't have to write new data structures to store graphs, and algorithms to operate on graphs. Sparse matrix operations can be used to implement graph algorithms !

Lets say, we have a collection of edges (u, v, w) - There is an edge of weight w between nodes u and v. One way to store this graph is to just store the tuples (u, v, w) in dense vectors. However, not much can be done with a graph in such a data structure. To convert this into a sparse matrix, we use the sparse() command:

>> G = sparse (u, v, w)

An interesting thing to do immediately at this point is to visualize the graph with the spy command. Lets create a random graph for the purpose of illustration.

>> n = 1000;                         % Use n = 1000*p in Star-P
>> U = ceil(n*rand(10*n,1));   % from nodes
>> V = ceil(n*rand(10*n,1));   % to nodes
>> G = sparse(U, V, 1, n, n);  % n-node graph with ~10n edges

So far, we have created a graph with 1,000 nodes and 10,000 random edges between them. We use spy and spyy for visualization. spy is a Matlab built-in which displays every nonzero in a sparse matrix (every edge in a graph) as a dot. Clearly, when the graphs are very large, with tens of millions of nodes and edges, such visualization becomes meaningless. spyy is a 2D histogram of nonzero/edge densities.

>> spy (G)                     % in Matlab



>> spyy (G)                    % in Matlab or Star-P
Now, here's why all this is so cool. You can create distributed graphs in Star-P from edge tuples stored in distributed dense vectors. In other words, u, v, and w can be distributed dense vectors, and G will be a distributed sparse matrix, or a distributed graph. This idea can be developed further to do lots of interesting things. If this is something interesting, email me, or leave comments, and I will post future blog entries on the subject.



Article has 0 comments. Click To Read/Write Comments

Previous Page | Next Page

 
 

Subscribe by Email

Your email:
 
 

Latest Posts

 
 

Browse by Tag

 
 

Most Popular Posts