Star-P Blog
    

Parallel Lounge: Parallel Computing Blog for Engineers, Scientists, Analysts

Current Articles | RSS Feed RSS Feed

Types of Parallelism: Task Parallelism

Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 

Last time we looked at data parallelism, which is in some ways a restricted form of parallelism. This time I’ll address task parallelism, which is conceptually a very broad form of parallelism. Star-P currently implements a subset of complete task parallelism, where the “iterations” of a parallel loop have no dependencies or communication among them.

In its full glory, task-parallelism allows you to do completely different tasks in parallel. For instance (in a non-Star-P pseudocode), my morning ritual might look like
    if (myproc==1)
        Get_the_newspaper();
    elseif (myproc ==2)
         Make_breakfast();
         Eat_breakfast();
    elseif (myproc == 3)
         Let_the_dogs_out();
    elseif (myproc ==4)
         Wake_the_kids_up();
    end


Today’s Star-P implementation of task-parallelism is narrower than this, focused on performing embarrassingly parallel tasks on non-overlapping sections of distributed arrays. You might run into this as a set of images stored in a single large array, where you need to do a 2D FFT on each image. This doesn’t quite fit the definition of the data-parallel operations, as they assume that the operation will be applied to the array as a whole, not to portions of it. Original MATLAB® code for this might look like:

    % images is an N x N x #images array
    for i=1:size(images,3)
          fft_images(:,:,i) = fft2(images(:,:,i));
    end


It’s pretty easy to see that this loop doesn’t have any dependencies between iterations and could be correctly executed in parallel – each invocation of fft2 is using a different plane of images and creating a different plane of fft_images, and has no dependencies on other planes of either. To do this with Star-P, you’ll need to use the ppeval construct, modeled on the MATLAB cellfun construct, which takes as its input the name of a function to execute and the arguments to that function. The simplest Star-P implementation of this would be

    fft_images = ppeval(‘fft2’,images);

For a more complicated loop body, you would create a new function with just the body of the for loop, i.e.

    function y = dummy(x)
    y = fft2(x);


and then execution of that function as

    fft_images = ppeval(‘dummy’,images);

The ppeval construct splits its input array(s) along their last non-singleton dimension (by default) and passes them to the specified function, automatically mapping the size in the last dimension to the number of actual threads. Scalars are broadcast to all threads. [Note that all these examples have the dummy function written in the M language, but it’s also possible to have the dummy function as a serial C/C++ function.]

Another simple example is to take the 1-norm of a matrix, which is defined as the largest column sum of the matrix, in parallel (if this were not already implemented in Star-P). This could be implemented in the M language as

norm1 = norm(a, 1);


A parallel implementation with ppeval on the distributed array a could be done as

norm1 = max(ppeval(‘norm’, a, 1));

In this example, the return value from ppeval will be a vector (of the scalar values from each “iteration”), which will then be reduced by max to a scalar.


The ppeval construct is similar to the parfor construct in The MathWorks’ Distributed Computing Toolbox. It also bears some resemblance to the parallel for loop in OpenMP, though OpenMP is typically implemented with a dynamic work-sharing model, whereas ppeval does static work-sharing.

While a fair number of codes readily map to the current ppeval, we realize it could be generalized in several dimensions:

  • The mechanical step of creating the function (dummy above) could be eliminated by having a directive denote the parallelism, similar to the way directives are used in OpenMP programs
  • The required absence of any communication or synchronization between the iterations could be relaxed.
  • The fixed allocation of “iterations” of ppeval to threads could be made dynamic, giving better load balance where the computational cost of iterations is highly variable
  • The work-sharing of a single task and its data across all threads could be generalized to share several tasks and their data
  • The requirement of non-overlapping slices of input arrays could be relaxed
  • The iterations could be automatically blocked to the number of active cores in the Star-P session at the time of execution

    If your code doesn’t map well to ppeval, what are the obstacles?
    Would one of these generalizations enable it to run in parallel?

The basic ppeval construct can actually do some pretty sophisticated tasks. One case we’ve run into several times is that of assembling the stiffness matrix for a finite-element calculation, which can often be time-consuming and need parallelization. The stiffness matrix is usually very sparse, so code for this might look like

    stiffness = spalloc(dof,dof,0);
    for it=1:numel(elements)
        % create i, j, and v as row and column indices
        % and value, respectively, of the new stiffness
        % matrix elements
        stiffness(i,j) = stiffness(i,j) + v;
    end


Since the new elements of stiffness are scattered throughout, it doesn’t seem to meet the independence requirement of ppeval, but it actually turns out to be runnable in parallel. First, a quick reminder that the MATLAB code
    sparse_matrix = sparse(i,j,v);
    [ii jj vv] = find(sparse_matrix);

illustrates that sparse and find are nearly opposites of each other; “nearly” because sparse will combine the values of any duplicated i/j pairs it encounters. The values of i and j are the row and column indices, respectively, of the nonzero elements in the matrix and the values of v are those nonzero values themselves. In this example, barring duplicates, i, j, and v will equal ii, jj, and vv, respectively.

To parallelize the for loop just above, let’s think about each finite element creating a few entries in the stiffness matrix, and then aggregating those entries at the end.  First we’ll create our new dummy subroutine, in this case

     function [i j v] = dummy(elements, dof, ents_per_elem)
     stiffness = spalloc(dof,dof,0);
         for ii=1:ents_per_elem
              jj = column(i);
              …
              stiffness(ii,jj) = stiffness(ii,jj) + value;
         end
     end

     [i j v] = find(stiffness);
     return

Then, the original loop can be written as:

    [i j v] = ppeval(‘dummy’, elements,dof,ents_per_elem);

    stiffness = sparse(i,j,v);


This results in parallel code to assemble the stiffness matrix, which is conceptually not a simple thing to parallelize. In practice, you’d probably want to block this up so that each processor core would get a group of elements and return a single local stiffness matrix, but that’s a relatively straightforward optimization.

    Would you be willing to change your code this much to get it to run in parallel?
    If not, how would you want the Star-P constructs to be different?
    And what type of problem are you solving?

Summary: Existing task-parallel constructs provide effective means to parallelize some types of loops, and with ingenuity can be powerful.

Posted by Steve Reinhardt

COMMENTS

Currently, there are no comments. Be the first to post one!
Post Comment
Name
 *
Email
 *
Website (optional)
Comment
 *

Allowed tags: <a> link, <b> bold, <i> italics

Receive email when someone replies.
 
 

Subscribe by Email

Your email:
 
 

Latest Posts

 
 

Browse by Tag

 
 

Most Popular Posts