Last time I
mentioned that there were several dimensions in which the current ppeval mechanism could be generalized
and become more useful. We’ve thought
about stencil operations, that is,
operations that use a specific set of elements of an array repeatedly, and how
to extend task parallelism to support them.
[NOTE: These are just ideas, and
do not exist in the current Star-P and may never exist in a future Star-P
either. The new directives in the
example below are in blue to denote “blue sky” thinking.]
The following figure shows two example stencil operations on a 2D rectangular grid (in blue). The first, in pink, is known as a five-point stencil, and involves an element of the array and its north, south, east, and west neighbors.

The MATLAB® code below expresses a simple averaging operation, where the element itself is weighted as heavily as the sum of the four neighboring elements.
newx(i,j)= 4*x(i,j) + x(i-1,j) + x(i,j-1) + x(i,j+1)+ x(i+1,j)
Those neighboring elements are sometimes known as a
halo around the central element. Typically the averaging or weighting operation would be applied to each element in the array, with some special handling around the boundary to avoid indexing out of bounds. The second example, in green, is known as a nine-point stencil, and additionally involves the northwest, northeast, southeast, and southwest neighbors. The code uses the MATLAB
blkproc function, which implements stencil operations itself via the supplied function
fn.
newy = blkproc(x,[3 3],fn)
with fn defined in this case as:
function z = fn(x)
z = sum(x(:))+7*x(2,2);
You can mentally extend these stencils to their 3D analogues
as well.
- Are stencils important to your codes?
- Does this description of stencils cover your usage, or are there other angles that I’ve missed?
The very fact of the stencil, or neighborhood, of elements is what requires some extra thought when it comes to parallelizing this construct. For parallelism to accelerate the operation, some of the elements of the array will need to reside on the memory associated with each processor core. When that core gets to the boundary elements, it will need to use neighboring elements that reside on another core. How does it get access to those elements, and how does it know that it is getting the most recent values of those elements? These are the hard points in parallelizing this construct.
The following example illustrates both these hard points. It’s a perhaps-nonsensical extension of the MATLAB conv2 routine to include an extra argument that denotes the number of convolution steps to take. conv2 calculates the convolution of each element of its first argument a with the neighborhood represented by its second argument b. A simplified version of this code, that doesn’t scrupulously account for all end-cases, is shown following.
function z = conv2(a,b,k)
[an am] = size(a);
[bn bm] = size(b);
bn = fix(bn/2); bm = fix(bm/2);
oldz = a;
for ii=1:k
for i=1:an
for j=1:am %ignore out-of-bounds refs to oldz
z(i,j) = oldz(i-bn:i+bn,j-bm:j+bm).*b;
end
end
oldz = z;
end
Annotating stencils for parallel executionWe want the
i and
j loops to run in parallel, but the
ii loop cannot, as it depends on the results of the prior convolution. What is the minimum we could add to this code sequence to make it run correctly and fast, in parallel? For simplicity of expression, let’s assume we have directives that can express what we need, instead of having to rewrite the code itself. It’s clear we need something to denote that the
i/j loops should run in parallel, but that would be true even if there were no stencil operation inside them. The added wrinkle of the stencil is that we need some way to ensure that the “
oldz = z” line results in each core getting the latest value of z from its neighbors before proceeding to the next iteration of the ii loop. One could imagine an update directive that would provide this functionality. The resulting code might look something like (as-yet-unimplemented directives in
blue): function z = conv2(a,b,k)
[an am] = size(a);
[bn bm] = size(b);
bn = fix(bn/2); bm = fix(bm/2);
oldz = a;
%%$starp tparallel bcast(b), halo(oldz,[bn bm])
for ii=1:k
for i=1:an
for j=1:am %ignore out-of-bounds refs to oldz
z(i,j) = oldz(i-bn:i+bn,j-bm:j+bm).*b;
end
end
oldz = z;
%%$starp update(oldz)
end
%%$starp tparallel end
Note that in this specific case, the Star-P version of the
conv2 routine would have this parallelism implemented internally, so you as a user wouldn’t have to think about it. But obviously you might have your own code where you need to implement something similar.
- Do you see this as a reasonable interface for changing your code to specify parallelism?
- Does this overlook angles that need to be addressed?
Summary: We discussed one means of extending the current ppeval mechanism for a specific type of fixed-pattern computations.