In a widely applauded article published
in 1973 and entitled, “Global variable considered harmful,” Bill Wulf and Mary
Shaw argued, “We claim that the non-local variable is a major contributing
factor in programs which are difficult to understand.” Thirty-five years later, however, nonlocal
variables are still very much in vogue in production code. Moreover, as software developers look toward
multicore-enabling their legacy codebases, nonlocal variables pose a
significant obstacle to parallelization, because they can cause race bugs, a
topic I discussed in an
earlier blog.
In this blog, I’ll discuss some of the strategies for coping with global
and other nonlocal variables when parallelizing software.
The Lure of Nonlocal Variables
To begin with, what were Wulf and Shaw
concerned about, and why are nonlocal variables nevertheless so prevalent? To be clear, by nonlocal variable, I mean one that is
declared outside of the scope in which it is used. A global variable
is a nonlocal variable declared in the outermost program scope.
Wulf and Shaw were concerned that when
a variable is used far away from its definition, a local inspection of the code
cannot easily determine its meaning.
They identified several reasons for eschewing nonlocal variables. Among them, they observed that if a function
modifies the nonlocal variable, this side-effect can be difficult to reason
about. In their words, “Untold confusion
can result when the consequences of executing a procedure cannot be determined
at the site of the procedure call.”
The clear alternative to referencing
nonlocal variables is to reference local ones, which one can do by passing the
variable as an argument to function calls.
The problem with this approach is that it leads to parameter proliferation — long argument
lists to functions for passing numerous, frequently used variables. For every global variable referenced in a
function, an extra parameter would have to be declared to the function to pass
that variable. Not only would that lead
to functions with dozens of extra arguments, there also the cost of passing the
arguments to consider.
As a practical matter, nonlocal
variables are just damn convenient. If I
have a code operating on a large data structure, why should I have to pass the
data structure to each and every function that operates on the data
structure? It’s much more convenient to
keep the function interfaces clean with fewer parameters and simply refer to
the data structure as a global variable or, commonly in object-oriented programming,
as a nonlocal member variable.
A Real-world Example
Parallel programming exacerbates the
problems with nonlocal variables, however, because they can cause races. To illustrate the issues, let me take an
example from one of Cilk Arts’s customers who supplies 3D computer-aided design
software. They represent a large
mechanical assembly as a tree of subassemblies down to the individual parts,
which may be complex geometric objects.
For example, the first few levels of a pickup truck might be represented
something like this:

One of the operations this customer’s product does is collision
detection, which involves listing all the parts in an assembly that intersect a
given target object. For example, to
compute a cutaway, the target might be a plane through 3-space. I’ve abstracted such a code below:
Node *target;
std::list<Node *>
output_list;
...
void walk(Node *x)
{
switch (x->kind) {
case Node::LEAF:
if (target->collides_with(x))
{
output_list.push_back(x);
}
break;
case Node::INTERNAL:
for (Node::const_iterator child =
x.begin();
child != x.end();
++child)
{
walk(child);
}
break;
}
}
In this example, the walk function starts at the root of the assembly and
recursively visits all the subassemblies.
If it encounters an internal node, it recursively walks the
subassemblies. If it encounters a leaf
(a part), it tests whether the part intersects the target object, and if it does, it appends the
part to the global variable output_list. At
the end of the computation, output_list
contains all the parts that intersect target.
On the surface, this code is easy to parallelize using
Cilk++. The programmer simply needs to
replace the for
keyword with cilk_for,
which allows each of the children to be walked in parallel:
Node *target;
std::list<Node *>
output_list;
...
void walk(Node *x)
{
switch (x->kind) {
case Node::LEAF:
if (target->collides_with(x))
{
output_list.push_back(x); // race bug
}
break;
case Node::INTERNAL:
cilk_for (Node::const_iterator child = x.begin();
child != x.end();
++child)
{
walk(child);
}
break;
}
}
Unfortunately, the output_list global variable causes a race in the line
indicated. When two branches of the
computation try to update output_list, they may collide. (See my
blog on determinacy races.)
Solution #1: Locking
One way to avoid the data race is to use mutual-exclusion locks, or
mutexes, to ensure that
only one subcomputation is appending to output_list
at a time (modified code is in red):
Node *target;
std::list<Node *>
output_list;
cilk::cilk_mutex
output_list_mutex;
...
Void walk(Node *x)
{
switch (x->kind) {
case Node::LEAF:
if (target->collides_with(x))
{
output_list_mutex.lock();
output_list.push_back(x);
output_list_mutex.unlock();
}
break;
case Node::INTERNAL:
cilk_for (Node::const_iterator
child = x.begin();
child != x.end();
++child)
{
walk(child);
}
break;
}
}
This code is now correct, but it may exhibit an adverse performance
problem. If many parts must be added to output_list,
the code may be slowed down by lock
contention as multiple processors all vie for access to the
mutex and then serialize their accesses to the update code. Although this kind of “hotspot” can
drastically curtail parallelism, the code modifications are fairly local and
straightforward, if you can find all the data races.
Solution #2: Refactor the code
Another strategy to solve the race on the global variable output_list is to refactor the code to
eliminate nonlocal references. The idea
is to construct portions of the output locally and combine them on the way up
the call tree, passing the partial results as a returned value or an output
parameter of the walk function. Although walk is
a void function, the following code illustrates the approach using an output
parameter, because that’s the more general way to solve the problem when the
function already returns a value or if there are more than one nonlocal
variables to eliminate:
Node *target;
std::list<Node *>
output_list;
...
void walk(Node *x, std::list<Node *>
o_list)
{
switch (x->kind) {
case Node::LEAF:
if (target->collides_with(x))
{
o_list.push_back(x);
}
break;
case Node::INTERNAL:
std::vector<std::list<Node *> >
children_list(x.num_children);
cilk_for (Node::const_iterator child = x.begin();
child != x.end();
++child)
{
walk(child, children_list[child]);
}
for (int i=0; i < x.num_children; ++i)
{
o_list.splice(
o_list.end(),
children_list[i]
);
}
break;
}
}
In this code, a local array children_list
is constructed to hold the output lists from each of the children. An array is necessary, rather than reusing a
single child list for all of the children, because otherwise a race might occur
as the children are computing their results in parallel. After the parallel recursive walk produces
all the output lists from all of the children, a serial for loop gathers all of the results into the output paramenter o_list.
The initial call to the walk of the entire assembly passes the global
variable output_list as its
second parameter.
Although the refactoring solution incurs some overhead due
to parameter passing, it generally produces code with predictably good
performance. Moreover, unlike the
locking solution, it constructs output_list
with elements in the same order as in the serial execution. In contrast, locking may cause the elements
of output_list to
appear in a jumbled order.
Unfortunately, however, refactoring can mean revising the
logic of substantial sections of code, and the prospect of refactoring a large
codebase is daunting. Should you think
you’re safe because you don’t use recursion or lots of nested function calls,
think again! The same problem can arise
when the body of a parallel loop accesses variables declared outside the loop.
Solution #3: Hyperobjects
Cilk++ provides a novel mechanism called hyperobjects, which mitigate
data races on nonlocal variables without the performance problems endemic to
locking or the need for code restructuring — the best of both worlds. The basic idea is that different parallel
branches of the computation may see different views of the hyperobject, but combined
they provide a single, consistent view.
The following code uses a particular kind of hyperobject called a reducer:
Node *target;
cilk::hyper_ptr<cilk::reducer_list_append<Node *> > output_list;
...
void walk(Node *x)
{
switch (x->kind) {
case Node::LEAF:
if (target->collides_with(x))
{
(*output_list).push_back(x);
// Or use ->
}
break;
case Node::INTERNAL:
cilk_for (Node::const_iterator child = x.begin();
child != x.end();
++child)
{
walk(child);
}
break;
}
}
A reducer over a type T
is a hyperobject for which T’s member
function reduce() implements a (usually
associative) binary operator ⊗ and its constructor implements an identity
e. In our example code, the reduce() function for reducer_list_append implements
list concatenation, which is an associative operation: if A, B, and C are lists
and ⊗ is list concatenation, we always have (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C) .
A declaration hyper_ptr<T>
foo instantiates foo as a hyperpointer to a reducer
of type T, which is initialized to e. In our code, output_list
is a hyperpointer to a list, which is initially empty.
A dereference *foo
produces a local view of type T
of the reducer. Views may differ from
point to point in the program and are combined using reduce()
to produce a value for the global reducer hyperobject. In our code, dereferencing *output_list yields a local view
of output_list, to which the object x is appended. At the end of the computation, output_list contains all the list
elements, and as in the refactoring solution, they appear in exactly the same
order as they are in the list produced by a serial execution.
The reducer solution implemented in Cilk++ has low
overhead. Dereferencing costs about the
same as a null function call, and there is no overhead for calling or spawning
functions. As long as there is enough
parallelism in the application, the reduce()
function, which is invoked as an up-call by the Cilk++ runtime system to
combine views, is executed relatively infrequently. Specifically, the number of calls to reduce() is at most proportional
to the span of the computation (see my
blog on parallelism).
The Cilk++ distribution includes a library for common
reducers, including list concatenation, sum, minimum, maximum, and logical
operations. Moreover, programmers can
write their own. Cilk Arts is also
developing other kinds of hyperobjects.
We’ll keep you informed about Cilk Arts innovations in this blog and on the Cilk Arts webpage.