Subscribe Via Email

Your email:


Multicore Programming Blog

Current Articles | RSS Feed RSS Feed

Global Variable Reconsidered

Posted by Charles Leiserson on Sat, Jun 21, 2008
 | Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 

In a widely applauded article published in 1973 and entitled, “Global variable considered harmful,” Bill Wulf and Mary Shaw argued, “We claim that the non-local variable is a major contributing factor in programs which are difficult to understand.”  Thirty-five years later, however, nonlocal variables are still very much in vogue in production code.  Moreover, as software developers look toward multicore-enabling their legacy codebases, nonlocal variables pose a significant obstacle to parallelization, because they can cause race bugs, a topic I discussed in an earlier blog.  In this blog, I’ll discuss some of the strategies for coping with global and other nonlocal variables when parallelizing software.

The Lure of Nonlocal Variables 

To begin with, what were Wulf and Shaw concerned about, and why are nonlocal variables nevertheless so prevalent?  To be clear, by nonlocal variable, I mean one that is declared outside of the scope in which it is used.  A global  variable is a nonlocal variable declared in the outermost program scope.

Wulf and Shaw were concerned that when a variable is used far away from its definition, a local inspection of the code cannot easily determine its meaning.  They identified several reasons for eschewing nonlocal variables.  Among them, they observed that if a function modifies the nonlocal variable, this side-effect can be difficult to reason about.  In their words, “Untold confusion can result when the consequences of executing a procedure cannot be determined at the site of the procedure call.”

The clear alternative to referencing nonlocal variables is to reference local ones, which one can do by passing the variable as an argument to function calls.  The problem with this approach is that it leads to parameter proliferation — long argument lists to functions for passing numerous, frequently used variables.  For every global variable referenced in a function, an extra parameter would have to be declared to the function to pass that variable.  Not only would that lead to functions with dozens of extra arguments, there also the cost of passing the arguments to consider.

As a practical matter, nonlocal variables are just damn convenient.  If I have a code operating on a large data structure, why should I have to pass the data structure to each and every function that operates on the data structure?  It’s much more convenient to keep the function interfaces clean with fewer parameters and simply refer to the data structure as a global variable or, commonly in object-oriented programming, as a nonlocal member variable.

A Real-world Example 

Parallel programming exacerbates the problems with nonlocal variables, however, because they can cause races.  To illustrate the issues, let me take an example from one of Cilk Arts’s customers who supplies 3D computer-aided design software.  They represent a large mechanical assembly as a tree of subassemblies down to the individual parts, which may be complex geometric objects.  For example, the first few levels of a pickup truck might be represented something like this:


One of the operations this customer’s product does is collision detection, which involves listing all the parts in an assembly that intersect a given target object.  For example, to compute a cutaway, the target might be a plane through 3-space.  I’ve abstracted such a code below:

Node *target;

std::list<Node *> output_list;

...

void walk(Node *x)

{

  switch (x->kind) {

  case Node::LEAF:

    if (target->collides_with(x))

      {

        output_list.push_back(x);

      }

    break;

  case Node::INTERNAL:

    for (Node::const_iterator child = x.begin();

                child != x.end();

                ++child)

      {

        walk(child);

      }

    break;

  }

}

In this example, the walk function starts at the root of the assembly and recursively visits all the subassemblies.  If it encounters an internal node, it recursively walks the subassemblies.  If it encounters a leaf (a part), it tests whether the part intersects the target object, and if it does, it appends the part to the global variable output_list.  At the end of the computation, output_list contains all the parts that intersect target. 

On the surface, this code is easy to parallelize using Cilk++.  The programmer simply needs to replace the for keyword with cilk_for, which allows each of the children to be walked in parallel:

Node *target;

std::list<Node *> output_list;

...

void walk(Node *x)

{

  switch (x->kind) {

  case Node::LEAF:

    if (target->collides_with(x))

      {

        output_list.push_back(x); // race bug

      }

    break;

  case Node::INTERNAL:

    cilk_for (Node::const_iterator child = x.begin();

                child != x.end();

                ++child)

      {

        walk(child);

      }

    break;

  }

}

Unfortunately, the output_list global variable causes a race in the line indicated.  When two branches of the computation try to update output_list, they may collide.  (See my blog on determinacy races.)

Solution #1: Locking 

One way to avoid the data race is to use mutual-exclusion locks, or mutexes, to ensure that only one subcomputation is appending to output_list at a time (modified code is in red):

Node *target;

std::list<Node *> output_list;

cilk::cilk_mutex output_list_mutex;

...

Void walk(Node *x)

{

  switch (x->kind) {

  case Node::LEAF:

    if (target->collides_with(x))

      {

        output_list_mutex.lock();

        output_list.push_back(x);

        output_list_mutex.unlock();

      }

    break;

  case Node::INTERNAL:

    cilk_for (Node::const_iterator child = x.begin();

                child != x.end();

                ++child)

      {

        walk(child);

      }

    break;

  }

}

This code is now correct, but it may exhibit an adverse performance problem.  If many parts must be added to output_list, the code may be slowed down by lock contention as multiple processors all vie for access to the mutex and then serialize their accesses to the update code.  Although this kind of “hotspot” can drastically curtail parallelism, the code modifications are fairly local and straightforward, if you can find all the data races.

Solution #2: Refactor the code

Another strategy to solve the race on the global variable output_list is to refactor the code to eliminate nonlocal references.  The idea is to construct portions of the output locally and combine them on the way up the call tree, passing the partial results as a returned value or an output parameter of the walk function.  Although walk is a void function, the following code illustrates the approach using an output parameter, because that’s the more general way to solve the problem when the function already returns a value or if there are more than one nonlocal variables to eliminate:

Node *target;

std::list<Node *> output_list;

...

void walk(Node *x, std::list<Node *> o_list)

{

  switch (x->kind) {

  case Node::LEAF:

    if (target->collides_with(x))

      {

        o_list.push_back(x);

      }

    break;

  case Node::INTERNAL:

    std::vector<std::list<Node *> > children_list(x.num_children);

    cilk_for (Node::const_iterator child = x.begin();

                child != x.end();

                ++child)

      {

        walk(child, children_list[child]);

      }

 

    for (int i=0; i < x.num_children; ++i)

      {

        o_list.splice(

               o_list.end(),

               children_list[i]

        );

      }

    break;

  }

}

In this code, a local array children_list is constructed to hold the output lists from each of the children.  An array is necessary, rather than reusing a single child list for all of the children, because otherwise a race might occur as the children are computing their results in parallel.  After the parallel recursive walk produces all the output lists from all of the children, a serial for loop gathers all of the results into the output paramenter o_list.  The initial call to the walk of the entire assembly passes the global variable output_list as its second parameter.

Although the refactoring solution incurs some overhead due to parameter passing, it generally produces code with predictably good performance.  Moreover, unlike the locking solution, it constructs output_list with elements in the same order as in the serial execution.  In contrast, locking may cause the elements of output_list to appear in a jumbled order.

Unfortunately, however, refactoring can mean revising the logic of substantial sections of code, and the prospect of refactoring a large codebase is daunting.  Should you think you’re safe because you don’t use recursion or lots of nested function calls, think again!  The same problem can arise when the body of a parallel loop accesses variables declared outside the loop.

Solution #3: Hyperobjects 

Cilk++ provides a novel mechanism called hyperobjects, which mitigate data races on nonlocal variables without the performance problems endemic to locking or the need for code restructuring — the best of both worlds.  The basic idea is that different parallel branches of the computation may see different views of the hyperobject, but combined they provide a single, consistent view.  The following code uses a particular kind of hyperobject called a reducer:

Node *target;

cilk::hyper_ptr<cilk::reducer_list_append<Node *> > output_list;

...

void walk(Node *x)

{

  switch (x->kind) {

  case Node::LEAF:

    if (target->collides_with(x))

      {

        (*output_list).push_back(x);   // Or use ->

      }

    break;

  case Node::INTERNAL:

    cilk_for (Node::const_iterator child = x.begin();

                child != x.end();

                ++child)

      {

        walk(child);

      }

    break;

  }

}

A reducer over a type T is a hyperobject for which T’s member function reduce() implements a (usually associative) binary operator ⊗ and its constructor implements an identity e.  In our example code, the reduce() function for reducer_list_append implements list concatenation, which is an associative operation: if A, B, and C are lists and ⊗ is list concatenation, we always have (A ⊗ B) ⊗ C = A ⊗ (B ⊗ C) . 

A declaration hyper_ptr<T> foo instantiates foo as a hyperpointer to a reducer of type T, which is initialized to e.  In our code, output_list is a hyperpointer to a list, which is initially empty. 

A dereference *foo produces a local view of type T of the reducer.  Views may differ from point to point in the program and are combined using reduce() to produce a value for the global reducer hyperobject.  In our code, dereferencing *output_list yields a local view of output_list, to which the object x is appended.  At the end of the computation, output_list contains all the list elements, and as in the refactoring solution, they appear in exactly the same order as they are in the list produced by a serial execution. 

The reducer solution implemented in Cilk++ has low overhead.  Dereferencing costs about the same as a null function call, and there is no overhead for calling or spawning functions.  As long as there is enough parallelism in the application, the reduce() function, which is invoked as an up-call by the Cilk++ runtime system to combine views, is executed relatively infrequently.  Specifically, the number of calls to reduce() is at most proportional to the span of the computation (see my blog on parallelism). 

The Cilk++ distribution includes a library for common reducers, including list concatenation, sum, minimum, maximum, and logical operations.  Moreover, programmers can write their own.  Cilk Arts is also developing other kinds of hyperobjects.  We’ll keep you informed about Cilk Arts innovations in this blog and on the Cilk Arts webpage.

0 Comments Click here to Read/write comments

Why and When Can Start-ups Overcome Incumbents?

Posted by Duncan McCallum on Fri, Jun 20, 2008
 | Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 

Last month, I was invited to speak to the Society for Information Management – Advanced Practices Council (SIM-APC) by Professor Lynda Applegate, one of my teachers at Harvard Business School.  SIM-APC is an exclusive forum of 40 CIO’s who value directing and applying pragmatic research; exploring emerging IT issues in-depth; and hearing different, global perspectives from colleagues in other industries.

The session, “Emerging Information Technologies and Business Innovation” was geared towards helping CIO’s think about which new technologies to bet on, and best practices for applying new technologies to business problems.   The invitation gave me a chance to reflect on, and speak about, what I’ve learned in my 20+ years in technology as a developer, business user, venture capitalist, and entrepreneur.  It was a ton of fun to talk to these super heavy-hitting tech exec’s about innovation – one of my passions.

My talk addressed three questions.

1) how should a customer think about picking between an established vendor and a new start-up in a given technology area?

2) How can a customer (or partner, or investor) judge the likelihood that a fledging company will thrive?   What are the key success factors for a new startup?

3) What are the “best practices” in building and managing a new technology company?

 In this note, I focus on the first question…incumbents vs. start-ups.

Sustaining vs. Disruptive Innovations

Purchasers of new software face a difficult set of questions when a new technology (like multicore software) emerges.  Should the CIO, VP Engineering, or whoever bet on an upstart new vendor, or stick with an existing, incumbent vendor?  The answer lies in considering the type of innovation that is driving the opportunity.    I believe established vendors are best for “sustaining innovations”.   New companies are best for “disruptive innovations.”  I’ll explain. (I am borrowing from the ideas of Prof. Clay Christensen.  His book “Innovator’s Dilemma” is one of my favorite books on innovation.)

Let us consider a typical scenario.   A new product category arises in the market – propelled by a change in technology (e.g., multicore computing) or a new business requirement (e.g., Sarbanes-Oxley).   In this typical scenario, at least one new company is launched to serve the new need.   Lets call this new company HOT (Hyper-charged Original Technology).    In the other corner is SAFE, Inc. (Successful And Financially Established Incumbent), a large and well-established company.   When should you bet on HOT?   When should you play it SAFE?

I’ll start by acknowledging that, while I have a bias towards startups (the bulk of my career has been spent helping to launch new products and companies), SAFE has many advantages.  Most obviously, SAFE is, well, safe.   Usually, you do not need to worry about whether SAFE will be in business long enough to support its product after you install it.   SAFE has ample resources to provide training, service and support to help make deployment successful (assuming they can execute).   Often, new software products need to integrate with other pieces of software already in place.   Who better to deliver a well-integrated offering than the vendor of the legacy software?   Perhaps SAFE is already one of your vendors.   You know the sales and support people, they’re already in your billing system and on the approved vendor list.   And you know that no one ever got fired for buying from SAFE.   Why would you ever consider anything other than SAFE?

Some would argue that large companies don’t really innovate – but is that really true?   I found some interesting data in an article from CIO Insight.   The top 81 US-based software/internet/computer companies invested $47 billion in 2006 – with giants like Google, Intel, Microsoft, HP and IBM averaging a whopping $4.2 billion.    By comparison, total venture capital software investments in Q1 2008 were at an annual rate of $5 billion (source: NVCA).   Much of that VC investment gets spent on things other than R&D – sales, marketing, and other overhead like CEO’s.   If we make a rough guess that 50% of the VC money goes to R&D, that’s $2.5 billion per year.

Let’s get this straight – Google and Microsoft alone outspend the entire population of venture-backed software companies by a factor of three.   Either someone is wasting a lot of money, or there is in fact innovation happening at big companies.   In fact, the large vendors have a massive R&D resource advantage.

SAFE: Masters of Sustaining Innovation

So who are these madmen investing in, and working at, companies like HOT who think they can compete with an R&D behemoth like SAFE.   How could a new company possibly compete?   What are these people thinking?

The answer lies in considering what big companies spend on, and the kinds of innovation at which they excel or struggle.   Clay Christensen did a phenomenal job of framing the issue in his research.   Big companies are very well set up to serve their existing customers and established markets.   They generally pursue “sustaining” technologies that introduce new features or improve performance for established products or product categories.   Companies like SAFE work very hard, and generally effectively, to serve their existing customers and their known requirements.   They invest aggressively to retain their position in their established markets.

Indeed, referring back to the CIO Insight article:

“(R&D) investing in 2006 was mostly aimed at shoring up existing product lines and was concentrated among the biggest firms, like Microsoft, Intel and Google.   The biggest technology companies have been pouring resources into product development, but don't look for a burst of innovative new business applications anytime soon.  … To the extent that companies are developing business applications, their focus is more on enhancing existing products than introducing new ones.”

“Adobe is a good example. The software maker typically spends 90% of its R&D budget enhancing products like Acrobat, Photoshop and Illustrator that have been around for a decade or more.   ‘Some of our products are Version 12 or whatever,’ says Leslie Bixel, director of technology programs at Adobe, which spent $540 million on R&D last year. ‘It really requires that we challenge ourselves to innovate and make sure we're close to the customer to deliver new value.’”

So, SAFE spends a lot on innovation.   But you can bet that most of the R&D budget, and the vast majority of the thought cycles of management, go to thinking about:  how to serve requirements articulated around existing products; how to protect the existing business; and where to find BIG opportunities that can meaningfully impact growth.    Against this backdrop, it must be hard to carve off meaningful attention, from the company’s best developers and marketeers, to thinking about a small new market that may or may not emerge.

Large companies have another challenge.   It’s harder for them to do “multi-functional innovation,” where customer-facing functions (like sales and marketing) and product development must work closely together to craft a new product that fits tightly with a new market requirement.   In a large company, sales, marketing and engineering are usually not on the same floor – and maybe not in the same building.   Product planning is not all that fast, and the communication from customer to product requirements is “lossy” – like that game of telephone we played as kids.   It can take months or quarters for a new product to win budget approval.   And “I think this is something a few customers will need in a year or two” doesn’t usually compete favorably for that budget for the multi-million dollar product line that must be progressed to protect the existing business.

Finally, large companies have to worry about how the new innovation fits with their legacy product and business practices.  They cannot cannibalize their existing product lines.   The products need to be markted and sold by existing channels.   And even if the product is killer and fits a small niche market, it is hard for SAFE to put its best sales/marketing/service muscle behind the new offering.   After all, the field team needs to make quota – so they will always focus on what lots of customers want now – not what a few customers may want next year.

HOT: Masters of Disruptive Innovation

We can now talk about the advantages small upstart companies like HOT have.   Small companies excel at what Christenson nicely describes as “disruptive innovations”.  These innovations create new market opportunities – opportunities that aren’t necessarily large and obvious enough to capture attention of SAFE during the market formation stage when winning, first-to-market technologies are minted.   They are created by discontinuities – big changes in technology or business requirements – that create new customer problems.   The advent of multicore computing, with the multicore software challenge it brings about, is an example of such a discontinuity.

Here, the small company can dominate.   First, the small company has a clean slate.   No large, existing revenue base to distract attention from the new opportunity.   100% of HOT’s attention is focused on the single problem – and a market that is not yet big enough for SAFE can be plenty big enough for HOT.   HOT has a clean slate.   No legacy products with which to integrate.   No installed base that demands ongoing improvements.    And HOT is agile – able to respond quickly to changes in the market and product requirements (because new markets change more quickly than established markets).

And I believe the biggest advantage of a small companies are in “multi-functional” innovations.   New market requirements are best served when a company can build tight communications loops between the customer and product development.  Customer-facing functions and product development in a well-run small company are exceptionally tight.  At Cilk Arts, as an example, we conducted over 40 detailed customer briefings – for a product category that was not yet defined – before we wrote the first line of Cilk++ (to be fair, we should probably count the 15 years of research on Cilk at MIT – but lets ignore that).   The CEO and CTO were in every meeting.

And now that we’re in full-on product development, the loops stay tight.   If our VP Marketing needs to communicate a requirement to our VP Engineering, its 10 paces to his office.   We discuss customer feedback and product requirements in every Monday’s staff meeting, and at our all-hands meetings on Friday.   Point of fact, I believe that superior EMPATHY (defn:  Identification with and understanding of another's situation, feelings, and motives) is a key corporate advantage.   Important enough that it’s reflected in our corporate values.  Empathy in business means listening actively and carefully to customers.  Superior empathy, combined with agility and unity of focus, enables small companies to meet emerging new market requirements more quickly, accurately, and completely than established companies.  

What do YOU think?

  • As a technology/product developer, do your experiences match Christensen’s model of which companies are best – and worst –  suited for sustaining (or disruptive) innovation?
  • As a buyer or adopter of technology, when have you bet on HOT, and when have you bet on SAFE?

0 Comments Click here to Read/write comments

Cilk Wins Most Influential PLDI Paper Award

Posted by Matteo Frigo on Mon, Jun 16, 2008
 | Digg digg it | Reddit reddit | del.icio.us del.icio.us | StumbleUpon StumbleUpon 

Last week I went to the 2008 Programming Language Design and Implementation (PLDI) conference to receive the award for the most influential paper published in PLDI 1998.  The award committee had the following to say about our paper:

"The 1998 PLDI paper "Implementation of the Cilk-5 Multithreaded Language" by Matteo Frigo, Charles E. Leiserson, and Keith H. Randall introduced an efficient form of thread-local deques to control scheduling of multithreaded programs. This innovation not only opened the way to faster and simpler runtimes for fine-grained parallelism, but also provided a basis for simpler parallel recursive programming techniques that elegantly extend those of sequential programming. The stack-like side of a deque acts just like a standard procedure stack, while the queue side enables breadth-first work-stealing by other threads. The work-stealing techniques introduced in this paper are beginning to influence common practice, such as the Intel Threading Building Blocks project, an upcoming standardized fork-join framework for Java, and a variety of projects at Microsoft."

At Cilk Arts we are really honored by this recognition.  I would like to express my deepest gratitude to Kathleen Fisher and to the rest of the award committee for honoring us in this way. 

Our 1998 paper is based on a couple of simple ideas that taken together yield a powerful point of view on parallel programming.

The first idea is that creating a new thread does not have to be expensive.  For example, the spawn keyword in the MIT Cilk implementation costs you something like 10-20 cycles.  (Mohr, Kranz, and Halstead were the first to show how to do this with their lazy task creation technique, which we optimized by eliminating locks in the common case.)

From the point of view of the programmer, cheap expression of parallelism eliminates the whole issue of thread granularity that you have in, e.g., pthreads programs, where you must make sure that the thread that you create performs enough work to amortize the overhead of the thread creation.  (Then you get into thread pools, which are a nightmare to manage when you have more than one library managing its own threads.)  In Cilk, in contrast, you do not have to worry too much about the cost of a spawn: If you think that a function can be executed in parallel, just spawn it---it will probably not slow down your program.

The second idea is harder to explain, and you will have to read the paper for the details, but it boils down to this.  In the execution of a parallel program, one can distinguish between cycles that would be spent on a single-core execution (called the “work term” in the paper), and cycles that are spent solely because of parallelism, e.g., spent migrating work from one core to another (called the “critical-path term” in the paper).  When implementing a parallel programming language, the implementer usually faces tradeoffs between reducing overheads in the work term or reducing overheads in the critical-path term.  The idea is that, in Cilk, we *postulate* that the work term is more important.

Why is the work term more important than the critical path term?  (By the way, in more recen