Subscribe by Email

Your email:

Multicore Programming Blog

Current Articles | RSS Feed RSS Feed

Visualizing Parallel Speedup with Cilkview

Posted by Will Leiserson on Tue, Jun 30, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

Along with our upcoming release of Cilk++ v.1.1 we are including a new tool to help you visualize application performance: Cilkview. Cilkview runs an application binary and generates performance data on sections you specify. It combines this data with performance estimates generated by the work/span calculator and produces a graph of the results.

Consider the following parallel algorithm for calculating fibonacci numbers:

The Directed Acyclic Graph (DAG) that represents the call tree fib(35) is a binary tree of parallel nodes, so there is parallelism aplenty. Furthermore, a look at the function reveals that it is basically CPU-bound. The only memory accesses are into the frame and they have the highest possible locality. Naturally, this is a terrible way to calculate fibonacci numbers, but it is a simple example of the class of algorithms that are easily parallelized. To make Cilkview predict speedup, insert the following code near the call site:

start() and stop() indicate the interesting sections of code to Cilkview. dump() will associate the collected data with a label that Cilkview will use to generate a graph. Compile your program and run Cilkview to generate a graph:

  % cilkview ./fib

Cilkview runs your binary on a single core with instrumentation that calculates the work and span of the sections of code you specify. In this case, Cilkview made the same prediction that we did. The steeper line indicates perfect linear speedup up until the theoretical parallelism of the algorithm. The theoretical parallelism does not appear on the graph because it is too large. The more gently sloped line indicates Cilkview's prediction of more realistic performance, burdened speedup, based on the overhead of the Cilk++ runtime. Since the two lines are so close (nearly indistinguishable in this plot), Cilkview predicts that actual trials of this application should scale with nearly perfect linearity.

To test this, Cilkview can also run the trials themselves by specifying the "-trials all" option at the command line. No additional code is necessary:

 
% cilkview -trials all ./fib

 

The red stars indicate trial data. Cilkview by default runs one trial for each of 1..n Cilk workers where n is the number of cores. The plot, above, was generated on a computer with 8 cores. As predicted, the trial performances scaled with near-perfect linearity. Slight variations can be attributed to the limitation that only one trial was performed for each configuration.

But what about more common algorithms such as sorting? The following is a simple algorithm for parallel quicksort:

At first glance, this algorithm looks like it has plenty of parallelism, too. Even though there are lots of memory accesses the quicksort algorithm has good cache properties. But consider that the DAG would have a substantial sequential node at the root for the first partition(). Likewise, after the recursive call, each of the first parallel children have to run partition() on their portions of the array before they can do anything parallel. In fact, most of the work in quicksort is done in the partition() algorithm -- serially! Cilkview's analysis bears this out:

The plot represents qsort run on an array 10,000,000 integers. In this graph you can see the that the ideal parallelism (the horizontal line) is small enough to appear on the graph (for this or any reasonable size of input). Plotting the trials gives data that is consistent with this:

The trials fit well within the predicted margin.

Now consider the following code for parallel memcpy:

On the surface, this has lots of parallelism. All of the work is in the cilk_for, and the automatic grain size should handle the overhead of the implicit cilk_spawns with respect to the trivial work done at each iteration. Conceptually, memcpy() is similar to fib() in terms of parallelism. The DAG looks similar to that of fib() except the tree is balanced. Unfortunately, the prediction generated from work/span data is totally inconsistent with the actual performance data from the trials (parallel_memcpy() run on an array 400 million bytes long):

Why does the trial data not match with the prediction? Cilkview takes the work and span into account and estimates the overhead of the runtime to arrive at a reasonable estimation of parallelism. But the limitation was not in the amount of parallelism, here. A discrepancy between the prediction and the trial data indicates other factors at work. In this case, the memory bandwidth became saturated and the speedup ceased to scale as the processors began to queue for access to main memory. The computer on which this was run has two memory controllers, so the scaling would have been even worse on a typical desktop machine. Cilkview's parallelism prediction is useful for understanding whether your algorithm has enough parallelism to scale effectively. A graph in which the trial data falls below the lower-bound indicates that the algorithm is running up against another limitation.

A final, happier example is the parallel matrix multiplication algorithm developed by Matteo Frigo and Matthew Steele. This algorithm was intended as an alternative to more intuitive parallel matrix multiplication algorithms that tend not to use cache effectively:

Note that the algorithm does not actually get super-linear speedup as it might appear in the 8-worker case. Remember that a single trial was run at each level, so slight variations in runtime can cause visible changes in the graph.

Cilkview will ship with Cilk++ v.1.1 for both Windows and Linux. There is more functionality included with the tool (e.g., you can instrument multiple sections of code, generate .csv files, set the number of cores on which to predict performance, and more). The hope is that Cilkview will help users confirm or deny their intuitions about parallel algorithms. Also, it makes pretty pictures.

2 Comments Click here to Read/write comments

Multicore Adoption: More Competitive than World Cup and Miss Universe?

Posted by Ilya Mirman on Mon, Jun 29, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

The map pictured above is not a secret plan for world domination (we're pretty open about our goals). Nor is it a chart of our most recent performance in the board game RISK (we feel more comfortable with troops stationed in Greenland and Kazakhstan).  Rather, the map pinpoints locations of the early adopters of Cilk++ worldwide.

Since releasing Cilk++, our solution for multicore-enabling C++ applications, just a few months ago, we've seen thousands of leads and downloads from developers from around the world. And hundreds of universities are starting to incorporate our multicore software in their curricula.

Would you rather be dating Miss Universe or Miss Multicore? 

We thought it might be fun to see if these "multicore hotspots" have any correlation with some other categories in which countries vigorously compete. Is a nation that thrives at World Cup Soccer more likely to score a leadership role in multicore programming? Are nations which produce high-ranking Miss Universe contestants more likely to recognize the pure beauty of Cilk++?

Take a moment to digest the following chart, which reflects the latest international rankings in economic productivity, soccer, beauty pageantry and parallel computing:

Global Competitiveness Measured in terms of Money, Sports, Beauty & Brains*

ECONOMY (GCI)

WORLD CUP
SOCCER

MISS UNIVERSE

WORLD
MULTICORE
INDEX

1. USA
2. Switzerland
3. Denmark
4. Sweden
5. Singapore
6. Finland
7. Germany
8. Netherlands
9. Japan
10. Canada

1. Spain
2. Germany
3. Netherlands
4. Brazil
5. Italy
6. Argentina
7. England
8. Croatia
9. Russia
10. France

1. Venezuela
2. Colombia
3. Dominican Republic
4. Russia
5. Mexico
6. Kosovo
7. Spain
8. USA
9. Italy
10. Australia

1. Singapore
2. Greece
3. USA
4. Australia
5. Canada
6. Iran
7. S. Korea
8. France
9. Germany
10. Italy

For the purpose of this fun exercise, we made up our own statistic called the "World Multicore Index."  The WMI is determined by dividing the number of Cilk++ downloads  in each country by its population.

Obviously, all statistics have their limitations and these WMI rankings are no exception. We are a start-up without a huge marketing megaphone, and these results reflect only the early adopters of Cilk++ worldwide.  That said, our efforts to spread the gospel of multicore do reach a wide and diverse audience that cannot be ignored.

So, are the nations that produce jocks and beauty queens equally committed to raising homegrown computer geeks?  Right now, if you want to create a start-up company with a gorgeous sales staff and a killer soccer team, Rome or Milan looks like the place to be. Italy is currently the only country to be ranked in the Top 10 for the World Cup, Miss Universe and Multicore Index.

But time to get a little more serious. If you need to have software optimized for multicore CPUs five years from now, you had better start thinking about it now.

Are economic powerhouses more likely to be going multicore?

Logic might dictate that the more powerful the country's economy, the more likely that its programmers would be exploring multicore technology. So let's dive a little deeper into that hypothesis and see how often it rings true.

The Global Competitive Index, or GCI, measures a dozen different indicators to determine a country's economic prowess.

Do nations with the highest GCI -- those with exceptional educational systems and a large population of engineers -- assume leadership roles in adopting multicore technology? 

The results, quite frankly, surprised us - there is not a great deal of correlation between the indices:

  1. Of the world's top 10 economies ranked by GCI, only four (Singapore, USA, Canada, Germany) seem to be aggressively pursuing opportunities to be multicore technology leaders.
  2. The saturation of blue dots on our map can be deceiving. Economic powerhouses like Japan (# 17 in WMI), China (# 16), and India (# 15) seem to lag a bit in terms of going multicore, in our humble opinion.  We're looking forward to sharing more trial software with our friends in the Far East.
  3. On the flip side, both Iran and Greece – nations better known for exporting yummy treats (pistachios, baklava) than computer technology – could potentially establish themselves as serious players on the world multicore stage.

The good news is that unlike soccer (where players have been training since birth), there is not an insurmountable bar to become competitive in this arena!

It's still early in the adoption of multicore (today's CPUs feature just 2-4 cores). And the various programming approaches are getting better and better. So jump in the multicore pool – the water is warm!

As warm as the Mediterranean or the Persian Gulf.

RESEARCH FOOTNOTES

  1. World economy rankings are based on the 2008-2009 Global Competitive Index (GCI), which measures a dozen economic indicators as criteria for success. http://www.weforum.org/en/media/Latest%20Press%20Releases/PR_GCR082
  1. Latest FIFA World Cup rankings as of May 2009. http://www.fifa.com/worldfootball/ranking/lastranking/gender=m/fullranking.html
  1. This Top 10 Miss Universe list is from 2008. The 2009 tiara is up for grabs in the Bahamas on August 23rd. http://www.missuniverse.org/
  1. The World Multicore Index (WMI) is determined by dividing a country's total number of Cilk software downloads by its population.

1 Comments Click here to Read/write comments

Multicore-enabling FP-tree Algorithm for Frequent Pattern Mining

Posted by Yuxiong He on Thu, Jun 25, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 
Untitled Document

In data mining, association rule mining is a popular and well-researched method for discovering interesting relations between variables in large databases.  For example, the rule {onions, potatoes} => {beef} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy beef. Such information can be used as the basis for decisions about marketing activities.  Frequent pattern mining is the basis of association rule mining.   Given a list of transactions, frequent pattern mining returns a complete set of items that occur more than a threshold of times.  One of the fastest and most popular algorithms for frequent pattern mining is the FP-tree [1] algorithm.   In this blog post, I describe how to parallelize the FP-tree algorithm using Cilk++.    

The FP-tree algorithm builds a prefix tree representation of a given database of transactions, and generates a complete group of frequent item sets through recursive tree projections.   Several algorithms [2-4] have been proposed in the literature to address the problem of parallelizing FP-tree.  The algorithm in [2] constructs multiple local parallel trees, interlinks the local counters, and conducts mining jointly on these trees.  The work in [3] classifies hot and cold items, forms a hot node hash table and divides the work manually for tree construction.   These algorithms use different approaches to achieve thread-level parallelism.  That said, they all have one thing in common – they require substantial changes to the sequential algorithm in order to explore parallelism, avoid data races and obtain fair load balancing.  Below I describe a multicore-enabled parallel FP-tree implementation using Cilk++, which maintains the sequential semantics of the FP-tree algorithm, and requires relatively minor changes to the sequential code.  On a 4-core machine, this implementation achieves 2.6x speedup on tree building and 3.3x speedup on frequent set generation for our tests on 1M-transaction database.

The organization of this blog post is as follows.  First, I present a definition of the frequent pattern mining problem.   Then, I illustrate the FP-tree algorithm in three parts – preprocessing of transactions, tree construction, and pattern generation.  Section 2 to 4 describe the algorithm as well as the parallelization of each of the three steps.    Section 5 covers performance.  The sequential implementation we used is from Christian Borgelt's Webpages (http://www.borgelt.net//fpgrowth.html).  (The paper [5] describing this implementation is available on that webpage, and this post uses some examples in the paper to illustrate the sequential algorithm.)

Section 1. Frequent Pattern Mining Problem

Given a list of transactions, frequent pattern mining means to look for sets of items that occur more than a predefined threshold of times in these transactions.   More formally, it is defined [1] as:

Let I be a set of items, and a transaction database DB = { T1, T2, …, Tn}, where Ti is a transaction which contains a set of items in I.  The support (or occurrence frequency) of a pattern A, which is a set of items, is the number of transactions containing A in DB.  A, is a frequent pattern if A’s support is no less than a predefined minimum support threshold S.  Given a transaction database DB and a minimum support threshold, S, the problem of finding the complete set of frequent patterns is called the frequent pattern mining problem.

 

Section 2. Preprocessing of Transactions

FP-tree preprocesses the transaction database as follows: in an initial scan of the frequencies of the individual items (support of single element item sets) are determined.  All infrequent individual items, i.e., all items that appear in few transactions than the user-defined threshold S, are discarded from the transactions since they cannot be part of a frequent pattern.  In addition, the items in each transaction are sorted, so that they are in descending order with respect to their frequency in the database.  Each transaction is then recoded using an integer array respecting each item using its index.  Figure 1 shows an example of a transaction database before and after preprocessing.

Figure 1.  (a) A transaction database before preprocessing.  (b) Item frequency and index table. The threshold value is 3, and f, g will be discarded since their frequency is less than the threshold.  (c) Reduced transaction database with items in transactions sorted in a descending order with respect to their frequency. (d) Reduced transaction database after recoding transactions using item index.

Parallelizing the scan of transactions in database is I/O specific, and is beyond the scope of this post.   Parallelizing the sorting and recording of items in each transaction is straightforward because they are all independent tasks.   As a result, all we need to do is to change the “for” loop in the sequential code to “cilk_for” loop.

 

Section 3. FP-Tree Construction

An FP-Tree is a prefix tree for transactions.  Each path of the tree represents a set of transactions that share the same prefix, and each node corresponds to one item.  In addition, all nodes referring to the same item are linked together in a list, so that all transactions containing a specific item can easily be found and counted by traversing this list.  This list is accessible through a head element, which also records the item’s total number of occurrences in the database.  Figure 2 shows the FP-Tree for the (reduced) database shown in Figure 1.


Figure 2. FP-Tree for the reduced database shown in Figure 1.


In the sequential implementation, the initial FP-Tree is built from the preprocessed transaction database as a simple list of integer arrays (as shown in Figure 1(d)).  The FP-Tree construction includes two phases – sorting and building.    

In the sorting phase, the transactions are sorted lexicographically.  The sequential sorting uses quicksort for large data arrays, and changes to insertion sort after the data has been sorted on a macro scale.  Our parallel sorting simply cilkifies quicksort by spawning out the sorting of sub-arrays, and changes the base case of quicksort to insertion sort when the array size is small.  Figure 3(a) and Figure 3(b) show the sequential and parallel sorting codes, respectively.    


Figure 3(a) Sequential quicksort with insertion sort for small scale


Figure 3(b) Parallel quicksort with insertion sort as base case.  (Major changes over sequential version are highlighted in dark-red.)

In the building phase, the sorted list is turned into an FP-Tree using the recursive procedure _build (Figure 4(a)).  At recursion depth k, the k-th item in each transaction is used to split the database into sections, and a node of the FP-tree is created and labeled with that item.   We can see that each section can be processed rather independently.  The only races occur when these sections attempt to update the header list.  Therefore, in the parallel version (Figure 4(b)), we spawn off the work of each section to expose parallelism, and use p_FPTLIST_reducer (similar to a link list reducer supporting node insertion at the head) to represent header list and prevent data races.      During the reduce operation, the link list reducer accumulates the weight and appends the item list from sections.     (To understand the basics of reducers and how to write a reducer, please refer to reducer section of the Cilk++ User Manual).


Figure 4(a) Sequential FP-Tree construction



Figure 4(b) Parallel FP-Tree construction. Major changes from sequential version are highlighted in dark-red color.

Section 4. Pattern Generation

The core operation of FP-tree pattern generation is to compute conditional projection tree of the database, that is, the database of the transactions containing a specific item, with this specific item removed.  An example of computing a projection from the FP-tree of a database is shown in Figure 5.  In this example, we are computing a conditional projection tree based on item e.  From the header list of e, we can see that there are three transactions including item e.  Going upward from the nodes in e’s header list to root, we can draw the edges in the e-based conditional projection subtree using red links.  Including all the nodes connected by these red links and removing nodes with item e, we get the conditional projection tree based on item e (Figure 5(b)).  This projection tree indicates there are 3 conditional patterns {db, dca, bc} based on e in DB, i.e., there are three transactions that include e in DB {dbe, dcae, bce}.




Figure 5(a) Computing item-e-based conditional projection tree from the database shown in Figure 2. (b) The resulting conditional projection tree with base e.


For pattern generation, an FP-Tree of the database is projected recursively.  The pseudocode for pattern generation is shown in Figure 6.  To parallelize the pattern generation, we can unwrap the outermost layer and run the tree-projection of different items in parallel.  If tree projection is thread-safe, our work here is done.  Although intuitively, the tasks of computing projection of the same FP-tree over different items are independent because the FP-Tree is the only shared object and projection should only involve reading of the tree nodes, because the sequential implementation (Figure 7(a)) changes member variables of the tree nodes, which results in data races for concurrent tree projection.  Our parallel implementation introduces thread-local storage for tree projection to make the projection thread-safe (see Figure 7(b)). 

Figure 6. Pseudocode for FP-Tree pattern generation

 


Figure 7(a) Tree projection in sequential code


Figure 7(b) Thread-safe tree projection in parallel code.  Major changes over sequential version are highlighted by dark-red color.

 

The changes indicated above are all we need to parallelize the FP-Tree algorithm using Cilk++ – no reinvention of a parallel algorithm, no rewriting a new parallel implementation, simply adding one reducer and some Cilk++ keywords.  So with this modest effort, a considerably complex algorithm is now multicore-enabled.    

Section 5. Results

We tested the performance of this multicore-enabled FP-Tree implementation using a 1M-transaction database on a 16-core AMD Opteron machine.    As shown in Figure 8, on 4 cores, the implementation gets a 2.6x speedup on tree building and 3.3x speedup on frequent set generation; on 8 cores, it gets speedup 3.0x and 4.8x; on 16 cores, it gets 3.6x and 6.6x.   The speedup on a small number of cores is not bad, though the speedup does not scale linearly to 16 cores – primarily due to a memory bandwidth limitation.    As I mentioned in a previous blog post, a simple test to explore whether a multithreaded program has a memory bandwidth problem is to run the sequential version of the program for two cases – (1) run one copy of the sequential program on each core, and (2) run a single copy of the sequential program on the entire machine.   If the program’s execution time in case 1 is significantly larger than that in case 2, the memory bandwidth demand of the program could be high enough to become the primary constraint for program scalability.  If we run a sequential version of FP-Tree test case, the execution time on case (1) is approximately 2.4 times of the execution time on case (2).    This indicates a high memory bandwidth requirement of this application, and explains its non-linear speedup on a larger number of cores.   Moreover, there is literature [3] discussing the high cache miss rate of the FP-tree algorithm and proposing variants of the original FP-tree algorithm with better cache behavior.  It would be interesting to parallelize some cache-aware versions of FP-tree algorithm using Cilk++ and observe its scaling performance.  


Figure 8.  Speedup of FP-tree implementation on 16-core machine

References

[1] J. Han, J. Pei, and Y. Yin, '' Mining Frequent Patterns without Candidate Generation'', Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00), 2000.
[2] O. Zaiane, M. El-Hajj, and P. Lu. Fast Parallel Association Rules Mining without Candidacy Generation. In IEEE 2001 International Conference on Data Mining (ICDM'2001), 2001.
[3] Liu, L., Li, E., Zhang, Y., and Tang, Z. 2007. Optimization of frequent itemset mining on multiple-core processor. In Proceedings of the 33rd international Conference on Very Large Data Bases. 2007
[4] H. D. K. Moonesinghe, Moon-Jung Chung, Pang-Ning Tan. Fast Parallel Mining of Frequent Itemsets.  Michigan State University.
[5] C. Borgelt. An Implementation of the FP-growth Algorithm. Workshop Open Source Data Mining Software. 2005

0 Comments Click here to Read/write comments

Reducers and Other Cilk++ Hyperobjects: Peeking Under the Hood

Posted by Ilya Mirman on Wed, Jun 17, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

We have written extensively about the challenges global variables pose for multithreaded applications - inhibiting parallelism by inducing data races. 

One way to deal with this challenge is by employing Cilk++ hyperobjects. Cilk++ hyperobjects mitigate data races on non-local variables without the need for locks or code restructuring. Cilk hyperobjects are straightforward to use: Simply by declaring a global variable to be a hyperobject of the appropriate type, data races on the global variable are automatically resolved.

What's Under the Hood?

To date, we have shared WHAT hyperobjects do, but not HOW they work.  With the patent filings behind us, we have authored a paper for the 2009 Symposium on Parallelism in Algorithms and Architectures.  

Here's the abstract:

This paper introduces hyperobjects, a linguistic mechanism that allows different branches of a multithreaded program to maintain coordinated local views of the same nonlocal variable. We have identified three kinds of hyperobjects that seem to be useful - reducers, holders, and splitters - and we have implemented reducers and holders in Cilk++, a set of extensions to the C++ programming language that enables multicore programming in the style of MIT Cilk. We analyze a randomized locking methodology for reducers and show that a work-stealing scheduler can support reducers without incurring significant overhead.

The paper can be downloaded here.

Comments / Questions?  We'd love to hear from you!

0 Comments Click here to Read/write comments

Multicore Courses through MIT's Professional Education program

Posted by Ilya Mirman on Thu, Jun 11, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

We just wrapped up our 2-day course (6 lectures and 4 hands-on labs) at MIT.

Here are the lectures from this week's 2-day course:

5-day Course on Multicore Programming at MIT (register today!)

In addition, we'd like to point out that there is a great 5-day course on multicore programming being held July 20-24:

  • Overview: The course will approach multiprocessor programming from two complementary directions. In the first part of the course, we focus on foundations: what our programs and machines need to provide in order to ensure that concurrent programs do what we expect. We use an idealized model of computation in which multiple concurrent threads manipulate a set of shared objects. This model is essentially the model presented by standard Java or C++ threads packages.
  • Foundation: The foundations section is intended to build up the students' intuition and confidence in understanding and reasoning about concurrency. We approach this goal using examples, counter-examples, models, and exercises. These elements are laid out in a structured and progressive manner, from simple machine instructions to powerful universal constructions, the equivalent of Turing machines for multiprocessors.
  • Performance: The second part of the course will be concerned with performance. Reasoning about the performance of concurrent programs and data structures is very different in flavor from reasoning about their sequential counterparts. Sequential programming is based on a well-established and well-understood set of abstractions. There is little or no need to understand the specifics of the underlying architecture. In multiprocessor programming, by contrast, there are no such well-established abstractions. It is impossible to reason effectively about the performance of a concurrent data structure without understanding the fundamentals of the underlying architecture.  The performance part of the course will revisit many of the issues first raised in the foundations section, but in a more realistic model that exposes those aspects of the underlying architecture that most influence performance.
  • Data structures: The course then goes through a sequence of fundamental data structures, the concurrent analogs of the data structures found in any undergraduate data structures course, and a few foundation structures that are unique to the world of multithreaded computation. These data structures are introduced in an incremental way, each one extending the techniques developed for its predecessors. Each of these data structures is useful in and of itself as a reference. Moreover, by the end, the students will have built up a solid understanding of the fundamentals of concurrent data structure design, and should be well-prepared to design and implement his or her own concurrent data structures.
  • The course is based on the book, The Art of Multiprocessor Programming, by Herlihy and Shavit, Morgan-Kaufmann Elsevier 2008
Learn more...

2 Comments Click here to Read/write comments

Multicore-enabling Dense Polynomial Multiplication

Posted by Ilya Mirman on Tue, Jun 02, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

Marc Moreno Maza
Ontario Research Centre for Computer Algebra, University of Western Ontario
moreno@csd.uwo.ca

Yuzhen Xie
Computer Science and Artificial Intelligence Laboratory, MIT
yxie@csail.mit.edu

1  Introduction

In symbolic computation, dense polynomial multiplication is a fundamental operation akin to dense matrix multiplication in numerical computation. We report herein on a parallelization, in Cilk++, of dense polynomial multiplication based on multi-dimensional FFTs. Dense polynomial arithmetic is important for most computer algebra algorithms, such as the Euclidean Algorithm and its variants, which tend to densify intermediate data, even when the input and output polynomials are sparse. In addition, we focus on polynomials over finite fields since modular techniques allow us to reduce to such fields of coefficients.

2  Algorithm Review

Given two multivariate polynomials f and g with coefficients in the prime field K = Z/pZ and with ordered variables x1 < ⋯ < xn. Define di = deg(f, xi) and di = deg(g, xi), for all i. Assume there exists a primitive si-th root of unity ωi ∈ K, for all i, where si is a power of 2 satisfying sidi + di + 1. Then f g can be computed as follows.

Step 1.
Evaluate f and g at each point of the n-dimensional grid ((ω1e1, …, ωnen), 0 ≤ e1 < s1, …, 0 ≤ en < sn ) via n-D FFT.
Step 2.
Evaluate f g at each point P of the grid, simply by computing f(P) g(P),
Step 3.
Interpolate f g (from its values on the grid) via n-D FFT.

3  Framework, Challenges and Solutions

We assume that one-dimensional FFTs are computed by a black-box program, which could be a serial one. This allows us to utilize non-standard 1-D FFTs, such as Truncated Fourier Transforms (TFTs) which are hard to parallelize. TFTs are, however, of high practical interest since they can reduce unnecessary work and permit more efficient memory usage. This feature is even more significant when the number of dimensions increases. Another reason for our "1-D FFT black box" assumption is the fact that for us multiplication is a basic routine on top of which we aim at building several layers of higher-level algorithms, each of them offering opportunities for parallel execution. Therefore, this assumption allows us to avoid the parallel overhead of too fine grain size at the very low-level routines.

We use the serial C routines for 1-D FFT and 1-D TFT from the modpn library shipped with the computer algebra system Maple. Our integer arithmetic modulo a prime number relies also on the efficient functions from modpn, in particular the improved Montgomery trick. This trick is another important specificity of 1-D FFTs over finite fields which makes their parallelization even more difficult.

Multi-dimensional FFT can naturally be computed in a parallel fashion. For instance, a 2-D FFT with dimension sizes d1 and d2 essentially proceeds by the computation of d1 1-D FFTs (which can be performed concurrently) followed by the computation of d2 1-D FFTs (which can also be performed concurrently). The difficulties arise from irregular problems. Consider again the above 2-D FFT example with dimensions x1 and x2. Assume that d1 is in the order of 1000 whereas d2 is in the order of 10. When performing 1-D FFTs along x1, one computes only a few (namely 10) 1-D FFTs of large vectors. This means little parallelism (under our black box assumption). When performing 1-D FFTs along x2, one computes many (namely 1000) 1-D FFTs of small vectors. This means mis-use of FFT since FFT is really worth on sufficiently large vectors, say with at least several hundred entries.

We have conducted complexity analysis considering the performances of multivariate multiplication in terms of parallel speed-up and cache misses. They show that 2-D FFT based bivariate multiplication:

  • performs optimally when the partial degrees of the product are equal;
  • performs poorly when the ratio of those partial degrees is large, say about 500.

We show how multivariate multiplication can be efficiently reduced to balanced bivariate multiplication, based on 2-D FFTs. By balanced we mean that both dimension sizes are equal (or almost equal). With respect to a multiplication based on n-dimensional FFT, our approach may increase the work (i.e. the serial running time) by at most a factor of 2. However, it provides much larger parallel speed-up factors as reported in our experimentation.

Our approach combines two techniques that we call contraction and extension. The technique of contraction reduces multivariate multiplication to bivariate one, without ensuring that dimension sizes are equal; however, the work remains unchanged and in many practical cases the parallelism and cache complexity are improved.

The technique of extension turns univariate multiplication to bivariate one. This has several applications. First, it permits to overcome the difficult cases where primitive roots of unity of “large” orders cannot be found in the prime field K. Secondly, combined with the technique of contraction, this leads to balanced bivariate multiplication.

4  Implementation

For a dense multivariate polynomial f ∈ K[x1, …, xn] with variables x1 < … < xn and degree di in xi for 1 ≤ in, we represent it in a dense recursive manner w.r.t the given variable order. Its coefficients are stored in a contiguous one-dimensional array B. The coefficient of the term x1e1xnen is indexed by

(d1+1) ⋯ (dn−1+1) en+ (d1+1) ⋯ (dn−2 +1) en−1 +⋯+(d1+1) e2e1

in B. This representation is equivalent to a n−dimensional matrix (or grid) in row-major layout.

The parallelization of the multiplication algorithm reviewed in Section 2 takes advantage of the ease-of-use parallel constructs in Cilk++. Evaluation of a polynomial f is done by n-dimensional FFT or TFT. Along the i-th dimension, the (di +1) calls to 1-D FFT or 1-D TFT are parallelized by a cilk_for loop.

With the parallel evaluation by FFT or TFT at hand, Step 1, explained in Section 2, is realized by one cilk_spawn and one cilk_sync. Step 2 is parallelized by a cilk_for. The interpolation in Step 3 is similar to the parallel evaluation, up to some details.

A number of n−1 data transpositions is needed for such n-dimensional FFT or TFT. For 2-dimensional case, we use the cache-efficient code provided by Dr. Matteo Frigo. It employs a divide-conquer approach and fits the base case into the cache of targeted machines. For multi-dimensional case, we divide the problem into multiple 2-dimensional transpositions where Dr. Matteo Frigo’s cache-efficient code can work.

Another difficulty encountered in our implementation is loop carried dependencies. One situation that is hard to identify is when there is deep data dependency but the procedure is parallelized using cilk_for in a top-level, likewise for the improved Montgomery trick. We use Cilkscreen to effectively detect and localize data-race bugs.

5  Experimental Results

Our benchmarks are carried out on a 16-core machine with 16 GB memory and 4096 KB L2 cache. All the processors are Intel Xeon E7340 @ 2.40GHz.

The performance of our bivariate interpolation (same as a 2-D TFT up to some details) for problems of sizes from 2047 to 16383 is given in Figure 1. Using 16 processors, the maximum speedup is 15 for the problem of size 16383.

Figure 1: Speedup of bivariate interpolation (Step 3) via 2-D TFT

For the multiplication of two bivariate polynomials with the degrees of all the variables being equal and large, say in the order of 1000 to 8000, our implementation got good speedup on a 16-core machine: quasi-linear speedup up to 12 processors. Using 16 processors, the speedup factor of the problem of size 8191 reaches 14, as illustrated in Figure 2.

Figure 2: Speedup of bivariate multiplication via 2-D TFT

Figures 3 and 4 compare the timing and speedup for the multiplication of two 4-variate polynomials via direct 4-D TFTs and contraction to balanced 2-D TFTs. In this problem, each polynomial has partial degrees 1023, 1, 1, and 1023. Even on one processor, multiplication by the contraction to balanced 2-D TFT method is faster than that by the direct 4-D TFT method by about 139%. Hence, using 16 processors, the net speedup by contraction to balanced 2-D TFT method w.r.t the serial direct 4-D TFT method reaches 31.

Figure 3: Timing of a 4-variate multiplication by direct 4-D TFTs vs contraction to 2-D TFTs

Figure 4: Speedup of a 4-variate multiplication by direct 4-D TFTs vs contraction to 2-D TFTs

The timing and speedup factors for the multiplication of two univariate polynomials of degree 25427968 are illustrated in Figure 5 and 6, by both the direct 1-D TFT method and extension to balanced 2-D method. On one processor, extension to balanced 2-D method is about 27% slower. However, using 16 processors, the speedup gain by extension to balanced 2-D is about 10.47 w.r.t the direct 1-D TFT method on one processor.

Figure 5: Timing of a univariate multiplication by direct 1-D TFTs vs extension to 2-D TFTs

Figure 6: Speedup of a univariate multiplication by direct 1-D TFTs vs extension to 2-D TFTs

Acknowledgements: This work was supported by NSERC and MITACS NCE of Canada, and NSF Grants 0540248, 0615215, 0541209, and 0621511. We are very grateful for the help of Professor Charles E. Leiserson at MIT, Dr. Matteo Frigo and all other members of Cilk Arts.

0 Comments Click here to Read/write comments

Multicore-enabling a Binary Decision Diagram algorithm

Posted by Yuxiong He on Fri, May 29, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

Reduced and Ordered Binary Decision Diagram (a.k.a. ROBDD, though I will refer to it as BDD in this post for simplicity) is an acyclic graph representation for Boolean functions.  Because this representation provides a canonical form and is quite succinct in most cases, it is widely used for formal verification of protocols and digital circuits.  BDD can be used to look for satisfying assignment of Boolean expressions and comparing whether two Boolean expressions are equivalent.  Because the construction of BDD for certain large and complex Boolean functions can be very time consuming, we thought it would be interesting to parallelize BDD operations in order to achieve efficient execution on multicore CPUs.  
Our parallel BDD is developed based on BuDDy, an open source BDD package.  In this post, I will describe some key algorithms of BDD operations and introduce the approach we took to parallelize them.   We tested the performance of our parallel BDD by constructing a BDD for 13-bit multiplier, and obtained a 10x speedup on a 16-core machine.  (More detailed performance results are presented in Section 4 of this blog.)

Brief Introduction of BDD

A binary decision diagram is built based on the If-then-else Normal Form (INF) of a Boolean expression.  For example, a Boolean expression (x AND y) of variable x and y can be represented using the following INF form shown in Figure 1(a), which corresponds to the binary decision diagram in Figure 1(b).   Each BDD node has a variable, and two successor links.  The dashed line points to the low successor of the node representing the branch where the variable value is false.  The solid line points to the high successor of the node representing the branch where the variable value is true.


Figure 1.  (a) INF form of x AND y.  (b) BDD of x AND y.

Reduced and Ordered BDD (ROBDD) is more restricted than general BDDs.  A binary decision diagram is ordered (an OBDD) if the variables occur in the same order on all paths from the root of the BDD.  For example, BDD in Figure 1(b) follows variable order x < y.  (Please note that the operator “<” here is not the familiar less-than operator in math.  Instead, it indicates that the variable x always occurs before variable y in this BDD.)  An OBDD is reduced if both of the following conditions are satisfied:
  • (uniqueness) no two distinct nodes have the same variable name and low and high successor.
  • (no-redundancy) no variable node has identical low- and high-successor.

An important property of ROBDD is that for each Boolean expression with a fixed variable order, there is exactly one ROBDD.   Figure 2 shows an example of ROBDD for expression (x1 1) AND (x2 y2) with variable order x1 < x2 < y1 < y2 .  The operator (called bi-implication) means x is equivalent to y, i.e., the value of x y is 1 if and only if (x=0 AND y=0) OR (x=1 AND y=1).


Figure 2.  BDD of expression (x11) AND (x2 y2) with variable order x1 < x2 < y1 < y2.

 

Algorithm for BDD Construction

To describe the algorithm of constructing BDD, we need a data structure to represent BDD.    The data structure here is straightforward, which is a direct representation of directed acyclic graph – a set of nodes with outgoing edges.   As shown in Figure 2, each intermediate node has three attributes – variable, low edge node and high edge node.  There are two leaf nodes representing logical true and logical false, respectively.  Since reduced BDDs do not contain any redundant nodes, a hash table is required to provide a guarantee on the absence of any redundancies.   A hash table in BDD serves two purposes: to store the nodes, and to provide efficient lookup of nodes.
Constructing BDD for a Boolean expression can be treated as an iterative process of combining two Boolean expressions with a Boolean operator.   Let’s call this combination operation Apply(expression e1, expression e2, operator op).  For example, the boolean expression (x1 y1) AND (x2 y2) can be constructed using Apply(e1, e2, AND) where e1 = (x1 1) and e2 =(x2 y2).  Moreover, e1 can be further constructed using Apply(x1, y1, ), and so does e2.    Thus, Apply is a key operation in BDD construction and manipulation.   The pseudo code of the Apply operation is shown in figure 3. 


Figure 3.  The pseudo code of sequential Apply operation in BDD.
To apply a logic operation on two BDDs, the evaluation consists of several steps, as shown in Figure 3.  First, if this operation was just performed and the result is cached, the result is returned directly.  This requires a special data structure – a cache table.   A cache table is analogous to a hash table, and maintains <key, value> pair and returns the corresponding value given the key.  The difference between the two is that cache table doesn’t guarantee to record all <key, value> pairs inserted.  It works like a cache, and only stores the recent accessed <key, value> pairs.     Second, if both BDDs are constants (0 or 1), it simply applies the simple boolean operation on constants.  Otherwise, the Apply operation is recursively conducted on low edge and high edge of two bdds.

 

Parallel  Algorithm for BDD Construction

There was existing work on parallelizing BDD construction using thread models [1-3].  Because their parallel languages / libraries don’t have dynamic scheduling for load balancing, their codes have to divide the work manually or maintain a work queue in order to balance the load among processors.  These all involve significant changes to the sequential code.  In contrast, when we use Cilk++ to parallelize the algorithm, all we need to do is to add a handful of cilk_spawn and cilk_sync statements to expose potential parallelism (Figure 4).   We leave the heavy lifting to the Cilk++ runtime system featuring efficient dynamic scheduling and load balancing.


Figure 4.  The pseudo code of parallel Apply operation in BDD.

 

When the parallel Apply procedure is called, several instances of MakeNode can run concurrently, so do the cache table and hash table accesses.  To ensure correct parallel execution without data races, we need thread-safe cache table and hash table.
Our thread-safe cache table is a simple fixed-size table supporting lookup(key) and insert(key, value).   Each entry of the table includes a <key, value> pair.  The position of a pair in the table is determined by the hashed value of the key.  If two <key, value> pairs are hashed into the same entry, the new pair will overwrite the old one.  Each entry of the table is protected such that only one thread can read/write it at any time.  This is done by using one bit per entry to indicate status of the entry, and using Compare-and-Swap(CAS) to ensure atomicity. 

A Thread-Safe Hash Table

The hash table in BDD needs to support concurrently execution of one key operation – lookup_insert(key* k), which returns the key if it already exists and inserts the key otherwise.  Since the space requirement by BDD operations is usually unknown, we need a resizable hash table.  When the probe number is greater than a certain threshold value, the hash table performs auto-resizing. 
Our implementation uses an open addressing hash table.  The thread-safeness is ensured by atomic Compare-And-Swap (CAS) operations.    Lookup_insert operations can be safely executed in parallel.  During hash table resizing, any worker that calls lookup-update will first help with the resizing, and only then complete their lookup-update operations.  This all-participation scheme distributes the workload of hash table resizing to all the workers that use the hash table.   This ensures that resizing can be done efficiently, and that the worker triggering the resizing won’t be solely burdened with the task.  (The code of this thread-safe hash table and a simple test case is available for download here)

Performance Results

Our BDD achieves good speedup on multicore machines.   For the multiplier application – a classic example to test scalability of BDD implementation, it achieves 10x speedup on a 16-core system (four quad-core AMD Barcelona CPUs).


Figure 5.  Speedup results of 13-bit Multiplier Construction using multicore-enabled BDD
Although the speedup of multicore BDD is good, we don’t get perfect linear speedup here.   An important reason is that BDD multiplier has a considerable memory bandwidth requirement.  A simple test to confirm this is to run the sequential version of the program for two cases – (1) run one copy of the sequential program on each core, and (2) run a single copy of the sequential program on the entire machine.   If the program’s execution time in case 1 is significantly larger than that in case 2, the memory bandwidth demand of the program could be high enough to become the primary constraint for program scalability.  If we run a sequential BDD multiplier application, the execution time on case (1) is approximately 1.6 times of the execution time on case (2).  This highlights the intrinsic memory bandwidth demand of this application, and explains the imperfect linear speedup.

 

References:

[1] Implementation of an Efficient Parallel BDD Package.  Tony Stornetta  and Forrest Brewer.
[2] A Parallel Algorithm for Constructing Binary Decision Diagrams.  Shinji Kimura and Edmund M. Clarke.
[3] BDDNOW: A Parallel BDD Package.  Kim Milvang-Jensen.

2 Comments Click here to Read/write comments

First Impressions of the Fortress Language

Posted by Pablo Halpern on Fri, May 08, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 
Tags: 

I was privileged recently to attend a one-day hands-on introduction to Fortress lead by Sukyoung Ryu and Jan-Willem Maessen of Sun Microsystems and hosted by MIT.  Fortress is a new parallel programming language developed at Sun and designed to bring together many of the best ideas in computer language design from the last few decades.   Some of the highlights are:

  • Implicit parallelism using a Cilk-style work-stealing scheduler
  • Transactional synchronization to minimize contention on shared objects
  • An extensible syntax that uses mathematical symbols
  • Static type-checking with polymorphism and type inference
  • Definition of many language features through libraries rather than built-in syntax
  • Support for generic, object-oriented, and functional programming paradigms
  • Garbage collection
  • Write-once-run-anywhere coding enabled by using the Java Virtual Machine (JVM)

There is a lot to Fortress, and I will only touch on a few topics here.

Mathematical Notation

Fortress’s syntax has little in common with languages like C++ or Java.  In fact, it is unlike any other programming language I’ve seen (including Lisp, Perl, sh, FORTRAN, and APL).  I am told that it bears a vague resemblance to Haskell, though there are features that are drawn from many different sources including some that would be familiar to C++ or even Pascal programmers.

The most striking thing one first notices about Fortress syntax is the use of mathematical notation.  The following code defines a function that takes a list of strings and returns true if any of the strings have the value, “hello”:

The code above can also be represented in plain text as follows:

hasHello(xs: List[\String\]): Boolean =
  BIG OR[x <- xs] x="Hello"

Depending on your point of view, either the first notation is an ornamental rendering of the second, or the second notation is a primitive representation of the first (for people using impoverished editing and source-control tools). Using emacs or vim in Fortress mode, you can enter plain text and it will be partially rendered into the mathematical form as you type:

Before I did any work in Fortress, the graphical rendering seemed gimmicky to me.  After doing a few exercises, however, it started to grow on me.  Though I doubt that the notation alone is enough to make Fortress easy for a mathematician to use, I have to admit that it does make the program easier to read once you get used to it.  It’s kind of like a pretty-printer on steroids – not limited to just bolded keywords and italicized comments.

Extensible syntax

More profound than the mathematical notation is the fact that the syntax is extensible.  There wasn’t much talk of extending the syntax in the one-day workshop, but one mechanism that was mentioned was the ability to overload whitespace.  If you’ve read Bjarne Stroustrup’s April Fool’s Day proposal for overloadable whitespace in C++ (PDF), you probably got a chuckle out of that last sentence.  But Fortress really does, in a sense, allow whitespace to be overloaded so that two times x is written simply as 2 x.  More precisely, what is being overloaded is not the whitespace, but the juxtaposition of tokens.  Thus two times x can be written even more simply as 2x, with no intervening whitespace. 

Fortress’s syntax is so extensible, that during this short workshop I could not tell which parts of the language are baked into its syntax, and which parts are provided by libraries, just as cout<< may look like a C++ intrinsic to a newbie C++ programmer. This blurring of intrinsic syntax and library-provided syntax is apparently deliberate in Fortress.

Types and Dynamic Overloading

The Fortress type system lets the user define traits, which are like Java Interfaces or C++ pure abstract classes and objects which are like Java or C++ classes.  A function argument type can be declared using either a trait or object type and functions can be overloaded.  A function call is type-checked at compile-time to ensure that there is only one best-match for an overload set, but the actual resolution of the function overload is preformed at run-time based on the dynamic type of all of the arguments.  Contrast this dynamic function resolution to C++’s template instantiation, which occurs entirely at compile time based on static argument types, or C++’s virtual function binding (or Java’s member-function call), which occurs at run-time based on the dynamic type of the implicit this argument but the static types of all of the remaining arguments.

Parallelism

Of course, what interested me the most about Fortress was its model and mechanism for parallel programming.  The Fortress team consists of some very smart people, so it was with some satisfaction that I read on their web site that their parallel scheduler uses essentially the same work-stealing algorithm as Cilk++.  Unfortunately, the underlying JVM does not provide the stack-manipulation primitives needed for a maximally-efficient work-stealing scheduler.  (Providing such primitives would probably invite security exploits.)  I suspect that the JVM, with some guidance from the Fortress team, will someday evolve to support work-stealing natively.

Implicitly, Fortress performs the equivalent of a cilk_spawn whenever it sees two parts of an expression that can be executed in parallel.  For example, in the familiar recursive implementation of a function that computes a Fibonacci number,

fib(n: ZZ32): ZZ32 =
  if n < 2 then n
  else fib(n - 1) + fib(n - 2)
  end;

the two recursive calls to fib() are invoked in parallel.  The equivalent code in Cilk++ would be:

int fib(int n) {
   if (n < 2) return n;
   a = cilk_spawn fib(n – 1);
   b = fib(n – 2);
   cilk_sync;
   return a + b;
}

For this to work, it is important that that fib has no side-effects, otherwise a data race may occur.  One must be very careful when calling functions with side-effects, lest the same side-effect occur in multiple parts of a parallel expression.  (Even in serial languages like C++, performing side-effects on the same memory location in multiple branches of the same expression is never recommended because you cannot control the order in which the side-effects will occur.)  In addition to this implicit parallelism, the language supports some explicit parallel constructs: 

  • A for loop is parallel by default.
  • A do also construct specifies two or more blocks that may execute in parallel with one another.

Fortress provides at least two mechanisms for coordinating access to shared variables in parallel operations: atomic operations and reductions.  Atomic operations use Fortress’s transactional memory system to ensure that one or more operations occur atomically, but with minimal locking and contention.  Atomic operations prevent data races, but do not by themselves ensure determinism or serial semantics.  Reductions are constructs that perform associative operations in parallel and preserve serial semantics, like Cilk++ reducers.  The hasHello example, above, uses a BIG OR reduction to compute its result.

Immutable data types

The ability to use both implicit and explicit parallelism is enhanced by the Fortress library’s orientation towards immutable data structures.  An immutable data structure is one that supports no modifying operations.  This is not to be confused with a read-only (constant) variable, which is simply an immutable view on an otherwise mutable data structure.  A variable of immutable type can be re-assigned, but only by binding it to a different object – the original object remains unchanged. (In Fortress, as in Java and most other garbage-collected languages, a variable does not hold an object, but rather a reference to an object.  Assigning a variable causes it to be bound to a different object.) To better understand immutable data types, consider the following expression, which takes a list x and removes the third element:

x := x.take(2) || x.drop(3);

The two halves of the expression each return new lists holding a subsequence of x (take(2) returns the first two elements, drop(3) returns all but the first three elements), leaving x itself unmodified.  The ||  operator returns a third new list holding the concatenation of its operands.  Finally, the variable x is reassigned to refer to the return value from ||.  The original list may or may not still be referenced somewhere else in the program, but it remains unchanged by this expression (or any other expression).  The garbage collector is responsible for freeing the original list (if it is no longer used), as well as the temporary lists returned by take() and drop().

Using immutable objects makes it easier for both humans and compilers to reason about the parallelism in a program.  Given two operations on an object belonging to an immutable type, it is readily apparent that the two operations may be performed in parallel because there are no side effects that could change the value of an immutable object.  Immutable data types are an important part of functional programming, so Fortress’s library of immutable types encourages the use of a functional programming style, to some degree, and is a significant paradigm shift from imperative programming.  As a long-time C++ programmer, I found that programming in Fortress exercised a part of my brain that had not been used much since my Lisp programming days in college.

What Next?

The current interpreter and compiler are far from being commercial products and both the language and the library are in flux, so don’t expect Fortress to solve your current parallelization problems.  Performance is further hampered by limitations in the current JVM specification (including the absence of optimized tail recursion, which hurts many divide-and-conquer algorithms). Even if an industrial-strength development environment becomes available, legacy code in languages like C++ cannot easily be ported to Fortress.

Nevertheless, Fortress is an interesting research project that has the potential to advance the field of programming in general, and parallel programming specifically.  A native (non-JVM) compiler could conceivably generate code that would run on the Cilk++ runtime platform. I hope that Sun (and now Oracle) will continue to support the Fortress project.  The industry needs projects like Fortress to continue generating fresh ideas.

For more information about Fortress, see the Project Fortress Community web site.

2 Comments Click here to Read/write comments

A cute technique for avoiding certain race conditions

Posted by Matteo Frigo on Thu, May 07, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 
Tags: , , , ,

The following problem, slightly rephrased, was posed in the Intel TBB forum.  You have N particles. Particle j exerts force f(i,j) upon particle i. The usual action/reaction law holds, i.e., f(i,j) == -f(j,i). Compute the total force acting on each particle, in parallel. It turns out that the solution is nontrivial, as we discuss in the rest of this article.

The natural sequential solution is something like this:

 

   for (int i = 0; i < N; ++i)
for (int j = i; j < N; ++j) {
double t = f(i, j);
force[i] += t;
force[j] -= t;
}

The sequential implementation calls f(i,j) once to compute both the force f(i,j) exerted by particle j upon particle i, and the force -f(i,j) exerted by i upon j. Calling f once is better than calling f twice, especially because f tends to be expensive to compute. (E.g., computing gravitational forces involves a division and a square root.)

The serial solution does not parallelize easily, however. You cannot execute the outer loop in parallel, because then we have a race condition among parallel updates of force[j] for different values of i. For a similar reason, parallelizing the inner loop does not work either.

If you are willing to call f twice, both both as f(i,j) and as f(j,i), then a simple parallel implementation looks like this:

   cilk_for (int i = 0; i < N; ++i)
for (int j = 0; j < N; ++j)
force[i] += f(i, j);

That is, for each particle i (in parallel) compute (sequentially) the force exerted upon i by all other particles j. The inner loop is sequential, which avoids the race condition on the update of force[i]. However, this parallel solution calls f twice as much as the serial one, which is not good.

So how do you do it? My solution appears below. It calls f only once for every pair (i,j), for i <= j, so it is does not double the work, and it is race-free. It is a complete Cilk++ program that you can run and certify to be race-free by using Cilkscreen, our race-detecting tool.

 

#include <assert.h>
#define N 749   // number of particles

double force[N];

double f(int i, int j)
{
// compute your favorite force here. We use a constant
// force 1.0 for illustration purposes
return 1.0;
}

void basecase(int i, int j)
{
assert(i <= j);
double t = f(i, j);
force[i] += t;
force[j] -= t;
}

/* traverse the rectangle i0 <= i < i1, j0 <= j < j1 */
void rect(int i0, int i1, int j0, int j1)
{
int di = i1 - i0, dj = j1 - j0;
if (di > 1 && dj > 1) {
int im = i0 + di / 2;
int jm = j0 + dj / 2;
cilk_spawn rect(i0, im, j0, jm);
cilk_spawn rect(im, i1, jm, j1);
cilk_sync;
cilk_spawn rect(i0, im, jm, j1);
cilk_spawn rect(im, i1, j0, jm);
} else {
for (int i = i0; i < i1; ++i)
for (int j = j0; j < j1; ++j)
basecase(i, j);
}
}

/* traverse the triangle n0 <= i <= j < n1 */
void interact(int n0, int n1)
{
int dn = n1 - n0;
if (dn > 1) {
int nm = n0 + dn / 2;
cilk_spawn interact(n0, nm);
cilk_spawn interact(nm, n1);
cilk_sync;
rect(n0, nm, nm, n1);
} else if (dn == 1) {
basecase(n0, n0);
}
}

int cilk_main()
{
interact(0, N);
}

This program is not meant to be obvious, so let me explain what it does.

The sequential program is equivalent to calling basecase(i,j) for all 0 <= i <= j < N. Another way to look at it is that we call basecase(i,j) for all points (i,j) in the triangle 0 <= i <= j < N, which looks like the shaded area in this figure:

 

 

Procedure interact traverses this triangle in parallel, and in fact it is a little bit more general, because it traverses triangles of the form n0 <= i <= j < n1. Initially, we set n0 = 0 and n1 = N in cilk_main.

Procedure interact works by recursively partitioning the triangle. If the triangle consists of only one point, then it visits the point (n0, n0) directly. Otherwise, the procedure cuts the triangle into one rectangle and two triangles, like this:

 

 

The two smaller triangles can be executed in parallel, because one consists only of points (i,j) such that i < nm and j < nm, and the other consists only of points (i,j) such that i >= nm and j >= nm. Thus, the two triangles update nonoverlapping regions of the force array, and thus they do not race with each other. However, the rectangle races with both triangles, and thus we need the cilk_sync statement before processing the rectangle.

To traverse a rectangle we use procedure rect, which also works recursively. Specifically, if the rectangle is large enough, the procedure cuts the rectangle i0 <= i < i1, j0 <= j < j1, into four smaller subrectangles, like this:

 


 

The amazing thing is that the two black subrectangles can be traversed in parallel with each other without races. Similarly, the two gray subrectangles can be traversed in parallel with each other without races. However, the black subrectangles race with the gray, so we must use a cilk_sync statement after processing the first pair of subrectangles.

To see why there are no races between the two black subrectangles (the same argument applies to the grey) observe that the i-ranges of the two subrectangles do not overlap, because one is smaller than im and the other is larger. For the same reason, the j-ranges do not overlap either. In order for races not to occur, however, we must also prove that the i-range of one subrectangle does not overlap with the j-range of the other, because the base case updates both force[i] and force[j]. This property holds because when interact calls rect initially, the i-range is n0 <= i < nm, whereas the j-range is nm <= j < n1, so the two ranges never overlap.

23 Comments Click here to Read/write comments

Multicore-enabling the Murphi Verification Tool

Posted by Yuxiong He on Tue, Apr 28, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

Murphi is a popular finite-state machine verification tool, used widely in the design of cache coherence algorithms and protocols, link-level protocols, executable memory model analysis, and analysis of cryptographic and security-related protocols. These complex protocols are often verified by examining all reachable protocol states from a set of possible start states. With proper memory management like hash compression and space reduction techniques, runtimes become a major limiting factor. We developed a multicore-enabled Murphi using the Cilk++ concurrency platform, and achieved excellent speedup:

Multicore Murphi ldash application speedup on 16-core AMD processor

Our parallelized Murphi is based on standard Murphi 3.1. The standard Murphi packages include a Murphi compiler and verifier, where the compiler translates models writing in the Murphi language to C++ code, and the verifier conducts the verification of this code. The Murphi compiler has approximately 19K lines of code, and the verifier has 7K lines. It took one engineer 2 weeks to parallelize Murphi using Cilk++. The parallel version supports all key functions of the sequential version. (It also fixes some severe memory leakage issues in the original code and provides 64-bit support where the original is highly dependent on the 32-bit platform.) Below I briefly describe the changes we made to parallelize the Murphi code.

Murphi verifies a protocol by traversing the entire state space, which is essentially a graph-traversal process. The most commonly used graph traversal algorithms are depth-first search (DFS) and breath-first search (BFS). We developed parallel DFS and BFS in the "Cilkified" version of Murphi.

  • Depth-First Search: Sequential DFS in Murphi uses a hash table to store all visited states and a stack to track new states. With a thread-safe concurrent hash table and a concurrent stack, one can easily get DFS running in parallel. However, development of a thread-safe and almost contention-free concurrent stack could be a time-consuming task. We therefore took an alternative approach and implemented a recursive version of DFS without using a stack. Then all we needed is a concurrent hash table to make the sequential DFS run in parallel. Our concurrent hash table is lock-free, and uses atomic operation Compare_and_Swap to ensure thread safety for entry lookup and insertion.
  • Breadth-First Search: Sequential BFS in Murphi uses a queue to track new states, and a hash table to store all visited states. We implemented the parallel breadth-first search slightly differently than the sequential version. Instead of using a queue approach, we used an implementation of an unordered set, called a bag. The parallel search keeps two bags at any time to represent the frontier of the search. The current bag contains the existing frontier of nodes and is walked in parallel, while the next bag accumulates the new frontier one step further from the root of the search. The start node for the search is initially placed into the current bag. The search is broken into phases, where each phase consists of examining all the nodes in the current bag, computing their successors, and storing those successors that have not yet been seen into the next bag. At the end of each phase, the current bag is destroyed, and the next bag becomes the new current bag. Since unioning bags is an associative operation, we declare the bags to be reducers, which allows each worker to operate largely on its own local bag. Our bag provides efficient implementation of insertion, merge, and split operations. Starting from n separate items, bag insertion and merging take constant amortized time and walking through splitting runs in linear time. (The bag implementation we used in Murphi is contributed by Tao Benjamin Schardl from his final year project at MIT.)

Another issue we faced when parallelizing Murphi code is that the sequential code uses a single buffer to perform all state operations. For each state operation like rule matching, invariant verification, etc., it copies the state into the single buffer, performs the operation, and copies the result out of the buffer. This structure is not parallel-friendly because that single buffer becomes the location of many data races if it is not protected by locks or atomic operations, and it becomes the center of contention if it is protected. One fix is to rewrite the program through passing state as parameter to avoid using the single buffer, however this fix would involve significant changes in the structure of the original verifier program as well as the compiler code. Alternatively, we chose to implement the single buffer as a hyperobject of the state, and it works as a holder where each worker has a different view of the buffer. This successfully solves the race and contention problem with just a few lines of changes to the original sequential program.

Our Murphi achieves excellent speedup on multicore machines. For most of the non-trivial applications in the example directory of Murphi package, speedup is linear. The figure above illustrates the results of checking ldash - a complex cache coherence protocol model. Multicore Murphi achieves 15.5X and 14.9X speedup using BFS and DFS respectively on 16-core AMD machine (four quad-core AMD Opteron processors connected using HyperTransport).

Our Multicore Murphi is available here for evaluation. This release includes the compiler binary, and source code of the verifier.

  • If you are interested in speeding up your model checking process, give it a try!
  • If you are interested in how to parallelize a full-size application using Cilk++, the source code of the Murphi verifier may be a good example.
  • If you are interested in how to conduct parallel graph traversal or build an efficient dag, you may find our concurrent hash table and Bag implementation useful.
  • Of course, feel free to try it just for fun as well!
(After the evaluation, if you'd like to continue using our multicore-enabled version of Murphi, please contact our team at info[at]cilk.com)

0 Comments Click here to Read/write comments

All Posts | Next Page