Subscribe by Email

Your email:

Multicore Programming Blog

Current Articles | RSS Feed RSS Feed

Concepts in Multicore Programming - Lecture 3: Analysis of Multithreaded Algorithms

Posted by Ilya Mirman on Mon, Jul 20, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 
Tags: 

We recently teamed up with MIT's Professional Development Program on a 2-day workshop focused on multicore programming.

(Here are lectures #1 and #2.)

The third lecture covered:

  • Implementation of Cilk Loops
  • Divide-&-Conquer Recurrences
  • Matrix Multiplication
  • Tableau Construction

Here is the video and slides from the lecture:

 

1 Comments Click here to read/write comments

Concepts in Multicore Programming - Lecture 2: Parallelism & Scheduling Theory

Posted by Ilya Mirman on Wed, Jul 15, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 
Tags: 

We recently teamed up with MIT's Professional Development Program on a 2-day workshop focused on multicore programming.

(Here's lecture #1.)

The second lecture covered:

  • What is Parallelism?
  • Scheduling Theory
  • Cilk++ Runtime System
  • A Chess Lesson

Here is the video and slides from the lecture:

2 Comments Click here to read/write comments

Cilk++ 1.1.0 Beta

Posted by Ilya Mirman on Wed, Jul 15, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

Cilk++ 1.1.0 includes several cool features to make your multicore C++ programming projects even more fun:

  • A new, easier to use syntax for defining and using reducers
  • Cilkview Scalability Analyzer for analyzing, benchmarking and predicting parallel program performance
  • Additional reducers to support logical operations have been added to the reducer library
  • Cilkscreen Race Detector can now operate on significantly larger programs
  • Runtime performance improvements and compiler optimizations
  • A variety of bug fixes and other improvements
  • Linux-specific: Support for Cilk++ code in shared libraries

Cilk++ for MacAnd for Mac users: Alpha-level support for Macintosh available on request!

 

Download Cilk++ 1.1.0 Beta...

2 Comments Click here to read/write comments

Multicore Programming Workshop - Lecture 1

Posted by Ilya Mirman on Wed, Jul 15, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 
Tags: 

We recently teamed up with MIT's Professional Development Program on a 2-day workshop focused on multicore programming.

The first lecture covered:

  • The multicore programming challenge
  • Shared-memory hardware
  • Leading concurrency platforms (Pthreads, OpenMP, TBB, Cilk++)
  • Race conditions

Here is the video and slides from the first lecture:

Here's the next lecture (#2).

2 Comments Click here to read/write comments

Visualizing Parallel Speedup with Cilkview

Posted by Will Leiserson on Tue, Jun 30, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

Along with our upcoming release of Cilk++ v.1.1 we are including a new tool to help you visualize application performance: Cilkview. Cilkview runs an application binary and generates performance data on sections you specify. It combines this data with performance estimates generated by the work/span calculator and produces a graph of the results.

Consider the following parallel algorithm for calculating fibonacci numbers:

The Directed Acyclic Graph (DAG) that represents the call tree fib(35) is a binary tree of parallel nodes, so there is parallelism aplenty. Furthermore, a look at the function reveals that it is basically CPU-bound. The only memory accesses are into the frame and they have the highest possible locality. Naturally, this is a terrible way to calculate fibonacci numbers, but it is a simple example of the class of algorithms that are easily parallelized. To make Cilkview predict speedup, insert the following code near the call site:

start() and stop() indicate the interesting sections of code to Cilkview. dump() will associate the collected data with a label that Cilkview will use to generate a graph. Compile your program and run Cilkview to generate a graph:

  % cilkview ./fib

Cilkview runs your binary on a single core with instrumentation that calculates the work and span of the sections of code you specify. In this case, Cilkview made the same prediction that we did. The steeper line indicates perfect linear speedup up until the theoretical parallelism of the algorithm. The theoretical parallelism does not appear on the graph because it is too large. The more gently sloped line indicates Cilkview's prediction of more realistic performance, burdened speedup, based on the overhead of the Cilk++ runtime. Since the two lines are so close (nearly indistinguishable in this plot), Cilkview predicts that actual trials of this application should scale with nearly perfect linearity.

To test this, Cilkview can also run the trials themselves by specifying the "-trials all" option at the command line. No additional code is necessary:

 
% cilkview -trials all ./fib

 

The red stars indicate trial data. Cilkview by default runs one trial for each of 1..n Cilk workers where n is the number of cores. The plot, above, was generated on a computer with 8 cores. As predicted, the trial performances scaled with near-perfect linearity. Slight variations can be attributed to the limitation that only one trial was performed for each configuration.

But what about more common algorithms such as sorting? The following is a simple algorithm for parallel quicksort:

At first glance, this algorithm looks like it has plenty of parallelism, too. Even though there are lots of memory accesses the quicksort algorithm has good cache properties. But consider that the DAG would have a substantial sequential node at the root for the first partition(). Likewise, after the recursive call, each of the first parallel children have to run partition() on their portions of the array before they can do anything parallel. In fact, most of the work in quicksort is done in the partition() algorithm -- serially! Cilkview's analysis bears this out:

The plot represents qsort run on an array 10,000,000 integers. In this graph you can see the that the ideal parallelism (the horizontal line) is small enough to appear on the graph (for this or any reasonable size of input). Plotting the trials gives data that is consistent with this:

The trials fit well within the predicted margin.

Now consider the following code for parallel memcpy:

On the surface, this has lots of parallelism. All of the work is in the cilk_for, and the automatic grain size should handle the overhead of the implicit cilk_spawns with respect to the trivial work done at each iteration. Conceptually, memcpy() is similar to fib() in terms of parallelism. The DAG looks similar to that of fib() except the tree is balanced. Unfortunately, the prediction generated from work/span data is totally inconsistent with the actual performance data from the trials (parallel_memcpy() run on an array 400 million bytes long):

Why does the trial data not match with the prediction? Cilkview takes the work and span into account and estimates the overhead of the runtime to arrive at a reasonable estimation of parallelism. But the limitation was not in the amount of parallelism, here. A discrepancy between the prediction and the trial data indicates other factors at work. In this case, the memory bandwidth became saturated and the speedup ceased to scale as the processors began to queue for access to main memory. The computer on which this was run has two memory controllers, so the scaling would have been even worse on a typical desktop machine. Cilkview's parallelism prediction is useful for understanding whether your algorithm has enough parallelism to scale effectively. A graph in which the trial data falls below the lower-bound indicates that the algorithm is running up against another limitation.

A final, happier example is the parallel matrix multiplication algorithm developed by Matteo Frigo and Matthew Steele. This algorithm was intended as an alternative to more intuitive parallel matrix multiplication algorithms that tend not to use cache effectively:

Note that the algorithm does not actually get super-linear speedup as it might appear in the 8-worker case. Remember that a single trial was run at each level, so slight variations in runtime can cause visible changes in the graph.

Cilkview will ship with Cilk++ v.1.1 for both Windows and Linux. There is more functionality included with the tool (e.g., you can instrument multiple sections of code, generate .csv files, set the number of cores on which to predict performance, and more). The hope is that Cilkview will help users confirm or deny their intuitions about parallel algorithms. Also, it makes pretty pictures.

9 Comments Click here to read/write comments

Multicore Adoption: More Competitive than World Cup and Miss Universe?

Posted by Ilya Mirman on Mon, Jun 29, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

The map pictured above is not a secret plan for world domination (we're pretty open about our goals). Nor is it a chart of our most recent performance in the board game RISK (we feel more comfortable with troops stationed in Greenland and Kazakhstan).  Rather, the map pinpoints locations of the early adopters of Cilk++ worldwide.

Since releasing Cilk++, our solution for multicore-enabling C++ applications, just a few months ago, we've seen thousands of leads and downloads from developers from around the world. And hundreds of universities are starting to incorporate our multicore software in their curricula.

Would you rather be dating Miss Universe or Miss Multicore? 

We thought it might be fun to see if these "multicore hotspots" have any correlation with some other categories in which countries vigorously compete. Is a nation that thrives at World Cup Soccer more likely to score a leadership role in multicore programming? Are nations which produce high-ranking Miss Universe contestants more likely to recognize the pure beauty of Cilk++?

Take a moment to digest the following chart, which reflects the latest international rankings in economic productivity, soccer, beauty pageantry and parallel computing:

Global Competitiveness Measured in terms of Money, Sports, Beauty & Brains*

ECONOMY (GCI)

WORLD CUP
SOCCER

MISS UNIVERSE

WORLD
MULTICORE
INDEX

1. USA
2. Switzerland
3. Denmark
4. Sweden
5. Singapore
6. Finland
7. Germany
8. Netherlands
9. Japan
10. Canada

1. Spain
2. Germany
3. Netherlands
4. Brazil
5. Italy
6. Argentina
7. England
8. Croatia
9. Russia
10. France

1. Venezuela
2. Colombia
3. Dominican Republic
4. Russia
5. Mexico
6. Kosovo
7. Spain
8. USA
9. Italy
10. Australia

1. Singapore
2. Greece
3. USA
4. Australia
5. Canada
6. Iran
7. S. Korea
8. France
9. Germany
10. Italy

For the purpose of this fun exercise, we made up our own statistic called the "World Multicore Index."  The WMI is determined by dividing the number of Cilk++ downloads  in each country by its population.

Obviously, all statistics have their limitations and these WMI rankings are no exception. We are a start-up without a huge marketing megaphone, and these results reflect only the early adopters of Cilk++ worldwide.  That said, our efforts to spread the gospel of multicore do reach a wide and diverse audience that cannot be ignored.

So, are the nations that produce jocks and beauty queens equally committed to raising homegrown computer geeks?  Right now, if you want to create a start-up company with a gorgeous sales staff and a killer soccer team, Rome or Milan looks like the place to be. Italy is currently the only country to be ranked in the Top 10 for the World Cup, Miss Universe and Multicore Index.

But time to get a little more serious. If you need to have software optimized for multicore CPUs five years from now, you had better start thinking about it now.

Are economic powerhouses more likely to be going multicore?

Logic might dictate that the more powerful the country's economy, the more likely that its programmers would be exploring multicore technology. So let's dive a little deeper into that hypothesis and see how often it rings true.

The Global Competitive Index, or GCI, measures a dozen different indicators to determine a country's economic prowess.

Do nations with the highest GCI -- those with exceptional educational systems and a large population of engineers -- assume leadership roles in adopting multicore technology? 

The results, quite frankly, surprised us - there is not a great deal of correlation between the indices:

  1. Of the world's top 10 economies ranked by GCI, only four (Singapore, USA, Canada, Germany) seem to be aggressively pursuing opportunities to be multicore technology leaders.
  2. The saturation of blue dots on our map can be deceiving. Economic powerhouses like Japan (# 17 in WMI), China (# 16), and India (# 15) seem to lag a bit in terms of going multicore, in our humble opinion.  We're looking forward to sharing more trial software with our friends in the Far East.
  3. On the flip side, both Iran and Greece – nations better known for exporting yummy treats (pistachios, baklava) than computer technology – could potentially establish themselves as serious players on the world multicore stage.

The good news is that unlike soccer (where players have been training since birth), there is not an insurmountable bar to become competitive in this arena!

It's still early in the adoption of multicore (today's CPUs feature just 2-4 cores). And the various programming approaches are getting better and better. So jump in the multicore pool – the water is warm!

As warm as the Mediterranean or the Persian Gulf.

RESEARCH FOOTNOTES

  1. World economy rankings are based on the 2008-2009 Global Competitive Index (GCI), which measures a dozen economic indicators as criteria for success. http://www.weforum.org/en/media/Latest%20Press%20Releases/PR_GCR082
  1. Latest FIFA World Cup rankings as of May 2009. http://www.fifa.com/worldfootball/ranking/lastranking/gender=m/fullranking.html
  1. This Top 10 Miss Universe list is from 2008. The 2009 tiara is up for grabs in the Bahamas on August 23rd. http://www.missuniverse.org/
  1. The World Multicore Index (WMI) is determined by dividing a country's total number of Cilk software downloads by its population.

1 Comments Click here to read/write comments

Multicore-enabling FP-tree Algorithm for Frequent Pattern Mining

Posted by Yuxiong He on Thu, Jun 25, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 
Untitled Document

In data mining, association rule mining is a popular and well-researched method for discovering interesting relations between variables in large databases.  For example, the rule {onions, potatoes} => {beef} found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy beef. Such information can be used as the basis for decisions about marketing activities.  Frequent pattern mining is the basis of association rule mining.   Given a list of transactions, frequent pattern mining returns a complete set of items that occur more than a threshold of times.  One of the fastest and most popular algorithms for frequent pattern mining is the FP-tree [1] algorithm.   In this blog post, I describe how to parallelize the FP-tree algorithm using Cilk++.    

The FP-tree algorithm builds a prefix tree representation of a given database of transactions, and generates a complete group of frequent item sets through recursive tree projections.   Several algorithms [2-4] have been proposed in the literature to address the problem of parallelizing FP-tree.  The algorithm in [2] constructs multiple local parallel trees, interlinks the local counters, and conducts mining jointly on these trees.  The work in [3] classifies hot and cold items, forms a hot node hash table and divides the work manually for tree construction.   These algorithms use different approaches to achieve thread-level parallelism.  That said, they all have one thing in common – they require substantial changes to the sequential algorithm in order to explore parallelism, avoid data races and obtain fair load balancing.  Below I describe a multicore-enabled parallel FP-tree implementation using Cilk++, which maintains the sequential semantics of the FP-tree algorithm, and requires relatively minor changes to the sequential code.  On a 4-core machine, this implementation achieves 2.6x speedup on tree building and 3.3x speedup on frequent set generation for our tests on 1M-transaction database.

The organization of this blog post is as follows.  First, I present a definition of the frequent pattern mining problem.   Then, I illustrate the FP-tree algorithm in three parts – preprocessing of transactions, tree construction, and pattern generation.  Section 2 to 4 describe the algorithm as well as the parallelization of each of the three steps.    Section 5 covers performance.  The sequential implementation we used is from Christian Borgelt's Webpages (http://www.borgelt.net//fpgrowth.html).  (The paper [5] describing this implementation is available on that webpage, and this post uses some examples in the paper to illustrate the sequential algorithm.)

Section 1. Frequent Pattern Mining Problem

Given a list of transactions, frequent pattern mining means to look for sets of items that occur more than a predefined threshold of times in these transactions.   More formally, it is defined [1] as:

Let I be a set of items, and a transaction database DB = { T1, T2, …, Tn}, where Ti is a transaction which contains a set of items in I.  The support (or occurrence frequency) of a pattern A, which is a set of items, is the number of transactions containing A in DB.  A, is a frequent pattern if A’s support is no less than a predefined minimum support threshold S.  Given a transaction database DB and a minimum support threshold, S, the problem of finding the complete set of frequent patterns is called the frequent pattern mining problem.

 

Section 2. Preprocessing of Transactions

FP-tree preprocesses the transaction database as follows: in an initial scan of the frequencies of the individual items (support of single element item sets) are determined.  All infrequent individual items, i.e., all items that appear in few transactions than the user-defined threshold S, are discarded from the transactions since they cannot be part of a frequent pattern.  In addition, the items in each transaction are sorted, so that they are in descending order with respect to their frequency in the database.  Each transaction is then recoded using an integer array respecting each item using its index.  Figure 1 shows an example of a transaction database before and after preprocessing.

Figure 1.  (a) A transaction database before preprocessing.  (b) Item frequency and index table. The threshold value is 3, and f, g will be discarded since their frequency is less than the threshold.  (c) Reduced transaction database with items in transactions sorted in a descending order with respect to their frequency. (d) Reduced transaction database after recoding transactions using item index.

Parallelizing the scan of transactions in database is I/O specific, and is beyond the scope of this post.   Parallelizing the sorting and recording of items in each transaction is straightforward because they are all independent tasks.   As a result, all we need to do is to change the “for” loop in the sequential code to “cilk_for” loop.

 

Section 3. FP-Tree Construction

An FP-Tree is a prefix tree for transactions.  Each path of the tree represents a set of transactions that share the same prefix, and each node corresponds to one item.  In addition, all nodes referring to the same item are linked together in a list, so that all transactions containing a specific item can easily be found and counted by traversing this list.  This list is accessible through a head element, which also records the item’s total number of occurrences in the database.  Figure 2 shows the FP-Tree for the (reduced) database shown in Figure 1.


Figure 2. FP-Tree for the reduced database shown in Figure 1.


In the sequential implementation, the initial FP-Tree is built from the preprocessed transaction database as a simple list of integer arrays (as shown in Figure 1(d)).  The FP-Tree construction includes two phases – sorting and building.    

In the sorting phase, the transactions are sorted lexicographically.  The sequential sorting uses quicksort for large data arrays, and changes to insertion sort after the data has been sorted on a macro scale.  Our parallel sorting simply cilkifies quicksort by spawning out the sorting of sub-arrays, and changes the base case of quicksort to insertion sort when the array size is small.  Figure 3(a) and Figure 3(b) show the sequential and parallel sorting codes, respectively.    


Figure 3(a) Sequential quicksort with insertion sort for small scale


Figure 3(b) Parallel quicksort with insertion sort as base case.  (Major changes over sequential version are highlighted in dark-red.)

In the building phase, the sorted list is turned into an FP-Tree using the recursive procedure _build (Figure 4(a)).  At recursion depth k, the k-th item in each transaction is used to split the database into sections, and a node of the FP-tree is created and labeled with that item.   We can see that each section can be processed rather independently.  The only races occur when these sections attempt to update the header list.  Therefore, in the parallel version (Figure 4(b)), we spawn off the work of each section to expose parallelism, and use p_FPTLIST_reducer (similar to a link list reducer supporting node insertion at the head) to represent header list and prevent data races.      During the reduce operation, the link list reducer accumulates the weight and appends the item list from sections.     (To understand the basics of reducers and how to write a reducer, please refer to reducer section of the Cilk++ User Manual).


Figure 4(a) Sequential FP-Tree construction



Figure 4(b) Parallel FP-Tree construction. Major changes from sequential version are highlighted in dark-red color.

Section 4. Pattern Generation

The core operation of FP-tree pattern generation is to compute conditional projection tree of the database, that is, the database of the transactions containing a specific item, with this specific item removed.  An example of computing a projection from the FP-tree of a database is shown in Figure 5.  In this example, we are computing a conditional projection tree based on item e.  From the header list of e, we can see that there are three transactions including item e.  Going upward from the nodes in e’s header list to root, we can draw the edges in the e-based conditional projection subtree using red links.  Including all the nodes connected by these red links and removing nodes with item e, we get the conditional projection tree based on item e (Figure 5(b)).  This projection tree indicates there are 3 conditional patterns {db, dca, bc} based on e in DB, i.e., there are three transactions that include e in DB {dbe, dcae, bce}.




Figure 5(a) Computing item-e-based conditional projection tree from the database shown in Figure 2. (b) The resulting conditional projection tree with base e.


For pattern generation, an FP-Tree of the database is projected recursively.  The pseudocode for pattern generation is shown in Figure 6.  To parallelize the pattern generation, we can unwrap the outermost layer and run the tree-projection of different items in parallel.  If tree projection is thread-safe, our work here is done.  Although intuitively, the tasks of computing projection of the same FP-tree over different items are independent because the FP-Tree is the only shared object and projection should only involve reading of the tree nodes, because the sequential implementation (Figure 7(a)) changes member variables of the tree nodes, which results in data races for concurrent tree projection.  Our parallel implementation introduces thread-local storage for tree projection to make the projection thread-safe (see Figure 7(b)). 

Figure 6. Pseudocode for FP-Tree pattern generation

 


Figure 7(a) Tree projection in sequential code


Figure 7(b) Thread-safe tree projection in parallel code.  Major changes over sequential version are highlighted by dark-red color.

 

The changes indicated above are all we need to parallelize the FP-Tree algorithm using Cilk++ – no reinvention of a parallel algorithm, no rewriting a new parallel implementation, simply adding one reducer and some Cilk++ keywords.  So with this modest effort, a considerably complex algorithm is now multicore-enabled.    

Section 5. Results

We tested the performance of this multicore-enabled FP-Tree implementation using a 1M-transaction database on a 16-core AMD Opteron machine.    As shown in Figure 8, on 4 cores, the implementation gets a 2.6x speedup on tree building and 3.3x speedup on frequent set generation; on 8 cores, it gets speedup 3.0x and 4.8x; on 16 cores, it gets 3.6x and 6.6x.   The speedup on a small number of cores is not bad, though the speedup does not scale linearly to 16 cores – primarily due to a memory bandwidth limitation.    As I mentioned in a previous blog post, a simple test to explore whether a multithreaded program has a memory bandwidth problem is to run the sequential version of the program for two cases – (1) run one copy of the sequential program on each core, and (2) run a single copy of the sequential program on the entire machine.   If the program’s execution time in case 1 is significantly larger than that in case 2, the memory bandwidth demand of the program could be high enough to become the primary constraint for program scalability.  If we run a sequential version of FP-Tree test case, the execution time on case (1) is approximately 2.4 times of the execution time on case (2).    This indicates a high memory bandwidth requirement of this application, and explains its non-linear speedup on a larger number of cores.   Moreover, there is literature [3] discussing the high cache miss rate of the FP-tree algorithm and proposing variants of the original FP-tree algorithm with better cache behavior.  It would be interesting to parallelize some cache-aware versions of FP-tree algorithm using Cilk++ and observe its scaling performance.  


Figure 8.  Speedup of FP-tree implementation on 16-core machine

References

[1] J. Han, J. Pei, and Y. Yin, '' Mining Frequent Patterns without Candidate Generation'', Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD'00), 2000.
[2] O. Zaiane, M. El-Hajj, and P. Lu. Fast Parallel Association Rules Mining without Candidacy Generation. In IEEE 2001 International Conference on Data Mining (ICDM'2001), 2001.
[3] Liu, L., Li, E., Zhang, Y., and Tang, Z. 2007. Optimization of frequent itemset mining on multiple-core processor. In Proceedings of the 33rd international Conference on Very Large Data Bases. 2007
[4] H. D. K. Moonesinghe, Moon-Jung Chung, Pang-Ning Tan. Fast Parallel Mining of Frequent Itemsets.  Michigan State University.
[5] C. Borgelt. An Implementation of the FP-growth Algorithm. Workshop Open Source Data Mining Software. 2005

2 Comments Click here to read/write comments

Reducers and Other Cilk++ Hyperobjects: Peeking Under the Hood

Posted by Ilya Mirman on Wed, Jun 17, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

We have written extensively about the challenges global variables pose for multithreaded applications - inhibiting parallelism by inducing data races. 

One way to deal with this challenge is by employing Cilk++ hyperobjects. Cilk++ hyperobjects mitigate data races on non-local variables without the need for locks or code restructuring. Cilk hyperobjects are straightforward to use: Simply by declaring a global variable to be a hyperobject of the appropriate type, data races on the global variable are automatically resolved.

What's Under the Hood?

To date, we have shared WHAT hyperobjects do, but not HOW they work.  With the patent filings behind us, we have authored a paper for the 2009 Symposium on Parallelism in Algorithms and Architectures.  

Here's the abstract:

This paper introduces hyperobjects, a linguistic mechanism that allows different branches of a multithreaded program to maintain coordinated local views of the same nonlocal variable. We have identified three kinds of hyperobjects that seem to be useful - reducers, holders, and splitters - and we have implemented reducers and holders in Cilk++, a set of extensions to the C++ programming language that enables multicore programming in the style of MIT Cilk. We analyze a randomized locking methodology for reducers and show that a work-stealing scheduler can support reducers without incurring significant overhead.

The paper can be downloaded here.

Comments / Questions?  We'd love to hear from you!

1 Comments Click here to read/write comments

Multicore Courses through MIT's Professional Education program

Posted by Ilya Mirman on Thu, Jun 11, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

We just wrapped up our 2-day course (6 lectures and 4 hands-on labs) at MIT.

Here are the lectures from this week's 2-day course:

5-day Course on Multicore Programming at MIT (register today!)

In addition, we'd like to point out that there is a great 5-day course on multicore programming being held July 20-24:

  • Overview: The course will approach multiprocessor programming from two complementary directions. In the first part of the course, we focus on foundations: what our programs and machines need to provide in order to ensure that concurrent programs do what we expect. We use an idealized model of computation in which multiple concurrent threads manipulate a set of shared objects. This model is essentially the model presented by standard Java or C++ threads packages.
  • Foundation: The foundations section is intended to build up the students' intuition and confidence in understanding and reasoning about concurrency. We approach this goal using examples, counter-examples, models, and exercises. These elements are laid out in a structured and progressive manner, from simple machine instructions to powerful universal constructions, the equivalent of Turing machines for multiprocessors.
  • Performance: The second part of the course will be concerned with performance. Reasoning about the performance of concurrent programs and data structures is very different in flavor from reasoning about their sequential counterparts. Sequential programming is based on a well-established and well-understood set of abstractions. There is little or no need to understand the specifics of the underlying architecture. In multiprocessor programming, by contrast, there are no such well-established abstractions. It is impossible to reason effectively about the performance of a concurrent data structure without understanding the fundamentals of the underlying architecture.  The performance part of the course will revisit many of the issues first raised in the foundations section, but in a more realistic model that exposes those aspects of the underlying architecture that most influence performance.
  • Data structures: The course then goes through a sequence of fundamental data structures, the concurrent analogs of the data structures found in any undergraduate data structures course, and a few foundation structures that are unique to the world of multithreaded computation. These data structures are introduced in an incremental way, each one extending the techniques developed for its predecessors. Each of these data structures is useful in and of itself as a reference. Moreover, by the end, the students will have built up a solid understanding of the fundamentals of concurrent data structure design, and should be well-prepared to design and implement his or her own concurrent data structures.
  • The course is based on the book, The Art of Multiprocessor Programming, by Herlihy and Shavit, Morgan-Kaufmann Elsevier 2008
Learn more...

5 Comments Click here to read/write comments

Multicore-enabling Dense Polynomial Multiplication

Posted by Ilya Mirman on Tue, Jun 02, 2009
 | Submit to Digg digg it | Submit to Reddit reddit | Add to delicious delicious | Submit to StumbleUpon StumbleUpon | Share on Twitter Twitter 

Marc Moreno Maza
Ontario Research Centre for Computer Algebra, University of Western Ontario
moreno@csd.uwo.ca

Yuzhen Xie
Computer Science and Artificial Intelligence Laboratory, MIT
yxie@csail.mit.edu

1  Introduction

In symbolic computation, dense polynomial multiplication is a fundamental operation akin to dense matrix multiplication in numerical computation. We report herein on a parallelization, in Cilk++, of dense polynomial multiplication based on multi-dimensional FFTs. Dense polynomial arithmetic is important for most computer algebra algorithms, such as the Euclidean Algorithm and its variants, which tend to densify intermediate data, even when the input and output polynomials are sparse. In addition, we focus on polynomials over finite fields since modular techniques allow us to reduce to such fields of coefficients.

2  Algorithm Review

Given two multivariate polynomials f and g with coefficients in the prime field K = Z/pZ and with ordered variables x1 < ⋯ < xn. Define di = deg(f, xi) and di = deg(g, xi), for all i. Assume there exists a primitive si-th root of unity ωi ∈ K, for all i, where si is a power of 2 satisfying sidi + di + 1. Then f g can be computed as follows.

Step 1.
Evaluate f and g at each point of the n-dimensional grid ((ω1e1, …, ωnen), 0 ≤ e1 < s1, …, 0 ≤ en < sn ) via n-D FFT.
Step 2.
Evaluate f g at each point P of the grid, simply by computing f(P) g(P),
Step 3.
Interpolate f g (from its values on the grid) via n-D FFT.

3  Framework, Challenges and Solutions

We assume that one-dimensional FFTs are computed by a black-box program, which could be a serial one. This allows us to utilize non-standard 1-D FFTs, such as Truncated Fourier Transforms (TFTs) which are hard to parallelize. TFTs are, however, of high practical interest since they can reduce unnecessary work and permit more efficient memory usage. This feature is even more significant when the number of dimensions increases. Another reason for our "1-D FFT black box" assumption is the fact that for us multiplication is a basic routine on top of which we aim at building several layers of higher-level algorithms, each of them offering opportunities for parallel execution. Therefore, this assumption allows us to avoid the parallel overhead of too fine grain size at the very low-level routines.

We use the serial C routines for 1-D FFT and 1-D TFT from the modpn library shipped with the computer algebra system Maple. Our integer arithmetic modulo a prime number relies also on the efficient functions from modpn, in particular the improved Montgomery trick. This trick is another important specificity of 1-D FFTs over finite fields which makes their parallelization even more difficult.

Multi-dimensional FFT can naturally be computed in a parallel fashion. For instance, a 2-D FFT with dimension sizes d1 and d2 essentially proceeds by the computation of d1 1-D FFTs (which can be performed concurrently) followed by the computation of d2 1-D FFTs (which can also be performed concurrently). The difficulties arise from irregular problems. Consider again the above 2-D FFT example with dimensions x1 and x2. Assume that d1 is in the order of 1000 whereas d2 is in the order of 10. When performing 1-D FFTs along x1, one computes only a few (namely 10) 1-D FFTs of large vectors. This means little parallelism (under our black box assumption). When performing 1-D FFTs along x2, one computes many (namely 1000) 1-D FFTs of small vectors. This means mis-use of FFT since FFT is really worth on sufficiently large vectors, say with at least several hundred entries.

We have conducted complexity analysis considering the performances of multivariate multiplication in terms of parallel speed-up and cache misses. They show that 2-D FFT based bivariate multiplication:

  • performs optimally when the partial degrees of the product are equal;
  • performs poorly when the ratio of those partial degrees is large, say about 500.

We show how multivariate multiplication can be efficiently reduced to balanced bivariate multiplication, based on 2-D FFTs. By balanced we mean that both dimension sizes are equal (or almost equal). With respect to a multiplication based on n-dimensional FFT, our approach may increase the work (i.e. the serial running time) by at most a factor of 2. However, it provides much larger parallel speed-up factors as reported in our experimentation.

Our approach combines two techniques that we call contraction and extension. The technique of contraction reduces multivariate multiplication to bivariate one, without ensuring that dimension sizes are equal; however, the work remains unchanged and in many practical cases the parallelism and cache complexity are improved.

The technique of extension turns univariate multiplication to bivariate one. This has several applications. First, it permits to overcome the difficult cases where primitive roots of unity of “large” orders cannot be found in the prime field K. Secondly, combined with the technique of contraction, this leads to balanced bivariate multiplication.

4  Implementation

For a dense multivariate polynomial f ∈ K[x1, …, xn] with variables x1 < … < xn and degree di in xi for 1 ≤ in, we represent it in a dense recursive manner w.r.t the given variable order. Its coefficients are stored in a contiguous one-dimensional array B. The coefficient of the term x1e1xnen is indexed by

(d1+1) ⋯ (dn−1+1) en+ (d1+1) ⋯ (dn−2 +1) en−1 +⋯+(d1+1) e2e1

in B. This representation is equivalent to a n−dimensional matrix (or grid) in row-major layout.

The parallelization of the multiplication algorithm reviewed in Section 2 takes advantage of the ease-of-use parallel constructs in Cilk++. Evaluation of a polynomial f is done by n-dimensional FFT or TFT. Along the i-th dimension, the (di +1) calls to 1-D FFT or 1-D TFT are parallelized by a cilk_for loop.

With the parallel evaluation by FFT or TFT at hand, Step 1, explained in Section 2, is realized by one cilk_spawn and one cilk_sync. Step 2 is parallelized by a cilk_for. The interpolation in Step 3 is similar to the parallel evaluation, up to some details.

A number of n−1 data transpositions is needed for such n-dimensional FFT or TFT. For 2-dimensional case, we use the cache-efficient code provided by Dr. Matteo Frigo. It employs a divide-conquer approach and fits the base case into the cache of targeted machines. For multi-dimensional case, we divide the problem into multiple 2-dimensional transpositions where Dr. Matteo Frigo’s cache-efficient code can work.

Another difficulty encountered in our implementation is loop carried dependencies. One situation that is hard to identify is when there is deep data dependency but the procedure is parallelized using cilk_for in a top-level, likewise for the improved Montgomery trick. We use Cilkscreen to effectively detect and localize data-race bugs.

5  Experimental Results

Our benchmarks are carried out on a 16-core machine with 16 GB memory and 4096 KB L2 cache. All the processors are Intel Xeon E7340 @ 2.40GHz.

The performance of our bivariate interpolation (same as a 2-D TFT up to some details) for problems of sizes from 2047 to 16383 is given in Figure 1. Using 16 processors, the maximum speedup is 15 for the problem of size 16383.

Figure 1: Speedup of bivariate interpolation (Step 3) via 2-D TFT

For the multiplication of two bivariate polynomials with the degrees of all the variables being equal and large, say in the order of 1000 to 8000, our implementation got good speedup on a 16-core machine: quasi-linear speedup up to 12 processors. Using 16 processors, the speedup factor of the problem of size 8191 reaches 14, as illustrated in Figure 2.

Figure 2: Speedup of bivariate multiplication via 2-D TFT

Figures 3 and 4 compare the timing and speedup for the multiplication of two 4-variate polynomials via direct 4-D TFTs and contraction to balanced 2-D TFTs. In this problem, each polynomial has partial degrees 1023, 1, 1, and 1023. Even on one processor, multiplication by the contraction to balanced 2-D TFT method is faster than that by the direct 4-D TFT method by about 139%. Hence, using 16 processors, the net speedup by contraction to balanced 2-D TFT method w.r.t the serial direct 4-D TFT method reaches 31.

Figure 3: Timing of a 4-variate multiplication by direct 4-D TFTs vs contraction to 2-D TFTs

Figure 4: Speedup of a 4-variate multiplication by direct 4-D TFTs vs contraction to 2-D TFTs

The timing and speedup factors for the multiplication of two univariate polynomials of degree 25427968 are illustrated in Figure 5 and 6, by both the direct 1-D TFT method and extension to balanced 2-D method. On one processor, extension to balanced 2-D method is about 27% slower. However, using 16 processors, the speedup gain by extension to balanced 2-D is about 10.47 w.r.t the direct 1-D TFT method on one processor.

Figure 5: Timing of a univariate multiplication by direct 1-D TFTs vs extension to 2-D TFTs

Figure 6: Speedup of a univariate multiplication by direct 1-D TFTs vs extension to 2-D TFTs

Acknowledgements: This work was supported by NSERC and MITACS NCE of Canada, and NSF Grants 0540248, 0615215, 0541209, and 0621511. We are very grateful for the help of Professor Charles E. Leiserson at MIT, Dr. Matteo Frigo and all other members of Cilk Arts.

1 Comments Click here to read/write comments

All Posts | Next Page