One of our summer interns, Matthew Steele, suggested a matrix-multiplication algorithm that more effectively used the cache than an algorithm that might be more intuitive to a mathematician or physicist. No doubt, the intuitive triply-nested loop is the preferred solution of many software engineers. But Matt's function beat the intuitive one - even after the latter was cilkified. What made his algorithm perform so well? The short answer is that, although Matt's algorithm accessed all of the same memory addresses the same number of times as the intuitive algorithm, his function caused fewer cache misses. The original function caused the computer to spend more time loading and storing cache lines than executing the program.
Your computer's cache is divided into lines. When your CPU accesses a certain memory address, if it isn't in the cache, it will fetch a line from the next level out rather than a single word. This is a slow process, but if subsequent accesses to memory are nearby, there is a high probability that what the CPU needs is already in the cache. However, certain structures are unlikely to fit in the cache all at once, or at least they may be spread across many lines. Matrices are a prime example and the consequence is that the way in which you access elements of a matrix has a significant bearing on performance. Consider the following two C++ snippets accessing the same matrix along different dimensions:
for (int i = 0; i < rows; ++i) {
for (int j = 0; j < cols; ++j) {
out[i] += matrix[i][j]) * in[j];
}
}
for (int i = 0; i < cols; ++i) {
for (int j = 0; j < rows; ++j) {
out[i] += matrix[j][i] * in[j];
}
}
The two snippets have substantially different performance, which I'll illustrate with a recent example from an early adopter of Cilk++. Dan Mirman, a UConn Psychology Postdoctoral Fellow, brought his neural network simulation code to us because each time he ran it, it would take him a day to get results. Naturally, he wanted to parallelize it to improve performance. The two snippets above correspond directly to two functions in the MikeNet Neural Network Simulator library.
The former function is used to multiply a matrix by a vector, and the latter apparently was copied and pasted and the indices were reversed to perform the multiplication of a matrix transpose by a vector. In fact, there was a trivial modification, taking into account cache usage, that turned out to be nearly as significant as parallelization. To understand the nature of the change, consider the following layout of a matrix in memory:
Sequential accesses in the first function point to adjacent values in memory. Thus, accesses are quick. And the only time it misses the cache (on matrix accesses) is when it is done with the old line and will never load it again.
In the second function, rather than iterating through sequential addresses, every access is on a different line. In short, every access misses the cache. Furthermore, by the time the outer loop completes the first iteration, the second iteration's memory accesses are looking for lines that have long since been flushed. In other words, where the first function pulls each line into the cache once, the second pulls each line into the cache for every iteration of the outer loop.
With this in mind, a simple modification to MikeNet achieved ~3.4x improvement to the overall program on a roughly 500x500 matrix without using Cilk++ at all! All that was required was to invert the two for-loops in the transpose function so that sequential iterations of the inner loop were accessing adjacent locations in memory:
for (int j = 0; j < rows; ++j) {
for (int i = 0; i < cols; ++i) {
out[i] += matrix[j][i] * in[j];
}
}
Incidentally, if you use MikeNet or otherwise have similar matrix code, this is a simple modification that you can perform to save yourself some cycles. This is already enough to change the way Dan interacts with his program. But can the work be spread across multiple cores?
The original algorithm has some hidden parallelism in both functions that is easy to expose. In the original code, merely converting the outer for-loops into cilk_for-loops provides some limited parallelism. The change is not quite as trivial in the second function after it has been modified for cache locality, however. Since the output vector is based on the old inner loop, and since it is the new outer loop, parallelizing it causes a data race on each of the elements in the vector. Furthermore, so little work is done in the new inner loop that the overhead to running it in parallel is significant. This, of course, is precisely the purpose of a reducer. Wrapping the output vector in a reducer eliminates the data race caused by parallelizing the new outer loop:
array_reducer_t art(out, cols);
cilk::hyper_ptr art_ptr(art);
cilk_for (int j = 0; j < rows; ++j) {
float *tmp = art_ptr->get_value();
for (int i = 0; i < cols; ++i) {
tmp[i] += matrix[j][i] * in[j];
}
}
This reducer is about as expensive as a reducer can be. Every time it merges, it must walk down two vectors and sum each pair of elements (complexity: O(cols)). In general, one would much rather have reductions that perform trivial operations to cut down on overhead.Nevertheless, when I made the changes to the algorithm, even with the expensive reducer, and with the limited parallelism the cilk_for exposes, I saw significant performance improvement on one of our testing servers.
Bear in mind that I parallelized three functions in the whole program, and the only difference between the two versions is that one of the functions has been altered to access its matrix along a different dimension.
To sum up, identifying parallelism is a key part of program analysis, but it isn't the only thing you should consider when thinking about performance. Especially for some of these large data structures (e.g., matrices), taking some time to look critically at how they interact with your computer's cache is just as significant.