(This is a follow-up to our earlier post on multicore storage allocation.)
While working with an early Cilk++ adopter, it quickly became apparent that the
default memory allocator shipped with the Windows C Run-Time Library can be a
bottleneck in a multithreaded application. The Windows memory allocator has a
single lock which it uses to serialize access to its internal structures. While
this is safe, it proved to cause a serious loss of parallelism in the
customer’s application.
There are lots of memory allocators available, both open source and
commercially. We looked at using a number of them didn't find any that had the
combination of features we were looking for. Ultimately we chose to write our
own. Since the customer's program was a Windows application, we initially
wrote concentrated on the Windows implementation.
Features borrowed from Hoard
We decided to build the new allocator inspired by the principles in
Hoard: A Scalable Memory Allocator for Multithreaded
Applications by Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe and
Paul R. Wilson. Hoard is designed to minimize lock contention between
threads, false sharing and memory drift. Detailed information about Hoard can
be found at the Hoard
website.
We named our new memory allocator "Miser" to play on its relationship
to Hoard. We have implemented Miser for both Linux and Windows, and this post describes data structures common to both, several Windows-specific implementation details, and several Linux-specific notes at the end.
Superblocks
The basic unit of allocation in Miser is a "superblock." A superblock
holds a header, an array to hold the size of each requested block of memory and
an array of same-sized memory blocks, or bins, which can be used to satisfy
memory requests. The bins are powers of 2; 8 bytes through 256 bytes. Studies
have shown that this will satisfy over 98% of memory requests. Requests for
larger blocks are forwarded to the C Run-Time Library. The array of bins is
placed against the back of the superblock, so memory blocks returned to the
user are always naturally aligned.
Global and Per-Thread Pools
Miser, like Hoard, features a private pool of memory for each thread. Because
the pool is thread-private, it can be accessed without locking. A pool is
made up of superblocks which are kept in a roughly sorted order based on how
much space is available in each superblock. The use of the per-thread pools is
fundamental to improving application performance:
-
Per-thread pools allow Miser to satisfy most memory requests without locking.
-
Per-thread pools limit false-sharing. Unless a block is passed from one thread
to another, all use of the memory accessed through a cache-line should be from
one thread.
The global pool is not owned by any thread. If a per-thread pool cannot satisfy
a request, it will take a superblock from the global pool. If the global pool
is empty, Miser will allocate more memory from the operating system. As memory
is released to a per-thread pool, it will contribute superblocks to the global
pool to prevent memory drift.
Additional Features
While the features inspired by Hoard provided a good foundation, we had the
following additional requirements for Miser:
-
It must be able to be loaded dynamically. This allows a component shipped as
a shared library to use Miser instead of requiring
that the application load it at startup.
-
It must be able to recognize blocks that it had not allocated, and pass those
to the system C Run-Time Library.
-
It must be thread-safe and fast.
- It must be able to simultaneously
support the multiple versions of the C Run-Time Library that may be in use by
an application.
Loading Miser dynamically
The format for a Windows module, either an executable or Dynamic-Link
Library (DLL) is defined by the Windows Portable Executable File Format.
Almost every Windows module imports functionality from some other module.
These dependencies are described by a pair of tables; the Import Name Table
and the Import Address Table. When a module is loaded into memory, the loader
will "snap" the addresses in the Import Address Table to the
entrypoint in the destination module. All calls to that function from the
module will be made indirectly through this cell.
When Miser is loaded into an application, it will go through each of the
modules in the application and redirect the entries in the Import Address
Table for each of the C Run-Time Library memory allocation functions to its
own entrypoint. This technique was first described by Matt Pietrek in
Peering
Inside the PE: A Tour of the Win32 Portable Executable File Format.
Since the modification of the IAT entry is a 32-bit write, it’s an atomic operation;
the IAT entry is either the C Run-Time Library address or the Miser address.
Once Miser has been loaded into a process, it can never be unloaded. While
Miser could restore the original values of the C Run-Time Library functions
in the IATs, the C Run-Time Library functions cannot handle memory allocated
by Miser.
Recognizing blocks not allocated by Miser
When Miser loads into the application, the application may have already
allocated memory. This is no way for Miser to replace those blocks with blocks
from its own resources. Since Miser redirects all calls to free()
and _msize() to itself, it must be able to recognize memory that has
been allocated by the C Run-Time Library and pass those calls to the appropriate
version of the C Run-Time Library.
The Windows OS provides a number of user-mode APIs to manage memory. One of the
most fundamental is VirtualAlloc() which allocates memory in 64K
blocks on 64K boundaries. Miser will subdivide these into 4K superblocks, each
of which starts on a 4K boundary. Not coincidentally, this is the page size for
the 32-bit x86 Windows OS. The first thing in each superblock is a header
which contains a "magic number" and other control information. So
by masking off the low 12 bits of a memory address, we should find the header
for the superblock containing the memory block. There are two checks to
validate a superblock.
-
The superblock header contains the correct "magic number."
-
The superblock header contains a pointer to the per-thread pool owning the
superblock and an index into an array of pointers to the superblocks owned
by the per-thread pool. The entry in the array should match the address of
the superblock.
If either of these tests fails, Miser will pass the call off to the Windows
C Run-Time Library.
An additional benefit of being able to detect memory allocated by the C Run-Time
Library and pass the request on is that Miser can pass any requests for more
than 256 bytes off to the C Run-Time Library.
Supporting multiple versions of the C Run-Time Library
Each module loaded into an application can be built against a different
version of the C Run-Time Library. In addition to the side-by-side support
built into the OS, recent versions of the C Run-Time Library have had their
version number in the DLL name. Miser supports all versions of the C Run-Time
Library since Visual Studio .NET 2002. Since each version of the C Run-Time
Library has its own heap, Miser keeps track of which version of the C Run-Time
Library it has hooked and will forward calls to the appropriate version.
Speed and thread-safety
Thread-safety and speed are intimately tied. The more the code needs to take
out a lock, the greater the chance that it will have to wait for some other
thread. For most operations, Miser does not need to lock; it can usually
satisfy a request from resources already available in the per-thread pool.
Miser must take out a lock in the following circumstances:
-
Accessing the global pool – only one thread may be allowed to access
anything in the global pool at any time. Any thread that attempts to access
the global pool must lock it first, serializing access.
-
Freeing a block allocated from another per-thread pool – Unlike Hoard,
Miser does not maintain a lock in the superblock header. Blocks to be freed
in some other per-thread pool are placed on that per-thread pool's remote free
queue using interlocked instructions. The remote free queue is drained before
attempting to allocate a block in the hopes that a freed block may satisfy the
request.
-
Supporting
_msize() for a block allocated by another thread
- The _msize() function allows a Windows application to query the
C Run-Time Library about the size of a block of memory. The value returned is
the size in bytes that was specified when the block was allocated. If the
block was allocated by Miser in the current thread, we can safely access the
information in a superblock owned by the per-thread pool. However if the
superblock is owned by another thread’s per-thread pool, then we must lock
the per-thread pool before we attempt to validate the superblock against
its superblock array.
-
Maintaining the list of superblocks – Each per-thread pool maintains an
array of the superblocks that it has allocated. The header for each superblock
has a pointer to the per-thread pool that owns it, as well as an index into the
per-thread pool superblock array. These are used to validate that a memory
address is from a block that Miser allocated. Because another thread may access
the array of superblocks to satisfy an
_msize() call, the array of
superblocks must be locked before it is updated.
One important thing to note is that while Miser supports Cilk++ well, it is
not tied in any way to the Cilk++ compiler or runtime. It should serve equally
well for a multithreaded application using any of the threading technologies
available.
Miser Limitations on Windows
-
Miser does not support any of the C Run-Time Library calls which
allocate memory on an address boundary.
-
Miser assumes program correctness. It does not put any padding between blocks
to attempt to detect writes beyond the bounds of the allocated block. Allocated
blocks are butted directly against each other; Miser keeps the allocation size
in a separate structure to allow it to maintain natural alignment in it's
bin arrays.
-
Because of its dependence on the Import Address Table to hook into a module,
Miser can only be used with modules that use the DLL forms of the C Run-Time
Library. Modules that link against the static C Run-Time Library do not go
through the IAT, so Miser cannot hook into them.
Performance
The following is an example of performance gains achievable with Miser, from an earlier blog post (Multicore-enabling the N-Queens Problem Using Cilk++).

Miser on Linux
Although Miser was initially written for Windows, since it performed better than Hoard in some of our tests, it was ported to Linux. The back-end code remained basically identical except for various system calls (e.g., changing calls to VirtualAlloc() to mmap(), etc.).
On the front-end, the mechanism for accessing the functions was rewritten. The Miser library exports malloc(), free(), calloc(), etc., and linking a binary with it will cause the loader to look for it at runtime. Additionally, preloading it by setting LD_LIBRARY_PATH causes objects to find Miser's allocators before those of the C runtime library. Miser still uses the system malloc() for large requests and it finds them in the default library using dlsym().
In Linux, if Miser is loaded after a module has resolved the location of allocator functions, Miser does not hook itself in. You can still access Miser's malloc() using dlsym() and getting "malloc" from the library but, as in Windows, memory allocated with Miser must be freed with Miser. The system free() doesn't know what to do with Miser memory.