SGen – Concurrent Mark

Introduction

Like most garbage collectors, SGen is of the stop-the-world variety, which means that when the time has come to perform its work, it stops all managed threads in their tracks and only resumes them once it has finished.

Nursery collections are usually relatively short affairs, and can be sped up even further by using a smaller nursery size. Major collections, on the other hand, can take a long time because they must scan every single live object in the system. Long garbage collection pauses are clearly a serious problem for many applications, be they interactive or on the server, and garbage collector engineers make considerable efforts to shorten pause times.

One approach to this is to do parts of the major collection concurrently with the running program, aka the mutator. SGen’s major collector, mark-and-sweep, consists of two main phases, unsurprisingly called mark and sweep.

Thanks to Zoltan’s recent work—lazy sweep—SGen already does most of sweep’s work outside of major collection pauses. I have been working on the other part of the story for the past few weeks: concurrent mark.

How it works

Our implementation of concurrent mark is very similar to that of Sun JVM’s concurrent mark-and-sweep (CMS) collector.

A major garbage collection starts with an initial marking pause, during which all roots are scanned.[1] Then it restarts the world and in the background a worker thread performs the marking and scanning of major heap objects, in the usual manner. When the concurrent marking phase is done, the collector stops the world again to finish up the collection in the final marking pause.

Dealing with concurrent mutation

Consider the following code, which manipulates a linked list, swapping the order between the two nodes following a:

1   b = a->next;
2   c = b->next;
3   a->next = c;
4   b->next = c->next;
5   c->next = b;

Let’s assume that right after a->next has been set in line 3, the concurrent worker thread scans and marks the node referenced by a, then follows next and processes c. If at that point our program hasn’t come around to setting c->next to b in line 5, the garbage collector will never encounter b—it will have been lost because for a brief moment that node wasn’t part of the list.

We can recruit the help of our friend the write barrier to deal with this issue, but we need to extend its scope a little: instead of just recording stores that point from the major heap to the nursery, it must now also record stores pointing from the major heap to the major heap. In other words: all reference stores that happen in the major heap, no matter where they point to, must be recorded.[2]

With this new write barrier the store to c->next will record the node c as requiring a re-scan. This re-scanning of all modified major heap objects happens in the final marking pause, so it’s not further disturbed by the mutator.

What if the final marking pause happens before c->next is set in line 5, however? The node b might not be pointed to by any object in the heap, so we won’t find it that way, yet it is still live. It’s live because it’s referenced by the variable b on the stack, so the final marking pause must also re-scan all roots.[3]

Nursery collections

Nursery collections that happen while concurrent mark is running stop the world, but concurrent mark keeps chugging along. It never follows references into the nursery anyway, so there is little potential for conflict.

There is an issue with the card table, however: nursery collections clear the card table, but its data is required during the final marking pause, which might only occur a few nursery collections later. The solution to this is to maintain a separate card table that accumulates the marked bits from the “regular” card table, and to use this union table in the final scan.

Results

This graph shows maximum pause times for a few of the programs in our GC benchmarks suite. All of the pause times are relative to Boehm’s, so values below 1 are better than Boehm’s. Note that the values correspond to the longest stop-the-world pause that occurred when running the program, not the median or some average.

The three SGen configurations are standard mark-and-sweep in blue, concurrent mark-and-sweep in green and concurrent mark-and-sweep in conjunction with cementing, in yellow.

The programs represented in this graph are those whose maximum pause times with Boehm were longer than 10ms. The ones with the longest pause times are on the left—on graph4 Boehm had a pause of 2.5s—which are those that benefit most from concurrent mark, as expected: the longest pause on graph4 is less than 180ms with concurrent mark. I have not yet investigated why graph8 is such an outlier.


  1. For concurrent mark the roots not only include the usual suspects like thread stacks and global variables, but also nursery objects. In fact, the initial marking pause performs a nursery collection, after which only pinned objects will be left in the nursery. Those are treated as roots, or rather the references they contain are.  ↩

  2. Since we use a single card table for both kinds of references, this will increase the number of cards marked that don’t have references pertinent to nursery collections. The card scanning overhead will therefore increase, but in practice this doesn’t seem to be a problem. An alternative would be to use two separate card tables, one for references to nursery objects and the other for references to major heap objects. That would be more expensive both in terms of memory usage as well as in write barrier performance.  ↩

  3. This entails doing a full mark-scan loop, of course, which might still take a very long time in pathological cases. Imagine a program that keeps a huge linked list around, referenced only by a single object on the major heap. At some point concurrent mark kicks in. The program now loads the reference to the list into a local variable and nulls the reference on the heap. The garbage collector will not be able to mark the list until it scans the stack in the final marking pause, and will now have to traverse the list with the world stopped.  ↩

Advertisements

SGen – The Write Barrier

Introduction

In previous installments of my blog post series on SGen, Mono’s generational garbage collector, I wrote about the major and minor heaps and their organizations. What I have for the most part neglected to pontificate on is how they interact with each other.

The advantage of generational garbage collectors comes from being able to collect the minor heap, also referred to as the nursery, very quickly because it doesn’t have to look at the objects in the major heap. There is a problem, however, which this illustration from my talk on SGen explains:

Let’s say we collect the nursery, starting from the single root reference. We find a graph of four live objects, colored blue, and promote them to the major heap, discarding all the others, including the three white ones. But wait—there’s a reference in some object on the major heap pointing to one of the objects we have just destroyed! This cannot stand.

Where did this reference come from? Consider the previous nursery collection. We’ll ignore pinning for now, so after that collection the nursery was empty. The mutator (that’s what garbage collector people affectionately call the pesky thing that insists on messing with the heap and that most people refer to as the program) allocated that white object, and stored a reference to it in an object on the major heap—in that order, but not necessarily in direct succession.

One way to handle those references is to scan the whole major heap for them and to treat them as roots. That would make nursery collections as slow as major collections, so it’s not a good idea. What’s needed is a way to record all such references happening between nursery collections. The nursery collector would then process them like ordinary roots and everything would be in order again.

Write barriers

Write barriers[1] do exactly that: On every store of a reference in an object they check whether the reference resides in the major heap and points to the nursery. If those conditions are satisfied, the reference is recorded for the benefit of the next nursery collection. This sounds like an awful overhead for each store, but it really isn’t, and for almost all applications it’s far outweighed by the advantages from generational collection.[2]

A note on terminology: The literature sometimes uses “write barrier” to refer to just the code that identifies references to remember, and sometimes to refer to both the code that identifies as well as records those references. I shall be using the term in the latter sense.

SGen’s write barrier is a so-called card table. We also implement an alternative, a sequential store buffer, which I will not discuss here.

Card tables

The basic idea is to conceptually divide the major heap into small, equally sized regions, called cards, and to have one bit per card that is set—the card is marked—when a reference to a nursery object is written to it.[3]

The advantages of this scheme are that it requires only a handful of instructions to record a store, so it can be inlined, and that its space overhead is small and proportional to the heap size. The disadvantage is that it doesn’t record the exact locations of the references we’re interested in.

In SGen’s case, cards are 512 bytes large, so the nursery collector needs to scan roughly that amount of memory to potentially locate a single reference.[4] The references it does find it treats like ordinary roots: the referenced objects are promoted to the major heap and the references are updated to reflect their new locations. The newly promoted objects are then scanned in turn, of course.

Once all marked cards have been scanned there are no more references from the major heap to the nursery, so all cards mark bits are cleared, to be ready for the next round.

Pinned objects

So far we have been assuming that when we encounter a reference from the major heap to the nursery, we can always move the nursery object to the major heap, thereby rendering the bothersome reference harmless.

As I explained in the blog post on the nursery, sometimes objects become pinned, which implies that they cannot be moved. This threatens to confound our cunning plan: we can only clear the card mark bits because there are no references left that we need to keep track of. The solution here is simple: Cards containing references to pinned objects must stay marked so that they will be processed again during the next nursery collection, in the hopes that the object referred to will no longer be pinned.

This can lead to another potential problem: Assume that some nursery object remains pinned, and that there are lots of objects in the major heap that refer to it.[5] Lots of cards would remain perpetually marked and nursery collections would become dominated by card scanning. That is in fact what we noticed on two benchmarks in our small GC benchmark suite—nursery collections took much longer than expected and the programs took twice as long or more to run with SGen than with Boehm.

Cementing

I just implemented a new feature—not yet pushed to master—that solves this new problem in a very simple way. I call it “cementing”, because it institutionalizes bothersome pinned objects: if we encounter a certain number of references from the major heap to a particular pinned nursery object, we just give up on it. Instead of hoping that it will become unpinned soon, we cement it—the punishment for badly behaved pinned objects. A cemented object will always be pinned at the start of future nursery collections as a matter of principle. The consequences of this are that it will be treated as a root, and that it will not be moved, which in turn means that we don’t have to worry about references to it from the major heap any more: the object is kept alive and the references can stay the same. Cards with references to cemented objects can be cleared, reducing the card scanning load.

Major collections give us a blank slate because we need to scan all live major heap objects anyway, so we use them as opportunities to reset cementing, giving those poor objects another chance after all.

The preliminary results from cementing are encouraging. It doesn’t seem to introduce performance regressions, and for the two benchmarks affected, it works wonders:

The red bars represent the wall clock run time with Boehm. The green bars are SGen run times without cementing, and blue are SGen with cementing.


  1. There is at least one other meaning for the term “write barrier”, namely a machine instruction (or series of instructions) that ensures that all stores preceding it are seen by other threads before the stores following it. Those are not the barriers we’re looking for.  ↩

  2. An alternative way of solving the reference problem is a read barrier. Read barriers make it possible to write moving concurrent collectors, a magnificent example of which is Azul’s C4 collector. If you want a glimpse of the future of garbage collection, read the paper.  ↩

  3. SGen uses a whole byte per card for marking. Not only can whole bytes be written easier than single bits, but multiple threads can do so without conflicts between contiguous ones. If we used individual bits, we’d have to use CAS to avoid losing marks by other threads.  ↩

  4. Nitpickers will notice that a marked card might not even contain a single reference to a nursery object—the mutator can write a reference, hence mark the card, and then later overwrite it with null, for example.  ↩

  5. A simple way to trigger this is to allocate some object high up on the call stack, keep a reference to it on the stack, and then to construct a long list, with every node containing a reference to said object. Keep the list around and its nodes will eventually be promoted, staining the major heap with perpetually marked cards.  ↩