SGen – The Write Barrier

Introduction

In previous installments of my blog post series on SGen, Mono’s generational garbage collector, I wrote about the major and minor heaps and their organizations. What I have for the most part neglected to pontificate on is how they interact with each other.

The advantage of generational garbage collectors comes from being able to collect the minor heap, also referred to as the nursery, very quickly because it doesn’t have to look at the objects in the major heap. There is a problem, however, which this illustration from my talk on SGen explains:

Let’s say we collect the nursery, starting from the single root reference. We find a graph of four live objects, colored blue, and promote them to the major heap, discarding all the others, including the three white ones. But wait—there’s a reference in some object on the major heap pointing to one of the objects we have just destroyed! This cannot stand.

Where did this reference come from? Consider the previous nursery collection. We’ll ignore pinning for now, so after that collection the nursery was empty. The mutator (that’s what garbage collector people affectionately call the pesky thing that insists on messing with the heap and that most people refer to as the program) allocated that white object, and stored a reference to it in an object on the major heap—in that order, but not necessarily in direct succession.

One way to handle those references is to scan the whole major heap for them and to treat them as roots. That would make nursery collections as slow as major collections, so it’s not a good idea. What’s needed is a way to record all such references happening between nursery collections. The nursery collector would then process them like ordinary roots and everything would be in order again.

Write barriers

Write barriers[1] do exactly that: On every store of a reference in an object they check whether the reference resides in the major heap and points to the nursery. If those conditions are satisfied, the reference is recorded for the benefit of the next nursery collection. This sounds like an awful overhead for each store, but it really isn’t, and for almost all applications it’s far outweighed by the advantages from generational collection.[2]

A note on terminology: The literature sometimes uses “write barrier” to refer to just the code that identifies references to remember, and sometimes to refer to both the code that identifies as well as records those references. I shall be using the term in the latter sense.

SGen’s write barrier is a so-called card table. We also implement an alternative, a sequential store buffer, which I will not discuss here.

Card tables

The basic idea is to conceptually divide the major heap into small, equally sized regions, called cards, and to have one bit per card that is set—the card is marked—when a reference to a nursery object is written to it.[3]

The advantages of this scheme are that it requires only a handful of instructions to record a store, so it can be inlined, and that its space overhead is small and proportional to the heap size. The disadvantage is that it doesn’t record the exact locations of the references we’re interested in.

In SGen’s case, cards are 512 bytes large, so the nursery collector needs to scan roughly that amount of memory to potentially locate a single reference.[4] The references it does find it treats like ordinary roots: the referenced objects are promoted to the major heap and the references are updated to reflect their new locations. The newly promoted objects are then scanned in turn, of course.

Once all marked cards have been scanned there are no more references from the major heap to the nursery, so all cards mark bits are cleared, to be ready for the next round.

Pinned objects

So far we have been assuming that when we encounter a reference from the major heap to the nursery, we can always move the nursery object to the major heap, thereby rendering the bothersome reference harmless.

As I explained in the blog post on the nursery, sometimes objects become pinned, which implies that they cannot be moved. This threatens to confound our cunning plan: we can only clear the card mark bits because there are no references left that we need to keep track of. The solution here is simple: Cards containing references to pinned objects must stay marked so that they will be processed again during the next nursery collection, in the hopes that the object referred to will no longer be pinned.

This can lead to another potential problem: Assume that some nursery object remains pinned, and that there are lots of objects in the major heap that refer to it.[5] Lots of cards would remain perpetually marked and nursery collections would become dominated by card scanning. That is in fact what we noticed on two benchmarks in our small GC benchmark suite—nursery collections took much longer than expected and the programs took twice as long or more to run with SGen than with Boehm.

Cementing

I just implemented a new feature—not yet pushed to master—that solves this new problem in a very simple way. I call it “cementing”, because it institutionalizes bothersome pinned objects: if we encounter a certain number of references from the major heap to a particular pinned nursery object, we just give up on it. Instead of hoping that it will become unpinned soon, we cement it—the punishment for badly behaved pinned objects. A cemented object will always be pinned at the start of future nursery collections as a matter of principle. The consequences of this are that it will be treated as a root, and that it will not be moved, which in turn means that we don’t have to worry about references to it from the major heap any more: the object is kept alive and the references can stay the same. Cards with references to cemented objects can be cleared, reducing the card scanning load.

Major collections give us a blank slate because we need to scan all live major heap objects anyway, so we use them as opportunities to reset cementing, giving those poor objects another chance after all.

The preliminary results from cementing are encouraging. It doesn’t seem to introduce performance regressions, and for the two benchmarks affected, it works wonders:

The red bars represent the wall clock run time with Boehm. The green bars are SGen run times without cementing, and blue are SGen with cementing.


  1. There is at least one other meaning for the term “write barrier”, namely a machine instruction (or series of instructions) that ensures that all stores preceding it are seen by other threads before the stores following it. Those are not the barriers we’re looking for.  ↩

  2. An alternative way of solving the reference problem is a read barrier. Read barriers make it possible to write moving concurrent collectors, a magnificent example of which is Azul’s C4 collector. If you want a glimpse of the future of garbage collection, read the paper.  ↩

  3. SGen uses a whole byte per card for marking. Not only can whole bytes be written easier than single bits, but multiple threads can do so without conflicts between contiguous ones. If we used individual bits, we’d have to use CAS to avoid losing marks by other threads.  ↩

  4. Nitpickers will notice that a marked card might not even contain a single reference to a nursery object—the mutator can write a reference, hence mark the card, and then later overwrite it with null, for example.  ↩

  5. A simple way to trigger this is to allocate some object high up on the call stack, keep a reference to it on the stack, and then to construct a long list, with every node containing a reference to said object. Keep the list around and its nodes will eventually be promoted, staining the major heap with perpetually marked cards.  ↩

Flame Graphs for Instruments

Flame graphs are a great way to visualize stack-trace based profiling data. Brendan Gregg’s scripts for generating these graphs accept, among other options, output produced by DTrace. Unfortunately, OS X’s DTrace, when accessed through the command line, contains a bug that prevents it from producing stack traces in most cases. XCode Instruments, on the other hand, is not compromised in this way, but it will not give output similar to dtrace.

What it does provide is an export functionality which generates CSV files containing the exact contents of the details pane, so I wrote a script to do the intermediary work.

https://i1.wp.com/www.complang.tuwien.ac.at/schani/blog/sgen/instruments-fg.jpg

To produce the CSV file expand all nodes in the call tree (Option-click all the roots) and use the Instrument -> Export Track menu item. Then convert the CSV to the format Brendan’s script understands and feed it with it:

./stackcollapse-instruments.pl out.csv | ./flamegraph.pl >mygraph.svg

flamegraph.png

For an alternative view you can invert the call tree in Instruments and you’ll get a correspondingly inverted flame graph, i.e., the flames go up from callee to caller, as opposed to the other way around when the call tree is not inverted.

SGen and DTrace

Introduction

SGen is Mono’s generational garbage collector. DTrace is a programmable dynamic tracing framework and is available for OS X and Solaris, among others.

DTrace provides an interface for applications to define their own probes which can then be observed, and actions triggered. We have recently added a few probes to SGen and I’d like to show a few simple things that can be done with them.

The program I’m using to demonstrate DTrace here is IronPython, running pystone. SGen’s DTrace probes are currently only available on OS X, on mono master.

Garbage collection times

How long do garbage collections take for a specific workload?

1:  mono$target:::gc-begin {
2:          self->ts = timestamp;
3:  }
4:  
5:  mono$target:::gc-end {
6:          @[arg0] = quantize ((timestamp - self->ts) / 1000);
7:  }

When a garbage collection starts, we save the current time, and when it ends we gather the time elapsed since then in microseconds. The results:

    0
value  ------------- Distribution ------------- count
    4 |                                         0
    8 |                                         1
   16 |                                         0
   32 |                                         1
   64 |                                         1
  128 |                                         0
  256 |                                         0
  512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    90
 1024 |                                         1
 2048 |                                         0
 4096 |                                         0
 8192 |@                                        3
16384 |                                         1
32768 |                                         0

     1
value  ------------- Distribution ------------- count
   16 |                                         0
   32 |@@@@@@@@@@@@@                            1
   64 |                                         0
  128 |                                         0
  256 |                                         0
  512 |                                         0
 1024 |                                         0
 2048 |                                         0
 4096 |                                         0
 8192 |                                         0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              2
32768 |                                         0

The graph for 0 shows collection times for the nursery, the graph for 1 shows times for the major collections. We can see here that the majority of nursery collections took about half a millisecond, with 4 outliers taking around 10 ms. There were 3 major collections, two of which took around 16 ms and the third one barely took any time at all.

Apple’s Instruments profiling application gathers data using DTrace and can be extended with custom DTrace scripts. Here’s what a visualization of SGen collector pause times looks like:

https://i2.wp.com/www.complang.tuwien.ac.at/schani/blog/sgen/instruments.jpg

The dots lower down in the scatter diagram are likely nursery collections, the higher up ones major collections.

GC lock

Not counting garbage collections, how often and for how long is the GC lock held?

 1:  mono*:::gc-locked {
 2:          self->ts = timestamp;
 3:  }
 4:  
 5:  mono*:::gc-begin {
 6:          @ = quantize ((timestamp - self->ts) / 1000);
 7:  }
 8:  
 9:  mono*:::gc-end {
10:          self->ts = timestamp;
11:  }
12:  
13:  mono*:::gc-unlocked {
14:          @ = quantize ((timestamp - self->ts) / 1000);
15:  }

The GC lock is taken shortly before collections start, and released shortly after they finish. It’s also taken in a variety of other cases. Naturally we want to avoid holding it for long.

value  ------------- Distribution ------------- count
    0 |                                         0
    1 |@@@@@@@@@@@@@@@@@@@@@@@@@                5462
    2 |@@@@@@@@@@@@@                            2789
    4 |                                         97
    8 |                                         49
   16 |@                                        192
   32 |                                         12
   64 |                                         2
  128 |                                         0
  256 |                                         0
  512 |                                         0
 1024 |                                         1
 2048 |                                         0

SGen takes the lock quite often, but apart from one outlier at about 1 millisecond, it’s released very quickly. Further analysis (not presented here) shows that the outlier is a result of thread-pool initialization during startup, when a few objects are allocated pinned, requiring allocating some memory from the operating system.

Objects allocated

How many objects of which classes are allocated?

 1:  mono*:::gc-nursery-obj-alloc {
 2:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
 3:  }
 4:  
 5:  mono*:::gc-major-obj-alloc-large {
 6:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
 7:  }
 8:  
 9:  mono*:::gc-major-obj-alloc-pinned {
10:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
11:  }
12:  
13:  mono*:::gc-major-obj-alloc-degraded {
14:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
15:  }
16:  
17:  mono*:::gc-major-obj-alloc-mature {
18:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
19:  }

SGen has probes for five different kinds of object allocation, hence the repetition. User-space pointers to the namespace and class name are passed in the third and fourth argument, respectively, so we concatenate those with a dot in the middle to give the full class name.

Whenever possible SGen uses the managed allocator – a small managed piece of code to quickly allocate objects in the nursery, which handles the most common case. The managed allocator is not instrumented with a DTrace probe, however, so for this script to function properly the managed allocator must be turned off via the MONO_GC_DEBUG environment variable.

...
IronPython.Runtime.PythonDictionary                          500014
IronPython.Runtime.List                                      500067
IronPython.Runtime.Calls.Method                             1000002
System.Object[]                                             1505064
System.Int32                                               15499015

IronPython boxes lots of integers.

Objects pinned in the nursery

How many objects of which types are pinned during nursery collections?

1:  mono$target:::gc-obj-pinned /arg4==0/ {
2:          @[strjoin (copyinstr (arg2), strjoin (".", copyinstr (arg3)))] = count ();
3:  }

This is very similar to the preceding script, but we use the pinned object probe and a guard to select for the nursery, as opposed to the whole heap.

...
Microsoft.Scripting.Actions.DynamicSiteTarget`3                  863
IronPython.Runtime.Calls.CallTarget1                             864
System.Object[]                                                 1271
Microsoft.Scripting.Actions.CallSite`1                          1350
System.String[]                                                 1440

Here we see the pin counts for the 5 most pinned classes during nursery collections. Combining all nursery collections, for example, 1440 string arrays were pinned.

What if we’d like to know how many bytes of memory are pinned in the nursery during each collection?

 1:  mono$target:::gc-begin /arg0==0/ {
 2:          bytes = 0;
 3:  }
 4:  
 5:  mono$target:::gc-obj-pinned /arg4==0/ {
 6:          bytes += arg1;
 7:  }
 8:  
 9:  mono$target:::gc-end /arg0==0/ {
10:          @ = quantize (bytes);
11:  }

All we have to do is add up the sizes of all pinned objects for nursery collections.

value  ------------- Distribution ------------- count
   32 |                                         0
   64 |                                         1
  128 |@                                        2
  256 |                                         0
  512 |                                         0
 1024 |                                         0
 2048 |                                         0
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  95
 8192 |                                         0

This looks a bit suspicious. We had about 100 nursery collections, so according to the object counts above there were on average 13 object arrays and 14 string arrays pinned per nursery collection, in addition to lots of other objects. All of those supposedly amount to only about 4 kilobytes of memory. Are we counting something wrong or are these very short arrays? Let’s see:

1:  mono$target:::gc-obj-pinned /arg4==0/ {
2:          @[strjoin (copyinstr (arg2), strjoin (".", copyinstr (arg3)))] = quantize (arg1);
3:  }

This gathers object size histograms per class for all objects pinned in the nursery. These are the parts of the output giving the counts for the arrays we’re interested in:

System.String[]
         value  ------------- Distribution ------------- count
             8 |                                         0
            16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    1344
            32 |@@@                                      96
            64 |                                         0

System.Object[]
         value  ------------- Distribution ------------- count
             8 |                                         0
            16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              870
            32 |@@@@@@@@@@@@@                            409
            64 |                                         0

Apparently those arrays really are very short, all around 16 to 32 bytes long. But maybe the probe is lying to us about the object size? If we know about Mono’s internal object layout we can read out the array length directly from the heap1:

1:  mono$target:::gc-obj-pinned /arg4 == 0 && copyinstr (arg3) == "Object[]"/ {
2:          this->l = *(int*)copyin (arg0 + 12, 4);
3:          @[arg1] = lquantize (this->l, 0, 8);
4:  }

On 32-bit Mono the length of an array is at offset 12, so we read 4 bytes from there and cast it to an int. We get a few graphs like these:

    16
value  ------------- Distribution ------------- count
  < 0 |                                         0
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 96
    1 |                                         0

    28
value  ------------- Distribution ------------- count
    2 |                                         0
    3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
    4 |                                         0

    20
value  ------------- Distribution ------------- count
    0 |                                         0
    1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 481
    2 |                                         0

Each of them gives us the distribution of the array lengths for array objects of specific sizes. If everything is correct then for each array object size there must only be one array length, which is exactly what we get. For example, all 481 array objects of size 20 bytes have a length of one.

In fact, the numbers show that the sizes we get are exactly what we should expect given Mono’s object layout on 32-bits: A reference array with a given length has a size of 16 + 4*n bytes.

Footnotes:

1 Don’t try this at home!

The Schism

I’ve been a very bad blogger recently. Part of the reason is certainly laziness, but there’s also the fact that I felt more motivated to write about photography recently than about software whereas this blog has become focused almost exclusively on software, so it didn’t feel like it would fit in anymore.

For this reason I have decided to start a dedicated photography blog, creatively named Mark’s Photography Blog. I’ve started it off with a post on 10 of my favorite images from 2011.

What that means for this blog is that, for the rare occasions on which I will post, I shall focus it almost exclusively on software, which is also reflected in its new awesome title. I’ll also try to overcome my aversion to writing smaller, less epic posts, which will hopefully also result in a higher posting frequency.

On the more epic side of things, I still owe you one or two more posts on SGen, and I’d like to write a post on the lock-free allocator we’re now using in SGen, but which is independent of it and is pretty cool. What can it be used for? Dynamic memory allocation in signal handlers, for example.

The ICFP Programming Contest 2011

Introduction

Last weekend saw yet another iteration of the ICFP Programming Contest, an event the glorious team Funktion im Kopf der Mensch could not miss.

The task

The task this year was to build a program that would play a card game, Lambda The Gathering, to be run on the contest organizer’s machines. The game is played with two players each, taking turns. In each round, a player must play one out of several kinds of cards, for each of which there is an infinite supply. Each player owns 256 slots, each of which has some “vitality”, initially 10000. If a slot’s vitality drops to zero it is dead and cannot be used anymore (unless it is explicitly revived). The object of the game is to kill all the opponent’s slots.

Apart from its vitality, a slot also holds a value, which is where the real action happens. A value can be either a number or a function. The cards in the game also represent functions, and the number 0. When a card is played, it must be played to a specific slot, meaning that the card’s value and the slot’s value are combined. The combination can be either that the slot’s value is applied to the card (right) or the other way around (left). Initially, each slot holds the identity function, I.

As an example, let’s say we want to generate the number 4 in a slot. Initially it will hold I, so we play zero right, giving I(zero), which is reduced to 0. Now we play the function succ, which takes a number and returns that number plus one, left, giving succ(0), which reduces to 1. In each of the next two turns we play the card dbl left. dbl is a function that doubles a number, so we end up with 4.

To actually do something, like attacking your opponent’s slots, there are cards whose functions have side effects. One of them is attack, which you supply with three numbers (it’s a curried function). The first is the number of one of your slots, the second the number of one your opponent’s slots and the third is the “strength” of the attack. Once the function has all of its arguments it will subtract the third number from your slot’s vitality and then subtract nine tenths of that number from your opponent’s slot. There is another function, help, which you can use to top up your vitality, so if you first help and then attack with the right strengths you will have damaged your opponent without having lost any vitality yourself.

Also among the cards in the game are S, K and I, representing the homonymous functions of the SKI combinator calculus. These three simple functions, together with function application, are Turing complete, which means that it is possible to build programs in slots.

https://i0.wp.com/www.complang.tuwien.ac.at/schani/blog/icfp2011/L1006883.jpg

Strategy

Actual programs can potentially damage the opponent much quicker than would be possible by building each attack by playing the respective cards anew. (Note that I’m talking about the programs (functions) in the game’s slots now, not about the program that plays the card game).

There is a limit to how long a slot’s function is allowed to “execute”, however, which is 1000 function applications. If it has not finished until then, execution is stopped and the slot’s value is reset to I. Any side effects stay, however, so an endless attack loop would do damage, though only a limited amount.

There is a good reason for not using endless loops, however, namely that, since the slot’s value is reset, the function is “lost”, so it has to be copied to another slot first (which is possible with the get function, but somewhat costly, since the function’s slot number has to be built first) or, even worse, generated again.

A better strategy is to use a function that does some more or less fixed amount of damage and then returns itself, or some slightly different variant of itself. The actual result of the function will, after all, be the new value of the slot. So, suppose we could build a function like this:

1:  function make_killer (slot) {
2:    return function (dummy) {
3:      kill_opponent_slot (slot);
4:      return make_killer (slot + 1);
5:    }
6:  }

If we generate this function and give it the number 0, it will result in a function that, given a dummy argument, will kill the opponent’s slot 0 and return a function that when called with a dummy argument will kill the opponent’s slot 1 and return a function that when called with a dummy argument etc. In other words, once we have this function we can kill one enemy slot per turn. In comparison, just launching a single attack without building more complex functions requires more than 50 turns, and that’s not sufficient to kill an enemy’s slot. If you factor in the help calls to top up your own vitality so as not to die quicker than the enemy you’re looking at more than 200 rounds just to kill a single slot of the enemy (assuming it has the initial vitality of 10000).

Feasibility

Unfortunately, it’s not completely trivial to write a function like this in SKI calculus. In fact, compared to SKI, x86 machine code looks incredibly high level. Fortunately, there is help. As luck would have it, I wrote a Scheme subset to Unlambda compiler many years ago. Scheme is essentially Lambda calculus and Unlambda is essentially SKI. In other words I was already a bit familiar with both SKI as well as the compilation algorithm (which is quite simple).

Here is the function from above in very simple Scheme:

1:  ((lambda (x) (x x))
2:   (lambda (self)
3:     (lambda (slot)
4:       (lambda (dummy)
5:         (kill-opponent-slot slot)
6:         ((self self) (succ slot))))))

Since there are no function names in SKI or pure Lambda calculus we have to achieve self-reference using a trick, namely the function (lambda (x) (x x)), which applies its argument to itself, so if you pass it your function then that function gets itself as its (first) argument, in this case named self. Note that I am assuming that we have sequential execution available, to first kill the opponent’s slot and then return the function for the next slot. This is not natively available in Lambda calculus, which doesn’t even deal with side effects, but can be implemented like so:

1:  ((lambda (dummy)
2:     do-this-afterwards)
3:   do-this-first)

How easily can we generate an attack function like the one above in a slot? The first step is to translate Lambda to SKI calculus. The simple algorithm (given in the task description) produces ridiculously huge programs. The problem is that since SKI calculus doesn’t know named function arguments the arguments must be passed downwards “manually” from where they originated. The simple algorithm passes them down all the way to all the leaves of the syntax tree, where they might or might not be needed. By cutting off the passing down at subtrees which don’t need the argument a significant reduction is achieved, but program size is still an issue, and depends a lot on how deeply nested the original function is.

Now that we have the function in SKI form, can we trivially play cards to summon it to a slot? It turns out that this is only possible in simple cases because a player’s turn can only apply a card left or right, so nesting on both sides of an application is not trivially possible. For example, it’s easy to generate S(S(SK)), starting from I, by applying K right and the S left three times. S((SS)K) also works: S right, S left, K right and S left. However, (SS)(KK) is not doable that easily.

There is a trick, though: Assume we already have some arbitrarily complex x, and we want x(MN). If we apply K left, S left and then M and N right we end up with ((S(Kx))M)N which, thanks to the way the S and K combinators work, reduces to x(MN).

This trick still only works if M and N are single cards. It can be used recursively however, and, using yet another similar trick, arbitrarily nested right sides can be produced, but at a great increase in the number of cards that have to be played. Before we figured out those tricks we came across a simpler general solution, using the card get, which, when given a slot number, reduces to the value that’s in that slot. If we use get and zero as the cards M and N in the trick above, the function will reduce to xy, with y being whatever is in slot 0. The generation algorithm built using this trick has to use more than one slot, and it depends vitally on slot 0, but it is reasonably simple and the overhead is small.

It looks as if we have all the pieces of the puzzle together now and can finally generate a function that does some serious damage quickly, but there is another problem. Let’s say we want to generate a function which, when given a dummy argument does the side effect of inc zero (the details of what inc does are not important here). In Scheme, that is (lambda (d) (inc zero)). Translated to SKI calculus it is K(inc zero). To generate it in a slot (assuming it holds I) we apply zero right, then inc and K left. After having applied the inc the slot’s function is inc zero, though, which is immediately reduced, thereby doing the side effect and giving as a result the function I, so after applying K left we end up with K(I) and a side effect we only wanted after giving the dummy argument.

Clearly, we need a way to “guard” side effects against immediate execution. The easiest way we found to do this is to build an equivalent function, but in a different way. From a completely functional point of view, i.e. ignoring possible side effects, the functions

1:  K(inc zero)

and

1:  (S(K inc))(K zero)

are the same – they both take an argument that is ignored and reduce to inc zero, but in the latter form the inc zero can only be reduced once the dummy argument is given. Let’s call such a function a “side effect function” – it requires a dummy argument and only once that is present will it produce a side effect.

What if we want to combine two such side effect functions, resulting in a new side effect function that does one first, then the other? It seems tempting to just apply the second to the first, like this:

1:  (lambda (d)
2:    (second-se-fn (first-se-fn d)))

The error in this approach is that even though (first-se-fn d) can only be reduced once the argument d is present, it is still, in terms of the SKI calculus, a function, and therefore a value, so second-se-fn can be reduced, even without d. What we need instead is a function that takes an argument and then applies first one function to that argument and then a second one. As luck would have it, that function is S. So, to combine two side effect functions, yielding a new side effect function, we have

1:  ((S first-se-fn) second-se-fn)

https://i2.wp.com/www.complang.tuwien.ac.at/schani/blog/icfp2011/L1006887.jpg

More strategy

With all of that in place we can come back to our self-returning slot-increasing attack function. As it turns out, passing around that extra slot number argument is quite an overhead when everything is translated down to SKI calculus, but we had already thought of a simpler solution. Instead of having the slot number inside the function we can use indirection: We store the number of the slot to be attacked in some other slot and use the get function to retrieve it from within the attack function. The disadvantage of this approach is that we need to increment the slot number after each attach, costing us one extra turn, but on the other hand we can switch to another slot to attack reasonably quickly.

After some optimizations, some of them directly on the generated SKI code, we managed to get the number of turns required to load this function down to below 180. After that we just have to load zero into a slot and we can kill one enemy slot every other turn.

One small detail I omitted above is that when attacking a slot the number you give will actually be inverted, i.e. giving 0 will attack slot 255, and vice versa. That makes it easy to attack high slot numbers but more difficult to attack lower ones, because higher numbers take more time to generate. Nonetheless, our bot doesn’t actually start its attack from the highest slot, as would be easiest. It first figures out which opponent slot holds the most complex function (using the trivial metric of counting the leaves of the syntax trees) and attack that first, then increment until we hit slot 0, then we start again from the highest slot, hopefully finishing off the opponent.

We did not have much time left to refine our bot beyond that. There are very few measures we take to defend ourselves. If one of our few vital slots is killed while we are generating our attack function, for example, we just revive the slot and the start over with generating the function. A smarter bot would try to salvage the pieces left in the other slots. If the slot we take the vitality from to launch attacks is killed, we’re dead in the water – we had code to help it get back into working condition, but we failed at the last minute to integrate it.

Overall, I think we’re doing quite well. We launch our first attack within about 200 rounds and can finish off a bot that doesn’t defend itself within a bit more than 500 turns after that. Against the best teams, however, we’re helpless – they can kill all 256 of out slots stone dead in less than 190 rounds, starting from scratch. Kudos!

https://i1.wp.com/www.complang.tuwien.ac.at/schani/blog/icfp2011/ltgvis.jpg

Contest organization

In the past I have taken contest organizers to task for screwing up in minor and major ways. This year, I am happy to say, there was almost nothing to find fault with. The task was clear, not artificially obscured, many-layered, interesting and enjoyable. There was a test submission server but teams didn’t depend on it and it worked quite well in any case.

So, from Team Funktion im Kopf der Mensch a huge Thank You to the organizers for this amazing job! May all future contests be organized this well.

Programming languages

Last year was the first year we used Clojure after using OCaml almost exclusively in the years before. This time we stepped back a bit and used OCaml for the actual bot, i.e. the program that will run on the organizer’s servers. The main reasons for this were that OCaml has a much smaller footprint and more predictable performance than Clojure and we wanted to make sure we didn’t hit against any limits, and that we felt somewhat safer writing an unsupervised program when protected by OCaml’s type system.

We still used Clojure for developing the attack functions, starting from Lambda calculus, going down to SKI and then generating a sequence of card application.

Our visualizer was written in C.

All of these languages did their respective jobs well. No complaints this time.

The Code

All the code we wrote for this year’s contest is available on GitHub. If you need any assistance with it, please email me.

SGen – Finalization and Weak References

Introduction

In this installment of my series on SGen, Mono’s new garbage collector, we shall be looking at how finalizers and weak references are implemented, and why you (almost certainly) should not use finalizers.

Tracking object lifetime

Both finalizers and weak references need to track the lifetime of certain objects in order to take an action when those objects become unreachable. To that end SGen keeps lists of finalizable objects and weak references which it checks against at the end of every collection.

If the object referred to by a weak reference has become unreachable, the weak reference is nulled.

If a finalizable object is deemed unreachable by the collector, it is put onto the finalization queue and it is marked, since it must be kept alive until the finalizer has run. Of course, all objects that it references have to be marked as well, so the main collection loop is activated again.

Generations

A nursery collection cannot collect an object in the major heap, so it is not necessary to check the status of old objects after a nursery collection. That is why SGen keeps separate finalization and weak reference lists for the nursery and major heaps.

Invoking finalizers

SGen uses a dedicated thread for invoking finalizers. The finalization queue is processed one object at a time. As long as an object is in the finalization queue it is also considered live, i.e. the finalization queue is a GC root.

Resurrection

Resurrecting an object means making it reachable again from within its finalizer or a finalizer that can still reach the object (or via a tracking weak reference). The garbage collector does not have to treat this case specially—until the finalizer has run the object is considered live by virtue of its being in the finalization queue, and afterwards it is live because it is reachable through some other root(s).

Tracking weak references

If an object is weakly referenced and has a finalizer, the weak reference will be nulled during the same collection as the finalizer is put in the finalization queue. That is not always desirable, especially for objects that might be resurrected.

Tracking references solve the problem by keeping the reference intact at least until the finalizer has run. Once the finalizer has finished, a tracking reference acts like a standard weak reference, i.e. it will be nulled once the object becomes unreachable, typically during the next collection (unless the object was resurrected).

When SGen encounters a tracking reference, instead of nulling it, it turns it into a non-tracking reference. The referenced object is now on the finalization queue and therefore considered live again, so the reference will not be nulled before the finalizer has run.

Why finalization is Evil

Finalization should not be used to manage scarce resources, such as file descriptors.

Time of finalization is not determined

Unless you force a garbage collection and wait for the finalizers to finish, the time at which they are run is not determined. If your program does not allocate a lot of memory, or your heap is huge, it might take a very long time until a major collection is triggered, so dead objects on the major heap might not be finalized in time before your scarce resource is depleted.

In this highly recommended talk (starting at 8:35) Cliff Click gives an example where it was necessary to put a hack into the JVM garbage collector that triggered a collection whenever the system ran out of file handles, because Apache Tomcat relied on finalizers to reclaim them.

Finalizers run one after another

A finalizer that performs a time consuming task will delay the execution of other finalizers. An especially worrisome case is finalizers that do potentially blocking I/O. Depending on the timeout of the operation, other finalizers can be blocked for a long time.

Finalizers make objects live longer

Since finalizable objects are considered live until their finalizers have run, they also consume memory until their finalizers have run and the next garbage collection is triggered. This not only applies to the finalizable objects themselves but to all objects they reference as well. This is particularly problematic if finalization is blocked by an ill-behaved finalizer.

Order of finalization is not determined

A finalizer cannot count on other finalizers running before or after it, irrespective of whether it references or is referenced by those objects or not. In other words, a finalizer cannot assume that an object it references has not been finalized yet. That is true even if that object is known to be strongly held by a GC root, namely when shutting down.

An exception to this are critical finalizers, which are run after “normal” finalizers, but between normal finalizers there is no determined order, neither between critical finalizers.

[Update]

It turns out that SGen’s handling of tracking references, as described above, is not only conceptually wrong, it was also incorrectly implemented, which, ironically, mostly fixed the bug in the design.

Our conceptual mistake was to demote tracking references to non-tracking ones the first time the object became unreachable, which would have led to the reference being non-tracking even if its target was re-registered for finalization. The bug in the implementation was that we didn’t actually do that. Instead, SGen would keep tracking references around until their targets became really, truly unreachable, even via finalization, i.e. until it was done finalizing, without any further chance of resurrection. Only at that point would it turn the tracking reference into a non-tracking one, thereby making the object alive once again and keeping it around for one more garbage collection cycle.

The bug is now fixed. Thanks to Alan for noticing this case and investigating.

SGen – The Major Collectors

Introduction

In the last post of my series on SGen, Mono’s new garbage collector, I talked about the young generation, called the nursery, and its collector. In this post I shall talk about the old, or major, generation and the two different collectors that implement it.

The older of the two, the copying collector, is mostly used for reference nowadays. It has been supplanted by the Mark-and-Sweep collector in practical use because it performs better in most cases and still has room for improvement.

The Mono man page gives details on how to specify which collector to use and how to change their parameters with the environment variable MONO_GC_PARAMS.

Copying

The copying major collector is a very simple copying garbage collector. Instead of using one big consecutive region of memory, however, it uses fixed-size blocks (of 128 KB) that are allocated and freed on demand. Within those blocks allocation is done by pointer bumping.

At a major collection all objects for which this is possible are copied to newly allocated blocks. Pinned objects clearly cannot be copied, so they stay in place. All blocks that have been vacated completely are freed. In the ones that still contain pinned objects we zero the regions that are now empty. We don’t actually reuse that space in those blocks but just keep them around until all their objects have become unpinned. This is clearly an area where things can be improved.

Why two collectors?

The copying collector was implemented when SGen was still young. It shares most of the copying machinery with the nursery collector, so it was easy to implement. As a production collector, however, it is not suited for most workloads. The old generation is expected to be large and composed of objects that have a long life, so copying them is bad from the perspective of memory usage (it might double for the duration of the collection) as well as of cache behavior.

Mark-and-Sweep

Mark-and-Sweep is SGen’s default major collector and is actively being developed and improved.

In contrast to the copying collector a mark-and-sweep collector has to deal with individual objects being freed and new objects filling up that space again. (This is different from pinned objects leaving holes in blocks for the copying collector because in that case it is expected that there will only be a tiny number of remaining pinned objects, so the free regions will be large, whereas here it is expected that many objects survive, leaving lots of small holes). This creates the problem of fragmentation – lots of small holes between objects that cannot be filled anymore, resulting in lots of wasted space.

Blocks

The basic idea of SGen’s Mark-and-Sweep collector is to have fixed-size blocks, each of which is divided into equally sized object slots of a variety of different sizes – this is similar to what Boehm does. One advantage of this over allocating objects sequentially is that the holes are always the same size, so if there are new objects arriving at that size the holes can always be filled.

Of course it is not feasible to support blocks with slots of every conceivable object size, so we space out the different supported sizes so as not to waste too much space (in the future we might want to choose the sizes dynamically, to fit the workload). Another thing that works to our advantage is that, since the slots have a known size, we don’t have to keep tabs on where they start and end. All we need is one mark bit per object which we keep in a separate region for better memory locality (we could use unused bits in the first object word, like the nursery collector does). For faster processing we’re actually keeping one bit per potential object start address, i.e. 1 bit per 8 bytes.

Since we only have a limited number of different slot sizes, we can also keep free lists, so allocation is more efficient because it doesn’t involve search.

During garbage collection it is necessary to get from an object to its block in order to access its mark bit. The size of each block is 16 KB, and they are allocated on 16 KB boundaries, so to get from an object that is inside a block to the address of its containing block a simple logical AND is sufficient. Unfortunately, large objects live outside of blocks and for them this operation would yield a bogus block address. To determine whether an object is large we need to examine its class metadata which is referenced from its VTable, requiring three loads (the “fixed heap” variant of the collector solves this problem and is described below.)

Each block has a “block info” data structure that contains metadata about the block. Apart from the size of the object slots and several other pieces of data it also contains the mark bits.

Fixed Heap

In the standard Mark-and-Sweep collector the blocks are allocated on demand, so they are potentially scattered all over the process’s address space. The fixed heap variant, on the other hand, allocates (using mmap) a heap of a fixed size (excluding the large object space) on startup and then assigns blocks out of that space. The advantage here is that we can check whether a pointer is within a block simply by checking whether it falls within the bounds of the allocated space, so no loads are necessary to get to its mark bit.

In standard Mark-and-Sweep the first word of each block links to its block info. Fixed heap can forego this load as well because the block infos are also allocated sequentially, so all that is needed to get to a block info is its index which is the same as its block’s index, which can be determined from the block address.

The obvious disadvantage of fixed heap Mark-and-Sweep is that the heap cannot grow beyond its fixed size. On some workloads fixed heap yields significant performance improvements. I will try to give some benchmarking results in a later post.

Non-Reference Objects

Objects that don’t contain references don’t have to be scanned, which implies that they also don’t have to be put on the gray stack. To handle this special case more efficiently Mark-and-Sweep uses separate blocks for objects with versus without references. The block info contains a flag that says which type the block is.

For the fixed heap collector this separation means that we don’t even have to load a single word of an object or its metadata if it doesn’t contain references.

Evacuation

Even despite the segregation of blocks according to fixed object sizes one form of memory waste can still occur. If the workload shifts from having lots of objects of a specific size to having only a few of them, a lot of blocks for that object size will have very low occupancy – perhaps a single object will still be live in the block and no new objects will arrive to fill the empty slots.

Unlike Boehm, SGen has the liberty to move objects around. During the sweep stage of a collection we keep statistics on the occupancy of blocks of all slot sizes. If the occupancy for a given slot size falls below a certain (user definable) threshold then we mark that slot size for evacuation on the next major collection.

When blocks of a slot size are being evacuated Mark-and-Sweep acts like a copying collector on the objects within those blocks. Instead of marking them they are copied to newly allocated blocks, filling them sequentially and emptying all the sparsely occupied ones which will then be freed during the following sweep.

Blocks with pinned objects are exempt from this whole process since they cannot be freed anyway.

Note that the occupancy situation might change between a sweep and the following major collection – the workload could change in the mean time and produce lots of live objects of that size which would all be copied, without actually saving any space. SGen cannot tell which objects will be live without doing the mark phase because that’s the mark phase’s job, so we take the risk of that happening.

As a side-note: Sun’s Garbage-First collector solves this problem by running the mark phase concurrently with the mutator, so it is actually able to tell at the start of a garbage collection pause which blocks will contain mostly garbage.

Parallel Mark

The mark phase can be parallelized with reasonable effort. Essentially, a number of threads each start with a set of root objects independently. Each thread has its own gray stack, but care must be taken that work is not repeated. Two threads can compete for marking the same object, and if they both put the object into their gray stacks, the object will be scanned twice, which is to be avoided.

Some care must also be taken with handling objects that still reside in the nursery. They have to be copied to the major heap and a forwarding pointer must be installed. The forwarding pointer can only be installed after the space on the major heap has been allocated (otherwise where would it point to?), so two threads might do that independently but only one will win out and the other will have done the allocation in vain.

It is not actually sufficient to just partition the set of root objects between the worker threads – some threads might end up having a far smaller share of the live object graph to work with. SGen currently uses a shared gray stack section buffer where worker threads deposit parts of their gray queues for the others to take.

Parallel mark achieves quite impressive performance gains on some workloads, but on most the gains are still somewhat disappointing. I suspect the work distribution needs improvement.

Concurrent Sweep

The sweep phase goes through all blocks in the system, resets the mark bits, zeroes the object slots that were freed and rebuilds the free lists.

None of those data structures are actually needed until the next nursery collection, so the sweep lends itself to be done in the background while the mutator is running, which is what Mark-and-Sweep can do optionally. The next nursery collection will in that case wait for the sweep thread to finish until it proceeds.

SGen – The Nursery

Introduction

This is the second part of a series of blog posts on SGen, Mono’s new garbage collector. In this installment we shall be taking a closer look at the first generation, also called the “nursery”.

Nursery

As the name implies, the nursery is the generation where objects are allocated, or, to be more poetic, born. In SGen, it is a contiguous region of memory that is allocated when SGen starts up, and it does not change size. The default size is 4 MB, but a different size can be specified via the environment variable MONO_GC_PARAMS when invoking Mono. See the man page for details.

Allocation

In a classic semi-space collector allocation is done in a linear fashion by pointer bumping, i.e. simply incrementing the pointer that indicates the start of the current semi-space’s region that is still unused by the amount of memory that is required for the new object.

In a multi-threaded environment like Mono using a single pointer would introduce additional costs because we would have to do at least a compare-and-swap to increment it, potentially looping because it might fail due to contention. The standard solution to this problem is to give each thread a little piece of the nursery exclusively from which it can bump-allocate without contention. Synchronization is only needed when a thread has filled up its piece and needs a new one, which should be far less frequent. These pieces of nursery are called “thread local allocation buffers”, or “TLABs” for short.

TLABs are assigned to threads lazily. The relevant pointers, most notably the “bumper” and the address of the TLAB’s end, are stored in thread local variables. Currently TLABs are fixed in size to 4 KB (actually that’s the maximum size – they might be smaller because the nursery can be fragmented). In the future we might dynamically adapt that size to the current situation. The less threads that are allocating objects, the larger the TLABs should be so that they need to be assigned less often. On the other hand, the more threads we allocate TLABs to the sooner we run out of nursery space, so we will collect prematurely. In that situation the TLABs should be smaller. Ideally a thread should get a TLAB whose size is proportional to its rate of allocation.

Calling from managed into unmanaged code involves doing a relatively costly transition. To avoid this the fast paths of the allocation functions are managed code. They only handle allocation from the TLAB. If the thread does not have a TLAB assigned yet or it is not of sufficient size to fulfill the allocation request we call the full, unmanaged allocation function.

Collection

Collecting the nursery is not much different in principle from collecting the major heap, so much of what I discuss here is also valid there.

Coloring

The collection process can be thought of as a coloring game. At the start of the collection all objects start out as white. First the collector goes through all the roots and marks those gray, signifying that they are reachable but their contents have not been processed yet. Each gray object is then scanned for further references. The objects found during that scanning are also colored gray if they are still white. When the collector has finished scanning a gray object it paints it black, meaning that object is reachable and fully processed. This is repeated until there are no gray objects left, at which point all white objects are known not to be reachable and can be disposed of whereas all black objects must be considered reachable.

One of the central data structures in this process, then, is the “gray set”, i.e. the set of all gray objects. In SGen it is implemented as a stack.

The roots

As I mentioned above as well as in my overview post, the collection starts with the roots. These include, most prominently, the references on the stacks of all managed threads and static class fields. In addition to that there are numerous other roots that the runtime registers and uses for internal purposes. Since SGen is a generational collector, the roots for the nursery collection also include all references from the major heap to objects in the nursery. These are kept track of with write barriers which I will discuss in a later blog post.

Forwarding

In a copying collector like SGen’s nursery collector, which copies reachable objects from the nursery to the major heap, coloring an object also implies copying it, which in turn requires that all references to it must be updated. To that end we must keep track of where each object has been copied to. The easiest way to do this is to use a forwarding pointer. In Mono, the first pointer-sized word of each object is a pointer to the object’s vtable, which is aligned to at least 4 bytes. This alignment leaves the two least significant bits for SGen to play with during a collection, so we use one of them to indicate whether that particular object is forwarded, i.e. has already been copied to the major heap. If it is set, the rest of the word doesn’t actually point to the vtable anymore, but to the object’s new location on the major heap. The second bit is used for pinning, which I will discuss further down.

The loop

Here is a simplified (pinning is not handled) pseudo-code implementation of the central loop for the nursery collector:

 1:  while (!gray_stack_is_empty ()) {
 2:      object = gray_stack_pop ();
 3:      foreach (refp in object) {
 4:          old = *refp;
 5:          if (!ptr_in_nursery (old))
 6:              continue;
 7:          if (object_is_forwarded (old)) {
 8:              new = forwarding_destination (old);
 9:          } else {
10:              new = major_alloc (object_size (old));
11:              copy_object (new, old);
12:              forwarding_set (old, new);
13:              gray_stack_push (new);
14:         }
15:         *refp = new;
16:     }
17:  }

In line 2 we pop an object out of the gray stack and in line 3 we loop over all the references that it contains. refp is a pointer to the reference we’re currently looking at. In line 4 we fetch the actual reference. We are only interested in collecting the nursery, so if the reference already points to the old generation we skip it (lines 5 and 6). If the object is already forwarded (line 7) we fetch its new location (line 8). Otherwise we must copy it to the major heap. To that end we allocate the space required (line 10), do the actual copying (line 11) and forward the object (line 12). That newly copied object must also be processed, so we push it on the gray stack (line 13). Finally, for both cases, we update the reference to point to the new location of the object (line 15).

Pinning

In the overview post I mentioned that we currently cannot scan stack frames precisely, i.e. we do not know whether a word that looks like a reference just does so by coincidence or is actually one. Work for scanning managed stack frames precisely is under way, but for unmanaged frames this is not possible anyway because the C compiler does not provide us with the necessary information (some runtimes avoid this problem by not exposing managed references to unmanaged code directly).

Because we lack this knowledge we must proceed conservatively and assume that what looks like a reference is one, which means considering the object it points to as reachable. On the other hand, since we cannot be sure, we are not allowed to modify this supposed reference, because it might just be a number. That implies that we also cannot move the object. In other words the object is “pinned”.

Pinned objects are a bit of a complication. Having pinned objects in the nursery means that we cannot clean it out completely and so it becomes fragmented. It also requires an additional check in the central loop.

Finishing up

After all the copying is done what’s left is to clean up the nursery for the next round of allocations. SGen doesn’t actually zero the nursery memory at this point, however. Instead that happens at the point when TLABs are assigned to threads. The advantage is not only that it potentially gets distributed over more than one thread, but also that is has better cache behavior – the TLAB is likely to be touched again soon by that thread.

What has to happen at this point, though, is to take score of the regions of nursery memory that are available for allocation, which means everything apart from pinned objects. Since we have a list of all pinned objects this is done very efficiently – I will discuss this “pin queue” in a later post. It is these “fragments” from which TLABs will be assigned in the next mutator round.

I should mention that there is a bit of additional work to be done prior to fragment creation, namely making sure unreachable finalizable objects are finalized and weak references are zeroed if the objects they pointed to have become unreachable. Both are very similar issues and shall be discussed in a later post.

SGen

Introduction

SGen is Mono’s new garbage collector that we have been working on intensely for almost two years now and that has been becoming stable and competitive during the past few months.

In this series of blog posts I will try to explain how garbage collection works in general, how SGen works in particular, how to get the best performance out of it and, finally, what our plans are for the future.

This first post will give a very brief overview over what a garbage collector does and how it does it, before outlining the broad architecture of SGen.

Why a new garbage collector?

Since its inception Mono, like many other garbage collected runtimes, has been using the Boehm-Demers-Weiser collector, which I shall refer to as “Boehm” henceforth. Boehm’s main advantages are its portability, stability and the ease with which it can be embedded. It is designed as a garbage collector for C and C++, not for managed languages, however, so it does not come as a surprise that it falls short for that purpose compared to collectors dedicated to such languages or runtimes.

Our goal with SGen was to overcome Boehm’s limitations and provide better performance for managed applications, in terms of allocation speed, collector pauses as well as memory utilization. In this and the following posts of this series I will mention points where we improve upon Boehm whenever appropriate.

Garbage collection

Before digging into the details of SGen it seems prudent to discuss how garbage collection actually works. I will only discuss topics that are relevant to SGen, and even paint those only with very broad strokes. For more comprehensive overviews of garbage collection see “Uniprocessor Garbage Collection Techniques” by Wilson or, for more detailed information, Jones and Lins’s book.

The garbage collector, in short, is that part of the language’s or platform’s runtime that keeps a program from exhausting the computer’s memory, despite the lack of an explicit way to free objects that have been allocated. It does that by discovering which objects might still be reached by the program and getting rid of the rest. It considers as reachable objects those that are directly or indirectly referenced by the so-called “roots”, which are global variables, local variables on the stacks or in registers and other references explicitly registered as roots. In other words, all those objects are considered reachable, as well as those that they reference, and those that they reference, and so on. The pedantic will note that the program might not actually be able to reach all of those. Computing the minimum set of objects that the program could reach is impossible, however, which we know thanks to this guy.

SGen, as well as Boehm, are “stop-the-world” collectors, meaning that they do their work while the actual program, slightingly called the “mutator” in garbage collection lingo, is stopped.

There are three classic garbage collection algorithms: Mark-and-sweep, Copying, and Compaction, of which only the first two are relevant to our discussion of SGen since we have not implemented a compacting collector and have no plans of doing so in the foreseeable future.

Mark-and-Sweep

Mark-and-sweep is the oldest and probably most widely implemented garbage collection algorithm. Boehm is a mark-and-sweep collector.

The idea is to have a “mark” bit on each object that is set on all reachable ones in the “mark” phase of the collection, by recursively traversing references starting from the root set. Of course in practice this is not actually implemented recursively, but by using a queue or stack.

The “sweep” phase then frees the memory occupied by the objects that were not marked and resets the mark bits on the marked ones.

Of course many variants of this algorithm exist that vary in details.

Copying

Traditional mark-and-sweep has two main weaknesses. First, it needs to visit all objects, including the unreachable ones. Second, it can suffer from memory fragmentation like a malloc/free allocator.

A copying collector addresses both problems by copying all reachable objects to a new memory region, allocating them linearly one after the other. The old memory region can then be discarded wholesale. The classic copying collector is the so called “semi-space” algorithm where the heap is divided into two halves. Allocation is done linearly from one half until it is full, at which point the collector copies the reachable objects to the second half. Allocation proceeds from there and at the next collection the now empty first half is used as the copying destination, the so-called “to-space”.

Since with a copying collector objects change their addresses it is obvious that the collector also has to update all the references to the objects that it moves. In Mono this is not always possible because we cannot in all cases unambiguously identify whether a value in memory (usually on the stack) is really a reference or just a number that points to an object by coincidence. I will discuss how we deal with this problem in a later installment of this series.

Generational garbage collection

It is observed that for most workloads the majority of objects either die very quickly or live for a very long period of time. One can take advantage of this so-called “generational hypothesis” by dividing the heap into two or more “generations” that are collected at different frequencies.

Objects begin their lives in the first generation, also referred to as the “nursery”. If they manage to stick around long enough they get “promoted” to the second generation, etc. Since all objects are born in the nursery it grows very quickly and needs to be collected often. If the generational hypothesis holds only a small fraction of those objects will make it to the next generation so it needs to be collected less frequently. Also, it is expected that while only a small fraction of the objects in the nursery will survive a collection, the percentage will be higher for older generations, so a collection algorithm that is better suited to high survival rates can be used for those. Some collectors go so far as to give objects “tenure” at some point, making them immortal so they don’t burden the collection anymore.

One difficulty with generational collection is that it’s not quite possible to collect a young generation without looking at the older generations at all, because the mutator might have stored a reference to a young generation object in an older generation object. Even if that young object is not referenced from anywhere else it must still be considered alive. Clearly scanning all objects in older generations for such references would defeat the whole purpose of separating them in the first place. To address this problem generational collectors make the mutator register all new references from old to young generations. This registration is referred to as a “write barrier” and will be discussed in a later installment. It is also possible to register reads instead of writes, with a “read barrier”, but this is impractical without hardware support. It’s obvious that the write barrier must be very efficient since it’s invoked for every write to a reference field or array member.

SGen

SGen, which historically stands for “Simple Generational”, is a generational garbage collector with two generations. The nursery is collected with a copying collector that immediately promotes all live objects, if possible, to the old generation (or “major heap”).

For the old generation we have two collectors that the user can choose between: A mark-and-sweep and a copying collector. The mark-and-sweep collector is available in four variants, with and without a parallel mark phase and with a dynamic or fixed-size heap.

In addition to that SGen has a separate space for large objects (larger than 8000 bytes), called the “Large Object Space”, or “LOS”, which is logically part of the old generation. Large objects are not born in the nursery but directly in the LOS, and they are not moved.

One major difficulty SGen has to deal with is objects that are “pinned”, which means that they cannot be moved. The reason for this is usually that they are referenced from stack frames that we cannot scan “precisely”, i.e. for which we do not know what is a reference and what is not. Work is currently under way to scan managed stack frames precisely, but we will always have to handle unmanaged stack frames, too, for which no type information is available and which can therefore only be scanned “conservatively”. More on this in one of the following posts.

Future installments

Having covered a few basic concepts of garbage collection I shall go into more detail about SGen in the blog posts following this one. Here is a tentative list of the posts I intend to write in the coming weeks. Comments and suggestions are of course welcome.

Allocating objects and the minor collector

How is the nursery organized, how are new objects allocated and how is the nursery collected?

The major collectors

An overview of SGen’s major collectors, mark-and-sweep and copying. I’ll discuss how they organize the heap and how they collect.

Sundry

Several shorter topics: Large objects, GC descriptors, conservative scanning, pinning, write barriers, finalization, weak references and domain unloading.

Debugging a garbage collector

Finding bugs in a garbage collector is ridiculously hard. Here I’ll describe some of the tools we use for assistance.

Getting the best performance out of SGen

How to fine-tune SGen for a workload and how to take write applications to take advantage of SGen’s performance characteristics.

The future

What’s in stock for SGen in the future?