Mostly Software

October 5, 2013

Stacks and Signals

Filed under: Software — Tags: , , — schani @ 1:56 am

Now and then we get bug reports of test cases crashing on a specific configuration in an indeterminate manner or frequency, both from our own build bots as well as from our users. One of those I looked at recently involved a runtime test case on Linux/AMD64. The test would only crash infrequently, but luckily I could reproduce it in a VM.

What the program does is run two threads, one of which causes null reference exceptions in a loop over and over again:

while (!finished) {
    object o = null;
    try {
        o.ToString ();
    } catch {

The other thread also runs a very simple loop, manually invoking the garbage collector again and again:

while (i < niter) {
    if (i % 100 == 0)
        Console.Write (".");
    GC.Collect ();

Mono generates null reference exceptions by just letting the code trigger a segmentation violation signal, which we catch in a signal handler that then proceeds to throw the NRE. This approach, as opposed to doing a null check whenever we access an object, is that there is no overhead at all in the case the reference is legit, and the code we generate is shorter.

Garbage collection, on the other hand, also involves signals: stopping the world for a collection on Linux is done by sending a signal to each managed thread. Each thread catches its signal, which tells it that it has to suspend until the GC has done its work.[1]

When we look for the cause of a bug like this the first thing we try to do is catch the crash in a debugger, in this case GDB. Unfortunately, the crash was caused by a segmentation violation that our signal handler cannot handle (because it doesn’t occur in managed code), but a regular run of the program also triggers segmentation violations like crazy, so just breaking at each SIGSEGV didn’t get us anywhere.

My first attempt at getting to the crashing SIGSEGV was to write an Expect script that drove GDB to the first SIGSEGV, presumably caused by the null reference in the managed code loop, remember the offending address and then keep continuing until it hit a SIGSEGV at a different address. Sadly (and quite typical of cases like this), running the code that way in GDB never succeeded in triggering the bug, so we had to think of other options.

At first we thought the crash was caused by the interaction of the two signals (the segmentation violation from the null reference and the stop-the-world signal), but we looked into other hypotheses. We tried two different ways of getting rid of the segmentation violation: using explicit null checks to trigger NREs and just throwing an exception manually without accessing a null reference.

Explicit null checks (enabled with the environment variable MONO_DEBUG=explicit-null-checks) makes the JIT emit less efficient and more elaborate null checks whenever an object is accessed, so an NRE can be thrown without going through a SIGSEGV.

The other way to get rid of the SIGSEGV while still retaining most of the code’s behavior is to just throw an exception, like so:

while (!finished) {
    var e = new Exception ();
    try {
        throw e;
    } catch {

We figured it probably didn’t matter which type the exception was, and indeed it did not: both with explicit null checks as well as with the manual exception throw the crash still occurred. We also looked into whether invoking the garbage collector was critical to the crash, and it turned out it was: just making the other thread sleep instead of calling the GC made the crash go away. Also, despite what the bug report claims, the crash was not dependent on SGen—it occurred with the Boehm collector, as well.

Armed with this new data we set out to finally get a close look at the crash in the debugger. Since the crash still only occurred very infrequently I brought my Expect script to bear again, and after a while we did get the crash in GDB. It did, regrettably, give us less information that we had hoped: instead of crashing somewhere in unmanaged code while dereferencing an invalid pointer, the crash resulted from the program counter being set to an invalid address, namely zero.

There are three main ways of setting the program counter on AMD64: CALLing a function, RETurning from a function and JMPing to an address. Spending some time looking at the stack dump it seemed that the most likely cause was a jump, which did not make our job easier, since it leaves no trace of where it came from.

After some more tinkering we decided to investigate whether the point at which the exception throwing thread was interrupted by the GC played any role. We got lucky—it so happened that every time the test crashed, the last time the thread was stopped was always in the exact same piece of code; even more, at the exact same instruction. This is the code in which the thread suspension occurred, with the instruction where it happened labeled:

    mov    %rdi,%r11
    mov    (%r11),%rax
    mov    0x10(%r11),%rcx
    mov    0x18(%r11),%rdx
    mov    0x8(%r11),%rbx
    mov    0x20(%r11),%rbp
    mov    0x30(%r11),%rsi
    mov    0x38(%r11),%rdi
    mov    0x60(%r11),%r12
    mov    0x68(%r11),%r13
    mov    0x70(%r11),%r14
    mov    0x78(%r11),%r15
    mov    0x28(%r11),%rsp
    mov    0x80(%r11),%r11
    jmpq   *%r11

This piece of code is executed as the last step of throwing an exception. It jumps to the catch handler, which involves first restoring a bunch of registers, including the stack pointer, and then doing the actual jump. We guessed that this jump to register r11 was the one that landed us at address zero. Observing in a few more crashes that r11 was indeed always zero corroborated that hypothesis.

The code that calls this restore_context() function looks like this:

mono_amd64_throw_exception (MonoObject *exception, ...)
    MonoContext ctx;
    ... figure out where on the stack the matching catch clause is ...
    ... set ctx to the values required to end up in that catch block ...
    restore_context (&ctx);

Note that the context—the data structure that contains all the values for the registers, including the program counter—is a local variable of this function, i.e., it resides in this function’s stack frame. Said stack frame is on the same stack as the catch handler that it eventually dispatches to, but further down, of course. We pass the address of the context to the above restore_context() function.

It seemed obvious now that for whatever reason the value for the program counter was set to zero in the context, so we added an assertion to check for that directly before the call to restore_context(). The assertion never failed, but the program still crashed.

Looking at the assembly code again Rodrigo suddenly proclaimed that he knew exactly what caused the bug: after setting the stack pointer to the catch handler’s frame we did another load from the context, which we weren’t allowed to do. By moving the stack pointer upward the context ended up lying not above the stack pointer, where it used to be, but below:

Most ABIs specify that programs must assume that data below the stack pointer (plus the red zone) is subject to arbitrary change because it’s used by the operating system for invoking signal handlers, for example—exactly what happened in our case with the thread suspend signal.

The correct way to do the context restore, then, is to load everything from the context before restoring the stack pointer. In our case we use the scratch register r8 to hold the value that will end up in rsp:

    mov    0x28(%r11),%r8
    mov    0x80(%r11),%r11
    mov    %r8,%rsp
    jmpq   *%r11

It’s somewhat embarrassing that we could already generate the correct code, but only did so in case we were running on Valgrind, because it gave us a warning about it that we apparently thought was a false positive.

Looking at our code for x86, after fixing AMD64, we found that we had the same bug there, but nobody had reported it yet, probably because it was harder to trigger.

  1. Reality is a tiny bit more complicated: Sending a signal to a thread is asynchronous, meaning that the GC has no direct way of knowing that a thread has already gotten the signal and reacted to it. Instead, the suspend signal handlers message back to the GC that the point of suspension has been reached, upon which they actually suspend, waiting for yet another signal from the GC, telling them to resume. Note that if the resume signal arrived between the point where the GC is notified of the suspension and the point of initiating the wait for said resume signal, the signal would be lost and the thread would never resume.  ↩

April 27, 2013

How to Split a Pizza

Filed under: Uncategorized — schani @ 9:35 pm

Assume two people want to split a pizza in a fair way. In our context, fair means that if both people follow the procedure, each person gets half a pizza, or, if not, only has themselves to blame. The obvious answer seems to be “split it in the middle”, but that leads to all kinds of further questions and complications: Who does the splitting? How is “the middle” determined? Even if we had protocols for that, it would still leave the possibility of one person not being satisfied with how they were carried out—sloppy measurements or sloppy cutting.

The standard way to avoid all such problems is to let one person cut the pizza in two halves, and the other person choose who gets which half. If the cutter thinks they got the smaller piece it’s because they didn’t cut fairly or precisely enough, and if the chooser thinks their piece is smaller, they should have chosen the bigger one.

What if there are three people? Before we try to come up with a procedure, let’s consider people’s motivations first. In the case of two people, whatever one person wins, the other loses. With three people, one person might elect to get a smaller piece for themselves in order to also make one of the other two worse off. Or, more realistically, two people might collude against the third. We’d like a procedure that ensures that even under those circumstances, fairness is guaranteed.

One generalization of the two-people procedure that comes to mind is to let one person—let’s call them “A”—cut, and then let the two others choose, first B, then C. Unfortunately, if A cuts in such a way that there is one large piece and two pieces that are each less than 1/3, and B chooses the large piece, C doesn’t get their fair share (neither does A, but that’s A’s fault). Randomizing the order of cutter and choosers is not a solution, either, because there is still the possibility of somebody getting the short end of the stick.

Here’s a procedure that works: A cuts the pizza, B chooses their slice. C has two options: they can either choose one of the remaining two slices, or they can call shenanigans. In the latter case, B picks a second slice[1], then both slices are put together and C divides them into two halves. B picks one half, C gets the other.

To see why this procedure is fair, let’s consider each person in turn. A will always get the piece that B and C didn’t pick, so if A cut fairly, they get their fair share. B will either get the piece they pick, or at least half of two pieces they pick, so if they pick the two largest pieces, they will get at least one third of the pizza. C will either get the piece they pick, or, in the case of shenanigans, half of the two pieces that B picked. It’s not obvious that they will get at least a third in the latter case. We assume that C only calls shenanigans if the two pieces that are left after B picked one are both smaller than one third—if one piece is at least a third, C should of course just pick it, thereby getting their fair share. The larger of the two pieces will have size 1/3-x, the smaller one 1/3-x-y, where x>0 and y>=0. If the two pieces are the same size, y is zero. The piece that B picked therefore has size 1/3+2x+y. B now picks one of the smaller pieces. If B wants to minimize C’s share, B must pick the smallest piece, i.e.1/3-x-y, which then gets put together with B’s original piece for C to cut. The two pieces together have size 2/3+x, so if C cuts fairly, they will actually get a size larger than their fair share.

I don’t know whether this procedure can be generalized to a larger number of people.

Update: As many commenters have thankfully pointed out, I have solved a problem that smarter people have already solved for more general cases. For a good overview of a few methods, read this article. My personal favorite is the Last Diminisher algorithm. The method I came up with is a variant of the three-person Lone Divider procedure.

  1. It is important that it is B, not C, who picks the second slice when C calls shenanigans.  ↩

February 25, 2013

SGen – Concurrency and Evacuation

Filed under: Software — Tags: , , — schani @ 11:54 pm


In my last blog post about SGen, Mono’s garbage collector, I wrote about concurrent mark-and-sweep, which makes most of the garbage collection work happen concurrently with the running program.

One thing we cannot do in a concurrent collector, as opposed to a stop-the-world collector, is move objects around in memory[1]. This is typically not a problem in a mark-and-sweep collector, but SGen goes beyond basic mark-and-sweep and prevents fragmentation by evacuating objects when the need arises.

Fragmentation is no less of an issue in a concurrent collector, so being unable to do evacuation is problematic. Our best solution at this point is to stop the world when it happens[2]. The threshold we use to decide when to do evacuation with concurrent mark-and-sweep is higher than on our non-concurrent collector, but there is still a chance that a full collection pause will hit the program when interactive speeds are desired.


We’re introducing a new API in the Mono.Runtime class. The application can use it to ask the GC not to do full major collection pauses:

if (!Mono.Runtime.SetGCAllowSynchronousMajor (false))
        Console.WriteLine ("Sorry, the GC won't cooperate.");

// your low-latency code here

Mono.Runtime.SetGCAllowSynchronousMajor (true);

One possible use case would be a game, where you want the lowest latency possible while the player is in a level, but between levels, or during loads, a longer collection pause is not much of an issue. Note that there are still no guarantees about the lengths of the pauses—this is a best-effort approach, not a real-time garbage collector.

This API will be available in the next Mono release, but only via reflection. It will be a public API on mobile soon.

  1. This is technically possible, but it involves using a read barrier, which would not only be forbiddingly complicated to implement in Mono, but would also probably incur a significant performance hit as well as break our embedding API.  ↩

  2. An approach that would keep marking mostly concurrent would be to decide which blocks to evacuate before starting concurrent mark, and then during marking to keep tabs on the references from already marked objects to objects within those blocks. In the finishing collection pause we could evacuate the marked objects in those blocks, but we’d have to update the references to them. Depending on the workload that might still require a significant pause.  ↩

December 21, 2012

SGen – Concurrent Mark

Filed under: Software — Tags: , , — schani @ 4:02 pm


Like most garbage collectors, SGen is of the stop-the-world variety, which means that when the time has come to perform its work, it stops all managed threads in their tracks and only resumes them once it has finished.

Nursery collections are usually relatively short affairs, and can be sped up even further by using a smaller nursery size. Major collections, on the other hand, can take a long time because they must scan every single live object in the system. Long garbage collection pauses are clearly a serious problem for many applications, be they interactive or on the server, and garbage collector engineers make considerable efforts to shorten pause times.

One approach to this is to do parts of the major collection concurrently with the running program, aka the mutator. SGen’s major collector, mark-and-sweep, consists of two main phases, unsurprisingly called mark and sweep.

Thanks to Zoltan’s recent work—lazy sweep—SGen already does most of sweep’s work outside of major collection pauses. I have been working on the other part of the story for the past few weeks: concurrent mark.

How it works

Our implementation of concurrent mark is very similar to that of Sun JVM’s concurrent mark-and-sweep (CMS) collector.

A major garbage collection starts with an initial marking pause, during which all roots are scanned.[1] Then it restarts the world and in the background a worker thread performs the marking and scanning of major heap objects, in the usual manner. When the concurrent marking phase is done, the collector stops the world again to finish up the collection in the final marking pause.

Dealing with concurrent mutation

Consider the following code, which manipulates a linked list, swapping the order between the two nodes following a:

1   b = a->next;
2   c = b->next;
3   a->next = c;
4   b->next = c->next;
5   c->next = b;

Let’s assume that right after a->next has been set in line 3, the concurrent worker thread scans and marks the node referenced by a, then follows next and processes c. If at that point our program hasn’t come around to setting c->next to b in line 5, the garbage collector will never encounter b—it will have been lost because for a brief moment that node wasn’t part of the list.

We can recruit the help of our friend the write barrier to deal with this issue, but we need to extend its scope a little: instead of just recording stores that point from the major heap to the nursery, it must now also record stores pointing from the major heap to the major heap. In other words: all reference stores that happen in the major heap, no matter where they point to, must be recorded.[2]

With this new write barrier the store to c->next will record the node c as requiring a re-scan. This re-scanning of all modified major heap objects happens in the final marking pause, so it’s not further disturbed by the mutator.

What if the final marking pause happens before c->next is set in line 5, however? The node b might not be pointed to by any object in the heap, so we won’t find it that way, yet it is still live. It’s live because it’s referenced by the variable b on the stack, so the final marking pause must also re-scan all roots.[3]

Nursery collections

Nursery collections that happen while concurrent mark is running stop the world, but concurrent mark keeps chugging along. It never follows references into the nursery anyway, so there is little potential for conflict.

There is an issue with the card table, however: nursery collections clear the card table, but its data is required during the final marking pause, which might only occur a few nursery collections later. The solution to this is to maintain a separate card table that accumulates the marked bits from the “regular” card table, and to use this union table in the final scan.


This graph shows maximum pause times for a few of the programs in our GC benchmarks suite. All of the pause times are relative to Boehm’s, so values below 1 are better than Boehm’s. Note that the values correspond to the longest stop-the-world pause that occurred when running the program, not the median or some average.

The three SGen configurations are standard mark-and-sweep in blue, concurrent mark-and-sweep in green and concurrent mark-and-sweep in conjunction with cementing, in yellow.

The programs represented in this graph are those whose maximum pause times with Boehm were longer than 10ms. The ones with the longest pause times are on the left—on graph4 Boehm had a pause of 2.5s—which are those that benefit most from concurrent mark, as expected: the longest pause on graph4 is less than 180ms with concurrent mark. I have not yet investigated why graph8 is such an outlier.

  1. For concurrent mark the roots not only include the usual suspects like thread stacks and global variables, but also nursery objects. In fact, the initial marking pause performs a nursery collection, after which only pinned objects will be left in the nursery. Those are treated as roots, or rather the references they contain are.  ↩

  2. Since we use a single card table for both kinds of references, this will increase the number of cards marked that don’t have references pertinent to nursery collections. The card scanning overhead will therefore increase, but in practice this doesn’t seem to be a problem. An alternative would be to use two separate card tables, one for references to nursery objects and the other for references to major heap objects. That would be more expensive both in terms of memory usage as well as in write barrier performance.  ↩

  3. This entails doing a full mark-scan loop, of course, which might still take a very long time in pathological cases. Imagine a program that keeps a huge linked list around, referenced only by a single object on the major heap. At some point concurrent mark kicks in. The program now loads the reference to the list into a local variable and nulls the reference on the heap. The garbage collector will not be able to mark the list until it scans the stack in the final marking pause, and will now have to traverse the list with the world stopped.  ↩

December 18, 2012

SGen – The Write Barrier

Filed under: Software — Tags: , , — schani @ 5:05 pm


In previous installments of my blog post series on SGen, Mono’s generational garbage collector, I wrote about the major and minor heaps and their organizations. What I have for the most part neglected to pontificate on is how they interact with each other.

The advantage of generational garbage collectors comes from being able to collect the minor heap, also referred to as the nursery, very quickly because it doesn’t have to look at the objects in the major heap. There is a problem, however, which this illustration from my talk on SGen explains:

Let’s say we collect the nursery, starting from the single root reference. We find a graph of four live objects, colored blue, and promote them to the major heap, discarding all the others, including the three white ones. But wait—there’s a reference in some object on the major heap pointing to one of the objects we have just destroyed! This cannot stand.

Where did this reference come from? Consider the previous nursery collection. We’ll ignore pinning for now, so after that collection the nursery was empty. The mutator (that’s what garbage collector people affectionately call the pesky thing that insists on messing with the heap and that most people refer to as the program) allocated that white object, and stored a reference to it in an object on the major heap—in that order, but not necessarily in direct succession.

One way to handle those references is to scan the whole major heap for them and to treat them as roots. That would make nursery collections as slow as major collections, so it’s not a good idea. What’s needed is a way to record all such references happening between nursery collections. The nursery collector would then process them like ordinary roots and everything would be in order again.

Write barriers

Write barriers[1] do exactly that: On every store of a reference in an object they check whether the reference resides in the major heap and points to the nursery. If those conditions are satisfied, the reference is recorded for the benefit of the next nursery collection. This sounds like an awful overhead for each store, but it really isn’t, and for almost all applications it’s far outweighed by the advantages from generational collection.[2]

A note on terminology: The literature sometimes uses “write barrier” to refer to just the code that identifies references to remember, and sometimes to refer to both the code that identifies as well as records those references. I shall be using the term in the latter sense.

SGen’s write barrier is a so-called card table. We also implement an alternative, a sequential store buffer, which I will not discuss here.

Card tables

The basic idea is to conceptually divide the major heap into small, equally sized regions, called cards, and to have one bit per card that is set—the card is marked—when a reference to a nursery object is written to it.[3]

The advantages of this scheme are that it requires only a handful of instructions to record a store, so it can be inlined, and that its space overhead is small and proportional to the heap size. The disadvantage is that it doesn’t record the exact locations of the references we’re interested in.

In SGen’s case, cards are 512 bytes large, so the nursery collector needs to scan roughly that amount of memory to potentially locate a single reference.[4] The references it does find it treats like ordinary roots: the referenced objects are promoted to the major heap and the references are updated to reflect their new locations. The newly promoted objects are then scanned in turn, of course.

Once all marked cards have been scanned there are no more references from the major heap to the nursery, so all cards mark bits are cleared, to be ready for the next round.

Pinned objects

So far we have been assuming that when we encounter a reference from the major heap to the nursery, we can always move the nursery object to the major heap, thereby rendering the bothersome reference harmless.

As I explained in the blog post on the nursery, sometimes objects become pinned, which implies that they cannot be moved. This threatens to confound our cunning plan: we can only clear the card mark bits because there are no references left that we need to keep track of. The solution here is simple: Cards containing references to pinned objects must stay marked so that they will be processed again during the next nursery collection, in the hopes that the object referred to will no longer be pinned.

This can lead to another potential problem: Assume that some nursery object remains pinned, and that there are lots of objects in the major heap that refer to it.[5] Lots of cards would remain perpetually marked and nursery collections would become dominated by card scanning. That is in fact what we noticed on two benchmarks in our small GC benchmark suite—nursery collections took much longer than expected and the programs took twice as long or more to run with SGen than with Boehm.


I just implemented a new feature—not yet pushed to master—that solves this new problem in a very simple way. I call it “cementing”, because it institutionalizes bothersome pinned objects: if we encounter a certain number of references from the major heap to a particular pinned nursery object, we just give up on it. Instead of hoping that it will become unpinned soon, we cement it—the punishment for badly behaved pinned objects. A cemented object will always be pinned at the start of future nursery collections as a matter of principle. The consequences of this are that it will be treated as a root, and that it will not be moved, which in turn means that we don’t have to worry about references to it from the major heap any more: the object is kept alive and the references can stay the same. Cards with references to cemented objects can be cleared, reducing the card scanning load.

Major collections give us a blank slate because we need to scan all live major heap objects anyway, so we use them as opportunities to reset cementing, giving those poor objects another chance after all.

The preliminary results from cementing are encouraging. It doesn’t seem to introduce performance regressions, and for the two benchmarks affected, it works wonders:

The red bars represent the wall clock run time with Boehm. The green bars are SGen run times without cementing, and blue are SGen with cementing.

  1. There is at least one other meaning for the term “write barrier”, namely a machine instruction (or series of instructions) that ensures that all stores preceding it are seen by other threads before the stores following it. Those are not the barriers we’re looking for.  ↩

  2. An alternative way of solving the reference problem is a read barrier. Read barriers make it possible to write moving concurrent collectors, a magnificent example of which is Azul’s C4 collector. If you want a glimpse of the future of garbage collection, read the paper.  ↩

  3. SGen uses a whole byte per card for marking. Not only can whole bytes be written easier than single bits, but multiple threads can do so without conflicts between contiguous ones. If we used individual bits, we’d have to use CAS to avoid losing marks by other threads.  ↩

  4. Nitpickers will notice that a marked card might not even contain a single reference to a nursery object—the mutator can write a reference, hence mark the card, and then later overwrite it with null, for example.  ↩

  5. A simple way to trigger this is to allocate some object high up on the call stack, keep a reference to it on the stack, and then to construct a long list, with every node containing a reference to said object. Keep the list around and its nodes will eventually be promoted, staining the major heap with perpetually marked cards.  ↩

November 16, 2012

Flame Graphs for Instruments

Filed under: Software — Tags: , , — schani @ 10:02 pm

Flame graphs are a great way to visualize stack-trace based profiling data. Brendan Gregg’s scripts for generating these graphs accept, among other options, output produced by DTrace. Unfortunately, OS X’s DTrace, when accessed through the command line, contains a bug that prevents it from producing stack traces in most cases. XCode Instruments, on the other hand, is not compromised in this way, but it will not give output similar to dtrace.

What it does provide is an export functionality which generates CSV files containing the exact contents of the details pane, so I wrote a script to do the intermediary work.

To produce the CSV file expand all nodes in the call tree (Option-click all the roots) and use the Instrument -> Export Track menu item. Then convert the CSV to the format Brendan’s script understands and feed it with it:

./ out.csv | ./ >mygraph.svg


For an alternative view you can invert the call tree in Instruments and you’ll get a correspondingly inverted flame graph, i.e., the flames go up from callee to caller, as opposed to the other way around when the call tree is not inverted.

November 2, 2012

SGen and DTrace

Filed under: Software — Tags: , , , — schani @ 1:57 am


SGen is Mono’s generational garbage collector. DTrace is a programmable dynamic tracing framework and is available for OS X and Solaris, among others.

DTrace provides an interface for applications to define their own probes which can then be observed, and actions triggered. We have recently added a few probes to SGen and I’d like to show a few simple things that can be done with them.

The program I’m using to demonstrate DTrace here is IronPython, running pystone. SGen’s DTrace probes are currently only available on OS X, on mono master.

Garbage collection times

How long do garbage collections take for a specific workload?

1:  mono$target:::gc-begin {
2:          self->ts = timestamp;
3:  }
5:  mono$target:::gc-end {
6:          @[arg0] = quantize ((timestamp - self->ts) / 1000);
7:  }

When a garbage collection starts, we save the current time, and when it ends we gather the time elapsed since then in microseconds. The results:

value  ------------- Distribution ------------- count
    4 |                                         0
    8 |                                         1
   16 |                                         0
   32 |                                         1
   64 |                                         1
  128 |                                         0
  256 |                                         0
  512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    90
 1024 |                                         1
 2048 |                                         0
 4096 |                                         0
 8192 |@                                        3
16384 |                                         1
32768 |                                         0

value  ------------- Distribution ------------- count
   16 |                                         0
   32 |@@@@@@@@@@@@@                            1
   64 |                                         0
  128 |                                         0
  256 |                                         0
  512 |                                         0
 1024 |                                         0
 2048 |                                         0
 4096 |                                         0
 8192 |                                         0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              2
32768 |                                         0

The graph for 0 shows collection times for the nursery, the graph for 1 shows times for the major collections. We can see here that the majority of nursery collections took about half a millisecond, with 4 outliers taking around 10 ms. There were 3 major collections, two of which took around 16 ms and the third one barely took any time at all.

Apple’s Instruments profiling application gathers data using DTrace and can be extended with custom DTrace scripts. Here’s what a visualization of SGen collector pause times looks like:

The dots lower down in the scatter diagram are likely nursery collections, the higher up ones major collections.

GC lock

Not counting garbage collections, how often and for how long is the GC lock held?

 1:  mono*:::gc-locked {
 2:          self->ts = timestamp;
 3:  }
 5:  mono*:::gc-begin {
 6:          @ = quantize ((timestamp - self->ts) / 1000);
 7:  }
 9:  mono*:::gc-end {
10:          self->ts = timestamp;
11:  }
13:  mono*:::gc-unlocked {
14:          @ = quantize ((timestamp - self->ts) / 1000);
15:  }

The GC lock is taken shortly before collections start, and released shortly after they finish. It’s also taken in a variety of other cases. Naturally we want to avoid holding it for long.

value  ------------- Distribution ------------- count
    0 |                                         0
    1 |@@@@@@@@@@@@@@@@@@@@@@@@@                5462
    2 |@@@@@@@@@@@@@                            2789
    4 |                                         97
    8 |                                         49
   16 |@                                        192
   32 |                                         12
   64 |                                         2
  128 |                                         0
  256 |                                         0
  512 |                                         0
 1024 |                                         1
 2048 |                                         0

SGen takes the lock quite often, but apart from one outlier at about 1 millisecond, it’s released very quickly. Further analysis (not presented here) shows that the outlier is a result of thread-pool initialization during startup, when a few objects are allocated pinned, requiring allocating some memory from the operating system.

Objects allocated

How many objects of which classes are allocated?

 1:  mono*:::gc-nursery-obj-alloc {
 2:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
 3:  }
 5:  mono*:::gc-major-obj-alloc-large {
 6:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
 7:  }
 9:  mono*:::gc-major-obj-alloc-pinned {
10:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
11:  }
13:  mono*:::gc-major-obj-alloc-degraded {
14:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
15:  }
17:  mono*:::gc-major-obj-alloc-mature {
18:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
19:  }

SGen has probes for five different kinds of object allocation, hence the repetition. User-space pointers to the namespace and class name are passed in the third and fourth argument, respectively, so we concatenate those with a dot in the middle to give the full class name.

Whenever possible SGen uses the managed allocator – a small managed piece of code to quickly allocate objects in the nursery, which handles the most common case. The managed allocator is not instrumented with a DTrace probe, however, so for this script to function properly the managed allocator must be turned off via the MONO_GC_DEBUG environment variable.

IronPython.Runtime.PythonDictionary                          500014
IronPython.Runtime.List                                      500067
IronPython.Runtime.Calls.Method                             1000002
System.Object[]                                             1505064
System.Int32                                               15499015

IronPython boxes lots of integers.

Objects pinned in the nursery

How many objects of which types are pinned during nursery collections?

1:  mono$target:::gc-obj-pinned /arg4==0/ {
2:          @[strjoin (copyinstr (arg2), strjoin (".", copyinstr (arg3)))] = count ();
3:  }

This is very similar to the preceding script, but we use the pinned object probe and a guard to select for the nursery, as opposed to the whole heap.

Microsoft.Scripting.Actions.DynamicSiteTarget`3                  863
IronPython.Runtime.Calls.CallTarget1                             864
System.Object[]                                                 1271
Microsoft.Scripting.Actions.CallSite`1                          1350
System.String[]                                                 1440

Here we see the pin counts for the 5 most pinned classes during nursery collections. Combining all nursery collections, for example, 1440 string arrays were pinned.

What if we’d like to know how many bytes of memory are pinned in the nursery during each collection?

 1:  mono$target:::gc-begin /arg0==0/ {
 2:          bytes = 0;
 3:  }
 5:  mono$target:::gc-obj-pinned /arg4==0/ {
 6:          bytes += arg1;
 7:  }
 9:  mono$target:::gc-end /arg0==0/ {
10:          @ = quantize (bytes);
11:  }

All we have to do is add up the sizes of all pinned objects for nursery collections.

value  ------------- Distribution ------------- count
   32 |                                         0
   64 |                                         1
  128 |@                                        2
  256 |                                         0
  512 |                                         0
 1024 |                                         0
 2048 |                                         0
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  95
 8192 |                                         0

This looks a bit suspicious. We had about 100 nursery collections, so according to the object counts above there were on average 13 object arrays and 14 string arrays pinned per nursery collection, in addition to lots of other objects. All of those supposedly amount to only about 4 kilobytes of memory. Are we counting something wrong or are these very short arrays? Let’s see:

1:  mono$target:::gc-obj-pinned /arg4==0/ {
2:          @[strjoin (copyinstr (arg2), strjoin (".", copyinstr (arg3)))] = quantize (arg1);
3:  }

This gathers object size histograms per class for all objects pinned in the nursery. These are the parts of the output giving the counts for the arrays we’re interested in:

         value  ------------- Distribution ------------- count
             8 |                                         0
            16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    1344
            32 |@@@                                      96
            64 |                                         0

         value  ------------- Distribution ------------- count
             8 |                                         0
            16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              870
            32 |@@@@@@@@@@@@@                            409
            64 |                                         0

Apparently those arrays really are very short, all around 16 to 32 bytes long. But maybe the probe is lying to us about the object size? If we know about Mono’s internal object layout we can read out the array length directly from the heap1:

1:  mono$target:::gc-obj-pinned /arg4 == 0 && copyinstr (arg3) == "Object[]"/ {
2:          this->l = *(int*)copyin (arg0 + 12, 4);
3:          @[arg1] = lquantize (this->l, 0, 8);
4:  }

On 32-bit Mono the length of an array is at offset 12, so we read 4 bytes from there and cast it to an int. We get a few graphs like these:

value  ------------- Distribution ------------- count
  < 0 |                                         0
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 96
    1 |                                         0

value  ------------- Distribution ------------- count
    2 |                                         0
    3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
    4 |                                         0

value  ------------- Distribution ------------- count
    0 |                                         0
    1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 481
    2 |                                         0

Each of them gives us the distribution of the array lengths for array objects of specific sizes. If everything is correct then for each array object size there must only be one array length, which is exactly what we get. For example, all 481 array objects of size 20 bytes have a length of one.

In fact, the numbers show that the sizes we get are exactly what we should expect given Mono’s object layout on 32-bits: A reference array with a given length has a size of 16 + 4*n bytes.


1 Don’t try this at home!

January 13, 2012

The Schism

Filed under: Photography — Tags: — schani @ 6:16 pm

I’ve been a very bad blogger recently. Part of the reason is certainly laziness, but there’s also the fact that I felt more motivated to write about photography recently than about software whereas this blog has become focused almost exclusively on software, so it didn’t feel like it would fit in anymore.

For this reason I have decided to start a dedicated photography blog, creatively named Mark’s Photography Blog. I’ve started it off with a post on 10 of my favorite images from 2011.

What that means for this blog is that, for the rare occasions on which I will post, I shall focus it almost exclusively on software, which is also reflected in its new awesome title. I’ll also try to overcome my aversion to writing smaller, less epic posts, which will hopefully also result in a higher posting frequency.

On the more epic side of things, I still owe you one or two more posts on SGen, and I’d like to write a post on the lock-free allocator we’re now using in SGen, but which is independent of it and is pretty cool. What can it be used for? Dynamic memory allocation in signal handlers, for example.

June 22, 2011

The ICFP Programming Contest 2011

Filed under: Software — Tags: , , , — schani @ 10:21 pm


Last weekend saw yet another iteration of the ICFP Programming Contest, an event the glorious team Funktion im Kopf der Mensch could not miss.

The task

The task this year was to build a program that would play a card game, Lambda The Gathering, to be run on the contest organizer’s machines. The game is played with two players each, taking turns. In each round, a player must play one out of several kinds of cards, for each of which there is an infinite supply. Each player owns 256 slots, each of which has some “vitality”, initially 10000. If a slot’s vitality drops to zero it is dead and cannot be used anymore (unless it is explicitly revived). The object of the game is to kill all the opponent’s slots.

Apart from its vitality, a slot also holds a value, which is where the real action happens. A value can be either a number or a function. The cards in the game also represent functions, and the number 0. When a card is played, it must be played to a specific slot, meaning that the card’s value and the slot’s value are combined. The combination can be either that the slot’s value is applied to the card (right) or the other way around (left). Initially, each slot holds the identity function, I.

As an example, let’s say we want to generate the number 4 in a slot. Initially it will hold I, so we play zero right, giving I(zero), which is reduced to 0. Now we play the function succ, which takes a number and returns that number plus one, left, giving succ(0), which reduces to 1. In each of the next two turns we play the card dbl left. dbl is a function that doubles a number, so we end up with 4.

To actually do something, like attacking your opponent’s slots, there are cards whose functions have side effects. One of them is attack, which you supply with three numbers (it’s a curried function). The first is the number of one of your slots, the second the number of one your opponent’s slots and the third is the “strength” of the attack. Once the function has all of its arguments it will subtract the third number from your slot’s vitality and then subtract nine tenths of that number from your opponent’s slot. There is another function, help, which you can use to top up your vitality, so if you first help and then attack with the right strengths you will have damaged your opponent without having lost any vitality yourself.

Also among the cards in the game are S, K and I, representing the homonymous functions of the SKI combinator calculus. These three simple functions, together with function application, are Turing complete, which means that it is possible to build programs in slots.


Actual programs can potentially damage the opponent much quicker than would be possible by building each attack by playing the respective cards anew. (Note that I’m talking about the programs (functions) in the game’s slots now, not about the program that plays the card game).

There is a limit to how long a slot’s function is allowed to “execute”, however, which is 1000 function applications. If it has not finished until then, execution is stopped and the slot’s value is reset to I. Any side effects stay, however, so an endless attack loop would do damage, though only a limited amount.

There is a good reason for not using endless loops, however, namely that, since the slot’s value is reset, the function is “lost”, so it has to be copied to another slot first (which is possible with the get function, but somewhat costly, since the function’s slot number has to be built first) or, even worse, generated again.

A better strategy is to use a function that does some more or less fixed amount of damage and then returns itself, or some slightly different variant of itself. The actual result of the function will, after all, be the new value of the slot. So, suppose we could build a function like this:

1:  function make_killer (slot) {
2:    return function (dummy) {
3:      kill_opponent_slot (slot);
4:      return make_killer (slot + 1);
5:    }
6:  }

If we generate this function and give it the number 0, it will result in a function that, given a dummy argument, will kill the opponent’s slot 0 and return a function that when called with a dummy argument will kill the opponent’s slot 1 and return a function that when called with a dummy argument etc. In other words, once we have this function we can kill one enemy slot per turn. In comparison, just launching a single attack without building more complex functions requires more than 50 turns, and that’s not sufficient to kill an enemy’s slot. If you factor in the help calls to top up your own vitality so as not to die quicker than the enemy you’re looking at more than 200 rounds just to kill a single slot of the enemy (assuming it has the initial vitality of 10000).


Unfortunately, it’s not completely trivial to write a function like this in SKI calculus. In fact, compared to SKI, x86 machine code looks incredibly high level. Fortunately, there is help. As luck would have it, I wrote a Scheme subset to Unlambda compiler many years ago. Scheme is essentially Lambda calculus and Unlambda is essentially SKI. In other words I was already a bit familiar with both SKI as well as the compilation algorithm (which is quite simple).

Here is the function from above in very simple Scheme:

1:  ((lambda (x) (x x))
2:   (lambda (self)
3:     (lambda (slot)
4:       (lambda (dummy)
5:         (kill-opponent-slot slot)
6:         ((self self) (succ slot))))))

Since there are no function names in SKI or pure Lambda calculus we have to achieve self-reference using a trick, namely the function (lambda (x) (x x)), which applies its argument to itself, so if you pass it your function then that function gets itself as its (first) argument, in this case named self. Note that I am assuming that we have sequential execution available, to first kill the opponent’s slot and then return the function for the next slot. This is not natively available in Lambda calculus, which doesn’t even deal with side effects, but can be implemented like so:

1:  ((lambda (dummy)
2:     do-this-afterwards)
3:   do-this-first)

How easily can we generate an attack function like the one above in a slot? The first step is to translate Lambda to SKI calculus. The simple algorithm (given in the task description) produces ridiculously huge programs. The problem is that since SKI calculus doesn’t know named function arguments the arguments must be passed downwards “manually” from where they originated. The simple algorithm passes them down all the way to all the leaves of the syntax tree, where they might or might not be needed. By cutting off the passing down at subtrees which don’t need the argument a significant reduction is achieved, but program size is still an issue, and depends a lot on how deeply nested the original function is.

Now that we have the function in SKI form, can we trivially play cards to summon it to a slot? It turns out that this is only possible in simple cases because a player’s turn can only apply a card left or right, so nesting on both sides of an application is not trivially possible. For example, it’s easy to generate S(S(SK)), starting from I, by applying K right and the S left three times. S((SS)K) also works: S right, S left, K right and S left. However, (SS)(KK) is not doable that easily.

There is a trick, though: Assume we already have some arbitrarily complex x, and we want x(MN). If we apply K left, S left and then M and N right we end up with ((S(Kx))M)N which, thanks to the way the S and K combinators work, reduces to x(MN).

This trick still only works if M and N are single cards. It can be used recursively however, and, using yet another similar trick, arbitrarily nested right sides can be produced, but at a great increase in the number of cards that have to be played. Before we figured out those tricks we came across a simpler general solution, using the card get, which, when given a slot number, reduces to the value that’s in that slot. If we use get and zero as the cards M and N in the trick above, the function will reduce to xy, with y being whatever is in slot 0. The generation algorithm built using this trick has to use more than one slot, and it depends vitally on slot 0, but it is reasonably simple and the overhead is small.

It looks as if we have all the pieces of the puzzle together now and can finally generate a function that does some serious damage quickly, but there is another problem. Let’s say we want to generate a function which, when given a dummy argument does the side effect of inc zero (the details of what inc does are not important here). In Scheme, that is (lambda (d) (inc zero)). Translated to SKI calculus it is K(inc zero). To generate it in a slot (assuming it holds I) we apply zero right, then inc and K left. After having applied the inc the slot’s function is inc zero, though, which is immediately reduced, thereby doing the side effect and giving as a result the function I, so after applying K left we end up with K(I) and a side effect we only wanted after giving the dummy argument.

Clearly, we need a way to “guard” side effects against immediate execution. The easiest way we found to do this is to build an equivalent function, but in a different way. From a completely functional point of view, i.e. ignoring possible side effects, the functions

1:  K(inc zero)


1:  (S(K inc))(K zero)

are the same – they both take an argument that is ignored and reduce to inc zero, but in the latter form the inc zero can only be reduced once the dummy argument is given. Let’s call such a function a “side effect function” – it requires a dummy argument and only once that is present will it produce a side effect.

What if we want to combine two such side effect functions, resulting in a new side effect function that does one first, then the other? It seems tempting to just apply the second to the first, like this:

1:  (lambda (d)
2:    (second-se-fn (first-se-fn d)))

The error in this approach is that even though (first-se-fn d) can only be reduced once the argument d is present, it is still, in terms of the SKI calculus, a function, and therefore a value, so second-se-fn can be reduced, even without d. What we need instead is a function that takes an argument and then applies first one function to that argument and then a second one. As luck would have it, that function is S. So, to combine two side effect functions, yielding a new side effect function, we have

1:  ((S first-se-fn) second-se-fn)

More strategy

With all of that in place we can come back to our self-returning slot-increasing attack function. As it turns out, passing around that extra slot number argument is quite an overhead when everything is translated down to SKI calculus, but we had already thought of a simpler solution. Instead of having the slot number inside the function we can use indirection: We store the number of the slot to be attacked in some other slot and use the get function to retrieve it from within the attack function. The disadvantage of this approach is that we need to increment the slot number after each attach, costing us one extra turn, but on the other hand we can switch to another slot to attack reasonably quickly.

After some optimizations, some of them directly on the generated SKI code, we managed to get the number of turns required to load this function down to below 180. After that we just have to load zero into a slot and we can kill one enemy slot every other turn.

One small detail I omitted above is that when attacking a slot the number you give will actually be inverted, i.e. giving 0 will attack slot 255, and vice versa. That makes it easy to attack high slot numbers but more difficult to attack lower ones, because higher numbers take more time to generate. Nonetheless, our bot doesn’t actually start its attack from the highest slot, as would be easiest. It first figures out which opponent slot holds the most complex function (using the trivial metric of counting the leaves of the syntax trees) and attack that first, then increment until we hit slot 0, then we start again from the highest slot, hopefully finishing off the opponent.

We did not have much time left to refine our bot beyond that. There are very few measures we take to defend ourselves. If one of our few vital slots is killed while we are generating our attack function, for example, we just revive the slot and the start over with generating the function. A smarter bot would try to salvage the pieces left in the other slots. If the slot we take the vitality from to launch attacks is killed, we’re dead in the water – we had code to help it get back into working condition, but we failed at the last minute to integrate it.

Overall, I think we’re doing quite well. We launch our first attack within about 200 rounds and can finish off a bot that doesn’t defend itself within a bit more than 500 turns after that. Against the best teams, however, we’re helpless – they can kill all 256 of out slots stone dead in less than 190 rounds, starting from scratch. Kudos!

Contest organization

In the past I have taken contest organizers to task for screwing up in minor and major ways. This year, I am happy to say, there was almost nothing to find fault with. The task was clear, not artificially obscured, many-layered, interesting and enjoyable. There was a test submission server but teams didn’t depend on it and it worked quite well in any case.

So, from Team Funktion im Kopf der Mensch a huge Thank You to the organizers for this amazing job! May all future contests be organized this well.

Programming languages

Last year was the first year we used Clojure after using OCaml almost exclusively in the years before. This time we stepped back a bit and used OCaml for the actual bot, i.e. the program that will run on the organizer’s servers. The main reasons for this were that OCaml has a much smaller footprint and more predictable performance than Clojure and we wanted to make sure we didn’t hit against any limits, and that we felt somewhat safer writing an unsupervised program when protected by OCaml’s type system.

We still used Clojure for developing the attack functions, starting from Lambda calculus, going down to SKI and then generating a sequence of card application.

Our visualizer was written in C.

All of these languages did their respective jobs well. No complaints this time.

The Code

All the code we wrote for this year’s contest is available on GitHub. If you need any assistance with it, please email me.

February 22, 2011

SGen – Finalization and Weak References

Filed under: Software — Tags: , , — schani @ 3:41 pm


In this installment of my series on SGen, Mono’s new garbage collector, we shall be looking at how finalizers and weak references are implemented, and why you (almost certainly) should not use finalizers.

Tracking object lifetime

Both finalizers and weak references need to track the lifetime of certain objects in order to take an action when those objects become unreachable. To that end SGen keeps lists of finalizable objects and weak references which it checks against at the end of every collection.

If the object referred to by a weak reference has become unreachable, the weak reference is nulled.

If a finalizable object is deemed unreachable by the collector, it is put onto the finalization queue and it is marked, since it must be kept alive until the finalizer has run. Of course, all objects that it references have to be marked as well, so the main collection loop is activated again.


A nursery collection cannot collect an object in the major heap, so it is not necessary to check the status of old objects after a nursery collection. That is why SGen keeps separate finalization and weak reference lists for the nursery and major heaps.

Invoking finalizers

SGen uses a dedicated thread for invoking finalizers. The finalization queue is processed one object at a time. As long as an object is in the finalization queue it is also considered live, i.e. the finalization queue is a GC root.


Resurrecting an object means making it reachable again from within its finalizer or a finalizer that can still reach the object (or via a tracking weak reference). The garbage collector does not have to treat this case specially—until the finalizer has run the object is considered live by virtue of its being in the finalization queue, and afterwards it is live because it is reachable through some other root(s).

Tracking weak references

If an object is weakly referenced and has a finalizer, the weak reference will be nulled during the same collection as the finalizer is put in the finalization queue. That is not always desirable, especially for objects that might be resurrected.

Tracking references solve the problem by keeping the reference intact at least until the finalizer has run. Once the finalizer has finished, a tracking reference acts like a standard weak reference, i.e. it will be nulled once the object becomes unreachable, typically during the next collection (unless the object was resurrected).

When SGen encounters a tracking reference, instead of nulling it, it turns it into a non-tracking reference. The referenced object is now on the finalization queue and therefore considered live again, so the reference will not be nulled before the finalizer has run.

Why finalization is Evil

Finalization should not be used to manage scarce resources, such as file descriptors.

Time of finalization is not determined

Unless you force a garbage collection and wait for the finalizers to finish, the time at which they are run is not determined. If your program does not allocate a lot of memory, or your heap is huge, it might take a very long time until a major collection is triggered, so dead objects on the major heap might not be finalized in time before your scarce resource is depleted.

In this highly recommended talk (starting at 8:35) Cliff Click gives an example where it was necessary to put a hack into the JVM garbage collector that triggered a collection whenever the system ran out of file handles, because Apache Tomcat relied on finalizers to reclaim them.

Finalizers run one after another

A finalizer that performs a time consuming task will delay the execution of other finalizers. An especially worrisome case is finalizers that do potentially blocking I/O. Depending on the timeout of the operation, other finalizers can be blocked for a long time.

Finalizers make objects live longer

Since finalizable objects are considered live until their finalizers have run, they also consume memory until their finalizers have run and the next garbage collection is triggered. This not only applies to the finalizable objects themselves but to all objects they reference as well. This is particularly problematic if finalization is blocked by an ill-behaved finalizer.

Order of finalization is not determined

A finalizer cannot count on other finalizers running before or after it, irrespective of whether it references or is referenced by those objects or not. In other words, a finalizer cannot assume that an object it references has not been finalized yet. That is true even if that object is known to be strongly held by a GC root, namely when shutting down.

An exception to this are critical finalizers, which are run after “normal” finalizers, but between normal finalizers there is no determined order, neither between critical finalizers.


It turns out that SGen’s handling of tracking references, as described above, is not only conceptually wrong, it was also incorrectly implemented, which, ironically, mostly fixed the bug in the design.

Our conceptual mistake was to demote tracking references to non-tracking ones the first time the object became unreachable, which would have led to the reference being non-tracking even if its target was re-registered for finalization. The bug in the implementation was that we didn’t actually do that. Instead, SGen would keep tracking references around until their targets became really, truly unreachable, even via finalization, i.e. until it was done finalizing, without any further chance of resurrection. Only at that point would it turn the tracking reference into a non-tracking one, thereby making the object alive once again and keeping it around for one more garbage collection cycle.

The bug is now fixed. Thanks to Alan for noticing this case and investigating.

Older Posts »

The Shocking Blue Green Theme Blog at


Get every new post delivered to your Inbox.