Associated Types Considered Weird

In my first post on Swift I expressed mild bewilderment at Apple’s decision to make protocols generic not by giving them generic type arguments, like C# does, but via what they call “associated types”. To reiterate the example, the Sequence protocol in Swift is defined as

protocol Sequence {
    typealias GeneratorType : Generator
    func generate() -> GeneratorType
}

whereas if Swift had generic type arguments in protocols (as it does in classes), the definition would look more like

protocol Sequence<T> {
    func generate() -> Generator<T>
}

When playing around with a set implementation that Alexis Gallagher posted, I came across a problem with associated types: it exposes more type information than the user of a class needs, and worse, more than the user of a class should have.

Here’s how Alexis implemented the generate method on his set type:

func generate() -> SetGeneratorType<KeyType> {
    let dictGenerator = self.dictionaryOfItems.generate()
    return SetGeneratorType<KeyType>(dictGenerator)
}

The type that generate is actually required to return (to adhere to the Sequence interface) is Generator. It’s not possible, however, (as far as I can tell) to declare generate as such:

func generate() -> Generator

The reason being that Generator here is under-specified—it must be a Generator whose associated type Element is KeyType, but that can’t be specified. If Swift had generic type arguments for protocols, it would simply be Generator<KeyType>.

The problem here is that the set type needs to publicly expose an implementation detail, namely the concrete type by which it implements its Generator. If that implementation detail changes, the method’s type changes, and code might break.

This problem is exacerbated by the fact that users of the generator might have to depend on its type. As an example, look at my simplification of Alexis’ code:

typealias GeneratorType = MapSequenceGenerator<Dictionary<KeyType,
                                                          Bool>.GeneratorType,
                                               KeyType>

func generate() -> GeneratorType {
    return dictionaryOfItems.keys.generate()
}

The body of the function explains the weird GeneratorType: The dictionary has a keys property that is exactly what I need—a sequence of all the elements in the set. The type that I’m returning now is the generator type of that sequence, instead of my own, and I must specify the concrete return type. The only way to get the concrete type, however, is to look up the type that keys returns, which means that I now depend on an implementation detail of Dictionary, and, making matters worse, I’m surfacing it to my users.

For comparison, let’s look at how this would work it Swift had generic type arguments for protocols. Sequence and Generator would be defined as

protocol Sequence<T> {
    func generate() -> Generator<T>
}

protocol Generator<T> {
    func next() -> T?
}

Dictionary’s keys property would be

var keys: Sequence<KeyType> { get }

And this would be our set’s generate method:

func generate() -> Generator<KeyType> {
    return dictionaryOfItems.keys.generate()
}

No leaked implementation details and no unneeded dependencies.

Generators in Swift

In my last post on Swift I examined sequences and generators. Today we’ll implement our own sequence.

Many languages have a function like Clojure’s interpose, which takes an item and a sequence, and inserts the item between each two subsequent elements of the sequence:[1]

Array(interpose("in", ["San Francisco", "California", "the USA"]))
=> ["San Francisco", "in", "California", "in", "the USA"]

There are three things we need here: The interpose function itself, which must return a Sequence, which, in turn, generates a Generator, which does the actual work of producing the items.

Let’s start with the Generator:

struct InterposeGenerator<G: Generator, T where T == G.Element>
        : Generator {
    let sep: T
    var gen: G
    var needSep: Bool
    var nextOrNil: T?

    init (sep: T, gen: G) {
        self.sep = sep
        self.needSep = false
        self.gen = gen
        self.nextOrNil = self.gen.next()
    }

    mutating func next() -> T? {
        if needSep {
            needSep = false
            return sep
        } else {
            let n = nextOrNil
            if n {
                nextOrNil = gen.next()
                needSep = nextOrNil != nil
            }
            return n
        }
    }
}

InterposeGenerator is a little state machine that alternates between producing either an element of the original generator or the separator. A bit of care is required to make sure that it doesn’t produce the separator again after the generator is exhausted.

One noteworthy thing here is the invocation of self.gen.next() in init. If we try to call next on the constructor argument gen instead of the property self.gen we get a compiler error:

immutable value of type 'G' only has mutating members named 'next'

This makes sense: the Generator is a value type, and it’s passed in here as an argument, which apparently makes it immutable, and next is a mutating function. At least, that is how it would typically happen—the compiler doesn’t seem to have any information to that effect, though. Generator is a protocol, which can be implemented by reference as well as value types:

protocol Generator {
    typealias Element
    func next() -> Element?
}

As you can see, next is not marked as mutating, either. I suspect there are two things going on here: the protocol definition that Xcode shows is probably incorrect, in that the mutating modifier is missing. The definition that Swift actually uses contains the modifier. In addition, the compiler doesn’t know whether the Generator we are using to create InterposeGenerator is a value type or not, so it must assume that it is, for the sake of safety in the case of mutating functions.

Let’s move on to the Sequence. It doesn’t really have to do anything other than creating an InterposeGenerator when asked to, so apart from all the type fiddling involved it’s rather simple:

struct InterposeSequence<S: Sequence, T
                         where T == S.GeneratorType.Element>
        : Sequence {
    let sep: T
    var seq: S
    func generate () -> InterposeGenerator<S.GeneratorType, T> {
        return InterposeGenerator(sep: sep, gen: seq.generate())
    }
}

Finally, the interpose function just creates an InterposeSequence:

func interpose<S: Sequence, T
               where T == S.GeneratorType.Element>
        (sep: T, seq: S) -> InterposeSequence<S, T> {
    return InterposeSequence(sep: sep, seq: seq)
}

It seems to be a pattern in the base library for the Sequence and Generator to be implemented in the same type, which we could easily do here, as well. I’ll leave it as an exercise for the reader.

Languages with a yield operator, like C# or Python, allow sequence generation to be expressed much more succinctly, with the compiler taking over the task of managing the state machine:

static IEnumerable<T> Interpose<T> (T sep, IEnumerable<T> seq) {
    var en = seq.GetEnumerator ();
    if (!en.MoveNext ())
        yield break;
    for (;;) {
        yield return en.Current;
        if (!en.MoveNext ())
            yield break;
        yield return sep;
    }
}

  1. It doesn’t modify the original sequence, of course, but rather returns a new one.  ↩

Playing with Swift

I really like what we’ve seen so far of Swift, Apple’s new programming language. For starters, I was happy to learn that references cannot be nil[1], and integer arithmetic is checked by default[2].

Let’s look at how Swift deals with sequences, and learn a little bit about generics on the way. Arrays, for example, are sequences, so you can iterate over them with a for loop:

for x in [1,2,3] {
    println(x)
}

Lots of other types are sequences, too, like strings, dictionaries and ranges. What makes them sequences is that they all implement the Sequence protocol:

protocol Sequence {
    typealias GeneratorType : Generator
    func generate() -> GeneratorType
}

I’m not sure why Swift elects to do generic protocols via typealias instead of just having the protocol take type arguments, like C# does, which might look something like this:

protocol Sequence<T> {
    func generate() -> Generator<T>
}

Whatever the reason might be, we can see that a sequence can generate a Generator, which is defined as follows:

protocol Generator {
    typealias Element
    func next() -> Element?
}

Generators have a next method which returns either the next element in the sequence, or nil, when the end has been reached. The for loop from above is probably mostly equivalent to

var gen = [1,2,3].generate()
while let x = gen.next() {
    println(x)
}

The equivalent of Sequence and Generator in C# are IEnumerable and IEnumerator.

The map function takes a sequence and a function, returning a new sequence containing the results of applying the function to the elements of the given sequence. For example:

for x in map([1,2,3], {i in i*2}) {
    println(x)
}

produces

2
4
6

It’s important to realize that the sequence which map returns is not necessarily of the same type as the one it is passed:

println([1,2,3])
println(map([1,2,3], {i in i*2}))

prints

[1, 2, 3]
VSs17MapCollectionView (has 2 children)

Also, sequences can generate their elements lazily:

var seq = map([1,2,3], { (i: Int) -> Int in
    println("processing " + String(i))
    return i*2
    })
var gen = seq.generate()
println(gen.next())

prints

processing 1
2

Let’s write a function that takes a sequence and returns an array containing all the sequence’s elements!

First we need to figure out the type of this function. It turns out that this doesn’t work:

func toArray (seq: Sequence) -> seq.GeneratorType.Element[]

The compiler complains that seq is an undeclared type. This doesn’t work, either:

func toArray (seq: Sequence) -> Sequence.GeneratorType.Element[]

Here the complaint is that Sequence.GeneratorType is ambiguous. That’s because we can’t tell the compiler that the two Sequence types are the same. Maybe we can use generics to solve the problem:

func toArray<S:Sequence> (seq: S) -> S.GeneratorType.Element[] {
    var arr = S.GeneratorType.Element[]()
    ...
    return arr
}

Swift is now happy with the type of the function, but unfortunately it doesn’t let us instantiate the array. Instead it thinks we’re trying to subscribe S.GeneratorType.Element, which isn’t possible, because it’s a type and not a value. As far as I can tell, the only way to get around this problem is to introduce a second generic argument, which we can force to the correct type via a where clause:

func toArray<S : Sequence,
             T where T == S.GeneratorType.Element>
        (seq : S) -> T[] {
    var arr = T[]()
    for x in seq {
        arr.append(x)
    }
    return arr
}

println(toArray(map([1,2,3], {x in x*2})))

And there we have it:

[2, 4, 6]

  1. Swift would be the first mainstream programming language that corrects Tony Hoare’s “billion dollar mistake”, after barely 60 years of all others repeating it.  ↩

  2. It seems we’re finally learning that it’s better for a program to halt than to give away all your secrets and then launch all the nuclear missiles in the case of a bug.  ↩

Stacks and Signals

Now and then we get bug reports of test cases crashing on a specific configuration in an indeterminate manner or frequency, both from our own build bots as well as from our users. One of those I looked at recently involved a runtime test case on Linux/AMD64. The test would only crash infrequently, but luckily I could reproduce it in a VM.

What the program does is run two threads, one of which causes null reference exceptions in a loop over and over again:

while (!finished) {
    object o = null;
    try {
        o.ToString ();
    } catch {
    }
}

The other thread also runs a very simple loop, manually invoking the garbage collector again and again:

while (i < niter) {
    i++;
    if (i % 100 == 0)
        Console.Write (".");
    GC.Collect ();
}

Mono generates null reference exceptions by just letting the code trigger a segmentation violation signal, which we catch in a signal handler that then proceeds to throw the NRE. This approach, as opposed to doing a null check whenever we access an object, is that there is no overhead at all in the case the reference is legit, and the code we generate is shorter.

Garbage collection, on the other hand, also involves signals: stopping the world for a collection on Linux is done by sending a signal to each managed thread. Each thread catches its signal, which tells it that it has to suspend until the GC has done its work.[1]

When we look for the cause of a bug like this the first thing we try to do is catch the crash in a debugger, in this case GDB. Unfortunately, the crash was caused by a segmentation violation that our signal handler cannot handle (because it doesn’t occur in managed code), but a regular run of the program also triggers segmentation violations like crazy, so just breaking at each SIGSEGV didn’t get us anywhere.

My first attempt at getting to the crashing SIGSEGV was to write an Expect script that drove GDB to the first SIGSEGV, presumably caused by the null reference in the managed code loop, remember the offending address and then keep continuing until it hit a SIGSEGV at a different address. Sadly (and quite typical of cases like this), running the code that way in GDB never succeeded in triggering the bug, so we had to think of other options.

At first we thought the crash was caused by the interaction of the two signals (the segmentation violation from the null reference and the stop-the-world signal), but we looked into other hypotheses. We tried two different ways of getting rid of the segmentation violation: using explicit null checks to trigger NREs and just throwing an exception manually without accessing a null reference.

Explicit null checks (enabled with the environment variable MONO_DEBUG=explicit-null-checks) makes the JIT emit less efficient and more elaborate null checks whenever an object is accessed, so an NRE can be thrown without going through a SIGSEGV.

The other way to get rid of the SIGSEGV while still retaining most of the code’s behavior is to just throw an exception, like so:

while (!finished) {
    var e = new Exception ();
    try {
        throw e;
    } catch {
    }
}

We figured it probably didn’t matter which type the exception was, and indeed it did not: both with explicit null checks as well as with the manual exception throw the crash still occurred. We also looked into whether invoking the garbage collector was critical to the crash, and it turned out it was: just making the other thread sleep instead of calling the GC made the crash go away. Also, despite what the bug report claims, the crash was not dependent on SGen—it occurred with the Boehm collector, as well.

Armed with this new data we set out to finally get a close look at the crash in the debugger. Since the crash still only occurred very infrequently I brought my Expect script to bear again, and after a while we did get the crash in GDB. It did, regrettably, give us less information that we had hoped: instead of crashing somewhere in unmanaged code while dereferencing an invalid pointer, the crash resulted from the program counter being set to an invalid address, namely zero.

There are three main ways of setting the program counter on AMD64: CALLing a function, RETurning from a function and JMPing to an address. Spending some time looking at the stack dump it seemed that the most likely cause was a jump, which did not make our job easier, since it leaves no trace of where it came from.

After some more tinkering we decided to investigate whether the point at which the exception throwing thread was interrupted by the GC played any role. We got lucky—it so happened that every time the test crashed, the last time the thread was stopped was always in the exact same piece of code; even more, at the exact same instruction. This is the code in which the thread suspension occurred, with the instruction where it happened labeled:

    mov    %rdi,%r11
    mov    (%r11),%rax
    mov    0x10(%r11),%rcx
    mov    0x18(%r11),%rdx
    mov    0x8(%r11),%rbx
    mov    0x20(%r11),%rbp
    mov    0x30(%r11),%rsi
    mov    0x38(%r11),%rdi
    mov    0x60(%r11),%r12
    mov    0x68(%r11),%r13
    mov    0x70(%r11),%r14
    mov    0x78(%r11),%r15
    mov    0x28(%r11),%rsp
here:
    mov    0x80(%r11),%r11
    jmpq   *%r11

This piece of code is executed as the last step of throwing an exception. It jumps to the catch handler, which involves first restoring a bunch of registers, including the stack pointer, and then doing the actual jump. We guessed that this jump to register r11 was the one that landed us at address zero. Observing in a few more crashes that r11 was indeed always zero corroborated that hypothesis.

The code that calls this restore_context() function looks like this:

void
mono_amd64_throw_exception (MonoObject *exception, ...)
{
    MonoContext ctx;
    ... figure out where on the stack the matching catch clause is ...
    ... set ctx to the values required to end up in that catch block ...
    restore_context (&ctx);
}

Note that the context—the data structure that contains all the values for the registers, including the program counter—is a local variable of this function, i.e., it resides in this function’s stack frame. Said stack frame is on the same stack as the catch handler that it eventually dispatches to, but further down, of course. We pass the address of the context to the above restore_context() function.

It seemed obvious now that for whatever reason the value for the program counter was set to zero in the context, so we added an assertion to check for that directly before the call to restore_context(). The assertion never failed, but the program still crashed.

Looking at the assembly code again Rodrigo suddenly proclaimed that he knew exactly what caused the bug: after setting the stack pointer to the catch handler’s frame we did another load from the context, which we weren’t allowed to do. By moving the stack pointer upward the context ended up lying not above the stack pointer, where it used to be, but below:

Most ABIs specify that programs must assume that data below the stack pointer (plus the red zone) is subject to arbitrary change because it’s used by the operating system for invoking signal handlers, for example—exactly what happened in our case with the thread suspend signal.

The correct way to do the context restore, then, is to load everything from the context before restoring the stack pointer. In our case we use the scratch register r8 to hold the value that will end up in rsp:

    mov    0x28(%r11),%r8
    mov    0x80(%r11),%r11
    mov    %r8,%rsp
    jmpq   *%r11

It’s somewhat embarrassing that we could already generate the correct code, but only did so in case we were running on Valgrind, because it gave us a warning about it that we apparently thought was a false positive.

Looking at our code for x86, after fixing AMD64, we found that we had the same bug there, but nobody had reported it yet, probably because it was harder to trigger.


  1. Reality is a tiny bit more complicated: Sending a signal to a thread is asynchronous, meaning that the GC has no direct way of knowing that a thread has already gotten the signal and reacted to it. Instead, the suspend signal handlers message back to the GC that the point of suspension has been reached, upon which they actually suspend, waiting for yet another signal from the GC, telling them to resume. Note that if the resume signal arrived between the point where the GC is notified of the suspension and the point of initiating the wait for said resume signal, the signal would be lost and the thread would never resume.  ↩

How to Split a Pizza

Assume two people want to split a pizza in a fair way. In our context, fair means that if both people follow the procedure, each person gets half a pizza, or, if not, only has themselves to blame. The obvious answer seems to be “split it in the middle”, but that leads to all kinds of further questions and complications: Who does the splitting? How is “the middle” determined? Even if we had protocols for that, it would still leave the possibility of one person not being satisfied with how they were carried out—sloppy measurements or sloppy cutting.

The standard way to avoid all such problems is to let one person cut the pizza in two halves, and the other person choose who gets which half. If the cutter thinks they got the smaller piece it’s because they didn’t cut fairly or precisely enough, and if the chooser thinks their piece is smaller, they should have chosen the bigger one.

What if there are three people? Before we try to come up with a procedure, let’s consider people’s motivations first. In the case of two people, whatever one person wins, the other loses. With three people, one person might elect to get a smaller piece for themselves in order to also make one of the other two worse off. Or, more realistically, two people might collude against the third. We’d like a procedure that ensures that even under those circumstances, fairness is guaranteed.

One generalization of the two-people procedure that comes to mind is to let one person—let’s call them “A”—cut, and then let the two others choose, first B, then C. Unfortunately, if A cuts in such a way that there is one large piece and two pieces that are each less than 1/3, and B chooses the large piece, C doesn’t get their fair share (neither does A, but that’s A’s fault). Randomizing the order of cutter and choosers is not a solution, either, because there is still the possibility of somebody getting the short end of the stick.

Here’s a procedure that works: A cuts the pizza, B chooses their slice. C has two options: they can either choose one of the remaining two slices, or they can call shenanigans. In the latter case, B picks a second slice[1], then both slices are put together and C divides them into two halves. B picks one half, C gets the other.

To see why this procedure is fair, let’s consider each person in turn. A will always get the piece that B and C didn’t pick, so if A cut fairly, they get their fair share. B will either get the piece they pick, or at least half of two pieces they pick, so if they pick the two largest pieces, they will get at least one third of the pizza. C will either get the piece they pick, or, in the case of shenanigans, half of the two pieces that B picked. It’s not obvious that they will get at least a third in the latter case. We assume that C only calls shenanigans if the two pieces that are left after B picked one are both smaller than one third—if one piece is at least a third, C should of course just pick it, thereby getting their fair share. The larger of the two pieces will have size 1/3-x, the smaller one 1/3-x-y, where x>0 and y>=0. If the two pieces are the same size, y is zero. The piece that B picked therefore has size 1/3+2x+y. B now picks one of the smaller pieces. If B wants to minimize C’s share, B must pick the smallest piece, i.e.1/3-x-y, which then gets put together with B’s original piece for C to cut. The two pieces together have size 2/3+x, so if C cuts fairly, they will actually get a size larger than their fair share.

I don’t know whether this procedure can be generalized to a larger number of people.

Update: As many commenters have thankfully pointed out, I have solved a problem that smarter people have already solved for more general cases. For a good overview of a few methods, read this article. My personal favorite is the Last Diminisher algorithm. The method I came up with is a variant of the three-person Lone Divider procedure.


  1. It is important that it is B, not C, who picks the second slice when C calls shenanigans.  ↩

SGen – Concurrency and Evacuation

Introduction

In my last blog post about SGen, Mono’s garbage collector, I wrote about concurrent mark-and-sweep, which makes most of the garbage collection work happen concurrently with the running program.

One thing we cannot do in a concurrent collector, as opposed to a stop-the-world collector, is move objects around in memory[1]. This is typically not a problem in a mark-and-sweep collector, but SGen goes beyond basic mark-and-sweep and prevents fragmentation by evacuating objects when the need arises.

Fragmentation is no less of an issue in a concurrent collector, so being unable to do evacuation is problematic. Our best solution at this point is to stop the world when it happens[2]. The threshold we use to decide when to do evacuation with concurrent mark-and-sweep is higher than on our non-concurrent collector, but there is still a chance that a full collection pause will hit the program when interactive speeds are desired.

The API

We’re introducing a new API in the Mono.Runtime class. The application can use it to ask the GC not to do full major collection pauses:

if (!Mono.Runtime.SetGCAllowSynchronousMajor (false))
        Console.WriteLine ("Sorry, the GC won't cooperate.");

// your low-latency code here

Mono.Runtime.SetGCAllowSynchronousMajor (true);

One possible use case would be a game, where you want the lowest latency possible while the player is in a level, but between levels, or during loads, a longer collection pause is not much of an issue. Note that there are still no guarantees about the lengths of the pauses—this is a best-effort approach, not a real-time garbage collector.

This API will be available in the next Mono release, but only via reflection. It will be a public API on mobile soon.


  1. This is technically possible, but it involves using a read barrier, which would not only be forbiddingly complicated to implement in Mono, but would also probably incur a significant performance hit as well as break our embedding API.  ↩

  2. An approach that would keep marking mostly concurrent would be to decide which blocks to evacuate before starting concurrent mark, and then during marking to keep tabs on the references from already marked objects to objects within those blocks. In the finishing collection pause we could evacuate the marked objects in those blocks, but we’d have to update the references to them. Depending on the workload that might still require a significant pause.  ↩

SGen – Concurrent Mark

Introduction

Like most garbage collectors, SGen is of the stop-the-world variety, which means that when the time has come to perform its work, it stops all managed threads in their tracks and only resumes them once it has finished.

Nursery collections are usually relatively short affairs, and can be sped up even further by using a smaller nursery size. Major collections, on the other hand, can take a long time because they must scan every single live object in the system. Long garbage collection pauses are clearly a serious problem for many applications, be they interactive or on the server, and garbage collector engineers make considerable efforts to shorten pause times.

One approach to this is to do parts of the major collection concurrently with the running program, aka the mutator. SGen’s major collector, mark-and-sweep, consists of two main phases, unsurprisingly called mark and sweep.

Thanks to Zoltan’s recent work—lazy sweep—SGen already does most of sweep’s work outside of major collection pauses. I have been working on the other part of the story for the past few weeks: concurrent mark.

How it works

Our implementation of concurrent mark is very similar to that of Sun JVM’s concurrent mark-and-sweep (CMS) collector.

A major garbage collection starts with an initial marking pause, during which all roots are scanned.[1] Then it restarts the world and in the background a worker thread performs the marking and scanning of major heap objects, in the usual manner. When the concurrent marking phase is done, the collector stops the world again to finish up the collection in the final marking pause.

Dealing with concurrent mutation

Consider the following code, which manipulates a linked list, swapping the order between the two nodes following a:

1   b = a->next;
2   c = b->next;
3   a->next = c;
4   b->next = c->next;
5   c->next = b;

Let’s assume that right after a->next has been set in line 3, the concurrent worker thread scans and marks the node referenced by a, then follows next and processes c. If at that point our program hasn’t come around to setting c->next to b in line 5, the garbage collector will never encounter b—it will have been lost because for a brief moment that node wasn’t part of the list.

We can recruit the help of our friend the write barrier to deal with this issue, but we need to extend its scope a little: instead of just recording stores that point from the major heap to the nursery, it must now also record stores pointing from the major heap to the major heap. In other words: all reference stores that happen in the major heap, no matter where they point to, must be recorded.[2]

With this new write barrier the store to c->next will record the node c as requiring a re-scan. This re-scanning of all modified major heap objects happens in the final marking pause, so it’s not further disturbed by the mutator.

What if the final marking pause happens before c->next is set in line 5, however? The node b might not be pointed to by any object in the heap, so we won’t find it that way, yet it is still live. It’s live because it’s referenced by the variable b on the stack, so the final marking pause must also re-scan all roots.[3]

Nursery collections

Nursery collections that happen while concurrent mark is running stop the world, but concurrent mark keeps chugging along. It never follows references into the nursery anyway, so there is little potential for conflict.

There is an issue with the card table, however: nursery collections clear the card table, but its data is required during the final marking pause, which might only occur a few nursery collections later. The solution to this is to maintain a separate card table that accumulates the marked bits from the “regular” card table, and to use this union table in the final scan.

Results

This graph shows maximum pause times for a few of the programs in our GC benchmarks suite. All of the pause times are relative to Boehm’s, so values below 1 are better than Boehm’s. Note that the values correspond to the longest stop-the-world pause that occurred when running the program, not the median or some average.

The three SGen configurations are standard mark-and-sweep in blue, concurrent mark-and-sweep in green and concurrent mark-and-sweep in conjunction with cementing, in yellow.

The programs represented in this graph are those whose maximum pause times with Boehm were longer than 10ms. The ones with the longest pause times are on the left—on graph4 Boehm had a pause of 2.5s—which are those that benefit most from concurrent mark, as expected: the longest pause on graph4 is less than 180ms with concurrent mark. I have not yet investigated why graph8 is such an outlier.


  1. For concurrent mark the roots not only include the usual suspects like thread stacks and global variables, but also nursery objects. In fact, the initial marking pause performs a nursery collection, after which only pinned objects will be left in the nursery. Those are treated as roots, or rather the references they contain are.  ↩

  2. Since we use a single card table for both kinds of references, this will increase the number of cards marked that don’t have references pertinent to nursery collections. The card scanning overhead will therefore increase, but in practice this doesn’t seem to be a problem. An alternative would be to use two separate card tables, one for references to nursery objects and the other for references to major heap objects. That would be more expensive both in terms of memory usage as well as in write barrier performance.  ↩

  3. This entails doing a full mark-scan loop, of course, which might still take a very long time in pathological cases. Imagine a program that keeps a huge linked list around, referenced only by a single object on the major heap. At some point concurrent mark kicks in. The program now loads the reference to the list into a local variable and nulls the reference on the heap. The garbage collector will not be able to mark the list until it scans the stack in the final marking pause, and will now have to traverse the list with the world stopped.  ↩

SGen – The Write Barrier

Introduction

In previous installments of my blog post series on SGen, Mono’s generational garbage collector, I wrote about the major and minor heaps and their organizations. What I have for the most part neglected to pontificate on is how they interact with each other.

The advantage of generational garbage collectors comes from being able to collect the minor heap, also referred to as the nursery, very quickly because it doesn’t have to look at the objects in the major heap. There is a problem, however, which this illustration from my talk on SGen explains:

Let’s say we collect the nursery, starting from the single root reference. We find a graph of four live objects, colored blue, and promote them to the major heap, discarding all the others, including the three white ones. But wait—there’s a reference in some object on the major heap pointing to one of the objects we have just destroyed! This cannot stand.

Where did this reference come from? Consider the previous nursery collection. We’ll ignore pinning for now, so after that collection the nursery was empty. The mutator (that’s what garbage collector people affectionately call the pesky thing that insists on messing with the heap and that most people refer to as the program) allocated that white object, and stored a reference to it in an object on the major heap—in that order, but not necessarily in direct succession.

One way to handle those references is to scan the whole major heap for them and to treat them as roots. That would make nursery collections as slow as major collections, so it’s not a good idea. What’s needed is a way to record all such references happening between nursery collections. The nursery collector would then process them like ordinary roots and everything would be in order again.

Write barriers

Write barriers[1] do exactly that: On every store of a reference in an object they check whether the reference resides in the major heap and points to the nursery. If those conditions are satisfied, the reference is recorded for the benefit of the next nursery collection. This sounds like an awful overhead for each store, but it really isn’t, and for almost all applications it’s far outweighed by the advantages from generational collection.[2]

A note on terminology: The literature sometimes uses “write barrier” to refer to just the code that identifies references to remember, and sometimes to refer to both the code that identifies as well as records those references. I shall be using the term in the latter sense.

SGen’s write barrier is a so-called card table. We also implement an alternative, a sequential store buffer, which I will not discuss here.

Card tables

The basic idea is to conceptually divide the major heap into small, equally sized regions, called cards, and to have one bit per card that is set—the card is marked—when a reference to a nursery object is written to it.[3]

The advantages of this scheme are that it requires only a handful of instructions to record a store, so it can be inlined, and that its space overhead is small and proportional to the heap size. The disadvantage is that it doesn’t record the exact locations of the references we’re interested in.

In SGen’s case, cards are 512 bytes large, so the nursery collector needs to scan roughly that amount of memory to potentially locate a single reference.[4] The references it does find it treats like ordinary roots: the referenced objects are promoted to the major heap and the references are updated to reflect their new locations. The newly promoted objects are then scanned in turn, of course.

Once all marked cards have been scanned there are no more references from the major heap to the nursery, so all cards mark bits are cleared, to be ready for the next round.

Pinned objects

So far we have been assuming that when we encounter a reference from the major heap to the nursery, we can always move the nursery object to the major heap, thereby rendering the bothersome reference harmless.

As I explained in the blog post on the nursery, sometimes objects become pinned, which implies that they cannot be moved. This threatens to confound our cunning plan: we can only clear the card mark bits because there are no references left that we need to keep track of. The solution here is simple: Cards containing references to pinned objects must stay marked so that they will be processed again during the next nursery collection, in the hopes that the object referred to will no longer be pinned.

This can lead to another potential problem: Assume that some nursery object remains pinned, and that there are lots of objects in the major heap that refer to it.[5] Lots of cards would remain perpetually marked and nursery collections would become dominated by card scanning. That is in fact what we noticed on two benchmarks in our small GC benchmark suite—nursery collections took much longer than expected and the programs took twice as long or more to run with SGen than with Boehm.

Cementing

I just implemented a new feature—not yet pushed to master—that solves this new problem in a very simple way. I call it “cementing”, because it institutionalizes bothersome pinned objects: if we encounter a certain number of references from the major heap to a particular pinned nursery object, we just give up on it. Instead of hoping that it will become unpinned soon, we cement it—the punishment for badly behaved pinned objects. A cemented object will always be pinned at the start of future nursery collections as a matter of principle. The consequences of this are that it will be treated as a root, and that it will not be moved, which in turn means that we don’t have to worry about references to it from the major heap any more: the object is kept alive and the references can stay the same. Cards with references to cemented objects can be cleared, reducing the card scanning load.

Major collections give us a blank slate because we need to scan all live major heap objects anyway, so we use them as opportunities to reset cementing, giving those poor objects another chance after all.

The preliminary results from cementing are encouraging. It doesn’t seem to introduce performance regressions, and for the two benchmarks affected, it works wonders:

The red bars represent the wall clock run time with Boehm. The green bars are SGen run times without cementing, and blue are SGen with cementing.


  1. There is at least one other meaning for the term “write barrier”, namely a machine instruction (or series of instructions) that ensures that all stores preceding it are seen by other threads before the stores following it. Those are not the barriers we’re looking for.  ↩

  2. An alternative way of solving the reference problem is a read barrier. Read barriers make it possible to write moving concurrent collectors, a magnificent example of which is Azul’s C4 collector. If you want a glimpse of the future of garbage collection, read the paper.  ↩

  3. SGen uses a whole byte per card for marking. Not only can whole bytes be written easier than single bits, but multiple threads can do so without conflicts between contiguous ones. If we used individual bits, we’d have to use CAS to avoid losing marks by other threads.  ↩

  4. Nitpickers will notice that a marked card might not even contain a single reference to a nursery object—the mutator can write a reference, hence mark the card, and then later overwrite it with null, for example.  ↩

  5. A simple way to trigger this is to allocate some object high up on the call stack, keep a reference to it on the stack, and then to construct a long list, with every node containing a reference to said object. Keep the list around and its nodes will eventually be promoted, staining the major heap with perpetually marked cards.  ↩

Flame Graphs for Instruments

Flame graphs are a great way to visualize stack-trace based profiling data. Brendan Gregg’s scripts for generating these graphs accept, among other options, output produced by DTrace. Unfortunately, OS X’s DTrace, when accessed through the command line, contains a bug that prevents it from producing stack traces in most cases. XCode Instruments, on the other hand, is not compromised in this way, but it will not give output similar to dtrace.

What it does provide is an export functionality which generates CSV files containing the exact contents of the details pane, so I wrote a script to do the intermediary work.

http://www.complang.tuwien.ac.at/schani/blog/sgen/instruments-fg.jpg

To produce the CSV file expand all nodes in the call tree (Option-click all the roots) and use the Instrument -> Export Track menu item. Then convert the CSV to the format Brendan’s script understands and feed it with it:

./stackcollapse-instruments.pl out.csv | ./flamegraph.pl >mygraph.svg

flamegraph.png

For an alternative view you can invert the call tree in Instruments and you’ll get a correspondingly inverted flame graph, i.e., the flames go up from callee to caller, as opposed to the other way around when the call tree is not inverted.

SGen and DTrace

Introduction

SGen is Mono’s generational garbage collector. DTrace is a programmable dynamic tracing framework and is available for OS X and Solaris, among others.

DTrace provides an interface for applications to define their own probes which can then be observed, and actions triggered. We have recently added a few probes to SGen and I’d like to show a few simple things that can be done with them.

The program I’m using to demonstrate DTrace here is IronPython, running pystone. SGen’s DTrace probes are currently only available on OS X, on mono master.

Garbage collection times

How long do garbage collections take for a specific workload?

1:  mono$target:::gc-begin {
2:          self->ts = timestamp;
3:  }
4:  
5:  mono$target:::gc-end {
6:          @[arg0] = quantize ((timestamp - self->ts) / 1000);
7:  }

When a garbage collection starts, we save the current time, and when it ends we gather the time elapsed since then in microseconds. The results:

    0
value  ------------- Distribution ------------- count
    4 |                                         0
    8 |                                         1
   16 |                                         0
   32 |                                         1
   64 |                                         1
  128 |                                         0
  256 |                                         0
  512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    90
 1024 |                                         1
 2048 |                                         0
 4096 |                                         0
 8192 |@                                        3
16384 |                                         1
32768 |                                         0

     1
value  ------------- Distribution ------------- count
   16 |                                         0
   32 |@@@@@@@@@@@@@                            1
   64 |                                         0
  128 |                                         0
  256 |                                         0
  512 |                                         0
 1024 |                                         0
 2048 |                                         0
 4096 |                                         0
 8192 |                                         0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              2
32768 |                                         0

The graph for 0 shows collection times for the nursery, the graph for 1 shows times for the major collections. We can see here that the majority of nursery collections took about half a millisecond, with 4 outliers taking around 10 ms. There were 3 major collections, two of which took around 16 ms and the third one barely took any time at all.

Apple’s Instruments profiling application gathers data using DTrace and can be extended with custom DTrace scripts. Here’s what a visualization of SGen collector pause times looks like:

http://www.complang.tuwien.ac.at/schani/blog/sgen/instruments.jpg

The dots lower down in the scatter diagram are likely nursery collections, the higher up ones major collections.

GC lock

Not counting garbage collections, how often and for how long is the GC lock held?

 1:  mono*:::gc-locked {
 2:          self->ts = timestamp;
 3:  }
 4:  
 5:  mono*:::gc-begin {
 6:          @ = quantize ((timestamp - self->ts) / 1000);
 7:  }
 8:  
 9:  mono*:::gc-end {
10:          self->ts = timestamp;
11:  }
12:  
13:  mono*:::gc-unlocked {
14:          @ = quantize ((timestamp - self->ts) / 1000);
15:  }

The GC lock is taken shortly before collections start, and released shortly after they finish. It’s also taken in a variety of other cases. Naturally we want to avoid holding it for long.

value  ------------- Distribution ------------- count
    0 |                                         0
    1 |@@@@@@@@@@@@@@@@@@@@@@@@@                5462
    2 |@@@@@@@@@@@@@                            2789
    4 |                                         97
    8 |                                         49
   16 |@                                        192
   32 |                                         12
   64 |                                         2
  128 |                                         0
  256 |                                         0
  512 |                                         0
 1024 |                                         1
 2048 |                                         0

SGen takes the lock quite often, but apart from one outlier at about 1 millisecond, it’s released very quickly. Further analysis (not presented here) shows that the outlier is a result of thread-pool initialization during startup, when a few objects are allocated pinned, requiring allocating some memory from the operating system.

Objects allocated

How many objects of which classes are allocated?

 1:  mono*:::gc-nursery-obj-alloc {
 2:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
 3:  }
 4:  
 5:  mono*:::gc-major-obj-alloc-large {
 6:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
 7:  }
 8:  
 9:  mono*:::gc-major-obj-alloc-pinned {
10:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
11:  }
12:  
13:  mono*:::gc-major-obj-alloc-degraded {
14:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
15:  }
16:  
17:  mono*:::gc-major-obj-alloc-mature {
18:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
19:  }

SGen has probes for five different kinds of object allocation, hence the repetition. User-space pointers to the namespace and class name are passed in the third and fourth argument, respectively, so we concatenate those with a dot in the middle to give the full class name.

Whenever possible SGen uses the managed allocator – a small managed piece of code to quickly allocate objects in the nursery, which handles the most common case. The managed allocator is not instrumented with a DTrace probe, however, so for this script to function properly the managed allocator must be turned off via the MONO_GC_DEBUG environment variable.

...
IronPython.Runtime.PythonDictionary                          500014
IronPython.Runtime.List                                      500067
IronPython.Runtime.Calls.Method                             1000002
System.Object[]                                             1505064
System.Int32                                               15499015

IronPython boxes lots of integers.

Objects pinned in the nursery

How many objects of which types are pinned during nursery collections?

1:  mono$target:::gc-obj-pinned /arg4==0/ {
2:          @[strjoin (copyinstr (arg2), strjoin (".", copyinstr (arg3)))] = count ();
3:  }

This is very similar to the preceding script, but we use the pinned object probe and a guard to select for the nursery, as opposed to the whole heap.

...
Microsoft.Scripting.Actions.DynamicSiteTarget`3                  863
IronPython.Runtime.Calls.CallTarget1                             864
System.Object[]                                                 1271
Microsoft.Scripting.Actions.CallSite`1                          1350
System.String[]                                                 1440

Here we see the pin counts for the 5 most pinned classes during nursery collections. Combining all nursery collections, for example, 1440 string arrays were pinned.

What if we’d like to know how many bytes of memory are pinned in the nursery during each collection?

 1:  mono$target:::gc-begin /arg0==0/ {
 2:          bytes = 0;
 3:  }
 4:  
 5:  mono$target:::gc-obj-pinned /arg4==0/ {
 6:          bytes += arg1;
 7:  }
 8:  
 9:  mono$target:::gc-end /arg0==0/ {
10:          @ = quantize (bytes);
11:  }

All we have to do is add up the sizes of all pinned objects for nursery collections.

value  ------------- Distribution ------------- count
   32 |                                         0
   64 |                                         1
  128 |@                                        2
  256 |                                         0
  512 |                                         0
 1024 |                                         0
 2048 |                                         0
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  95
 8192 |                                         0

This looks a bit suspicious. We had about 100 nursery collections, so according to the object counts above there were on average 13 object arrays and 14 string arrays pinned per nursery collection, in addition to lots of other objects. All of those supposedly amount to only about 4 kilobytes of memory. Are we counting something wrong or are these very short arrays? Let’s see:

1:  mono$target:::gc-obj-pinned /arg4==0/ {
2:          @[strjoin (copyinstr (arg2), strjoin (".", copyinstr (arg3)))] = quantize (arg1);
3:  }

This gathers object size histograms per class for all objects pinned in the nursery. These are the parts of the output giving the counts for the arrays we’re interested in:

System.String[]
         value  ------------- Distribution ------------- count
             8 |                                         0
            16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    1344
            32 |@@@                                      96
            64 |                                         0

System.Object[]
         value  ------------- Distribution ------------- count
             8 |                                         0
            16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              870
            32 |@@@@@@@@@@@@@                            409
            64 |                                         0

Apparently those arrays really are very short, all around 16 to 32 bytes long. But maybe the probe is lying to us about the object size? If we know about Mono’s internal object layout we can read out the array length directly from the heap1:

1:  mono$target:::gc-obj-pinned /arg4 == 0 && copyinstr (arg3) == "Object[]"/ {
2:          this->l = *(int*)copyin (arg0 + 12, 4);
3:          @[arg1] = lquantize (this->l, 0, 8);
4:  }

On 32-bit Mono the length of an array is at offset 12, so we read 4 bytes from there and cast it to an int. We get a few graphs like these:

    16
value  ------------- Distribution ------------- count
  < 0 |                                         0
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 96
    1 |                                         0

    28
value  ------------- Distribution ------------- count
    2 |                                         0
    3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
    4 |                                         0

    20
value  ------------- Distribution ------------- count
    0 |                                         0
    1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 481
    2 |                                         0

Each of them gives us the distribution of the array lengths for array objects of specific sizes. If everything is correct then for each array object size there must only be one array length, which is exactly what we get. For example, all 481 array objects of size 20 bytes have a length of one.

In fact, the numbers show that the sizes we get are exactly what we should expect given Mono’s object layout on 32-bits: A reference array with a given length has a size of 16 + 4*n bytes.

Footnotes:

1 Don’t try this at home!