Flame Graphs for Instruments

Flame graphs are a great way to visualize stack-trace based profiling data. Brendan Gregg’s scripts for generating these graphs accept, among other options, output produced by DTrace. Unfortunately, OS X’s DTrace, when accessed through the command line, contains a bug that prevents it from producing stack traces in most cases. XCode Instruments, on the other hand, is not compromised in this way, but it will not give output similar to dtrace.

What it does provide is an export functionality which generates CSV files containing the exact contents of the details pane, so I wrote a script to do the intermediary work.

https://i0.wp.com/www.complang.tuwien.ac.at/schani/blog/sgen/instruments-fg.jpg

To produce the CSV file expand all nodes in the call tree (Option-click all the roots) and use the Instrument -> Export Track menu item. Then convert the CSV to the format Brendan’s script understands and feed it with it:

./stackcollapse-instruments.pl out.csv | ./flamegraph.pl >mygraph.svg

flamegraph.png

For an alternative view you can invert the call tree in Instruments and you’ll get a correspondingly inverted flame graph, i.e., the flames go up from callee to caller, as opposed to the other way around when the call tree is not inverted.

SGen and DTrace

Introduction

SGen is Mono’s generational garbage collector. DTrace is a programmable dynamic tracing framework and is available for OS X and Solaris, among others.

DTrace provides an interface for applications to define their own probes which can then be observed, and actions triggered. We have recently added a few probes to SGen and I’d like to show a few simple things that can be done with them.

The program I’m using to demonstrate DTrace here is IronPython, running pystone. SGen’s DTrace probes are currently only available on OS X, on mono master.

Garbage collection times

How long do garbage collections take for a specific workload?

1:  mono$target:::gc-begin {
2:          self->ts = timestamp;
3:  }
4:  
5:  mono$target:::gc-end {
6:          @[arg0] = quantize ((timestamp - self->ts) / 1000);
7:  }

When a garbage collection starts, we save the current time, and when it ends we gather the time elapsed since then in microseconds. The results:

    0
value  ------------- Distribution ------------- count
    4 |                                         0
    8 |                                         1
   16 |                                         0
   32 |                                         1
   64 |                                         1
  128 |                                         0
  256 |                                         0
  512 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    90
 1024 |                                         1
 2048 |                                         0
 4096 |                                         0
 8192 |@                                        3
16384 |                                         1
32768 |                                         0

     1
value  ------------- Distribution ------------- count
   16 |                                         0
   32 |@@@@@@@@@@@@@                            1
   64 |                                         0
  128 |                                         0
  256 |                                         0
  512 |                                         0
 1024 |                                         0
 2048 |                                         0
 4096 |                                         0
 8192 |                                         0
16384 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              2
32768 |                                         0

The graph for 0 shows collection times for the nursery, the graph for 1 shows times for the major collections. We can see here that the majority of nursery collections took about half a millisecond, with 4 outliers taking around 10 ms. There were 3 major collections, two of which took around 16 ms and the third one barely took any time at all.

Apple’s Instruments profiling application gathers data using DTrace and can be extended with custom DTrace scripts. Here’s what a visualization of SGen collector pause times looks like:

https://i0.wp.com/www.complang.tuwien.ac.at/schani/blog/sgen/instruments.jpg

The dots lower down in the scatter diagram are likely nursery collections, the higher up ones major collections.

GC lock

Not counting garbage collections, how often and for how long is the GC lock held?

 1:  mono*:::gc-locked {
 2:          self->ts = timestamp;
 3:  }
 4:  
 5:  mono*:::gc-begin {
 6:          @ = quantize ((timestamp - self->ts) / 1000);
 7:  }
 8:  
 9:  mono*:::gc-end {
10:          self->ts = timestamp;
11:  }
12:  
13:  mono*:::gc-unlocked {
14:          @ = quantize ((timestamp - self->ts) / 1000);
15:  }

The GC lock is taken shortly before collections start, and released shortly after they finish. It’s also taken in a variety of other cases. Naturally we want to avoid holding it for long.

value  ------------- Distribution ------------- count
    0 |                                         0
    1 |@@@@@@@@@@@@@@@@@@@@@@@@@                5462
    2 |@@@@@@@@@@@@@                            2789
    4 |                                         97
    8 |                                         49
   16 |@                                        192
   32 |                                         12
   64 |                                         2
  128 |                                         0
  256 |                                         0
  512 |                                         0
 1024 |                                         1
 2048 |                                         0

SGen takes the lock quite often, but apart from one outlier at about 1 millisecond, it’s released very quickly. Further analysis (not presented here) shows that the outlier is a result of thread-pool initialization during startup, when a few objects are allocated pinned, requiring allocating some memory from the operating system.

Objects allocated

How many objects of which classes are allocated?

 1:  mono*:::gc-nursery-obj-alloc {
 2:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
 3:  }
 4:  
 5:  mono*:::gc-major-obj-alloc-large {
 6:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
 7:  }
 8:  
 9:  mono*:::gc-major-obj-alloc-pinned {
10:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
11:  }
12:  
13:  mono*:::gc-major-obj-alloc-degraded {
14:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
15:  }
16:  
17:  mono*:::gc-major-obj-alloc-mature {
18:          @[strjoin (copyinstr (arg2), strjoin(".", copyinstr (arg3)))] = count ();
19:  }

SGen has probes for five different kinds of object allocation, hence the repetition. User-space pointers to the namespace and class name are passed in the third and fourth argument, respectively, so we concatenate those with a dot in the middle to give the full class name.

Whenever possible SGen uses the managed allocator – a small managed piece of code to quickly allocate objects in the nursery, which handles the most common case. The managed allocator is not instrumented with a DTrace probe, however, so for this script to function properly the managed allocator must be turned off via the MONO_GC_DEBUG environment variable.

...
IronPython.Runtime.PythonDictionary                          500014
IronPython.Runtime.List                                      500067
IronPython.Runtime.Calls.Method                             1000002
System.Object[]                                             1505064
System.Int32                                               15499015

IronPython boxes lots of integers.

Objects pinned in the nursery

How many objects of which types are pinned during nursery collections?

1:  mono$target:::gc-obj-pinned /arg4==0/ {
2:          @[strjoin (copyinstr (arg2), strjoin (".", copyinstr (arg3)))] = count ();
3:  }

This is very similar to the preceding script, but we use the pinned object probe and a guard to select for the nursery, as opposed to the whole heap.

...
Microsoft.Scripting.Actions.DynamicSiteTarget`3                  863
IronPython.Runtime.Calls.CallTarget1                             864
System.Object[]                                                 1271
Microsoft.Scripting.Actions.CallSite`1                          1350
System.String[]                                                 1440

Here we see the pin counts for the 5 most pinned classes during nursery collections. Combining all nursery collections, for example, 1440 string arrays were pinned.

What if we’d like to know how many bytes of memory are pinned in the nursery during each collection?

 1:  mono$target:::gc-begin /arg0==0/ {
 2:          bytes = 0;
 3:  }
 4:  
 5:  mono$target:::gc-obj-pinned /arg4==0/ {
 6:          bytes += arg1;
 7:  }
 8:  
 9:  mono$target:::gc-end /arg0==0/ {
10:          @ = quantize (bytes);
11:  }

All we have to do is add up the sizes of all pinned objects for nursery collections.

value  ------------- Distribution ------------- count
   32 |                                         0
   64 |                                         1
  128 |@                                        2
  256 |                                         0
  512 |                                         0
 1024 |                                         0
 2048 |                                         0
 4096 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  95
 8192 |                                         0

This looks a bit suspicious. We had about 100 nursery collections, so according to the object counts above there were on average 13 object arrays and 14 string arrays pinned per nursery collection, in addition to lots of other objects. All of those supposedly amount to only about 4 kilobytes of memory. Are we counting something wrong or are these very short arrays? Let’s see:

1:  mono$target:::gc-obj-pinned /arg4==0/ {
2:          @[strjoin (copyinstr (arg2), strjoin (".", copyinstr (arg3)))] = quantize (arg1);
3:  }

This gathers object size histograms per class for all objects pinned in the nursery. These are the parts of the output giving the counts for the arrays we’re interested in:

System.String[]
         value  ------------- Distribution ------------- count
             8 |                                         0
            16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    1344
            32 |@@@                                      96
            64 |                                         0

System.Object[]
         value  ------------- Distribution ------------- count
             8 |                                         0
            16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              870
            32 |@@@@@@@@@@@@@                            409
            64 |                                         0

Apparently those arrays really are very short, all around 16 to 32 bytes long. But maybe the probe is lying to us about the object size? If we know about Mono’s internal object layout we can read out the array length directly from the heap1:

1:  mono$target:::gc-obj-pinned /arg4 == 0 && copyinstr (arg3) == "Object[]"/ {
2:          this->l = *(int*)copyin (arg0 + 12, 4);
3:          @[arg1] = lquantize (this->l, 0, 8);
4:  }

On 32-bit Mono the length of an array is at offset 12, so we read 4 bytes from there and cast it to an int. We get a few graphs like these:

    16
value  ------------- Distribution ------------- count
  < 0 |                                         0
    0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 96
    1 |                                         0

    28
value  ------------- Distribution ------------- count
    2 |                                         0
    3 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
    4 |                                         0

    20
value  ------------- Distribution ------------- count
    0 |                                         0
    1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 481
    2 |                                         0

Each of them gives us the distribution of the array lengths for array objects of specific sizes. If everything is correct then for each array object size there must only be one array length, which is exactly what we get. For example, all 481 array objects of size 20 bytes have a length of one.

In fact, the numbers show that the sizes we get are exactly what we should expect given Mono’s object layout on 32-bits: A reference array with a given length has a size of 16 + 4*n bytes.

Footnotes:

1 Don’t try this at home!