Fast Virtual Generic Calls

Until about a week ago, Mono did not have an efficient implementation of virtual generic method calls. Unlike non-virtual generic methods, it is not know at compile-time which method must be called, so some kind of look-up needs to be performed. However, unlike non-generic virtual methods, there can be more than one instantiation of a particular virtual generic method per class, and it is not known in advance how many or which instantiations will be needed, so a traditional virtual method table cannot be used. Mono’s previous solution was to do the look-up dynamically, by calling into unmanaged code, which is very inefficient. According to a simple benchmark I wrote, performing a generic virtual call was about 200 times slower than performing a non-generic virtual call.

My first stab at implementing a fast virtual generic call mechanism involved the runtime generic context data structure I implemented for generic code sharing. The idea was to use the RGCTX as an extensible virtual generic method table. Whenever a new instantiation of a virtual generic method was needed, we would reserve a new slot for it in the RGCTX. The performance this approach achieved was quite impressive, making generic virtual calls only about 3 times slower than non-generic virtual calls, but memory usage made it impractical. For only 64 different instantiations of virtual generic methods one benchmark required 46k more memory than with the old implementation.

Another implementation idea came from Paolo, who wanted to reuse the machinery we have for doing interface method lookups. Every class’s VTable has a fixed number of slots set aside for interface methods. These slots are used as a hash table, with pointers to the individual interface method’s metadata as the keys. Since the interface method to be called is known at compile-time (but not which particular implementation of it), the hash can be computed then and used as a constant index into the hash table. Thus, an interface method call is done in almost the same way as a “normal” virtual method call. Problems occur only when more than one method hashes to the same slot. In that case, a small piece of code, called an “IMT thunk”, is generated, which resolves the lookup by doing a hard-coded search for the method’s implementation, given the method metadata pointer, which is passed as an implicit argument in interface calls.

Applying this technique to virtual generic methods requires reserving a single VTable slot for each virtual generic method, which points to a thunk that performs the look-up of the actual implementation, given its type arguments, which we represent by a single pointer. In contrast to interface methods, where we have a complete list of all methods, the required instantiations of a virtual generic method become know one by one during run-time, so we do a fall-back into unmanaged code if the required instantiation is not found. There the thunk might be re-built to include the new instantiation.

Currently the policy for deciding which instantiations to include in the thunk is very simple: If an instantiation was invoked at least 10 times then it will be included. The potential danger of this policy is that the thunks might grow too big. We discussed a few alternatives, like having an upper limit on the number of instantiations and including only the most frequently used ones, but decided to keep the current implementation until we run into problems.

Performance numbers are a bit harder to come by for this implementation since the speed of the look-up depends on the size of the thunk. For small thunks the performance seems to be comparable to that of my first approach, i.e. about 3 times slower than non-generic virtual calls. Interface calls are not supported, yet.

Advertisements

Another Generic Sharing Update

Since my last report on generic code sharing I chased down a few bugs we uncovered when trying out IronPython 2.0. That new version uses the Microsoft Dynamic Language Runtime, which extensively utilizes generics. One issue we came across was how to figure out the actual method for a delegate when only the native code pointer (acquired with ldftn) and no target class is given. For example:

public class Gen<T> {
  public void work () { ... }
}

With generic code sharing the methods Gen<string>.work and Gen<object>.work will share the same native code, so given only a pointer to it it’s not possible to differentiate between the two. What one could do to make it possible to tell between the two would be to let ldftn produce a pointer not to the method directly but to a small piece of trampoline code for which there is one for each instantiation of the method. Fortunately it seems like we don’t have to bother with that, since the .NET CLR doesn’t either. Instead it gives you the instantiation of the method where all type arguments are object, so we do the same.

Another thing I did was implement sharing of methods of generic value types. There doesn’t seem too much code out there which utilizes generic value types extensively, but it wasn’t a big deal to implement so I went ahead and did it. Since instances of value types don’t contain VTable pointers we need to pass the runtime generic context (RGCTX) explicitly for all methods, like we do for static methods of reference types. One complication that arises here is when the value type implements an interface. When casting such a value type to the interface type it gets boxed and receives a VTable for the interface methods. Since the caller of those methods doesn’t know it’s dealing with a value type, much less which particular one, it cannot pass the RGCTX, so the methods in the interface VTable need a wrapper which will pass it. This is very similar to the wrapper we use when taking the address of a static method of a reference type (for constructing a delegate, for example).

I’ll end with an updated table of memory statistics for a few test applications. “Nemerle” is the Nemerle compiler compiling itself. “IronPython 2.0″ is running pystone. “F# 1.9″ is running a simple “Hello world” program on the command line and “F# 2.0″ is compiling a simple program.

No sharing Sharing
Methods
compiled
Code
produced
Methods
compiled
Code
produced
Memory for
(M)RGCTXs
Savings
Nemerle 7127 2008k 6159 1895k 23k 90k
IronPython 2.0 9060 1607k 5833 1011k 42k 554k
F# 1.9 15268 2187k 9828 1659k 111k 417k
F# 2.0 27186 3781k 15828 2830k 239k 712k

Sharing Generic Methods

All of my posts on sharing generic code so far have been about sharing non-generic methods of generic classes, like this one:

class Gen<T> {
  public T [] newArr (int n) {
    return new T [n];
  }
}

But what about generic methods, like this one:

class Gen<S> {
  public Dictionary<S,T> newDict<T> () {
    return new Dictionary<S,T> ();
  }
}

If we want to share this methods between different instantiations, i.e. different type values of T, we need to provide a place for the code to look up the type of Dictionary<S,T>. This place cannot be the runtime generic context, because the data in there only depends on the type arguments of the class, i.e. S, but not of generic methods.

Our solution is to introduce a data structure very similar to the runtime generic context, called the method runtime generic context, or MRGCTX. It is associated not with generic classes and their type arguments, like the RGCTX, but with generic methods and their type arguments. We use the same MRGCTX for generic methods of a specific class if the method type arguments are the same. As an example, these methods would all share the same MRGCTX:

Gen<object>.foo<object,string> ()
Gen<object>.bar<object,string> ()
Gen<object>.quux<object,string> ()

while no two of these methods would use the same MRGCTX:

Gen<object>.foo<object,string> ()
Gen<object>.foo<string,object> ()
Gen<string>.foo<string,object> ()
Ban<string>.foo<string,object> ()

The MRGCTX contains, apart from the RGCTX-like slots, two items of data: A pointer to the vtable of the method’s class, and the values of the method’s type arguments. The first one is needed to get to the class’s RGCTX if no this argument is passed, i.e. in static generic methods. The type arguments are needed to instantiate new slots in the MRGCTX – without knowing what the value of T is, for example, we cannot look up the type Dictionary<S,T>.

So how much memory do we save with shared generic methods? In my previous post on sharing generic code I presented a table with the savings in memory my three large test applications. Here it is again, updated with data for sharing generic methods:

No sharing Sharing Sharing w/gen methods
Methods
compiled
Code
produced
Methods
compiled
Code
produced
Methods
compiled
Code
produced
Memory for
(M)RGCTXs
Savings
IronPython 3614 719k 3368 691k 3324 687k 7k 25k
Nemerle 7210 2001k 6302 1943k 6150 1891k 34k 76k
F# 15529 2193k 11431 2062k 9823 1652k 154k 387k

Note that this time I’m also counting all the memory used by (M)RGCTXs and the (M)RGCTX templates, which I didn’t do last time.

Sharing Everything and Saving Memory

Two weeks ago we hit an important milestone in our implementation of sharing generic code. To put it into the proper context I must first explain a bit about how we started sharing generic code, rehashing a few bits from an older post. The CIL has lots of instructions, only some of which need special care when dealing with generic code. Whether you’re in a generic method (be it shared or not) or in non-generic code won’t affect how two integers are added, for example. The effect of the instruction for creating a new array, however, might depend on the type arguments if you’re in generic code. And if you share generic code between different instantiations of a generic type, you have to produce code that takes the type arguments into consideration when creating that array.

When we started out with sharing generic code, we could of course not yet do that special handling for any of the instructions that required it. So we could either implement it for all instructions in one swoop or do the instructions one by one. We did of course choose to do the latter. For that to work we needed some way of treating the instructions we could not handle yet, or otherwise we could have only run trivial test applications. Our solution was to provide a fallback. Whenever we came across some piece of generic code we would try to compile it for sharing. If we came across an instruction we could not handle yet, we would abort the compilation and just compile a specialized version of the code, like we used to do before we had generic sharing.

That is all in the past now, however, because since about two weeks we can finally handle every single instruction. There are still some other limitations that prevent us from sharing all generic code (which I will discuss in some other blog post), but it’s a pretty important step.

To celebrate this breakthrough I did some benchmarks. In terms of speed there are several advantages and disadvantages one should expect from generic code. On the one hand less methods are compiled, so the time spent in the JIT should be less. On the other hand the generated code is a bit more complicated, so it should run slower. Unless the savings in generated code size have a positive impact on instruction cache behavior, in which case it should run faster. Having said all that I did not detect any significant speed difference in any of the tests I ran, which were: Contrived benchmarks for the List<T> and Dictionary<S,T> classes, IronPython running pystone, the Nemerle compiler compiling itself and F# compiling and running a hello world program.

The next thing I looked at was how much memory we were saving by sharing generic code. Here are the details for the non-contrived benchmarks, on x86:

No sharing Sharing
Methods
compiled
Code
produced
Methods
compiled
Code
produced
Savings
IronPython 3614 719k 3368 691k 28k
Nemerle 7210 2001k 6302 1943k 58k
F# 15529 2193k 11431 2062k 131k

We’re not saving terribly much, but it’s something. And, as I’ve said, we’re not sharing all the code we eventually will, so there is still room for improvement. I’ll collect some more data this week to find out how much more room exactly, and will report in due time on this blog.

Another thing I wanted to find out was how much memory we’re using for the runtime generic context data structures that we need for supporting shared code. For IronPython we allocated about 10k of memory for runtime generic contexts, which still left us with memory savings of 18k. For Nemerle, however, we allocated 230k and for F# no less than 600k, which meant that we were actually using more memory rather than less with sharing.

After some period of consternation, thinking, and collecting further statistics, I wrote an email outlining the situation and proposing a solution which I then implemented last week.

Using the data structure described in that email reduced the amount of memory needed for runtime generic contexts from 230k for Nemerle to less than 6k and from 600k for F# to less than 10k. In addition I reduced the memory required for the runtime generic context templates, which store the actual layout and content of the runtime generic contexts by about two thirds, and they now weigh in at about 2k for Nemerle and 4k for F#.

It turns out that it’s not as easy to make that new runtime generic context data structure thread-safe as I imagined, however. Using hazard pointers to guard the runtime generic context from getting freed is just too much overhead to do on every lookup operation, so I revised the small data structure to be of a fixed size and hence not require re-allocation. That means that space would be wasted in contexts which require less slots than there is space for, but I hoped that I could get a reasonable compromise.

Experimenting a bit with my benchmarks I found out that the best size for saving memory is 3 entries. With that setting the runtime generic contexts for F# now use up about 14k, and for Nemerle about 6k, which is still very reasonable.

Since the managed part of the lookup code can only handle the small, fixed table of the runtime generic context with 3 slots, and lookups which do not succeed there have to call into unmanaged code it’s rather important that the small table contains the most used slots. I did some benchmarks and found that about 10% of all lookup operations had to call into unmanaged code for F#, whereas it was about 80% for Nemerle, which is clearly not acceptable. My next goal was therefore to make the small, fixed table of the runtime generic context act as a cache for the potentially much larger variable-size table.

What I did was very simple: Whenever a lookup in the small table fails, but succeeds in the variable-size table, the result of the lookup operation, i.e. the found slot, is put into the first place in the small table. To make room for the new slot the last slot of the small table is evicted into the variable-size table and the rest of the slots are shifted down by one place. This must be done carefully to ensure that lookup operations from other threads which might occur in parallel don’t get wrong results.

With that new code in place the lookup failure rate for F# goes down to less than 2% and that of Nemerle to about 10%, which I find acceptable. It doesn’t go down significantly any more if the size of the small table is increased to 4.

Repeating the speed benchmarks again with the new code still showed no significant performance differences between sharing and not sharing generic code. I’m happy.

Fine print: This code is not in SVN yet, but we’re working on it.

Update: Fixed a bug in the table with the statistics.

Sharing Static Methods

I wrote about a data structure we call the “runtime generic context” in my previous posts on sharing generic code. We use this data structure to store type information that shared generic code needs to access that is specific for each instantiation of the generic type.

It follows that shared generic code must have easy access to the runtime generic context – otherwise it couldn’t efficiently access that data. Non-static methods of generic classes access the runtime generic context through the vtable which they can get to via the implicit “this” argument. Static methods do not get this implicit argument, however, and need some other means of accessing the runtime generic context.

Our solution to this problem is to pass the runtime generic context as an additional argument to shared static methods. By doing this we create another problem: The pointer to the code of the method suddenly isn’t enough anymore to call the method. That becomes an issue when the method is to be used as a delegate, for example, where the caller does not and cannot know that the method needs an additional argument.

What we do in those cases is provide an additional piece of code that calls the method and also passes the correct runtime generic context. As such, this piece of code is specific to an instantiation of a generic type, i.e. it is not shared between instantiations (which is the whole point). Right now this piece of code is a wrapper but we will probably replace it by a piece of custom-generated machine code which will be much shorter and more efficient.

Side-note: Methods of generic value types can be handled in exactly the same way. Value type objects don’t have vtables so we have to pass the runtime generic context explicitly, just like for static methods.

Generic Types are Lazy

In my last post about my work on generics sharing in Mono I expanded on the runtime generic context and its extensible part. Today I want to talk about when the runtime generic context is actually filled, i.e. at which point the actual metadata information is put into the slots.

There are two “natural” strategies for doing that: Strictly and lazily. The strict strategy is to fill the runtime generic context when it is created, whereas doing it lazily means to instantiate the slots when they are actually needed by managed code.

At first it seems that doing it strictly is the way to go because it’s less overhead for the managed code, which can just assume that the values are there instead of having to check and call into unmanaged code if they are not. Unfortunately, we need to be able to handle infinitely deep types, like this canonical example:

class A<T> ...
class B<T> : A<B<B<T>>> ...

The transitive closure of the superclasses involved in this short example is infinite. The class B<T> has the superclass A<B<B<T>>>. The class B<B<T>>, which is involved in that type, has the superclass A<B<B<B<T>>>> in turn, and so on. Nonetheless, that code has to run, so strict instantiation in all cases is not an option.

At first we thought we could keep using strict instantiation in most cases and use lazy instantiation only in some cases by looking at the depth of the types involved, or, more specifically, the depth at which type variables occur. Very informally speaking, we can define the depth of an occurrence of a type variable in a type as the number of angle brackets you have to open to get to that occurrence. For example, the T in T has a depth of zero because it has no angle brackets around it at all. The T in A<T> has depth one, and the T in Dictionary<List<List<string>>,List<T>> has depth two. Since what we have to avoid is the creation of ever deeper types, it seemed to us that if we used lazy instantiation in all the cases where the type depth increases, and strict instantiation everywhere else, the problem would be solved.

Unfortunately, that is wrong. What we were missing was that you can “transform” a dependency where the type depth does not increase into one where it does by subclassing, while being stuck with the original decision to instantiate strictly. Let’s look at this example:

class A<T> ...
 
class B<T>
  someMethod ()
    ... A<T> ...
 
class C<T> : B<C<C<T>>> ...

A<T> is completely harmless. B<T> needs A<T> in someMethod(), so A<T> will be added to the extensible part of B<T>. The nesting depth of A<T> is the same as of B<T>, so we make the decision to instantiate it strictly. Therefore someMethod() will access that particular slot in the runtime generic context without a check, assuming that it is already instantiated, so that slot will also need to be instantiated strictly in every subclass of B<T>. Then we have C<T>, which is a subclass of B<C<C<T>>>. Because it’s a subclass of B<X>, it needs to have A<X> in its runtime generic context. However, for C<T>, X is C<C<T>>, so we’ll need to put A<C<C<T>>> into that slot, and we have an infinite recursion again, of the same form as in the first example.

So, unfortunately, we have to use lazy instantiation for all slots. The (potentially) good news is that we can bring the run-time overhead for it down to almost zero by employing memory management tricks, should we decide that it’s worth the effort to implement that. The current implementation takes about 5 machine instructions to fetch a slot from the runtime generic context.

Other Types

In my last longer post on generics sharing in Mono I talked about the three different interesting ways an open type that’s used in a method of a generic class (to create an array, for example) can be related to that generic class: The used type can be one of the class’s type arguments, it can be the class itself or one of its superclasses, or it can be more or less unrelated to the class (where we treat all degrees of more or less the same). I talked about the “run-time generic context”, which is a data structure that stores various pieces of type information in connection to a closed generic class, so that shared code can execute efficiently and doesn’t need to do complicated look-up maneuvers to get to those pieces of data. I also explained that the third category of relatedness, namely “other”, is different from the first two in that we cannot know which, or even how many of those other types there will be when we load the class or compile one of its methods, and that we therefore need an extensible data structure to store them. This post is about that data structure.

To use an example from the post alluded to above, let’s say we have a class GenA<T> which has a method newSomeClassArr() that we are in the process of compiling:

public SomeClass<T>[] newSomeClassArr () {
  return new SomeClass<T> [3];
}

We encounter a NEWARR instruction telling us to create an array with element type SomeClass<T>. To create an array we need the internal meta-data of the element type, so SomeClass<T> must be in the run-time generic context. Since it is unrelated to GenA<T>, i.e. it is not a type argument nor a subclass of the latter, we have to put it in the extensible part of the run-time generic context.

To make things easy let’s assume that the run-time generic context contains an extensible array, so we can just append elements without having to worry about memory management or complex data structures. What exactly does it mean to put that class into the run-time generic context? I have already explained that every closed generic class, i.e. every instantiation of an open generic class, has its own run-time generic context, so there might already be several run-time generic contexts of some instantiations of GenA<T> in the wild where we need to put that information in. Furthermore, we have to make sure that for every new instantiation that we might create later on, that information is put in there, too. To that latter end we keep track of all the type information stored in the run-time generic contexts of the instantiations of an open generic class in another structure we call the “run-time generic context template”.

Things get a bit more complicated, however. Not only do we have to make sure that all instantiations of GenA<T> contain the meta-data information of SomeClass<T>, we must do the same for all subclasses of GenA<T> as well, since they might also execute the newSomeClassArr() method. This has two interesting consequences: First, we sometimes have to create run-time generic contexts for non-generic classes, and second, we cannot just pick any unused slot in the extensible array for a new type.

Non-generic classes can be subclasses of generic classes, and since we have to create run-time generic contexts for all subclasses of a generic class, it follows that some non-generic classes, namely those which are direct or indirect subclasses of generic ones, need one too.

The second point has to do with the downward propagation, in the inheritance tree, of entries in the run-time generic context. If we put a type into a specific slot for some class, then we have to put the corresponding type into the same slot of all subclasses of that class. That means that we have to pick a slot that is not already occupied by some other type in one or more of the subclasses. We solve this problem by not only propagating entries downward but also by marking the slot we selected as occupied in all the parent classes (to be precise: in their run-time generic context templates).

The last remaining mystery is the question how the slots are actually represented in the run-time generic context. For now we have decided to put a fixed number of slots directly into the run-time generic context and use an indirect slot for the rest. That indirect slot points to an array of slots that can be reallocated if more space is needed. Other approaches are possible and have been discussed, but we’re taking our chances with this simple solution and see how it works out.