Martin Bartenstein
Martin Bartenstein, Austria’s Minister of Economics and Labour, at the Vienna City Marathon.
Martin Bartenstein, Austria’s Minister of Economics and Labour, at the Vienna City Marathon.
Two weeks ago we hit an important milestone in our implementation of sharing generic code. To put it into the proper context I must first explain a bit about how we started sharing generic code, rehashing a few bits from an older post. The CIL has lots of instructions, only some of which need special care when dealing with generic code. Whether you’re in a generic method (be it shared or not) or in non-generic code won’t affect how two integers are added, for example. The effect of the instruction for creating a new array, however, might depend on the type arguments if you’re in generic code. And if you share generic code between different instantiations of a generic type, you have to produce code that takes the type arguments into consideration when creating that array.
When we started out with sharing generic code, we could of course not yet do that special handling for any of the instructions that required it. So we could either implement it for all instructions in one swoop or do the instructions one by one. We did of course choose to do the latter. For that to work we needed some way of treating the instructions we could not handle yet, or otherwise we could have only run trivial test applications. Our solution was to provide a fallback. Whenever we came across some piece of generic code we would try to compile it for sharing. If we came across an instruction we could not handle yet, we would abort the compilation and just compile a specialized version of the code, like we used to do before we had generic sharing.
That is all in the past now, however, because since about two weeks we can finally handle every single instruction. There are still some other limitations that prevent us from sharing all generic code (which I will discuss in some other blog post), but it’s a pretty important step.
To celebrate this breakthrough I did some benchmarks. In terms of speed there are several advantages and disadvantages one should expect from generic code. On the one hand less methods are compiled, so the time spent in the JIT should be less. On the other hand the generated code is a bit more complicated, so it should run slower. Unless the savings in generated code size have a positive impact on instruction cache behavior, in which case it should run faster. Having said all that I did not detect any significant speed difference in any of the tests I ran, which were: Contrived benchmarks for the List<T> and Dictionary<S,T> classes, IronPython running pystone, the Nemerle compiler compiling itself and F# compiling and running a hello world program.
The next thing I looked at was how much memory we were saving by sharing generic code. Here are the details for the non-contrived benchmarks, on x86:
| No sharing | Sharing | ||||
| Methods compiled |
Code produced |
Methods compiled |
Code produced |
Savings | |
| IronPython | 3614 | 719k | 3368 | 691k | 28k |
| Nemerle | 7210 | 2001k | 6302 | 1943k | 58k |
| F# | 15529 | 2193k | 11431 | 2062k | 131k |
We’re not saving terribly much, but it’s something. And, as I’ve said, we’re not sharing all the code we eventually will, so there is still room for improvement. I’ll collect some more data this week to find out how much more room exactly, and will report in due time on this blog.
Another thing I wanted to find out was how much memory we’re using for the runtime generic context data structures that we need for supporting shared code. For IronPython we allocated about 10k of memory for runtime generic contexts, which still left us with memory savings of 18k. For Nemerle, however, we allocated 230k and for F# no less than 600k, which meant that we were actually using more memory rather than less with sharing.
After some period of consternation, thinking, and collecting further statistics, I wrote an email outlining the situation and proposing a solution which I then implemented last week.
Using the data structure described in that email reduced the amount of memory needed for runtime generic contexts from 230k for Nemerle to less than 6k and from 600k for F# to less than 10k. In addition I reduced the memory required for the runtime generic context templates, which store the actual layout and content of the runtime generic contexts by about two thirds, and they now weigh in at about 2k for Nemerle and 4k for F#.
It turns out that it’s not as easy to make that new runtime generic context data structure thread-safe as I imagined, however. Using hazard pointers to guard the runtime generic context from getting freed is just too much overhead to do on every lookup operation, so I revised the small data structure to be of a fixed size and hence not require re-allocation. That means that space would be wasted in contexts which require less slots than there is space for, but I hoped that I could get a reasonable compromise.
Experimenting a bit with my benchmarks I found out that the best size for saving memory is 3 entries. With that setting the runtime generic contexts for F# now use up about 14k, and for Nemerle about 6k, which is still very reasonable.
Since the managed part of the lookup code can only handle the small, fixed table of the runtime generic context with 3 slots, and lookups which do not succeed there have to call into unmanaged code it’s rather important that the small table contains the most used slots. I did some benchmarks and found that about 10% of all lookup operations had to call into unmanaged code for F#, whereas it was about 80% for Nemerle, which is clearly not acceptable. My next goal was therefore to make the small, fixed table of the runtime generic context act as a cache for the potentially much larger variable-size table.
What I did was very simple: Whenever a lookup in the small table fails, but succeeds in the variable-size table, the result of the lookup operation, i.e. the found slot, is put into the first place in the small table. To make room for the new slot the last slot of the small table is evicted into the variable-size table and the rest of the slots are shifted down by one place. This must be done carefully to ensure that lookup operations from other threads which might occur in parallel don’t get wrong results.
With that new code in place the lookup failure rate for F# goes down to less than 2% and that of Nemerle to about 10%, which I find acceptable. It doesn’t go down significantly any more if the size of the small table is increased to 4.
Repeating the speed benchmarks again with the new code still showed no significant performance differences between sharing and not sharing generic code. I’m happy.
Fine print: This code is not in SVN yet, but we’re working on it.
Update: Fixed a bug in the table with the statistics.
Looking through my spam folder I came across the subject “Make your love wand function better”. At first I wondered what a love wand function might be - something similar to a hash function maybe? It took me more than two seconds to realize that “function” is not used as a noun but as a verb in that imperative sentence.
All I can say is: The horse raced past the barn fell and time flies like an arrow.
I wrote about a data structure we call the “runtime generic context” in my previous posts on sharing generic code. We use this data structure to store type information that shared generic code needs to access that is specific for each instantiation of the generic type.
It follows that shared generic code must have easy access to the runtime generic context - otherwise it couldn’t efficiently access that data. Non-static methods of generic classes access the runtime generic context through the vtable which they can get to via the implicit “this” argument. Static methods do not get this implicit argument, however, and need some other means of accessing the runtime generic context.
Our solution to this problem is to pass the runtime generic context as an additional argument to shared static methods. By doing this we create another problem: The pointer to the code of the method suddenly isn’t enough anymore to call the method. That becomes an issue when the method is to be used as a delegate, for example, where the caller does not and cannot know that the method needs an additional argument.
What we do in those cases is provide an additional piece of code that calls the method and also passes the correct runtime generic context. As such, this piece of code is specific to an instantiation of a generic type, i.e. it is not shared between instantiations (which is the whole point). Right now this piece of code is a wrapper but we will probably replace it by a piece of custom-generated machine code which will be much shorter and more efficient.
Side-note: Methods of generic value types can be handled in exactly the same way. Value type objects don’t have vtables so we have to pass the runtime generic context explicitly, just like for static methods.
A few years ago Herbert Pötzl and I have developed a native MacOS X version of MathMap we called “MathMap Cocoa”. We haven’t updated it since then and it has become quite dated. Until now, that is. In a few hacking sessions, the last one this weekend, we updated MathMap Cocoa to correspond to the latest stable MathMap release, 1.2.4. It even comes with an introductory video, showing off, among other things, movie creation. Give it a go and download MathMap Cocoa!
In my last post about my work on generics sharing in Mono I expanded on the runtime generic context and its extensible part. Today I want to talk about when the runtime generic context is actually filled, i.e. at which point the actual metadata information is put into the slots.
There are two “natural” strategies for doing that: Strictly and lazily. The strict strategy is to fill the runtime generic context when it is created, whereas doing it lazily means to instantiate the slots when they are actually needed by managed code.
At first it seems that doing it strictly is the way to go because it’s less overhead for the managed code, which can just assume that the values are there instead of having to check and call into unmanaged code if they are not. Unfortunately, we need to be able to handle infinitely deep types, like this canonical example:
class A<T> ...
class B<T> : A<B<B<T>>> …
The transitive closure of the superclasses involved in this short example is infinite. The class B<T> has the superclass A<B<B<T>>>. The class B<B<T>>, which is involved in that type, has the superclass A<B<B<B<T>>>> in turn, and so on. Nonetheless, that code has to run, so strict instantiation in all cases is not an option.
At first we thought we could keep using strict instantiation in most cases and use lazy instantiation only in some cases by looking at the depth of the types involved, or, more specifically, the depth at which type variables occur. Very informally speaking, we can define the depth of an occurrence of a type variable in a type as the number of angle brackets you have to open to get to that occurrence. For example, the T in T has a depth of zero because it has no angle brackets around it at all. The T in A<T> has depth one, and the T in Dictionary<List<List<string>>,List<T>> has depth two. Since what we have to avoid is the creation of ever deeper types, it seemed to us that if we used lazy instantiation in all the cases where the type depth increases, and strict instantiation everywhere else, the problem would be solved.
Unfortunately, that is wrong. What we were missing was that you can “transform” a dependency where the type depth does not increase into one where it does by subclassing, while being stuck with the original decision to instantiate strictly. Let’s look at this example:
class A<T> ...
class B<T>
someMethod ()
… A<T> …
class C<T> : B<C<C<T>>> …
A<T> is completely harmless. B<T> needs A<T> in someMethod(), so A<T> will be added to the extensible part of B<T>. The nesting depth of A<T> is the same as of B<T>, so we make the decision to instantiate it strictly. Therefore someMethod() will access that particular slot in the runtime generic context without a check, assuming that it is already instantiated, so that slot will also need to be instantiated strictly in every subclass of B<T>. Then we have C<T>, which is a subclass of B<C<C<T>>>. Because it’s a subclass of B<X>, it needs to have A<X> in its runtime generic context. However, for C<T>, X is C<C<T>>, so we’ll need to put A<C<C<T>>> into that slot, and we have an infinite recursion again, of the same form as in the first example.
So, unfortunately, we have to use lazy instantiation for all slots. The (potentially) good news is that we can bring the run-time overhead for it down to almost zero by employing memory management tricks, should we decide that it’s worth the effort to implement that. The current implementation takes about 5 machine instructions to fetch a slot from the runtime generic context.
In one of his latest essays Paul Graham seems to try to defend his new language Arc against the criticism it has evoked. His essay, in a nutshell, says this (my words):
People criticize Arc. People have criticized my work before and it went on to be successful. Therefore their criticism against Arc is wrong.
That is not an argument. It is merely rethorics.
This week at Novell we had another Hack Week, for which we are encouraged to work on some new or interesting project of our own choice. I decided to spend the week working on MathMap, which I’ve blogged about before. I’ve improved the composer to a point where it’s actually usable for doing real stuff. That work has made it into the new release 1.3.2.
As a bonus I’ve begun implementing a new feature which will greatly extend the range of image manipulations possible with MathMap. I will talk about this in detail when it is at a point where I can give an actual demonstration.
What I’ve also done is put together a new screencast. This time it’s not about a new feature but is an introduction to the small language that’s at the core of MathMap:
Paul Graham has released his Lisp dialect Arc. I am underwhelmed. It’s basically a standard Lisp with some syntactic sugar. There are already too many Lisp dialects. Hell, there are too many implementations of Common Lisp and Scheme, the two main Lisp dialects. Do we really need yet another one?
In my last longer post on generics sharing in Mono I talked about the three different interesting ways an open type that’s used in a method of a generic class (to create an array, for example) can be related to that generic class: The used type can be one of the class’s type arguments, it can be the class itself or one of its superclasses, or it can be more or less unrelated to the class (where we treat all degrees of more or less the same). I talked about the “run-time generic context”, which is a data structure that stores various pieces of type information in connection to a closed generic class, so that shared code can execute efficiently and doesn’t need to do complicated look-up maneuvers to get to those pieces of data. I also explained that the third category of relatedness, namely “other”, is different from the first two in that we cannot know which, or even how many of those other types there will be when we load the class or compile one of its methods, and that we therefore need an extensible data structure to store them. This post is about that data structure.
To use an example from the post alluded to above, let’s say we have a class GenA<T> which has a method newSomeClassArr() that we are in the process of compiling:
public SomeClass<T>[] newSomeClassArr () {
return new SomeClass<T> [3];
}
We encounter a NEWARR instruction telling us to create an array with element type SomeClass<T>. To create an array we need the internal meta-data of the element type, so SomeClass<T> must be in the run-time generic context. Since it is unrelated to GenA<T>, i.e. it is not a type argument nor a subclass of the latter, we have to put it in the extensible part of the run-time generic context.
To make things easy let’s assume that the run-time generic context contains an extensible array, so we can just append elements without having to worry about memory management or complex data structures. What exactly does it mean to put that class into the run-time generic context? I have already explained that every closed generic class, i.e. every instantiation of an open generic class, has its own run-time generic context, so there might already be several run-time generic contexts of some instantiations of GenA<T> in the wild where we need to put that information in. Furthermore, we have to make sure that for every new instantiation that we might create later on, that information is put in there, too. To that latter end we keep track of all the type information stored in the run-time generic contexts of the instantiations of an open generic class in another structure we call the “run-time generic context template”.
Things get a bit more complicated, however. Not only do we have to make sure that all instantiations of GenA<T> contain the meta-data information of SomeClass<T>, we must do the same for all subclasses of GenA<T> as well, since they might also execute the newSomeClassArr() method. This has two interesting consequences: First, we sometimes have to create run-time generic contexts for non-generic classes, and second, we cannot just pick any unused slot in the extensible array for a new type.
Non-generic classes can be subclasses of generic classes, and since we have to create run-time generic contexts for all subclasses of a generic class, it follows that some non-generic classes, namely those which are direct or indirect subclasses of generic ones, need one too.
The second point has to do with the downward propagation, in the inheritance tree, of entries in the run-time generic context. If we put a type into a specific slot for some class, then we have to put the corresponding type into the same slot of all subclasses of that class. That means that we have to pick a slot that is not already occupied by some other type in one or more of the subclasses. We solve this problem by not only propagating entries downward but also by marking the slot we selected as occupied in all the parent classes (to be precise: in their run-time generic context templates).
The last remaining mystery is the question how the slots are actually represented in the run-time generic context. For now we have decided to put a fixed number of slots directly into the run-time generic context and use an indirect slot for the rest. That indirect slot points to an array of slots that can be reallocated if more space is needed. Other approaches are possible and have been discussed, but we’re taking our chances with this simple solution and see how it works out.
Powered by WordPress.com