Two weeks ago we hit an important milestone in our implementation of sharing generic code. To put it into the proper context I must first explain a bit about how we started sharing generic code, rehashing a few bits from an older post. The CIL has lots of instructions, only some of which need special care when dealing with generic code. Whether you’re in a generic method (be it shared or not) or in non-generic code won’t affect how two integers are added, for example. The effect of the instruction for creating a new array, however, might depend on the type arguments if you’re in generic code. And if you share generic code between different instantiations of a generic type, you have to produce code that takes the type arguments into consideration when creating that array.
When we started out with sharing generic code, we could of course not yet do that special handling for any of the instructions that required it. So we could either implement it for all instructions in one swoop or do the instructions one by one. We did of course choose to do the latter. For that to work we needed some way of treating the instructions we could not handle yet, or otherwise we could have only run trivial test applications. Our solution was to provide a fallback. Whenever we came across some piece of generic code we would try to compile it for sharing. If we came across an instruction we could not handle yet, we would abort the compilation and just compile a specialized version of the code, like we used to do before we had generic sharing.
That is all in the past now, however, because since about two weeks we can finally handle every single instruction. There are still some other limitations that prevent us from sharing all generic code (which I will discuss in some other blog post), but it’s a pretty important step.
To celebrate this breakthrough I did some benchmarks. In terms of speed there are several advantages and disadvantages one should expect from generic code. On the one hand less methods are compiled, so the time spent in the JIT should be less. On the other hand the generated code is a bit more complicated, so it should run slower. Unless the savings in generated code size have a positive impact on instruction cache behavior, in which case it should run faster. Having said all that I did not detect any significant speed difference in any of the tests I ran, which were: Contrived benchmarks for the List<T> and Dictionary<S,T> classes, IronPython running pystone, the Nemerle compiler compiling itself and F# compiling and running a hello world program.
The next thing I looked at was how much memory we were saving by sharing generic code. Here are the details for the non-contrived benchmarks, on x86:
We’re not saving terribly much, but it’s something. And, as I’ve said, we’re not sharing all the code we eventually will, so there is still room for improvement. I’ll collect some more data this week to find out how much more room exactly, and will report in due time on this blog.
Another thing I wanted to find out was how much memory we’re using for the runtime generic context data structures that we need for supporting shared code. For IronPython we allocated about 10k of memory for runtime generic contexts, which still left us with memory savings of 18k. For Nemerle, however, we allocated 230k and for F# no less than 600k, which meant that we were actually using more memory rather than less with sharing.
After some period of consternation, thinking, and collecting further statistics, I wrote an email outlining the situation and proposing a solution which I then implemented last week.
Using the data structure described in that email reduced the amount of memory needed for runtime generic contexts from 230k for Nemerle to less than 6k and from 600k for F# to less than 10k. In addition I reduced the memory required for the runtime generic context templates, which store the actual layout and content of the runtime generic contexts by about two thirds, and they now weigh in at about 2k for Nemerle and 4k for F#.
It turns out that it’s not as easy to make that new runtime generic context data structure thread-safe as I imagined, however. Using hazard pointers to guard the runtime generic context from getting freed is just too much overhead to do on every lookup operation, so I revised the small data structure to be of a fixed size and hence not require re-allocation. That means that space would be wasted in contexts which require less slots than there is space for, but I hoped that I could get a reasonable compromise.
Experimenting a bit with my benchmarks I found out that the best size for saving memory is 3 entries. With that setting the runtime generic contexts for F# now use up about 14k, and for Nemerle about 6k, which is still very reasonable.
Since the managed part of the lookup code can only handle the small, fixed table of the runtime generic context with 3 slots, and lookups which do not succeed there have to call into unmanaged code it’s rather important that the small table contains the most used slots. I did some benchmarks and found that about 10% of all lookup operations had to call into unmanaged code for F#, whereas it was about 80% for Nemerle, which is clearly not acceptable. My next goal was therefore to make the small, fixed table of the runtime generic context act as a cache for the potentially much larger variable-size table.
What I did was very simple: Whenever a lookup in the small table fails, but succeeds in the variable-size table, the result of the lookup operation, i.e. the found slot, is put into the first place in the small table. To make room for the new slot the last slot of the small table is evicted into the variable-size table and the rest of the slots are shifted down by one place. This must be done carefully to ensure that lookup operations from other threads which might occur in parallel don’t get wrong results.
With that new code in place the lookup failure rate for F# goes down to less than 2% and that of Nemerle to about 10%, which I find acceptable. It doesn’t go down significantly any more if the size of the small table is increased to 4.
Repeating the speed benchmarks again with the new code still showed no significant performance differences between sharing and not sharing generic code. I’m happy.
Fine print: This code is not in SVN yet, but we’re working on it.
Update: Fixed a bug in the table with the statistics.