.Net Garbage Collector Pain

How well does your garbage collection method scale?

A simple iteration loop should scale linearly right? Wrong! The .net garbage collector commits a cardinal sin.

The four graphs below (which I will described in more detail later) clearly show On2 scaling of a process where execution time is plotted against number of iterations. The really scary thing is that the process which created these graphs should be linear, it is just a loop!

So what is causing the non linear scaling.  It is the .net garbage collector.  This is the cardinal sin I am referring to in the title; the garbage collector is forcing a piece of code to scale worse than linear and the developer of that code can do little to nothing about it.  This is a major challenge to managed code as it matures into the approach of choice for business applications.  Scaling kills applications when then move from demonstration to production.  So having a scaling problem in the runtime infrastructure is a nightmare we could do without.  All is not lost though - there are partial solutions; however, this is another wake-up call.  As developers and architects we must not simply rely on systems scaling and behaving the way we expect under production loads.  We must test and we must fully appreciate the underlying computer science of anything we expect to actually work.  Assume something will "just work" and we are done for.

Loop scaling equation fits

Putting 4 Million Objects References On The Heap Made And Unrelated Loop Run 15 Times Slower

This is the result which alerted me to the reality of this problem.  I knew .net's mark sweep garbage collector should scale linearly with the number of references on the heap, but I had not realized the extent of the problems this can cause.  You  might think that 4 million object references is a ridiculous number, but when you consider a transactional system processing XML documents as DOM trees you may change your mind.  In this situation, as with many other business activities (like database records set processing) 4 million is small - not big.

To demonstrate the point I used some VSPL:

4000000 !Count
timer !time
[
?Count,1>Diff !Count,0>GT
]
Loop
timer,?time>Diff ," 4 000 000 iteration with no large data structure">Concat Print
4000000 !Count
@ !result
timer !time
[
"Hello",?result>@Push
?Count,1>Diff !Count,0>GT
]
Loop
timer,?time>Diff ," 4 000 000 ""Hello"" into list">Concat Print
4000000 !Count
timer !time
[
?Count,1>Diff !Count,0>GT
]
Loop
timer,?time>Diff ," 4 000 000 iteration with large data structure">Concat Print
@ !result
4000000 !Count
timer !time
[
?Count,1>Diff !Count,0>GT
]
Loop
timer,?time>Diff ," 4 000 000 iteration with no large data structure">Concat Print
This is actually 4 loops.  The first, third and fourth just iterate around an empty loop.  The second loop creates a list of 4 million references to the same string.  The references are accumulated in a System.Collections.ArrayList object.  I ran this piece of code in VSPL in an implementation which created new objects even when iterating an empty loop (technically, the accumulator was created fresh each time it was used).  I then patched the VSPL so it no longer called the new operator inside empty loops. The results are below:
Unpatched timings
1.5 -> 4 000 000 iterations with no large data structure
18.41 -> 4 000 000 "Hello" into list
27.12 -> 4 000 000 iterations with large data structure
1.75 -> 4 000 000 iterations with no large data structure

Patched timings
1.88 -> 4 000 000 iteration with no large data structure
7.45 -> 4 000 000 "Hello" into list
2.08 -> 4 000 000 iteration with large data structure
2.02 -> 4 000 000 iteration with no large data structure
The most interesting result is the one in bold text.  Before the patch, just by having the 4 million references on the heap meant an unrelated loop ran some 15 times more slowly!  The patched results show the empty loops running slightly slower (the patch requires some extra function calls in the inner working of VSPL) but the creation of the 4 million references is much faster (7.45s versa 27.12) and the impact on the third, empty loop is now negligible.  So we can now see that the .net garbage collector is having an massive  impact on any code which happens to be running when a garbage collection is called.  As garbage collections are stimulated by when new calling, garbage collection scaling explains the above results.

Don't Call New If You Can Avoid It!

The patch I did which helped VSPL was to allow object re-use.  The drawback of the patch was that extra code has to execute to 'know' when an object can be re-used or not.  This has a performance impact and a complexity impact.  In other words, I am having to code around the issue of scaling in the garbage collector so some of the simplicity benefits of having a collector are thrown away.  What is more, in a language like C# you cannot get away from using the heap, you simply cannot create objects on the stack like you can in C++.  This means means C#, VB.net, Java and the like all have 'baked in scaling problems'.  This really worried me (still does actually) so I ran some more tests to double check my thinking and try to understand the impact a little more.

   This graph shows scaling of the reference list when
 the all the loops were in the same process.

The first thing I tried was writing VSPL script which created a list (implemented in C# as and ArrayList) of references (as above) of increasing size.  I did this with the newly patched VSPL implementation which does not its self cause new to be called inside loops.  This meant that the only object calling new was the ArrayList implementation its self.  The System.Collections.ArrayList is a wrapper around an array.  The array has an initial capacity and then when that capacity is exhausted, it creates a new internal array and copies the contents over into that one.  This means that in this case, new was being called from inside System.Collections.ArrayList as the list grew.

The results were startling to say the least.  Rather than a nice smooth graph, it started off smooth but once I got above 5 million references, it went all zig zag.  Initially I just assumed these results were something else happening on the computer.  But they are completely repeatable.  The only explanation I can think of is that, under some circumstances, the data from the previous list is not completely garbage collected before the next loop.  This means that the garbage collector is causing the slow downs.  The choices of when the large array object gets garbage collected seems to depend on its size and the size of other object, or maybe the number of collection cycles.


This graph shows scaling of the reference list when
 the the loops were in difference processes.

Repeating the test, but putting each loop in a new process (thus separate heaps) makes the graph behave the way one would expect.  One can see from the graphs in "Loop scaling equation fits" that the best fit for this is an On2 scaling.  Both a binomial and trinomial fit have R=1 (perfect fit); however, the trinomial has a trivial cubic component.  Other equations (like exponential and power series) fit less well (R<1).  Therefore was can safely conclude that the program scales as the square of the number of iterations of the loop.

To finally nail that this is the garbage collector causing this scaling I needed to discount any other On2 scaling algorithms.  You may have already spotted that the ArrayList implementation scales a On1.5 in this usage scenario.  This is because of the copy operation each time the capacity is increased.  However, as the capacity is doubled each time, the frequency of capacity increases dies off, hence the index of 1.5 not 2.  However, the following table discounts this as a possibility and pretty much proves that it is the garbage collector which  is the culprit:

No Presize Equal Presize Double Presize
Counts Time Counts Time Counts Time
1000000 1.09 1000000 1.05 1000000 1.05
2000000 2.69 2000000 2.64 2000000 2.62
3000000 4.92 3000000 4.83 3000000 4.84
4000000 7.62 4000000 7.55 4000000 7.5
5000000 10.91 5000000 10.78 5000000 10.75
6000000 14.75 6000000 14.56 6000000 14.56
7000000 19.06 7000000 18.9 7000000 18.87
8000000 23.94 8000000 23.77 8000000 23.73
9000000 29.7 9000000 29.34 9000000 29.31

I added the ability of VSPL to pre-allocate the capacity of the ArrayList, This meant the implementation without garbage collection should scale as  On2 . Then, to be extra careful, I ran the test with the pre-allocation set to the eventual size of the list and double the eventual size.  We can see from the table that the timings are nearly identical for all three setups. This clearly shows that scaling in the allocation of the ArrayList is not a noticeably contributing factor to the scaling of the program.

So What Does This Mean To Me?

Things you should be careful about and do to help:

  • All those people who say you should pool objects and avoid using the new operator - believe them!
  • Avoid using new inside loops like the plague.
  • "Scale test" applications to large memory footprints.  Assumptions that it will keep scaling OK from small tests are no use because garbage collection scaling might only kick in at larger sizes.  For example, if at 1 million objects, linear scaling factors are dominate, but binomial scaling garbage collection effects may 'catch up' once you go above 10 million. So testing from 100 thousand to 1 million is no use at all in telling if the application will work in production at 10 million.  To know if it will work at production, test at production memory pressures.
  • If two sub-systems, which never interact, work OK separately, it does not mean that they will work OK when included in the same application.  The garbage collector effectively means that if two sub-systems run in the same process, they are never separated.
  • Consider breaking truly monumental programs into sub programs. Sometimes multi-process is better than multi-thread.
  • Retest each time your VM (be it Java or .net) has a new or different garbage collector, it might make a huge difference.
  • Put as much pressure as possible on VM development companies to come up with ever better scaling garbage collection schemes. Some are already working with hybrid reference counting / mark sweep solutions.  Why - because reference counting scales linearly.