Deep Dive Java Performance And Escape Analysis

Nope - this post is hard-core!
Copyright Dr Alexander J Turner all rights reserved
Escape Analysis is part of the powerful compilation optimiser in both the Oracle (ex Sun) JVM and the IBM JVM. It can have dramatic impacts on performance and free developers from worrying about object creation overheads.

On really nice thing about this optimiser is that it encourages good programming. The example here might look quite complicated but 90% of the code is there to support benchmarking (which is a black art  on the JVM). The real take home messages are:
  • Make code's functional structure match its logical structure.
  • Avoid Java Bad Smells like if/else if/else.
  • Use immutable objects.
That last one is a huge change on older JVMs! In the past a good trick for improving performance was to go through the immutable objects which were created inside loops and change them to mutable objects created outside loops. The reason being that object creation causes allocation on the heap which  is not cost free and it requires those objects to be garbage collected, which is also non trivial.

What is escape analysis?
In the JVM escape analysis looks at where an object reference can go. We can logically (if not physically) think of all C derived languages (like Java) as having three memory types [scopes]:

  1. Stack - parameters, local and block scope all of which can be passed down to called functions.
  2. Static - in the case of Java this is class loader static i.e. static within classes.
  3. Dynamic - sometimes called 'heap' this memory can be passed around freely, is dynamically allocated and reclaimed explicitly (e.g. C) or garbage collected (e.g. Java).

[Please note that there is enough ambiguity and overloading of the above names as to ensure there is no 'correct' word for each type].

Stack storage is often the place where most storage optimisations can occur. In the JVM, stacks are thread local automatically (with no overhead) and the memory on the stack is automatically retrieved by simply moving the stack pointer (or the logical equivalent) rather than having to perform garbage collection. Similarly, allocating memory on the stack is very cheep and does not have any threading (direct or via memory barriers) implications. Escape analysis takes advantage of the performance benefits of the stack.

If an object is allocated in stack scope (as an rvalue* or as a local variable) then the JIT optimizer will trace where in the code call graph it goes. If it is never transferred to static or dynamic scope then it does not actually have to be allocated in dynamic scope. By default (and in all older JVMs) all objects are allocated in dynamic scope (which is only of the most common criticisms of Java and the JVM).
* An rvalue is so called because in right associative languages it sits to the right of that to which it is passed. It might never be directly assigned to a local. For example, in the expression to the right of = in a = b * (x + y), the result of x + y is an rvalue. The result of b * (x + y), is also an rvalue which is then converted into an lvalue by the assignment operator.
The Test:
On my test machine with 8 xeon cores it is possible to do some heavy thread testing. This reveals some very interesting aspects of escape analysis and thread contention. The guts of the test is the following two methods:


  private final static void countNotEscaped(boolean save){
    for(int j=0;j<count;++j){
      ToEscape esc= new ToEscape(j);
      if(save)list.add(null);
    }
  }
  
  private final static void countEscaped(boolean save){
    for(int j=0;j<count;++j){
      ToEscape esc= new ToEscape(j);
      if(save)list.add(esc);
    }
  }

These operate on the following inner class:



  public static class ToEscape{
    private static int counter;
    public long what;
    public ToEscape(long what){
      this.what=what;
      if(counter++==100)System.out.println("I Exist");
    }
  }
  
The only difference between the two methods is highlighted in blue (bold italic underline for the colour blind amongst us). In the method countNotEscape the ToEscape objects created in the loop cannot escape the block (stack) scope; in countEscaped the compiler cannot know if the objects will escape or not. So in countNotEscape the compiler can make esc a stack object rather than a dynamic object. 

Actually, the sharp eyed amongst you might notice that under some conditions the compiler could get rid of esc altogether. It is for this reason that if(counter++==100)System.out.println("I Exist"); clause in the ToEscape object constructor was put there as a side effect so the optimizer cannot get rid of the object created in the loop.



On my machine the performance difference in around 3.5 to 1 on the 7u3 x64 server JVM. The difference was largely independent of the number of threads tested (I tried from 1 to 16). This is a great demonstration of escape analysis at work. The other thing that made it very obvious that escape analysis was the cause was:
  • The code with the possibility of escape took 1.3 Gig of memory to run. IE the heap was being filled up with objects which were then collected.
  • Making both loops not permit escape took that down to 37 Meg! The heap was largely static and there was very little (almost non) garbage collection activity.


Removing Objects Altogether And Thread Cache Synchronization:
When I removed the if(counter++==100)System.out.println("I Exist"); clause and counter itself from ToEscape the countNotEscape still run 3.5 times faster with 1 thread but around 45 times faster with 16 threads. I have not looked in great detail as to why; nevertheless, I have two theories (which might complement one another). The JIT compiler might be getting rid of the allocation to ToEscape altogether. Alternatively, it could be that the performance of the multi-threaded case with the count in place was being restricted by the sharing of the count static field state between the cores on the machine. Remember that the machine is a twin socket quad-core xeon machine; the cost of synchronising count between the cores and sockets will be rather large even though it is not market as volatile. 

This is a good example of BAD coding. I have not synchronised count in any way, so its state will be undefined because multiple threads will be reading, and updating it at the same time. Thus, count is not really doing much of logical use yet it is crippling the ability of the application to scale!

Here is the full code:
package com.nerdscentral.escape;

import java.util.ArrayList;
import java.util.concurrent.atomic.AtomicInteger;

public class TestEscape {
  private static ArrayList<ToEscape> list= new ArrayList<>();
  private final static int count=100000;

  public static class ToEscape{
    private static int counter;
    public long what;
    public ToEscape(long what){
      this.what=what;
      if(counter++==100)System.out.println("I Exist");
    }
  }
  
  private final static void countNotEscaped(boolean save){
    for(int j=0;j<count;++j){
      ToEscape esc= new ToEscape(j);
      if(save)list.add(null);
    }
  }
  
  private final static void countEscaped(boolean save){
    for(int j=0;j<count;++j){
      ToEscape esc= new ToEscape(j);
      if(save)list.add(esc);
    }
  }
  
  public static void main(String[] args) throws InterruptedException{
    Thread.currentThread().sleep(5000);
      
    final AtomicInteger pcnt = new AtomicInteger(0);
    final AtomicInteger done=new AtomicInteger(0);
    final int tThreads=16;
    final int scale=1000000;
    for(int i=0;i<tThreads;++i){
      new Thread(){
        public void run(){
          double totalE=0;
          double totalNE=0;
          for(int i=0;i<1000;++i){
            if(i==100)totalE=totalNE=0;
            long t=System.nanoTime();
            countNotEscaped(false);
            totalNE+=(System.nanoTime()-t);
            t=System.nanoTime();
            countEscaped(false);
            totalE+=(System.nanoTime()-t);
          }
          pcnt.addAndGet((int) ((totalE/totalNE)*scale));
          done.incrementAndGet();
          
        }
      }.start();
    }
    while(done.get()<tThreads)Thread.sleep(50);
    System.out.println("Final ratio escaped:not escaped = " + (((double)pcnt.get())/((double)(scale*tThreads))));
  }
  
}