You Thought You Knew About Java Performance - Part 1

Seeing dimly through the mist.
Copyright: Dr Alexander J Turner
You've paid your money and gotten a fancy profiler or you just use visualvm. But, is it telling you the truth?

It tells you 10% of your program's time is spent garbage collecting so you now spend the next year fiddling with gc options and telling your boss how clever you are. Well stop - because it is all nonsense.

OK - I am overstating the case; but there is a completely different view of the world and we as programmers really need to start thinking about it was as well as our cosy linear execution view.

We still measure performance as if we were running a Tandy TRS-80 from the late seventies. We are so wrong because we equate time to performance directly as though the CPU is actually doing stuff during these times we are looking at. We think that because the profiler says 100ms is spent in a function, the CPU is executing stuff during those 100ms and the we can optimise performance by improving the algorithm used in the method. We are used to the idea that the CPU waits for IO. We see CPU activity monitors drop as our code waits for IO. However, other sources of waiting do not show up as drops in CPU usage because they are not mediated by the kernel schedular. 

All this time measuring is kind of missing the point; modern computers struggle to get their CPUs to be 50% active. The CPU is so fast that a massive proportion of the time is spent trying to get information to and from memory. Any performance gain by clever algorithms and garbage collection tweaking can easily be lost in memory issues; standard performance measures like CPU utilisation will never notice it. Profilers, on the other hand, simply do not have the accuracy to notice these subtle effects. Sorry to burst the bubble, but those numbers coming out your tool are guidance only. They are handy for comparing one method to another or finding hot spots, but that is where it ends. The very act of profiling something completely alters its behaviour. Don't believe me? Well, where does that profile information go? It goes into memory, the very thing that modern computers struggle with so much. 

Let me show an example:

Parsing: patches/performances/ping-5.sfpl size: 1642
SFPL: [153]1 >> Sonic Field Version 0.2 Starting  Processing
SFPL: [153]1 >> ============================================

SFPL: [12127]1 >> Sonic Field Version 0.2 Completed Processing
SFPL: [12127]1 >> ============================================
Exit code: 0

 EXEC  : instructions per nominal CPU cycle
 IPC   : instructions per CPU cycle
 FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
 AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state'  (includes Intel Turbo Boost)
 L3MISS: L3 cache misses 
 L2MISS: L2 cache misses (including other core's L2 cache *hits*) 
 L3HIT : L3 cache hit ratio (0.00-1.00)
 L2HIT : L2 cache hit ratio (0.00-1.00)
 L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
 L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
 READ  : bytes read from memory controller (in GBytes)
 WRITE : bytes written to memory controller (in GBytes)
 TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature


   0    0     1.00   0.89   1.12    1.30      67 M    101 M    0.34    0.30    0.32    0.04     N/A     N/A     50
   1    0     0.62   0.69   0.89    1.29      50 M     73 M    0.31    0.19    0.30    0.04     N/A     N/A     50
   2    0     0.81   0.79   1.02    1.30      60 M     84 M    0.28    0.30    0.31    0.03     N/A     N/A     54
   3    0     0.61   0.70   0.88    1.29      49 M     65 M    0.24    0.20    0.30    0.03     N/A     N/A     54
   4    0     0.82   0.79   1.03    1.29      60 M     74 M    0.19    0.33    0.31    0.02     N/A     N/A     55
   5    0     0.61   0.70   0.87    1.29      48 M     58 M    0.17    0.22    0.29    0.02     N/A     N/A     55
   6    0     0.85   0.80   1.07    1.30      66 M     74 M    0.11    0.33    0.33    0.01     N/A     N/A     53
   7    0     0.58   0.68   0.85    1.29      48 M     53 M    0.09    0.22    0.30    0.01     N/A     N/A     53
 SKT    0     0.74   0.76   0.97    1.29     452 M    585 M    0.23    0.27    0.31    0.03     N/A     N/A     48
 TOTAL  *     0.74   0.76   0.97    1.29     452 M    585 M    0.23    0.27    0.31    0.03     N/A     N/A     N/A

 Instructions retired:  201 G ; Active cycles:  264 G ; Time (TSC):   34 Gticks ; C0 (active,non-halted) core residency: 74.74 %

 C1 core residency: 10.40 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 14.86 %
 C2 package residency: 0.01 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.05 %

 PHYSICAL CORE IPC                 : 1.52 => corresponds to 38.11 % utilization for cores in active state
 Instructions per nominal CPU cycle: 1.48 => corresponds to 36.88 % core utilization over time interval

 SKT    0 package consumed 421.69 Joules
 TOTAL:                    421.69 Joules
Cleaning up

This is the output from a completely different form of profiler, it is profiling the performance of the cores during the execution of a program. It is telling us some really useful information which give us a idea of how much room there is for improvement rather than how much time various things are taking. The above is the output from Intel's pcm.x (it is on the Mac) Performance Counter Monitor tool. The trace is from a Mac Book Pro Retina which has a quad core I7 which gives a good set of performance counters to look at. The program is Sonic Field, a Java/Spring audio generation and manipulation tool

There is a lot of information there so let's start to consider it bit by bit to see what it means. First, to anchor this in something more familiar, we can compare the results of pcm.x with those from sar. To do this I ran sar immediately before running the process which I was measuring with pcm.

Here are the sar results:

Time %usr %nice %sys %idle
11:46:56 35 0 3 61
11:46:57 92 0 3 6
11:46:58 87 0 9 4
11:46:59 80 0 13 8
11:47:00 83 0 10 7
11:47:01 58 0 7 35
11:47:02 60 0 5 35
11:47:03 92 0 6 3
11:47:04 82 0 8 10
11:47:05 69 0 6 25
11:47:06 88 0 7 5
11:47:07 88 0 5 7
11:47:08 21 0 4 75

Average 71.9230769231
Average 78.3846153846

The table above gives the following graph.
Sar output graph for the test program
Now sar is sampling the kernel schedular every second so its resolution for this program (which lasts 13 seconds) is not going to be perfect. The number we are really interested in is that area above the idle line which is 100 minus the mean of the idle time (approximately) which gives us a cpu utilisation of 78 which compares well to the pcm cpu C0 state percentage of 75.

Given these numbers - we can go away quite happy that we are being 75% efficient in the use of CPU resource. I.E. our algorithms can be tweaked to improve performance, but our parallel execution system is good enough to use 75% of the CPUs all the time so there is some gain to be had but not a world shatteringly big amount.

But wait - that is wrong. Let us take a look at some of the other metrics from pcm. These are metrics which we cannot see from sar or from any normal profiler.

This is the ratio of time the CPU is stalled due to misses on the level three cache. It is a good measure of the amount of time the CPU is waiting for main memory. This shows that the CPU is waiting for around a third of the time for main memory. Thus, we can now take the amount of time in active state (C0) and multiply it by the amount of time not waiting for main memory whilst in C0 to get the the amount of time available for calculations:


Do not be fooled, reality is not always
how it would first appear.

Copyright: Dr Alexander J Turner
Oh dear, this means that our expensive I7 cpu is actually able to do something only half the time. Let us have a think about that. Say we spend £1000 on a server and we then spend the same on running it for its life. This givea a spend of £2000 and we throw away about £1000 of that on leaking in the threading model and missing the L3 cache. Ouch!

Other Stuff:
Unfortunately, there are a lot of other places we can leak performance. Branch prediction failures, pipeline stalls, ITLB, DTLB etc. This leads us to the following number:

 Instructions per nominal CPU cycle: 1.48 => corresponds to 36.88 % core utilization over time interval

This is the number which we 'thought' we would get from sar or any other software based CPU utilisation tool but did not because they only see a tiny part of the story. By including hardware stalls which are not known to the shedular we get the real measure of our program's efficiency. At 36.88% is does not make for happy reading.

Now let's look at this line:

 PHYSICAL CORE IPC                 : 1.52 => corresponds to 38.11 % utilization for cores in active state

It had me confused for quite some time (I still am a bit to be honest). I would normally expect the ratio of (Instructions/nominal-cycles)/(Physical Core IPC) to be around the proportion of time in C0. We can see here that it is not. In the example below it is more like what I expected:

pcm.x "du -sk ./*"

C0 (active,non-halted) core residency: 42.85 %
Instructions per nominal CPU cycle: 18.10 %

So how come instructions per nominal instruction cycle is so high compared to the physical core IPC whilst in active state for my program? My guess this is due to turbo boost. Whilst the process is not running on all the cores (note that the machine is pretty much idle apart from this process) those cores which are running turbo boost up in performance and so we get more instructions retired than we expect. I am sure there are many other issues which might cause this effect as well.

What To Do Next:
Armed with this new information, I know that I have the theoretical potential to increase the performance of my application by three times. In other words, out of the £2000 server cost I was talking about, £1300 or so would be thrown away! So, what I thought was a well optimised program is just not, it is rubbish :(

I do know that SonicField uses an immutable data model. This might well be a big contributor to the very poor memory performance. If I were to introduce some scratch memory buffers which are mutated but their mutation is hidden from the rest of the system so I retain the immutable mode then I might get some big performance gains. Watch this space (but not too hard because I am rather busy right now).