Synchronising sun.misc.Unsafe With C++

Testing atomics

What is a fence, a volatile read, a volatile set and why so many people get this wrong.

Quick Note

I have simplified and glossed over quite a bit in this article; this is especially so regarding the implementation of x86 cpus. I expect to get some backlash over this - I look forward to it. However, I personally believe the level of simplification I have used is about right to allow for safe and sound reasoning without drowning in the details. If you disagree, please put a comment and add to the discussion around this important and interesting topic.

Unsafe - The New Fence Methods

Let us start by looking at the new fence entries in Unsafe. These were added with Java 8 and seem to be very highly miss-understood.

     * Ensures lack of reordering of loads before the fence
     * with loads or stores after the fence.
     * @since 1.8
    public native void loadFence();

     * Ensures lack of reordering of stores before the fence
     * with loads or stores after the fence.
     * @since 1.8
    public native void storeFence();

     * Ensures lack of reordering of loads or stores before the fence
     * with loads or stores after the fence.
     * @since 1.8
    public native void fullFence();

From now on I am discussing Linux x86 and x86_64. 99% or everything I say will apply to Windows and Mac on those chips; for other chips, what I say below is not directly relevant.

An Simple Model Of Memory

We can think of a two CPU (or two core) x86 system a little bit like this.

[Execution Unit]->[Store Buffer]->[Cache]\
                                          ->[Main Memory]
[Execution Unit]->[Store Buffer]->[Cache]/

[Execution Unit]<-[Read Buffer]<-[Cache]\
                                          <-[Main Memory]
[Execution Unit]<-[Read Buffer]<-[Cache]/

On one CPU reads and writes are not reordered to the same location; however, to a single location, reads can be reordered with reads and writes can be reordered with rights. Reordering between locations is permitted either way as long as the result is consequentially consistent. Further, the compiler is at liberty to perform any reordering it wants. The reordering is done via the store and read buffers (a gross simplification).

Cache Coherence ensures that the values read from and written to cache are consistent and coherent cross the entire machine. We do not need to worry about caches writing to or reading from main memory; as long as reads and writes are to/from cache then we are getting a consistent global state. All race conditions occur due to reordering by the compiler or in the read and store buffers.

Fences are all about trying to prevent some or all of these reorderings.  In the x86 instruction set we have mfence, sfence and lfence. These are about preventing reordering and ensuring that memory state of the local cpu/core is globally visible. The cache coherence system of the CPU is then able to ensure that global visibility actually works.

  • lfence: no read reordering across this instruction; reads after this will be globally consistent with the global state of the cache at the point of retiring this instruction.
  • sfence: no write reordering across this instructions; all writes before the retirement of this instruction will be moved to the cache and consistently seen across the entire machine.
  • mfence: lfence and mfence combined atomically.

Unsafe fences are a little different. For FullFence, the behaviour is as expected, but for other fences it is not anything like the x86 instructions.

Full Fence

Fences 'flush the buffers' (again, gross situation) which is also often referred to as serialisation. This means that the reads are never reordered over a load fence and the writes are never reordered over the store fence.  Unsafe.fullFence could be been implemented using the mfence instruction (though as we will see, it is not). Because we flush the store buffer to the cache and then the cache has a consistency model, we see the new value globally.

Thread A Store  0 to 0x467897
Thread B Store 25 to 0x467897
Thread B fullFence()
Thread A fullFence()
Thread A Load from 0x467897

What is the value of the read from Thread A? The answer is 25 because both Thread A and Thread B have guaranteed to  have no reordering and and reading/writing from main memory or the cache. The Cache is consistent due to cache consistency so we have a consistent memory model.

inline void OrderAccess::fence() {
  if (os::is_MP()) {
    // always use locked addl since mfence is sometimes expensive
#ifdef AMD64
    __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
    __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");

The JDK uses the serialising/fence effect of a 'locked add' to flush the read and write buffers at the same time. Despite the x86 instruction set having mfence, the developers of the JDK have evidence that their implementation is faster. They perform an atomic addition of zero to the stack pointer. This is a fully memory barrier/fence and does ensure that all other cpus/cores will use the memory values stored by this one previous to the instruction (due to the x86 memory model being so strong).

So - on x86 this is true:
Thread A Store  0 to 0x467897
Thread B Store 25 to 0x467897
Thread B fullFence()
Thread A Load from 0x467897

We can still guarantee that A will read 25. It seems a bit extreme but yes - the x86 memory model is that strong on ordering.  See here for more details: 

Note that because we are bitting the stack, the chances are very good indeed that this operation will not attract inter cpu or inter core communications as stacks are usually local to a particular core. Further, because it is the top of the stack, we can guarantee that it will be swapped in an not cause a segmentation violation. As hacks go (and let us be honest - this is a hack) it is quite a nice one.

Load And Store Fence

These are ridiculously weak in Java. They actually do nothing other than prevent compiler reordering:

inline void OrderAccess::acquire() {
  volatile intptr_t local_dummy;
#ifdef AMD64
  __asm__ volatile ("movq 0(%%rsp), %0" : "=r" (local_dummy) : : "memory");
  __asm__ volatile ("movl 0(%%esp),%0" : "=r" (local_dummy) : : "memory");
#endif // AMD64

inline void OrderAccess::release() {
  // Avoid hitting the same cache-line from
  // different threads.
  volatile jint local_dummy = 0;

Really? Yes - these are store and read buffer neutral operations. The fence operations are identical to a read and write (respectively) to a C/C++ volatile. IE they have no special effect on the hardware and serve only to prevent the compiler reordering around them or performing speculative operations.

One might ask 'why have these at all'. The answer - I guess - is a sort of reverse logic. As we will see below, the volatile store operation from Unsafe is a fully serialising, locked instruction and so potentially very expensive (tens to hundreds of clock cycles). Consequently, Java completely lacks a lightweight way of preventing the compiler from reordering across a barrier. Unsafe.loadFence() and Unsafe.storeFence() provide that very mechanism but offer non of the functions a x86 experienced developer might think come with the word 'fence'. It is undoubted that these are more useful and more powerful on hardware with a weaker memory ordering model than x86. On x86 they are almost, but not quite, ridiculous.

Let us be very clear about this: the above JVM code does  nothing to prevent reordering of stores and loads to different locations, even from the same CPU. The above implementations are not actually fences at all and the name could be quite misleading.

** If I am wrong about this interpretation of Usafe fences please shout out!

Volatile And Atomic

Now we can look at the interaction of C++ Atomics with Java. Both Java and C++ simplify the memory situation by ignoring read reordering via leveraging cache consistency and the effect of the lock prefix (or implied prefix). Here is how it works:
  1. A read cannot be inconsistent unless there is a write because reads have no effect on memory state.
  2. Therefore if writes are in the correct order and reads cannot be reordered around then, reads are effectively in the correct order.
  3. If we use the lock prefix (or implied lock) we can guarantee GLOBAL write/store order for a given memory cell (a cell being 32 (x86) or 64 (x86_64) bits aligned). We also make all stores before the instruction from that cpu/core globally visible. See this article.
  4. As, on each cpu/core, reads cannot be reordered with writes, we get GLOBAL sequentially consistent ordering.
Using a locked instruction on x86 is like reaching into all the cores/cpus on the machine and telling them all to read from cache for all the stores which have been made from the locking cpu/core. This is usually much more efficient than using lfence for reads and sfence for writes. Why? Well, in general, reads outnumber writes by a large margin; if we use fences then the cost of the lfence is paid frequently. By using the lock instead, reads are at full speed; the only cost incurred is the lack of compiler reordering. Writes are expensive due to the lock instruction; this might be more expensive than an sfence but writes are usually less common than reads so it is a price worth paying. Anyhow, we have no choice because both Java and C++ have chosen the naked read / lock write approach and we should just get on and use it.

The C++ standard has several memory orderings. However, for x86 Java interop' we are really only interested in three. Here are the C++ memory orderings and their Java equivalents.

Table 1: Matching Java Operations to C++ Atomics
Java Volatile OperationJava Unsafe OperationC++ Memory Order
volatile long x;
Unsafe unsafe;
volatile long y;
Unsafe unsafe;
Note that I have use long as an example - int and Object versions exist in for Java.
Note that passing null as the first parameter to the Unsafe methods uses absolute addressing rather than object relative addressing.

This memory ordering approach means that the get/load/memory_order_acquire actions have no special machine operations associated with them. The only thing they require/enforce is that the compiler does not reorder across them.  The put/store/memory_order_release operations are implemented using the lock assembler prefix (or an instruction with an implied prefix) which ensures the stores and in order and thus forces the reads to be in order with respect to the stores.

The Java implementation of get and put volatile relies on the following macros:

#define GET_FIELD_VOLATILE(obj, offset, type_name, v) \
  oop p = JNIHandles::resolve(obj); \
  volatile type_name v = OrderAccess::load_acquire((volatile type_name*)index_oop_from_field_offset_long(p, offset));

#define SET_FIELD_VOLATILE(obj, offset, type_name, x) \
  oop p = JNIHandles::resolve(obj); \
  OrderAccess::release_store_fence((volatile type_name*)index_oop_from_field_offset_long(p, offset), x);

These call out to the following methods:

inline void     OrderAccess::release_store_fence(volatile jbyte*  p, jbyte  v) {
  __asm__ volatile (  "xchgb (%2),%0"
                    : "=q" (v)
                    : "0" (v), "r" (p)
                    : "memory");
inline void     OrderAccess::release_store_fence(volatile jshort* p, jshort v) {
  __asm__ volatile (  "xchgw (%2),%0"
                    : "=r" (v)
                    : "0" (v), "r" (p)
                    : "memory");
inline void     OrderAccess::release_store_fence(volatile jint*   p, jint   v) {
  __asm__ volatile (  "xchgl (%2),%0"
                    : "=r" (v)
                    : "0" (v), "r" (p)
                    : "memory");

inline void     OrderAccess::release_store_fence(volatile jlong*   p, jlong   v) {
#ifdef AMD64
  __asm__ __volatile__ (  "xchgq (%2), %0"
                          : "=r" (v)
                          : "0" (v), "r" (p)
                          : "memory");
  release_store(p, v); fence();
#endif // AMD64

inline jbyte    OrderAccess::load_acquire(volatile jbyte*   p) { return *p; }
inline jshort   OrderAccess::load_acquire(volatile jshort*  p) { return *p; }
inline jint     OrderAccess::load_acquire(volatile jint*    p) { return *p; }
inline jlong    OrderAccess::load_acquire(volatile jlong*   p) { return Atomic::load(p); }
inline jubyte   OrderAccess::load_acquire(volatile jubyte*  p) { return *p; }
inline jushort  OrderAccess::load_acquire(volatile jushort* p) { return *p; }
inline juint    OrderAccess::load_acquire(volatile juint*   p) { return *p; }
inline julong   OrderAccess::load_acquire(volatile julong*  p) { return Atomic::load((volatile jlong*)p); }
inline jfloat   OrderAccess::load_acquire(volatile jfloat*  p) { return *p; }
inline jdouble  OrderAccess::load_acquire(volatile jdouble* p) { return jdouble_cast(Atomic::load((volatile jlong*)p)); }

We can see that, as I explained, the stores are done using locks; xchg instructions on x86 have what is called 'implied lock' because the explicit lock prefix is not require. The reads are just memory reads around which the compiler must not reorder.

The Java implementation for the getAndSet operations actually uses compareAndSwap and is a bit sloppy to be honest. I have not checked, but I hope, the JIT compiler can improve on things:

     * Atomically exchanges the given value with the current value of
     * a field or array element within the given object o
     * at the given offset.
     * @param o object/array to update the field/element in
     * @param offset field/element offset
     * @param newValue new value
     * @return the previous value
     * @since 1.8
    public final long getAndSetLong(Object o, long offset, long newValue) {
        long v;
        do {
            v = getLongVolatile(o, offset);
        } while (!compareAndSwapLong(o, offset, v, newValue));
        return v;

My guess is that this implementation is just a convenience because it works on many CPUs rather than using the lock prefix implementations which x86 provides.

I have not shown the actual assembler generated for volatile variables by the JVM. This does however, follow the same convention. The a volatile read is just a read on x86 whilst the volatile write uses an explicit or implied lock instruction (note that some sources, including mechanical-sympathy say volatiles are achieved using sfence and lfence, this is not the case in modern JVMs). This means that the state of volatile variables is consistent across all the threads on a JVM just as a variable stored with memory_order_release is in C++ and one set with putXXXXVolaitle is from and instance of Unsafe.

Conclusions And Recommendations

  1. Only bother with the ordering semantics from Table 1.
  2. If in doubt, use sequentially consistent.
  3. Do not bother with the fence stuff except for ultimate last nano second tuning of a very important piece of code. Basically, if you find yourself needing to use one of the fence methods from unsafe it is 99.9% certain there is a better way of implementing what you are doing.**
  4. Remember that memory_order_acquire and volatile get/load is cheep; Memory_order_release and volatile set/store is expensive.
  5. Now that C++ and Java both have memory models and they are both sensibly and reliably implemented on x86 we can confidently build high performance synchronisation without having to resort to the sledgehammer of mutexes or guesswork and figure crossing.
** One illustrative example where they might be useful would be publishing to a single publisher single consumer queue where we just want to ensure the compiler does actually write the queue end location to memory, but we are not especially bothered when that write happens.

Source Code Lookup

This article is largely bases on the following source code reference URL.

Using Unsafe For Unions Under The JVM

Unions are one of those things I have been thinking about for years - because Java does did not support them - till recently.

Back in the Java 6 days, byte buffers were slow and sun.misc.unsafe was not well known. Both of these things have now changed. Seriously, software bit twiddling was used to move double's in and out of byte buffers in Java 6. The situation was completely terrible. As a consequence, there was not good way of getting JVM implementations of pointers or unions to run at anything like native speed. Languages like C, Fortran and COBOL implemented on the JVM suffered from sever slowdowns due to the limitations of memory management on that platform.

However, that situation has completely changed. Java 7 got better but I am going to skip a generation and look at Java 8. It has solved the performance problem for unions both by community acceptance and the intrinsification of Java.

On Java 8 - unions and pointers can be implemented at full native speed. With 'pause free' (actually - very low pause) JVM implementations like Azul, JVM languages are well and truly into the speed game.

So - let's see how all this performance is achieved in byte buffer. Well - a byte buffer is either direct or heap. To get started, we can look at 'putShort' on a heap byte buffer.

328    public ByteBuffer More ...putShort(int i, short x) {
330        Bits.putShort(this, ix(checkIndex(i, 2)), x, bigEndian);
331        return this;
335    }
189    static void More ...putShort(ByteBuffer bb, int bi, short x, boolean bigEndian) {
190        if (bigEndian)
191            putShortB(bb, bi, x);
192        else
193            putShortL(bb, bi, x);
194    }
174    static void More ...putShortL(long a, short x) {
175        _put(a    , short0(x));
176        _put(a + 1, short1(x));
177    }
554    private static void More ..._put(long a, byte b) {
555        unsafe.putByte(a, b);
556    }

To put a short into a byte buffer, the JDK delegates to Bits which knows about the hardware endianness. This the calls sun.misc.unsafe. Sadly, there is also  ix(checkIndex(i, 2). This is a major issue because, whilst the latter two calls can be inlined and then optimised away, the index check causes a safe point in the JVM and prevents optimising away the entire call stack (or so I have been told). The overhead of inlining is quite large here as well; all this results in a call to putShort on a HeapByteBuffer is not always perfectly optimised, but it is pretty good.

Can we do better by hand? Yes - naturally we can. We can call sun.misc.unsafe directly ourselves and bypass all this clutter (at the expense of having to consider endianness ourselves and loose range checking). If we are happy that our program is 'unsafe' then we can run a full native speed for unions.

Analysis using JIT disassembly shows that the inlining, loop unrolling and other techniques of the Oracle JDK 8 JVM is comparable to the very best profile guided optimisations we might expect from gcc/g++ so we really are 'inline' to getting unions, pointers and other concepts in JVM languages to run as fast as they do in native code.

DANGER: what I am about to describe is making a JVM compiled language operate as though it is in 'unvarified mode - as in C#'. What you will have is a program which is just as dangerous as a pure native program. Do not pass the result off as being safe like 'normal' Java - it is not.

OK, so we have established that the JDK its self uses sun.misc.Unsafe to implement byte buffers. Indeed, although it is supposed to be hidden, sun.misc.Unsafe is now an integral part of so much JVM code running 'in the wild' that we can, for any full JVM/JDK implementation, expect it to just be there. Because of that, we can expect to use it.

What is sun.misc.Unsafe and why is it so fast?
It is a class which has most of its methods intrinsic; that means the methods are known to the JVM jit compiler and are directly implemented by the compiler its self, rather than being implemented via byte code which is then compiled (just like malloc/free under gcc for example). We can find out which methods are intrinsic from this header file:

Think about that this actually means:
    double x;

    // where unsafe is an instance of the Unsafe class
    long ptr;

Will compile to code which runs approximately as quickly as each other. However, the latter is accessing some random location in memory using the long as a pointer! Yes - Java can now use unverified pointers - even (with some tricks) into its own heap!

We can access fields in JVM classes in much the same way. There are a number of tricks around this which I might write up in another post. However, here is the bottom line when it comes to unions.

    long ptr.
    double d=unsafe.getDouble(null,ptr);
    long l=unsafe.getLong(ptr);

This is a union! We are accessing the same memory in but with different meanings. We can do this is arbitrary memory, or we can 'look into' JVM arrays with just the same speed. This way we can implement unions into JVM arrays at full speed with all the JIT optimisations we would expect for non unions. This has some small benefit for Java, but for implementing languages like COBOL and C on the JVM the implications are enormous. 

As of Java 8, we are able to implement C and COBOL on the JVM at full native speed. If anyone will, is another question; but it could be done.

Why multi-tracking a mon synth is NOT cheating

Let us consider this piece - how could this have been made without multi-tracking?

This piece of Bach is actually 10 tracks overlayed. The harpsichord is two hands, each with 4 tracks. The flute is two tracks. The whole piece was 'played' with Sonic Field controlling the midi to correctly distribute the midi notes in sequence to record all 10 tracks. I then overlayed them in Audacity.

Now - is this cheating? Should I have played this real time. Well yes and no. It would be lovely to do so but also impossible as the flute and keyboard just are two hard for one person to play simultaneously (well - most people anyhow). But, even putting that to one side, the kit involved would be stupid. The sound for the harpsichord required this signal chain:

Waldorf Pulse 2 -> Electro Harmonic Ploy Chorus -> Alter Ego X4 Delay -> Bheringer Sonic Exciter

Now, we might say we could use 8 Waldorf Pulse 2 synthesizers. That alone is just daft (at £360 each - say $600). Nevertheless, there are good polysynth's out there for a lot less than £2880 (e.g. the prophet 08). But that is assuming that we can pass the output through the other components and get the same effect; the truth is that we cannot. The key to the sound is the overlay of the polychorus and delays. The sound would be much, much thinner if all 8 passed voices through the same effects chain. Note even the Sonic Exciter is actually an envelop following device in the drive circuit. That means it would be much less responsive if processing all 8 voices at once. Further still, I use different setting for the left and right 'hand' voices.

So - in reality we would require 8 of everything to mimic the sound of the harpsichord. The flute uses a reverb rather than a delay, so we will need Electro Harmonix Cathederal pedals plus two more of the other synth' and exciters:

10 Synths;
  8 Polychorus
  2 Reverb (EHX Cathederal)
  8 Delays
10 Sonic Exciters

Then we would also need a good mixer!

I guess we are looking at over £10 000 or $15 000.

Seriously - multi-trakcing is not cheating - it is a way bring creating music into reach of a hobbyist!