Why Modernising Away 'Classic' (aka legacy) Systems Might Never Happen

An interesting post turned up on my twitter:

Click here for the article [in new window]:
http://www.informationweek.com/why-wal-marts-cio-calls-legacy-it-systems-classic/a/d-id/1320286

A good read about supporting legacy systems and how calling them 'Classic' and giving them the support and respect they need and deserve benefits an organisation. The author commented thus:

Thomas Claburn,
Given the realities of legacy systems, I wonder why more emphasis isn't made on avoiding problems by adopting applications that are open and upgradeable, without lock-in or other burdens. Being an IT leader shouldn't mean kicking the can down the road for the next IT leader to deal with.


Which I thought was just too easy and showed a over simplistic view of the challenges involved. There is my replay explaining why we are not going to see even the beginning od the end of classic for at least 15 years.

Hi Thomas,
The answer to your question is not simple but here goes:

1) Being open and upgradable is only true briefly. Open source projects get abandoned as they become stale or just 'no longer cool'. We see open source are upgradeable because we are looking over a planning horizon of years not decades. 

2) Open source is not always more maintainable; do we go for a clean, will constructed commercial stack of a loosely associated constellation of applications which happen to work together? We need to be very careful over which open source we go for. In a highly regulated environment, 'Any old Linux' is not going to cut it for example. A bank, or even a large retail chain, will not have the resources inhouse to manage an entire Linux stack and so has little choice but to buy in Redhat or one of the similar third party systems. Otherwise, the user of the stack cannot guarantee to the regulator due diligence over the stack they are using.

3) Now for the kicker, the mainframe. The problem here is that the open source, closed source or any other world has not yet provided a distributed computing alternative to large mainframe installations. The transnational consistency model available with System Z cannot be replicated using commodity hardware and modern networking technology in any meaningful way. The issue is throughput vs latency. You can code up a x86_64 server system to manage huge throughput with a consistent eventually model (think Google) or with a instantly consistent low latency model (think high frequency trading) but both at once is simply impossible due to all sorts of issues such as the Von Neuman bottle neck (i.e. main memory speeds), processor design, network latency, power constraints and hardware reliability.

So, if you have a 'classic' System Z based infrastructure the transactional consistency it provides is likely to be baked into the way the business works. It will be baked into the business's relationship with the regulators. It cannot be 'modernised' because there is nothing to modernise it to. It could be replaced, but that is not just replacing it but replacing an entire way of functioning for the business; this can amount to a complete pivot of a company's business model. Will the new business be competitive? Will it be hit by huge regulator fines? Will the disruption in the pivot cause irreparable damage to the business?

I do see a world, 15 ot even 26 years hence, where silicon to silicon fibre and fpga baked into network cards can start to allow x86_64 (or similar) based hardware to lift and shift away from System Z; but that is firmly in the future and still in doubt. Until such time, maintenance of classic systems will continue to be necessary and achieving small, incremental shifts away from those systems is all which even the most ambitious CIO should be expecting. I look back 15 years so see that distributed computing has scaled out enormously but has done little to tackle the transactional consistency vertical scaling issue; hence my prediction that we need at least another 15 years for that to happen.

Maybe, just maybe, it will never happen. We might still be using classic technology in 50 years time just as we are still driving around Otto Cycle engines which were invented over a century ago and have never been replaced by more modern systems.

- AJ

Ivy Standalone: Tips On Launching An Application

Apparently this is Poison Ivy,
which seems appropriate
as ivy also drives me batty :)
I searched and searched and found very little actually helpful documentation on how to use Ivy for anything but building an application. Here are my notes on RUNNING and application with Ivy to avoid the next person suffering this pain.


Seriously, why was this so hard? Why are the docs so terrible and there is so little out there in the way of help? I find the Java community very poor in terms of the help for even simple tooling tasks. It a case of 'to it this particular way or go to hell'.

Anyhow:

java
    xxxxxx                          \
    -cp ant.jar:ivy.jar             \
    org.apache.ivy.Main             \
    -ivy ivy.xml                    \
    -setting ivy-settings.xml       \
    -main your.main.ClassName       \
    -cp yyyyyyyy                    \
    -args p q r

The trick it to think of the java command line as split into thee parts:

JVM Config:

This is the xxxx above and this is where you add JVM command line flags lime -Dmy.varibale=fred and --X:+UnlockDiagnosticOptions. Any -D option you set here will be available to your application in the normal way.

Running Ivy:

This is the first -cp which has you ant and ivy jars on it. You might not need ant.jar on there, but in my case I did. The main class for the JVM is org.apache.ivy.Main. Then you specify the location of your ivy and ivy-settings files.

Running Your Application:

Any jars or class directories you want on the application classpath which are not ivy resolved need to go in this, the second, -cp - command line argument. NOTE: yes you have two -cp arguments. This is a good way of:

  1. Going mad
  2. Wasting a day's work.
  3. Being very profane

-args then starts the parameter list for you actual program. It is everything after -args which gets passed to public static void main(String[] args).

The ultimate gotcha

Do not get fooled into adding jars and class directories for your application to the first -cp. That first -cp is for ivy itself. The second -cp is for you. Forget this piece of wisdom and all efforts to get ivy to work will be doomed to failure.

Synchronising sun.misc.Unsafe With C++

Testing atomics

What is a fence, a volatile read, a volatile set and why so many people get this wrong.


Quick Note

I have simplified and glossed over quite a bit in this article; this is especially so regarding the implementation of x86 cpus. I expect to get some backlash over this - I look forward to it. However, I personally believe the level of simplification I have used is about right to allow for safe and sound reasoning without drowning in the details. If you disagree, please put a comment and add to the discussion around this important and interesting topic.

Unsafe - The New Fence Methods

Let us start by looking at the new fence entries in Unsafe. These were added with Java 8 and seem to be very highly miss-understood.

/**
     * Ensures lack of reordering of loads before the fence
     * with loads or stores after the fence.
     * @since 1.8
     */
    public native void loadFence();

    /**
     * Ensures lack of reordering of stores before the fence
     * with loads or stores after the fence.
     * @since 1.8
     */
    public native void storeFence();

    /**
     * Ensures lack of reordering of loads or stores before the fence
     * with loads or stores after the fence.
     * @since 1.8
     */
    public native void fullFence();


From now on I am discussing Linux x86 and x86_64. 99% or everything I say will apply to Windows and Mac on those chips; for other chips, what I say below is not directly relevant.


An Simple Model Of Memory

We can think of a two CPU (or two core) x86 system a little bit like this.

Store:
[Execution Unit]->[Store Buffer]->[Cache]\
                                          ->[Main Memory]
[Execution Unit]->[Store Buffer]->[Cache]/

Load:
[Execution Unit]<-[Read Buffer]<-[Cache]\
                                          <-[Main Memory]
[Execution Unit]<-[Read Buffer]<-[Cache]/


On one CPU reads and writes are not reordered to the same location; however, to a single location, reads can be reordered with reads and writes can be reordered with rights. Reordering between locations is permitted either way as long as the result is consequentially consistent. Further, the compiler is at liberty to perform any reordering it wants. The reordering is done via the store and read buffers (a gross simplification).

Cache Coherence ensures that the values read from and written to cache are consistent and coherent cross the entire machine. We do not need to worry about caches writing to or reading from main memory; as long as reads and writes are to/from cache then we are getting a consistent global state. All race conditions occur due to reordering by the compiler or in the read and store buffers.

Fences are all about trying to prevent some or all of these reorderings.  In the x86 instruction set we have mfence, sfence and lfence. These are about preventing reordering and ensuring that memory state of the local cpu/core is globally visible. The cache coherence system of the CPU is then able to ensure that global visibility actually works.

  • lfence: no read reordering across this instruction; reads after this will be globally consistent with the global state of the cache at the point of retiring this instruction.
  • sfence: no write reordering across this instructions; all writes before the retirement of this instruction will be moved to the cache and consistently seen across the entire machine.
  • mfence: lfence and mfence combined atomically.

Unsafe fences are a little different. For FullFence, the behaviour is as expected, but for other fences it is not anything like the x86 instructions.

Full Fence

Fences 'flush the buffers' (again, gross situation) which is also often referred to as serialisation. This means that the reads are never reordered over a load fence and the writes are never reordered over the store fence.  Unsafe.fullFence could be been implemented using the mfence instruction (though as we will see, it is not). Because we flush the store buffer to the cache and then the cache has a consistency model, we see the new value globally.

Thread A Store  0 to 0x467897
Thread B Store 25 to 0x467897
Thread B fullFence()
Thread A fullFence()
Thread A Load from 0x467897

What is the value of the read from Thread A? The answer is 25 because both Thread A and Thread B have guaranteed to  have no reordering and and reading/writing from main memory or the cache. The Cache is consistent due to cache consistency so we have a consistent memory model.

inline void OrderAccess::fence() {
  if (os::is_MP()) {
    // always use locked addl since mfence is sometimes expensive
#ifdef AMD64
    __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
#else
    __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
#endif
  }
}

The JDK uses the serialising/fence effect of a 'locked add' to flush the read and write buffers at the same time. Despite the x86 instruction set having mfence, the developers of the JDK have evidence that their implementation is faster. They perform an atomic addition of zero to the stack pointer. This is a fully memory barrier/fence and does ensure that all other cpus/cores will use the memory values stored by this one previous to the instruction (due to the x86 memory model being so strong).

So - on x86 this is true:
Thread A Store  0 to 0x467897
Thread B Store 25 to 0x467897
Thread B fullFence()
Thread A Load from 0x467897

We can still guarantee that A will read 25. It seems a bit extreme but yes - the x86 memory model is that strong on ordering.  See here for more details: 

Note that because we are bitting the stack, the chances are very good indeed that this operation will not attract inter cpu or inter core communications as stacks are usually local to a particular core. Further, because it is the top of the stack, we can guarantee that it will be swapped in an not cause a segmentation violation. As hacks go (and let us be honest - this is a hack) it is quite a nice one.


Load And Store Fence

These are ridiculously weak in Java. They actually do nothing other than prevent compiler reordering:

inline void OrderAccess::acquire() {
  volatile intptr_t local_dummy;
#ifdef AMD64
  __asm__ volatile ("movq 0(%%rsp), %0" : "=r" (local_dummy) : : "memory");
#else
  __asm__ volatile ("movl 0(%%esp),%0" : "=r" (local_dummy) : : "memory");
#endif // AMD64
}

inline void OrderAccess::release() {
  // Avoid hitting the same cache-line from
  // different threads.
  volatile jint local_dummy = 0;
}

Really? Yes - these are store and read buffer neutral operations. The fence operations are identical to a read and write (respectively) to a C/C++ volatile. IE they have no special effect on the hardware and serve only to prevent the compiler reordering around them or performing speculative operations.

One might ask 'why have these at all'. The answer - I guess - is a sort of reverse logic. As we will see below, the volatile store operation from Unsafe is a fully serialising, locked instruction and so potentially very expensive (tens to hundreds of clock cycles). Consequently, Java completely lacks a lightweight way of preventing the compiler from reordering across a barrier. Unsafe.loadFence() and Unsafe.storeFence() provide that very mechanism but offer non of the functions a x86 experienced developer might think come with the word 'fence'. It is undoubted that these are more useful and more powerful on hardware with a weaker memory ordering model than x86. On x86 they are almost, but not quite, ridiculous.

Let us be very clear about this: the above JVM code does  nothing to prevent reordering of stores and loads to different locations, even from the same CPU. The above implementations are not actually fences at all and the name could be quite misleading.

** If I am wrong about this interpretation of Usafe fences please shout out!

Volatile And Atomic

Now we can look at the interaction of C++ Atomics with Java. Both Java and C++ simplify the memory situation by ignoring read reordering via leveraging cache consistency and the effect of the lock prefix (or implied prefix). Here is how it works:
  1. A read cannot be inconsistent unless there is a write because reads have no effect on memory state.
  2. Therefore if writes are in the correct order and reads cannot be reordered around then, reads are effectively in the correct order.
  3. If we use the lock prefix (or implied lock) we can guarantee GLOBAL write/store order for a given memory cell (a cell being 32 (x86) or 64 (x86_64) bits aligned). We also make all stores before the instruction from that cpu/core globally visible. See this article.
  4. As, on each cpu/core, reads cannot be reordered with writes, we get GLOBAL sequentially consistent ordering.
Using a locked instruction on x86 is like reaching into all the cores/cpus on the machine and telling them all to read from cache for all the stores which have been made from the locking cpu/core. This is usually much more efficient than using lfence for reads and sfence for writes. Why? Well, in general, reads outnumber writes by a large margin; if we use fences then the cost of the lfence is paid frequently. By using the lock instead, reads are at full speed; the only cost incurred is the lack of compiler reordering. Writes are expensive due to the lock instruction; this might be more expensive than an sfence but writes are usually less common than reads so it is a price worth paying. Anyhow, we have no choice because both Java and C++ have chosen the naked read / lock write approach and we should just get on and use it.

The C++ standard has several memory orderings. However, for x86 Java interop' we are really only interested in three. Here are the C++ memory orderings and their Java equivalents.

Table 1: Matching Java Operations to C++ Atomics
Java Volatile OperationJava Unsafe OperationC++ Memory Order
volatile long x;
x=y;
Unsafe unsafe;
unsafe.putLongVolatile(o,p,q);
memory_order_release
volatile long y;
y=x;
Unsafe unsafe;
unsafe.getLongVolatile(o,p,q);
memory_order_acquire
N/Aunsafe.getAndSetLong(o,p,q);memory_order_seq_cst
Note that I have use long as an example - int and Object versions exist in for Java.
Note that passing null as the first parameter to the Unsafe methods uses absolute addressing rather than object relative addressing.

This memory ordering approach means that the get/load/memory_order_acquire actions have no special machine operations associated with them. The only thing they require/enforce is that the compiler does not reorder across them.  The put/store/memory_order_release operations are implemented using the lock assembler prefix (or an instruction with an implied prefix) which ensures the stores and in order and thus forces the reads to be in order with respect to the stores.

The Java implementation of get and put volatile relies on the following macros:

#define GET_FIELD_VOLATILE(obj, offset, type_name, v) \
  oop p = JNIHandles::resolve(obj); \
  volatile type_name v = OrderAccess::load_acquire((volatile type_name*)index_oop_from_field_offset_long(p, offset));

#define SET_FIELD_VOLATILE(obj, offset, type_name, x) \
  oop p = JNIHandles::resolve(obj); \
  OrderAccess::release_store_fence((volatile type_name*)index_oop_from_field_offset_long(p, offset), x);

These call out to the following methods:

inline void     OrderAccess::release_store_fence(volatile jbyte*  p, jbyte  v) {
  __asm__ volatile (  "xchgb (%2),%0"
                    : "=q" (v)
                    : "0" (v), "r" (p)
                    : "memory");
}
inline void     OrderAccess::release_store_fence(volatile jshort* p, jshort v) {
  __asm__ volatile (  "xchgw (%2),%0"
                    : "=r" (v)
                    : "0" (v), "r" (p)
                    : "memory");
}
inline void     OrderAccess::release_store_fence(volatile jint*   p, jint   v) {
  __asm__ volatile (  "xchgl (%2),%0"
                    : "=r" (v)
                    : "0" (v), "r" (p)
                    : "memory");
}

inline void     OrderAccess::release_store_fence(volatile jlong*   p, jlong   v) {
#ifdef AMD64
  __asm__ __volatile__ (  "xchgq (%2), %0"
                          : "=r" (v)
                          : "0" (v), "r" (p)
                          : "memory");
#else
  release_store(p, v); fence();
#endif // AMD64
}

inline jbyte    OrderAccess::load_acquire(volatile jbyte*   p) { return *p; }
inline jshort   OrderAccess::load_acquire(volatile jshort*  p) { return *p; }
inline jint     OrderAccess::load_acquire(volatile jint*    p) { return *p; }
inline jlong    OrderAccess::load_acquire(volatile jlong*   p) { return Atomic::load(p); }
inline jubyte   OrderAccess::load_acquire(volatile jubyte*  p) { return *p; }
inline jushort  OrderAccess::load_acquire(volatile jushort* p) { return *p; }
inline juint    OrderAccess::load_acquire(volatile juint*   p) { return *p; }
inline julong   OrderAccess::load_acquire(volatile julong*  p) { return Atomic::load((volatile jlong*)p); }
inline jfloat   OrderAccess::load_acquire(volatile jfloat*  p) { return *p; }
inline jdouble  OrderAccess::load_acquire(volatile jdouble* p) { return jdouble_cast(Atomic::load((volatile jlong*)p)); }

We can see that, as I explained, the stores are done using locks; xchg instructions on x86 have what is called 'implied lock' because the explicit lock prefix is not require. The reads are just memory reads around which the compiler must not reorder.

The Java implementation for the getAndSet operations actually uses compareAndSwap and is a bit sloppy to be honest. I have not checked, but I hope, the JIT compiler can improve on things:

/**
     * Atomically exchanges the given value with the current value of
     * a field or array element within the given object o
     * at the given offset.
     *
     * @param o object/array to update the field/element in
     * @param offset field/element offset
     * @param newValue new value
     * @return the previous value
     * @since 1.8
     */
    public final long getAndSetLong(Object o, long offset, long newValue) {
        long v;
        do {
            v = getLongVolatile(o, offset);
        } while (!compareAndSwapLong(o, offset, v, newValue));
        return v;
    }

My guess is that this implementation is just a convenience because it works on many CPUs rather than using the lock prefix implementations which x86 provides.

I have not shown the actual assembler generated for volatile variables by the JVM. This does however, follow the same convention. The a volatile read is just a read on x86 whilst the volatile write uses an explicit or implied lock instruction (note that some sources, including mechanical-sympathy say volatiles are achieved using sfence and lfence, this is not the case in modern JVMs). This means that the state of volatile variables is consistent across all the threads on a JVM just as a variable stored with memory_order_release is in C++ and one set with putXXXXVolaitle is from and instance of Unsafe.

Conclusions And Recommendations

  1. Only bother with the ordering semantics from Table 1.
  2. If in doubt, use sequentially consistent.
  3. Do not bother with the fence stuff except for ultimate last nano second tuning of a very important piece of code. Basically, if you find yourself needing to use one of the fence methods from unsafe it is 99.9% certain there is a better way of implementing what you are doing.**
  4. Remember that memory_order_acquire and volatile get/load is cheep; Memory_order_release and volatile set/store is expensive.
  5. Now that C++ and Java both have memory models and they are both sensibly and reliably implemented on x86 we can confidently build high performance synchronisation without having to resort to the sledgehammer of mutexes or guesswork and figure crossing.
** One illustrative example where they might be useful would be publishing to a single publisher single consumer queue where we just want to ensure the compiler does actually write the queue end location to memory, but we are not especially bothered when that write happens.

Source Code Lookup

This article is largely bases on the following source code reference URL.

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/sun/misc/Unsafe.java http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/share/vm/prims/unsafe.cpp http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/orderAccess_linux_x86.inline.hpp