Inlining - The Real Reason Template Dispatch Matters

Unwinding the complex spiral of a call tree is the
key benefit of templated dispatch.

Evaluation at compile time is obviously fast, but why is compile time template dispatch so much better than runtime virtual dispatch?


'call' is not that expensive; indeed, dispatch off a vtable is not _that_ expensive. So why is template based programming so very performant? Why is it so very efficient to compute dispatch at compile time compared to runtime?

The reason is inlining then later optimisation. Inlining is a very powerful optimisation by its self because it elides (removes) the overhead of call. Call has some expensive because the jump in the instruction pointer can cause waste of instruction cache lines, there is the overhead of marshalling local data into the call arguments and there is manipulation of the stack. Inlining gets rid of all of these costs but that is not the real benefit of compile time dispatch.

Consider what happens after the inlining 'event'. We need to think like a compiler; a function is inlined and then the optimiser can consider the body of the function in the same context as the calling code. Suddenly all sorts of very powerful optimisations like loop unrolling, register localisation, constant folding and instruction eliding come into play which are just not possible through an opaque function call.

Here is some code which illustrates the inline/optimisation effect. 

Code being code, any example can be replicated using a different programming approach. It would be possible to write this code to run just as efficiently and with just as effective optimisation in C or Fortran. So, please accept this example as just that, an example of inline/optimisation.

#include "core.hpp"
namespace SonicCpp
{

 // Recurse template to generate a sawtooth function without aliasing
 // this will construct all the generators for a sawtooth at compile time
 // and therefore allow them to be linined and optimised
 template<pitch_t terminator,pitch_t pitch,size_t Size,size_t recurse>
 struct Sawtooth_inner
 {
  SinGenerator<pitch*recurse,Size> gen;
  //maxHarmonic(pitch) is a constexpr which will equal recurse
  // when the max frequency has been reached so match the 0 specialised
  // template
  Sawtooth_inner<maxHarmonic<pitch>()-recurse,pitch,Size,recurse+1> saw;
  double get(size_t i)
  {
   return gen.get(i)/(double)recurse + saw.get(i);
  }
 };

 template<pitch_t pitch,size_t Size,size_t recurse>
 struct Sawtooth_inner<0,pitch,Size,recurse>
 {
  SinGenerator<pitch*recurse,Size> gen;
  double get(size_t i)
  {
   return gen.get(i)/(double)recurse;
  }
 };

 template<pitch_t pitch,size_t Size>
 struct Sawtooth
 {
  Sawtooth_inner<1,pitch,Size,1> gen;
  double get(size_t i)
  {
   return gen.get(i);
  }

  size_t size()
  {
   return Size;
  }
 };
}

This program generates a recursive loop which adds together scaled sine waves to produce a non aliased sawtooth. To achieve this I use template specialisation to act as the termination condition for template recursion. In this case, I use a constexpr to work out the harmonic one above highest harmonic to include. Then the current harmonic is equal to the limiting one, then the subtraction of the two is zero which matched the terminating template specialisation; (in a later version of the code I use another constexpr to make it easier to match just on bool:

 template<size_t A,size_t B>
 constexpr bool greaterThan(){ return A>B;}

 template<size_t A,size_t B>
 constexpr bool lessThan(){ return A<B;}

 template<size_t A,size_t B>
 constexpr bool equalTo(){ return A==B;}

So, in theory all I have generated is a complex recursive set of functions which will be dog slow to run. It would have been better to express the whole thing as an imperative loop. However, all the functions, then number of functions there are and the way they call one another is all known to the compiler at compile time. Consequently, it can massively optimise away all the function calls and produce a single block of object code which does the entire calculation in a very efficient manner.

The unoptimised compilation:

.globl __ZN8SonicCpp14Sawtooth_innerILy42ELy4400000ELm1024ELm5EE3getEm
 .weak_definition __ZN8SonicCpp14Sawtooth_innerILy42ELy4400000ELm1024ELm5EE3getEm
__ZN8SonicCpp14Sawtooth_innerILy42ELy4400000ELm1024ELm5EE3getEm:
LFB2037:
LM92:
 pushq %rbp
LCFI77:
 movq %rsp, %rbp
LCFI78:
 subq $32, %rsp
 movq %rdi, -8(%rbp)
 movq %rsi, -16(%rbp)
LM93:
 movq -8(%rbp), %rax
 movq -16(%rbp), %rdx
 movq %rdx, %rsi
 movq %rax, %rdi
 call __ZNK8SonicCpp12SinGeneratorILy22000000ELm1024EE3getEm
 movd %xmm0, %rdx
 movabsq $4617315517961601024, %rax
 movd %rdx, %xmm1
 movd %rax, %xmm3
 divsd %xmm3, %xmm1
 movsd %xmm1, -24(%rbp)
 movq -8(%rbp), %rax
 leaq 8(%rax), %rdx
 movq -16(%rbp), %rax
 movq %rax, %rsi
 movq %rdx, %rdi
 call __ZN8SonicCpp14Sawtooth_innerILy41ELy4400000ELm1024ELm6EE3getEm
 movd %xmm0, %rax
 movd %rax, %xmm2
 addsd -24(%rbp), %xmm2
 movd %xmm2, %rax
LM94:
 movd %rax, %xmm0
 leave
LCFI79:
 ret

 .globl __ZNK8SonicCpp12SinGeneratorILy22000000ELm1024EE3getEm
 .weak_definition __ZNK8SonicCpp12SinGeneratorILy22000000ELm1024EE3getEm
__ZNK8SonicCpp12SinGeneratorILy22000000ELm1024EE3getEm:
LFB2048:
LM101:
 pushq %rbp
LCFI86:
 movq %rsp, %rbp
LCFI87:
 subq $16, %rsp
 movq %rdi, -8(%rbp)
 movq %rsi, -16(%rbp)
LM102:
 movq -8(%rbp), %rax
 movq (%rax), %rcx
 movq -16(%rbp), %rax
 testq %rax, %rax
 js L70
 pxor %xmm0, %xmm0
 cvtsi2sdq %rax, %xmm0
 jmp L71
L70:
 movq %rax, %rdx
 shrq %rdx
 andl $1, %eax
 orq %rax, %rdx
 pxor %xmm0, %xmm0
 cvtsi2sdq %rdx, %xmm0
 addsd %xmm0, %xmm0
L71:
 movd %rcx, %xmm1
 mulsd %xmm0, %xmm1
 movd %xmm1, %rax
 movd %rax, %xmm0
 call _sin
 movd %xmm0, %rax
LM103:
 movd %rax, %xmm0
 leave
LCFI88:
 ret

In the above unoptimised code we can see that every one of the recursive implementations of the sine generator does a double to integral conversion. I have highlighted this n yellow. The conversion is doubly complicated by using slightly different semantics for negative and positive numbers (at least I think that is what it is doing). When we add this overhead on top of the call and return framework we end up with a huge amount of complexity which does not need to be there.

The conversion to double is due to the index in the sin generator:

 template&>t;pitch_t pitch,size_t Size>
 class SinGenerator: public GeneratorRoot
 {
  const double mult=M_PI * 2.0 * pitch / (SAMPLE_RATE * NORMAL_OFF);
 public:
  double get(size_t index) const
  {
   return sin(mult*index);
  }

  static constexpr size_t size()
  {
   return Size;
  }
 };

However, this very inefficient implementation becomes highly efficient when the optimiser gets to it. Because the level of recursion and all the values apart from the index are all known at compile time the compiler optimises almost all the framework away. The conversion to double only has to happen once (highlighted in yellow) and then all the computation of sine is inlined into one simple sequence of instructions (one 'recursion' equivalent is highlighted in red).

LM24:
 cmpq $1024, %rbx
 je L8
 movsd LC3(%rip), %xmm0
 pxor %xmm2, %xmm2
 cvtsi2sdq %rbx, %xmm2
 mulsd %xmm2, %xmm0
 movsd %xmm2, 8(%rsp)
 call _sin
LVL25:
 movsd %xmm0, 16(%rsp)
 movsd LC54(%rip), %xmm0
 mulsd 8(%rsp), %xmm0
 call _sin
LVL26:
 movd %xmm0, %r15
 movsd LC55(%rip), %xmm0
 mulsd 8(%rsp), %xmm0
 call _sin
LVL27:
 movsd LC56(%rip), %xmm4
 mulsd 8(%rsp), %xmm4
 movd %xmm0, %r12
LVL28:
 movapd %xmm4, %xmm0
 call _sin
LVL29:
 movsd %xmm0, 24(%rsp)
 movsd 8(%rsp), %xmm0
 mulsd LC57(%rip), %xmm0
 call _sin
LVL30:
 movd %xmm0, %r13
LVL31:
 movsd 8(%rsp), %xmm0
 mulsd LC58(%rip), %xmm0
 call _sin
LVL32:
 movsd %xmm0, 32(%rsp)
 movsd 8(%rsp), %xmm0
 mulsd LC59(%rip), %xmm0
 call _sin
LVL33:
 movd %xmm0, %r14
 movsd 8(%rsp), %xmm0
 mulsd LC60(%rip), %xmm0
 call _sin
LVL34:
 movsd %xmm0, 40(%rsp)
 movsd 8(%rsp), %xmm0
 mulsd LC61(%rip), %xmm0
 call _sin
LVL35:
 movsd %xmm0, 48(%rsp)
 movsd 8(%rsp), %xmm0
 mulsd LC62(%rip), %xmm0
 call _sin
... AND SO ON ...

Whilst this might not be the very best possible example, I hope it shows how templates and constexpr allow much more optimisation then just working out some stuff at compile time. Whilst it is great to compute some stuff up front, the real heavy lifting comes from allowing the optimiser access to a complete description of the algorithms being employed rather than being blocked by opaque function calls.

Java: Eventually Consistent Off Heap Memory Models

All about consistency!
Attribution see here

In my post on atomics I discussed instantaneously consistent memory models, here we look at more relaxed and thus potentially more performant 'Eventually Consistent' models.


Again, I am going to stick with x86_64 architecture for this post. The concepts also directly apply to x86 as it is fundamentally (from a memory point of view) the same thing with narrower registers. Some of this is generally applicable but there are no guarantees. 

Cache Consistency

Here is my super simple model of a x86_64 two core system.

Store:
[Execution Unit]->[Store Buffer]->[Cache]\
                                          ->[Main Memory]
[Execution Unit]->[Store Buffer]->[Cache]/

Load:
[Execution Unit]<-[Read Buffer]<-[Cache]\
                                          <-[Main Memory]
[Execution Unit]<-[Read Buffer]<-[Cache]/

The key to understanding, rather than getting very confused, the way memory consistency works in x86_64 is that the cache is consistent. Changes to state do not need to 'get to main memory' or 'be flushed to main memory'. As soon as a change 'mutation' gets to the cache then all the other cores/cpus will see the mutated value. Cores/CPUs do not communicate via main memory, they communicate via the cache consistency system and main memory.

Where does reordering come from:

However, as I discussed before, x86_64 is does have memory write reordering. This is due to the store buffer. This is can be thought of as a queue between the processor's execution unit and the cache. The queue is not FIFO (first in first out) though, it allows reordering. This is to permit optimisations to occur in hardware and is critically important for many of the performance tricks modern silicon can perform. 

The store buffer is not a cache

Here we have the key to eventually consistent. The contents of the store buffer will eventually be flushed to cache. When they are the cache consistency system will ensure the write is consistent across all processors. Writes can be elided (as far as I can tell from the standard). In other words, if there is code which writes a value to address X then immediately writes another value to X then other processors might only ever see the second value. We cannot guarantee the time at which a value will leave the store buffer either. However, there are many useful things we can do for inter processor communication even with these restrictions. We do not always have to resort to memory barriers and atomic instructions. 

Two illustrative eventually consistent patterns

Continuous appender

Whilst stores to the same location can be elided in the store buffer, stores to different locations cannot. If we are always storing to different locations then we never elide those stores. An example is where one processors writes out a block of data and its length. Consider that we have two processors, writer and reader then.


  • All memory is zeroed (mfence if required). This is the critical start step.
  • Reader checks for length to be non zero.
  • Writer writes content.
  • Writer sets the first length to the length it just wrote
  • At some point the reader will see the length update and read a block.

A pattern like that works well for logging where we have a log writer and a log reader. Any sort of data streaming communication can be performed this way. Unfortunately, there is a catch, the store buffer can reorder the data writes such that the length is updated before the data is all written out. This means that the reader must be tolerant of dirty writes. The performance benefit of the approach is enormous, but there is this drawback.

We can work around the drawback however, because we know the store buffer is really small. The amount of reordering is necessarily small. If the blocks of data run into hundreds of bytes, it is perfectly safe (on modern x86) to read not the last block written, but the one before that.

Value handover

This pattern is useful for synchronous communications between to threads. Here is the initial idea.

  • Thread A sets a X to a value P.
  • At some point Thread B will 'see' X change to value P.

We can use this as a lazy inter thread, or interprocess (using shared memory) procedure call.
  • Thread A sets a X to a value P.
  • At some point Thread B will 'see' X change to value P.
  • Thread B does some work.
  • Thread B sets X to value Q.
  • At some point Thread A will 'see' X change to value Q.
If A and B spin checking the value of X then this might be a bit wasteful. If A and B are looping doing other useful work, then this is like a 1 element log asynchronous communication queue. Because A only ever sets X to P and B only ever sets X to Q we really do not require any atomics, there is no ABA problem, it is really quire beautifully simple.

A Final Thought

It is really easy to get a 'everything must be locked' mind set. We can end up putting mutexes around everything; or spin locks. We then learn to use atomics. How refreshing to think of patterns which do not require any of this extra machinery but still function and, in general, much more efficiently?

Force constexpr to only be at compile time

Pigging out on constexpr

constexpr is an amazing way of asking a C++ compiler to evaluate at compile time; but how can we FORCE it to ONLY evaluate at compile time.


Templates are evaluated at compile time; however, recursive template evaluation is tricky. With the up and coming C++14 compilers we can even have imperative programming inside constexpr. So how can we have the ease of constexpr functions with the compile time only guarantee of templates?

I am sure there are many, many ways of doing this (why program in C++ if not to have 1001 ways of doing any particular task); however, how about just calling a constexpr function from a template function? That is super simple.

For example:

constexpr size_t maxHarmonic(pitch_t fund,size_t multi)
{
return fund*multi>MAX_FREQUENCY_N?
                       multi:maxHarmonic(fund,multi+1);
}

// Useful for template and constexpre recursion termination
// computation
constexpr size_t maxHarmonic(pitch_t fund)
{
return maxHarmonic(fund,1);
}

// Guaranteed compile time only
template<pitch_t fund>
constexpr size_t maxHarmonic()
{
return maxHarmonic(fund);
}

This is using C++11 style constexpr. I use two constexpr functions to recursively search for the maximum harmonic below some cut off of a fundamental frequency. However, just calling:

maxHarmonic(123);

Should be evaluated at compile time but might be evaluated at run time. As the code gets more complex there could be a line like this:

maxHarmonic(x);

Where I thought x is compile time, but the compiler finds a compilation route under which it is evaluated at runtime. The program will run correctly, but that very inefficient recursive function (which might or might not be tail optimised) could end up being evaluated in some inner loop where I had expected maxHarmonic(x) to be a constant.

With the above code I can have the best of both words:

maxHarmonic(x); // works anywhere
maxHarmonic<x>(); // guaranteed to be compile time only