User Injected High Speed Benchmarking In C

The timings for some real world code (anonymised).
The profiler shows the relative time of different code
segments.
Profiling tools a great - but oh so very often I find myself in situations where they cannot be used or they are not a perfect fit. I light weight, very high speed, benchmarking system which can be injects by hand wherever it is required would be really handy - here is one!

The profiler is a simple piece of C. As I have written it here, it uses static state and so is not OK for multiple threads. This can be fixed easily enough by replacing the statics with a structure which is passed around. However, for the example below and for my current project, this static is enough. Also please note that this implementation is very much a Windows implementation. It should not be too hard to port though.

What it does is measure the time taken between specific points in the code as well as the number of times that pair of points in that order are executed. This is very different from a typical function based profiler but is more appropriate for hand profiling. Simply by placing PROFILE_POINT in two places in your code, it will measure the cumulative time (in seconds) that the code took between those two points.

The implementation is a pains to be quick and light weight and the timing system tries to ignore the time taken by the profiler itself. However, placing the PROFILE_POINT inside a hot loop is bound to have and effect on the program performance. Not only will the calls to the profiler take some time, the vary fact they are there and they call a kernel function will have an impact on the optimization strategy of the compiler. Therefore, this profiler, just like any other which uses injected code, will never give a completely 'true' picture of what is going on time wise on a running program. However, it can get close if used around rather than inside hot loops.

Here is the output from the tests in the main below:

From Point,To Point,Hits,Total Time (seconds)
Profiler.cpp252,Profiler.cpp254,10,0.999603
Profiler.cpp254,Profiler.cpp256,10,1.00002
Profiler.cpp256,Profiler.cpp258,10,1.00003
Profiler.cpp258,Profiler.cpp252,9,1.23185e-005
From Point,To Point,Hits,Total Time (seconds)
Profiler.cpp263,Profiler.cpp265,10,0.999644
Profiler.cpp265,Profiler.cpp267,10,1.00003
Profiler.cpp267,Profiler.cpp269,10,1.00003
Profiler.cpp269,Profiler.cpp263,9,1.3345e-005
From Point,To Point,Hits,Total Time (seconds)
Profiler.cpp274,Profiler.cpp276,1,3.07962e-006
Profiler.cpp276,Profiler.cpp276,999999,0.259093
Profiler.cpp276,Profiler.cpp278,1,5.1327e-007

The output (which I have directed to standard out, but could be sent anywhere) is a comma separated value dataset. This allows it to be imported into any spreadsheet for analysis. For example, the pie chart above was created in Excel by using the 'text import wizard' under the Paste ribbon tab.

How Does I Work?
Tries are a highly efficient data structure for storing information keys by byte sets where the sets may well share a common root. In my implementation I have created a 256 way trie. This means that each node in the try has 256 children. The implementation is thus UTF8 ignorant in that it will just work with UFT8 or any other 8 bit encoding where \0 is still the string terminator. (I did try using a 16 way trie where two nodes where used for each encoding of data but the performance overhead was large.) This is a compromise which consumes a reasonable amount of memory which seems OK for large memory systems like those on which I work most of the time. For example, 1000 profile points with 10 bytes differing in the name and line number between them (on average) will consume 1000 * 10 * 256 bytes in the try which is 2.5 megabytes - which is large but not huge by modern standards.

So what are these keys? 
#define STRINGIFY(message_) #message_
#define TOSTRING(message_) STRINGIFY(message_)
#define PROFILE_POINT (profilerPoint( __FILE__  TOSTRING(__LINE__) ))

This macro code takes the current file and line number and creates a string constant like "Profiler.cpp276". These being string constants means that pointer to them remain valid. The code relies on this to create one trie for all the PROFILE_POINTs which are met in the code. Each time a PROFILE_TIME is met, the time from the high resolution timer. The timer is the TSC time-stamp-counter register on the intel chip (so this has to be Intel unless you use a different timing source).

Problems with TSC:
  1. It gives clock cycles passed, not clock-cycles consumed. Consequently, the profile take no account at all of multi-tasking pre-empting by the OS.
  2. It is not consistent between cores and so odd result might occur if the process migrate to a different core during the run. Newer chips are consistent between cores, but not between sockets. 
  3. If the CPU speed-steps all bets are off. You must lock the CPU to maximum before doing any bench marks.
  4. It is not completely trivial to convert from it to seconds.
  5. As far as I know - you need some assembler to get at it in gcc.
inline volatile long long RDTSC() {
   register long long TSC asm("eax");
   asm volatile (".byte 15, 49" : : : "eax", "edx");
   return TSC;


See: http://www.flipcode.com/archives/How_To_Read_The_TSC_With_GCC.shtml

A call to PROFILE_POINT works like this:

  1. Record TSC.
  2. Find or create a new node in the profile point tire for the point.
  3. Find or create a new record on the previous node (which is a linked list of records) which links to this profile point's node.
  4. Update that record by increasing the hit count and incrementing the clock cycle count by the difference between the last check and the recorded TSC value.
  5. Record TSC for the next store.


Note that we record TSC at the start and end of the profile function to remove as much as possible of the cost of profiling from the number of cycles recorded.

Using The Profiler:
There are two approached (which can be combined) to getting a profile. The simplest is to place PROFILE_POINT in several places in your code by hand. It is a good idea to use a #define or compiler define to turn profiling on and off so release code does not run through the compiler!

The other alternative it to target an already common macro in your code. Some code might have macros for error handing, calling another library etc. These marcos can have PROFILE_POINT placed inside them and the profiling happens automatically. Again, best to be able to turn it off at compile time.


// Code to turn on or off PROFILE_POINT
#ifdef PROFILE
#define PROFILE_POINT (profilerPoint( __FILE__  TOSTRING(__LINE__) ))
#else
#define PROFILE_POINT NULL
#endif



  1. DO NOT PROFILE DEBUG CODE.
  2. RUN PROFILES SEVERAL TIMES TO MAKE SURE THE RESULTS ARE 'REAL'.
  3. Then, place your profile into Excel or some other tool and find out where what is taking the most time.
  4. Look in detail at the code path joining the to PROFILE_POINTs.
  5. If it is not obvious what is taking the time, add some more PROFILE_POINTs!
  6. Find the expensive stuff and fix it.


Anyhow - here is the code. It is rough and the output is primitive, but it makes a starting point for creating something better:

/*
Copyright (c) <2012>, <Dr Alexander J Turner>
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met: 

1. Redistributions of source code must retain the above copyright notice, this
   list of conditions and the following disclaimer. 
2. Redistributions in binary form must reproduce the above copyright notice,
   this list of conditions and the following disclaimer in the documentation
   and/or other materials provided with the distribution. 

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those
of the authors and should not be interpreted as representing official policies, 
either expressed or implied, of the FreeBSD Project.
*/

>----- HEADER -----<

#include "Windows.h"
#include <malloc.h>
#include <stdio.h>

#define CounterType         LARGE_INTEGER
#define CounterType_internal long long

#define STRINGIFY(message_) #message_
#define TOSTRING(message_) STRINGIFY(message_)
#define PROFILE_POINT (profilerPoint( __FILE__  TOSTRING(__LINE__) ))

void profilerPoint(char*);
void profilerDump();
void profilerRestart();<

>----- DEFINITIONS -----<

#define PROFILE

#define STRINGIFY(message_) #message_
#define TOSTRING(message_) STRINGIFY(message_)
#ifdef PROFILE
#define PROFILE_POINT (profilerPoint( __FILE__  TOSTRING(__LINE__) ))
#else
#define PROFILE_POINT NULL
#endif
void profilerPoint(char*);
void profilerDump();
void profilerRestart();

#define PROFILER_HEADER__
#endif
/*
This is a derived work - the original work  was copyright:

Copyright (c) <2012>, <Dr Alexander J Turner>
All rights reserved. Use granted to GE under 
a FreeBSD license - see www.nerds-central.com
for details.
*/


#pragma intrinsic(__rdtsc)

#define  CounterType      unsigned __int64 
#define PROFILER_BLOCK_SIZE 65536
#define PROFILER_KEY_LEN 256
/* We use a localised memory allocation scheme where the multiple
   small allocations required for the profiler are grouped into
   larger block allocations from malloc. This helps with heap localicallity
   and hence caching etc.

   The allocated blocks are stored in a linked list pointing to earlier allocated
   blocks. To free all the memory all that is required it to work up the chain
   freeing until the previoud pointer is NULL.
 */
typedef struct profilerBlock_ {
  char block[PROFILER_BLOCK_SIZE];
  struct profilerBlock_ *previous;
} profilerBlock;

static profilerBlock *profilerCurrentBlock=NULL;
static size_t     profilerCurrentTop=PROFILER_BLOCK_SIZE;
static void* blockAlloc(size_t howMuch){
  if(howMuch+profilerCurrentTop>=PROFILER_BLOCK_SIZE){
    // allocate
    profilerBlock *old=profilerCurrentBlock;
    profilerCurrentBlock=(profilerBlock *)malloc(sizeof(profilerBlock));
    if(profilerCurrentBlock==NULL)abort();
    profilerCurrentBlock->previous=old;
    profilerCurrentTop=0;
  }
  {
    void *ret = profilerCurrentBlock->block + profilerCurrentTop;
    profilerCurrentTop+=howMuch;
    return ret;
  }
}

typedef struct ProfilerNode_ {
  struct ProfilerNode_ *kids[PROFILER_KEY_LEN];
  char      *name;
  void      *recordChain;
  char      filler[7]; // allignment
} ProfilerNode;

typedef struct ProfileRecord_ {
  ProfilerNode      *toNode;
  struct ProfileRecord_ *next;
  CounterType       cumulativeCount;
  long          hitCount;
} ProfileRecord;

static ProfilerNode *newProfilerNode(){
  ProfilerNode *ret =(ProfilerNode*)blockAlloc(sizeof(ProfilerNode));
  ProfilerNode n={
    {
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,

      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,

      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,

      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,
      NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL
    },
    "",
    NULL
  };
  *ret=n;
  return ret;
}

static ProfilerNode *findProfileNode(char *expanded,char* name,ProfilerNode *root){
  for(;;){
    if(*expanded==(char)0x0){
      // at end of that for which we seek
      return root;
    }else{
      if(root->kids[*expanded]==NULL){
        // Make a new one and then go to it
        ProfilerNode *newNode;
        newNode=root->kids[*expanded]=newProfilerNode();
        newNode->name=name;
      }
      root=root->kids[*expanded];
      expanded=expanded+1;
    }
  }
}

// Pointers in here screw with your head - but it is efficient.
static ProfileRecord *findProfileRecord(ProfileRecord **chain,ProfilerNode *toNode){
  for(;;){
    if(*chain==NULL){
      // make a new record
      ProfileRecord n = { 
        toNode,
        NULL,
        0,
        0
      };
      *chain=(ProfileRecord *)blockAlloc(sizeof(ProfileRecord));
      **chain=n;
      return *chain;
    }else if((*chain)->toNode == toNode){
      return *chain;
    }else{
      chain=&((*chain)->next);
    }
  }
}

static ProfilerNode *profilerSetDifference(char* message,ProfilerNode *fromNode, CounterType difference, ProfilerNode *profilerRootNode){
  // Go down to and find or create the node
  ProfileRecord *chain;
  ProfilerNode  *toNode  =findProfileNode(message,message,profilerRootNode);
  // Follow the record chain and update or create the record
  chain=findProfileRecord((ProfileRecord **)&(fromNode->recordChain),toNode);
  chain->cumulativeCount+=difference;
  ++(chain->hitCount);
  return toNode;
}

// Note that this code assumes the message will not go away
// this works because it should come from the marco which will make
// a constant.
static ProfilerNode *profilerRootNode = NULL;
static ProfilerNode *profilerLastNode = NULL;
void profilerPoint(char* message){
  static CounterType  lastCounter;
  CounterType     nowCounter;
  // Check time before searching for and setting records

  nowCounter=__rdtsc();
  if(profilerLastNode==NULL){
    // Initialise a root node
    profilerRootNode=newProfilerNode();
    profilerLastNode=profilerRootNode;
    profilerLastNode->name="ROOT";
    lastCounter=nowCounter;
  }
  profilerLastNode = profilerSetDifference(
    message,
    profilerLastNode,nowCounter-lastCounter,
    profilerRootNode
  );
  // Set time after all work done
  lastCounter=__rdtsc();
  // alternative for benching the profiler
  //lastCounter=nowCounter; 

  return; 
}

static void dumpProfiler_internal(ProfilerNode *node,double divisor){
  {
    long i;
    for(i=0;i<PROFILER_KEY_LEN;++i){
      ProfilerNode *thisNode = node->kids[i];
      if(thisNode==NULL)continue;
      if(!strcmp("ROOT",thisNode->name))continue;
      {
        ProfileRecord *thisRecord=(ProfileRecord *)thisNode->recordChain;
        while(thisRecord!=NULL){
          printf("%s,%s,%d,%g\n",
            thisNode->name,
            thisRecord->toNode->name,
            thisRecord->hitCount,
            ((double)thisRecord->cumulativeCount)/divisor
          );
          thisRecord=thisRecord->next;
        }
      }
      dumpProfiler_internal(thisNode,divisor);
    }
  }
}

void profilerDump(){
  double divisor;
  __int64 t1=__rdtsc();
  { // Not good but good enough
    int i;
    for(i=0;i<5;++i){
      Sleep(200);
    }
  }
  divisor=(double)(__rdtsc()-t1);
  if(profilerLastNode==NULL){
    printf("No Profiler Information Recorded \n");
  }else{
    printf("From Point,To Point,Hits,Total Time (seconds)\n");
    dumpProfiler_internal(profilerRootNode,divisor);
  }
  fflush(stdout);
}

void profilerRestart(){
  while(profilerCurrentBlock!=NULL){
    profilerBlock *block=profilerCurrentBlock->previous;
    free(profilerCurrentBlock);
    profilerCurrentBlock=block;
  }
  profilerRootNode = NULL;
  profilerLastNode = NULL;
  profilerCurrentTop=PROFILER_BLOCK_SIZE;
}


int _tmain(int argc, _TCHAR* argv[])
{
  for(long i=0;i<10;++i){
    PROFILE_POINT;
    Sleep(100);
    PROFILE_POINT;
    Sleep(100);
    PROFILE_POINT;
    Sleep(100);
    PROFILE_POINT;
  }
  profilerDump();
  profilerRestart();
  for(long i=0;i<10;++i){
    PROFILE_POINT;
    Sleep(100);
    PROFILE_POINT;
    Sleep(100);
    PROFILE_POINT;
    Sleep(100);
    PROFILE_POINT;
  }
  profilerDump();
  profilerRestart();

  PROFILE_POINT;
  for(long i=0;i<1000000;++i){
      PROFILE_POINT;
  }
  PROFILE_POINT;
  profilerDump();
  return 0;
}