shared_ptr performance issues and workaround

Beautiful things are often simple things.
Copyright Dr Alexander J Turner
C++11 is, as I have already said far too many times, is brilliant. One thing it brings into the standard is smart pointers (formerly of boost). But, there are problems with performance which, with care, can be worked around.

Library writers face a constant challenge; they have to write code that always works; the corner cases really matter. However, this can make for inefficient code. Sometimes knowing it is safe to ignore a possible usage situation can pay huge dividends in performance; shared_ptr is a perfect example. shared_ptr is a reference counted pointer. My post here is a simple example of why reference counted pointers are really handy. They act as a garbage collection system; alternatively we can view them an extension to RAII where we can distribute ownership of dynamically created objects. 

The snag with share_ptr is that the implementation has to be thread safe. It either has to use mutexs or CAS to ensure that the reference count has the same value across all threads which might have access to the pointer. My understanding is that the VC++11 implementation uses CAS on the Intel platform and so that is what I am going to talk about from here on (further, it is my understanding that CAS dominates the performance of shared_ptr in VC++11).

Compare And Swap has to synchronise memory between multiple processors/cores in hardware. This can require flushing cache lines and dropping instruction pipelines [see here]. So, what impact does it really have and how can we avoid this?

The Meat Of The Post:
The simplest way to avoid the performance cost of shared_ptr is not to use it. If we know that the owner of an object has a shared_ptr somewhere further up the stack then we can use references to it or raw pointers at avoid the overhead. Hang on though - that means that we have used stack locality to make assumptions about pointers. We have used it to say 'we are safe to not use reference counting because we have a pointer somewhere above us holding the pointer'. 

That is not thread safe. If we were to pass our reference or raw pointer to another thread (via passing to an existing thread or spawning a new thread) then the owning stack could unwind and leave our references or raws hanging. Such a mistake should not only cause hanging pointers but it should also involved hanging developers!

What have we just said? We have said that to avoid the cost of shared_ptr we can use references or raws if and only if we know they will not escape the owning thread. If we are sure of that - then we do not need CAS and so could still use reference counting if we so wished.

There is a subset of situations where it would be really good, safe and clean to use reference counting but we know the counted references will not escape the owning thread. In these cases we do not need CAS. shared_ptr is simply overkill.

A non thread safe reference counted pointer:
Many moons ago I wrote a ground up implementation of a thread safe counted pointer. That is something I can hardly believe now that we have C++11, but I did it. The Windows version used InterlockedIncrement and InterlockedDecrement to make it thread safe. I have taken this old code and simply ripped out the thread safety giving this class (note this was part of my VSPL research project - hence the names):

  typedef long long VSPL_INT ;
  typedef bool      VSPL_BOOL ;

  template <class T>
   class counted_ptr
   {
   private:
     VSPL_INT* count;
     T* ptr;
     VSPL_BOOL useCount;
   public:
     explicit counted_ptr(T* p=0)
       :ptr(p), count(p!=0?new VSPL_INT(1):0),useCount(p!=0){}

     counted_ptr(T* p,VSPL_BOOL permanent)
       :ptr(p), count(permanent?0:new VSPL_INT(1)),useCount(!permanent){}

     counted_ptr(const counted_ptr<T>& p) throw()
       :    ptr(p.ptr),count(p.count),useCount(p.useCount)
     {if(useCount)++count;}

     ~counted_ptr() throw()
     {dispose();}

     counted_ptr<T>& operator=(const counted_ptr<T>& p) throw()
     {
       if(this != &p){
         dispose();
         ptr=p.ptr;
         count=p.count;
         useCount=p.useCount;
         if(useCount) ++count;
       }
       return *this;
     }

     T& operator*() const throw()
     {return *ptr;}

     T* operator->() const throw()
     {return ptr;}

     VSPL_BOOL IsNull()
     { return ptr==0; }

   private:
     void dispose()
     {
       if(useCount)
       {
         if(--count == 0)
         {
           delete count;
           delete ptr;
         }
       }
     }
   };

This implementation is a simple counted pointer with performance gains for not counting NULL pointers and also allowing non counted pointers. With this and a bit of code I wrote which times filling and emptying vectors with shared_ptr or counted_ptrs to long long, we can start to look at the impact of CAS and how much performance there is to be gained by not using it:

#include "stdafx.h"
#include <vector>
#include <memory>
#include <chrono>
#include <iostream>
#include <functional>
#include <thread>
#include <algorithm>

using std::vector;
using std::shared_ptr;
using std::function;
using std::thread;
using std::for_each;

namespace nerdscentral{

  ... see above ...

  typedef long long Long;
  typedef shared_ptr<Long> GLong;
  //typedef counted_ptr<Long> GLong;
  typedef std::chrono::high_resolution_clock Clock;
  typedef std::chrono::duration<double, std::micro> Milliseconds;

  class ReferenceCounter{
  private:
    vector<GLong> refStore;
  public:
    Long getNRef(){
      return refStore.size();
    }

    ReferenceCounter* flush(){
      refStore.clear();
      return this;
    }

    ReferenceCounter* append(GLong item){
      refStore.insert(refStore.end(),item);
      return this;
    }
  };

  void timeRun(Long count,function<void()> toRun){
    auto t1 = Clock::now();
    for(;count>0;--count)toRun();
    auto t2 = Clock::now();
    auto tt = (t2-t1);
    Milliseconds res  = Clock::duration(1);
    std::cout << tt.count()*res.count() << '\n';
  }

  void performRun(){
    auto oneRun =[](){
      ReferenceCounter counter;
      for(;counter.getNRef()<100;){
        counter.append(GLong(new Long(counter.getNRef())));
      }
      counter.flush();
    };
    timeRun(1000,oneRun);
  }
}
int _tmain(int argc, _TCHAR* argv[]){
  function<void()> run = []{  for(long c=0;c<10;++c)nerdscentral::performRun(); };
  vector<shared_ptr<thread>> threads;
  for(long i=0;i<1;++i){
    threads.insert(threads.end(),shared_ptr<thread>(new thread(run)));
  }
  for_each(threads.begin(),threads.end(),[](shared_ptr<thread> t){t->join();});
  return 0;
}

By flipping the typedef for GLong I can change it between shared_ptr and counted_ptr. The vectors of these pointer implementations are local to the threads which are created in the main method which means that either pointer implementation is equally valid. So how much does shared_ptr cost compared to counted_ptr?

The program ran 2.3 times slower with shared_ptr
 over 8 threads.

The output of two runs (one shared one counted) plotted
against one another.

Seriously, shared_ptr completely dominated performance when using 8 threads [I ran the test on a Intel I5 laptop in performance mode and compiled to release x86_64 on Windows Vista using VC++11 beta]. Even with just one thread shared_ptr was considerably slower:
Counted:
18001.1
16000.9
16000.9
17001
16000.9
14000.8
14000.8
15000.9
14000.8
15000.9
Shared:
25001.4
24001.3
24001.3
24001.4
23001.3
26001.5
22001.3
23001.4
24001.4
23001.3

For the 8 thread version the mean of the two runs differed by a factor of 2.3 in favour of counted_ptr. Thread safety comes at a huge cost. In this situation the cost of share_ptr completely dominates the code.

Conclusion:
shared_ptr is just fantastic. It can be relied upon and just simple used. Nevertheless, when performance really counts (think high performance, low latency like trading or gaming) then it has problems. In these cases, using a (better) implementation of something like counted_ptr could make a really good alternative where references or raws are not appropriate.