C++ Boost

The Boost Statechart Library

Performance


Speed versus scalability tradeoffs
Memory management customization
RTTI customization
Resource usage

Speed versus scalability tradeoffs

Quite a bit of effort has gone into making the library fast for small simple machines and scaleable at the same time (this applies only to state_machine<>, there still is some room for optimizing fifo_scheduler<>, especially for multi-threaded builds). While I believe it should perform reasonably in most applications, the scalability does not come for free. Small, carefully handcrafted state machines will thus easily outperform equivalent Boost.Statechart machines. To get a picture of how big the gap is, I implemented a simple benchmark in the BitMachine example. The Handcrafted example is a handcrafted variant of the 1-bit-BitMachine implementing the same benchmark.

I tried to create a fair but somewhat unrealistic worst-case scenario:

The Benchmarks - running on a 3.2GHz Intel Pentium 4 - produced the following dispatch and transition times per event:

Although this is a big difference I still think it will not be noticeable in most real-world applications. No matter whether an application uses handcrafted or Boost.Statechart machines it will...

Therefore, in real-world applications event dispatch and transition not normally constitutes a bottleneck and the relative gap between handcrafted and Boost.Statechart machines also becomes much smaller than in the worst-case scenario.

Detailed performance data

In an effort to identify the main performance bottleneck, the example program "Performance" has been written. It measures the time that is spent to process one event in different BitMachine variants. In contrast to the BitMachine example, which uses only transitions, Performance uses a varying number of in-state reactions together with transitions. The only difference between in-state-reactions and transitions is that the former neither enter nor exit any states. Apart from that, the same amount of code needs to be run to dispatch an event and execute the resulting action.

The following diagrams show the average time the library spends to process one event, depending on the percentage of in-state reactions employed. 0% means that all triggered reactions are transitions. 100% means that all triggered reactions are in-state reactions. I draw the following conclusions from these measurements:

Out of the box

The library is used as is, without any optimizations/modifications.

PerformanceNormal1PerformanceNormal2PerformanceNormal3PerformanceNormal4

Native RTTI

The library is used with BOOST_STATECHART_USE_NATIVE_RTTI defined.

PerformanceNative1PerformanceNative2PerformanceNative3PerformanceNative4

Customized memory-management

The library is used with customized memory management (boost::fast_pool_allocator).

PerformanceCustom1PerformanceCustom2PerformanceCustom3PerformanceCustom4

Double dispatch

At the heart of every state machine lies an implementation of double dispatch. This is due to the fact that the incoming event and the active state define exactly which reaction the state machine will produce. For each event dispatch, one virtual call is followed by a linear search for the appropriate reaction, using one RTTI comparison per reaction. The following alternatives were considered but rejected:

Memory management customization

Out of the box, everything (event objects, state objects, internal data, etc.) is allocated through std::allocator< void > (the default for the Allocator template parameter). This should be satisfactory for applications meeting the following prerequisites:

Should an application not meet these prerequisites, Boost.Statechart's memory management can be customized as follows:

RTTI customization

RTTI is used for event dispatch and state_downcast<>(). Currently, there are exactly two options:

  1. By default, a speed-optimized internal implementation is employed
  2. The library can be instructed to use native C++ RTTI instead by defining BOOST_STATECHART_USE_NATIVE_RTTI

There are 2 reasons to favor 2:

Resource usage

Memory

On a 32-bit box, one empty active state typically needs less than 50 bytes of memory. Even very complex machines will usually have less than 20 simultaneously active states so just about every machine should run with less than one kilobyte of memory (not counting event queues). Obviously, the per-machine memory footprint is offset by whatever state-local members the user adds.

Processor cycles

The following ranking should give a rough picture of what feature will consume how many cycles:

  1. state_cast<>(): By far the most cycle-consuming feature. Searches linearly for a suitable state, using one dynamic_cast per visited state
  2. State entry and exit: Profiling of the fully optimized 1-bit-BitMachine suggested that roughly 3 quarters of the total event processing time is spent destructing the exited state and constructing the entered state. Obviously, transitions where the innermost common context is "far" from the leaf states and/or with lots of orthogonal states can easily cause the destruction and construction of quite a few states leading to significant amounts of time spent for a transition
  3. state_downcast<>(): Searches linearly for the requested state, using one virtual call and one RTTI comparison per visited state
  4. Deep history: For all innermost states inside a state passing either has_deep_history or has_full_history to its state base class, a binary search through the (usually small) history map must be performed on each exit. History slot allocation is performed exactly once, at first exit
  5. Shallow history: For all direct inner states of a state passing either has_shallow_history or has_full_history to its state base class, a binary search through the (usually small) history map must be performed on each exit. History slot allocation is performed exactly once, at first exit
  6. Event dispatch: One virtual call followed by a linear search for a suitable reaction, using one RTTI comparison per visited reaction
  7. Orthogonal states: One additional virtual call for each exited state if there is more than one active leaf state before a transition. It should also be noted that the worst-case event dispatch time is multiplied in the presence of orthogonal states. For example, if two orthogonal leaf states are added to a given state configuration, the worst-case time is tripled

Valid HTML 4.01 Transitional

Revised 03 December, 2006

Copyright © 2003-2006 Andreas Huber Dönni

Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)