How Double Dispatch Supercharged My Library's Performance

During the design phase of Levenshtypo I ran into a familiar dilemma which many library authors face:

How do I balance API design, future-compatibility and performance?

The tension results from opposing forces. As a library author I want to keep a very simple API to encourage adoption while exposing enough bells and whistles for power users, I want to give myself room to grow by abstracting the internals, but without tanking performance.

In this article I expose how using generics and double dispatch allowed me to both have and eat my cake.

Context

Levenshtypo exposes two main abstractions - the Levenshtein Automaton and the Trie, detailed here. Briefly, you can think of a trie as a mapping from strings to values, and an automaton as a schematic for which strings to find.

As an example a trie might represent the mapping { "where" : 1, "there": 2, "here": 3, "what": 4 }, and an automaton might represent the query "words similar to 'there'". If you were to execute the automaton against the trie, you'd get [1, 2, 3].

Tries are usually searched via Depth-First Search (DFS). DFS can often be implemented recursively such that many of the branching decisions can be stored on the stack, allowing optimized implementations to avoid heap allocations.

Goals

While the trie and the automaton implementations in Levenshtypo were literally made for each other, there are many use-cases where each could be used independently.

Tries are great for searching strings by prefix. Instead of limiting the Levenshtrie to fuzzy search, Levenshtypo should allow advanced users to build their own queries e.g. "5 letter words starting with 'tri'".

The in-built automata should also be entirely usable by any data structures, not only the aforementioned trie. For best adoption Levenshtypo should cater for developers who have already built their own trie, perhaps with features like external storage support - for which we do not cater.

Neither of these advanced use-cases should come with any performance hits.

The best implementation of a Levenshtein Automaton must be chosen dynamically depending on the pattern and the edit distance. This provides the best performance for the widest variety of scenarios, and opens the possibility to improve the library in the future by adding newer, better implementations. It's optional but if you want to know more, read my article on how Bitwise Parallel automata compare to Parameterized Levenshtein Automata.

Boxed In: Structs and Interfaces

The following simple interface describes the conceptual contract between Tries and Automata:

public interface IAutomatonState
{
    bool MoveNext(char c, out IAutomatonState next);

    bool IsFinal { get; }
}

A Trie Node finds matching child nodes by consulting the IAutomatonState.MoveNext. If the IAutomatonState.IsFinal is true then the automaton matches the word represented by stringing all characters from the root to the current node.

Note that MoveNext returns a new state (next) which represents the state obtained by following the character. This allows for the a node to attempt MoveNext across many children, branching to the next state for each matching node.

While Levenshtypo is designed for Fuzzy Search, the example below keeps things simple. It illustrates an automaton which matches all words of a specified length (counter).

public class LengthAutomatonState(int counter) : IAutomatonState
{
    public bool MoveNext(char c, out IAutomatonState next)
    {
        if (counter > 0)
        {
            next = new LengthAutomatonState(counter - 1);
            return true;
        }

        // When this method returns false, the next state shouldn't be used.
        next = default!;
        return false;
    }


    public bool IsFinal => counter == 0;
}

In the LengthAutomatonState, each call to MoveNext returns a new instance with a decremented counter.

LengthAutomatonState is typical in the sense that it has a small internal state which can easily be wrapped in a struct. This, combined with the DFS recursion algorithm makes it a perfect candidate to be a struct.

Benchmarking the above automaton with it's struct equivalent results in the following

Method	Mean	Gen0	Allocated
Class	12.598 ms	218.7500	3913912 B
Struct	7.349 ms	-	88 B

The benchmark show a great performance improvement when optiong for a struct; the Garbage Collector simply has much less to do than in the class variant.

Unfortunately, though, changing the LengthAutomatonState into a struct isn't as easy as it seems...

Struct-ification

.NET only allocates structs to stack memory if their type is known at JIT compile time.

For example if an int is cast to an object, then the JIT compiler treats it like it would any other object - as a class. To achieve this, the int is wrapped in a "box" and allocated onto the heap, which makes it look like a class. This process is unsurprisingly called "boxing".

The same boxing process happens if a struct is assigned to an interface; interface variables always reference the heap, so if the object is a struct then it must first be boxed.

If we were to naively change the LengthAutomatonState from a class to a struct, it would look something like the following.

public struct LengthAutomatonStateStruct(int charsRemaining) : IAutomatonState
{
    public bool MoveNext(char c, out IAutomatonState next)
    {
        ...
    }


    public bool IsFinal => ...;
}

public class Trie
{
   public int CountMatches(IAutomatonState state) => ...;
}

In an attempt to be re-usable with different automaton implementations, the Trie exposes a method which accepts the IAutomatonState interface. However in doing so it forces any struct passed as a parameter to immediately be boxed, since interfaces must always be managed references to objects on the heap. This naive change hasn't actually changed anything.

Generic Method Specialization

The JIT has a quirk which is especially powerful when optimizing performance.

When a generic method is called, the JIT must determine which machine code to execute. The JIT will generate the code if necessary, but it will re-use already-generated machine code if the generic parameters are compatible. To determine compatibility, the generic parameters must be equal after replacing class- and interface-type parameters with the System.__Canon placeholder. Classes and Interfaces can have the same machine code because variables of those types are all managed references 8 bytes long (on a 64-bit machine).

The same is not true for structs. A struct may require 16, 32 or any number of bytes. Also, the garbage collector must keep track of the managed references within the struct, so the number of references as well as their exact in-memory bitwise layout matters to maintain memory safety.

Therefore when invoking a generic method with a struct type parameter, the machine code generated by the JIT compiler will always be hardcoded and optimized specifically for that struct parameter.

For a great example to highlight how useful this can be for maximising performance, look no further than the core .NET library itself. In System.Private.CoreLib there exists the following code.

internal interface IRuntimeConst
{
    static abstract bool Value { get; }
}

internal readonly struct TrueConst : IRuntimeConst
{
    public static bool Value => true;
}

internal readonly struct FalseConst : IRuntimeConst
{
    public static bool Value => false;
}

The code is used in the performance-critical class SingleStringSearchValuesFallback. The class must make either case sensitive or case insensitive comparisons. Compare the implementations below (which have been heavily trimmed for this example):

// Unoptimized
internal sealed class SingleStringSearchValuesFallback(string value, bool ignoreCase)
{
    int IndexOf(ReadOnlySpan<char> span) =>
        ignoreCase
            ? IndexOf_CaseInsensitive(span, _value)
            : IndexOf_CaseSensitive(span, _value);
}

// Real
internal sealed class SingleStringSearchValuesFallback<TIgnoreCase>(string value)
    where TIgnoreCase : struct, SearchValues.IRuntimeConst
{
    int IndexOf(ReadOnlySpan<char> span) =>
        TIgnoreCase.Value // 👈 Branch based on the static readonly property in the generic parameter.
            ? IndexOf_CaseInsensitive(span, _value)
            : IndexOf_CaseSensitive(span, _value);
}

In the unoptimized version, every call to IndexOf tests the value of ignoreCase to choose a branch, as you would see in almost all code ever written.

In the optimized version, however, the code is written in a much more complex way, making use of the IRuntimeConst interface in the persuit of every-increasing performance gains. The code author relies on the JIT compiler generating different code for struct TrueConst and struct FalseConst, since generic methods code is not re-used across structs.

When generating the code specialized for TrueConst, it can resolve TIgnoreCase.Value as always being true, as it is effectively a constant - a property which can never be changed (static readonly). It can therefore deduce that the first branch will always be chosen, compiling down to just calling IndexOf_CaseInsensitive(span, _value) without any branching. The same reasoning applies if the generic parameter were set to FalseConst which would also result in no branching, directly calling IndexOf_CaseSensitive(span, _value).

Back to Automata

This quirk of generic method specialization can be used in our scenario, too.

public int CountMatches<TAutomatonState>(TAutomatonState state)
    where TAutomatonState : IAutomatonState
{
    // Some code goes here

    // This is a recursive method, so at some point it does the following:
    if (state.MoveNext(c, out IAutomatonState nextState))
    {
        CountMatches(nextState);
    }

    // Some more code goes here
}

The CountMatches method is now generic and the code will be specialized to whatever TAutomatonState is. Note that there's no need to constrain it to where TAutomatonState : struct for this to work.

If the method is called with the generic parameter as in CountMatches<LengthAutomatonStateStruct>(...) then that method call will not box variables of type TAutomatonState because TAutomatonState is known to be LengthAutomatonStateStruct at JIT compile-time.

Unfortunately if you were to benchmark the code in this state, then you'd see that the problem hasn't yet been fully solved

there's still a lot of boxing going on.

Can you work out why?

There's Generics... Then There's Self-Referential Generics

If you guessed that since IAutomatonState.MoveNext has an out variable of typeIAutomatonState, so the type of nextState according to the JIT compiler is IAutomatonState, which means that CountMatches(nextState) is actually CountMatches<IAutomatonState>(nextState), and that just falls back to the boxing-version of the method, then well done you!

To fix this we would need the MoveNext method to return the actual concrete type as in the following (invalid) code.

public interface IAutomatonState
{
    bool MoveNext(char c, out TSelf next);

    bool IsFinal { get; }
}

The code won't compile because TSelf is not defined.

Do you know how to make it work? ... see below

public interface IAutomatonState<TSelf> where TSelf : IAutomatonState<TSelf>
{
    bool MoveNext(char c, out TSelf next);

    bool IsFinal { get; }
}

public struct LengthAutomatonStateStruct : IAutomatonState<LengthAutomatonStateStruct>
{
    public bool MoveNext(char c, out LengthAutomatonStateStruct next)
}

It's kind of mind-bendy but a type can be used as a constraint on one of its generic arguments. This syntax pattern enables implementations of IAutomatonState<TSelf> to reference themselves as TSelf in method signatures, as seen in LengthAutomatonStateStruct.

The code is now ready to be used without boxing, leading to the performance gains seen earlier.

From a API perspective, though, it's unusable. Continue reading to understand why, and what can be done to fix it.

Virtual Dispatch

As discussed earlier, one of the goals of Levenshtypo was to dynamically decide which automaton implementation is most efficient at runtime based on the pattern and fuzziness required. As IAutomatonState<TSelf> requires specification of TSelf, how would the factory even be written?

public class Factory
{
    public IAutomatonState<IAutomatonState<IAutomatonState<<IAutomatonState<...>>>>> Create(string pattern, int fuzziness);
}

To solve this conundrum, we need to understand Virtual Dispatch.

Single Dispatch

When a method is called, the program must choose which version of the method to run. Study the following C# code. What does it print?

object number = (int)1;
Print(number);

static void Print(object o) => Console.WriteLine("object");
static void Print(int i) => Console.WriteLine("int");

When the above program is compiled, the C# compiler must choose which one of the two different methods, both called Print is being called in a step called Overload Resolution. The C# compiler can only use information it knows at compile-time, and that is the fact that Print is being invoked with a parameter of static† type object, not the runtime type int. The compiler chooses Print(object) and the console window displays "object".

† N.B. Static in this context means "the type known at compile time".

Single Dispatch is a language feature which allows programmers to choose the method based on the runtime type not the static type. You most likely already know the feature by a different name; method override. Now consider the following C# code. What does it print?

Animal item;

item = new Animal();
item.Print();

item = new Dog();
item.Print();

class Animal
{ 
    public virtual void Print() => Console.WriteLine("Animal");
}

class Dog : Animal
{
    public override void Print() => Console.WriteLine("Dog");
}

The program first prints "Animal" and then "Dog", showing clearly that it is the runtime type of the item variable which determines which method gets called.

In this context, the "Single" in Single Dispatch refers to the fact that you can only do this based on the runtime type of a single parameter. It's such a special parameter that it gets its own language keyword (this).

In Levenshtypo we need to design an API which executes codes based on the runtime type not only of the automaton, but also of the trie. Hence Single Dispatch is not enough - Double Dispatch is necessary.

Double Dispatch

As C# does not natively support Double Dispatch† we can simulate it using pre-school mathematics.

1 x Double Dispatch = 2 x Single Dispatch

† N.B. It does in fact support Double (and indeed Multiple) Dispatch but there's plenty of reasons to avoid dynamic, so we will pretend it does not.

Surprisingly this equation does work out. To achieve this we must introduce two new types - one for each Single Dispatch.

public interface IAutomatonExecutor<TResult>
{
    public TResult Execute<TAutomatonState>(TAutomatonState automatonState)
        where TAutomatonState : IAutomatonState<TAutomatonState>;
}

public abstract class Automaton
{
    public abstract void Execute<TResult>(IAutomatonExecutor<TResult> executor);
}

public class LengthAutomaton(int charsRemaining) : Automaton
{
    public override void Execute<TResult>(IAutomatonExecutor<TResult> executor)
        => executor.Execute(new LengthAutomatonStateStruct(charsRemaining));
}

IAutomatonExecutor is an interface indicating that a class knows how to execute automata.

Automaton is a base class for Automata. When Automaton.Execute is called, the first Single Dispatch will resolve to a concrete class which knows about a single strategy.

The LengthAutomaton, therefore, can be passed around statically typed as an Automaton. When it is ready to be executed, Automaton.Execute will be Single Dispatched to LengthAutomaton.Execute which in turn will call IAutomatonExecutor.Execute<LengthAutomatonStateStruct> which will be Single Dispatched to whatever module needs the concrete automaton type.

The example below completes the puzzle so that the Trie can execute a struct-based automaton.


public class Trie : IAutomatonExecutor<int>
{
    public void CountMatches(Automaton automaton) => automaton.Execute(this);

    int IAutomatonExecutor<int>.Execute<TAutomatonState>(TAutomatonState automatonState) => CountMatchesInternal(automatonState);

    public int CountMatchesInternal<TAutomatonState>(TAutomatonState state) where TAutomatonState : IAutomatonState<TAutomatonState>
    {
        // Some code goes here

        // This is a recursive method, so at some point it does the following:
        if (state.MoveNext(c, out TAutomatonState nextState))
        {
            CountMatchesInternal<TAutomatonState>(nextState);
        }

        // Some more code goes here
    }
}

Conclusion

Virtual dispatch is, at face value, regarded as a slower mechanism than direct or static method calls because of the extra lookup required to find the right implementation. Like in so many situations when optimizing code, though, the solution must be taken as a whole. In the article above I described how taking the minuscule hit of virtual dispatch upfront allowed the JIT to run extremely specialized for the rest of the algorithm, resulting in massive performance gains of 42% (12.6 ms -> 7.3 ms). Adding work upfront to choose a better strategy for the critical path is a common pattern seen in optimized code-bases.

Thanks to careful use of generics and structs, Levenshtypo met its design objectives:

✅ The Trie can be used independently and custom traversal comes with no performance loss compared to native.

✅ The Automata can be used independently with custom data structures, also with no performance loss.

✅ The Automata are dynamically selected for performance based on user inputs.

✅ The internal implementations of the Automata remain private enabling future extensions.

This work shows that with a deep understanding of the JIT, generics, and dispatch mechanisms, you can design a flexible, extensible API without sacrificing performance.