During the design phase of Levenshtypo I ran into a familiar dilemma which many library authors face:
How do I balance API design, future-compatibility and performance?
The tension results from opposing forces. As a library author I want to keep a very simple API to encourage adoption while exposing enough bells and whistles for power users, I want to give myself room to grow by abstracting the internals, but without tanking performance.
In this article I expose how using generics and double dispatch allowed me to both have and eat my cake.
Context
Levenshtypo exposes two main abstractions - the Levenshtein Automaton and the Trie, detailed here. Briefly, you can think of a trie as a mapping from strings to values, and an automaton as a schematic for which strings to find.
As an example a trie might represent the mapping { "where" : 1, "there": 2, "here": 3, "what": 4 }
,
and an automaton might represent the query "words similar to 'there'".
If you were to execute the automaton against the trie, you'd get [1, 2, 3]
.
Tries are usually searched via Depth-First Search (DFS). DFS can often be implemented recursively such that many of the branching decisions can be stored on the stack, allowing optimized implementations to avoid heap allocations.
Goals
While the trie and the automaton implementations in Levenshtypo were literally made for each other, there are many use-cases where each could be used independently.
Tries are great for searching strings by prefix. Instead of limiting the Levenshtrie
to fuzzy search,
Levenshtypo should allow advanced users to build their own queries e.g. "5 letter words starting with 'tri'".
The in-built automata should also be entirely usable by any data structures, not only the aforementioned trie. For best adoption Levenshtypo should cater for developers who have already built their own trie, perhaps with features like external storage support - for which we do not cater.
Neither of these advanced use-cases should come with any performance hits.
The best implementation of a Levenshtein Automaton must be chosen dynamically depending on the pattern and the edit distance. This provides the best performance for the widest variety of scenarios, and opens the possibility to improve the library in the future by adding newer, better implementations. It's optional but if you want to know more, read my article on how Bitwise Parallel automata compare to Parameterized Levenshtein Automata.
Boxed In: Structs and Interfaces
The following simple interface describes the conceptual contract between Tries and Automata:
public interface IAutomatonState
{
bool MoveNext(char c, out IAutomatonState next);
bool IsFinal { get; }
}
A Trie Node finds matching child nodes by consulting the IAutomatonState.MoveNext
.
If the IAutomatonState.IsFinal
is true
then the automaton matches the word represented
by stringing all characters from the root to the current node.
Note that MoveNext
returns a new state (next
) which represents the state obtained
by following the character. This allows for the a node to attempt MoveNext
across many
children, branching to the next
state for each matching node.
While Levenshtypo is designed for Fuzzy Search, the example below keeps things simple.
It illustrates an automaton which matches all words of a specified length (counter
).
public class LengthAutomatonState(int counter) : IAutomatonState
{
public bool MoveNext(char c, out IAutomatonState next)
{
if (counter > 0)
{
next = new LengthAutomatonState(counter - 1);
return true;
}
// When this method returns false, the next state shouldn't be used.
next = default!;
return false;
}
public bool IsFinal => counter == 0;
}
In the LengthAutomatonState
, each call to MoveNext
returns a new instance with a decremented counter.
LengthAutomatonState
is typical in the sense that it has a small internal state which can easily be
wrapped in a struct. This, combined with the DFS recursion algorithm makes it a perfect candidate
to be a struct
.
Benchmarking the above automaton with it's struct
equivalent results in the following
Method | Mean | Gen0 | Allocated |
---|---|---|---|
Class | 12.598 ms | 218.7500 | 3913912 B |
Struct | 7.349 ms | - | 88 B |
The benchmark show a great performance improvement when optiong for a struct
; the Garbage Collector
simply has much less to do than in the class
variant.
Unfortunately, though, changing the LengthAutomatonState
into a struct isn't as easy as it seems...
Struct-ification
.NET only allocates structs to stack memory if their type is known at JIT compile time.
For example if an int
is cast to an object
, then the JIT compiler treats it like it would any other object
- as a class.
To achieve this, the int
is wrapped in a "box" and allocated onto the heap, which makes it look like a class
.
This process is unsurprisingly called "boxing".
The same boxing process happens if a struct
is assigned to an interface; interface variables
always reference the heap, so if the object is a struct
then it must first be boxed.
If we were to naively change the LengthAutomatonState
from a class
to a struct
, it would look something like the following.
public struct LengthAutomatonStateStruct(int charsRemaining) : IAutomatonState
{
public bool MoveNext(char c, out IAutomatonState next)
{
...
}
public bool IsFinal => ...;
}
public class Trie
{
public int CountMatches(IAutomatonState state) => ...;
}
In an attempt to be re-usable with different automaton implementations, the Trie
exposes a method which
accepts the IAutomatonState
interface.
However in doing so it forces any struct
passed as a parameter to immediately be boxed, since interfaces
must always be managed references to objects on the heap.
This naive change hasn't actually changed anything.
Generic Method Specialization
The JIT has a quirk which is especially powerful when optimizing performance.
When a generic method is called, the JIT must determine which machine code to execute.
The JIT will generate the code if necessary, but it will re-use already-generated machine code
if the generic parameters are compatible.
To determine compatibility, the generic parameters must be equal after replacing class- and interface-type parameters
with the System.__Canon
placeholder.
Classes and Interfaces can have the same machine code because variables of those types are all managed references 8 bytes long (on a 64-bit machine).
The same is not true for structs. A struct may require 16, 32 or any number of bytes. Also, the garbage collector must keep track of the managed references within the struct, so the number of references as well as their exact in-memory bitwise layout matters to maintain memory safety.
Therefore when invoking a generic method with a struct type parameter, the machine code generated by the JIT compiler will always be hardcoded and optimized specifically for that struct parameter.
For a great example to highlight how useful this can be for maximising performance, look no further than the core .NET library itself.
In System.Private.CoreLib
there exists the following code.
internal interface IRuntimeConst
{
static abstract bool Value { get; }
}
internal readonly struct TrueConst : IRuntimeConst
{
public static bool Value => true;
}
internal readonly struct FalseConst : IRuntimeConst
{
public static bool Value => false;
}
The code is used in the performance-critical class SingleStringSearchValuesFallback. The class must make either case sensitive or case insensitive comparisons. Compare the implementations below (which have been heavily trimmed for this example):
// Unoptimized
internal sealed class SingleStringSearchValuesFallback(string value, bool ignoreCase)
{
int IndexOf(ReadOnlySpan<char> span) =>
ignoreCase
? IndexOf_CaseInsensitive(span, _value)
: IndexOf_CaseSensitive(span, _value);
}
// Real
internal sealed class SingleStringSearchValuesFallback<TIgnoreCase>(string value)
where TIgnoreCase : struct, SearchValues.IRuntimeConst
{
int IndexOf(ReadOnlySpan<char> span) =>
TIgnoreCase.Value // 👈 Branch based on the static readonly property in the generic parameter.
? IndexOf_CaseInsensitive(span, _value)
: IndexOf_CaseSensitive(span, _value);
}
In the unoptimized version, every call to IndexOf
tests the value of ignoreCase
to choose a branch,
as you would see in almost all code ever written.
In the optimized version, however, the code is written in a much more complex way, making use of the IRuntimeConst
interface in the persuit of every-increasing performance gains.
The code author relies on the JIT compiler generating different code for struct TrueConst
and struct FalseConst
,
since generic methods code is not re-used across structs.
When generating the code specialized for TrueConst
, it can resolve TIgnoreCase.Value
as always being true
,
as it is effectively a constant - a property which can never be changed (static readonly
).
It can therefore deduce that the first branch will always be chosen, compiling down to just calling IndexOf_CaseInsensitive(span, _value)
without any branching.
The same reasoning applies if the generic parameter were set to FalseConst
which would also result in no branching,
directly calling IndexOf_CaseSensitive(span, _value)
.
Back to Automata
This quirk of generic method specialization can be used in our scenario, too.
public int CountMatches<TAutomatonState>(TAutomatonState state)
where TAutomatonState : IAutomatonState
{
// Some code goes here
// This is a recursive method, so at some point it does the following:
if (state.MoveNext(c, out IAutomatonState nextState))
{
CountMatches(nextState);
}
// Some more code goes here
}
The CountMatches
method is now generic and the code will be specialized to whatever TAutomatonState
is.
Note that there's no need to constrain it to where TAutomatonState : struct
for this to work.
If the method is called with the generic parameter as in CountMatches<LengthAutomatonStateStruct>(...)
then that method
call will not box variables of type TAutomatonState
because TAutomatonState
is known to be LengthAutomatonStateStruct
at JIT compile-time.
Unfortunately if you were to benchmark the code in this state, then you'd see that the problem hasn't yet been fully solved
- there's still a lot of boxing going on.
Can you work out why?
There's Generics... Then There's Self-Referential Generics
If you guessed that since IAutomatonState.MoveNext
has an out variable of typeIAutomatonState
, so the type of nextState
according to the JIT compiler is IAutomatonState
, which means that CountMatches(nextState)
is actually CountMatches<IAutomatonState>(nextState)
,
and that just falls back to the boxing-version of the method, then well done you!
To fix this we would need the MoveNext
method to return the actual concrete type as in the following (invalid) code.
public interface IAutomatonState
{
bool MoveNext(char c, out TSelf next);
bool IsFinal { get; }
}
The code won't compile because TSelf
is not defined.
Do you know how to make it work? ... see below
public interface IAutomatonState<TSelf> where TSelf : IAutomatonState<TSelf>
{
bool MoveNext(char c, out TSelf next);
bool IsFinal { get; }
}
public struct LengthAutomatonStateStruct : IAutomatonState<LengthAutomatonStateStruct>
{
public bool MoveNext(char c, out LengthAutomatonStateStruct next)
}
It's kind of mind-bendy but a type can be used as a constraint on one of its generic arguments.
This syntax pattern enables implementations of IAutomatonState<TSelf>
to reference themselves as TSelf
in method signatures,
as seen in LengthAutomatonStateStruct
.
The code is now ready to be used without boxing, leading to the performance gains seen earlier.
From a API perspective, though, it's unusable. Continue reading to understand why, and what can be done to fix it.
Virtual Dispatch
As discussed earlier, one of the goals of Levenshtypo was to dynamically decide which automaton implementation is most
efficient at runtime based on the pattern and fuzziness required.
As IAutomatonState<TSelf>
requires specification of TSelf
, how would the factory even be written?
public class Factory
{
public IAutomatonState<IAutomatonState<IAutomatonState<<IAutomatonState<...>>>>> Create(string pattern, int fuzziness);
}
To solve this conundrum, we need to understand Virtual Dispatch.
Single Dispatch
When a method is called, the program must choose which version of the method to run. Study the following C# code. What does it print?
object number = (int)1;
Print(number);
static void Print(object o) => Console.WriteLine("object");
static void Print(int i) => Console.WriteLine("int");
When the above program is compiled, the C# compiler must choose which one of the two different methods,
both called Print
is being called in a step called Overload Resolution.
The C# compiler can only use information it knows at compile-time, and that is the fact that Print
is
being invoked with a parameter of static†type object
, not the runtime type int
.
The compiler chooses Print(object)
and the console window displays "object"
.
†N.B. Static in this context means "the type known at compile time".
Single Dispatch is a language feature which allows programmers to choose the method based on the
runtime type not the static type. You most likely already know the feature by a different name;
method override
.
Now consider the following C# code. What does it print?
Animal item;
item = new Animal();
item.Print();
item = new Dog();
item.Print();
class Animal
{
public virtual void Print() => Console.WriteLine("Animal");
}
class Dog : Animal
{
public override void Print() => Console.WriteLine("Dog");
}
The program first prints "Animal"
and then "Dog"
, showing clearly that it is the runtime type
of the item
variable which determines which method gets called.
In this context, the "Single" in Single Dispatch refers to the fact that you can only do this based
on the runtime type of a single parameter. It's such a special parameter that it gets its own language keyword (this
).
In Levenshtypo we need to design an API which executes codes based on the runtime type not only of the automaton, but also of the trie. Hence Single Dispatch is not enough - Double Dispatch is necessary.
Double Dispatch
As C# does not natively support Double Dispatch†we can simulate it using pre-school mathematics.
1 x Double Dispatch = 2 x Single Dispatch
†N.B. It does in fact support Double (and indeed Multiple) Dispatch but there's plenty of reasons to avoid dynamic
, so
we will pretend it does not.
Surprisingly this equation does work out. To achieve this we must introduce two new types - one for each Single Dispatch.
public interface IAutomatonExecutor<TResult>
{
public TResult Execute<TAutomatonState>(TAutomatonState automatonState)
where TAutomatonState : IAutomatonState<TAutomatonState>;
}
public abstract class Automaton
{
public abstract void Execute<TResult>(IAutomatonExecutor<TResult> executor);
}
public class LengthAutomaton(int charsRemaining) : Automaton
{
public override void Execute<TResult>(IAutomatonExecutor<TResult> executor)
=> executor.Execute(new LengthAutomatonStateStruct(charsRemaining));
}
IAutomatonExecutor
is an interface indicating that a class knows how to execute automata.
Automaton
is a base class for Automata. When Automaton.Execute
is called, the first Single Dispatch
will resolve to a concrete class which knows about a single strategy.
The LengthAutomaton
, therefore, can be passed around statically typed as an Automaton
.
When it is ready to be executed, Automaton.Execute
will be Single Dispatched to LengthAutomaton.Execute
which in turn will call IAutomatonExecutor.Execute<LengthAutomatonStateStruct>
which will be Single Dispatched to
whatever module needs the concrete automaton type.
The example below completes the puzzle so that the Trie can execute a struct-based automaton.
public class Trie : IAutomatonExecutor<int>
{
public void CountMatches(Automaton automaton) => automaton.Execute(this);
int IAutomatonExecutor<int>.Execute<TAutomatonState>(TAutomatonState automatonState) => CountMatchesInternal(automatonState);
public int CountMatchesInternal<TAutomatonState>(TAutomatonState state) where TAutomatonState : IAutomatonState<TAutomatonState>
{
// Some code goes here
// This is a recursive method, so at some point it does the following:
if (state.MoveNext(c, out TAutomatonState nextState))
{
CountMatchesInternal<TAutomatonState>(nextState);
}
// Some more code goes here
}
}
Conclusion
Virtual dispatch is, at face value, regarded as a slower mechanism than direct or static method calls because of the extra lookup required to find the right implementation. Like in so many situations when optimizing code, though, the solution must be taken as a whole. In the article above I described how taking the minuscule hit of virtual dispatch upfront allowed the JIT to run extremely specialized for the rest of the algorithm, resulting in massive performance gains of 42% (12.6 ms -> 7.3 ms). Adding work upfront to choose a better strategy for the critical path is a common pattern seen in optimized code-bases.
Thanks to careful use of generics and structs, Levenshtypo met its design objectives:
✅ The Trie can be used independently and custom traversal comes with no performance loss compared to native.
✅ The Automata can be used independently with custom data structures, also with no performance loss.
✅ The Automata are dynamically selected for performance based on user inputs.
✅ The internal implementations of the Automata remain private enabling future extensions.
This work shows that with a deep understanding of the JIT, generics, and dispatch mechanisms, you can design a flexible, extensible API without sacrificing performance.
Join the Discussion
You must be signed in to comment.