Last time on this blog, we dug deep into two powerful
data structures which are at the core of fuzzy search engines; Tries and Levenshtein Automata.
We wrote some unoptimised blog-quality code to search through 450,000 words for those resembling "et" using these data structures.
The results showed that it benchmarked as being 1500 faster than a naive loop with a highly-optimized .NET library.
The main idea presented was to use Levenshtein Automata as a guide to know which Trie nodes to visit and which to prune. In this article we tackle the considerable task of how to construct Levenshtein Automata.
In other words, how do we go from "2 typos in abc" to this:
Non-Deterministic Finite State Automata
In the previous article it was specifically noted that the arrow labelled '*' is only usable for characters
which don't match another arrow. This is because the FSAs we have seen so far are deterministic;
an input character can only match a single state transition for any given state.
A Non-Deterministic Finite Automata (NFAs) is one where an input character can match many transitions (arrows). An NFA will accept an input if any of the possible paths through reach an accepting state, even if most don't.
Try out the strings "et", "tet" and "tt" on the below NFA - they should all match.
Can you work out what string the automata is trying to match, and the edit distance tolerated?
N.B. Arrow colors are solely for visual distinction
Hint: For
"tet"and"tt", 3 arrows can be considered for the first't'but only 1 will pan out.
If you figured it out, you'll have noticed that we've seen this example before.
The above NFA accepts strings within 1 edit from "et".
Shown below is the DFA version of the same NFA. Take some time to evaluate the two Automata (NFA above vs DFA below) on the following points:
- How easy is it for a human to understand what the automata does?
- How easy would it be for a computer to match some input against the automata?
Humans tend to understand the NFA better, due the the patterns and structures being repeated. Even if it's not clear why the arrows are as they are, there's a clear progression towards the right when inputs are matched, and downwards for "errors". Discovering a pattern in the DFA is much harder.
On the other hand, following the DFA is very simple - find the matching arrow and follow it. To execute a NFA, however, a computer must keep track of a potentially intractable number of states at the same time, making this option too inefficient for fast fuzzy search.
After comparing NFAs and DFAs, it may come as a surprise to reveal that there is nothing a NFA can do which a DFA can't; in other words, they are equally expressive. We know this because we can actually convert all NFAs into DFAs and vice versa. As a DFA is also an NFA, that conversion is trivial. An NFA can be converted into a DFA using an algorithm called Powerset Construction (detailed later); the DFA shown earlier was in fact derived from the NFA using this technique.
The strategy we shall explore, then, is to design the NFA which represents the desired string/distance parameters. The repetative structures inherent to Levenshtein NFA make this possible and actually relatively simple. A DFA will then be derived from the NFA to keep the matching logic simple.
The Anatomy of a Levenshtein NFA
A visual example should help clarify the internals of Levenshtein NFAs.
The large complex DFA at the start of the article was actually derived from the NFA shown below.
Both accept inputs if they are within 2 edits from the pattern "abc".
First, a quick glossary of the terminology:
- Pattern: the string which the Levenshtein Automata is trying to match.
- Maximum Edit Distance: The automata will accept strings with this many edits or lower
- Edit: a primitive operation in Levenshtein Distance Algorithm:
- Inserting a character
- Removing a character
- Substituting a character
- Input: The string to be tested by the automata, checking if it is within Maximum Edit Distance of the Pattern.
The States (circles) are organised based on the following rules:
- Each state keeps track of two values.
c(consumed) is the number of characters inPatternseen so fare(error) is the number of errors so far
- The diagram uses
candeasxandyco-ordinates respectively.- As pattern characters are consumed, the states move towards the right.
- Thus, the number of columns is 1 more than the length of the Pattern.
- As edits increase, the states move down.
- Thus, the number of rows is 1 more than Maximum Edit Distance.
- As pattern characters are consumed, the states move towards the right.
- For a state to be accepting, it must either be in the final column, or has
enough edits remaining to reach the final column †.
- For example the string "a" is accepting in the NFA above because we can use 2 delete edits to reach the final column
c3 e2. - Specifically, we can use this formula:
IsAccepting = c + (MaxEditDistance - e) >= Pattern.Length
- For example the string "a" is accepting in the NFA above because we can use 2 delete edits to reach the final column
In C#, we first define a NfaNode which is parameterized by c and e. It also has access to the
pattern length and max edit distance.
/// <summary>
/// Represents a single state in the Levenshtein NFA
/// </summary>
public class NfaState(string pattern, int maxEditDistance, int c, int e)
{
/// <summary>
/// Number of characters consumed so far
/// </summary>
public int C => c;
/// <summary>
/// Number of errors seen so far
/// </summary>
public int E => e;
/// <summary>
/// Returns true to match input strings ending in this state
/// </summary>
public bool IsAccepting => C + (maxEditDistance - E) >= pattern.Length;
}
A Nfa is constructed using a pattern and a max edit distance. The nodes are pre-created according to the above rules.
public class Nfa
{
private readonly NfaState[,] _states; // Indexed via [c, e]
public Nfa(string pattern, int maxEditDistance)
{
_states = CreateEmptyStates(pattern, maxEditDistance);
}
private static NfaState[,] CreateEmptyStates(string pattern, int maxEditDistance)
{
var result = new NfaState[pattern.Length + 1, maxEditDistance + 1];
// To each array location assign the right NfaNode
for (var c = 0; c < result.GetLength(0); c++)
{
for (var e = 0; e < result.GetLength(1); e++)
{
result[c, e] = new NfaState(pattern, maxEditDistance, c, e);
}
}
return result;
}
}
† N.B. Some NFAs allow empty transitions - transitions can be followed at any point without consuming characters from the input string. This allows for simplifications in some aspects of the NFA - for example only the states in the final column must be accepting - but complexity to the overall representation of the model. Therefore the above rules create an NFA which does not need the empty transition.
The Transitions (arrows) are organised based on the following rules for an arbitrary node (c, e) (defined above):
- If an transition is to a non-existant state then ignore that transition and continue.
(0, 0)are the co-ordinates of the start node, the topmost leftmost node.(Pattern.Length, Max Edits)are the co-ordinates of the bottommost rightmost node.
- As a reminder, this is an NFA and the execution can follow multiple transitions, even in the straightforward case where the expected character is found.
- Create a transition for each possible Levenshtein Operation:
- A matching character must match
Pattern[c]. (green)- Transition to state
(c+1, e)as this consumes a character with no errors. - Example: "abc"... right -> right -> right
- Transition to state
- An inserted character. (black)
- Transition to state
(c, e+1)as this does not consume a character but does count as an edit. - Example: "axbc"... right -> down -> right -> right
- Transition to state
- A substituted character. (orange)
- Transition to state
(c+1, e+1)as this counts as consuming a character as a single edit. - Example: "abx"... right -> right -> diagonal to (c3, e1)
- Transition to state
- Deleted characters. (purple, red)
- The maximum number of deleted characters is Maximum Edit Distance (D).
- For
d = 1..D(inclusive), a character matchingPattern[c + d]can transition to(c+1+D, e+D), as this consumes D characters with D edits. - Example: "bc"... diagonal to (c2, e1) -> right
- A matching character must match
To the NfaState class we can add the following method which returns the possible transitions from any state:
public record struct NfaTransition(char? Symbol, int C, int E);
public class NfaState
{
// To the exising code, add the following method
public IEnumerable<NfaTransition> GetTransitions()
{
if (C < pattern.Length)
{
// Matches the expected character without incrementing the error value.
yield return new NfaTransition(pattern[C], C + 1, E);
}
if (E < maxEditDistance)
{
// Insertion
yield return new NfaTransition(null, C, E + 1);
}
if (C < pattern.Length && E < maxEditDistance)
{
// Substitute
yield return new NfaTransition(null, C + 1, E + 1);
}
for (var deletions = 1; deletions <= maxEditDistance; deletions++)
{
if (C + deletions < pattern.Length && E + deletions <= maxEditDistance)
{
// Delete {deletions} characters
yield return new NfaTransition(pattern[C + deletions], C + 1 + deletions, E + deletions);
}
}
}
}
Powerset Construction
Now that we have a way to create NFAs, we can use Powerset Construction to derive a DFA which matches the same strings.
For every combination of NFA states, Powerset Construction creates a DFA state. Hence no matter which combination of states the NFA is in there is always a DFA state which represents that combination.
Example: a NFA with the states
{a, b, c}will have a DFA with the states{a, b, c, ab, ac, bc, abc}.
Since there are a finite number of states in the NFA, it stands to reason that the number of states in the DFA will also be finite, though
possibly very large; an NFA with n states could result in 2^n DFA states.
To determine the transitions for a particular DFA state, first the underlying NFA states are found. Then, for each possible input character, all the valid NFA transitions for all the underlying NFA states are followed, building up a set of next NFA states for that character. A transition is then created for that character to the DFA state which represents the specific combination of next NFA states.
In practice, as not all combinations of states are reachable, the algorithm starts with the DFA representing the initial NFA state. As it builds transitions and encounters unseen combinations, it adds them it to a queue to be processed. The queue is processed one at a time, and once no more new combinations are encountered, and all combinations have been processed, then all reachable states have been accounted for and the algorithm can stop.
If you just want to understand the concepts behind fuzzy matching, then you can skip the following code-heavy section.
It's time to see what Powerset Construction looks like in C#.
First we'll create a skeleton showing the outline of the algorithm.
There's a PowersetConstruction which is aware of the NFA states and has a main method ToDfa() which
starts by inspecting the initial state at (c: 0, e: 0) and continues to process all encountered states.
public class Nfa
{
// To the exising code, add the following method which
public FiniteStateAutomaton ToDfa() => new PowersetConstruction(_states).ToDfa();
}
public class PowersetConstruction(NfaState[,] states)
{
/// <summary>
/// A full list of all DFA states encountered so far
/// </summary>
private readonly List<FsaState> _dfaStates = new ();
/// <summary>
/// Unprocessed DFA states to be processed.
/// </summary>
private readonly Queue<QueueItem> _queue = new ();
private record struct QueueItem(int DfaIndex, (int c, int e)[] NfaStates);
public FiniteStateAutomaton ToDfa()
{
// "Encounter" the initial state
MapNfaStatesToDfaIndex([(c: 0, e: 0)]);
while (_queue.Count > 0)
{
var queueItem = _queue.Dequeue();
ProcessNext(queueItem);
}
return new FiniteStateAutomaton(_dfaStates.ToArray());
}
}
The above skeleton makes reference to MapNfaStatesToDfaIndex, which is defined in the next code block.
Its job is to map combinations of NFA states to a DFA state.
If the combination has not been encountered before, it does some bookkeeping before adding it to the processing queue.
/// <summary>
/// Maps from a unique string representing the combination of states to the index in `dfaStates`
/// </summary>
private readonly Dictionary<string, int> _dfaMap = new ();
/// <summary>
/// Maps a combination of NFA states to an index into `_dfaStates`.
/// Creates an empty DFA state if this combination has not yet been encountered.
/// </summary>
private int MapNfaStatesToDfaIndex((int c, int e)[] nfaStateRefs)
{
// Create a string form of this unique combination to use as a dictionary key
var key = string.Join(":", nfaStateRefs
.OrderBy(static state => state.c)
.ThenBy(static state => state.e)
.Select(static state => $"({state.c}, {state.e})")); ;
if (!_dfaMap.TryGetValue(key, out var dfaIndex))
{
// We've encountered this combination for the first time.
// Create an entry in _dfaStates and update _dfaMap to
// find it easily next time.
dfaIndex = _dfaStates.Count;
// Ensure dfaMap and dfaStates are kept in-sync.
Debug.Assert(_dfaStates.Count == _dfaMap.Count);
_dfaMap.Add(key, dfaIndex);
_dfaStates.Add(new FsaState());
// Add this combination to the processing queue.
_queue.Enqueue(new QueueItem(dfaIndex, nfaStateRefs));
}
return dfaIndex;
}
Finally, the ProcessNext method is defined below. It is responsible for initializing the DFA state by
determining whether it's an accepting state, as well as building the transitions to other DFA states.
private void ProcessNext(QueueItem queueItem)
{
var (dfaIndex, nfaStateRefs) = queueItem;
var dfaState = _dfaStates[dfaIndex];
// A DFA state is accepting if any of the NFA states are accepting
dfaState.IsAccepting = nfaStateRefs
.Select(stateRef => states[stateRef.c, stateRef.e])
.Any(nfaState => nfaState.IsAccepting);
// List all the possible transitions from all the possible underlying NFA states
var nfaTransitions = nfaStateRefs
.Select(stateRef => states[stateRef.c, stateRef.e])
.SelectMany(nfaState => nfaState.GetTransitions())
.ToList();
// Get a list of all possible input symbols
var inputChars = nfaTransitions.Select(t => t.Symbol).Distinct();
foreach (var inputChar in inputChars)
{
// Follow all transitions matching this character, keeping track of the destination state.
var destinationStates = nfaTransitions
.Where(t => t.Symbol is null || t.Symbol == inputChar)
.Select(t => (t.C, t.E))
.Distinct()
.ToArray();
// Missing Optimization: In production code you'd typically see some pruning of unnecessary states
// but this is already getting way overcomplicated.
var nextState = MapNfaStatesToDfaIndex(destinationStates);
// Add the transition for this character
if (inputChar is null)
{
dfaState.WildcardTransition = nextState;
}
else
{
dfaState.Transitions.Add(inputChar.Value, nextState);
}
}
}
To create a Levenshtein Automaton, we can now do this:
var levenshteinAutomaton = new Nfa(pattern, editDistance).ToDfa();
Benchmarks
To determine how this all performs we will use a benchmark. The goal is to search words in the English language.
The 2 strategies to test are:
Quickenshtein: Naive loop using Quickenshtein, one of the fastest OSS .NET Levenshtein Distance calculatorsBlog: Runtime-generatedFiniteStateAutomaton(as above) with a pre-indexedTrie(as in last article in this series)
| Method | Scenario | Words | Mean |
|---|---|---|---|
| Naive Quickenshtein | "hello", edits = 1 | 1000 | 121.98 us |
| Automata | "hello", edits = 1 | 1000 | 24.83 us |
| Naive Quickenshtein | "hello", edits = 1 | 450000 | 59,137.71 us |
| Automata | "hello", edits = 1 | 450000 | 72.23 us |
| Naive Quickenshtein | "parallelogram", edits = 3 | 1000 | 195.77 us |
| Automata | "parallelogram", edits = 3 | 1000 | 4,078.31 us |
| Naive Quickenshtein | "parallelogram", edits = 3 | 450000 | 99,242.34 us |
| Automata | "parallelogram", edits = 3 | 450000 | 11,982.05 us |
The results come with the usual warning that the numbers should be interpreted relative to each other, and absolute values will vary widely across systems.
The naive approach exhibits roughly linear performance in the number of characters being compared. The longer query takes very roughly twice as long, and the larger dataset also takes proportionally longer.
Analyzing the benchmark results for the Automata reveals a very different story. For longer queries and smaller datasets, this approach is actually slower than the Naive one. This happens because whilst building the NFA and executing the DFA are extremely fast, Powerset Construction takes up most of the time dealing with an exponential number of states. Moreso, there isn't really a faster general-case way to convert NFAs to DFAs.
So what do practical fuzzy search engines do to get around this? Unfortunately the rabbit hole goes much deeper and this post is already very long. This will be the subject of the next entry in this series.
Conclusion
In this article we explored an algorithm for creating Levenshtein Automata. The strategy was to create an NFA which is easy to understand by humans, and then derive a DFA from that using Powerset Construction. The DFA is very fast for machines to execute.
All of this was provided as C# code, ready to be copy/pasted into your playground environment for further exploration.
The benchmarks, however, revealed a snag. Powerset Construction proved to be a performance killer, nullifying many of the gains for simpler scenarios.
Next time, we look at how search engines get around this.
.NET implementations of production-ready Levenshtein Automata are available publicly via the Levenshtypo library which powers Autypo, an autocomplete library created as an idiomatic way to do autocomplete in .NET.
Join the Discussion
You must be signed in to comment.