Under the Hood of Fuzzy Search: Parameterized Levenshtein Automata

In this series we're diving deep into the data structures and algorithms behind Fuzzy Search. As an introduction to the topic, the first post in this series shed light on how Tries and Levenshtein Automata are used together to provide rapid information retrieval.

The second post highlighted just how complex Levenshtein Automata can be to create, and ultimately just how slow they are due to the Powerset Construction algorithm having to explore an exponentially large number of states. The article concluded by hinting that all is not lost; after all we have all seen real fuzzy search in large scalable search engines.

In this article we will explore Parameterized Levenshtein Automata and how they solve the performance issues which plague regular Levenshtein Automata. It's a notoriosly complex topic, but some say it's just misunderstood and has a bad reputation due to the dense scientific article which first introduced it: Fast String Correction with Levenshtein-Automata by Klaus Schulz and Stoyan Mihov.

It's best to start by studying the two automata below (left is "cat" and right is "pig", edit distance = 1). Can you notice anything interesting?

DFA of "cat" and "pig", distance=1

The Levenshtein Automata for "cat" and "pig" are structurally identical! The only difference is the transition labels. Replacing ['c', 'a', 't'] into ['p', 'i', 'g'] respectively transforms the left into the right. Let's look at some other three letter words - "egg" and "pop" (edit distance = 1). Can you predict what they might look like?

DFA of "egg" and "pop", distance=1

Both of these automata differ from "cat" and "pig". The duplicate letters 'g' in "egg" and 'p' in "pop" transform the structure of the DFA.

For example the "pop" DFA must keep track that the 'p' could be either the first or the third letter in "pop". The DFA can only clarify this once further characters are processed. There is no such ambiguity in "cat".

How far apart is too far apart?

Can you guess how the automata for "fred" and "greg" compare?

DFA of "fred" and "greg", distance=1

You may have expected that the repeated 'g' in "greg" would result in the two automata being structurally different, but they are in fact identical.

The 'g' characters are so far apart that the DFA will never mix one for the other.

For example, if the pattern is "greg" and the input string starts with "reg..." then this can only ever match if the first 'g' was dropped. Similarly, if encountering the prefix "eg..." then the remainder of the string can only match if the 'e' at the front was an extra new character and the remaining characters are "reg".

If we repeat this process, a clear pattern emerges:

  • For edit distance 1, duplicates matter within 3 letters
  • For edit distance 2, duplicates matter within 5 letters
  • For edit distance 3, duplicates matter within 7 letters
  • For edit distance n, duplicates matter within 2n + 1 letters

The effect of duplicates only extends a limited distance into the automaton.

This result can also be derived by analytical reasoning. Start by assuming an arbitrarily long pattern with edit distance n:

Extreme 1 — insertions at the start

If all n edits are insertions before the first character, then after reading n + 1 input characters the first pattern character has either been consumed or skipped; it will never match later.

Extreme 2 — matches + deletions

On the other hand, suppose the first n input characters match, and another n pattern characters are deleted. Then the next input character (at index n + 1) could correspond to pattern character 2n + 1.

Hence after reading n + 1 input characters, the automaton could be anywhere between:

  • 1 character into the pattern
  • 2n + 1 characters into the pattern

✅ This matches our earlier observations.

NFA

The algorithm employ a sliding window technique - meaning that the algorithm only "looks" at a small part of the pattern at any one go. The size of the window starts off at a length of 2n + 1 (n = max allowed edits). As discussed earlier, this length guarantees that the automaton can slide the window forward in time, before the input reaches the end of the window.

The window starts at the beginning of the pattern string and slides forward, keeping the start of the window aligned with the earliest possible pattern character.

As implied by the name, Parameterized Levenshtein Automata can be re-used across words because the state transitions do not depend on actual characters. The NFA state transitions simply refer to a Character Index. Hence the NFA states shown below for edit distance 1, for example, are almost identical to a regular Levenshtein Automata NFA for the pattern "012".

To re-iterate: For parameterized NFA, ['0', '1', '2'] are not literal characters, but are indices (charIndex) into the sliding window, where the matching character can be found. This indirection is where the concept of "parameterization" comes from. Specifically the matching character is found at window[charIndex], which can be rephrased as pattern[windowOffset + charIndex].

N.B. Arrow colors are solely for visual distinction

Parameterized NFA of distance=1

Sliding the Window

A visual example of how the window slides forward during matching is show below. Observe the NFA on the left, and assume that after a transition the automaton is in one of the red states.

Sliding the window involves not only moving the window forward in the pattern, but also shifting the set of active states back by 1, resulting in the NFA on the right.

Sliding NFA

In mathematical terms, we maintain pattern[windowOffset + charIndex] by incrementing windowOffset and decrementing charIndex.

Example: "here" or "there"

Consider the diagram below, showing how an Parameterized Automaton can match words within 1 edit from "here" and how it behaves when processing the input string "there".

NFA walkthrough of here vs there

Each box shows a single character being processed. The starting states for the step are green, and the matching transitions and end states for that step are marked in blue. Final states have a red border.

  1. The window starts as "her" and the input is 't'.
    • 't' could be inserted (c0 e1) or a typo for 'h' (c1 e1).
    • The window does not move forward as we could be in state (c0 e1).
  2. The window is still "her" and the next input is 'h'
    • (c0 e1) has a matching transition to (c1 e1)
    • (c1 e1) has no matching transition.
    • The window can slide 1 forward at this point since we know we have moved past the (c0 e*) nodes.
  3. The window is now "ere" and the next input is 'e'
    • Notice how sliding the window moved us back to state (c0 e1)
    • (c0 e1) has a matching transition to (c1 e1)
    • The window can slide 1 forward at this point since we know we have moved past the (c0 e*) nodes.
  4. The window is now "re" and the next input is 'r'
    • Notice how sliding the window means that the window size must be reduced since we're at the end of the pattern.
    • (c0 e1) has a matching transition to (c1 e1)
    • The window can slide 1 forward at this point since we know we have moved past the (c0 e*) nodes.
  5. The window is now "e" and the next input is 'e'
    • (c0 e1) has a matching transition to (c1 e1)
    • The window can slide 1 forward at this point since we know we have moved past the (c0 e*) nodes.
    • As we have consumed all input characters and are in a final state, the word matches.
  6. Although unused in this example, this is a valid NFA if we arrive in (c0 e0) and still have another character to read.
    • In such a scenario, moving to (c0 e1) is a valid transition.

Parameterized Final States

Aside from the transitions, the "finality" of NFA states are also parameterised. Whether a state is Accepting / Final or not depends on the window size instead of the pattern length. The formula is otherwise identical.

Show me the code!

If you've been following along in the series, here are the list of code changes to make the NFA Parameterized:

  1. NfaTransition has property int? CharIndex instead of char? Symbol.
  2. NfaState has property int maxWindowSize instead of string pattern.
  3. IsAccepting is now a method accepting parameter int currentWindowSize, using that instead of pattern.Length

The full code listing below is for reference. Check out the previous article for a detailed explanation.

/// <summary>
/// Represents a transition from an NFA state.
/// If the input character matches the pattern character at index <see cref="CharIndex"/>,
/// move the DFA to state c<see cref="C"/> e<see cref="E"/>.
/// </summary>
public record struct NfaTransition(int? CharIndex, int C, int E);

/// <summary>
/// Represents a single state in the Levenshtein NFA
/// </summary>
public class NfaState(int maxWindowSize, int maxEditDistance, int c, int e)
{
    /// <summary>
    /// Number of characters consumed so far
    /// </summary>
    public int C => c;

    /// <summary>
    /// Number of errors seen so far
    /// </summary>
    public int E => e;

    /// <summary>
    /// Returns true to match input strings ending in this state
    /// </summary>
    public bool IsAccepting(int currentWindowSize) => C + (maxEditDistance - E) >= currentWindowSize;

    public IEnumerable<NfaTransition> GetTransitions()
    {
        if (C < maxWindowSize)
        {
            // Matches the expected character without incrementing the error value.
            yield return new NfaTransition(C, C + 1, E);
        }

        if (E < maxEditDistance)
        {
            // Insertion
            yield return new NfaTransition(null, C, E + 1);
        }

        if (C < maxWindowSize && E < maxEditDistance)
        {
            // Change
            yield return new NfaTransition(null, C + 1, E + 1);
        }

        for (var deletions = 1; deletions <= maxEditDistance; deletions++)
        {
            if (C + deletions < maxWindowSize && E + deletions <= maxEditDistance)
            {
                // Delete {deletions} characters
                yield return new NfaTransition(C + deletions, C + 1 + deletions, E + deletions);
            }
        }
    }
}

The Nfa class wraps prepares the NFA states in an array. Instead of converting the NFA directly to a DFA (as in last article), it exposes a GetDfaFactory method. This method caches the result of powerset construction and returns a factory method which can specialize these states into any provided pattern very quickly.

The distinction between returning a DFA vs a DFA Factory is critical. It reveals how Parameterized Levenshtein Automata improve upon regular Levenshtein Automata.

public class Nfa(int editDistance)
{
    private readonly NfaState[,] _states = CreateEmptyStates(editDistance); // Indexed via [c, e]

    private static NfaState[,] CreateEmptyStates(int maxEditDistance)
    {
        var windowSize = 2 * maxEditDistance + 1;
        var result = new NfaState[windowSize + 1, maxEditDistance + 1];

        // To each array location assign the right NfaNode
        for (var c = 0; c < result.GetLength(0); c++)
        {
            for (var e = 0; e < result.GetLength(1); e++)
            {
                result[c, e] = new NfaState(windowSize, maxEditDistance, c, e);
            }
        }

        return result;
    }

    /// <summary>
    /// Creates a factory for Levenshtein Automata for this edit distance.
    /// The DFA states are computed once and re-used with minimal computation
    /// when a new pattern is provided.
    /// Re-using the DFA states is what provides the biggest motivation for using Parameterized Levenshtein Automata.
    /// </summary>
    public Func<string, ParameterizedLevenshteinDfa> GetDfaFactory()
    {
        var dfaStates = new PowersetConstruction(_states, editDistance).GetDfaStates();
        return pattern => new ParameterizedLevenshteinDfa(dfaStates, editDistance, pattern);
    }
}

Characteristic Vectors

The DFA must also be modified to account for the parameterized states. Previously each DFA state mapped input characters to transition - which is no longer possible in a Parameterized world. Parameterized DFA states must:

  1. Represent combinations of NFA states, not just a single one (just like in regular DFA).
  2. Be based on the charIndex of the underlying NFA states, described earlier.
  3. Account for repeating characters. Remember "cat" vs "egg".

This is how it works:

After reading an input character, it is compared against each window character. This results in a sequence of booleans, for example reading 'p' with a window of "pop" results in [true, false, true]. This sequence can be represented as an uint where each bit represents a boolean. This uint is known as the Characteristic Vector.

Transitions in Parameterized Levenshtein Automata are labelled by Characteristic Vectors, not characters.

In C#:

// A method in the class `ParameterizedLevenshteinDfa`
// Examples (bits are RightToLeft)
// ('e', "egg") => 0b001
// ('g', "egg") => 0b110
private static uint CreateCharacteristicVector(char inputChar, ReadOnlySpan<char> window)
{
    uint result = 0;
    for (var i = 0; i < window.Length; i++)
    {
        if (inputChar == window[i])
        {
            result |= 1u << i;
        }
    }
    return result;
}

DFA Diagram

The diagram below shows the DFA for edit distance 1. It's actually limited to window size 3; there are variants which handle window sizes 0, 1, and 2.

The DFA states are labelled by the combination of NFA states they represent. Each transition is labelled by multiple possible characteristic vectors. Each characteristic vector is a 3-digit binary number followed by a window slide factor. The window slide factor is simply the number of spaces forward the window must move after following a transition.

The binary number is in the form abc where a == 1 if the input character matches the first window character, b == 1 if it matches the second, and c == 1 if it matches the third. For example, if the input character is r and the pattern window is rre then the vector is 110. N.B. The binary numbers have been inverted to make this easier to follow.

Below the diagram is an example which will help clarify what's going on.

DFA of edit distance 1 (limited to d=3)

We will re-use the of an automaton which represents "frog" and an input string "rog".

  1. 'r' vs "fro"
    • Characteristic Vector: 010
    • End State: c0_e1 c1_e1 c2_e1 (the big one in the middle)
    • Slide Window By: 0
  2. 'o' vs "fro"
    • Characteristic Vector: 001
    • End State: c0_e1 (the bottom one)
    • Slide Window By: 3
  3. 'g' vs "g"
    • Characteristic Vector: 1
    • At this point we've moved past the d=3 limits of the diagram, but you should hopefully be able to see how we have correctly determined that there has been 1 error (c0 e1) so far, and trust that 'g' should match "g" without incuring another error if we had the diagram for d=1.

Code Walkthrough: Modelling the DFA

Transitions must keep track of the slide amount as well as the next state index. This is wrapped in a ParameterizedLevenshteinTransition struct.

public readonly struct ParameterizedLevenshteinTransition(int? nextStateIndex, int slideWindowBy)
{
    public int? NextStateIndex { get; } = nextStateIndex;

    public int SlideWindowBy { get; } = slideWindowBy;
}

The DFA state is encapsulated by ParameterizedLevenshteinDfaState, which defines different behaviours (...ByLength) depending on the remaining pattern window length. Since there are always 2 ^ windowLength transitions for each DFA state we allow direct indexing into the transition array.

/// <summary>
/// Represents a state which can be reached by an Automaton.
/// </summary>
public class ParameterizedLevenshteinDfaState(int maxWindowLength)
{
    /// <summary>
    /// The state behaves differently depending on the length of the window.
    /// Index N represents how the state behaves for a window of length N.
    /// </summary>
    public ParameterizedLevenshteinDfaStateByLength[] ByLength = new ParameterizedLevenshteinDfaStateByLength[1 + maxWindowLength];
}

public class ParameterizedLevenshteinDfaStateByLength(int windowLength)
{
    /// <summary>
    /// Whether the state is accepting (inputs ending on this state are considered a match).
    /// </summary>
    public bool IsAccepting { get; set; }

    /// <summary>
    /// The transitions, indexed by CharacteristicVector.
    /// For example if the window length is 2 then the indices are:
    /// - Index 0 represents 0b00
    /// - Index 1 represents 0b01
    /// - Index 2 represents 0b10
    /// - Index 3 represents 0b11
    /// </summary>
    public ParameterizedLevenshteinTransition[] Transitions { get; } = new ParameterizedLevenshteinTransition[(int)Math.Pow(2, windowLength)];
}

The states are encapsulated by ParameterizedLevenshteinDfa. Also encapsulated are the edit distance, used for calculating window length, as well as the pattern itself. The pattern is required for matching because the state transitions make reference to the index in the pattern window.

The usage of ParameterizedLevenshteinDfa is identical to FsaState described in the first article in this series.

/// <summary>
/// Represents an automaton which exists in a single state
/// </summary>
public class ParameterizedLevenshteinDfa(ParameterizedLevenshteinDfaState[] states, int editDistance, string pattern)
{
    // A DFA must always be in exactly 1 state.
    // We use the convention that the first state (index 0) is the initial state.
    private int _currentStateIndex = 0;

    private int _windowStartIndex = 0;

    private int MaxWindowLength => 2 * editDistance + 1;

    private static uint CreateCharacteristicVector(char inputChar, ReadOnlySpan<char> window) => ...; // Detailed above.

    /// <summary>
    /// Advances the automaton based on the input symbol.
    /// </summary>
    public bool Transition(char input) => ...; // Detailed next

    /// <summary>
    /// If true then the automaton is in a state representing successful match.
    /// </summary>
    public bool IsAccepting => ...; // Detailed next
}

To keep the public API compatible with FsaState, the Clone and Reset methods work in exactly the same way.

/// <summary>
/// Returns a new automaton ready to match a new string with the same pattern.
/// </summary>
public ParameterizedLevenshteinDfa Reset()
{
    var clone = new ParameterizedLevenshteinDfa(states, editDistance, pattern);
    clone._currentStateIndex = 0;
    clone._windowStartIndex = 0;
    return clone;
}

/// <summary>
/// Returns a new automaton in the same state as this one.
/// </summary>
public ParameterizedLevenshteinDfa Clone()
{
    var clone = new ParameterizedLevenshteinDfa(states, editDistance, pattern);
    clone._currentStateIndex = _currentStateIndex;
    clone._windowStartIndex = _windowStartIndex;
    return clone;
}

Finally the Transition and IsAccepting methods are listed below. They use a shared helper method GetCurrentState which determines the remaining window length to find the relevant variant of the DFA state.

/// <summary>
/// Resolves the current state depending on the window length.
/// </summary>
private void GetCurrentState(out ParameterizedLevenshteinDfaStateByLength state, out ReadOnlySpan<char> window)
{
    var remainingPatternChars = pattern.Length - _windowStartIndex;
    var windowLength = Math.Min(MaxWindowLength, remainingPatternChars);

    window = pattern.AsSpan(_windowStartIndex, windowLength);
    state = states[_currentStateIndex].ByLength[windowLength];
}

/// <summary>
/// Advances the automaton based on the input symbol.
/// </summary>
public bool Transition(char input)
{
    const int invalidState = -1;

    GetCurrentState(out var state, out var window);

    var characteristicVector = CreateCharacteristicVector(input, window);
    var transition = state.Transitions[characteristicVector];

    if (transition.NextStateIndex is null)
    {
        _currentStateIndex = invalidState;
        return false;
    }

    _currentStateIndex = transition.NextStateIndex ?? invalidState;
    _windowStartIndex += transition.SlideWindowBy;
    return true;
}

/// <summary>
/// If true then the automaton is in a state representing successful match.
/// </summary>
public bool IsAccepting
{
    get
    {
        GetCurrentState(out var state, out _);

        return state.IsAccepting;
    }
}

Powerset Construction for Parameterized Levenshtein Automata

As a recap, powerset construction creates a DFA state for every possible combination of NFA states. For each one of these DFA states, it finds what the next DFA state would be when encountering the next input symbol.

DFA state transitions are Characteristic Vectors. As we are dealing with sliding windows there will come a point where it is longer than the remaining pattern, at which point the window gets truncated. Hence for edit distance n while the maximum window size is 2n + 1, powerset construction must model how the DFA state will behave for each window size from 0 up to 2n + 1.

N.B. As we have already seen, zero is a valid window size when there's no more pattern characters but there are more input symbols.

To model how a DFA state behaves in a given window size, we must consider every possible characteristic vector. For a window size w, the characteristic vector is a uint of w bits long, meaning there are 2 ^ w possible vectors.

Powerset construction consults the NFA to determine the next state, given that characteristic vector.

In addition to pointing to the next state, transitions also keep track of how many spaces forward the window should move. This turns out to be equal to the minimum C value in the "destination" NFA states.

Optimization: State Reduction

Consider a situation where the user is in two states: [(c: 0, e: 0), (c: 0, e: 1)]. In other words, the second state is in the same position as the former, but has "burned" an extra edit along the way. The first state can reach any state which the second can reach.

This multi-state is no different to simply being in [(c: 0, e: 0)]. In the literature, it is said that the former state "subsumes" the latter.

Subsumption reduces the possible combinations of states to those where every state adds at least something.

The above example is a single one - the full subsumption relationship is as follows:

Assuming two NFA states (c1, e1) and (c2, e2), the former subsumes the latter if:

e2 > e1 && (e2 - e1) >= Abs(c2 - c1)

The article is already very long so I won't be going into the analytical reason why this works. It's also optional to understanding the algorithm, although professional implementations will all do this.

Code Walkthrough: Powerset Construction

For Parameterized Levenshtein NFA we will modify the algorithm to account for the fact that each DFA state has:

  1. A "variant" for every possible window size (0 to 2n + 1 inclusive, where n = max edit distance)
  2. For each "variant" there are as many transitions as Characteristic Vectors (2 ^ w where w is window size)

The main loop for the algorithm barely changes from last time; a few types have been renamed and the computed DFA states are now returned directly. The Nfa class shown earlier stores and re-uses these DFA states to avoid repeating powerset construction.

In fact, if you've read the previous article, you can skip this next code block and go straight to the new implementation of ProcessNext.

public class PowersetConstruction(NfaState[,] states, int editDistance, string pattern)
{
    /// <summary>
    /// Maps from a unique string representing the combination of states to the index in `dfaStates`
    /// </summary>
    private readonly Dictionary<string, int> _dfaMap = new ();

    /// <summary>
    /// A full list of all DFA states encountered so far
    /// </summary>
    private readonly List<ParameterizedLevenshteinDfaState> _dfaStates = new ();

    /// <summary>
    /// Unprocessed DFA states to be processed.
    /// </summary>
    private readonly Queue<QueueItem> _queue = new ();

    private record struct QueueItem(int DfaIndex, (int c, int e)[] NfaStates);

    private int MaxWindowLength => 2 * editDistance + 1;

    /// <summary>
    /// Maps a combination of NFA states to an index into `_dfaStates`.
    /// Creates an empty DFA state if this combination has not yet been encountered.
    /// </summary>
    private int MapNfaStatesToDfaIndex((int c, int e)[] nfaStateRefs)
    {
        // Create a string form of this unique combination to use as a dictionary key
        var key = string.Join(":", nfaStateRefs
            .OrderBy(static state => state.c)
            .ThenBy(static state => state.e)
            .Select(static state => $"({state.c}, {state.e})")); ;

        if (!_dfaMap.TryGetValue(key, out var dfaIndex))
        {
            // We've encountered this combination for the first time.
            // Create an entry in _dfaStates and update _dfaMap to
            // find it easily next time.

            dfaIndex = _dfaStates.Count;

            // Ensure dfaMap and dfaStates are kept in-sync.
            Debug.Assert(_dfaStates.Count == _dfaMap.Count);
            _dfaMap.Add(key, dfaIndex);
            _dfaStates.Add(new ParameterizedLevenshteinDfaState(MaxWindowLength));

            // Add this combination to the processing queue.
            _queue.Enqueue(new QueueItem(dfaIndex, nfaStateRefs));
        }

        return dfaIndex;
    }

    public ParameterizedLevenshteinDfa ToDfa()
    {
        // "Encounter" the initial state
        MapNfaStatesToDfaIndex([(c: 0, e: 0)]);

        while (_queue.Count > 0)
        {
            var queueItem = _queue.Dequeue();

            ProcessNext(queueItem);
        }

        return new ParameterizedLevenshteinDfa(_dfaStates.ToArray(), editDistance, pattern);
    }

    private void ProcessNext(QueueItem queueItem)
    {
        ... // Detailed below
    }
}

Tthe ProcessNext method below contains some of the most complex logic so far, as it uses all the concepts and rationale we have discussed so far.

private void ProcessNext(QueueItem queueItem)
{
    var (dfaIndex, nfaStateRefs) = queueItem;

    var dfaState = _dfaStates[dfaIndex];

    for (var windowLength = 0; windowLength <= MaxWindowLength; windowLength++)
    {
        // Process each length variant separately
        var stateByLength = dfaState.ByLength[windowLength] = new ParameterizedLevenshteinDfaStateByLength(windowLength);
        ProcessNextInternal(nfaStateRefs, stateByLength, windowLength);
    }
}

private void ProcessNextInternal((int c, int e)[] nfaStateRefs, ParameterizedLevenshteinDfaStateByLength dfaState, int windowLength)
{
    // A DFA state is accepting if any of the NFA states are accepting
    dfaState.IsAccepting = nfaStateRefs
        .Select(stateRef => states[stateRef.c, stateRef.e])
        .Any(nfaState => nfaState.IsAccepting(windowLength));

    // We predict what will happen for any possible characteristic vector of this length
    foreach (uint characteristicVector in GetAllCharacteristicVectorsForWindowLength(windowLength))
    {
        var destinationStates = FollowNfaStateTransitions(nfaStateRefs, windowLength, characteristicVector);

        if (destinationStates.Count == 0)
        {
            continue;
        }

        // Remove any states which are subsumed by another state
        destinationStates.RemoveWhere(candidate
            => destinationStates.Any(other 
                => Subsumes(other, candidate)));

        // Shift the window forward by the minimum 'c' state value
        var slideWindowBy = destinationStates.Min(s => s.c);

        var shiftedDestinationStates = destinationStates
            .Select(s => (c: s.c - slideWindowBy, e: s.e))
            .ToArray();

        var nextStateIndex = MapNfaStatesToDfaIndex(shiftedDestinationStates);
        dfaState.Transitions[characteristicVector] = new(
            nextStateIndex: nextStateIndex,
            slideWindowBy: slideWindowBy);
    }
}

/// <summary>
/// Given a combination of starting NFA states,
/// follows the state transitions matching the
/// characteristic vector and returns the resulting
/// states in a HashSet.
/// </summary>
private HashSet<(int c, int e)> FollowNfaStateTransitions((int c, int e)[] nfaStateRefs, int windowLength, uint characteristicVector)
{
    // This is the set of end states for this transition.
    // We use a HashSet to avoid duplicates.
    var destinationStates = new HashSet<(int c, int e)>();

    // For each underlying state...
    foreach (var nfaState in nfaStateRefs.Select(stateRef => states[stateRef.c, stateRef.e]))
    {
        // ... check which transitions match
        foreach (var nfaTransition in nfaState.GetTransitions())
        {
            if (nfaTransition.C > windowLength)
            {
                // Not applicable
                continue;
            }

            bool isMatch = false;

            if (nfaTransition.CharIndex is null)
            {
                // This transition matches all characters
                isMatch = true;
            }
            else if (DoesCharacteristicVectorMatch(characteristicVector, nfaTransition.CharIndex.Value))
            {
                isMatch = true;
            }

            if (isMatch)
            {
                // We have found a possible next state, add it to the list
                destinationStates.Add((c: nfaTransition.C, e: nfaTransition.E));
            }
        }
    }

    return destinationStates;
}

private static IEnumerable<uint> GetAllCharacteristicVectorsForWindowLength(int windowLength)
{
    uint maxVector = (1u << windowLength) - 1;
    for (uint i = 0; i <= maxVector; i++)
    {
        yield return i;
    }
}

private static bool DoesCharacteristicVectorMatch(uint characteristicVector, int charIndex)
    // Check the relevant bit is set
    => (characteristicVector & (1u << charIndex)) is not 0u;

/// <summary>
/// Returns true if the first state subsumes the latter i.e.
/// the latter state is useless in the presence of the former.
/// </summary>
private static bool Subsumes(
    (int c, int e) former,
    (int c, int e) latter)
{
    var (c1, e1) = former;
    var (c2, e2) = latter;
    return e2 > e1 && (e2 - e1) >= Math.Abs(c2 - c1);
}

Benchmarks

Before continuing, it's worth taking stock of what has already been achieved.

The code above is able to generate a DFA for any edit distance. The DFA can be re-used across patterns, effectively making the powerset construction a one-time operation. Hence the following benchmarks will not run the costly powerset construction logic†. Instead the benchmarks will only incur the small cost of specializing the DFA to a specific pattern.

† N.B. Another reason we won't run powerset construction is that real-world applications like Lucene don't actually run powerset construction at all at runtime! Intrigued? Continue reading to find out more...

The benchmarks are, as usual, to search a dataset of 450,000 English words for ones within a certain edit distance of a specific patter.

The 2 strategies to test are:

  • Quickenshtein: Naive loop using Quickenshtein, one of the fastest OSS .NET Levenshtein Distance calculators)
  • Blog: Runtime-generated (Parameterized) Levenshtein Automaton (as above) with a pre-indexed Trie.
Method Scenario Words Mean
Naive Quickenshtein "hello", edits = 1 1000 127,692.4 ns
Automata "hello", edits = 1 1000 584.0 ns
Naive Quickenshtein "hello", edits = 1 450000 58,820,725.9 ns
Automata "hello", edits = 1 450000 49,694.7 ns
Naive Quickenshtein "parallelogram", edits = 3 1000 189,514.5 ns
Automata "parallelogram", edits = 3 1000 8,056.2 ns
Naive Quickenshtein "parallelogram", edits = 3 450000 97,108,883.3 ns
Automata "parallelogram", edits = 3 450000 6,400,798.8 ns

The results come with the usual warning that the numbers should be interpreted relative to each other, and absolute values will vary widely across systems.

The naive approach, like last time, exhibits roughly linear performance in the number of characters being compared. The longer query takes very roughly twice as long, and the larger dataset also takes proportionally longer.

Unlike last time, the Automata approach scales beautifully. It's consistent much faster than the naive approach - and remember, this is all with blog-quality code!

Although a full fuzzy search has been implemented, there's a few loose ends to tie up to really show the beauty of Parameterized Levenshtein Automata. The next section is going to blow your mind.

Serialization

During the benchmarks section I hinted that powerset construction is never actually run at all in search engines.

Study the following pseudocode block and see if you can guess how it can be optimized:

// Create the DFA once
private static readonly Func<string, ParameterizedLevenshteinDfa> dfaFactoryForEditDistance1 = new Nfa(1).GetDfaFactory();

bool Search(string pattern)
{
    // Specialize the DFA
    var automaton = dfaFactoryForEditDistance1(pattern);
    return trie.Search(automaton).Any();
}

If you guessed that the DFA creation does not need the pattern, hence it can be pre-computed and serialized, you'd be correct. In such a scheme. The application need only deserialize the DFA at runtime instead of running powerset construction. This technique reduces the cost of creating the DFA to be insignificant.

This article is already very long and the serialization logic will depend on implementations, so I won't be showing any code. However it is interesting that the authors of the original paper took this very literally and printed out the entire DFA for edit distance 1, in black-and-white.

There is, however, a big limitation; the sheer number of states and transitions get quite large. The table below shows just how quickly the DFA can grow in complexity. Serializing all that information comes at a cost, and different search engines make different decisions on how much to serialize. Lucene, for example, chooses to only serialize Automata for edit distances 1 and 2, hence it does not support edit distances larger than that. Note that this decision can be justified by observing that it's irregular to more than 2 typos.

Edit Distance Time Taken to Precompute Num States Num Transitions
1 0.01 s 5 75
2 0.02 s 30 1890
3 0.27 s 196 49,980
4 2.30 s 1353 1,384,119

Conclusion

In this article we explored the principles behind Parameterized Levenshtein Automata and used those principles to implement them in .NET.

We also saw how this algorithm has diminishing returns as the edit distance grows; the larger and larger DFAs are infeasible to serialize in advance because they take up so much space and memory.

This is why so many production fuzzy search engines cap edits at 2. Levenshtypo, the library I developed and maintain, gets around this and allows for relatively fast Levenshtein Automata of up to 30 edits. Levenshtypo powers Autypo, designed to be the idiomatic way to do autocomplete in .NET.

In the next post, I'll reveal how I managed to support edit distances up to 30 - far beyond what most systems attempt in practice.

Since the start of the series, we've gone from fuzzy searching Tries with hardcoded automata to building full Parameterized Levenshtein Automata. We've then seen how real production search engines hardcode (serialized) the parameterized DFAs after all.

This journey has hopefully given you a deeper look under the hood of how fuzzy search and typo tolerance really works in practice.


Other posts you might like


Join the Discussion

You must be signed in to comment.

No user information will be stored on our site until you comment.