Under the Hood of Fuzzy Search: Building a Search Engine 15 times fuzzier than Lucene

Typos are very common nowadays. When working in a fast-paced environment we expect computers to just understand what we mean intuitively, even if we didn't spell it out - sometimes literally in the form of spelling mistakes.

On this blog we've been through a deep dive into Levenshtein Automata, exploring how they are created using NFAs and Powerset Construction and finally diving deep into the Parameterized Levenshtein Automata to expose how professional search engines like Lucene handle fuzziness.

While Lucene can only fuzzy search with a maximum of 2 edits, today we review an algorithm which can handle up to 30 edits.

The algorithm was built while working on Levenshtypo, the library I developed and now maintain. It's used as the engine of Autypo, which I designed to be the idiomatic way to do autocomplete in .NET.

How it Works

In the lead-up to Parameterized Levenshtein Automata, I claimed that it is intractible for machines to execute NFA due to the potentially huge number of states. This claim is true if we assume that the machine were to consider each NFA state independently, as it would if the "active" NFA states were stored in a list.

Another way that a machine can represent an NFA is as bit strings, where each bit reflects whether the automaton is in a specific NFA state. For example for an automaton with 8 possible NFA states, if we wanted to represent that the currently active states are either 3 or 5, we would represent this as 0001 0100, which is a byte where the 3rd and 5th bit (right-to-left) are set.

In this world of bit strings, bitwise operations can be used to run NFA state transitions in parallel, slashing the running time of the algorithm down; critically no longer being related to active state count.

Encoding the NFA to Bits

Below is an example which regular readers will recall: a (non-paramerized) Levenshtein NFA representing 1 edit away from "et".

NFA Levenshtein Automaton for "et" distance 1

The red states represent the active states after processing 'e' as the first input character. To represent the states as bits, each row is represented as a separate sequence of 3 bits. The bits are show in most-significant order (right-to-left), such that the first bit 0 is at the right of the bit string. With this encoding, the bits representing the above NFA are:

010
011

The same encoding can be applied to Parameterized Levenshtein NFA. Consider the two below. If you were given a set of bits representing active states on the left, how would you transform them to the active states on the right?

Sliding NFA

The transformation can be done using a bitwise right shift operation. It will work regardless of which states are active. Remember to interpret bits right-to-left.

0010 >> 1 = 0001
0110 >> 1 = 0011

Recasting Automaton Operations as Bitwise Logic

We have just seen a bitwise right shift can be used to slide the window by a single character. In this section every action performed by Parameterized Levenshtein Automaton will be re-interpreted into bitwise operations.

The Transition function shown below computes the next NFA state based on the current NFA state and the Characteristic Vector. To do so, it first follows each transition (match, insert, substitute, delete) as a separate operation. Then, after some tidying-up, it checks if there are no active NFA state, in which case the input does not match. Just like in Parameterized Levenshtein Automata, it calculates how far to slide the pattern window.

/// <summary>
///     Given the previous state, the characteristic vector and the automaton's edit distance,
///     calculates the next state.
/// </summary>
/// <param name="prevState">
///     Current automaton state. Index 0 represents e = 0, and so on.
/// </param>
/// <param name="characteristicVector">
///     Characteristic vector. Bit 0 (least significant) represents c 0, and so on.
/// </param>
/// <param name="edits">
///     Number of edits supported by the automaton.
/// </param>
/// <param name="nextState">
///     Next automaton state. Start at 0, will be populated by this method.
/// </param>
/// <returns>
///   isMatch: True if any state is still active.
///   slideFactor: The number of characters to advance the window.
/// </returns>
private static (bool match, int slideFactor) Transition(
    uint[] prevState,
    uint characteristicVector,
    int edits,
    uint[] nextState)
{
    // Double-check a few invariants. Helps understand expectations.
    Debug.Assert(prevState.Length == nextState.Length);
    Debug.Assert(prevState.Length == edits + 1); // 0..e inclusive

    // Follow "Matching" transitions
    // TODO

    // Follow "Insert" transitions
    // TODO

    // Follow "Substitute" transitions
    // TODO

    // Follow "Delete" transitions
    // TODO

    // Mask Bitmaps 
    // TODO

    // Check that at least 1 state is active
    bool anyActive = ...; // TODO

    if (!anyActive)
    {
        return (false, slideFactor: default);
    }

    // Slide Window
    int slideFactor = ...; // TODO

    return (true, slideFactor);
}

Each // TODO is presented below as a small puzzle. If you want, try to work it out before looking at the answer. Its a great way to build intuition for how NFAs map onto bitwise logic.

Follow "Matching" transitions

The goal is to activate transitions from (c, e) to (c + 1, e) if the input character matches window[c].

Visual Example

The example diagram shows how the NFA on the left behaves when matching the Characteristic Vector 101. The black transition represent the "match" operation, the others represent other operations and can be ignored.

NFA bitwise matching characteristic vector

Bitwise Encoding

Edits: 1
CharacteristicVector: 010

PrevState   =>   NextState
[0]: 0001   =>   [0]: 0010
[1]: 0011   =>   [1]: 0010

Hint:

(c0, e0) matches on bit 0, activating (c1, e0)
(c0, e1) matches on bit 0, activating (c1, e1)
(c1, e1) has no matches

Answer

// Follow "Matching" transitions
for (var i = 0; i < nextState.Length; i++)
{
    nextState[i] |= (prevState[i] & characteristicVector) << 1;
}

Follow "Insert" transitions

The goal is to activate transitions from (c, e) to (c, e + 1) in all cases.

Visual Example

The example diagram shows how the NFA on the left behaves for the insert transitions (marked in black). Remember to ignore other (grey) transitions for this operation.

NFA bitwise insert

Bitwise Encoding

Edits: 1
CharacteristicVector: not relevant

PrevState   =>   NextState
[0]: 0011   =>   [0]: 0000
[1]: 0001   =>   [1]: 0011

Hint:

(c0, e0) activates (c0, e1)
(c1, e0) activates (c1, e1)
(c0, e1) has no insert transitions

Answer

// Follow "Insert" transitions
for (var i = 1; i < nextState.Length; i++)
{
    nextState[i] |= prevState[i - 1];
}

Follow "Substitute" transitions

The goal is to activate transitions from (c, e) to (c + 1, e + 1) in all cases.

Visual Example

The example diagram has been expanded to an NFA of edit distance 2. It shows how the NFA on the left behaves for the substitute transitions (marked in black). Remember to ignore other (grey) transitions for this operation.

NFA bitwise subsitute

Bitwise Encoding

Edits: 2
CharacteristicVector: not relevant

PrevState     =>   NextState
[0]: 000001   =>   [0]: 000000
[1]: 000110   =>   [1]: 000010
[2]: 000010   =>   [2]: 001100

Hint:

(c0, e0) activates (c1, e1)
(c1, e1) activates (c2, e2)
(c2, e1) activates (c3, e2)
(c1, e2) has no substitute transitions

Answer

// Follow "Substitute" transitions
for (var i = 1; i < nextState.Length; i++)
{
    nextState[i] |= prevState[i - 1] << 1;
}

Follow "Delete" transitions

This one is more complex than the previous, as each node has many delete transitions. For d = 1 .. edits, each node (c, e) activates transitions to (c + 1 + d, e + d) but only if the characteristic vector has bit c + d set.

We're looking to stick to a single for loop like the other operations, so try avoid nested loops if you can.

Visual Example

A visual example is much needed here. It shows how the NFA on the left behaves for the delete transitions (marked in black for 1 char deleted and orange for 2 chars deleted). Remember to ignore other (grey) transitions for this operation.

Notice how the (_, e0) states have 2 possible transitions. States with a higher e value would also have multiple transitions if the target states existed.

NFA bitwise delete

Bitwise Encoding

Edits: 2
CharacteristicVector: 01100

PrevState     =>   NextState
[0]: 000001   =>   [0]: 000000
[1]: 000101   =>   [1]: 000000
[2]: 000000   =>   [2]: 011000

Hint:

(c0, e0) activates (c3, e2)
(c0, e1) has a delete transition but it does not match the characteristic vector
(c2, e1) activates (c4, e2)

Answer

// Follow "Delete" transitions
uint deletionVector = prevState[0];
for (int i = 1; i < nextState.Length; i++)
{
    nextState[i] |= (deletionVector << 2) & (characteristicVector << 1);
    deletionVector = (deletionVector << 1) | prevState[i];
}

This one needs some explanation as it is trickier than the others.

Instead of using the previous states, we keep track of "deletion states". In the diagram, focus on a single state for example (c3, e2). It can be activated from either (c0, e0) and (c1, e1) with a vector matching 001000. Hence, for the purposes of determining row e2, as long as either of those states is active then (c3, e2) will be activated.

The answer above avoids a nested loop by "carrying forward" active states onto the next row, "for free", using (c, e) => (c + 1, e + 1). From there, we can simply deletion to (c, e) activating (c + 2, e + 1), if it matches character c + 1.

In other words, (c0, e0) first silently activates (c1, e1), which in turn activates the deletion transition to (c3, e2).

Mask Bitmap

Edit distance 1 requires a bitstring of length 4, 2 of length 6 and so on. As bits are grouped into 8 (byte, ushort, uint, ulong) there is a chance that an "invalid" bit might be set.

It's best to just clear any bits which are not part of the bitstring, as we would not want to accidentally move it back into the window when sliding it.

Bitwise Encoding

Edits: 2
CharacteristicVector: not relevant

NextState (before) =>   NextState (after)
[0]: 00 000001     =>   [0]: 00 000001
[1]: 00 000101     =>   [1]: 00 000101
[2]: 01 010000     =>   [2]: 00 010000

Answer

// Mask Bitmaps
var windowLength = 2 * edits + 2; // 0 .. 2n+1 (inclusive)
uint mask = (1u << windowLength) - 1; // All bits in windowLength set to 1
for (int i = 0; i < nextState.Length; i++)
{
    nextState[i] &= mask;
}

Check >= 1 State is Active

A straightforward implementation of this would be to check whether any states are non-zero.

Can you think of a bitwise operation to combine the states in a for loop, making a if check after the loop?

Bitwise Encoding

Edits: 2
CharacteristicVector: not relevant

NextState Example 1, AnyActive = true
[0]: 000001
[1]: 000101
[2]: 010000

NextState Example 2, AnyActive = false
[0]: 000000
[1]: 000000
[2]: 000000

Answer

// Check that at least 1 state is active

uint combined = 0u;
for (int i = 0; i < nextState.Length; i++)
{
    combined |= nextState[i];
}

bool anyActive = combined is not 0u;
if (!anyActive)
{
    return (false, slideFactor: default);
}

Slide Window

This operation must first determine the number of characters to advance the window. The state must then be updated to reflect the sliding window, a bitwise operation which we have already seen.

Bitwise Encoding

Edits: 2
CharacteristicVector: not relevant

NextState (before) =>   NextState (after)
[0]: 000010        =>   [0]: 000001
[1]: 001010        =>   [1]: 000101
[2]: 100000        =>   [2]: 010000

Answer

// Slide Window

uint combined = 0u; // same loop as the last puzzle
for (int i = 0; i < nextState.Length; i++)
{
    combined |= nextState[i];
}

// TrailingZeroCount counts the number of `0`s at the end of the combined state flag.
int slideFactor = (int)uint.TrailingZeroCount(combined);

if (slideFactor > 0)
{
    for (int i = 0; i < nextState.Length; i++)
    {
        nextState[i] >>= slideFactor;
    }
}

Putting it Back Together

It would be sufficient to just copy the various answers back into the Transition function. In addition to that, the listing below merges the for loops together for matching, inserting, substituting, deleting, masking, and buliding the combined state.

int windowLength = 2 * edits + 2; // 0 .. 2n+1 (inclusive)
uint mask = (1u << windowLength) - 1; // All bits in windowLength set to 1

uint deletionVector = 0u;
uint combined = 0u;

for (var i = 0; i < nextState.Length; i++)
{
    uint next = (prevState[i] & characteristicVector) << 1; // Match transition

    if (i > 0) // Hoist i == 0 out of the loop to avoid this "if test"
    {
        next |= prevState[i - 1]; // Insert transition
        next |= prevState[i - 1] << 1; // Substitute transition
        next |= (deletionVector << 2) & (characteristicVector << 1); // Delete transition
    }

    deletionVector = (deletionVector << 1) | prevState[i];

    next &= mask;
    nextState[i] = next;
    combined |= next;
}

if (combined is 0u)
{
    return (false, slideFactor: default);
}

// TrailingZeroCount counts the number of `0`s at the end of the combined state flag.
int slideFactor = (int)uint.TrailingZeroCount(combined);

if (slideFactor > 0)
{
    for (int i = 0; i < nextState.Length; i++)
    {
        nextState[i] >>= slideFactor;
    }
}

return (true, slideFactor);

Creating the DFA

The transition function we created above contains most of the logic we need. It does, however, need to be encapsulated correctly to be usable as a DFA.

In this section we encapsulate the logic into a BitwiseLevenshteinAutomaton class. It's given the editDistance and the pattern upon creation, and in turn initializes the current state as a uint[] with (c0, e0) set, as well as aligning the window to the first pattern character.

public class BitwiseLevenshteinAutomaton(int editDistance, string pattern)
{
    private uint[] _currentState = InitialState(editDistance);
    private int _windowStartIndex = 0;

    private static uint[] InitialState(int editDistance)
    {
        var result = new uint[editDistance + 1];
        result[0] = 1u;
        return result;
    }

    public bool Transition(char inputChar) => ...; // TODO

    public bool IsAccepting => ...; // TODO
}

As always for DFAs in this series so far, we add the Reset and Clone methods.

/// <summary>
/// Returns a new automaton ready to match a new string with the same pattern.
/// </summary>
public BitwiseLevenshteinAutomaton Reset() => new BitwiseLevenshteinAutomaton(editDistance, pattern);

/// <summary>
/// Returns a new automaton in the same state as this one.
/// </summary>
public BitwiseLevenshteinAutomaton Clone()
{
    var clone = new BitwiseLevenshteinAutomaton(editDistance, pattern);
    clone._currentState = (uint[])_currentState.Clone();
    clone._windowStartIndex = _windowStartIndex;
    return clone;
}

Transition

The public overload of Transition accepts an input character to modify the internal state then return whether the input string can still match.

It is implemented by building up the parameters to the private Transition method.

The pattern is first narrowed to the window using the offset and length. inputChar is then compared to each of the window characters, creating the Characteristic Vector.

From there the rest of the method should be relatively easy to follow.

public bool Transition(char inputChar)
{
    // First we build the characteristic vector
    var remainingPatternChars = pattern.Length - _windowStartIndex;
    var windowLength = Math.Min(2 * editDistance + 1, remainingPatternChars);
    var window = pattern.AsSpan(_windowStartIndex, windowLength);
    var characteristicVector = CreateCharacteristicVector(inputChar, window);

    // Initialize the next state
    var nextState = new uint[editDistance + 1];
    
    // Run the transitions
    var (isMatch, slideFactor) = Transition(_currentState, characteristicVector, editDistance, nextState);

    // Update the internal state
    _currentState = nextState;
    _windowStartIndex += slideFactor;
    return isMatch;
}

// Examples (bits are RightToLeft)
// ('e', "egg") => 0b001
// ('g', "egg") => 0b110
private static uint CreateCharacteristicVector(char inputChar, ReadOnlySpan<char> window)
{
    uint result = 0;
    for (var i = 0; i < window.Length; i++)
    {
        if (inputChar == window[i])
        {
            result |= 1u << i;
        }
    }
    return result;
}

IsAccepting

A state is accepting if it reaches the final column - or if it can consume remaining edits to reach it.

As an example, consider the diagram below which shows the state after the input "her" is matched against the pattern "here" (the final 'e' is missing).

We would also reach this state after processing the first three characters of the input "hers". However, in this case the automaton would then process the 's' and follow the (black) diagonal transition to the final state (c1, e1).

Both of these cases are equally 1 edit distance - "her" is missing a letter and "hers" has a substituted letter.

To simulate the missing input characters, the automaton can activate the substitution transitions anyway. This allows the automaton to reach a final state even when missing the final characters.

Hence: Before testing the final states, we recursively activate (c, e) => (c + 1, e + 1) to account for trailing deletions.

NFA "e" distance 1

/// <summary>
/// If true then the automaton is in a state representing successful match.
/// </summary>
public bool IsAccepting
{
    get
    {
        var remainingPatternChars = pattern.Length - _windowStartIndex;
        if (remainingPatternChars > 2 * editDistance + 1)
        {
            // Too far from the end to possibly accept
            return false;
        }
        
        uint finalVector = 0;
        for (var i = 0; i < _currentState.Length; i++)
        {
            // Recursively activate the diagonal transitions
            finalVector = (finalVector << 1) | _currentState[i];
        }

        // Acceptance requires reaching at least the last column
        uint threshold = 1u << remainingPatternChars;
        return finalVector >= threshold;
    }
}

Implementation Notes

The above code is a complete implementation of the bit-parallel execution of the Levenshtein NFA. There are, however, a few implementation details which make it sub-optimal for real use-cases.

The bit strings were all encoded within 32-bit uints. Edit distance of 1 to 3 can fit in a byte, 4 to 7 in a ushort, 8 to 15 in a uint, and 16 to 30 in a ulong. Practical implementations should carefully choose the right data type for the use-case.

The NFA state was represented as arrays (uint[]) which in .NET are costly heap allocations subject to garbage collection. A practical implementation should attempt to reduce these and allocate them on the stack instead.

In Levenshtypo, a .NET implementation, both of these concerns were solved. Generic Math was used to parameterize the data type (byte vs ushort vs ...) and Inline Buffers to allocate the NFA state on the stack as a struct. Hence Levenshtypo will be used in the coming benchmarks, as opposed to the blog-quality code shown above.

Evaluating The Claim

15 times fuzzier than Lucene is a bold claim - but is it fair?

While it's an objective fact that 30 edits is 15 times more Lucene's 2 edits, the spirit of the claim implies comparable efficiency.

In this section we will focus on whether the bit-parallel algorithm described above is efficient enough to be used in near-instant fuzzy search.

Let's evaluate the algorithm from a few practical angles: memory usage, throughput, and real-world applicability

N.B. Annoyingly, a review of the scientific literature reveals that I may not be the first to "discover" this algorithm, so I refrain from making that claim.

Real-World Applicability

I also want to clarify that I'm not trying to bash Lucene. It's an incredibly well-made and ambitious tool and is extremely well known - making it a great benchmark. The vast majority of use-cases are well covered by a fuzziness of 2 edits, especially when using Restricted Edit Distance. The way Lucene hardcodes the Parameterized Levenshtein DFAs directly into their code-base is an insanely efficient approach. It is, however, undeniable that there will be some use-cases for which more than 2 edits is a hard requirement. This algorithm is for those cases.

Memory Usage

The canonical use-case for Levenshtein Automata in Fuzzy Search is searching a Trie using Depth-First Search.

Briefly, Depth-First Search maximises efficient stack allocations by recursively exploring child nodes of the Trie. Recursion occurs at least whenever a node has multiple children; in this case the node is kept on the stack and child nodes are explored using further stack frames.

Professional implementations will be aware of the risk of too much recursion - overflowing the stack memory - and will have measures to break recursion by allocating on the heap. This approach provides a balance between the efficient stack and the plentiful heap.

For the purposes here, we can err on the side of pessimism, assuming each character requires a stack frame on which the entire NFA state is stored. Searching a word of 10 characters will require 10 copies of the NFA state.

The graphs below show how many bytes are required to represent the current automaton state as edit distance increases.

Stack Frame Size

For example, 16 bytes are required to represent the state of the automaton for edit distance of 7. At the upper end, searching with edit distance 30 requires a massive 248 bytes to hold the current state.

An analysis of the memory usage reveals why Powerset Construction is such a useful algorithm for executing NFA...

Powerset Construction creates a DFA state only for reachable combinations of NFA states. For a Parameterized Levenshtein Automaton matching 4 edits there are 50 NFA states (c0 e0 to c10 e4). If every possible combination was reachable there would be around 2 ^ 50 combinations. Powerset Construction identifies that only 1353 (~ 2 ^ 11) combinations are reachable.

Already for edit distance 4, if the DFA were not pruned to remove unreachable states, then there would be more DFA states than uint.MaxValue.

Repeating the analysis for edit distance 30 shows there are 1922 NFA states (c0 e0 to c61 e30) leading to 2 ^ 1922 combinations. If we were to assign each NFA state to a single bit ("active" or "inactive"), it would take 241 bytes to represent all states.

The bit-parallel algorithm can achieve such high fuzziness in part because it does not prune unreachable states (which would be intractable). As a result it must be able to represent unreachable states, accounting for the significant memory usage.

Throughput

The usual benchmarks in this series have measured how long it takes to search for a word in a dataset of 450,000 English words. This will not really be helpful with an edit distance of 30 because nearly every single word would match. In the benchmark below I've modified each word including the search word to repeat each character many times (proportional to edit distance). This should keep a comparable result set size.

The table below shows the performance of different automata implementations:

Bit Parallel is the algorithm described in this article, edit distances from 1 to 30.
Parameterized (Hardcoded) uses the hardcoded states of Parameterized Levenshtein Automata, the same as Lucene does, edit distances 1 and 2.
Parameterized runs Powerset Construction to build the Parameterized Levenshtein Automata at runtime.

The implementations are available for review on the Levenshtypo library GitHub page.

Method	Edit Distance	Mean	Heap Allocated
Parameterized (Hardcoded)	1	23.27 us	400 B
Parameterized (Hardcoded)	2	28.63 us	400 B
Parameterized	1	26.90 us	496 B
Parameterized	2	31.58 us	496 B
Parameterized	3	42.50 us	496 B
Parameterized	4	43.95 us	496 B
Bit Parallel	1	35.29 us	400 B
Bit Parallel	2	52.59 us	400 B
Bit Parallel	3	65.20 us	400 B
Bit Parallel	4	96.13 us	432 B
Bit Parallel	5	119.65 us	432 B
Bit Parallel	6	145.52 us	464 B
Bit Parallel	7	166.72 us	464 B
Bit Parallel	8	176.55 us	528 B
Bit Parallel	9	207.64 us	560 B
Bit Parallel	10	233.77 us	560 B
Bit Parallel	15	454.14 us	656 B
Bit Parallel	20	757.98 us	1072 B
Bit Parallel	25	1,103.97 us	1233 B
Bit Parallel	30	1,478.37 us	1393 B

For Edit Distances 1-4 there are directly comparable data points for Bit Parallel vs Parameterized. These indicate that the Bit Parallel approach is between 30% and 100% slower than the Parameterized approach. As the Parameterized approach is blazingly fast it can be argued that the Bit Parallel approach is probably also suitable for fuzzy search.

As the maximum edit distance increases, it becomes increasingly slower. At the extreme end, a 30-edits search with Bit Parallel is 42 times slower than a 1-edit search. A lot of this comes not only from more inner loops, but a larger characteristic vector which must be compared against more window characetrs, worse memory locality, and increased zeroing of data.

However, I'd argue that considering the extreme degree of fuzziness, 1.5 milliseconds to search 450,000 entries is pretty good.

This argument relies on the assumption that using such a large edit distance will not match most of the entries - in which case a lot of the benefits are lost.

Conclusion

In this article we discussed how to use bitwise operations to execute the NFA of the Parameterized Levenshtein Automata directly, avoiding powerset construction (the costly part).

By avoiding powerset construciton we end up having to model all possible states, even if they are unreachable. This creates some costly memory requirements, and a state which is more expensive to operate on than a simple integer as in the Parameterized approach.

The Bit Parallel algorithm described in this article delivers on the premise that degrees of fuzziness significantly larger than 2 are achievable in practice. Perhaps 30 is a stretch, but I would argue that the cut-off is somewhere in between.

Under the Hood of Fuzzy Search: Building a Search Engine 15 times fuzzier than Lucene

How it Works

Encoding the NFA to Bits

Recasting Automaton Operations as Bitwise Logic

Follow "Matching" transitions

Follow "Insert" transitions

Follow "Substitute" transitions

Follow "Delete" transitions

Mask Bitmap

Check >= 1 State is Active

Slide Window

Putting it Back Together

Creating the DFA

Transition

IsAccepting

Implementation Notes

Evaluating The Claim

Real-World Applicability

Memory Usage

Throughput

Conclusion

Other posts you might like

Most Recent

Next Post

Previous Post

Join the Discussion