Under the Hood of Fuzzy Search: Parameterized Levenshtein Automata

In this series we're diving deep into the data structures and algorithms behind Fuzzy Search. As an introduction to the topic, the first post in this series shed light on how Tries and Levenshtein Automata are used together to provide rapid information retrieval.

The second post highlighted just how complex Levenshtein Automata can be to create, and ultimately just how slow they are due to the Powerset Construction algorithm having to explore an exponentially large number of states. The article concluded by hinting that all is not lost; after all we have all seen real fuzzy search in large scalable search engines.

In this article we will explore Parameterized Levenshtein Automata and how they solve the performance issues which plague regular Levenshtein Automata. It's a notoriosly complex topic, but some say it's just misunderstood and has a bad reputation due to the dense scientific article which first introduced it: Fast String Correction with Levenshtein-Automata by Klaus Schulz and Stoyan Mihov.

It's best to start by studying the two automata below (left is "cat" and right is "pig", edit distance = 1). Can you notice anything interesting?

DFA of "cat" and "pig", distance=1

Read more

Under the Hood of Fuzzy Search: Constructing Levenshtein Automata

Last time on this blog, we dug deep into two powerful data structures which are at the core of fuzzy search engines; Tries and Levenshtein Automata. We wrote some unoptimised blog-quality code to search through 450,000 words for those resembling "et" using these data structures. The results showed that it benchmarked as being 1500 faster than a naive loop with a highly-optimized .NET library.

The main idea presented was to use Levenshtein Automata as a guide to know which Trie nodes to visit and which to prune. In this article we tackle the considerable task of how to construct Levenshtein Automata.

In other words, how do we go from "2 typos in abc" to this:

DFA Levenshtein Automaton for "abc" distance 2

Read more

Under the Hood of Fuzzy Search: Tries, Automata, and Speed

Have you wondered how your phone knows how to correct your typos as you type? Fuzzy Search! Have you searched "resrtaurants near me" and Google knows exactly what you mean? Also Fuzzy Search! Fuzzy Search is when a program looks for words which almost match some input.

But how does that actually work? In this article we will explore the data structures which are at the core of many of the fuzzy search engines we use today, Tries and Levenshtein Automata. Throughout the post we will implement these data structures in C#, and by the end be able to run a functional and algorithmically efficient fuzzy search which in 30 microseconds can find all English words similar to an input string.

Read more
Page 1 of 5
Next Page »