In this series we're diving deep into the data structures and algorithms behind Fuzzy Search. As an introduction to the topic, the first post in this series shed light on how Tries and Levenshtein Automata are used together to provide rapid information retrieval.
The second post highlighted just how complex Levenshtein Automata can be to create, and ultimately just how slow they are due to the Powerset Construction algorithm having to explore an exponentially large number of states. The article concluded by hinting that all is not lost; after all we have all seen real fuzzy search in large scalable search engines.
In this article we will explore Parameterized Levenshtein Automata and how they solve the performance issues which plague regular Levenshtein Automata. It's a notoriosly complex topic, but some say it's just misunderstood and has a bad reputation due to the dense scientific article which first introduced it: Fast String Correction with Levenshtein-Automata by Klaus Schulz and Stoyan Mihov.
It's best to start by studying the two automata below (left is "cat"
and right is "pig"
, edit distance = 1).
Can you notice anything interesting?