Under the Hood of Fuzzy Search: Constructing Levenshtein Automata

Last time on this blog, we dug deep into two powerful data structures which are at the core of fuzzy search engines; Tries and Levenshtein Automata. We wrote some unoptimised blog-quality code to search through 450,000 words for those resembling "et" using these data structures. The results showed that it benchmarked as being 1500 faster than a naive loop with a highly-optimized .NET library.

The main idea presented was to use Levenshtein Automata as a guide to know which Trie nodes to visit and which to prune. In this article we tackle the considerable task of how to construct Levenshtein Automata.

In other words, how do we go from "2 typos in abc" to this:

DFA Levenshtein Automaton for "abc" distance 2

Read more

Under the Hood of Fuzzy Search: Tries, Automata, and Speed

Have you wondered how your phone knows how to correct your typos as you type? Fuzzy Search! Have you searched "resrtaurants near me" and Google knows exactly what you mean? Also Fuzzy Search! Fuzzy Search is when a program looks for words which almost match some input.

But how does that actually work? In this article we will explore the data structures which are at the core of many of the fuzzy search engines we use today, Tries and Levenshtein Automata. Throughout the post we will implement these data structures in C#, and by the end be able to run a functional and algorithmically efficient fuzzy search which in 30 microseconds can find all English words similar to an input string.

Read more

My Implementation is Better

Code on this blog can rarely be considered remotely production-worthy. Today's example could possibly take the crown as the most filthy.

First, some context. I recently was working on a project where I wanted to represent ids as custom types instead of having ints everywhere. I found a package - StonglyTypedId - which is a source generator to remove most of the boilerplate for these strongly typed ids. Andrew Lock, the package author, has a series of blog posts describing this problem he calls primitive obsession as well as how this package was designed to solve it.

In using this StronglyTypedId library with Dapper.Contrib and MySQL I came across an issue where the .NET method Convert.ChangeType is used to convert from a ulong into the generated custom PersonId type. Convert.ChangeType can not be extended for custom types and hence the bug. The solution would be to either change Dapper.Contrib to use Dappers SqlMapper type conversion system or to use the .NET TypeConverter (example below) or some other extensible conversion method.

var converter = TypeDescriptor.GetConverter(typeof(idp.PropertyType));
var newValue = converter.ConvertFrom(id);
idp.SetValue(entityToInsert, newValue, null);

However practical a solution like that may be, it wouldn't make for a good blog post. What if, being stubborn individuals, we really wanted to extend Convert.ChangeType? Is that possible? Where would such a persuit take us? The implementation is pretty limited and so I say replace it - my implementation is better!

Read more
Page 1 of 4
Next Page »