How Double Dispatch Supercharged My Library's Performance

During the design phase of Levenshtypo I ran into a familiar dilemma which many library authors face:

How do I balance API design, future-compatibility and performance?

The tension results from opposing forces. As a library author I want to keep a very simple API to encourage adoption while exposing enough bells and whistles for power users, I want to give myself room to grow by abstracting the internals, but without tanking performance.

In this article I expose how using generics and double dispatch allowed me to both have and eat my cake.

Read more

Under the Hood of Fuzzy Search: Building a Search Engine 15 times fuzzier than Lucene

Typos are very common nowadays. When working in a fast-paced environment we expect computers to just understand what we mean intuitively, even if we didn't spell it out - sometimes literally in the form of spelling mistakes.

On this blog we've been through a deep dive into Levenshtein Automata, exploring how they are created using NFAs and Powerset Construction and finally diving deep into the Parameterized Levenshtein Automata to expose how professional search engines like Lucene handle fuzziness.

While Lucene can only fuzzy search with a maximum of 2 edits, today we review an algorithm which can handle up to 30 edits.

The algorithm was built while working on Levenshtypo, the library I developed and now maintain. It's used as the engine of Autypo, which I designed to be the idiomatic way to do autocomplete in .NET.

Read more

Under the Hood of Fuzzy Search: Parameterized Levenshtein Automata

In this series we're diving deep into the data structures and algorithms behind Fuzzy Search. As an introduction to the topic, the first post in this series shed light on how Tries and Levenshtein Automata are used together to provide rapid information retrieval.

The second post highlighted just how complex Levenshtein Automata can be to create, and ultimately just how slow they are due to the Powerset Construction algorithm having to explore an exponentially large number of states. The article concluded by hinting that all is not lost; after all we have all seen real fuzzy search in large scalable search engines.

In this article we will explore Parameterized Levenshtein Automata and how they solve the performance issues which plague regular Levenshtein Automata. It's a notoriosly complex topic, but some say it's just misunderstood and has a bad reputation due to the dense scientific article which first introduced it: Fast String Correction with Levenshtein-Automata by Klaus Schulz and Stoyan Mihov.

It's best to start by studying the two automata below (left is "cat" and right is "pig", edit distance = 1). Can you notice anything interesting?

DFA of "cat" and "pig", distance=1

Read more

Under the Hood of Fuzzy Search: Constructing Levenshtein Automata

Last time on this blog, we dug deep into two powerful data structures which are at the core of fuzzy search engines; Tries and Levenshtein Automata. We wrote some unoptimised blog-quality code to search through 450,000 words for those resembling "et" using these data structures. The results showed that it benchmarked as being 1500 faster than a naive loop with a highly-optimized .NET library.

The main idea presented was to use Levenshtein Automata as a guide to know which Trie nodes to visit and which to prune. In this article we tackle the considerable task of how to construct Levenshtein Automata.

In other words, how do we go from "2 typos in abc" to this:

DFA Levenshtein Automaton for "abc" distance 2

Read more

Under the Hood of Fuzzy Search: Tries, Automata, and Speed

Have you wondered how your phone knows how to correct your typos as you type? Fuzzy Search! Have you searched "resrtaurants near me" and Google knows exactly what you mean? Also Fuzzy Search! Fuzzy Search is when a program looks for words which almost match some input.

But how does that actually work? In this article we will explore the data structures which are at the core of many of the fuzzy search engines we use today, Tries and Levenshtein Automata. Throughout the post we will implement these data structures in C#, and by the end be able to run a functional and algorithmically efficient fuzzy search which in 30 microseconds can find all English words similar to an input string.

Read more
Page 1 of 3
Next Page »