Faster Spelling Correction algorithm (2012) (opens in new tab)

(blog.faroo.com)

117 pointstyn11y ago20 comments

20 comments

Pre-processing a fixed data-set in order to optimize search is a pretty well known technique (the phone companies have been using "preordering" (specifically Modified Preorder List Traversal [1]) as a means of routing phone calls and diagnosing network issues for a very long time.

The downside to any preorder algorithm is that you must rerun the preorder generation code anytime the input changes (in this case a dictionary) and often you must allocate extra storage to hold the pre-processed input.

This is a really interesting algorithm but, as always, you have to know the pros and cons and apply it where if fits (i.e. Not somewhere that needs a compressed dictionary).

[1] http://www.codeproject.com/Articles/31669/Hierarchical-Tree-...

philjohn11y ago

See also the Levenshtein Automatons used by Lucene 4.0 for fuzzy matching ... fun story, although they had to treat some Python code as a "black box" when implementing it ...

http://java.dzone.com/news/lucenes-fuzzyquery-100-times

Looks like some interesting research and dev going on in this space at the moment.

TheLoneWolfling11y ago

Hmm...

They don't mention the approach I'd take for a naive spellcheck:

Generate a (giant) trie from the dictionary. Then perform a DFS on the trie allowing only <the current closest word's number of edits> edits from the input word, keeping track of the edits done so far - try no changes at each node first, then a deletion, then a change, then an insertion. This works better with a smaller alphabet size.

An optimization: a DAWG is potentially even better space-wise than a hashmap (effectively it ends up compressing the input data), but takes a while to precompute.

dvirsky11y ago

It's a nice approach, I've implemented it in Golang a while back. It's really fast, but the cost in memory is huge, especially if you store bigrams and not just single words. But it's really elegant in its simplicity.

BTW - if anyone's interested in the Go implementation let me know, I'll try to find it and post it somewhere.

robert_tweed11y ago

Interesting, though I'm curious about whether this type of algorithm is state of the art for accuracy. At this time spell checkers are pretty much fast enough, but aren't always terribly good at suggesting replacements. I would imagine that systems based on n-grams are more effective.

philjohn11y ago

it mentioned it's for search engines, in which case you may be checking against a massive wordlist taken from a search index, so accuracy isn't (quite) as important as speed.

discardorama11y ago

Previous discussions:

https://news.ycombinator.com/item?id=4080791

https://news.ycombinator.com/item?id=8201598

https://news.ycombinator.com/item?id=7048225

taway201211y ago

To me, it seems to be the same as the Norvig algorithm, except that only deletes are considered. Which allows them to pre-calculate misspellings with modest storage cost.

Am I mistaken?

mumrah11y ago

Anyone know how this compares to the FST/DFA stuff in Lucene 4.x?

dang11y ago

More or less a repost of https://news.ycombinator.com/item?id=7048225.

piqufoh11y ago

Note; this article is from June 2012.

jheriko11y ago

hmmm... 1000x? how does it scale though?

tgb11y ago

The explanation is actually very simple, you should just read it! But basically they just generate variants of the word by deletions, not also insertions or substitutions. So increasing the "edit distance" multiplies the search space by n (the length of the word) instead of by n + an + a(n+1) (where a is the number of elements in the alphabet). Then there's cost in storing a larger dictionary of words, but lookup being logarithmic or better means that this should be fine. So yes, it scales better (at a memory cost).

ejr11y ago

The best way to know is to look at the code itself. https://github.com/wolfgarbe/symspell/blob/master/symspell.c...

simias11y ago

It's easy to say "1000x" when they don't mention the reference :)

Actually even reading the article I'm not completely certain where that 1000x comes from, I suppose it's when compared to the naive approach.

gorhill11y ago

I think it's compared to Peter Norvig's approach:

> This is three orders of magnitude less expensive (36 terms for n=9 and d=2)

Three orders of magnitude is 1000. Peter Norvig's approach was described above with:

> still expensive at search time (114,324 terms for n=9, a=36, d=2)

So 114,324 / 36 = 3,175, so "three orders of magnitude", and I suppose he went conservative by saying "1000x" rather than "3000x".

grogenaut11y ago

1000x is still O(1000N) == N. So technically not significantly faster.

JoeAltmaier11y ago

Except that 'spelling' is a bounded problem. There aren't going to be more words in the dictionary, at least not orders more, any time soon. 'Scaling as the word list grows' is meaningless. So the only definition of 'faster' that matters is the one they used.

toolslive11y ago

That's a rather "English only" view on the topic. Spelling is not a bounded problem in every language.

Some power languages (Dutch, German) have an infinite number of words, as you can take 2 nouns (fe a and b) and concatenate them to form a new one (ab means something else than ba). Some languages also have inflection....

ghkbrew11y ago

That all depends on what you consider a significant speed up. For small values of N constant factors will usually dominate. In this case, they mention that lookup time is independent of dictionary size, so the "N" in question is actually a combination of the word size and the edit distance, both relatively small quantities.

j / k navigate · click thread line to collapse

20 comments

smoyer11y ago

This is a really interesting algorithm but, as always, you have to know the pros and cons and apply it where if fits (i.e. Not somewhere that needs a compressed dictionary).

[1] http://www.codeproject.com/Articles/31669/Hierarchical-Tree-...

philjohn11y ago

See also the Levenshtein Automatons used by Lucene 4.0 for fuzzy matching ... fun story, although they had to treat some Python code as a "black box" when implementing it ...

http://java.dzone.com/news/lucenes-fuzzyquery-100-times

Looks like some interesting research and dev going on in this space at the moment.

TheLoneWolfling11y ago

Hmm...

They don't mention the approach I'd take for a naive spellcheck:

An optimization: a DAWG is potentially even better space-wise than a hashmap (effectively it ends up compressing the input data), but takes a while to precompute.

dvirsky11y ago

BTW - if anyone's interested in the Go implementation let me know, I'll try to find it and post it somewhere.

robert_tweed11y ago

philjohn11y ago

it mentioned it's for search engines, in which case you may be checking against a massive wordlist taken from a search index, so accuracy isn't (quite) as important as speed.

discardorama11y ago

Previous discussions:

https://news.ycombinator.com/item?id=4080791

https://news.ycombinator.com/item?id=8201598

https://news.ycombinator.com/item?id=7048225

taway201211y ago

To me, it seems to be the same as the Norvig algorithm, except that only deletes are considered. Which allows them to pre-calculate misspellings with modest storage cost.

Am I mistaken?

mumrah11y ago

Anyone know how this compares to the FST/DFA stuff in Lucene 4.x?

dang11y ago

More or less a repost of https://news.ycombinator.com/item?id=7048225.

piqufoh11y ago

Note; this article is from June 2012.

jheriko11y ago

hmmm... 1000x? how does it scale though?

tgb11y ago

ejr11y ago

The best way to know is to look at the code itself. https://github.com/wolfgarbe/symspell/blob/master/symspell.c...

simias11y ago

It's easy to say "1000x" when they don't mention the reference :)

Actually even reading the article I'm not completely certain where that 1000x comes from, I suppose it's when compared to the naive approach.

gorhill11y ago

I think it's compared to Peter Norvig's approach:

> This is three orders of magnitude less expensive (36 terms for n=9 and d=2)

Three orders of magnitude is 1000. Peter Norvig's approach was described above with:

> still expensive at search time (114,324 terms for n=9, a=36, d=2)

So 114,324 / 36 = 3,175, so "three orders of magnitude", and I suppose he went conservative by saying "1000x" rather than "3000x".

grogenaut11y ago

1000x is still O(1000N) == N. So technically not significantly faster.

JoeAltmaier11y ago

toolslive11y ago

That's a rather "English only" view on the topic. Spelling is not a bounded problem in every language.

ghkbrew11y ago

j / k navigate · click thread line to collapse