The Wonders of the Suffix Tree Through the Lens of Ukkonen’s Algorithm (opens in new tab)

(humanreadablemag.com)

150 pointspekalicious6y ago39 comments

39 comments

lbj6y ago

So often lately, Ive come here and left disappointed. Reading about Wasabi and hipster sweaters. But this post is exactly the kind of content we should be pushing. Everyone who reads this will become a better hacker for it.

mumblemumble6y ago

This site is currently running a Kickstarter to launch a new programming magazine. If you'd like to see more content like this out there, consider taking a look:

https://www.kickstarter.com/projects/pekalicious/human-reada...

weitzj6y ago

Also interesting to get into are suffix arrays which you could construct from a tree. Then there is a notion of generalized suffix arrays, where you mix multiple suffix arrays into a big suffix array. Then you can give fast answers like: what is the longest common prefix between the several tokens, whereas the tokens of your individual suffix arrays could be letters from an alphabet (so you construct a suffix array from words) or maybe words (you construct a suffix array from a sentence)

robinhouston6y ago

And then you could get into compressed indexes such as the FM-index, which represent a text in a form that is both (typically) smaller than the original, and also supports fast substring queries.

Suffix trees are cool, but they’re just the entry point to a world of wonderful data structures and algorithms.

del_operator6y ago

The Burrows-Wheeler transform was an interesting thing to see! When I used to teach at an immersive 3 month program for web dev we had a few weeks of playing with such arrays, trees, tries, dynamic programming, and text from Wikipedia.

ridiculous_fish6y ago

In practice one should typically prefer a suffix array to a suffix tree. A suffix array is conceptually so easy!

1. Given a set of strings, put all the suffixes into a set. For example "derp" -> "derp", "erp", "rp", "p".

2. Sort it

It's cheap! The number of suffixes equals the string length, and suffixes can be mere references into the string input. So in practice you use maybe ~2x memory compared to the input - and in big cache friendly chunks (not tries).

Now you're flying. If you want to perform a substring search:

1. Binary search on the first character. You get a range of suffixes which begin with that character.

2. Within that range, perform a binary search on the second character. Now you have a range of suffixes which begin with the first two characters.

3. etc.

Powerful and easy.

By the way this technique of going character by character can and should be extended to the sort itself; look up "three-way radix quicksort" for what you never learned in school.

SamReidHughes6y ago

More like 4x or 8x the memory (of an ASCII string) if you just use pointers or size_t’s. Abstractly, you need at minimum (n log n) memory to represent a permutation, which is what you’re doing.

nukke6y ago

I took string algorithms course with Esko Ukkonen. It was one of the most interesting and fun algoritm courses ever and very useful.

Ukkonen algorithm for suffix trees and Kärkkainen-Sanders for suffix array are very beautiful algorithms.

saagarjha6y ago

> Introduced to the world in 1973 by Peter Weiner, suffix trees are a relatively new concept in computer science, with this newness likely fueling its obscurity.

That isn't really all that new; it's just about when red-black trees and B-trees were invented as well and I wouldn't call those "obscure".

kadoban6y ago

You're right, it is not new. It is slightly obscure though. Definitely comes up if you study algorithms enough, but I don't think intro courses usually even mention them, and most working programmers won't have any idea they exist.

th-th-throwaway6y ago

Are there some hidden constants/considerations that prevent suffix trees from being used in the real world?

Every time I expect it to show up (e.g., for pattern matching in databases), I almost always see something else used instead (e.g., trigrams)

mehrdadn6y ago

A couple of considerations that come to mind for a database: (a) efficiency, and (b) dynamicity. The tree structure isn't particularly efficient (it has a lot of space overhead, and I think it's also not as fast), so you'd prefer something else if possible. So people often use suffix arrays + LCA preprocessing instead. However, those are hard to make dynamic (your database is constantly changing), so I'd expect that still makes it tough to use them in databases. Though I suspect this isn't the whole story.

srpablo6y ago

Not _quite_ a suffix tree, but a similar structure called a GADDAG is sometimes used in word-generating board games (Scrabble, Words With Friends).

GADDAGs are different because they're more like "reversed prefix" trees, but they help with the issues of placing a new word from a dictionary onto the board using the existing letters of another word.

   [1]: https://en.wikipedia.org/wiki/GADDAG

StefanKarpinski6y ago

A suffix array is of similar utility to a suffix tree, more compact to store, and pretty simple to compute: all you need to do is start with an array of each index into the data from start to end, then sort the indices by memcmp'ing the suffix string starting at each index. There are more optimized algorithms, but they're within 2-3x for moderately sized data (megabytes). You need the fancier sorting algorithms if you're constructing a prefix array of something huge like a genome.

kadoban6y ago

Just as a side note, that algorithm for suffix array construction has really bad worst cases (consider a string of all "a"s).

It won't matter for experimenting or most likely for using it on smallish realistic texts, but anything where you're taking user data for instance you'd want to step up to a slightly more clever algorithm (there's an O(n lg n) worst case one that's still very simple, called rank doubling I believe).

carapace6y ago

(Ironically, for a site called "Human Readable Magazine" I found that really hard to read.)

joe_the_user6y ago

(Edit: I been curious/confused by this for years)

Can someone explain how this is fundamentally different from a b-tree of the substrings of a string? You'd store each substring by start index and each leaf is where the substring becomes unique.

mumblemumble6y ago

The only thing suffix trees have in common with B-tries that I can see is that they're both non-binary trees.

A B-tree stores whole keys in the nodes, not fragments of values. And a B-tree tries to self-balance, so there's a lot of leeway for what keys get pushed up to the root node. B-trees split and merge nodes based on the number of items in them, trying to achieve a certain fill factor.

A suffix tree, on the other hand, is storing fragments of the values, and I believe (but have not bothered to prove) that there is one and only one minimal and valid way to organize the tree for a given set of data. Suffix trees add children when the data mandate that they do in order to remain correct, not just in order to make room.

joe_the_user6y ago

I should phrase it differently; what does a suffix tree get you that a standard sorting tree doesn't get you. A standard sorting tree would get you all the neighbors in lexical order. What more do you need?

1 more reply

alexchamberlain6y ago

This might be a stupid question: why is the fact it’s a _suffix_ tree important? (as opposed to building a similar structure with prefixes)

j-pb6y ago

A suffix tree contains a compact trie like graph representation of all the suffixes of a long string. So if you want to know if it contains a substring, you can start at the root and just search like in a prefix tree. Since for the queried word to be an infix it must be a prefix of a suffix, you will find the word in O(|query|) time.

So yeah its a prefix tree, but a prefix tree of all the suffixes.

And now to answer your question. If you naively put every suffix into a regular prefix tree, you would get the same lookup performance. BUT, since the suffixes will share a lot of suffixes, which the prefix tree can't compress you have a lot more memory consumption.

The suffix tree avoids this by essentially performing compression on the suffix as well, so that you simply have pointers to shared suffixes of suffixes, which point to the root.

In the end it's like a diamond shape I guess, trie in the front, compression in the back.

raincom6y ago

Why can't one construct a prefix tree of all prefixes, then start process from the right to left (assume that it is an offline algorithm)?

1 more reply

eindiran6y ago

The underlying data structure (a trie[0]) is used to create both suffix trees and prefix trees. The 'default' trie is a prefix tree actually.

The reason you would want a suffix tree (over a prefix trie) is that suffix trees are quite useful for computing the longest common substring of two strings (they can do it in linear time).

[0] https://en.wikipedia.org/wiki/Trie

raincom6y ago

Peter Weiner, who first invented suffix trees gave an answer to this question on p.3 of "Linear Pattern Matching Algorithms"[1]: "If P does match a substring of S, then reverse(P) also matches a substring of reverse(S). This observation implies that every technique which solves a pattern matching problem working from left to right has a dual procedure which works from right to left. In what follows, we adopt a left to right view point, referring only briefly to dual concepts as appropriate."

[1]https://pdfs.semanticscholar.org/d339/0f13eed511cde1116446e7...

pelario6y ago

The key property is that each substring is the begining of a suffix, so if you have your suffixes ordered in lexicographical order, is easy to answer pattern matching queries. Now, you could do the same with prefixes, but then you would need to reverse the query pattern, and things become less straightforward.

mehrdadn6y ago

The prefix tree of a string is just a linked list, though—it's the string itself. It doesn't help you do anything for a single string, unlike a suffix tree, which is used for finding substrings of a single string.

Moreover, how would you even start searching for a substring using a prefix tree? You can't start at the root, because the substring wouldn't be a prefix. It doesn't help you there at all.

joppy6y ago

This is such a clear and illuminating explanation as to why _suffix_, rather than prefix trees, are so useful.

1 more reply

nroets6y ago

For most problems, a prefix tree will be just as useful.

But suppose you want to look for the best match of string in a data stream with no predetermined length e.g. a TCP pipe. Then a suffix tree will allow you to process each incoming character in constant time. A prefix tree will not.

msclrhd6y ago

If you have keys that tend to have common prefices (e.g. URLs in RDF) then checking them in reverse (e.g. with a suffix tree) is generally faster, as you will hit differences quicker.

herostratus1016y ago

I remember learning about this in a data structures class, and it is a remarkable data structure. But the author of this article needs to pick up a copy of Strunk and White. This reads like it was written by a high school junior. Awful, awful writing.

pzh6y ago

It seems like the style of the article is deliberately long-winded and poetic to match the type of content that the “human readable magazine” is going for (similar to New Yorker articles). It’s likely seeking to cater to a specific niche of reader preferences.

herostratus1016y ago

You've got to be kidding me. The sentences are excruciating to read. New Yorker prose is in a different universe. The writing is spare and lucid. You can't evaluate the quality of prose with "wc -w".

tracer42016y ago

This was one of the most insightful articles I’ve read on HN in a while, and I’d love to see more of this content or even pay a premium to some publisher who has this category of technical deep dives.

My unsolicited feedback to you - it doesn’t hurt to practice empathy. The fact that you assumed the authors primary language is English (and even if it really is - it doesn’t matter) and hold them to that bar - and you point out the data structure is something you already knew - is this the same way you provide feedback in a work environment? Don’t take this as an attack, but I guarantee you would be a short timer in my org.

1 more reply

xenihn6y ago

English may not be their first language.

j / k navigate · click thread line to collapse

39 comments

lbj6y ago

mumblemumble6y ago

This site is currently running a Kickstarter to launch a new programming magazine. If you'd like to see more content like this out there, consider taking a look:

https://www.kickstarter.com/projects/pekalicious/human-reada...

weitzj6y ago

robinhouston6y ago

And then you could get into compressed indexes such as the FM-index, which represent a text in a form that is both (typically) smaller than the original, and also supports fast substring queries.

Suffix trees are cool, but they’re just the entry point to a world of wonderful data structures and algorithms.

del_operator6y ago

ridiculous_fish6y ago

In practice one should typically prefer a suffix array to a suffix tree. A suffix array is conceptually so easy!

1. Given a set of strings, put all the suffixes into a set. For example "derp" -> "derp", "erp", "rp", "p".

2. Sort it

Now you're flying. If you want to perform a substring search:

1. Binary search on the first character. You get a range of suffixes which begin with that character.

2. Within that range, perform a binary search on the second character. Now you have a range of suffixes which begin with the first two characters.

3. etc.

Powerful and easy.

By the way this technique of going character by character can and should be extended to the sort itself; look up "three-way radix quicksort" for what you never learned in school.

SamReidHughes6y ago

More like 4x or 8x the memory (of an ASCII string) if you just use pointers or size_t’s. Abstractly, you need at minimum (n log n) memory to represent a permutation, which is what you’re doing.

nukke6y ago

I took string algorithms course with Esko Ukkonen. It was one of the most interesting and fun algoritm courses ever and very useful.

Ukkonen algorithm for suffix trees and Kärkkainen-Sanders for suffix array are very beautiful algorithms.

saagarjha6y ago

> Introduced to the world in 1973 by Peter Weiner, suffix trees are a relatively new concept in computer science, with this newness likely fueling its obscurity.

That isn't really all that new; it's just about when red-black trees and B-trees were invented as well and I wouldn't call those "obscure".

kadoban6y ago

th-th-throwaway6y ago

Are there some hidden constants/considerations that prevent suffix trees from being used in the real world?

Every time I expect it to show up (e.g., for pattern matching in databases), I almost always see something else used instead (e.g., trigrams)

mehrdadn6y ago

srpablo6y ago

Not _quite_ a suffix tree, but a similar structure called a GADDAG is sometimes used in word-generating board games (Scrabble, Words With Friends).

   [1]: https://en.wikipedia.org/wiki/GADDAG

StefanKarpinski6y ago

kadoban6y ago

Just as a side note, that algorithm for suffix array construction has really bad worst cases (consider a string of all "a"s).

carapace6y ago

(Ironically, for a site called "Human Readable Magazine" I found that really hard to read.)

joe_the_user6y ago

(Edit: I been curious/confused by this for years)

Can someone explain how this is fundamentally different from a b-tree of the substrings of a string? You'd store each substring by start index and each leaf is where the substring becomes unique.

mumblemumble6y ago

The only thing suffix trees have in common with B-tries that I can see is that they're both non-binary trees.

joe_the_user6y ago

1 more reply

alexchamberlain6y ago

This might be a stupid question: why is the fact it’s a _suffix_ tree important? (as opposed to building a similar structure with prefixes)

j-pb6y ago

So yeah its a prefix tree, but a prefix tree of all the suffixes.

The suffix tree avoids this by essentially performing compression on the suffix as well, so that you simply have pointers to shared suffixes of suffixes, which point to the root.

In the end it's like a diamond shape I guess, trie in the front, compression in the back.

raincom6y ago

Why can't one construct a prefix tree of all prefixes, then start process from the right to left (assume that it is an offline algorithm)?

1 more reply

eindiran6y ago

The underlying data structure (a trie[0]) is used to create both suffix trees and prefix trees. The 'default' trie is a prefix tree actually.

The reason you would want a suffix tree (over a prefix trie) is that suffix trees are quite useful for computing the longest common substring of two strings (they can do it in linear time).

[0] https://en.wikipedia.org/wiki/Trie

raincom6y ago

[1]https://pdfs.semanticscholar.org/d339/0f13eed511cde1116446e7...

pelario6y ago

mehrdadn6y ago

Moreover, how would you even start searching for a substring using a prefix tree? You can't start at the root, because the substring wouldn't be a prefix. It doesn't help you there at all.

joppy6y ago

This is such a clear and illuminating explanation as to why _suffix_, rather than prefix trees, are so useful.

1 more reply

nroets6y ago

For most problems, a prefix tree will be just as useful.

msclrhd6y ago

If you have keys that tend to have common prefices (e.g. URLs in RDF) then checking them in reverse (e.g. with a suffix tree) is generally faster, as you will hit differences quicker.

herostratus1016y ago

pzh6y ago

herostratus1016y ago

You've got to be kidding me. The sentences are excruciating to read. New Yorker prose is in a different universe. The writing is spare and lucid. You can't evaluate the quality of prose with "wc -w".

tracer42016y ago

1 more reply

xenihn6y ago

English may not be their first language.

j / k navigate · click thread line to collapse