https://www.kickstarter.com/projects/pekalicious/human-reada...
Suffix trees are cool, but they’re just the entry point to a world of wonderful data structures and algorithms.
1. Given a set of strings, put all the suffixes into a set. For example "derp" -> "derp", "erp", "rp", "p".
2. Sort it
It's cheap! The number of suffixes equals the string length, and suffixes can be mere references into the string input. So in practice you use maybe ~2x memory compared to the input - and in big cache friendly chunks (not tries).
Now you're flying. If you want to perform a substring search:
1. Binary search on the first character. You get a range of suffixes which begin with that character.
2. Within that range, perform a binary search on the second character. Now you have a range of suffixes which begin with the first two characters.
3. etc.
Powerful and easy.
By the way this technique of going character by character can and should be extended to the sort itself; look up "three-way radix quicksort" for what you never learned in school.
Ukkonen algorithm for suffix trees and Kärkkainen-Sanders for suffix array are very beautiful algorithms.
That isn't really all that new; it's just about when red-black trees and B-trees were invented as well and I wouldn't call those "obscure".
Every time I expect it to show up (e.g., for pattern matching in databases), I almost always see something else used instead (e.g., trigrams)
GADDAGs are different because they're more like "reversed prefix" trees, but they help with the issues of placing a new word from a dictionary onto the board using the existing letters of another word.
[1]: https://en.wikipedia.org/wiki/GADDAGIt won't matter for experimenting or most likely for using it on smallish realistic texts, but anything where you're taking user data for instance you'd want to step up to a slightly more clever algorithm (there's an O(n lg n) worst case one that's still very simple, called rank doubling I believe).
Can someone explain how this is fundamentally different from a b-tree of the substrings of a string? You'd store each substring by start index and each leaf is where the substring becomes unique.
A B-tree stores whole keys in the nodes, not fragments of values. And a B-tree tries to self-balance, so there's a lot of leeway for what keys get pushed up to the root node. B-trees split and merge nodes based on the number of items in them, trying to achieve a certain fill factor.
A suffix tree, on the other hand, is storing fragments of the values, and I believe (but have not bothered to prove) that there is one and only one minimal and valid way to organize the tree for a given set of data. Suffix trees add children when the data mandate that they do in order to remain correct, not just in order to make room.
So yeah its a prefix tree, but a prefix tree of all the suffixes.
And now to answer your question. If you naively put every suffix into a regular prefix tree, you would get the same lookup performance. BUT, since the suffixes will share a lot of suffixes, which the prefix tree can't compress you have a lot more memory consumption.
The suffix tree avoids this by essentially performing compression on the suffix as well, so that you simply have pointers to shared suffixes of suffixes, which point to the root.
In the end it's like a diamond shape I guess, trie in the front, compression in the back.
The reason you would want a suffix tree (over a prefix trie) is that suffix trees are quite useful for computing the longest common substring of two strings (they can do it in linear time).
[1]https://pdfs.semanticscholar.org/d339/0f13eed511cde1116446e7...
Moreover, how would you even start searching for a substring using a prefix tree? You can't start at the root, because the substring wouldn't be a prefix. It doesn't help you there at all.
But suppose you want to look for the best match of string in a data stream with no predetermined length e.g. a TCP pipe. Then a suffix tree will allow you to process each incoming character in constant time. A prefix tree will not.
My unsolicited feedback to you - it doesn’t hurt to practice empathy. The fact that you assumed the authors primary language is English (and even if it really is - it doesn’t matter) and hold them to that bar - and you point out the data structure is something you already knew - is this the same way you provide feedback in a work environment? Don’t take this as an attack, but I guarantee you would be a short timer in my org.