I was playing with n-gram for a while and even produced similar results. But I don't see how can those data be useful to anybody.
In this case: at a depth of 6, a trie can handle ~75% of all words. At 5 it can handle ~67%. Since tries can grow exponentially in memory (fully populated), reducing a level and still getting about a 70% solution might be good enough. It's about an 8% reduction in the size of the representable lexicon. However if you go to length 4, you can only cover ~56% of words. Meaning there's a 45% chance that a given word won't be stored in the trie.
Supposing we set a desired metric that the trie needs to handle 70% of all words, then depth 5 is pretty reasonable and space efficient with only a 1/3 chance that a word that in our lexicon won't be in the trie.
Also, it used to guess what language a text might be in for anthropology and archaeology. The frequency charts for a given language are reliable enough (in a sufficiently large sample, etc) to guide that.
It actually works pretty well. You end up having to do something like this pretty early on in the Matasano crypto challenge problems.
[1] https://play.google.com/store/apps/details?id=com.starwords
- it's silent in the case of modifying preceding vowels separated by a medial consonant e.g. hat vs. hate, bat vs. bate
- and in older English (or English that wants to feel old) was a superfluous final letter e.g. olde, pubbe
- as a silent letter entirely e.g. eagle
- as itself e.g. egg, education
- as a silent or nearly silent suffix separator for -ed e.g. dropped, judged
- as a non-silent suffix for -ed e.g. educated
- silent as an immediate vowel modifier in vowel digraphs (in some spellings) e.g. archaeology, encyclopaedia, caesar used to be ligatured it was so incidental.
- silent as a modifier on itself e.g. teen, feel
- one of several representation for schwa, ə e.g. taken (takən), enemy (enəmy)
etc.
'e' is a mess. It's mostly silent, either ignored completely or modifying something else (an issue even Benjamin Franklin tried to solve through a proposed spelling reform). It's conflated with schwa (the most common vowel sound in English yet has no singular representation).
A language reformer would probably tackle this letter first and fix a great deal of the spelling problems in English.
You switched to "reformer" in your closing sentence, perhaps that was what you originally meant, too?
Of course, such a reform is not exactly easy to implement.
e.g. https://en.wikipedia.org/wiki/Hangul https://en.wikipedia.org/wiki/Cyrillic and I guess even https://en.wikipedia.org/wiki/Klingon_language
This is different than the meaning of https://en.wikipedia.org/wiki/Natural_language and https://en.wikipedia.org/wiki/Constructed_language
I guess if you want to get pedantic a better term might be "Orthographic design".