BTW, the test statistic for recognizing individual words isn't interesting, unless you sample/weigh by word frequency.
Indonesian and Malay insert vowels in consonant clusters and replace quite a few consonants, so they should be easily distinguishable from Dutch, even on Dutch loan words (which are not that frequent anyway).
Dutch has a much larger overlap with German (probably the largest), but even those can be distinguished (by a human) with just a few words of a meaningful sentence. I find it difficult to come up with three words that could be a grammatical fragment in both languages, but even then I expect the n-gram frequencies to be quite diverging.
Can this be a problem? If a text in Language_A includes names/words of Language_B, only relying on special characters would wrongly classify the entire text as Language_B.
For example, given "'I think, therefore I am' is the first principle of René Descartes's philosophy that was originally published in French as je pense, donc je suis.", is there a library that would tell me the main passage is in English, but contains fragments in French?
I only dabbled in language detection at a workshop at a conference years ago, but I was very impressed how well such models work on short text with only bigrams.
Maybe once you expand to over 70 languages does bi- and tri-grams fall short, but I just wanted to say that this is a usecase here very simple models can get you really far.
If you see a blog post where a language detection problem is solved with deep learning chances are the author doesn’t know what they are doing (towards datascience, I’m looking at you!) or it’s a tutorial for working with an NN framework.
By "real", I mean texts in a mix of multiple languages (super common on the web); short texts; texts in a different (unknown) language where ngrams don't know how to say "I don't know" and return rubbish instead; texts in close languages; etc.
Going "deep learning" is not the only alternative. Even simpler methods can work significantly better, while being fully interpretable:
https://link.springer.com/chapter/10.1007/978-3-642-00382-0_...
[1] https://fasttext.cc/docs/en/language-identification.html
I’ll try to find time to do it myself, but most probably only tomorrow
I see https://github.com/google/cld3, but how does this compare with https://github.com/CLD2Owners/cld2 which is used by the large https://commoncrawl.org project to classify billions of samples from the whole internet?
https://github.com/pemistahl/lingua-py#4-how-good-is-it
CDL 2 seems to be slightly less accurate than CLD 3 on average.
I'm the author of Lingua. Thank you for sharing my work and making it known in the NLP world.
Apart from the Go implementation, I've implemented the library in Kotlin, Python and Rust. Just take a look at my profile if you are interested: https://github.com/pemistahl