Lingua-Go, the most accurate language detection for Go (opens in new tab)

(github.com)

112 pointsacrophobic4y ago24 comments

24 comments

tgv4y ago

I see Dutch performs badly. I wouldn't be surprised if that's because of bad/noisy training data. Dutch web content contains an awful amount of English, which pollutes recognition. Cross-check the Dutch tokens with an English dictionary to be sure (although there is quite some overlap for frequent words, e.g. "is", "we", "are", "have", "bent", "had", "brief", etc., and rare ones like "keeshond").

BTW, the test statistic for recognizing individual words isn't interesting, unless you sample/weigh by word frequency.

yorwba4y ago

My guess is that it has trouble distinguishing between Afrikaans and Dutch, Indonesian and Malay, and other similar pairs.

tgv4y ago

Overlap with those is not particularly large. n-grams are particularly sensitive to spelling, and e.g. Afrikaans writes "Hy het skool toe gegaan", whereas it would be "Hij is naar school gegaan" in Dutch.

Indonesian and Malay insert vowels in consonant clusters and replace quite a few consonants, so they should be easily distinguishable from Dutch, even on Dutch loan words (which are not that frequent anyway).

Dutch has a much larger overlap with German (probably the largest), but even those can be distinguished (by a human) with just a few words of a meaningful sentence. I find it difficult to come up with three words that could be a grammatical fragment in both languages, but even then I expect the n-gram frequencies to be quite diverging.

1 more reply

rippeltippel4y ago

> This engine first determines the alphabet of the input text and searches for characters which are unique in one or more languages. If exactly one language can be reliably chosen this way, the statistical model is not necessary anymore.

Can this be a problem? If a text in Language_A includes names/words of Language_B, only relying on special characters would wrongly classify the entire text as Language_B.

1 more reply

kccqzy4y ago

What's a good way to detect languages in mixed-language passages? What's the state of art here?

For example, given "'I think, therefore I am' is the first principle of René Descartes's philosophy that was originally published in French as je pense, donc je suis.", is there a library that would tell me the main passage is in English, but contains fragments in French?

hvdijk4y ago

Worth noting that this is on Lingua-Go's issues list for the 1.1.0 version: https://github.com/pemistahl/lingua-go/issues/9

yorwba4y ago

With an ngram-based model like this one, you can just feed it short substrings, since it doesn't take the larger context into account anyway. There'll be some problems at the boundary, because e.g. "as" is a word in both languages.

koeng4y ago

I appreciate how direct and clear what this library does and who it is for. I have no need for it now, but after 1 paragraph of reading I’ll be remembering its name for later.

blurker4y ago

Yes that puts it really well and I agree. Too often I'll end up at a project where I'll still have little understanding of what it does or why it exists even after reading the readme. This project is super clear! But maybe that's what we should expect from a natural language library

Matumio4y ago

No, it's not something you can take for granted. Ever heard of Vowpal Wabbit? It's one of the big old libraries in this space. Now read the first sentence of its landing page and compare. Good luck remembering what it does, or even just its name.

1 more reply

wodenokoto4y ago

> Most libraries only use n-grams of size 3 (trigrams) which is satisfactory for detecting the language of longer text fragments consisting of multiple sentences. For short phrases or single words, however, trigrams are not enough.

I only dabbled in language detection at a workshop at a conference years ago, but I was very impressed how well such models work on short text with only bigrams.

Maybe once you expand to over 70 languages does bi- and tri-grams fall short, but I just wanted to say that this is a usecase here very simple models can get you really far.

If you see a blog post where a language detection problem is solved with deep learning chances are the author doesn’t know what they are doing (towards datascience, I’m looking at you!) or it’s a tutorial for working with an NN framework.

Radim4y ago

My experience is the opposite: character ngram models work "OK" on academic tasks and clean corpora. Not so much when unleashed on real data.

By "real", I mean texts in a mix of multiple languages (super common on the web); short texts; texts in a different (unknown) language where ngrams don't know how to say "I don't know" and return rubbish instead; texts in close languages; etc.

Going "deep learning" is not the only alternative. Even simpler methods can work significantly better, while being fully interpretable:

https://link.springer.com/chapter/10.1007/978-3-642-00382-0_...

akie4y ago

We're using libraries like this to try to guess the language of a book based on title alone (in case no other information is readily available), and trigram-based algorithms get it wrong often enough for it to be noticeable. I will look into replacing our current library with this one, it seems better suited for the task at hand.

alexott4y ago

Yeah, language detection on short texts is quite complex. In my practice, N-grams doesn’t work well for them.

doubtfuluser4y ago

How does it actually compare to fasttext [1] in performance. Building an interface to that in GO shouldn’t be too complicated. The claim that all language identification (lid) relies on ngrams is bold and there has been a switch to pure neural network based approaches.

[1] https://fasttext.cc/docs/en/language-identification.html

alexott4y ago

You can try with dataset that I’ve used to evaluate fasttext against different language detectors - it’s linked to this blog post: https://alexott.blogspot.com/2017/10/evaluating-fasttexts-mo...

I’ll try to find time to do it myself, but most probably only tomorrow

pemistahl4y ago

I've compared the Python implementation of Lingua with fasttext. Lingua performs clearly better. Look here: https://github.com/pemistahl/lingua-py#4-how-good-is-it

Xeoncross4y ago

Great work, thanks for sharing!

I see https://github.com/google/cld3, but how does this compare with https://github.com/CLD2Owners/cld2 which is used by the large https://commoncrawl.org project to classify billions of samples from the whole internet?

akreal4y ago

There is also a comparison with CLD 2 in the repo of sister Python library:

https://github.com/pemistahl/lingua-py#4-how-good-is-it

CDL 2 seems to be slightly less accurate than CLD 3 on average.

pemistahl4y ago

Hello everyone,

I'm the author of Lingua. Thank you for sharing my work and making it known in the NLP world.

Apart from the Go implementation, I've implemented the library in Kotlin, Python and Rust. Just take a look at my profile if you are interested: https://github.com/pemistahl

nshm4y ago

In general, language detection is surprisingly hard. There is LSTM-based implementation https://github.com/AU-DIS/LSTM_langid which should be better than ngrams.

debdut4y ago

I was searching for *(this but for programming languages and frameworks)

j / k navigate · click thread line to collapse

24 comments

tgv4y ago

BTW, the test statistic for recognizing individual words isn't interesting, unless you sample/weigh by word frequency.

yorwba4y ago

My guess is that it has trouble distinguishing between Afrikaans and Dutch, Indonesian and Malay, and other similar pairs.

tgv4y ago

1 more reply

rippeltippel4y ago

Can this be a problem? If a text in Language_A includes names/words of Language_B, only relying on special characters would wrongly classify the entire text as Language_B.

1 more reply

kccqzy4y ago

What's a good way to detect languages in mixed-language passages? What's the state of art here?

hvdijk4y ago

Worth noting that this is on Lingua-Go's issues list for the 1.1.0 version: https://github.com/pemistahl/lingua-go/issues/9

yorwba4y ago

koeng4y ago

I appreciate how direct and clear what this library does and who it is for. I have no need for it now, but after 1 paragraph of reading I’ll be remembering its name for later.

blurker4y ago

Matumio4y ago

1 more reply

wodenokoto4y ago

I only dabbled in language detection at a workshop at a conference years ago, but I was very impressed how well such models work on short text with only bigrams.

Maybe once you expand to over 70 languages does bi- and tri-grams fall short, but I just wanted to say that this is a usecase here very simple models can get you really far.

Radim4y ago

My experience is the opposite: character ngram models work "OK" on academic tasks and clean corpora. Not so much when unleashed on real data.

Going "deep learning" is not the only alternative. Even simpler methods can work significantly better, while being fully interpretable:

https://link.springer.com/chapter/10.1007/978-3-642-00382-0_...

akie4y ago

alexott4y ago

Yeah, language detection on short texts is quite complex. In my practice, N-grams doesn’t work well for them.

doubtfuluser4y ago

[1] https://fasttext.cc/docs/en/language-identification.html

alexott4y ago

You can try with dataset that I’ve used to evaluate fasttext against different language detectors - it’s linked to this blog post: https://alexott.blogspot.com/2017/10/evaluating-fasttexts-mo...

I’ll try to find time to do it myself, but most probably only tomorrow

pemistahl4y ago

I've compared the Python implementation of Lingua with fasttext. Lingua performs clearly better. Look here: https://github.com/pemistahl/lingua-py#4-how-good-is-it

Xeoncross4y ago

Great work, thanks for sharing!

akreal4y ago

There is also a comparison with CLD 2 in the repo of sister Python library:

https://github.com/pemistahl/lingua-py#4-how-good-is-it

CDL 2 seems to be slightly less accurate than CLD 3 on average.

pemistahl4y ago

Hello everyone,

I'm the author of Lingua. Thank you for sharing my work and making it known in the NLP world.

Apart from the Go implementation, I've implemented the library in Kotlin, Python and Rust. Just take a look at my profile if you are interested: https://github.com/pemistahl

nshm4y ago

In general, language detection is surprisingly hard. There is LSTM-based implementation https://github.com/AU-DIS/LSTM_langid which should be better than ngrams.

debdut4y ago

I was searching for *(this but for programming languages and frameworks)

j / k navigate · click thread line to collapse