It's a 9M Conformer-CTC model trained on ~300h (AISHELL + Primewords), quantized to INT8 (11 MB), runs 100% in-browser via ONNX Runtime Web.
Grades per-syllable pronunciation + tones with Viterbi forced alignment.
Try it here: https://simedw.com/projects/ear/
This is a great initiative and I hope to see more come out of this; I am not criticizing, but just want to provide my user experience here so you have data points.
In short, my experience lines up with your native speakers.
I found that it loses track of the phonemes when speaking quickly, and tones don't seem to line up when speaking at normal conversational speed.
For example, if I say 他是我的朋友 at normal conversational speed, it will assign `de` to 我, sometimes it interprets that I didn't have the retroflexive in `shi` and renders it `si`. Listened back to make sure I said everything, the phonemes are there in the recording, but the UI displays the wrong phonemes and tones.
By contrast, if I speak slowly and really push each tone, the phonemes and tones all register correctly.
Also, is this taking into account tone transformation? Example, third tones (bottom out tone) tend to smoosh into a second tone (rising) when multiple third tones are spoken in a row. Sometimes the first tone influences the next tone slightly, etc.
Again, great initiative, but I think it needs a way to deal with speech that is conversationally spoken and maybe even slurred a bit due to the nature of conversational level speech.
Hoping to see improvements in this area
I have just added sandhi support, please let me know if it's working better.
Will comment that the shorter phrases (2-4 characters long) were generally accurate at normal speed, but the longer sentences have issues.
Maybe focusing on the accuracy of the smaller phrases and then scaling that might be a good way to go, since those smaller phrases are returning better accuracy.
Again, really think this is a great initiative, want to see how it grows. :)
Will check once the TV is off in the house. :)
The classical example is 4/4 不是. Which goes bùshì -> búshì.
Or 3/3 that becomes 2/3. E.g. 你好 nǐhǎo becoming níhǎo.
The 1/4 -> 2/4 transformation I think is specific to one. 一个 yīgè becomes yígè.
There's one thing that gave me pause: In the phrase 我想学中文 it identified "wén" as "guó". While my pronunciation isn't perfect, there's no way that what I said is closer to "guó" than to "wén".
This indicates to me that the model learned word structures instead of tones here. "Zhōng guó" probably appears in the training data a lot, so the model has a bias towards recognizing that.
- Edit -
From the blog post:
> If my tone is wrong, I don’t want the model to guess what I meant. I want it to tell me what I actually said.
Your architecture also doesn't tell you what you actually said. It just maps what you said to the likeliest of the 1254 syllables that you allow. For example, it couldn't tell you that you said "wi" or "wr" instead of "wo", because those syllables don't exist in your setup.
Although I like the active aspect of the approach. Language apps where sound is the main form of learning should have a great advantage, as any written text just confuses as every country has its own spin on orthography. Even pinyin, despite making sense, for a beginner, has so many conflicting symbols.
Can you elaborate? I'm not sure I understand.
Though, as a guy who speaks perfect mandarin from Beijing, I’m struggle even to pass the easy ones… So it can definitely used some improvements. The example 你好吃饭了吗 returns hào → hǎo, fān → fàn, le → liǎo. The first two are the model listen my tone mistakenly, and the last one should be le instead of liǎo in this context.
Also I see in the comment section people are worry about tones. I can guarantee tones are not particularly useful and you can communicate with native speakers with all the tones messed up and that’s perfectly fine. Because as soon as you leave Beijing, you’ll find all the tones are shuffled because of every region has their own dialect and accents, which doesn’t stop people from communicate at all. So don’t let tone stuff slow your learning process down.
> I can guarantee that tones are not particularly useful and that you can communicate with native speakers with all the tones messed up, and that's perfectly fine.
Not at all. Tones are extremely important. If you have all the tones messed up, you can hardly communicate in Mandarin. It's true, as you said, that different regions of China have different dialects, and you'll find that people can communicate normally because: 1) The tonal differences in nearby regions are not too significant, and people can still try to understand based on context. And 2) In many cases, people switch to regular Mandarin when their dialects cannot communicate with each other. This is why Mandarin exists. It is an officially regulated dialect that all Chinese people learn, to solve the dialect problem among different regions. Chinese people may speak their own dialects at hometown, but when two Chinese people meet and find that their dialects cannot communicate, they immediately switch to Mandarin. Therefore, the tones in Mandarin are very important. To a considerable extent, Mandarin exists because of tones. You cannot communicate in it with messed up tones.
> To a considerable extent, Mandarin exists because of tones. You cannot communicate in it with messed up tones.
These statements are false. If they were true, it would be impossible to understand written tone-free pinyin; in reality, it's not just possible but easy.
Even for non-Mandarin/Guanhua, such as the Shanxi dialect, I can understand them because the pronunciation is much closer to mine, just the tones are completely novel.
Point being, this idea of a Universal Reference is exactly the kind of linguistic erasure that is wrongheaded to begin with. Nor does this completely prevent comprehension, these debates underestimate how much human communication is contextual, you read what I wrote above and most of it was your mind already filling in (gasp, like an LLM) the next words enabling you to read relatively quickly.
"Because as soon as you leave Beijing, you’ll find all the tones are shuffled because of every region has their own dialect and accents, which doesn’t stop people from communicate at all. "
Isn't this in fact one of the reasons why China relies heavily on the written language because the different regions lose vocal communication ability as the changes in tones and pronounciations render the language understandable to people from other regions?
That might be true between native speakers of similar enough dialects who otherwise speak "properly" with each other: proper grammar, idiomatic expressions, predictable accents (also regarding tones, which are not random, just different patterns from the standard). Language learners make errors in all these categories and there providing more motivation to neglect the tones is harmful. If tones were completely irrelevant regarding understandably then they would have disappeared long ago.
Probably because it's a legacy and disappearing slowly? Modern Mandarin only has four tones left and has already lost tone patterns.
Do you know there's a "robot tone" in Chinese? It's simply swap every character to the flat or the first tone. Though it's under the stereotypical false assumption that robots have troubles with tones, kids in the late last century often communicated in that tone for fun without issues.
At the end of the day, vocal Chinese is always ambiguous with or without tones and in practice heavily relies on context. It requires written language to truly fix that.
I just tried the tool and it couldn't properly recognize a very clearly pronounced "吃" and instead heard some shi2. I think it needs more training data or something. Or one needs a good mic.
The other two are probably things that could be fixed with a bigger and more varied dataset.
I've found that especially true with Mandarin because (I think) a beginner speaker is more likely to speak a little more quickly which allows the listener to essentially ignore the occasional incorrect or slightly mispronounced tone and understand the what theyî're trying to say.
(This is anecdotal, but with n>1. Discussed and observed with other Mandarin language learners)
90% of the effort in learning any language is just learning massive amounts of vocabulary.
Things like tone and grammar are the very basics that you learn right at the beginning.‡ Beginners complain about them, but after a few months of studying Chinese, you should be fairly comfortable with the tones. Then, you spend years learning vocabulary.
The two things that make Chinese difficult are:
1. The lack of shared vocabulary with Indo-European languages (this obviously doesn't apply if your native language is something with more shared vocabulary with Chinese).
2. The writing system, which because it's not phonetic requires essentially the same level of effort as learning an entirely new language (beyond spoken Chinese).
‡. The same goes for grammar issues (like declension and conjugation) that people always complain about when learning Indo-European languages. These are the very basics that you learn early on. Most of the real effort is in learning vocab.
Disagree slightly with this- pronouncing the tones individually and getting to the point where you can be understood isn't too hard (well still hard), but combining them when speaking more quickly is more challenging, especially if you want it to flow nicely, and adding emphasis while maintaining the tones. Not that it's mandatory if you just want to understand/be understood, it depends on one's goals.
It's a common misconception that it's enough just to learn the tones and move on and it's very hard to find teachers who are able to help with more advanced pronunciation
Most of it is passively paying attention. It should not be a struggle, it's one of those the more you struggle and overintellectualize the less time you are focusing on paying attention and letting your hearing ability do its work it was evolved to do.
The other thing is this whole emphasis on accents is misdirected. Teachers do not place this excessive emphasis on accents, it is people who want to sound "authentic" which is not a very wise goal of language learning in the first place.
I do think that learning music can help a little, especially a sonically complex instrument like violin and the like.
(caveat: I'm way oversimplifying on my Saturday afternoon, but that's my tentative views on this that I would try to argue for.)
This is an interesting observation. Another one that I sometimes mention to my friends who didn't have an occasion to learn Chinese before is that in this language speaking, reading and writing are actually 3 separate components. You can read characters without knowing how to write them properly or even remembering them entirely. Lots of my Taiwanese acquaintances forget how to write certain characters, because nowadays most of the text they write is in bopomofo on their phones. Bopomofo represents sounds, so basically knowing how an expression sounds and being able to read the character (pick it from a set of given characters for the chosen sound) is enough to "write" it.
It took me very long time to really understand how impersonating tone is in Chinese.
Chinese does not have clusters of consonants like "rst" in "first." The closest thing in Chinese phonology to "first" would be something like "fi-re-se-te." If you grow up never pronouncing consonant clusters, they are incredibly difficult to learn.
This is all related to the existence of tones, but tones are not the direct reason why Chinese people have difficulty pronouncing words like "first." Tone provides one additional way of differentiating syllables, so Chinese can get away with having far fewer syllables than non-tonal languages. You essentially get 4-5 different versions of every syllable.
> The number of vowels is subject to greater variation; in the system presented on this page there are 20–25 vowel phonemes in Received Pronunciation, 14–16 in General American and 19–21 in Australian English.
Native English speakers, if they are not teachers, tend to underestimate the challenge. I see YouTube videos that the western Chinese learner hypothesizes Spanish is most difficult for Chinese to learn because of the RR consonant -- I learned Spanish casually for a few years and I disagree. RR is difficult to pronounce, but I can clearly hear it and I won't confuse it with a different sound. In contrast to English, Spanish vowels are so easy.
Whereas in Chinese or to a lesser degree English, you have to very mindful on how you pronounce stuff.
As a native Spanish speaker the thing I dread the most is grammar and the absurd amount of verbal times there are. Even native speakers don't speak with perfect grammar.
I had no problems with tone pronunciation, but tone recognition was indeed much trickier. I still often get lost when listening to fast speech although I can follow formal speech (news) usually without problems.
At least, this is the case for slow text. Once the text is sped up it’s amazing how my brain just stops processing that information. Both listening and speaking.
I’m sure this will come with practice and time but for now I find it fascinating
My experience in learning Japanese pitch accent was eye-opening. At the start, I couldn't hear any difference. On quizzes I essentially scored the same as random guessing.
The first thing that helped me a lot was noticing how there were things in my native language (English) that used pitch information. For example, "uh-oh" has a high-low pitch. If you say it wrong it sounds very strange. "Uh-huh" to show understanding goes low-high. Again, if you reverse it it sounds unusual.
The next part was just doing lots of practice with minimal pairs. Each time I would listen and try my best to work out where the pitch changed. This took quite a lot of time. I feel like massed practice (many hours in a day) helped me more than trying to do 10 minutes regularly. Try to hear them correctly, but don't try too hard. I didn't have any luck with trying harder to 'understand' what was going on. I liken it to trying to learn to see a new color. There isn't much conscious thought.
The final piece of the puzzle was learning phrases, not individual words, that had pitch changes. For example: "yudetamago" could be boiled egg or boiled grandchildren. Somehow my brain just had a much easier time latching on to multi-word phrases instead of single words. Listening to kaki (persimmon) vs kaki (oyster) again and again seemed much harder.
Of course, your mileage may vary with these techniques. I already spoke decent Japanese when I started doing this.
Wow… Thanks for making it clear that English also has tones! I hadn’t thought of it this way before. “Uh-huh” sounds similar to Mandarin tones 3 & 2. “Uh-oh” is similar to Cantonese tones 1 & 3.
I’m wondering if we can find good examples to teach the Mandarin tones. I think two or three syllable words are best because it illustrates the contour of the tones.
It helped a lot even if I did look like an insane expat conducting an invisible orchestra.
One more thing: there's quite a bit of variation in how regional accents in the mainland can affect tonal pronunciation. It might be worth reaching to some native speakers to give you some baseline figures.
A few years later, he had the most clean and consistent pronunciation out of anyone I'd been in a class with, and easily switched between the Beijing and other accents depending on which teacher we had on any given day.
I rather regret not emulating him, even though I haven't really used it for nearly 20 years and have forgotten most of it.
I used simple index finger motions to mark tones.
Highly recommend taking a look at Phonemica for this:
If you can’t easily hear your pronunciation mistakes so clearly it hurts, consider putting more energy into training your ear. Adult language learners usually have brains that have become resistant to, but not incapable of, changing the parts of the brain responsible for phoneme recognition. The neuroplasticity is still there but it needs some nudging with focused exercises that make it clear to your brain exactly what the problem is. Minimal pair recognition drills, for example, are a great place to start.
It’s not the most fun task, but it’s worth it. You will tighten the pronunciation practice feedback loop much more than is possible with external feedback, so a better accent is the most obvious benefit. But beyond that, it will make a night and day difference for your listening comprehension. And that will get you access to more interesting learning materials sooner. Which hopefully increases your enjoyment and hence your time on task. Plus, more accurate and automatic phoneme recognition leaves more neurological resources free for processing other aspects of your input materials. So it may even help speed things like vocabulary and grammar acquisition.
What has been extremely beneficial has been having the text and audio forced aligned and highlighted, kareoke-style, every time I hear the audio. It has improved my phoneme recognition remarkably well with remarkably little content. Several users also report the same thing - that even native speech feels a lot more like separate words than just a slew of sounds. I attribute this in large part just due to this kareoke style audio. It works better for phonetic scripts, so I would recommend using this with pinyin/jyutping/furigana for character based languages.
For production, when I was at Regina Coeli (world-class language institute) their main thing was just 1. you hear a short passage in Dutch, 10-40 words 2. you record yourself reading the same passage and 3. you play back the two audio tracks on top of one another and listen for the difference. Optional step 4. Re-record and replay until it’s close enough.
There was no grading, no teacher checking recordings, no right or wrong; just hundreds of random sentences and a simple app to layer them. You needed to learn to hear the differences yourself and experiment until you no longer could. (fwiw this is not present in phrasing, I just found it relevant. One day soon I hope to add it!)
I feel like listening is the key to speaking. You don't necessarily need to rote learn the tones for each word. You just need say words as you hear them spoken by others.
The thing you've built is so good, and I would have loved to have it when I was learning Mandarin.
I tried it with a couple of sentences and it did a good job of identifying which tones were off.
( I’m learning using a flashcards web app I made and continue to update with vocab I encounter or need: https://memalign.github.io/m/mandarin/cards/index.html )
I am mostly developing this for myself, to have the perfect tool for me, but I dare say, that I have not seen anything comparable and that I let my 10y+ experience in learning Chinese influence my design decisions. Oh, and it is free /libre software of course (AGPL). It comes with an ever improving vocabulary file that has tons of metadata about words, their usage, how to memorize them, etc. under ODbL (open database license).
[1]: https://codeberg.org/ZelphirKaltstahl/xiaolong-dictionary
It takes text, adds colours for tones, pinyin, literal, and parallel translations.
There’s also a character decomposition tool at the bottom of the page which can be helpful if you’re able to recognise half a character but can’t remember the pronunciation for typing it.
The YouTube channel has some song lyrics, movie subtitles, and audio Bible that might help with learning.
What I noticed though is, that some of the components don't seem to be like what I would expect to be shown as components. For example I tried the word 衣服 and 服 is shown to have the component "二". I guess one could see it that way, but some other dictionaries stop at 月 which itself is a component with set meaning (moon) and usage as radical (often for body parts). My favorite online normal dictionary for example: https://www.mdbg.net/chinese/dictionary?page=worddict&wdrst=... (hover over 3 dots of character and click the button with the 字 and scissors to see decomposition) says:
服 = 月 + 𠬝
𠬝 = 卩 + 又
If you go further, wouldn't you also have to decompose "二" into "一" and "一"?A Chinese teacher told me there are various approaches for decomposition, so this might not be a science or that rigorous, but I think consistency would then dictate, that you decompose "二" as well. I don't always agree fully with their decomposition either and usually I stop at any component, that still has meaning by itself, which can be pretty low level 1 or 2 strokes components already. For determining that, I also use information from a language school, which I copied into a repo: https://codeberg.org/ZelphirKaltstahl/language-learning/src/... "All radicals from their website". Also useful for memorizing the characters, if one can derive a mnemonic for a character from its components and their meaning.
The advanced UI looks very complex, but I don't mind that. In fact it is quite cool! Just has some stuff I don't even know what it is about. I noticed, that once one toggles the advanced UI, I didn't find a way to toggle it back to simple again.
Bookmarked!
Pronunciation correction is an insanely underdeveloped field. Hit me up via email/twitter/discord (my bio) if you're interested in collabing.
[0]: https://gist.github.com/anchpop/acbfb6599ce8c273cc89c7d1bb36...
Cantonese tones are also different from those of Mandarin, so no, it can't be adopted for Cantonese and it would require a complete rework.
> It is a surprisingly difficult language to learn.
I keep hearing this quite a bit, but I do not find Cantonese to be any more difficult than most languages[0]. Or at least we would need to define a metric based on which we could assess the difficulty. If it is the number of tones, their number (six – no, not nine) may look formidable at first, but they are, in fact, rather simple tones and broadly fall into three categories: flat, rising, and falling. As a random example, Cantonese does not even have a dipping tone.
In comparison, «fancy» tones of Vietnamese are significantly more challenging or even difficult – they can curl and unfurl (so to speak).
[0] That crown appears to belong to Archi, with honourable mentions going out to Inuit, Basque, Georgian, Navajo, Yimas and several other polysynthetic languages.
1. tones, and generally the gatekeeping of some Cantonese communities towards people who haven't gotten the tones completely right
2. the lack of learning materials relative to the number of speakers, the confusion between written Chinese and written Cantonese (and also the general lack of the latter)
As they say, "a language is a dialect with an army and navy"... I'll leave it at that.
There are still holdouts!
Come back to me in a couple of decades when the trove of humanity's data has been pored over and drifted further out of sync with (verifiable) reality.
Hand-tuning is the only way to make progress when you've hit a domain's limits. Go deep and have fun.
Just curious - would you need insane HW infrastructure to begin with, or hosted/managed. And what tooling is preferred by the industry for the "training"?
I played around with python scripts for the same purpose. The AI gives feedback that can be transformed to a percentage of correctness. One annoyance is that for Mandarin, the percentage is calculated at the character level, whereas with English, it gives you a more granular score at the phoneme level.
> One annoyance is that for Mandarin, the percentage is calculated at the character level, whereas with English, it gives you a more granular score at the phoneme level.
This is the case for most solutions you'd find for this task. Probably because of the 1 character -> 1 syllable property. It's pretty straightforward to split the detected pinyin into initial+final and build a score from that though.
I suck at chinese but I want to get better and I'm too embarassed to try and talk with real people and practise.
This is a great compromise. even just practising for a few minutes I already feel way more confident based on its feedback, and I feel like I know more about the details of pronunciation.
I'm worried this might get too big and start sucking like everything else.
I'm sure there are a bunch of apps out there that claim they do the same thing, but they don't, IMO. Even if they do, as you said, where is the fun in that?
Great post, thanks for it!
For me it doesn't work very well. Even easy phrases like 他很忙 get transcribed completely random "ma he yu". Is it maybe over-fitted to some type of voice?
It would be cool if a model could tell you if you are singing or playing a piece of music with the right intonation and other ways.
I had a quick look at Farsi datasets, and there seem to be a few options. That said, written Farsi doesn’t include short vowels… so can you derive pronunciation from the text using rules?
You can't, but Farsi dictionaries list the missing short vowels/diacritics/"eraab" for every word.
For instance, see this entry: https://vajehyab.com/dehkhoda/%D8%AD%D8%B3%D8%A7%D8%A8?q=%D8...
With the short vowel on the first letter it would be written حِساب (normally written as just حساب)
The dictionary entry linked shows that there is a ِ on the first letter ح
But you would have to disambiguate between homographs that differ only in the eraab.
https://pingtype.github.io/farsi.html
Paste in some parallel text (e.g. Bible verses, movie subtitles, song lyrics) and read what Farsi you can on the first line, looking to the lower lines for clues if you get stuck.
The core version of Pingtype is for traditional Chinese, but it supports a few other languages too.
- māmā (incorrect)
- māma (correct)