Show HN: I trained a 9M speech model to fix my Mandarin tones (opens in new tab)

(simedw.com)

469 pointssimedw3mo ago153 comments

Built this because tones are killing my spoken Mandarin and I can't reliably hear my own mistakes.

It's a 9M Conformer-CTC model trained on ~300h (AISHELL + Primewords), quantized to INT8 (11 MB), runs 100% in-browser via ONNX Runtime Web.

Grades per-syllable pronunciation + tones with Viterbi forced alignment.

Try it here: https://simedw.com/projects/ear/

153 comments

dapangzi3mo ago

Longtime lurker, made an account specifically to give feedback here as an intermediate speaker. :)

This is a great initiative and I hope to see more come out of this; I am not criticizing, but just want to provide my user experience here so you have data points.

In short, my experience lines up with your native speakers.

I found that it loses track of the phonemes when speaking quickly, and tones don't seem to line up when speaking at normal conversational speed.

For example, if I say 他是我的朋友 at normal conversational speed, it will assign `de` to 我, sometimes it interprets that I didn't have the retroflexive in `shi` and renders it `si`. Listened back to make sure I said everything, the phonemes are there in the recording, but the UI displays the wrong phonemes and tones.

By contrast, if I speak slowly and really push each tone, the phonemes and tones all register correctly.

Also, is this taking into account tone transformation? Example, third tones (bottom out tone) tend to smoosh into a second tone (rising) when multiple third tones are spoken in a row. Sometimes the first tone influences the next tone slightly, etc.

Again, great initiative, but I think it needs a way to deal with speech that is conversationally spoken and maybe even slurred a bit due to the nature of conversational level speech.

mercanlIl3mo ago

The tool definitely needs to address tone transformations, it’s a big part of how the language is spoken. Otherwise it’s mostly useful for a first year student speaking in isolation.

Hoping to see improvements in this area

simedwOP3mo ago

Thank for the great feedback!

I have just added sandhi support, please let me know if it's working better.

dapangzi3mo ago

Still having some issues that match my previous comment, I'll try to follow your blog and give more feedback as you work on it.

Will comment that the shorter phrases (2-4 characters long) were generally accurate at normal speed, but the longer sentences have issues.

Maybe focusing on the accuracy of the smaller phrases and then scaling that might be a good way to go, since those smaller phrases are returning better accuracy.

Again, really think this is a great initiative, want to see how it grows. :)

dapangzi3mo ago

ACKing your comment.

Will check once the TV is off in the house. :)

sqs3mo ago

I don't think it takes care of tone transformation (eg 他是 ni3shi4 -> ni2shi4). Or if it does, my tones are just off. But it's a really cool idea!

carlmr3mo ago

他是 is tāshì which doesn't transform I think. Did you mean to write 你是 nǐshì? I think that transforms differently though. With the half 3rd tone only dropping.

The classical example is 4/4 不是. Which goes bùshì -> búshì.

Or 3/3 that becomes 2/3. E.g. 你好 nǐhǎo becoming níhǎo.

The 1/4 -> 2/4 transformation I think is specific to one. 一个 yīgè becomes yígè.

jhanschoo3mo ago

The tone sandhi example you just gave looks incorrect to me

jimz3mo ago

Well, OP wrote "he is" but then wrote "you are" in pinyin for one, and that's a bit hard to reconcile.

tifan3mo ago

I had the same issue! Perhaps being another dapangzi is the problem here lol

et-al3mo ago

I'm not familiar with this slang: what's a big plate?

allan_s3mo ago

It's a slang for somebody fat. 子 does not carry a specific meaning it is more a character with grammatical function to nominative

dirteater_3mo ago

the commenter's username (i'm guessing they mean 大胖子, feel free to google translate)

1 more reply

dapangzi3mo ago

胖 (pàng) means fat, vs 盘 (pán), which means plate.

Quite alright! We have to make mistakes to learn!

yunusabd3mo ago

Super nice, thanks for sharing!

There's one thing that gave me pause: In the phrase 我想学中文 it identified "wén" as "guó". While my pronunciation isn't perfect, there's no way that what I said is closer to "guó" than to "wén".

This indicates to me that the model learned word structures instead of tones here. "Zhōng guó" probably appears in the training data a lot, so the model has a bias towards recognizing that.

- Edit -

From the blog post:

> If my tone is wrong, I don’t want the model to guess what I meant. I want it to tell me what I actually said.

Your architecture also doesn't tell you what you actually said. It just maps what you said to the likeliest of the 1254 syllables that you allow. For example, it couldn't tell you that you said "wi" or "wr" instead of "wo", because those syllables don't exist in your setup.

vjerancrnjak3mo ago

I tried just repeating guó for as many times as symbols and repetition was not recognized.

Although I like the active aspect of the approach. Language apps where sound is the main form of learning should have a great advantage, as any written text just confuses as every country has its own spin on orthography. Even pinyin, despite making sense, for a beginner, has so many conflicting symbols.

yunusabd3mo ago

> I tried just repeating guó for as many times as symbols and repetition was not recognized.

Can you elaborate? I'm not sure I understand.

qingcharles3mo ago

I think he's saying transliteration and romanization is horribly flawed in some instances.

namelosw3mo ago

Impressive work! The idea and the UI is very intuitive.

Though, as a guy who speaks perfect mandarin from Beijing, I’m struggle even to pass the easy ones… So it can definitely used some improvements. The example 你好吃饭了吗 returns hào → hǎo, fān → fàn, le → liǎo. The first two are the model listen my tone mistakenly, and the last one should be le instead of liǎo in this context.

Also I see in the comment section people are worry about tones. I can guarantee tones are not particularly useful and you can communicate with native speakers with all the tones messed up and that’s perfectly fine. Because as soon as you leave Beijing, you’ll find all the tones are shuffled because of every region has their own dialect and accents, which doesn’t stop people from communicate at all. So don’t let tone stuff slow your learning process down.

tianqi3mo ago

Please allow me to share some of my views. I'm a native Mandarin speaker.

> I can guarantee that tones are not particularly useful and that you can communicate with native speakers with all the tones messed up, and that's perfectly fine.

Not at all. Tones are extremely important. If you have all the tones messed up, you can hardly communicate in Mandarin. It's true, as you said, that different regions of China have different dialects, and you'll find that people can communicate normally because: 1) The tonal differences in nearby regions are not too significant, and people can still try to understand based on context. And 2) In many cases, people switch to regular Mandarin when their dialects cannot communicate with each other. This is why Mandarin exists. It is an officially regulated dialect that all Chinese people learn, to solve the dialect problem among different regions. Chinese people may speak their own dialects at hometown, but when two Chinese people meet and find that their dialects cannot communicate, they immediately switch to Mandarin. Therefore, the tones in Mandarin are very important. To a considerable extent, Mandarin exists because of tones. You cannot communicate in it with messed up tones.

thaumasiotes3mo ago

> If you have all the tones messed up, you can hardly communicate in Mandarin.

> To a considerable extent, Mandarin exists because of tones. You cannot communicate in it with messed up tones.

These statements are false. If they were true, it would be impossible to understand written tone-free pinyin; in reality, it's not just possible but easy.

namelosw3mo ago

Well, as a northern guy, I do find myself able to understand Mandarin even from Yunnan easily without prior learning. The harder ones for me, like the Hefei dialect, are because the pronunciation is very different, not the tone. Nanjing dialect, on the otherhand, is also from the same Jianghuai Mandarin group as Hefei, which is perfect intelligentable for me.

Even for non-Mandarin/Guanhua, such as the Shanxi dialect, I can understand them because the pronunciation is much closer to mine, just the tones are completely novel.

calf3mo ago

Yes but Regular Mandarin has different tones, Beijing Mandarin is not Hong Kong-style Mandarin is not Taiwanese Mandarin and so when a foreigner chooses "Reference Mandarin", they are choosing what, exactly?

Point being, this idea of a Universal Reference is exactly the kind of linguistic erasure that is wrongheaded to begin with. Nor does this completely prevent comprehension, these debates underestimate how much human communication is contextual, you read what I wrote above and most of it was your mind already filling in (gasp, like an LLM) the next words enabling you to read relatively quickly.

samiv3mo ago

As a person who lived in Taiwan and reached C1 in Chinese, I can also say that the tones are indeed less important than one might thing once one can say more and communicate more context. In the beginning when you're very limited in your expressive capacity and only can say simple sentences there's less context and getting the tones wrong does produce confusion.

"Because as soon as you leave Beijing, you’ll find all the tones are shuffled because of every region has their own dialect and accents, which doesn’t stop people from communicate at all. "

Isn't this in fact one of the reasons why China relies heavily on the written language because the different regions lose vocal communication ability as the changes in tones and pronounciations render the language understandable to people from other regions?

zelphirkalt3mo ago

The point about being a beginner and having limited capacity to express oneself is an important point. When you can say more, you will also have learned more about the language's tendency to use words of 2 syllables, rather than 1 syllable words. Using 2 syllables instead of 1 already removes a lot of ambiguity, and people will understand you better.

samus3mo ago

> Also I see in the comment section people are worry about tones. I can guarantee tones are not particularly useful and you can communicate with native speakers with all the tones messed up and that’s perfectly fine.

That might be true between native speakers of similar enough dialects who otherwise speak "properly" with each other: proper grammar, idiomatic expressions, predictable accents (also regarding tones, which are not random, just different patterns from the standard). Language learners make errors in all these categories and there providing more motivation to neglect the tones is harmful. If tones were completely irrelevant regarding understandably then they would have disappeared long ago.

namelosw3mo ago

> If tones were completely irrelevant regarding understandably then they would have disappeared long ago.

Probably because it's a legacy and disappearing slowly? Modern Mandarin only has four tones left and has already lost tone patterns.

Do you know there's a "robot tone" in Chinese? It's simply swap every character to the flat or the first tone. Though it's under the stereotypical false assumption that robots have troubles with tones, kids in the late last century often communicated in that tone for fun without issues.

At the end of the day, vocal Chinese is always ambiguous with or without tones and in practice heavily relies on context. It requires written language to truly fix that.

zelphirkalt3mo ago

About the tones not being as useful ... I think there are cases, in which they matter. Take for example 熊猫 and 胸毛: "有 xiongmao 吗？" "Are there Pandas? " or "Do you have chest hair?". Another one: 时间 and 事件. Sometimes it gets comical, but natives can and some will be confused, when your tones are off by too much, and the conversation just started, so that the context is not as narrowed down. Context is key in the language. You can notice that, when you are trying to join a conversation between natives. Until you understand a phrase or most of a phrase, that gives you a hint for the topic they are talking about, you will usually have a hard time understanding anything.

I just tried the tool and it couldn't properly recognize a very clearly pronounced "吃" and instead heard some shi2. I think it needs more training data or something. Or one needs a good mic.

simedwOP3mo ago

Hi, thanks for the feedback. The 了 issue was a bug on the JavaScript side; that should be fixed (training did thankfully handle it correctly).

The other two are probably things that could be fixed with a bigger and more varied dataset.

mijoharas3mo ago

I feel like there is a commonly mentioned idea that "speaking a foreign language is easier after having a drink or two".

I've found that especially true with Mandarin because (I think) a beginner speaker is more likely to speak a little more quickly which allows the listener to essentially ignore the occasional incorrect or slightly mispronounced tone and understand the what theyî're trying to say.

(This is anecdotal, but with n>1. Discussed and observed with other Mandarin language learners)

ecshafer3mo ago

Anyone that is a native European language speaker that hasn't tried to learn Chinese or some other tonal language, its really hard to understand how hard it is. The tones can really be very subtle, and your ear is not fine tuned to them. So you think you are saying it right, but native speakers have no idea what you are saying.

DiogenesKynikos3mo ago

The tones are really not as difficult as people make them out to be.

90% of the effort in learning any language is just learning massive amounts of vocabulary.

Things like tone and grammar are the very basics that you learn right at the beginning.‡ Beginners complain about them, but after a few months of studying Chinese, you should be fairly comfortable with the tones. Then, you spend years learning vocabulary.

The two things that make Chinese difficult are:

1. The lack of shared vocabulary with Indo-European languages (this obviously doesn't apply if your native language is something with more shared vocabulary with Chinese).

2. The writing system, which because it's not phonetic requires essentially the same level of effort as learning an entirely new language (beyond spoken Chinese).

‡. The same goes for grammar issues (like declension and conjugation) that people always complain about when learning Indo-European languages. These are the very basics that you learn early on. Most of the real effort is in learning vocab.

dtm9876541233mo ago

>Things like tone and grammar are the very basics that you learn right at the beginning.‡ Beginners complain about them, but after a few months of studying Chinese, you should be fairly comfortable with the tones.

Disagree slightly with this- pronouncing the tones individually and getting to the point where you can be understood isn't too hard (well still hard), but combining them when speaking more quickly is more challenging, especially if you want it to flow nicely, and adding emphasis while maintaining the tones. Not that it's mandatory if you just want to understand/be understood, it depends on one's goals.

It's a common misconception that it's enough just to learn the tones and move on and it's very hard to find teachers who are able to help with more advanced pronunciation

DiogenesKynikos3mo ago

I fully agree that a lot of the difficulty with the tones is in pronouncing them at pace, and in internalizing how they interact with one another.

However, this is still something that happens very early on when learning Chinese, and it takes nowhere near the same amount of invested time as learning thousands of vocabulary terms.

calf3mo ago

Yours is the first comment I strongly agree with; as a multilingual/bicultural Asian American, children don't have this supposed difficulty hearing tones.

Most of it is passively paying attention. It should not be a struggle, it's one of those the more you struggle and overintellectualize the less time you are focusing on paying attention and letting your hearing ability do its work it was evolved to do.

The other thing is this whole emphasis on accents is misdirected. Teachers do not place this excessive emphasis on accents, it is people who want to sound "authentic" which is not a very wise goal of language learning in the first place.

I do think that learning music can help a little, especially a sonically complex instrument like violin and the like.

(caveat: I'm way oversimplifying on my Saturday afternoon, but that's my tentative views on this that I would try to argue for.)

DiogenesKynikos3mo ago

I agree on not over-intellectualizing the tones.

I've seen people struggle to pronounce a word when I explicitly tell them what tones it contains, but then pronounce it perfectly when I ask them to just imitate me.

But I disagree about accents. One of the major flaws in most foreign language education, in my opinion, is that pronunciation is not emphasized heavily enough at the beginning. Being able to pronounce the basic sounds correctly has a huge impact on how native speakers perceive your language skills, even if you're not very advanced in the language.

1 more reply

snicky3mo ago

> 2. The writing system, which because it's not phonetic requires essentially the same level of effort as learning an entirely new language (beyond spoken Chinese).

This is an interesting observation. Another one that I sometimes mention to my friends who didn't have an occasion to learn Chinese before is that in this language speaking, reading and writing are actually 3 separate components. You can read characters without knowing how to write them properly or even remembering them entirely. Lots of my Taiwanese acquaintances forget how to write certain characters, because nowadays most of the text they write is in bopomofo on their phones. Bopomofo represents sounds, so basically knowing how an expression sounds and being able to read the character (pick it from a set of given characters for the chosen sound) is enough to "write" it.

Sxubas3mo ago

Your comment is written as it learning a language was not a subjective experience, which could not be further from the actual thing

DiogenesKynikos3mo ago

Learning 10,000 words is objectively more difficult than getting used to tones.

You can get used to the tones in a relatively short amount of time. If you are in an immersive environment for a month or two, you will end up wondering how it is that anyone can't hear the tones.

In contrast, there is simply no way to memorize thousands of words in that timeframe.

vjvjvjvjghv3mo ago

Agree. It’s really hard. It also explains why a lot of people born in China tend to make serious pronunciation errors when speaking English or German. They are used to focus on different things than us westerners.

It took me very long time to really understand how impersonating tone is in Chinese.

DiogenesKynikos3mo ago

The reason why Chinese people have difficulty pronouncing Indo-European languages is that Chinese has a very limited set of syllables, and they always follow the pattern (consonant) + vowel + (nasal/rhotic consonant), with possibly one of the consonants being dropped.

Chinese does not have clusters of consonants like "rst" in "first." The closest thing in Chinese phonology to "first" would be something like "fi-re-se-te." If you grow up never pronouncing consonant clusters, they are incredibly difficult to learn.

This is all related to the existence of tones, but tones are not the direct reason why Chinese people have difficulty pronouncing words like "first." Tone provides one additional way of differentiating syllables, so Chinese can get away with having far fewer syllables than non-tonal languages. You essentially get 4-5 different versions of every syllable.

samus3mo ago

> This is all related to the existence of tones, but tones are not the direct reason why Chinese people have difficulty pronouncing words like "first."

Actually they kind of are. The tonal system of modern Chinese dialects developed from voiced initial constants of syllables. Old Chinese (Han dynasty and older) might not have been a tonal language altogether. Many linguists think that they developed from final consonants that have since disappeared, and before that happened, yes, Chinese would have had (some) consonant clusters. But still nothing like essentially free-form syllables like other language families.

1 more reply

cvhc3mo ago

As a native Mandarin speaker, I always think the most difficult feature in English (and a few other European languages, like French) are the rich vowels. Like done vs down, beat vs bit, trailing dark l vs -ou/-u sound, and frequent vowel reduction in speech. Even worse, different English dialects randomly shift vowels (maybe like how Mandarin dialects use different tones). Neither my ear nor my mouth is tuned. From Wikipedia "English phonology":

> The number of vowels is subject to greater variation; in the system presented on this page there are 20–25 vowel phonemes in Received Pronunciation, 14–16 in General American and 19–21 in Australian English.

Native English speakers, if they are not teachers, tend to underestimate the challenge. I see YouTube videos that the western Chinese learner hypothesizes Spanish is most difficult for Chinese to learn because of the RR consonant -- I learned Spanish casually for a few years and I disagree. RR is difficult to pronounce, but I can clearly hear it and I won't confuse it with a different sound. In contrast to English, Spanish vowels are so easy.

Sxubas3mo ago

Spanish is such a blessing as mispronouncing a word rarely changes the meaning.

Whereas in Chinese or to a lesser degree English, you have to very mindful on how you pronounce stuff.

As a native Spanish speaker the thing I dread the most is grammar and the absurd amount of verbal times there are. Even native speakers don't speak with perfect grammar.

cyberax3mo ago

I'm a native Russian speaker, and I decided to learn Mandarin, because it's linguistically almost the opposite of Russian.

I had no problems with tone pronunciation, but tone recognition was indeed much trickier. I still often get lost when listening to fast speech although I can follow formal speech (news) usually without problems.

barrell3mo ago

I recently started learning a tonal language, and so far have not struggled too much wrt tones when everything is slow. There was an original strangeness and refusal for my vocal cords to want to work that way, but probably only for the first month or so.

At least, this is the case for slow text. Once the text is sped up it’s amazing how my brain just stops processing that information. Both listening and speaking.

I’m sure this will come with practice and time but for now I find it fascinating

thenthenthen3mo ago

Euro speaker here, no problem with recognising tones but speaking them…:/

cyberax3mo ago

Are you studying a language with contour tones or with high/mid tone distinction? I tried to study a bit of Thai, and the high/mid tone distinction was a complete show-stopper for me.

laurieg3mo ago

For someone who hasn't grown up speaking an language with tones or pitches, the process of learning them can be maddening. I applaud anyone who makes tools like this to try to make the process easier.

My experience in learning Japanese pitch accent was eye-opening. At the start, I couldn't hear any difference. On quizzes I essentially scored the same as random guessing.

The first thing that helped me a lot was noticing how there were things in my native language (English) that used pitch information. For example, "uh-oh" has a high-low pitch. If you say it wrong it sounds very strange. "Uh-huh" to show understanding goes low-high. Again, if you reverse it it sounds unusual.

The next part was just doing lots of practice with minimal pairs. Each time I would listen and try my best to work out where the pitch changed. This took quite a lot of time. I feel like massed practice (many hours in a day) helped me more than trying to do 10 minutes regularly. Try to hear them correctly, but don't try too hard. I didn't have any luck with trying harder to 'understand' what was going on. I liken it to trying to learn to see a new color. There isn't much conscious thought.

The final piece of the puzzle was learning phrases, not individual words, that had pitch changes. For example: "yudetamago" could be boiled egg or boiled grandchildren. Somehow my brain just had a much easier time latching on to multi-word phrases instead of single words. Listening to kaki (persimmon) vs kaki (oyster) again and again seemed much harder.

Of course, your mileage may vary with these techniques. I already spoke decent Japanese when I started doing this.

ronyeh3mo ago

> For example, "uh-oh" has a high-low pitch. If you say it wrong it sounds very strange. "Uh-huh" to show understanding goes low-high. Again, if you reverse it it sounds unusual.

Wow… Thanks for making it clear that English also has tones! I hadn’t thought of it this way before. “Uh-huh” sounds similar to Mandarin tones 3 & 2. “Uh-oh” is similar to Cantonese tones 1 & 3.

I’m wondering if we can find good examples to teach the Mandarin tones. I think two or three syllable words are best because it illustrates the contour of the tones.

thaumasiotes3mo ago

Pitch levels are important enough in English that native speakers spontaneously develop ways to write them down, even though the standard written language has no way to indicate them.

However, they operate at the level of the sentence rather than the individual word, which sets up a conflict if an English speaker wants to learn Chinese.

The most common uses of pitch in English are to annotate the grammatical structure of a sentence, making it clear which words belong together in larger phrases, and to mark yes/no questions.

English does have one clear example of lexical tone, the "I don't know" word, which is pronounced very similarly to the Mandarin pinyin éēě. (If pronounced with the mouth open. With the mouth closed, it would be more like 嗯嗯嗯 in the same 2-1-3 tone sequence.)

1 more reply

danparsonson3mo ago

Wholeheartedly (or maybe downheartedly?) agree with this - sometimes I try to say the simplest things and people just stare at me like I'm speaking Martian. Which I suppose I might as well be! One of my big problems is implicit use of tones for things like expressing uncertainty; that's a very difficult habit to get out of.

bunderbunder3mo ago

Another one that I wish I had realized sooner is that, contrary to the impression teachers tend to convey, tones aren’t just a pitch contour thing. There are also intensity and cadence elements. Native speakers can fairly accurately recognize tones in recordings that have had all the pitch contour autotuned out.

dionian3mo ago

its critical because without proper tonal enunciation the words can be ambiguous.

tifan3mo ago

Well, it would work only when I speak word by word, not as a sentence or in a normal speed for daily conversations. The model thinks I was making mistakes when I speak casually (as a native Chinese speaker, I had Mandarin 2A certification, which is required for teachers or other occupations that requires a very high degree of Mandarin accuracy). You wouldn’t really notice it but language pronunciations is very different between causal and formal speech…

vunderba3mo ago

When I was living in Taiwan, one of the ways I forced myself to remember to pronounce the tones distinctly was by waving my hand in front of me, tracing the arc of each character’s tone.

It helped a lot even if I did look like an insane expat conducting an invisible orchestra.

One more thing: there's quite a bit of variation in how regional accents in the mainland can affect tonal pronunciation. It might be worth reaching to some native speakers to give you some baseline figures.

zdragnar3mo ago

In a university Mandarin class, one of the adult students (i.e. probably 40 or so) WAY over exaggerated his tones, to the point that the little old lady teaching us laughed out loud after one of his answers.

A few years later, he had the most clean and consistent pronunciation out of anyone I'd been in a class with, and easily switched between the Beijing and other accents depending on which teacher we had on any given day.

I rather regret not emulating him, even though I haven't really used it for nearly 20 years and have forgotten most of it.

ecshafer3mo ago

From a language learning standpoint that does make sense. Over-exageration while you are learning to help cement the idea, and then when you are speaking more naturally you will fall back into a regular kind of tone.

mleonhard3mo ago

Over-exaggeration also works well when learning to play stringed instruments like cello.

luckydata3mo ago

that's EXACTLY how I taught myself to speak with a Spanish accent from Madrid. I repeated the way tv celebrities and the speakers on the metro announced the stations, and it gave me a base for how to use my mouth and throat appropriately. After a while I was able to tone it down and my accent got so good that locals couldn't tell I wasn't spanish - I had this cool party trick pulling out my id and showing them I was truly a foreigner!

sowbug3mo ago

You'll love Mike Laoshi: https://youtu.be/cna89A2KAU4?si=SQEZ_0ooO1z119_k

cyberax3mo ago

Hand motions help! Especially when you want to memorize new words, because initially you need to treat tone as something additional to remember.

I used simple index finger motions to mark tones.

simedwOP3mo ago

For accents, I’ve mostly tested with a few friends so far. I’m wondering whether region should be a parameter, because training on all dialects might make the system too lax.

vunderba3mo ago

Probably be a lot of work but it would be really interesting if you had sufficient data sets to train across accents.

Highly recommend taking a look at Phonemica for this:

https://phonemica.net/

devin3mo ago

This sounds like how solfeg training works. You use a hand signal to indicate a specific tone: do re mi fa so la ti

bunderbunder3mo ago

This is very cool, but from one Mandarin learner to another I’d caution against relying too heavily on any external feedback mechanism for improving your pronunciation.

If you can’t easily hear your pronunciation mistakes so clearly it hurts, consider putting more energy into training your ear. Adult language learners usually have brains that have become resistant to, but not incapable of, changing the parts of the brain responsible for phoneme recognition. The neuroplasticity is still there but it needs some nudging with focused exercises that make it clear to your brain exactly what the problem is. Minimal pair recognition drills, for example, are a great place to start.

It’s not the most fun task, but it’s worth it. You will tighten the pronunciation practice feedback loop much more than is possible with external feedback, so a better accent is the most obvious benefit. But beyond that, it will make a night and day difference for your listening comprehension. And that will get you access to more interesting learning materials sooner. Which hopefully increases your enjoyment and hence your time on task. Plus, more accurate and automatic phoneme recognition leaves more neurological resources free for processing other aspects of your input materials. So it may even help speed things like vocabulary and grammar acquisition.

barrell3mo ago

I’m building a language learning app [https://phrasing.app] and this is really good advice. I’ve not had any interest in SST for the application, and have no plans to integrate it. In my experience, I’ve never seen them be truly beneficial in the language learning process.

What has been extremely beneficial has been having the text and audio forced aligned and highlighted, kareoke-style, every time I hear the audio. It has improved my phoneme recognition remarkably well with remarkably little content. Several users also report the same thing - that even native speech feels a lot more like separate words than just a slew of sounds. I attribute this in large part just due to this kareoke style audio. It works better for phonetic scripts, so I would recommend using this with pinyin/jyutping/furigana for character based languages.

For production, when I was at Regina Coeli (world-class language institute) their main thing was just 1. you hear a short passage in Dutch, 10-40 words 2. you record yourself reading the same passage and 3. you play back the two audio tracks on top of one another and listen for the difference. Optional step 4. Re-record and replay until it’s close enough.

There was no grading, no teacher checking recordings, no right or wrong; just hundreds of random sentences and a simple app to layer them. You needed to learn to hear the differences yourself and experiment until you no longer could. (fwiw this is not present in phrasing, I just found it relevant. One day soon I hope to add it!)

zdc13mo ago

I completely agree with this. There's a certain confidence you get when you can hear a word you don't know, but can still comprehend it well enough to know what pinyin to type into your dictionary app. Mandarin Blueprint has a nice pinyin pronunciation video on YouTube that I worked through a while ago, and then followed with a few weeks of immersion in Taiwan, I was able to really pick out what people were saying.

I feel like listening is the key to speaking. You don't necessarily need to rote learn the tones for each word. You just need say words as you hear them spoken by others.

rahimnathwani3mo ago

This is incredible. When I was first learning Chinese (casually, ~20 years ago), my teacher used some Windows software that drew a diagram of the shape of my pronunciation, so she could illustrate what I was getting wrong in some objective way.

The thing you've built is so good, and I would have loved to have it when I was learning Mandarin.

I tried it with a couple of sentences and it did a good job of identifying which tones were off.

yunusabd3mo ago

You're probably thinking of Praat, which is still around. Even has the same UI as 20 years ago.

memalign3mo ago

I wish this had a pinyin mode…! I am learning to speak Mandarin but I am not learning to read/write.

( I’m learning using a flashcards web app I made and continue to update with vocab I encounter or need: https://memalign.github.io/m/mandarin/cards/index.html )

data_ders3mo ago

same! but if you get it inevitably wrong the first time it gives you the pinyin. but i struggled to get it to transcribe the consonants I was making let alone the tones. i'm pretty sure i'm not as bad as that!

simedwOP3mo ago

Great suggestin, added a toggle to see pinyin.

siwatanejo3mo ago

+1 for pinyin

knocte3mo ago

alixwang3mo ago

As a native speaker of Mandarin the demo it's not work for me. It can't check the pronounce of my voice. I don't know what's wrong of it, may be it's too sensitive(my daughter watch carton on my side).

simedwOP3mo ago

It’s fairly sensitive to background noise at the moment. I’m planning to train an improved version with stronger data augmentation, including background noise.

affogarty3mo ago

This is extremely cool, although I asked my wife (who is Chinese) to try it out and it said she made some mistakes.

hawflakes3mo ago

I tried it out and it has some issues with my native speech. I grew up with more Taiwan mandarin but I know the Beijing standard and the recognizer was flagging some of my utterances incorrectly.

zelphirkalt3mo ago

I think this is a good time for a shameless plug. The last 2 month or so I am working on my own project [1] for learning more characters. I have made a tool with powerful search function, training mode, and other useful features, such as displaying plots that show you your progress and whether you are reaching your daily training goal, and the ability to save searches, a la Thunderbird saved filters. It is written in Python and oldschool tkinter with custom widgets for a somewhat more modern and capable feel. It is very configurable. Though currently configuring it means touching a JSON file, as I have not yet bothered writing GUI for that.

I am mostly developing this for myself, to have the perfect tool for me, but I dare say, that I have not seen anything comparable and that I let my 10y+ experience in learning Chinese influence my design decisions. Oh, and it is free /libre software of course (AGPL). It comes with an ever improving vocabulary file that has tons of metadata about words, their usage, how to memorize them, etc. under ODbL (open database license).

[1]: https://codeberg.org/ZelphirKaltstahl/xiaolong-dictionary

peterburkimsher3mo ago

Good to see that there are others learning and creating! Another shameless plug for my translator site: https://pingtype.github.io

It takes text, adds colours for tones, pinyin, literal, and parallel translations.

There’s also a character decomposition tool at the bottom of the page which can be helpful if you’re able to recognise half a character but can’t remember the pronunciation for typing it.

The YouTube channel has some song lyrics, movie subtitles, and audio Bible that might help with learning.

zelphirkalt3mo ago

Also I just read some of your blog about learning Chinese :) Haha, I can totally relate to some of it. What I noticed is, that when I speak Mandarin with locals (on vacation, because I am not living there), they are always super happy, that I speak their language and they make an effort to speak it with me. This might be dependent on the region one is in. From your writing I would guess you might be in Taiwan or HK, and while I have been in HK, I have never been in Taiwan and I don't know how people handle it there. I have mostly been in southern China and it's always been great and an overwhelming amount of people were very friendly and welcoming. Of course living there and traveling there for a while are 2 different things and experience might differ. If you happen to visit Berlin, feel welcome to visit our Chinese language meetup (https://dragon-descendants.de/en/) and if you want you can ask for me, 小龙.

zelphirkalt3mo ago

Wow, the tool for decomposing characters is very cool! I assume you are talking about the thing that appears, when I click "Matrix"? I think it would be good to have "decompose characters" somewhere. But I might actually use this to get the component characters. In my app in my vocabulary file I also have tags for words, which are like "component:<component here>", so that if one knows how parts of a character, one could also search for it, without knowing its pinyin, by searching for "tags contain component1 and contain component2 and ...". I might add more component tags using your tool.

What I noticed though is, that some of the components don't seem to be like what I would expect to be shown as components. For example I tried the word 衣服 and 服 is shown to have the component "二". I guess one could see it that way, but some other dictionaries stop at 月 which itself is a component with set meaning (moon) and usage as radical (often for body parts). My favorite online normal dictionary for example: https://www.mdbg.net/chinese/dictionary?page=worddict&wdrst=... (hover over 3 dots of character and click the button with the 字 and scissors to see decomposition) says:

    服 = 月 + 𠬝
    𠬝 = 卩 + 又

If you go further, wouldn't you also have to decompose "二" into "一" and "一"?

A Chinese teacher told me there are various approaches for decomposition, so this might not be a science or that rigorous, but I think consistency would then dictate, that you decompose "二" as well. I don't always agree fully with their decomposition either and usually I stop at any component, that still has meaning by itself, which can be pretty low level 1 or 2 strokes components already. For determining that, I also use information from a language school, which I copied into a repo: https://codeberg.org/ZelphirKaltstahl/language-learning/src/... "All radicals from their website". Also useful for memorizing the characters, if one can derive a mnemonic for a character from its components and their meaning.

The advanced UI looks very complex, but I don't mind that. In fact it is quite cool! Just has some stuff I don't even know what it is about. I noticed, that once one toggles the advanced UI, I didn't find a way to toggle it back to simple again.

Bookmarked!

peterburkimsher3mo ago

Thanks for your long and thoughtful reply!

Matrix is just a visualisation tool, I never actually found a practical use for it other than looking cool.

The decomposition feature is at the bottom of the page below the generated HTML. It's the text box with "隹" and a Search button. Clicking Search will show the 2 parts of the character, and all characters that contain that radical (䧶, 䳡, etc), and all multi-character words containing that character.

Clicking any of the related characters (or numeric codes for radicals that don't have a Unicode representation) will then show the genealogy for that character.

See "copying from images" in http://localhost/pingtype/docs/docs.html

If I ever come to Berlin then your meetup sounds fun! I'm pretty far away though; I live in New Zealand now.

All the best with your learning, I hope you keep making progress!

frozennothing3mo ago

This is really cool. Thank you for sharing. Before now I had not sought to understand how this technology works under the hood, but seeing it done at this scale made me curious to see if I could do something similar.

ChadNauseam3mo ago

This is amazing. I'm also working on free language learning tech. (I have some SOTA NLP models on huggingface and a free app.) I have some SOTA NLP models on huggingface and a free app. My most recent research is a list of every phrase [0].

Pronunciation correction is an insanely underdeveloped field. Hit me up via email/twitter/discord (my bio) if you're interested in collabing.

[0]: https://gist.github.com/anchpop/acbfb6599ce8c273cc89c7d1bb36...

stuxnet793mo ago

How difficult would it be to adapt this to Cantonese? It is a surprisingly difficult language to learn. It has more tones than Mandarin plus comparatively less access to learning resources (in my experience)

inkyoto3mo ago

Unlike Mandarin and other Chinese languages, Cantonese does not have tone sandhi and has changed tones instead.

Cantonese tones are also different from those of Mandarin, so no, it can't be adopted for Cantonese and it would require a complete rework.

> It is a surprisingly difficult language to learn.

I keep hearing this quite a bit, but I do not find Cantonese to be any more difficult than most languages[0]. Or at least we would need to define a metric based on which we could assess the difficulty. If it is the number of tones, their number (six – no, not nine) may look formidable at first, but they are, in fact, rather simple tones and broadly fall into three categories: flat, rising, and falling. As a random example, Cantonese does not even have a dipping tone.

In comparison, «fancy» tones of Vietnamese are significantly more challenging or even difficult – they can curl and unfurl (so to speak).

[0] That crown appears to belong to Archi, with honourable mentions going out to Inuit, Basque, Georgian, Navajo, Yimas and several other polysynthetic languages.

hnfong3mo ago

Cantonese is "hard" mainly for two reasons-

1. tones, and generally the gatekeeping of some Cantonese communities towards people who haven't gotten the tones completely right

2. the lack of learning materials relative to the number of speakers, the confusion between written Chinese and written Cantonese (and also the general lack of the latter)

As they say, "a language is a dialect with an army and navy"... I'll leave it at that.

calf3mo ago

You seem to be confusing/overgeneralizing the understandable resentment of "some Cantonese" who likely had bad experiences of postcolonialism and/or authoritarian-revanchist state policies. If Hong Kong diaspora has a poor reception towards newcomers to their local microculture, maybe it's because the people attempting to engage are not treading lightly with those actual historical legacies in mind.

1 more reply

inkyoto3mo ago

Given that linguistics does not have a concept of what makes a language «hard» or not, the language hardness classification is highly subjective and perceptional.

I have already commented on why I do not think that Cantonese tones are hard, so I will leave it at that – it is the first, oft repeated myth that is not based on facts.

> 2. the lack of learning materials relative to the number of speakers […]

On the subject of the availability of learning materials, there would have been a strong case for, e.g. Wu (Shanghainese), Min (Hokkien), Hakka etc – for all of which the learning materials virtually do not exist, and that is true.

With Cantonese, it is a remarkably different situation. My local bookshop has two large shelves stacked with Cantonese textbooks and dictionaries that suit a range of people from vastly different age groups – from toddlers starting to babble to serious advanced learners and anyone in between. More is available online, e.g. Virginia Yip's Routledge series, which includes a comprehensive book on the Cantonese grammar of rarely seen quality and coverage, Robert Bauer and Victor Mair's «ABC Cantonese-English Comprehensive Dictionary», and many more. There are online resources, an open-source, cross-platform «Jyut Dictionary», Google and Apple support the Cantonese keyboard etc.

If their printed versions are not easily locally available, they can be purchased as Kindle books as well.

Granted, Mandarin surpasses Cantonese in terms of the quantity of learning materials, and that is a dry fact.

> […] the confusion between written Chinese and written Cantonese […]

Many languages have quirks or come with a wealth of idiosyncrasies when it comes to how the language is spoken and written down. Burmese, Thai, and Tibetan, for example, are written according to extremely archaic pronunciation rules to the point that spoken and written languages have to be learned separately.

Written Cantonese has existed since at least the Ming dynasty[0][1], but the reasons why there are two distinct forms are entirely different as they go back to replacing Classical Chinese, which had become incomprehensible to anyone in the late 19th century without years of dedicated study, with a modern standard written standard based on northern Chinese varieties.

> […] (and also the general lack of the latter).

This is the second often repeated myth. Many Cantonese speakers believe that Cantonese can only be spoken but not written down, which is patently false – if a language has a writing system, it can be written down with it. When pressed with question «why do you think so», there is typically no answer or «because we have been told so». 口語粵語好容易用漢字寫低，就好似書面粵語咁。 There is a real issue of some native Cantonese words not having dedicated Chinese characters for them, but it is more of a philosophical disgreement between the academics rather than an insurmountable problem.

So, in reality – at least in Hong Kong – since formal literacy has long meant competence in Standard Written Chinese, not in a full Cantonese-written system, schools and institutions tend to penalise written vernacular Cantonese forms in formal contexts – entirely for non-linguistic reasons as explained in [2].

To sum it up, I do not find any of the counterarguments to be compelling, persuasive or supported by linguistic facts which would make Cantonese a «hard» language.

[0] https://www.fe.hku.hk/clear/doc/WC%20and%20Implications%20fo... – «The story of written Cantonese begins in the Ming dynasty with texts printed in woodblock print books called wooden fish books (木魚書)»

[1] https://cantoneseforfamilies.com/cantonese-vernacular-and-fo...

[2] https://hkupress.hku.hk/image/catalog/pdf-preview/9789622097...

2 more replies

rablackburn3mo ago

> And if there’s one thing we’ve learned over the last decade, it’s the bitter lesson: when you have enough data and compute, learned representations usually beat carefully hand-tuned systems.

There are still holdouts!

Come back to me in a couple of decades when the trove of humanity's data has been pored over and drifted further out of sync with (verifiable) reality.

Hand-tuning is the only way to make progress when you've hit a domain's limits. Go deep and have fun.

sgt3mo ago

How do you actually go about training specialized speech models? Let's say you have a language dialect you want to specialize on, or a pidgin English from West Africa, or a regular language but with highly specialized terminologies being used.

Just curious - would you need insane HW infrastructure to begin with, or hosted/managed. And what tooling is preferred by the industry for the "training"?

cocoa193mo ago

Have you tried the Azure Speech Studio? I wonder how your custom model compares to this solution.

I played around with python scripts for the same purpose. The AI gives feedback that can be transformed to a percentage of correctness. One annoyance is that for Mandarin, the percentage is calculated at the character level, whereas with English, it gives you a more granular score at the phoneme level.

dirteater_3mo ago

IMO the SotA for this is https://www.speechsuper.com/. Amazon suffers for similar

> One annoyance is that for Mandarin, the percentage is calculated at the character level, whereas with English, it gives you a more granular score at the phoneme level.

This is the case for most solutions you'd find for this task. Probably because of the 1 character -> 1 syllable property. It's pretty straightforward to split the detected pinyin into initial+final and build a score from that though.

drekipus3mo ago

instantly awesome.

I suck at chinese but I want to get better and I'm too embarassed to try and talk with real people and practise.

This is a great compromise. even just practising for a few minutes I already feel way more confident based on its feedback, and I feel like I know more about the details of pronunciation.

I'm worried this might get too big and start sucking like everything else.

erdemo3mo ago

This thread is like a diamond to me because I have been thinking about building almost the same thing for English tones. I need a model like this.

I'm sure there are a bunch of apps out there that claim they do the same thing, but they don't, IMO. Even if they do, as you said, where is the fun in that?

Great post, thanks for it!

ctkhn3mo ago

This is fantastic. Been looking for a way to get feedback on my pronunciation since I came back from Shanghai and haven't been seeing native speakers every day. Is there any plan to make this a download for desktop or mobile? Would be using it weekly to get back up to par on Mandarin

kris_builds3mo ago

Super interesting project. Curious about the data collection - did you record yourself, use existing datasets, or both? I've been thinking about building something similar for Hebrew vowels (which are often omitted in writing). Would love to hear what the hardest part of the pipeline was.

alexandermorgan3mo ago

I wish this were available for more languages! It would also be neat to estimate the native language of the speaker, given their pronunciation of the target language, and propose a prioritization of the pronunciation mistakes the language learner should work on first.

SequoiaHope3mo ago

Amazingly I just did the same thing! Only with AISHELL. It needs work. I used the encoder from the Meta MMS model.

https://github.com/sequoia-hope/mandarin-practice

arjie3mo ago

Very cool. As a super newbie who's only made it to Pimsleur 15 and only for the speaking, it would be cool to have a pinyin text entry and so on. In the end, I just type into ChatGPT what I want and paste it in your box so it's not a big deal.

redleader553mo ago

This is a very cool to have! Thanks for putting the time to build it.

For me it doesn't work very well. Even easy phrases like 他很忙 get transcribed completely random "ma he yu". Is it maybe over-fitted to some type of voice?

tomaytotomato3mo ago

Can the implementation used here for tone and pronounciation apply for Music?

It would be cool if a model could tell you if you are singing or playing a piece of music with the right intonation and other ways.

olalonde3mo ago

It might be a mic issue but my wife, who is a native speaker, seems to get most characters wrong. I will try again later in a quieter place to see if that helps.

jainaayush053mo ago

Any plans on releasing the inference/training code?

byb3mo ago

Neat. A personal tone trainer. Seriously, shut up and take my money now. Of course, it needs a vocabulary trainer, and zhuyin/traditional character support.

jrockway3mo ago

Interesting application! A friend of mine built a model like this to help her make her voice more feminine, and it is neat to see a similar use case here.

sim04ful3mo ago

I'm also working on a Chinese learning app (heyzima.com) and my "solution" to this was to use the TTS token/word log probabilities.

baby3mo ago

For people trying to say the "j" sound correctly, as in "jiu" (old), just say "dz", so in that example "dziu"

JCharante3mo ago

Cool! I'm not great at Chinese but I have to speak slowly for it to recognize the tones/words. I wonder how fast the training data is.

dionian3mo ago

it heard wu2 but i heard wo2 from you fine. and it should sound like wo2 not wo3 if spoken quickly. not a native speaker though so i could be wrong

holg3mo ago

Great idea and effort, thanks for sharing. It is even way more strict than my native chinese tryarounds :D

bytesandbits3mo ago

great work! I am going to try it out. Currently about to learn some Mandarin to be able to talk with hawker stand owners for a trip I am doing soon. I am trilingual and can speak a few languages on top of that, but none of them tonal. I am new to tonal languages and I find myself struggling with this... a lot!

anonzzzies3mo ago

goof luck! I speak 6 languages fluent but none of them tonal and I find mandarin very challenging; it does not help that people in places where you might need it are not very forgiving; asking for green fork in a tea shop has people very bewildered.

namr20003mo ago

Wow, I was going to make something almost exactly like this! Really cool work and thank you for sharing

while_true_3mo ago

Suggestion: in addition to the microphone input, allow the user to upload an audio file.

eudamoniac3mo ago

How do you know that what it tells you is correct if you can't hear it yourself?

btrlsnqtn3mo ago

The article mentions the bitter lesson. I'm confused about the status of Sutton's opinion of the bitter lesson. On the one hand, he invented the concept. On the other hand, he appears to be saying that LLMs are not the correct approach to artificial intelligence, which to a naive outsider looks like a contradiction. What gives?

allan_s3mo ago

Maybe he means that LLM will hit a ceiling glass or that the "right" approach will give equivalent with less training/less intensive compute requirements ?

victorbjorklund3mo ago

Cool. Would love a write up about how you did it if you have time

kris_builds3mo ago

+1 on wanting a writeup. The model architecture choices alone would be interesting - did they use a transformer, CNN, or something hybrid? And how they handled the tone pair ambiguities... Would read that blog post for sure.

mentalgear3mo ago

Very cool ! Will you make the source available as well?

martianlantern3mo ago

Nice! I need something similar for english now

irl_zebra3mo ago

I am a huge, huge AI hater. I hate, hate, hate all the "Show HN: My Latest AI Slop App That Sucks and Required No Creativity to Think Up or Vibe Code and is Useless and I am a Useless Void of a Person for Having Created It." I say that to give context to say that this is the first legitimately useful "Show HN" I have seen in this AI sphere. It's really great, it seems to work quite well (I am an amateur mandarin speaker, I "know" about 5,000 words, so can vaguely judge) and fulfills a legitimate use case. I would pay you to use this once the model improves a little bit. It's really fantastic. You did well.

jellojello3mo ago

This is amazing, if you feel like opening an entire language to being learned more easily.. Farsi is a VERY overlooked language, my wife/her family speak it but it's so difficult finding great language lessons (it's also called Persian/Dari)

simedwOP3mo ago

Thank you.

I had a quick look at Farsi datasets, and there seem to be a few options. That said, written Farsi doesn’t include short vowels… so can you derive pronunciation from the text using rules?

kranner3mo ago

> written Farsi doesn’t include short vowels… so can you derive pronunciation from the text using rules?

You can't, but Farsi dictionaries list the missing short vowels/diacritics/"eraab" for every word.

For instance, see this entry: https://vajehyab.com/dehkhoda/%D8%AD%D8%B3%D8%A7%D8%A8?q=%D8...

With the short vowel on the first letter it would be written حِساب (normally written as just حساب)

The dictionary entry linked shows that there is a ِ on the first letter ح

But you would have to disambiguate between homographs that differ only in the eraab.

peterburkimsher3mo ago

I made a parallel literal translator for Farsi:

https://pingtype.github.io/farsi.html

Paste in some parallel text (e.g. Bible verses, movie subtitles, song lyrics) and read what Farsi you can on the first line, looking to the lower lines for clues if you get stuck.

The core version of Pingtype is for traditional Chinese, but it supports a few other languages too.

maximedupre3mo ago

This is sick... you can just do things :D

cheonn6383mo ago

Unclear if it wants 媽媽 / 妈妈 as:

- māmā (incorrect)

- māma (correct)

nirvanatikku3mo ago

talk about 30 seconds to wow. great app, UX and demo. would love to use this. kudos.

felixbecker3mo ago

What a brilliant project!

somesun3mo ago

is there a English or Japanese learning model like this?

cmuguythrow3mo ago

Awesome idea!

wenjian3mo ago

Chinese here, some of the tune is wrong, maybe the env here has some noise, good luck on learning mandarin ;)

kanwisher3mo ago

Nice now just add Thai support ;)

iamanllm3mo ago

holy crap, I was literally imaging how I wanted something exactly like this yesterday! you are a hero!

contingencies3mo ago

Man, get a girlfriend.

j / k navigate · click thread line to collapse

153 comments

dapangzi3mo ago

Longtime lurker, made an account specifically to give feedback here as an intermediate speaker. :)

This is a great initiative and I hope to see more come out of this; I am not criticizing, but just want to provide my user experience here so you have data points.

In short, my experience lines up with your native speakers.

I found that it loses track of the phonemes when speaking quickly, and tones don't seem to line up when speaking at normal conversational speed.

By contrast, if I speak slowly and really push each tone, the phonemes and tones all register correctly.

Again, great initiative, but I think it needs a way to deal with speech that is conversationally spoken and maybe even slurred a bit due to the nature of conversational level speech.

mercanlIl3mo ago

The tool definitely needs to address tone transformations, it’s a big part of how the language is spoken. Otherwise it’s mostly useful for a first year student speaking in isolation.

Hoping to see improvements in this area

simedwOP3mo ago

Thank for the great feedback!

I have just added sandhi support, please let me know if it's working better.

dapangzi3mo ago

Still having some issues that match my previous comment, I'll try to follow your blog and give more feedback as you work on it.

Will comment that the shorter phrases (2-4 characters long) were generally accurate at normal speed, but the longer sentences have issues.

Maybe focusing on the accuracy of the smaller phrases and then scaling that might be a good way to go, since those smaller phrases are returning better accuracy.

Again, really think this is a great initiative, want to see how it grows. :)

dapangzi3mo ago

ACKing your comment.

Will check once the TV is off in the house. :)

sqs3mo ago

I don't think it takes care of tone transformation (eg 他是 ni3shi4 -> ni2shi4). Or if it does, my tones are just off. But it's a really cool idea!

carlmr3mo ago

他是 is tāshì which doesn't transform I think. Did you mean to write 你是 nǐshì? I think that transforms differently though. With the half 3rd tone only dropping.

The classical example is 4/4 不是. Which goes bùshì -> búshì.

Or 3/3 that becomes 2/3. E.g. 你好 nǐhǎo becoming níhǎo.

The 1/4 -> 2/4 transformation I think is specific to one. 一个 yīgè becomes yígè.

jhanschoo3mo ago

The tone sandhi example you just gave looks incorrect to me

jimz3mo ago

Well, OP wrote "he is" but then wrote "you are" in pinyin for one, and that's a bit hard to reconcile.

tifan3mo ago

I had the same issue! Perhaps being another dapangzi is the problem here lol

et-al3mo ago

I'm not familiar with this slang: what's a big plate?

allan_s3mo ago

It's a slang for somebody fat. 子 does not carry a specific meaning it is more a character with grammatical function to nominative

dirteater_3mo ago

the commenter's username (i'm guessing they mean 大胖子, feel free to google translate)

1 more reply

dapangzi3mo ago

胖 (pàng) means fat, vs 盘 (pán), which means plate.

Quite alright! We have to make mistakes to learn!

yunusabd3mo ago

Super nice, thanks for sharing!

This indicates to me that the model learned word structures instead of tones here. "Zhōng guó" probably appears in the training data a lot, so the model has a bias towards recognizing that.

- Edit -

From the blog post:

> If my tone is wrong, I don’t want the model to guess what I meant. I want it to tell me what I actually said.

vjerancrnjak3mo ago

I tried just repeating guó for as many times as symbols and repetition was not recognized.

yunusabd3mo ago

> I tried just repeating guó for as many times as symbols and repetition was not recognized.

Can you elaborate? I'm not sure I understand.

qingcharles3mo ago

I think he's saying transliteration and romanization is horribly flawed in some instances.

namelosw3mo ago

Impressive work! The idea and the UI is very intuitive.

tianqi3mo ago

Please allow me to share some of my views. I'm a native Mandarin speaker.

> I can guarantee that tones are not particularly useful and that you can communicate with native speakers with all the tones messed up, and that's perfectly fine.

thaumasiotes3mo ago

> If you have all the tones messed up, you can hardly communicate in Mandarin.

> To a considerable extent, Mandarin exists because of tones. You cannot communicate in it with messed up tones.

These statements are false. If they were true, it would be impossible to understand written tone-free pinyin; in reality, it's not just possible but easy.

namelosw3mo ago

Even for non-Mandarin/Guanhua, such as the Shanxi dialect, I can understand them because the pronunciation is much closer to mine, just the tones are completely novel.

calf3mo ago

samiv3mo ago

"Because as soon as you leave Beijing, you’ll find all the tones are shuffled because of every region has their own dialect and accents, which doesn’t stop people from communicate at all. "

zelphirkalt3mo ago

samus3mo ago

namelosw3mo ago

> If tones were completely irrelevant regarding understandably then they would have disappeared long ago.

Probably because it's a legacy and disappearing slowly? Modern Mandarin only has four tones left and has already lost tone patterns.

At the end of the day, vocal Chinese is always ambiguous with or without tones and in practice heavily relies on context. It requires written language to truly fix that.

zelphirkalt3mo ago

I just tried the tool and it couldn't properly recognize a very clearly pronounced "吃" and instead heard some shi2. I think it needs more training data or something. Or one needs a good mic.

simedwOP3mo ago

Hi, thanks for the feedback. The 了 issue was a bug on the JavaScript side; that should be fixed (training did thankfully handle it correctly).

The other two are probably things that could be fixed with a bigger and more varied dataset.

mijoharas3mo ago

I feel like there is a commonly mentioned idea that "speaking a foreign language is easier after having a drink or two".

(This is anecdotal, but with n>1. Discussed and observed with other Mandarin language learners)

ecshafer3mo ago

DiogenesKynikos3mo ago

The tones are really not as difficult as people make them out to be.

90% of the effort in learning any language is just learning massive amounts of vocabulary.

The two things that make Chinese difficult are:

1. The lack of shared vocabulary with Indo-European languages (this obviously doesn't apply if your native language is something with more shared vocabulary with Chinese).

2. The writing system, which because it's not phonetic requires essentially the same level of effort as learning an entirely new language (beyond spoken Chinese).

dtm9876541233mo ago

It's a common misconception that it's enough just to learn the tones and move on and it's very hard to find teachers who are able to help with more advanced pronunciation

DiogenesKynikos3mo ago

I fully agree that a lot of the difficulty with the tones is in pronouncing them at pace, and in internalizing how they interact with one another.

However, this is still something that happens very early on when learning Chinese, and it takes nowhere near the same amount of invested time as learning thousands of vocabulary terms.

calf3mo ago

Yours is the first comment I strongly agree with; as a multilingual/bicultural Asian American, children don't have this supposed difficulty hearing tones.

I do think that learning music can help a little, especially a sonically complex instrument like violin and the like.

(caveat: I'm way oversimplifying on my Saturday afternoon, but that's my tentative views on this that I would try to argue for.)

DiogenesKynikos3mo ago

I agree on not over-intellectualizing the tones.

I've seen people struggle to pronounce a word when I explicitly tell them what tones it contains, but then pronounce it perfectly when I ask them to just imitate me.

1 more reply

snicky3mo ago

> 2. The writing system, which because it's not phonetic requires essentially the same level of effort as learning an entirely new language (beyond spoken Chinese).

Sxubas3mo ago

Your comment is written as it learning a language was not a subjective experience, which could not be further from the actual thing

DiogenesKynikos3mo ago

Learning 10,000 words is objectively more difficult than getting used to tones.

You can get used to the tones in a relatively short amount of time. If you are in an immersive environment for a month or two, you will end up wondering how it is that anyone can't hear the tones.

In contrast, there is simply no way to memorize thousands of words in that timeframe.

vjvjvjvjghv3mo ago

It took me very long time to really understand how impersonating tone is in Chinese.

DiogenesKynikos3mo ago

samus3mo ago

> This is all related to the existence of tones, but tones are not the direct reason why Chinese people have difficulty pronouncing words like "first."

1 more reply

cvhc3mo ago

Sxubas3mo ago

Spanish is such a blessing as mispronouncing a word rarely changes the meaning.

Whereas in Chinese or to a lesser degree English, you have to very mindful on how you pronounce stuff.

As a native Spanish speaker the thing I dread the most is grammar and the absurd amount of verbal times there are. Even native speakers don't speak with perfect grammar.

cyberax3mo ago

I'm a native Russian speaker, and I decided to learn Mandarin, because it's linguistically almost the opposite of Russian.

barrell3mo ago

At least, this is the case for slow text. Once the text is sped up it’s amazing how my brain just stops processing that information. Both listening and speaking.

I’m sure this will come with practice and time but for now I find it fascinating

thenthenthen3mo ago

Euro speaker here, no problem with recognising tones but speaking them…:/

cyberax3mo ago

Are you studying a language with contour tones or with high/mid tone distinction? I tried to study a bit of Thai, and the high/mid tone distinction was a complete show-stopper for me.

laurieg3mo ago

My experience in learning Japanese pitch accent was eye-opening. At the start, I couldn't hear any difference. On quizzes I essentially scored the same as random guessing.

Of course, your mileage may vary with these techniques. I already spoke decent Japanese when I started doing this.

ronyeh3mo ago

> For example, "uh-oh" has a high-low pitch. If you say it wrong it sounds very strange. "Uh-huh" to show understanding goes low-high. Again, if you reverse it it sounds unusual.

I’m wondering if we can find good examples to teach the Mandarin tones. I think two or three syllable words are best because it illustrates the contour of the tones.

thaumasiotes3mo ago

Pitch levels are important enough in English that native speakers spontaneously develop ways to write them down, even though the standard written language has no way to indicate them.

However, they operate at the level of the sentence rather than the individual word, which sets up a conflict if an English speaker wants to learn Chinese.

The most common uses of pitch in English are to annotate the grammatical structure of a sentence, making it clear which words belong together in larger phrases, and to mark yes/no questions.

1 more reply

danparsonson3mo ago

bunderbunder3mo ago

dionian3mo ago

its critical because without proper tonal enunciation the words can be ambiguous.

tifan3mo ago

vunderba3mo ago

When I was living in Taiwan, one of the ways I forced myself to remember to pronounce the tones distinctly was by waving my hand in front of me, tracing the arc of each character’s tone.

It helped a lot even if I did look like an insane expat conducting an invisible orchestra.

zdragnar3mo ago

I rather regret not emulating him, even though I haven't really used it for nearly 20 years and have forgotten most of it.

ecshafer3mo ago

mleonhard3mo ago

Over-exaggeration also works well when learning to play stringed instruments like cello.

luckydata3mo ago

sowbug3mo ago

You'll love Mike Laoshi: https://youtu.be/cna89A2KAU4?si=SQEZ_0ooO1z119_k

cyberax3mo ago

Hand motions help! Especially when you want to memorize new words, because initially you need to treat tone as something additional to remember.

I used simple index finger motions to mark tones.

simedwOP3mo ago

For accents, I’ve mostly tested with a few friends so far. I’m wondering whether region should be a parameter, because training on all dialects might make the system too lax.

vunderba3mo ago

Probably be a lot of work but it would be really interesting if you had sufficient data sets to train across accents.

Highly recommend taking a look at Phonemica for this:

https://phonemica.net/

devin3mo ago

This sounds like how solfeg training works. You use a hand signal to indicate a specific tone: do re mi fa so la ti

bunderbunder3mo ago

This is very cool, but from one Mandarin learner to another I’d caution against relying too heavily on any external feedback mechanism for improving your pronunciation.

barrell3mo ago

zdc13mo ago

I feel like listening is the key to speaking. You don't necessarily need to rote learn the tones for each word. You just need say words as you hear them spoken by others.

rahimnathwani3mo ago

The thing you've built is so good, and I would have loved to have it when I was learning Mandarin.

I tried it with a couple of sentences and it did a good job of identifying which tones were off.

yunusabd3mo ago

You're probably thinking of Praat, which is still around. Even has the same UI as 20 years ago.

memalign3mo ago

I wish this had a pinyin mode…! I am learning to speak Mandarin but I am not learning to read/write.

( I’m learning using a flashcards web app I made and continue to update with vocab I encounter or need: https://memalign.github.io/m/mandarin/cards/index.html )

data_ders3mo ago

simedwOP3mo ago

Great suggestin, added a toggle to see pinyin.

siwatanejo3mo ago

+1 for pinyin

knocte3mo ago

alixwang3mo ago

simedwOP3mo ago

It’s fairly sensitive to background noise at the moment. I’m planning to train an improved version with stronger data augmentation, including background noise.

affogarty3mo ago

This is extremely cool, although I asked my wife (who is Chinese) to try it out and it said she made some mistakes.

hawflakes3mo ago

I tried it out and it has some issues with my native speech. I grew up with more Taiwan mandarin but I know the Beijing standard and the recognizer was flagging some of my utterances incorrectly.

zelphirkalt3mo ago

[1]: https://codeberg.org/ZelphirKaltstahl/xiaolong-dictionary

peterburkimsher3mo ago

Good to see that there are others learning and creating! Another shameless plug for my translator site: https://pingtype.github.io

It takes text, adds colours for tones, pinyin, literal, and parallel translations.

There’s also a character decomposition tool at the bottom of the page which can be helpful if you’re able to recognise half a character but can’t remember the pronunciation for typing it.

The YouTube channel has some song lyrics, movie subtitles, and audio Bible that might help with learning.

zelphirkalt3mo ago

    服 = 月 + 𠬝
    𠬝 = 卩 + 又

If you go further, wouldn't you also have to decompose "二" into "一" and "一"?

Bookmarked!

peterburkimsher3mo ago

Thanks for your long and thoughtful reply!

Matrix is just a visualisation tool, I never actually found a practical use for it other than looking cool.

Clicking any of the related characters (or numeric codes for radicals that don't have a Unicode representation) will then show the genealogy for that character.

See "copying from images" in http://localhost/pingtype/docs/docs.html

If I ever come to Berlin then your meetup sounds fun! I'm pretty far away though; I live in New Zealand now.

All the best with your learning, I hope you keep making progress!

frozennothing3mo ago

ChadNauseam3mo ago

Pronunciation correction is an insanely underdeveloped field. Hit me up via email/twitter/discord (my bio) if you're interested in collabing.

[0]: https://gist.github.com/anchpop/acbfb6599ce8c273cc89c7d1bb36...

stuxnet793mo ago

inkyoto3mo ago

Unlike Mandarin and other Chinese languages, Cantonese does not have tone sandhi and has changed tones instead.

Cantonese tones are also different from those of Mandarin, so no, it can't be adopted for Cantonese and it would require a complete rework.

> It is a surprisingly difficult language to learn.

In comparison, «fancy» tones of Vietnamese are significantly more challenging or even difficult – they can curl and unfurl (so to speak).

[0] That crown appears to belong to Archi, with honourable mentions going out to Inuit, Basque, Georgian, Navajo, Yimas and several other polysynthetic languages.

hnfong3mo ago

Cantonese is "hard" mainly for two reasons-

1. tones, and generally the gatekeeping of some Cantonese communities towards people who haven't gotten the tones completely right

2. the lack of learning materials relative to the number of speakers, the confusion between written Chinese and written Cantonese (and also the general lack of the latter)

As they say, "a language is a dialect with an army and navy"... I'll leave it at that.

calf3mo ago

1 more reply

inkyoto3mo ago

Given that linguistics does not have a concept of what makes a language «hard» or not, the language hardness classification is highly subjective and perceptional.

I have already commented on why I do not think that Cantonese tones are hard, so I will leave it at that – it is the first, oft repeated myth that is not based on facts.

> 2. the lack of learning materials relative to the number of speakers […]

If their printed versions are not easily locally available, they can be purchased as Kindle books as well.

Granted, Mandarin surpasses Cantonese in terms of the quantity of learning materials, and that is a dry fact.

> […] the confusion between written Chinese and written Cantonese […]

> […] (and also the general lack of the latter).

To sum it up, I do not find any of the counterarguments to be compelling, persuasive or supported by linguistic facts which would make Cantonese a «hard» language.

[1] https://cantoneseforfamilies.com/cantonese-vernacular-and-fo...

[2] https://hkupress.hku.hk/image/catalog/pdf-preview/9789622097...

2 more replies

rablackburn3mo ago

> And if there’s one thing we’ve learned over the last decade, it’s the bitter lesson: when you have enough data and compute, learned representations usually beat carefully hand-tuned systems.

There are still holdouts!

Come back to me in a couple of decades when the trove of humanity's data has been pored over and drifted further out of sync with (verifiable) reality.

Hand-tuning is the only way to make progress when you've hit a domain's limits. Go deep and have fun.

sgt3mo ago

Just curious - would you need insane HW infrastructure to begin with, or hosted/managed. And what tooling is preferred by the industry for the "training"?

cocoa193mo ago

Have you tried the Azure Speech Studio? I wonder how your custom model compares to this solution.

dirteater_3mo ago

IMO the SotA for this is https://www.speechsuper.com/. Amazon suffers for similar

> One annoyance is that for Mandarin, the percentage is calculated at the character level, whereas with English, it gives you a more granular score at the phoneme level.

drekipus3mo ago

instantly awesome.

I suck at chinese but I want to get better and I'm too embarassed to try and talk with real people and practise.

This is a great compromise. even just practising for a few minutes I already feel way more confident based on its feedback, and I feel like I know more about the details of pronunciation.

I'm worried this might get too big and start sucking like everything else.

erdemo3mo ago

This thread is like a diamond to me because I have been thinking about building almost the same thing for English tones. I need a model like this.

I'm sure there are a bunch of apps out there that claim they do the same thing, but they don't, IMO. Even if they do, as you said, where is the fun in that?

Great post, thanks for it!

ctkhn3mo ago

kris_builds3mo ago

alexandermorgan3mo ago

SequoiaHope3mo ago

Amazingly I just did the same thing! Only with AISHELL. It needs work. I used the encoder from the Meta MMS model.

https://github.com/sequoia-hope/mandarin-practice

arjie3mo ago

redleader553mo ago

This is a very cool to have! Thanks for putting the time to build it.

For me it doesn't work very well. Even easy phrases like 他很忙 get transcribed completely random "ma he yu". Is it maybe over-fitted to some type of voice?

tomaytotomato3mo ago

Can the implementation used here for tone and pronounciation apply for Music?

It would be cool if a model could tell you if you are singing or playing a piece of music with the right intonation and other ways.

olalonde3mo ago

It might be a mic issue but my wife, who is a native speaker, seems to get most characters wrong. I will try again later in a quieter place to see if that helps.

jainaayush053mo ago

Any plans on releasing the inference/training code?

byb3mo ago

Neat. A personal tone trainer. Seriously, shut up and take my money now. Of course, it needs a vocabulary trainer, and zhuyin/traditional character support.

jrockway3mo ago

Interesting application! A friend of mine built a model like this to help her make her voice more feminine, and it is neat to see a similar use case here.

sim04ful3mo ago

I'm also working on a Chinese learning app (heyzima.com) and my "solution" to this was to use the TTS token/word log probabilities.

baby3mo ago

For people trying to say the "j" sound correctly, as in "jiu" (old), just say "dz", so in that example "dziu"

JCharante3mo ago

Cool! I'm not great at Chinese but I have to speak slowly for it to recognize the tones/words. I wonder how fast the training data is.

dionian3mo ago

it heard wu2 but i heard wo2 from you fine. and it should sound like wo2 not wo3 if spoken quickly. not a native speaker though so i could be wrong

holg3mo ago

Great idea and effort, thanks for sharing. It is even way more strict than my native chinese tryarounds :D

bytesandbits3mo ago

anonzzzies3mo ago

namr20003mo ago

Wow, I was going to make something almost exactly like this! Really cool work and thank you for sharing

while_true_3mo ago

Suggestion: in addition to the microphone input, allow the user to upload an audio file.

eudamoniac3mo ago

How do you know that what it tells you is correct if you can't hear it yourself?

btrlsnqtn3mo ago

allan_s3mo ago

Maybe he means that LLM will hit a ceiling glass or that the "right" approach will give equivalent with less training/less intensive compute requirements ?

victorbjorklund3mo ago

Cool. Would love a write up about how you did it if you have time

kris_builds3mo ago

mentalgear3mo ago

Very cool ! Will you make the source available as well?

martianlantern3mo ago

Nice! I need something similar for english now

irl_zebra3mo ago

jellojello3mo ago

simedwOP3mo ago

Thank you.

I had a quick look at Farsi datasets, and there seem to be a few options. That said, written Farsi doesn’t include short vowels… so can you derive pronunciation from the text using rules?

kranner3mo ago

> written Farsi doesn’t include short vowels… so can you derive pronunciation from the text using rules?

You can't, but Farsi dictionaries list the missing short vowels/diacritics/"eraab" for every word.

For instance, see this entry: https://vajehyab.com/dehkhoda/%D8%AD%D8%B3%D8%A7%D8%A8?q=%D8...

With the short vowel on the first letter it would be written حِساب (normally written as just حساب)

The dictionary entry linked shows that there is a ِ on the first letter ح

But you would have to disambiguate between homographs that differ only in the eraab.

peterburkimsher3mo ago

I made a parallel literal translator for Farsi:

https://pingtype.github.io/farsi.html

Paste in some parallel text (e.g. Bible verses, movie subtitles, song lyrics) and read what Farsi you can on the first line, looking to the lower lines for clues if you get stuck.

The core version of Pingtype is for traditional Chinese, but it supports a few other languages too.

maximedupre3mo ago

This is sick... you can just do things :D

cheonn6383mo ago

Unclear if it wants 媽媽 / 妈妈 as:

- māmā (incorrect)

- māma (correct)

nirvanatikku3mo ago

talk about 30 seconds to wow. great app, UX and demo. would love to use this. kudos.

felixbecker3mo ago

What a brilliant project!

somesun3mo ago