Show HN: I trained a 9M speech model to fix my Mandarin tones

Overall reception

  • Many commenters are enthusiastic, calling it an immediate “wow” with great UX and a very useful compromise for shy learners who don’t want to practice with people.
  • Several say it would have been invaluable when they first learned Mandarin; others say even a few minutes of use already increased their confidence.

Usefulness and learning strategies

  • People relate it to prior tools like Praat and various commercial pronunciation graders.
  • Commenters connect it to known pedagogy: exaggerated tones during learning, mimicking native speakers, and using hand motions or solfege-like gestures to embody tonal contours.
  • Multiple experienced learners warn against relying solely on external scoring; they emphasize ear training (minimal pairs, lots of listening, overlaying your recording with a native one) as critical for long‑term pronunciation and listening gains.

Accuracy, speed, accents, and noise

  • Several native speakers report poor results: correct tones and syllables misclassified, especially in casual or fast speech; the system often works only when words are spoken slowly and distinctly.
  • Issues are noted for Taiwan-accent Mandarin, Beijing-standard speakers, and background-noisy environments; the model is described as sensitive to noise.
  • Some examples suggest phrase‑level bias (e.g., favoring very common collocations) and the limitations of mapping to a fixed set of 1,200–odd allowed syllables.
  • Users ask whether tone sandhi is modeled; most evidence suggests it is not, making the tool more suitable for isolated words or very careful speech.

Debate on tones: difficulty and importance

  • European and Russian learners describe tones and pitch accent as initially unintuitive, especially at natural speed, but trainable with practice.
  • There’s disagreement on importance: some natives say tones are overrated and communication relies heavily on context and regional variation; others insist that badly wrong tones make Mandarin communication very hard and give concrete minimal-pair examples where meaning flips.
  • Several note that disyllabic words and context reduce ambiguity over time, but beginners with limited vocabulary are more vulnerable to tonal errors.

Extensions, requests, and related work

  • Frequent feature requests: pinyin input mode, zhuyin and traditional character support, integrated vocabulary training.
  • Interest in adapting the idea to other languages and tasks: Cantonese (noted as requiring a separate system), Farsi and Hebrew (vowel recovery), English pronunciation, and even music intonation or voice feminization.
  • Commenters link alternative APIs and products (Azure, Amazon, SpeechSuper) and share their own open‑source or commercial Chinese learning tools, tone‑coloring translators, and character decomposition utilities.

Modeling and “bitter lesson” discussion

  • Some ask for a technical write‑up: architecture choice (transformer/CNN/CTC), datasets used, handling of ambiguities, and data collection pipeline.
  • There’s brief debate about Sutton’s “bitter lesson” versus hand‑tuned systems, and questions about what hardware and tooling are needed to train similar specialized speech models for other dialects or domains.