How AI hears accents: An audible visualization of accent clusters

Overall reception & visualization

  • Many found the tool fun and compelling, especially clicking points to hear accents and exploring the 3D UMAP visualization.
  • Several praised the clarity of the JS code and use of Plotly; one compared it to classic MNIST/embedding visualizers.
  • Some asked for ways to subscribe (e.g., via RSS) and for more such visualizations of other latent spaces.

Model, latent space & methods

  • Accent model: ~12 layers × 768 dimensions; the 3D plot is a UMAP projection of these embeddings.
  • The model wasn’t explicitly disentangled for timbre/pitch; fine-tuning for accent classification appears to push later layers to ignore non-accent characteristics (verified at least for gender).
  • One commenter questioned the choice of UMAP over t‑SNE, noting the “line” artifacts versus t‑SNE’s more blob-like clusters.

Dataset, labels & clustering quirks

  • Spanish is highly scattered, attributed to:
    • Many distinct dialects collapsed into a single “Spanish” label.
    • Label noise and a highly imbalanced dataset where Spanish is the most common class, leading the model to over-predict it when uncertain.
  • Users repeatedly requested breakdowns by regional varieties (Spanish by country/region; similarly French, Chinese, Arabic, UK English, German, etc.).
  • Irish English appears poorly modeled due to limited labeled Irish data; finer UK/Irish regional labels are planned.
  • Observed clusters prompted discussion:
    • Persian–Turkish–Slavic/Balkan languages clustering together.
    • Perceived similarity between Portuguese (especially European) and Russian.
    • Australian–Vietnamese proximity likely reflecting teacher geography rather than phonetic similarity.
    • Tight cluster of Australian, British, and South African English despite large perceived differences to human ears.

Voice standardization & audio quality

  • All samples use a single “neutral” synthetic voice to protect privacy and emphasize accent over speaker identity.
  • Some listeners found this helpful; others said:
    • Voices all sound like the same middle-aged man.
    • “French” and “Spanish” samples don’t resemble real native accents they know (e.g., missing characteristic /r/ patterns, prosody).
    • Many accents sound like generic non-native English with only a faint hint of the labeled accent.
  • Authors acknowledge the accent-preserving voice conversion is early and imperfect.

User experience with the accent oracle

  • Some users were classified correctly or nearly so; others got surprising results (e.g., Yorkshire labeled Dutch, Americans labeled Swedish).
  • Deaf and hard-of-hearing users reported:
    • Being misclassified (often Scandinavian) by the model and by non-native listeners in real life, while native listeners correctly identify them as native with a speech difference.
    • ASR systems struggling heavily with their speech; suggestions were made to fine-tune Whisper on personalized data.
  • Several criticized the notion of a single “British” or “German” accent and the framing of “having an accent,” noting everyone has one.

Ethical and linguistic reflections

  • Some argued the product targets insecurity of non-native speakers wanting to “sound native”; others warned against overstating “need” and playing on fears.
  • A few found it offensive that native accents could be implicitly treated as “less than native.”
  • Commenters noted that accent perception involves not only segmental sounds but prosody, vocabulary, and local idioms, which the tool does not model explicitly.