How AI hears accents: An audible visualization of accent clusters
Overall reception & visualization
- Many found the tool fun and compelling, especially clicking points to hear accents and exploring the 3D UMAP visualization.
- Several praised the clarity of the JS code and use of Plotly; one compared it to classic MNIST/embedding visualizers.
- Some asked for ways to subscribe (e.g., via RSS) and for more such visualizations of other latent spaces.
Model, latent space & methods
- Accent model: ~12 layers × 768 dimensions; the 3D plot is a UMAP projection of these embeddings.
- The model wasn’t explicitly disentangled for timbre/pitch; fine-tuning for accent classification appears to push later layers to ignore non-accent characteristics (verified at least for gender).
- One commenter questioned the choice of UMAP over t‑SNE, noting the “line” artifacts versus t‑SNE’s more blob-like clusters.
Dataset, labels & clustering quirks
- Spanish is highly scattered, attributed to:
- Many distinct dialects collapsed into a single “Spanish” label.
- Label noise and a highly imbalanced dataset where Spanish is the most common class, leading the model to over-predict it when uncertain.
- Users repeatedly requested breakdowns by regional varieties (Spanish by country/region; similarly French, Chinese, Arabic, UK English, German, etc.).
- Irish English appears poorly modeled due to limited labeled Irish data; finer UK/Irish regional labels are planned.
- Observed clusters prompted discussion:
- Persian–Turkish–Slavic/Balkan languages clustering together.
- Perceived similarity between Portuguese (especially European) and Russian.
- Australian–Vietnamese proximity likely reflecting teacher geography rather than phonetic similarity.
- Tight cluster of Australian, British, and South African English despite large perceived differences to human ears.
Voice standardization & audio quality
- All samples use a single “neutral” synthetic voice to protect privacy and emphasize accent over speaker identity.
- Some listeners found this helpful; others said:
- Voices all sound like the same middle-aged man.
- “French” and “Spanish” samples don’t resemble real native accents they know (e.g., missing characteristic /r/ patterns, prosody).
- Many accents sound like generic non-native English with only a faint hint of the labeled accent.
- Authors acknowledge the accent-preserving voice conversion is early and imperfect.
User experience with the accent oracle
- Some users were classified correctly or nearly so; others got surprising results (e.g., Yorkshire labeled Dutch, Americans labeled Swedish).
- Deaf and hard-of-hearing users reported:
- Being misclassified (often Scandinavian) by the model and by non-native listeners in real life, while native listeners correctly identify them as native with a speech difference.
- ASR systems struggling heavily with their speech; suggestions were made to fine-tune Whisper on personalized data.
- Several criticized the notion of a single “British” or “German” accent and the framing of “having an accent,” noting everyone has one.
Ethical and linguistic reflections
- Some argued the product targets insecurity of non-native speakers wanting to “sound native”; others warned against overstating “need” and playing on fears.
- A few found it offensive that native accents could be implicitly treated as “less than native.”
- Commenters noted that accent perception involves not only segmental sounds but prosody, vocabulary, and local idioms, which the tool does not model explicitly.