Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic
Observed Behavior Across Languages
- Users report that Whisper, especially large-v3, frequently “hears” fixed phrases during silence:
- Arabic: “translation by [person]”.
- German: “Subtitling of [broadcaster] for [network], 2017”.
- Czech, Italian, Romanian, Russian, Turkish, Chinese, English, Welsh, Norwegian, Danish, Dutch, French: variants of “subtitles by X”, “thanks for watching”, “don’t forget to like and subscribe”, broadcaster credits, or similar.
- Similar artifacts show up in other products using Whisper or similar models (Telegram voice recognition, ChatGPT audio, video platforms’ auto-captions).
Suspected Training Data Sources
- Widely shared belief that the model was trained heavily on subtitle tracks from:
- Movies and TV (including fansubs and community subtitles).
- YouTube-style content and other online videos.
- Silent credit-roll segments often contain translator or channel credits instead of “[silence]”, so silence in training data is frequently paired with such strings.
- Some commenters suggest specific subtitle sites and torrent-associated subtitles; others note there are also large “public” subtitle corpora.
Technical Cause: Overfitting vs Garbage Data
- One camp calls this classic overfitting: the model learns spurious correlations (silence → credits) that hurt generalization.
- Another camp says it’s primarily bad labeling / classification: silence is inconsistently labeled or not labeled at all, so the model has no clean “silence → nothing” pattern to learn.
- Several note both can be true: dirty data causes the model to overfit to noise.
- Broader point: the model can’t recognize “I don’t know” and instead picks the most likely learned pattern.
Mitigations and Usage Patterns
- Many practitioners say Whisper is usable only with strong preprocessing:
- Voice Activity Detection (VAD) or silence trimming before feeding audio.
- Some commercial and open-source pipelines (e.g., WhisperX, faster-whisper with VAD) significantly reduce hallucinations.
- Suggestions include small classifier models to detect hallucinations, simple silence detection, and post-filters to strip known credit phrases.
Copyright, Piracy, and Fair Use Debate
- Strong suspicion that training corpora include pirated or unofficial content (fansubs, torrent subtitles, paywalled books and media).
- Long debate over:
- Distinction between training as potential “fair use” vs illegally acquiring the material.
- Perceived double standard: individuals fined for torrenting vs AI companies scraping and pirating at massive scale.
- Ongoing lawsuits and preliminary rulings where training itself may be fair use, but obtaining pirated data is not.
Broader Takeaways about AI Limits
- Many see this as evidence that these systems are pattern matchers, not reasoners: they confidently hallucinate plausible text in edge cases like silence.
- Commenters stress that “garbage in, garbage out” and poor data cleaning can surface directly in model behavior, sometimes in amusing, sometimes in legally risky ways.