2025-07-22

Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic

Observed Behavior Across Languages

Users report that Whisper, especially large-v3, frequently “hears” fixed phrases during silence:
- Arabic: “translation by [person]”.
- German: “Subtitling of [broadcaster] for [network], 2017”.
- Czech, Italian, Romanian, Russian, Turkish, Chinese, English, Welsh, Norwegian, Danish, Dutch, French: variants of “subtitles by X”, “thanks for watching”, “don’t forget to like and subscribe”, broadcaster credits, or similar.
Similar artifacts show up in other products using Whisper or similar models (Telegram voice recognition, ChatGPT audio, video platforms’ auto-captions).

Suspected Training Data Sources

Widely shared belief that the model was trained heavily on subtitle tracks from:
- Movies and TV (including fansubs and community subtitles).
- YouTube-style content and other online videos.
Silent credit-roll segments often contain translator or channel credits instead of “[silence]”, so silence in training data is frequently paired with such strings.
Some commenters suggest specific subtitle sites and torrent-associated subtitles; others note there are also large “public” subtitle corpora.

Technical Cause: Overfitting vs Garbage Data

One camp calls this classic overfitting: the model learns spurious correlations (silence → credits) that hurt generalization.
Another camp says it’s primarily bad labeling / classification: silence is inconsistently labeled or not labeled at all, so the model has no clean “silence → nothing” pattern to learn.
Several note both can be true: dirty data causes the model to overfit to noise.
Broader point: the model can’t recognize “I don’t know” and instead picks the most likely learned pattern.

Mitigations and Usage Patterns

Many practitioners say Whisper is usable only with strong preprocessing:
- Voice Activity Detection (VAD) or silence trimming before feeding audio.
- Some commercial and open-source pipelines (e.g., WhisperX, faster-whisper with VAD) significantly reduce hallucinations.
Suggestions include small classifier models to detect hallucinations, simple silence detection, and post-filters to strip known credit phrases.

Copyright, Piracy, and Fair Use Debate

Strong suspicion that training corpora include pirated or unofficial content (fansubs, torrent subtitles, paywalled books and media).
Long debate over:
- Distinction between training as potential “fair use” vs illegally acquiring the material.
- Perceived double standard: individuals fined for torrenting vs AI companies scraping and pirating at massive scale.
- Ongoing lawsuits and preliminary rulings where training itself may be fair use, but obtaining pirated data is not.

Broader Takeaways about AI Limits

Many see this as evidence that these systems are pattern matchers, not reasoners: they confidently hallucinate plausible text in edge cases like silence.
Commenters stress that “garbage in, garbage out” and poor data cleaning can surface directly in model behavior, sometimes in amusing, sometimes in legally risky ways.

Related topics