Complete silence is always hallucinated as "ترجمة نانسي قنقر" in Arabic

Observed Behavior Across Languages

  • Users report that Whisper, especially large-v3, frequently “hears” fixed phrases during silence:
    • Arabic: “translation by [person]”.
    • German: “Subtitling of [broadcaster] for [network], 2017”.
    • Czech, Italian, Romanian, Russian, Turkish, Chinese, English, Welsh, Norwegian, Danish, Dutch, French: variants of “subtitles by X”, “thanks for watching”, “don’t forget to like and subscribe”, broadcaster credits, or similar.
  • Similar artifacts show up in other products using Whisper or similar models (Telegram voice recognition, ChatGPT audio, video platforms’ auto-captions).

Suspected Training Data Sources

  • Widely shared belief that the model was trained heavily on subtitle tracks from:
    • Movies and TV (including fansubs and community subtitles).
    • YouTube-style content and other online videos.
  • Silent credit-roll segments often contain translator or channel credits instead of “[silence]”, so silence in training data is frequently paired with such strings.
  • Some commenters suggest specific subtitle sites and torrent-associated subtitles; others note there are also large “public” subtitle corpora.

Technical Cause: Overfitting vs Garbage Data

  • One camp calls this classic overfitting: the model learns spurious correlations (silence → credits) that hurt generalization.
  • Another camp says it’s primarily bad labeling / classification: silence is inconsistently labeled or not labeled at all, so the model has no clean “silence → nothing” pattern to learn.
  • Several note both can be true: dirty data causes the model to overfit to noise.
  • Broader point: the model can’t recognize “I don’t know” and instead picks the most likely learned pattern.

Mitigations and Usage Patterns

  • Many practitioners say Whisper is usable only with strong preprocessing:
    • Voice Activity Detection (VAD) or silence trimming before feeding audio.
    • Some commercial and open-source pipelines (e.g., WhisperX, faster-whisper with VAD) significantly reduce hallucinations.
  • Suggestions include small classifier models to detect hallucinations, simple silence detection, and post-filters to strip known credit phrases.

Copyright, Piracy, and Fair Use Debate

  • Strong suspicion that training corpora include pirated or unofficial content (fansubs, torrent subtitles, paywalled books and media).
  • Long debate over:
    • Distinction between training as potential “fair use” vs illegally acquiring the material.
    • Perceived double standard: individuals fined for torrenting vs AI companies scraping and pirating at massive scale.
    • Ongoing lawsuits and preliminary rulings where training itself may be fair use, but obtaining pirated data is not.

Broader Takeaways about AI Limits

  • Many see this as evidence that these systems are pattern matchers, not reasoners: they confidently hallucinate plausible text in edge cases like silence.
  • Commenters stress that “garbage in, garbage out” and poor data cleaning can surface directly in model behavior, sometimes in amusing, sometimes in legally risky ways.