Show HN: Gemini LLM corrects ASR YouTube transcripts

Use of LLMs to Correct YouTube ASR Transcripts

  • Many see this as a natural, high-value use case: fix “boneheaded” ASR errors and improve readability and domain vocabulary.
  • LLMs can use extra context (video title/description, possibly frames) to pick better words and correct technical terms and names.
  • Some report success using pipelines: ASR (e.g., Whisper/WhisperX) → LLM cleanup → separate LLM for summarization.

Limitations and Risks of LLM Post‑Processing

  • LLMs tend to:
    • Normalize toward “average” language, potentially deleting outliers or unusual but correct phrases (e.g., odd activities, nonsense words that convey tone).
    • Reformat speech into polished “internet text,” reducing fidelity to how people actually talked.
    • Hallucinate, especially over long inputs or when given multiple modalities.
  • People with ASR experience argue generic LLM cleanup often reduces transcript accuracy overall, even if it helps on rare words and readability.
  • Chunking (e.g., ~512 words) is reported to reduce hallucinations versus feeding very long transcripts.

Accessibility, Law, and the Berkeley Lecture Archive

  • Discussion revisits Berkeley removing course videos after an accessibility complaint about missing/poor captions.
  • Some argue modern ASR + LLMs now make captioning cheap enough that such archives could be captioned instead of removed.
  • Others stress the legal issue remains: regulations effectively force “caption well or remove,” leading institutions to pull content when costs or risk are high.
  • Debate over whether such rules protect disabled users or end up harming everyone by reducing available content.

Quality of YouTube/Google ASR vs Alternatives

  • Mixed views on YouTube captions:
    • Some say they’re now “mostly fine” for clear, standard English.
    • Others (especially referencing Deaf/HoH use, accents, domain jargon, and non‑English like Japanese) find them inaccurate, misleading, or useless.
  • One commenter claims Google’s ASR is among the weakest hyperscalers; Azure (via Nuance) is described as significantly better, with several non‑cloud and self‑hosted options (Whisper, Kaldi) mentioned.
  • Re‑transcribing with modern ASR (Whisper variants, commercial APIs) is seen as cheap and often more reliable than “fixing” a bad transcript with an LLM.

Gemini as Product and API

  • Several users report poor experiences with consumer Gemini: refusals, prudish/risk‑averse behavior, weaker quality than GPT‑4o/Claude.
  • Others note Gemini API and multimodal models are strong for long audio/video, but still prone to hallucinations in meeting summaries.
  • Cost: Gemini Flash‑8B is cited as extremely cheap per hour of transcript, making LLM cleanup attractive at scale.

Security and API Keys

  • Users are wary of pasting personal API keys into third‑party tools, even if calls are client‑side.
  • Suggested mitigations: create low‑budget or temporary keys, rotate/delete keys after use, or self‑host the open‑source tool.