Show HN: Gemini LLM corrects ASR YouTube transcripts
Use of LLMs to Correct YouTube ASR Transcripts
- Many see this as a natural, high-value use case: fix “boneheaded” ASR errors and improve readability and domain vocabulary.
- LLMs can use extra context (video title/description, possibly frames) to pick better words and correct technical terms and names.
- Some report success using pipelines: ASR (e.g., Whisper/WhisperX) → LLM cleanup → separate LLM for summarization.
Limitations and Risks of LLM Post‑Processing
- LLMs tend to:
- Normalize toward “average” language, potentially deleting outliers or unusual but correct phrases (e.g., odd activities, nonsense words that convey tone).
- Reformat speech into polished “internet text,” reducing fidelity to how people actually talked.
- Hallucinate, especially over long inputs or when given multiple modalities.
- People with ASR experience argue generic LLM cleanup often reduces transcript accuracy overall, even if it helps on rare words and readability.
- Chunking (e.g., ~512 words) is reported to reduce hallucinations versus feeding very long transcripts.
Accessibility, Law, and the Berkeley Lecture Archive
- Discussion revisits Berkeley removing course videos after an accessibility complaint about missing/poor captions.
- Some argue modern ASR + LLMs now make captioning cheap enough that such archives could be captioned instead of removed.
- Others stress the legal issue remains: regulations effectively force “caption well or remove,” leading institutions to pull content when costs or risk are high.
- Debate over whether such rules protect disabled users or end up harming everyone by reducing available content.
Quality of YouTube/Google ASR vs Alternatives
- Mixed views on YouTube captions:
- Some say they’re now “mostly fine” for clear, standard English.
- Others (especially referencing Deaf/HoH use, accents, domain jargon, and non‑English like Japanese) find them inaccurate, misleading, or useless.
- One commenter claims Google’s ASR is among the weakest hyperscalers; Azure (via Nuance) is described as significantly better, with several non‑cloud and self‑hosted options (Whisper, Kaldi) mentioned.
- Re‑transcribing with modern ASR (Whisper variants, commercial APIs) is seen as cheap and often more reliable than “fixing” a bad transcript with an LLM.
Gemini as Product and API
- Several users report poor experiences with consumer Gemini: refusals, prudish/risk‑averse behavior, weaker quality than GPT‑4o/Claude.
- Others note Gemini API and multimodal models are strong for long audio/video, but still prone to hallucinations in meeting summaries.
- Cost: Gemini Flash‑8B is cited as extremely cheap per hour of transcript, making LLM cleanup attractive at scale.
Security and API Keys
- Users are wary of pasting personal API keys into third‑party tools, even if calls are client‑side.
- Suggested mitigations: create low‑budget or temporary keys, rotate/delete keys after use, or self‑host the open‑source tool.