2024-11-25

Show HN: Gemini LLM corrects ASR YouTube transcripts

Use of LLMs to Correct YouTube ASR Transcripts

Many see this as a natural, high-value use case: fix “boneheaded” ASR errors and improve readability and domain vocabulary.
LLMs can use extra context (video title/description, possibly frames) to pick better words and correct technical terms and names.
Some report success using pipelines: ASR (e.g., Whisper/WhisperX) → LLM cleanup → separate LLM for summarization.

Limitations and Risks of LLM Post‑Processing

LLMs tend to:
- Normalize toward “average” language, potentially deleting outliers or unusual but correct phrases (e.g., odd activities, nonsense words that convey tone).
- Reformat speech into polished “internet text,” reducing fidelity to how people actually talked.
- Hallucinate, especially over long inputs or when given multiple modalities.
People with ASR experience argue generic LLM cleanup often reduces transcript accuracy overall, even if it helps on rare words and readability.
Chunking (e.g., ~512 words) is reported to reduce hallucinations versus feeding very long transcripts.

Accessibility, Law, and the Berkeley Lecture Archive

Discussion revisits Berkeley removing course videos after an accessibility complaint about missing/poor captions.
Some argue modern ASR + LLMs now make captioning cheap enough that such archives could be captioned instead of removed.
Others stress the legal issue remains: regulations effectively force “caption well or remove,” leading institutions to pull content when costs or risk are high.
Debate over whether such rules protect disabled users or end up harming everyone by reducing available content.

Quality of YouTube/Google ASR vs Alternatives

Mixed views on YouTube captions:
- Some say they’re now “mostly fine” for clear, standard English.
- Others (especially referencing Deaf/HoH use, accents, domain jargon, and non‑English like Japanese) find them inaccurate, misleading, or useless.
One commenter claims Google’s ASR is among the weakest hyperscalers; Azure (via Nuance) is described as significantly better, with several non‑cloud and self‑hosted options (Whisper, Kaldi) mentioned.
Re‑transcribing with modern ASR (Whisper variants, commercial APIs) is seen as cheap and often more reliable than “fixing” a bad transcript with an LLM.

Gemini as Product and API

Several users report poor experiences with consumer Gemini: refusals, prudish/risk‑averse behavior, weaker quality than GPT‑4o/Claude.
Others note Gemini API and multimodal models are strong for long audio/video, but still prone to hallucinations in meeting summaries.
Cost: Gemini Flash‑8B is cited as extremely cheap per hour of transcript, making LLM cleanup attractive at scale.

Security and API Keys

Users are wary of pasting personal API keys into third‑party tools, even if calls are client‑side.
Suggested mitigations: create low‑budget or temporary keys, rotate/delete keys after use, or self‑host the open‑source tool.

Related topics