FFmpeg 8.0 adds Whisper support
Whisper in FFmpeg: Capabilities and Interface
- FFmpeg 8 adds a
whisperaudio filter (via whisper.cpp) that can output plain text, SRT subtitles, or JSON, to files or AVIO destinations; text is also exposed as frame metadata. - It doesn’t embed subtitles into video by itself but simplifies generating sidecar SRT/VTT files directly from arbitrary audio/video, without pre-extracting or re-encoding audio.
- Voice Activity Detection is already supported; the filter has a
queueoption to trade off latency vs. accuracy.
Performance, Real-Time Use, and Chunking
- Users report acceptable real-time performance with small/tiny models on modern CPUs; GPUs help, but are not strictly required.
- The FFmpeg filter defaults to ~3s chunks; longer chunks (10–20s) improve accuracy and reduce CPU use but increase latency.
- Several commenters discuss overlapping-chunk strategies for live transcription and note that Whisper’s 30s context and non-streaming architecture complicate low-latency, high-accuracy streaming.
Subtitles, Translation, and UX Debates
- People are excited about automatic subtitles/translation in players (VLC, mpv, OBS, etc.), though models must still be shipped or configured separately.
- There is extended debate over what “good subtitles” are:
- One camp wants verbatim, word-for-word captions matching audio.
- Another argues film/TV subtitles must be edited for readability, timing, and space, and sometimes soften profanity.
- Burned-in “engagement” subtitles on social media are widely disliked (non-toggleable, stylistically loud, single language), though some note platforms lack proper captioning, forcing this approach.
Accuracy, Hallucinations, and Multilingual Behavior
- Hallucinations on silence or music (e.g., repeated “Thanks for watching”) are a known issue; VAD and vocal-isolation preprocessing help but don’t eliminate it.
- Mixed-language audio (e.g., Dutch/English code-switching) can cause Whisper to translate segments instead of transcribing them; some suggest using transcription-only or “turbo” models.
- Experiences vary: some find Whisper excellent for many languages; others report failures or invented content, especially for translation and multilingual material.
Integration, Dependencies, and “Bloat” Concerns
- The filter is a wrapper over whisper.cpp; users must separately build whisper.cpp and download models (hundreds of MB–GB). Some fear this will frustrate novices.
- Others say this is consistent with existing FFmpeg filters that rely on external ML libs and models and see tight FFmpeg integration as a net win for tooling and downstream apps.
- A minority view calls this feature creep that breaks the “small tools” Unix philosophy; others counter that FFmpeg already includes various ML-based filters.
Accessibility and New Workflows
- Hard-of-hearing users describe Whisper-based tools (Subtitle Edit, custom pipelines, browser extensions) as transformative: any video, lecture, or podcast can be transcribed, searched, summarized, and translated.
- Examples include live police scanner transcripts, podcast archives, GNOME speech-to-text extensions, and voice-driven personal assistants wired through LLMs.
Site Access and Infrastructure Issues
- Many commenters struggle with FFmpeg’s Anubis bot filter (slow or broken challenges on older browsers/GrapheneOS); others report it passing instantly.
- Some argue proper configuration (e.g., meta-refresh challenges) would preserve protection while remaining usable; others defend strict bot filters as necessary to keep the Git UI responsive.