2025-08-13

FFmpeg 8.0 adds Whisper support

Whisper in FFmpeg: Capabilities and Interface

FFmpeg 8 adds a whisper audio filter (via whisper.cpp) that can output plain text, SRT subtitles, or JSON, to files or AVIO destinations; text is also exposed as frame metadata.
It doesn’t embed subtitles into video by itself but simplifies generating sidecar SRT/VTT files directly from arbitrary audio/video, without pre-extracting or re-encoding audio.
Voice Activity Detection is already supported; the filter has a queue option to trade off latency vs. accuracy.

Performance, Real-Time Use, and Chunking

Users report acceptable real-time performance with small/tiny models on modern CPUs; GPUs help, but are not strictly required.
The FFmpeg filter defaults to ~3s chunks; longer chunks (10–20s) improve accuracy and reduce CPU use but increase latency.
Several commenters discuss overlapping-chunk strategies for live transcription and note that Whisper’s 30s context and non-streaming architecture complicate low-latency, high-accuracy streaming.

Subtitles, Translation, and UX Debates

People are excited about automatic subtitles/translation in players (VLC, mpv, OBS, etc.), though models must still be shipped or configured separately.
There is extended debate over what “good subtitles” are:
- One camp wants verbatim, word-for-word captions matching audio.
- Another argues film/TV subtitles must be edited for readability, timing, and space, and sometimes soften profanity.
Burned-in “engagement” subtitles on social media are widely disliked (non-toggleable, stylistically loud, single language), though some note platforms lack proper captioning, forcing this approach.

Accuracy, Hallucinations, and Multilingual Behavior

Hallucinations on silence or music (e.g., repeated “Thanks for watching”) are a known issue; VAD and vocal-isolation preprocessing help but don’t eliminate it.
Mixed-language audio (e.g., Dutch/English code-switching) can cause Whisper to translate segments instead of transcribing them; some suggest using transcription-only or “turbo” models.
Experiences vary: some find Whisper excellent for many languages; others report failures or invented content, especially for translation and multilingual material.

Integration, Dependencies, and “Bloat” Concerns

The filter is a wrapper over whisper.cpp; users must separately build whisper.cpp and download models (hundreds of MB–GB). Some fear this will frustrate novices.
Others say this is consistent with existing FFmpeg filters that rely on external ML libs and models and see tight FFmpeg integration as a net win for tooling and downstream apps.
A minority view calls this feature creep that breaks the “small tools” Unix philosophy; others counter that FFmpeg already includes various ML-based filters.

Accessibility and New Workflows

Hard-of-hearing users describe Whisper-based tools (Subtitle Edit, custom pipelines, browser extensions) as transformative: any video, lecture, or podcast can be transcribed, searched, summarized, and translated.
Examples include live police scanner transcripts, podcast archives, GNOME speech-to-text extensions, and voice-driven personal assistants wired through LLMs.

Site Access and Infrastructure Issues

Many commenters struggle with FFmpeg’s Anubis bot filter (slow or broken challenges on older browsers/GrapheneOS); others report it passing instantly.
Some argue proper configuration (e.g., meta-refresh challenges) would preserve protection while remaining usable; others defend strict bot filters as necessary to keep the Git UI responsive.

Related topics