2025-07-20

FFmpeg devs boast of another 100x leap thanks to handwritten assembly code

Clarifying the “100x” Claim

Commenters note the article inconsistently says “100x” and “100%” speed boost; screenshots and mailing list posts show ~100.73× for a single function, not 100%.
That 100× applies to rangedetect8_avx512, not to FFmpeg overall. The whole filter may see closer to ~2×, and FFmpeg as a whole much less.
Baseline C code was compiled with -march=generic -fno-tree-vectorize, making the comparison very favorable to hand-tuned AVX-512. With vectorization enabled, independent benchmarks show more like 2.65× vs optimized C, not 100×.
Several people criticize the headline/marketing as misleading, even if the technical work is good.

Scope and Real-World Impact

The optimized function belongs to an “obscure filter” that detects color range (full vs limited) and related properties; it is not a general encoder/decoder speedup.
The filter is new, not yet committed, and only runs when explicitly requested by users who know they need that analysis.
For typical conversions—even large-scale pipelines—this is unlikely to change overall throughput in any noticeable way.

SIMD, AVX2/AVX-512, and Architecture Limits

The gains are primarily from SIMD vectorization (AVX2/AVX‑512) on 8‑bit data, not from “assembly magic” per se.
AVX‑512’s width and single‑instruction min/max on many bytes make the 100× microbenchmark speedup plausible on tiny hot-cache data.
Commenters note AVX-512 support is fragmented across x86 CPUs, and you can’t always rely on specific AVX‑512 subsets being present.

Auto-Vectorization vs Hand-Written SIMD/Assembly

Some argue modern compilers (GCC/Clang, MSVC) auto-vectorize simple loops very well and often schedule instructions better than humans.
Others report that auto-vectorization is brittle, varies across compilers/architectures, and cannot handle more complex kernels, data layouts (AoS vs SoA), or gather/scatter patterns.
ISPC is discussed: it can force vectorization but suffers from hardware gather/scatter inefficiencies, language limitations around access patterns, and precision and calling-convention quirks.
Consensus: for hot, complex kernels and non-trivial data structures, manual SIMD (intrinsics or assembly) is still routinely needed.

Benchmarking Skepticism and Macro vs Micro

Many emphasize that microbenchmarks (small buffers, hot caches, isolated functions) exaggerate speedups compared to real-world workloads with cache pressure and many interacting components.
Some suspect past FFmpeg “×90”–type claims were vs unoptimized (-O0) C; others stress that in any case these are tiny, rarely used code paths.
Several call for macrobenchmarks over realistic videos and filter pipelines; others describe statistical methods (blocking designs) that allow comparing versions without dedicated hardware.

Related topics