FFmpeg devs boast of another 100x leap thanks to handwritten assembly code
Clarifying the “100x” Claim
- Commenters note the article inconsistently says “100x” and “100%” speed boost; screenshots and mailing list posts show ~100.73× for a single function, not 100%.
- That 100× applies to
rangedetect8_avx512, not to FFmpeg overall. The whole filter may see closer to ~2×, and FFmpeg as a whole much less. - Baseline C code was compiled with
-march=generic -fno-tree-vectorize, making the comparison very favorable to hand-tuned AVX-512. With vectorization enabled, independent benchmarks show more like 2.65× vs optimized C, not 100×. - Several people criticize the headline/marketing as misleading, even if the technical work is good.
Scope and Real-World Impact
- The optimized function belongs to an “obscure filter” that detects color range (full vs limited) and related properties; it is not a general encoder/decoder speedup.
- The filter is new, not yet committed, and only runs when explicitly requested by users who know they need that analysis.
- For typical conversions—even large-scale pipelines—this is unlikely to change overall throughput in any noticeable way.
SIMD, AVX2/AVX-512, and Architecture Limits
- The gains are primarily from SIMD vectorization (AVX2/AVX‑512) on 8‑bit data, not from “assembly magic” per se.
- AVX‑512’s width and single‑instruction min/max on many bytes make the 100× microbenchmark speedup plausible on tiny hot-cache data.
- Commenters note AVX-512 support is fragmented across x86 CPUs, and you can’t always rely on specific AVX‑512 subsets being present.
Auto-Vectorization vs Hand-Written SIMD/Assembly
- Some argue modern compilers (GCC/Clang, MSVC) auto-vectorize simple loops very well and often schedule instructions better than humans.
- Others report that auto-vectorization is brittle, varies across compilers/architectures, and cannot handle more complex kernels, data layouts (AoS vs SoA), or gather/scatter patterns.
- ISPC is discussed: it can force vectorization but suffers from hardware gather/scatter inefficiencies, language limitations around access patterns, and precision and calling-convention quirks.
- Consensus: for hot, complex kernels and non-trivial data structures, manual SIMD (intrinsics or assembly) is still routinely needed.
Benchmarking Skepticism and Macro vs Micro
- Many emphasize that microbenchmarks (small buffers, hot caches, isolated functions) exaggerate speedups compared to real-world workloads with cache pressure and many interacting components.
- Some suspect past FFmpeg “×90”–type claims were vs unoptimized (
-O0) C; others stress that in any case these are tiny, rarely used code paths. - Several call for macrobenchmarks over realistic videos and filter pipelines; others describe statistical methods (blocking designs) that allow comparing versions without dedicated hardware.