FFmpeg devs boast of another 100x leap thanks to handwritten assembly code

Clarifying the “100x” Claim

  • Commenters note the article inconsistently says “100x” and “100%” speed boost; screenshots and mailing list posts show ~100.73× for a single function, not 100%.
  • That 100× applies to rangedetect8_avx512, not to FFmpeg overall. The whole filter may see closer to ~2×, and FFmpeg as a whole much less.
  • Baseline C code was compiled with -march=generic -fno-tree-vectorize, making the comparison very favorable to hand-tuned AVX-512. With vectorization enabled, independent benchmarks show more like 2.65× vs optimized C, not 100×.
  • Several people criticize the headline/marketing as misleading, even if the technical work is good.

Scope and Real-World Impact

  • The optimized function belongs to an “obscure filter” that detects color range (full vs limited) and related properties; it is not a general encoder/decoder speedup.
  • The filter is new, not yet committed, and only runs when explicitly requested by users who know they need that analysis.
  • For typical conversions—even large-scale pipelines—this is unlikely to change overall throughput in any noticeable way.

SIMD, AVX2/AVX-512, and Architecture Limits

  • The gains are primarily from SIMD vectorization (AVX2/AVX‑512) on 8‑bit data, not from “assembly magic” per se.
  • AVX‑512’s width and single‑instruction min/max on many bytes make the 100× microbenchmark speedup plausible on tiny hot-cache data.
  • Commenters note AVX-512 support is fragmented across x86 CPUs, and you can’t always rely on specific AVX‑512 subsets being present.

Auto-Vectorization vs Hand-Written SIMD/Assembly

  • Some argue modern compilers (GCC/Clang, MSVC) auto-vectorize simple loops very well and often schedule instructions better than humans.
  • Others report that auto-vectorization is brittle, varies across compilers/architectures, and cannot handle more complex kernels, data layouts (AoS vs SoA), or gather/scatter patterns.
  • ISPC is discussed: it can force vectorization but suffers from hardware gather/scatter inefficiencies, language limitations around access patterns, and precision and calling-convention quirks.
  • Consensus: for hot, complex kernels and non-trivial data structures, manual SIMD (intrinsics or assembly) is still routinely needed.

Benchmarking Skepticism and Macro vs Micro

  • Many emphasize that microbenchmarks (small buffers, hot caches, isolated functions) exaggerate speedups compared to real-world workloads with cache pressure and many interacting components.
  • Some suspect past FFmpeg “×90”–type claims were vs unoptimized (-O0) C; others stress that in any case these are tiny, rarely used code paths.
  • Several call for macrobenchmarks over realistic videos and filter pipelines; others describe statistical methods (blocking designs) that allow comparing versions without dedicated hardware.