Improving performance of rav1d video decoder
Compiler behavior & u16 comparison optimization
- Discussion centers on an inefficient pattern for comparing pairs of 16-bit integers generated by LLVM for both Rust and C in some cases.
- Rust-specific ideas: using a
freezeintrinsic to avoid “poison” and enable better optimizations; concerns about struct alignment differences between Rust and C affecting codegen. - Example C code shows Clang optimizing better when structs are passed by value vs by reference, while GCC emits more complex code in both cases.
- Store-forwarding failures are raised as a possible reason compilers avoid merging 16-bit loads into a single 32-bit load, with microarchitecture-dependent tradeoffs.
Zeroing buffers & initialization elision
- A major performance win came from avoiding unnecessary buffer zeroing; commenters link this to recent discussions about how hard it is for compilers to safely skip initialization.
- Compilers struggle to prove no read of uninitialized elements, especially with arrays, unknown sizes, or assembly-based initialization.
- Using assembly for initialization further hinders optimization because the compiler lacks visibility into what the assembly does.
Profiling methodology & “obvious” wins
- Some are surprised that the first optimization was findable with straightforward profiling, but others stress that simple perf/differential profiling across C vs Rust implementations is powerful and underused.
- There’s praise for detailed, stepwise optimization writeups and references to similar series on speeding up large codebases.
AV1, performance, and ecosystem
- AV1 is viewed very positively: comparable or better than HEVC in compression efficiency, royalty-free, but still catching up in universal hardware support.
- Hardware encode/decode status across GPUs is discussed, along with confusion between Mbit/s and Mbyte/s in bitrate claims.
- VP9 vs H.264/H.265 vs AV1 is debated: VP9 often beats H.264 at equal bitrate but uses more CPU; AV1 generally beats both but at higher computational cost.
- Live streaming and device compatibility drive many deployments to H.264 due to ubiquitous hardware decoders.
WUFFS vs Rust/C for codecs & memory models
- One view: ideal world would use a safe, specialized language like WUFFS for codecs; others counter that WUFFS’ no-heap model is ill-suited to AV1-class decoders with complex, dynamic state.
- Clarifications: decoders typically have bounded but nontrivial dynamic state due to GOP structures (I/P/B frames, multiple references, motion vectors, film grain).
- Hardware-oriented codec design imposes strict memory bounds; many implementations minimize heap allocations but rarely reach zero.
ffmpeg vs Rust ports & security vs performance
- A social subthread analyzes a critical ffmpeg Twitter thread about Rust ports being slower and overfunded compared to C originals.
- Some see the tone as toxic and off-putting; others defend the frustration as a reaction to language zealotry and underfunding of incumbent projects.
- Security tradeoffs: ffmpeg has a steady flow of CVEs; Rust-based decoders like rav1d seek better memory safety at some performance cost. There’s no drop-in ffmpeg alternative, so users must accept its tradeoffs.
Project scope & dav1d composition
- Clarification that dav1d is predominantly hand-written assembly, with Rust work mainly touching the coordinating C layer, not the hot assembly kernels themselves.
- Some commenters initially misunderstand this and assume the Rust port is targeting the entire assembly-heavy core.