2025-05-22

Improving performance of rav1d video decoder

Compiler behavior & u16 comparison optimization

Discussion centers on an inefficient pattern for comparing pairs of 16-bit integers generated by LLVM for both Rust and C in some cases.
Rust-specific ideas: using a freeze intrinsic to avoid “poison” and enable better optimizations; concerns about struct alignment differences between Rust and C affecting codegen.
Example C code shows Clang optimizing better when structs are passed by value vs by reference, while GCC emits more complex code in both cases.
Store-forwarding failures are raised as a possible reason compilers avoid merging 16-bit loads into a single 32-bit load, with microarchitecture-dependent tradeoffs.

Zeroing buffers & initialization elision

A major performance win came from avoiding unnecessary buffer zeroing; commenters link this to recent discussions about how hard it is for compilers to safely skip initialization.
Compilers struggle to prove no read of uninitialized elements, especially with arrays, unknown sizes, or assembly-based initialization.
Using assembly for initialization further hinders optimization because the compiler lacks visibility into what the assembly does.

Profiling methodology & “obvious” wins

Some are surprised that the first optimization was findable with straightforward profiling, but others stress that simple perf/differential profiling across C vs Rust implementations is powerful and underused.
There’s praise for detailed, stepwise optimization writeups and references to similar series on speeding up large codebases.

AV1, performance, and ecosystem

AV1 is viewed very positively: comparable or better than HEVC in compression efficiency, royalty-free, but still catching up in universal hardware support.
Hardware encode/decode status across GPUs is discussed, along with confusion between Mbit/s and Mbyte/s in bitrate claims.
VP9 vs H.264/H.265 vs AV1 is debated: VP9 often beats H.264 at equal bitrate but uses more CPU; AV1 generally beats both but at higher computational cost.
Live streaming and device compatibility drive many deployments to H.264 due to ubiquitous hardware decoders.

WUFFS vs Rust/C for codecs & memory models

One view: ideal world would use a safe, specialized language like WUFFS for codecs; others counter that WUFFS’ no-heap model is ill-suited to AV1-class decoders with complex, dynamic state.
Clarifications: decoders typically have bounded but nontrivial dynamic state due to GOP structures (I/P/B frames, multiple references, motion vectors, film grain).
Hardware-oriented codec design imposes strict memory bounds; many implementations minimize heap allocations but rarely reach zero.

ffmpeg vs Rust ports & security vs performance

A social subthread analyzes a critical ffmpeg Twitter thread about Rust ports being slower and overfunded compared to C originals.
Some see the tone as toxic and off-putting; others defend the frustration as a reaction to language zealotry and underfunding of incumbent projects.
Security tradeoffs: ffmpeg has a steady flow of CVEs; Rust-based decoders like rav1d seek better memory safety at some performance cost. There’s no drop-in ffmpeg alternative, so users must accept its tradeoffs.

Project scope & dav1d composition

Clarification that dav1d is predominantly hand-written assembly, with Rust work mainly touching the coordinating C layer, not the hot assembly kernels themselves.
Some commenters initially misunderstand this and assume the Rust port is targeting the entire assembly-heavy core.

Related topics