2024-11-08

Perceptually lossless (talking head) video compression at 22kbit/s

Definition of “perceptually lossless”

Strong debate over the term:
- Critics call it marketing for “lossy,” arguing that “lossless” should mean bit-identical and that adding qualifiers is misleading.
- Others say it’s a long-established technical term, essentially equivalent to “transparent” compression where typical viewers can’t tell the difference.
Several note perception is audience-dependent (human vs animals, people with eye issues, closeness of inspection).
Generation loss is raised: even if one encode looks identical, repeated recompression will eventually show degradation.
Some argue all digital media is ultimately perceptually, not truly, lossless due to ADC/DAC limits and discretization.

Visual quality, artifacts, and uncanny valley

Many find the result impressive but clearly not “lossless”:
- Noted artifacts: background objects (bike saddle, tire) moving with the head, gaze direction off, jitter.
Some say it still sits in the uncanny valley, especially for familiar faces.
Others argue that compared to traditional CGI, neural methods capture light and “essence” better, making fakes more easily mistaken for real.

Comparison with traditional codecs and graphics

Video codec experts argue:
- 22 kbit/s isn’t extraordinary given the huge compute (e.g., RTX 4090) and narrow talking-head domain.
- Traditional codecs already trade encode/decode complexity vs. bitrate; with similar compute, they could likely reach similar or better efficiency.
- Pure learned codecs currently lose to hybrid (traditional + learned) approaches, though some learned systems already beat modern standards on certain metrics.
Disagreement over how “magical” this is; some see it as numerical methods/heuristics rather than a qualitative leap.

Practicality and use cases

Current real-time requirement for high-end GPUs makes the compression use case feel premature.
Proposed future/edge use cases:
- Low-bandwidth or metered mobile networks; long video calls on very low data volume.
- Situations with abundant compute but tight bandwidth (space, underwater) – though available bitrates there are debated.
- Dial-in video from a single headshot when only a narrow uplink (or even voice channel) is available.
- Many-to-one conferencing where each participant sends only expression/pose parameters.

Analogies and side references

Multiple comparisons to MP3/AAC and “transparent” audio; ABX testing, encoder quality, and re-encoding artifacts mentioned.
References to older standards and media:
- MPEG-4 face animation parameters as a conceptual predecessor.
- Sci-fi depictions of ultra-compressed or locally reconstructed video feeds.
- Game and cinematic CGI claimed to be sometimes indistinguishable, with others strongly disagreeing.

Related topics