Perceptually lossless (talking head) video compression at 22kbit/s
Definition of “perceptually lossless”
- Strong debate over the term:
- Critics call it marketing for “lossy,” arguing that “lossless” should mean bit-identical and that adding qualifiers is misleading.
- Others say it’s a long-established technical term, essentially equivalent to “transparent” compression where typical viewers can’t tell the difference.
- Several note perception is audience-dependent (human vs animals, people with eye issues, closeness of inspection).
- Generation loss is raised: even if one encode looks identical, repeated recompression will eventually show degradation.
- Some argue all digital media is ultimately perceptually, not truly, lossless due to ADC/DAC limits and discretization.
Visual quality, artifacts, and uncanny valley
- Many find the result impressive but clearly not “lossless”:
- Noted artifacts: background objects (bike saddle, tire) moving with the head, gaze direction off, jitter.
- Some say it still sits in the uncanny valley, especially for familiar faces.
- Others argue that compared to traditional CGI, neural methods capture light and “essence” better, making fakes more easily mistaken for real.
Comparison with traditional codecs and graphics
- Video codec experts argue:
- 22 kbit/s isn’t extraordinary given the huge compute (e.g., RTX 4090) and narrow talking-head domain.
- Traditional codecs already trade encode/decode complexity vs. bitrate; with similar compute, they could likely reach similar or better efficiency.
- Pure learned codecs currently lose to hybrid (traditional + learned) approaches, though some learned systems already beat modern standards on certain metrics.
- Disagreement over how “magical” this is; some see it as numerical methods/heuristics rather than a qualitative leap.
Practicality and use cases
- Current real-time requirement for high-end GPUs makes the compression use case feel premature.
- Proposed future/edge use cases:
- Low-bandwidth or metered mobile networks; long video calls on very low data volume.
- Situations with abundant compute but tight bandwidth (space, underwater) – though available bitrates there are debated.
- Dial-in video from a single headshot when only a narrow uplink (or even voice channel) is available.
- Many-to-one conferencing where each participant sends only expression/pose parameters.
Analogies and side references
- Multiple comparisons to MP3/AAC and “transparent” audio; ABX testing, encoder quality, and re-encoding artifacts mentioned.
- References to older standards and media:
- MPEG-4 face animation parameters as a conceptual predecessor.
- Sci-fi depictions of ultra-compressed or locally reconstructed video feeds.
- Game and cinematic CGI claimed to be sometimes indistinguishable, with others strongly disagreeing.