Perceptually lossless (talking head) video compression at 22kbit/s

Definition of “perceptually lossless”

  • Strong debate over the term:
    • Critics call it marketing for “lossy,” arguing that “lossless” should mean bit-identical and that adding qualifiers is misleading.
    • Others say it’s a long-established technical term, essentially equivalent to “transparent” compression where typical viewers can’t tell the difference.
  • Several note perception is audience-dependent (human vs animals, people with eye issues, closeness of inspection).
  • Generation loss is raised: even if one encode looks identical, repeated recompression will eventually show degradation.
  • Some argue all digital media is ultimately perceptually, not truly, lossless due to ADC/DAC limits and discretization.

Visual quality, artifacts, and uncanny valley

  • Many find the result impressive but clearly not “lossless”:
    • Noted artifacts: background objects (bike saddle, tire) moving with the head, gaze direction off, jitter.
  • Some say it still sits in the uncanny valley, especially for familiar faces.
  • Others argue that compared to traditional CGI, neural methods capture light and “essence” better, making fakes more easily mistaken for real.

Comparison with traditional codecs and graphics

  • Video codec experts argue:
    • 22 kbit/s isn’t extraordinary given the huge compute (e.g., RTX 4090) and narrow talking-head domain.
    • Traditional codecs already trade encode/decode complexity vs. bitrate; with similar compute, they could likely reach similar or better efficiency.
    • Pure learned codecs currently lose to hybrid (traditional + learned) approaches, though some learned systems already beat modern standards on certain metrics.
  • Disagreement over how “magical” this is; some see it as numerical methods/heuristics rather than a qualitative leap.

Practicality and use cases

  • Current real-time requirement for high-end GPUs makes the compression use case feel premature.
  • Proposed future/edge use cases:
    • Low-bandwidth or metered mobile networks; long video calls on very low data volume.
    • Situations with abundant compute but tight bandwidth (space, underwater) – though available bitrates there are debated.
    • Dial-in video from a single headshot when only a narrow uplink (or even voice channel) is available.
    • Many-to-one conferencing where each participant sends only expression/pose parameters.

Analogies and side references

  • Multiple comparisons to MP3/AAC and “transparent” audio; ABX testing, encoder quality, and re-encoding artifacts mentioned.
  • References to older standards and media:
    • MPEG-4 face animation parameters as a conceptual predecessor.
    • Sci-fi depictions of ultra-compressed or locally reconstructed video feeds.
    • Game and cinematic CGI claimed to be sometimes indistinguishable, with others strongly disagreeing.