2024-12-26

Short Message Compression Using LLMs

How the LLM-based compression works

The system uses a deterministic language model to predict a probability distribution over next tokens.
An arithmetic coder then encodes the actual next token using that distribution, achieving lossless compression.
When the model is very confident, few bits are needed; when it is wrong or uncertain, more bits are required.
This is conceptually “encoding the difference between prediction and reality,” but always reversibly.

Relation to traditional compression & information theory

Commenters compare this directly to arithmetic coding, Huffman coding, PAQ, and other high-order statistical models.
The LLM acts as a powerful adaptive probability model, analogous to traditional context models but trained with ML.
Debate over whether to encode token rank vs true probabilities; consensus is that using full probabilities is more efficient.
Some note that similar ideas already appear in top entries for text compression benchmarks and the Hutter Prize.

Lossy variants, embeddings, and generative uses

Several people speculate about lossy text compression: nudging text toward more probable tokens to save bits.
Encoder–decoder schemes that store only embeddings are suggested but reported as hard in practice; small perturbations tend to erase critical details (e.g., dates) rather than just style.
Analogies are made to image/video compression and to compressed “evocations” in science fiction.

Potential applications

Suggested domains: LoRa/Meshtastic-style low-bandwidth radios, satellite SMS/messaging, possibly Apple’s satellite iMessages.
Idea: heavy compute on phones or gateways, but dramatically fewer bits over extremely constrained links.
Some envision on-device models enabling highly compressed offline storage of reference material.

Steganography and security

Idea: encode secret data in choices among plausible next tokens so resulting text looks natural.
Discussion over whether this is steganography or cryptography; consensus is that it functions as steganography if the text appears ordinary.
As a lossless scheme, it is not seen as an obvious attack vector beyond possibly degrading compression ratios; correctness depends on deterministic model configuration.

Model size, performance, and benchmarks

The implementation uses a ~169M-parameter model; download size is about 153 MB.
Some criticize comparisons to brotli as unfair unless the model size is counted; others argue the model cost amortizes across many messages and is acceptable for modern phones.
Hutter Prize relevance is noted, but constraints on total program+data size and CPU/time make large LLMs impractical there; training-on-the-fly neural compressors exist but are very slow.

Side discussion: compressing JSON in Postgres

For repetitive JSONB, suggestions include: TOAST compression with modern codecs (LZ4, Zstd), gzip/zstd with seed dictionaries, and custom schemas or templating techniques inspired by log compression systems.

Related topics