Short Message Compression Using LLMs
How the LLM-based compression works
- The system uses a deterministic language model to predict a probability distribution over next tokens.
- An arithmetic coder then encodes the actual next token using that distribution, achieving lossless compression.
- When the model is very confident, few bits are needed; when it is wrong or uncertain, more bits are required.
- This is conceptually “encoding the difference between prediction and reality,” but always reversibly.
Relation to traditional compression & information theory
- Commenters compare this directly to arithmetic coding, Huffman coding, PAQ, and other high-order statistical models.
- The LLM acts as a powerful adaptive probability model, analogous to traditional context models but trained with ML.
- Debate over whether to encode token rank vs true probabilities; consensus is that using full probabilities is more efficient.
- Some note that similar ideas already appear in top entries for text compression benchmarks and the Hutter Prize.
Lossy variants, embeddings, and generative uses
- Several people speculate about lossy text compression: nudging text toward more probable tokens to save bits.
- Encoder–decoder schemes that store only embeddings are suggested but reported as hard in practice; small perturbations tend to erase critical details (e.g., dates) rather than just style.
- Analogies are made to image/video compression and to compressed “evocations” in science fiction.
Potential applications
- Suggested domains: LoRa/Meshtastic-style low-bandwidth radios, satellite SMS/messaging, possibly Apple’s satellite iMessages.
- Idea: heavy compute on phones or gateways, but dramatically fewer bits over extremely constrained links.
- Some envision on-device models enabling highly compressed offline storage of reference material.
Steganography and security
- Idea: encode secret data in choices among plausible next tokens so resulting text looks natural.
- Discussion over whether this is steganography or cryptography; consensus is that it functions as steganography if the text appears ordinary.
- As a lossless scheme, it is not seen as an obvious attack vector beyond possibly degrading compression ratios; correctness depends on deterministic model configuration.
Model size, performance, and benchmarks
- The implementation uses a ~169M-parameter model; download size is about 153 MB.
- Some criticize comparisons to brotli as unfair unless the model size is counted; others argue the model cost amortizes across many messages and is acceptable for modern phones.
- Hutter Prize relevance is noted, but constraints on total program+data size and CPU/time make large LLMs impractical there; training-on-the-fly neural compressors exist but are very slow.
Side discussion: compressing JSON in Postgres
- For repetitive JSONB, suggestions include: TOAST compression with modern codecs (LZ4, Zstd), gzip/zstd with seed dictionaries, and custom schemas or templating techniques inspired by log compression systems.