Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
Quantization Approach & “1‑Bit” Details
- Weights are stored as 1‑bit values in groups of 128, each sharing a 16‑bit scaling factor; effective precision is ~1.1 bits, not pure 1‑bit.
- Some compare this to earlier 1.58‑bit / ternary work and ask how it scales to larger models (27B, 35B, 100B+).
- There’s interest in theoretical work on fully binary training and backprop, but Bonsai appears to be a quantized Qwen variant, not trained from scratch in binary.
Performance, Quality & Trade‑offs
- Benchmarks in the whitepaper put the 8B model below larger mainstream models (e.g. Qwen3) in accuracy but at dramatically smaller size (16× smaller) and much faster inference (≈6× on an RTX 4090).
- Users report:
- Very fast generation (hundreds of tokens/s on high‑end GPUs, workable on older CPUs and phones).
- Quality reminiscent of early GPT‑3: often coherent and useful for coding, SQL, LaTeX, simple data tasks; but frequent hallucinations and factual mistakes.
- Fails some reasoning tests (e.g. “car wash” distance, strawberry test, timezone conversions), and produces nonsense in some factual domains (e.g. physics, Harry Potter lore).
Deployment Experiences
- Runs via a fork of llama.cpp, with special kernels and a custom quantization type; building from source and checking out the right branch is required.
- Some struggle with gibberish output until they use the correct fork/branch or parameters (e.g. context size, AVX2, KV cache precision).
- Works on Jetson, older laptops, iPhones (via third‑party apps), and consumer GPUs; CPU‑only is possible but can be slow without optimizations.
- Memory usage in practice sometimes closer to 4‑bit quants than the headline “14× less,” leading to confusion.
Use Cases & Outlook
- Seen as promising for: lightweight agents, classification, translation, simple summarization, SQL agents, and as sub‑components under stronger “orchestrator” models.
- Some expect future systems to rely more on small, tool‑using models rather than memorizing facts.
- Enthusiasm about 1‑bit models as a path to democratized, large‑parameter local LLMs coexists with skepticism about missing comparisons against strong 4‑/8‑bit quantized baselines and unclear training cost.