2026-03-31

Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs

Quantization Approach & “1‑Bit” Details

Weights are stored as 1‑bit values in groups of 128, each sharing a 16‑bit scaling factor; effective precision is ~1.1 bits, not pure 1‑bit.
Some compare this to earlier 1.58‑bit / ternary work and ask how it scales to larger models (27B, 35B, 100B+).
There’s interest in theoretical work on fully binary training and backprop, but Bonsai appears to be a quantized Qwen variant, not trained from scratch in binary.

Performance, Quality & Trade‑offs

Benchmarks in the whitepaper put the 8B model below larger mainstream models (e.g. Qwen3) in accuracy but at dramatically smaller size (16× smaller) and much faster inference (≈6× on an RTX 4090).
Users report:
- Very fast generation (hundreds of tokens/s on high‑end GPUs, workable on older CPUs and phones).
- Quality reminiscent of early GPT‑3: often coherent and useful for coding, SQL, LaTeX, simple data tasks; but frequent hallucinations and factual mistakes.
- Fails some reasoning tests (e.g. “car wash” distance, strawberry test, timezone conversions), and produces nonsense in some factual domains (e.g. physics, Harry Potter lore).

Deployment Experiences

Runs via a fork of llama.cpp, with special kernels and a custom quantization type; building from source and checking out the right branch is required.
Some struggle with gibberish output until they use the correct fork/branch or parameters (e.g. context size, AVX2, KV cache precision).
Works on Jetson, older laptops, iPhones (via third‑party apps), and consumer GPUs; CPU‑only is possible but can be slow without optimizations.
Memory usage in practice sometimes closer to 4‑bit quants than the headline “14× less,” leading to confusion.

Use Cases & Outlook

Seen as promising for: lightweight agents, classification, translation, simple summarization, SQL agents, and as sub‑components under stronger “orchestrator” models.
Some expect future systems to rely more on small, tool‑using models rather than memorizing facts.
Enthusiasm about 1‑bit models as a path to democratized, large‑parameter local LLMs coexists with skepticism about missing comparisons against strong 4‑/8‑bit quantized baselines and unclear training cost.

Related topics