Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs

Quantization Approach & “1‑Bit” Details

  • Weights are stored as 1‑bit values in groups of 128, each sharing a 16‑bit scaling factor; effective precision is ~1.1 bits, not pure 1‑bit.
  • Some compare this to earlier 1.58‑bit / ternary work and ask how it scales to larger models (27B, 35B, 100B+).
  • There’s interest in theoretical work on fully binary training and backprop, but Bonsai appears to be a quantized Qwen variant, not trained from scratch in binary.

Performance, Quality & Trade‑offs

  • Benchmarks in the whitepaper put the 8B model below larger mainstream models (e.g. Qwen3) in accuracy but at dramatically smaller size (16× smaller) and much faster inference (≈6× on an RTX 4090).
  • Users report:
    • Very fast generation (hundreds of tokens/s on high‑end GPUs, workable on older CPUs and phones).
    • Quality reminiscent of early GPT‑3: often coherent and useful for coding, SQL, LaTeX, simple data tasks; but frequent hallucinations and factual mistakes.
    • Fails some reasoning tests (e.g. “car wash” distance, strawberry test, timezone conversions), and produces nonsense in some factual domains (e.g. physics, Harry Potter lore).

Deployment Experiences

  • Runs via a fork of llama.cpp, with special kernels and a custom quantization type; building from source and checking out the right branch is required.
  • Some struggle with gibberish output until they use the correct fork/branch or parameters (e.g. context size, AVX2, KV cache precision).
  • Works on Jetson, older laptops, iPhones (via third‑party apps), and consumer GPUs; CPU‑only is possible but can be slow without optimizations.
  • Memory usage in practice sometimes closer to 4‑bit quants than the headline “14× less,” leading to confusion.

Use Cases & Outlook

  • Seen as promising for: lightweight agents, classification, translation, simple summarization, SQL agents, and as sub‑components under stronger “orchestrator” models.
  • Some expect future systems to rely more on small, tool‑using models rather than memorizing facts.
  • Enthusiasm about 1‑bit models as a path to democratized, large‑parameter local LLMs coexists with skepticism about missing comparisons against strong 4‑/8‑bit quantized baselines and unclear training cost.