BitNet: Inference framework for 1-bit LLMs

Scope and Nature of the Release

  • The “100B 1‑bit model” title is widely viewed as misleading.
  • The repo provides an inference framework (bitnet.cpp) that can run a 100B-class BitNet model on CPUs; no such 100B model is actually trained or released.
  • Existing official BitNet models are small (≈1–3B parameters). The largest mentioned in docs/papers is 10B, used only for experiments.

1-Bit vs 1.58-Bit / Ternary Weights

  • The models are ternary (values in {−1, 0, 1}), which have entropy ≈1.58 bits per parameter, not strictly 1 bit.
  • Implementation uses 2 physical bits per weight (e.g., sign + value), sometimes packing 4 symbols per byte for simplicity.
  • “1‑bit LLM” is seen as marketing shorthand; several commenters prefer calling it “1‑trit” or 1.58‑bit.

Training vs Post-Training Quantization

  • BitNet’s core idea: design and train models from scratch with ternary weights (via custom BitLinear layers), not quantize full‑precision models down afterward.
  • Post‑training 1.58‑bit quantization of normal models performs poorly; native ternary models can be more competitive but still lag SOTA.
  • Scaling to 100B parameters should be roughly as hard as a standard 100B model, perhaps harder due to less maturity of the approach.

Performance, Memory, and Energy

  • CPU inference is memory‑bandwidth bound for large models; ternary/packed weights reduce bandwidth demands.
  • Matmuls can become mostly additions/XOR+popcount, changing the compute profile versus FP16/INT8 FMA-heavy kernels.
  • Reported CPU gains: linear-ish speedup with threads and ~70–82% energy reduction vs baselines. Claims of 5–7 tok/s for hypothetical 100B CPU inference; some users want ≥10 tok/s for comfortable usage.
  • Current demos use only a 3B model; details like RAM/storage requirements are not clearly documented.

Model Quality and Practical Value

  • Demo text is described as repetitive, shallow, and sometimes incorrect (e.g., odd obsessions, fake citations).
  • Defenders note the shown model is a small, 2‑year‑old base model trained on relatively few tokens.
  • A newer 2B BitNet model shows solid benchmarks in some tasks (e.g., GSM8K) but is weak on math; overall competitiveness vs small Qwen models is debated, with some calling BitNet more of a research curiosity.

Adoption, Skepticism, and Broader Context

  • Some argue that if ternary were truly revolutionary, leading labs (Qwen, DeepSeek, etc.) would already be using it; others say absence of public results isn’t conclusive.
  • There’s interest in low‑bit models for custom hardware, NPUs, and fully on‑device “minimal” LLMs paired with tools/RAG.
  • Thread also contains meta-discussion about suspected bot accounts, reflecting broader concern over AI‑generated forum content.