BitNet: Inference framework for 1-bit LLMs
Scope and Nature of the Release
- The “100B 1‑bit model” title is widely viewed as misleading.
- The repo provides an inference framework (bitnet.cpp) that can run a 100B-class BitNet model on CPUs; no such 100B model is actually trained or released.
- Existing official BitNet models are small (≈1–3B parameters). The largest mentioned in docs/papers is 10B, used only for experiments.
1-Bit vs 1.58-Bit / Ternary Weights
- The models are ternary (values in {−1, 0, 1}), which have entropy ≈1.58 bits per parameter, not strictly 1 bit.
- Implementation uses 2 physical bits per weight (e.g., sign + value), sometimes packing 4 symbols per byte for simplicity.
- “1‑bit LLM” is seen as marketing shorthand; several commenters prefer calling it “1‑trit” or 1.58‑bit.
Training vs Post-Training Quantization
- BitNet’s core idea: design and train models from scratch with ternary weights (via custom BitLinear layers), not quantize full‑precision models down afterward.
- Post‑training 1.58‑bit quantization of normal models performs poorly; native ternary models can be more competitive but still lag SOTA.
- Scaling to 100B parameters should be roughly as hard as a standard 100B model, perhaps harder due to less maturity of the approach.
Performance, Memory, and Energy
- CPU inference is memory‑bandwidth bound for large models; ternary/packed weights reduce bandwidth demands.
- Matmuls can become mostly additions/XOR+popcount, changing the compute profile versus FP16/INT8 FMA-heavy kernels.
- Reported CPU gains: linear-ish speedup with threads and ~70–82% energy reduction vs baselines. Claims of 5–7 tok/s for hypothetical 100B CPU inference; some users want ≥10 tok/s for comfortable usage.
- Current demos use only a 3B model; details like RAM/storage requirements are not clearly documented.
Model Quality and Practical Value
- Demo text is described as repetitive, shallow, and sometimes incorrect (e.g., odd obsessions, fake citations).
- Defenders note the shown model is a small, 2‑year‑old base model trained on relatively few tokens.
- A newer 2B BitNet model shows solid benchmarks in some tasks (e.g., GSM8K) but is weak on math; overall competitiveness vs small Qwen models is debated, with some calling BitNet more of a research curiosity.
Adoption, Skepticism, and Broader Context
- Some argue that if ternary were truly revolutionary, leading labs (Qwen, DeepSeek, etc.) would already be using it; others say absence of public results isn’t conclusive.
- There’s interest in low‑bit models for custom hardware, NPUs, and fully on‑device “minimal” LLMs paired with tools/RAG.
- Thread also contains meta-discussion about suspected bot accounts, reflecting broader concern over AI‑generated forum content.