2024-05-30

“Imprecise” language models are smaller, speedier, and nearly as accurate

Energy, Compute, and Scaling

Many argue that efficiency gains won’t reduce energy use; they’ll be reinvested into bigger/better models until marginal quality gains no longer justify cost.
Others downplay LLM training energy versus much larger sectors (video, food), but some counter that large GPU clusters running 24/7 are already substantial.
Renewables are seen as part of the answer, but there’s concern about opportunity cost of dedicated power for AI.
Some object to framing energy use as the main problem, seeing it as a proxy for disliking the tech.

Quantization, 1‑bit/ternary Models, and Real-World Quality

Strong consensus that quantization is not “free”: lower precision usually costs quality.
Experience varies: some claim Q8 is effectively identical to FP16, Q5 only slightly worse; others say all quants are clearly degraded in practice.
Critical voices call out “1‑bit LLM” papers as marketing: real effective bits are higher due to extra parameters; perplexity often degrades sharply.
BitNet/BitNet‑1.58 (ternary weights) are seen as promising, especially when trained from scratch, but there’s skepticism that results haven’t been demonstrated at frontier scale.
One quantization researcher outlines key evaluation metrics: perplexity (on comparable datasets/context), true bits/parameter (via file size), actual throughput/latency, and strength of the base model.
Several suggest we’re nearing a practical floor around ~2.5–4 bits/parameter; truly useful 1‑bit models may be unlikely.

Model Accuracy, Reliability, and Hype

Some see LLMs as overhyped and too unreliable for high-stakes tasks; others report large productivity gains in coding, debugging, everyday problem-solving, and creative work.
Trust is a central issue: users can’t reliably tell when an answer is wrong, making even rare errors dangerous as perceived accuracy rises.
Discussion compares pushing from ~90% to near-perfect accuracy to approaching light speed: diminishing returns in data/compute, and unclear what “perfect” even means for language.
Some argue smaller, domain-specific models plus tools (calculators, search, symbolic engines) and agent-style orchestration may be more realistic than a single “god model.”

Data, Synthetic Data, and Limits

“Data, not just compute” is seen as the real bottleneck; quantization may work better on undertrained, redundant models.
Synthetic data (e.g., toy story datasets) is viewed by some as promising for capabilities like coherence at small scale; others criticize its low quality and worry about “garbage in, garbage out.”
There’s debate over how far synthetic data and in‑context learning can substitute for scarce or missing real-world data (e.g., low‑resource languages).

Practical Uses and “Good Enough” Models

Many see value in mid‑size “Goldilocks” models that approximate GPT‑3.5‑level capabilities but run locally and cheaply, especially as NPUs become common.
For some tasks (autocomplete, low‑risk assistance, brainstorming) lower‑precision or tiny models are considered acceptable and attractive.
Several note that current deployment decisions depend heavily on hardware (e.g., llama.cpp vs specialized inference engines) and quantization that fits big models into limited VRAM.

Related topics