2024-10-24

Quantized Llama models with increased speed and a reduced memory footprint

Model & quantization overview

Meta released quantized Llama 3.2 1B and 3B models with near-original accuracy, using PTQ, SpinQuant, and QLoRA-style quantization-aware training.
QLoRA variants are further fine-tuned, aligning weights to nf4 and giving higher accuracy than plain PTQ, but at higher compute cost.
SpinQuant learns rotations to “smear out” outliers in weights/activations. It doesn’t consistently beat nf4 QLoRA but offers better throughput and memory use.
A Meta engineer clarifies “vanilla PTQ” baseline as simple 4-bit per-group symmetric weight + 8-bit activations, no advanced scheme like AWQ/GPTQ/SpinQuant.

Performance, VRAM, and benefits of quantization

Quantization mainly helps because LLM inference is memory-bound (weight-matrix × vector each token); fewer bits → less memory bandwidth → faster inference.
On phones (e.g., OnePlus 12), Meta reports large latency and memory gains and ~56% model size reduction.
Some commenters wish releases would state VRAM needs directly; others argue it’s roughly “parameters × bytes per weight,” with KV cache and context length being the real variable.
Discussion covers how to estimate KV cache impact and notes that quantization is also used just to make models fit on smaller GPUs.

Quality and use cases of 1B/3B models

Mixed experiences: some find 3B models capable for lightweight tasks, testing, or simple Slack bots; others report poor performance even on basic classification/translation and reject them for production.
Several report small models frequently ignoring instructions like “output only X,” especially with longer inputs.
Recommended mitigations: constrained grammars, schema-based tools, multi-step prompting, short contexts, or dedicated JSON/grammar support in inference servers.
There is debate on “speculative decoding”: one side claims it preserves exact accuracy of the large model, another describes it as only “tolerably close.”

Structured output & control

Thread dives into advanced methods for constrained/JSON output:
- Grammars in llama.cpp and other frameworks.
- Pre-filling outputs (e.g., starting with { or json).
- State machine approaches over JSON schemas that alternate fixed tokens and model-generated spans.
Users note grammars can slightly hurt output quality, and escaping errors can still yield semantically wrong but valid JSON.

On-device and app deployment

Suggested iOS/Android options: MLC Chat, PocketGPT, PocketPal, or running Ollama/llama.cpp remotely and accessing via SSH/Matrix.
Some are exploring bundling llama.cpp directly in Android apps; Termux-based setups work but are seen as too technical for most users.
ExecuTorch is mentioned as Meta’s mobile/embedded runtime; still early but positioned for fast on-device inference.

Meta, ecosystem, and “open source”

Several express appreciation that Meta released code, models, and full comparison tables without overclaiming.
Others criticize over-engineered stacks (like Llama Stack) and difficulty getting CUDA or simple deployments working.
One comment objects to Meta calling Llama “open source” without releasing training data.
Some see these releases as part of LLM commoditization and note that excitement/discussion density is lower than for major frontier-model launches.

Related topics