Quantized Llama models with increased speed and a reduced memory footprint
Model & quantization overview
- Meta released quantized Llama 3.2 1B and 3B models with near-original accuracy, using PTQ, SpinQuant, and QLoRA-style quantization-aware training.
- QLoRA variants are further fine-tuned, aligning weights to nf4 and giving higher accuracy than plain PTQ, but at higher compute cost.
- SpinQuant learns rotations to “smear out” outliers in weights/activations. It doesn’t consistently beat nf4 QLoRA but offers better throughput and memory use.
- A Meta engineer clarifies “vanilla PTQ” baseline as simple 4-bit per-group symmetric weight + 8-bit activations, no advanced scheme like AWQ/GPTQ/SpinQuant.
Performance, VRAM, and benefits of quantization
- Quantization mainly helps because LLM inference is memory-bound (weight-matrix × vector each token); fewer bits → less memory bandwidth → faster inference.
- On phones (e.g., OnePlus 12), Meta reports large latency and memory gains and ~56% model size reduction.
- Some commenters wish releases would state VRAM needs directly; others argue it’s roughly “parameters × bytes per weight,” with KV cache and context length being the real variable.
- Discussion covers how to estimate KV cache impact and notes that quantization is also used just to make models fit on smaller GPUs.
Quality and use cases of 1B/3B models
- Mixed experiences: some find 3B models capable for lightweight tasks, testing, or simple Slack bots; others report poor performance even on basic classification/translation and reject them for production.
- Several report small models frequently ignoring instructions like “output only X,” especially with longer inputs.
- Recommended mitigations: constrained grammars, schema-based tools, multi-step prompting, short contexts, or dedicated JSON/grammar support in inference servers.
- There is debate on “speculative decoding”: one side claims it preserves exact accuracy of the large model, another describes it as only “tolerably close.”
Structured output & control
- Thread dives into advanced methods for constrained/JSON output:
- Grammars in llama.cpp and other frameworks.
- Pre-filling outputs (e.g., starting with
{orjson). - State machine approaches over JSON schemas that alternate fixed tokens and model-generated spans.
- Users note grammars can slightly hurt output quality, and escaping errors can still yield semantically wrong but valid JSON.
On-device and app deployment
- Suggested iOS/Android options: MLC Chat, PocketGPT, PocketPal, or running Ollama/llama.cpp remotely and accessing via SSH/Matrix.
- Some are exploring bundling llama.cpp directly in Android apps; Termux-based setups work but are seen as too technical for most users.
- ExecuTorch is mentioned as Meta’s mobile/embedded runtime; still early but positioned for fast on-device inference.
Meta, ecosystem, and “open source”
- Several express appreciation that Meta released code, models, and full comparison tables without overclaiming.
- Others criticize over-engineered stacks (like Llama Stack) and difficulty getting CUDA or simple deployments working.
- One comment objects to Meta calling Llama “open source” without releasing training data.
- Some see these releases as part of LLM commoditization and note that excitement/discussion density is lower than for major frontier-model launches.