Quantized Llama models with increased speed and a reduced memory footprint

Model & quantization overview

  • Meta released quantized Llama 3.2 1B and 3B models with near-original accuracy, using PTQ, SpinQuant, and QLoRA-style quantization-aware training.
  • QLoRA variants are further fine-tuned, aligning weights to nf4 and giving higher accuracy than plain PTQ, but at higher compute cost.
  • SpinQuant learns rotations to “smear out” outliers in weights/activations. It doesn’t consistently beat nf4 QLoRA but offers better throughput and memory use.
  • A Meta engineer clarifies “vanilla PTQ” baseline as simple 4-bit per-group symmetric weight + 8-bit activations, no advanced scheme like AWQ/GPTQ/SpinQuant.

Performance, VRAM, and benefits of quantization

  • Quantization mainly helps because LLM inference is memory-bound (weight-matrix × vector each token); fewer bits → less memory bandwidth → faster inference.
  • On phones (e.g., OnePlus 12), Meta reports large latency and memory gains and ~56% model size reduction.
  • Some commenters wish releases would state VRAM needs directly; others argue it’s roughly “parameters × bytes per weight,” with KV cache and context length being the real variable.
  • Discussion covers how to estimate KV cache impact and notes that quantization is also used just to make models fit on smaller GPUs.

Quality and use cases of 1B/3B models

  • Mixed experiences: some find 3B models capable for lightweight tasks, testing, or simple Slack bots; others report poor performance even on basic classification/translation and reject them for production.
  • Several report small models frequently ignoring instructions like “output only X,” especially with longer inputs.
  • Recommended mitigations: constrained grammars, schema-based tools, multi-step prompting, short contexts, or dedicated JSON/grammar support in inference servers.
  • There is debate on “speculative decoding”: one side claims it preserves exact accuracy of the large model, another describes it as only “tolerably close.”

Structured output & control

  • Thread dives into advanced methods for constrained/JSON output:
    • Grammars in llama.cpp and other frameworks.
    • Pre-filling outputs (e.g., starting with { or json).
    • State machine approaches over JSON schemas that alternate fixed tokens and model-generated spans.
  • Users note grammars can slightly hurt output quality, and escaping errors can still yield semantically wrong but valid JSON.

On-device and app deployment

  • Suggested iOS/Android options: MLC Chat, PocketGPT, PocketPal, or running Ollama/llama.cpp remotely and accessing via SSH/Matrix.
  • Some are exploring bundling llama.cpp directly in Android apps; Termux-based setups work but are seen as too technical for most users.
  • ExecuTorch is mentioned as Meta’s mobile/embedded runtime; still early but positioned for fast on-device inference.

Meta, ecosystem, and “open source”

  • Several express appreciation that Meta released code, models, and full comparison tables without overclaiming.
  • Others criticize over-engineered stacks (like Llama Stack) and difficulty getting CUDA or simple deployments working.
  • One comment objects to Meta calling Llama “open source” without releasing training data.
  • Some see these releases as part of LLM commoditization and note that excitement/discussion density is lower than for major frontier-model launches.