2025-04-05

The Llama 4 herd

Release, links, and model lineup

Initial confusion over whether this was a leak due to 404s; clarified by official blog and docs on ai.meta.com / llama.com.
Three main models discussed:
- Scout: 17B active, 109B total MoE, 10M context, single-H100 capable (with quantization).
- Maverick: 17B active, 400B total MoE, 1M context, multi-GPU / DGX-scale.
- Behemoth (teacher, not released): ~288B active, ~2T total parameters, still training; used for distillation.

Context window, architecture, and RAG

10M-token context in Scout draws heavy interest; people debate whether useful recall extends beyond a fraction of that.
Meta’s “iRoPE” and mixed RoPE/NoPE positional encodings are cited as the main trick; some relate it to prior long-context methods.
Several commenters suspect real performance will degrade with distance, call for better long-context benchmarks than “needle-in-a-haystack.”
Many argue RAG remains needed: for cost, latency, grounding, and because 10M tokens still can’t cover large or evolving corpora (e.g. Wikipedia, big repos).

MoE design, hardware, and self‑hosting

Repeated clarification: 17B “active” ≠ 17B model; total size (109B / 400B) determines RAM/VRAM needs.
MoE experts are per-layer, router-driven subnetworks, not human-understandable topic specialists; routing is mainly optimized for load and performance.
Tradeoff: dense-level quality at lower per-token compute, but large total parameter footprint makes local inference hard without 64–512GB-class machines or multi-GPU rigs.
Long discussions on:
- Quantization materially reducing memory.
- Apple Silicon (M3/M4, Mac Studio) vs 4090/5090 vs AMD APUs and Tenstorrent for home inference.
- Prompt prefill vs generation bottlenecks for huge contexts.

System prompt, alignment, and politics

Suggested prompt explicitly discourages moral lecturing, “it’s important to…”-style language, and political refusals; encourages chit-chat, venting, and even rude output if requested.
Some praise this as less “neutered” than prior LLMs; others worry it downplays helpfulness, critical thinking, and safety.
Large subthread on whether prior models were “left-leaning,” whether Meta is “debiasing” or adding a different bias, and whether “truth” vs “bias” is even a coherent distinction.
Early testing suggests:
- Looser NSFW and insult behavior but still some guardrails, especially on sensitive classification (e.g. inferring politics from images).
- Political and social responses remain constrained without heavy prompt engineering.

Benchmarks vs real-world behavior

Meta’s charts show Maverick competing with GPT‑4o / Gemini 2.0 Flash, but omitting newer models (Gemini 2.5, o‑series, DeepSeek R1) raises skepticism.
LMArena head-to-head initially ranked Maverick near the top, but later reports claim the arena model differed from the released one, suggesting benchmark gaming.
Aider’s coding benchmark shows Maverick only matching a 32B specialized coder model and far behind Gemini 2.5 Pro / Claude Sonnet 3.7, raising questions about coding quality.
Multiple commenters note existing benchmarks (especially multimodal) are weak: too much OCR/MCQ, little “in the wild” reasoning.

Licensing, openness, and data ethics

License is widely criticized as “open weights, not open source”:
- Commercial use banned above 700M MAU.
- Branding (“built with Llama”) and naming requirements.
- Acceptable-use policy controlling downstream uses.
Hugging Face access friction for earlier Llama versions already upset some; people expect similar gating here.
Strong thread on training data ethics: accusations of large-scale scraping/piracy (e.g. books), with some arguing full transparency of training data should be required.

Ecosystem, performance and sentiment

Groq quickly exposes Scout and Maverick with very high token throughput and low prices; several people test via Groq/OpenRouter and compare to Gemini/Claude/OpenAI.
Early impressions:
- Vision is clearly improved over Llama 3 but still trails GPT‑4o and top Qwen models.
- Instruction following and writing quality seen as below Gemini 2.5 / Claude in some tests.
No “reasoning” variant yet; a placeholder “Llama 4 reasoning is coming” page suggests a later RL‑style reasoning release.
Community mood mixes excitement (especially about long context and open weights) with fatigue over yet another giant MoE, lack of small dense models, licensing constraints, and unresolved political/data issues.

Related topics