The Llama 4 herd
Release, links, and model lineup
- Initial confusion over whether this was a leak due to 404s; clarified by official blog and docs on ai.meta.com / llama.com.
- Three main models discussed:
- Scout: 17B active, 109B total MoE, 10M context, single-H100 capable (with quantization).
- Maverick: 17B active, 400B total MoE, 1M context, multi-GPU / DGX-scale.
- Behemoth (teacher, not released): ~288B active, ~2T total parameters, still training; used for distillation.
Context window, architecture, and RAG
- 10M-token context in Scout draws heavy interest; people debate whether useful recall extends beyond a fraction of that.
- Meta’s “iRoPE” and mixed RoPE/NoPE positional encodings are cited as the main trick; some relate it to prior long-context methods.
- Several commenters suspect real performance will degrade with distance, call for better long-context benchmarks than “needle-in-a-haystack.”
- Many argue RAG remains needed: for cost, latency, grounding, and because 10M tokens still can’t cover large or evolving corpora (e.g. Wikipedia, big repos).
MoE design, hardware, and self‑hosting
- Repeated clarification: 17B “active” ≠ 17B model; total size (109B / 400B) determines RAM/VRAM needs.
- MoE experts are per-layer, router-driven subnetworks, not human-understandable topic specialists; routing is mainly optimized for load and performance.
- Tradeoff: dense-level quality at lower per-token compute, but large total parameter footprint makes local inference hard without 64–512GB-class machines or multi-GPU rigs.
- Long discussions on:
- Quantization materially reducing memory.
- Apple Silicon (M3/M4, Mac Studio) vs 4090/5090 vs AMD APUs and Tenstorrent for home inference.
- Prompt prefill vs generation bottlenecks for huge contexts.
System prompt, alignment, and politics
- Suggested prompt explicitly discourages moral lecturing, “it’s important to…”-style language, and political refusals; encourages chit-chat, venting, and even rude output if requested.
- Some praise this as less “neutered” than prior LLMs; others worry it downplays helpfulness, critical thinking, and safety.
- Large subthread on whether prior models were “left-leaning,” whether Meta is “debiasing” or adding a different bias, and whether “truth” vs “bias” is even a coherent distinction.
- Early testing suggests:
- Looser NSFW and insult behavior but still some guardrails, especially on sensitive classification (e.g. inferring politics from images).
- Political and social responses remain constrained without heavy prompt engineering.
Benchmarks vs real-world behavior
- Meta’s charts show Maverick competing with GPT‑4o / Gemini 2.0 Flash, but omitting newer models (Gemini 2.5, o‑series, DeepSeek R1) raises skepticism.
- LMArena head-to-head initially ranked Maverick near the top, but later reports claim the arena model differed from the released one, suggesting benchmark gaming.
- Aider’s coding benchmark shows Maverick only matching a 32B specialized coder model and far behind Gemini 2.5 Pro / Claude Sonnet 3.7, raising questions about coding quality.
- Multiple commenters note existing benchmarks (especially multimodal) are weak: too much OCR/MCQ, little “in the wild” reasoning.
Licensing, openness, and data ethics
- License is widely criticized as “open weights, not open source”:
- Commercial use banned above 700M MAU.
- Branding (“built with Llama”) and naming requirements.
- Acceptable-use policy controlling downstream uses.
- Hugging Face access friction for earlier Llama versions already upset some; people expect similar gating here.
- Strong thread on training data ethics: accusations of large-scale scraping/piracy (e.g. books), with some arguing full transparency of training data should be required.
Ecosystem, performance and sentiment
- Groq quickly exposes Scout and Maverick with very high token throughput and low prices; several people test via Groq/OpenRouter and compare to Gemini/Claude/OpenAI.
- Early impressions:
- Vision is clearly improved over Llama 3 but still trails GPT‑4o and top Qwen models.
- Instruction following and writing quality seen as below Gemini 2.5 / Claude in some tests.
- No “reasoning” variant yet; a placeholder “Llama 4 reasoning is coming” page suggests a later RL‑style reasoning release.
- Community mood mixes excitement (especially about long context and open weights) with fatigue over yet another giant MoE, lack of small dense models, licensing constraints, and unresolved political/data issues.