The Llama 4 herd

Release, links, and model lineup

  • Initial confusion over whether this was a leak due to 404s; clarified by official blog and docs on ai.meta.com / llama.com.
  • Three main models discussed:
    • Scout: 17B active, 109B total MoE, 10M context, single-H100 capable (with quantization).
    • Maverick: 17B active, 400B total MoE, 1M context, multi-GPU / DGX-scale.
    • Behemoth (teacher, not released): ~288B active, ~2T total parameters, still training; used for distillation.

Context window, architecture, and RAG

  • 10M-token context in Scout draws heavy interest; people debate whether useful recall extends beyond a fraction of that.
  • Meta’s “iRoPE” and mixed RoPE/NoPE positional encodings are cited as the main trick; some relate it to prior long-context methods.
  • Several commenters suspect real performance will degrade with distance, call for better long-context benchmarks than “needle-in-a-haystack.”
  • Many argue RAG remains needed: for cost, latency, grounding, and because 10M tokens still can’t cover large or evolving corpora (e.g. Wikipedia, big repos).

MoE design, hardware, and self‑hosting

  • Repeated clarification: 17B “active” ≠ 17B model; total size (109B / 400B) determines RAM/VRAM needs.
  • MoE experts are per-layer, router-driven subnetworks, not human-understandable topic specialists; routing is mainly optimized for load and performance.
  • Tradeoff: dense-level quality at lower per-token compute, but large total parameter footprint makes local inference hard without 64–512GB-class machines or multi-GPU rigs.
  • Long discussions on:
    • Quantization materially reducing memory.
    • Apple Silicon (M3/M4, Mac Studio) vs 4090/5090 vs AMD APUs and Tenstorrent for home inference.
    • Prompt prefill vs generation bottlenecks for huge contexts.

System prompt, alignment, and politics

  • Suggested prompt explicitly discourages moral lecturing, “it’s important to…”-style language, and political refusals; encourages chit-chat, venting, and even rude output if requested.
  • Some praise this as less “neutered” than prior LLMs; others worry it downplays helpfulness, critical thinking, and safety.
  • Large subthread on whether prior models were “left-leaning,” whether Meta is “debiasing” or adding a different bias, and whether “truth” vs “bias” is even a coherent distinction.
  • Early testing suggests:
    • Looser NSFW and insult behavior but still some guardrails, especially on sensitive classification (e.g. inferring politics from images).
    • Political and social responses remain constrained without heavy prompt engineering.

Benchmarks vs real-world behavior

  • Meta’s charts show Maverick competing with GPT‑4o / Gemini 2.0 Flash, but omitting newer models (Gemini 2.5, o‑series, DeepSeek R1) raises skepticism.
  • LMArena head-to-head initially ranked Maverick near the top, but later reports claim the arena model differed from the released one, suggesting benchmark gaming.
  • Aider’s coding benchmark shows Maverick only matching a 32B specialized coder model and far behind Gemini 2.5 Pro / Claude Sonnet 3.7, raising questions about coding quality.
  • Multiple commenters note existing benchmarks (especially multimodal) are weak: too much OCR/MCQ, little “in the wild” reasoning.

Licensing, openness, and data ethics

  • License is widely criticized as “open weights, not open source”:
    • Commercial use banned above 700M MAU.
    • Branding (“built with Llama”) and naming requirements.
    • Acceptable-use policy controlling downstream uses.
  • Hugging Face access friction for earlier Llama versions already upset some; people expect similar gating here.
  • Strong thread on training data ethics: accusations of large-scale scraping/piracy (e.g. books), with some arguing full transparency of training data should be required.

Ecosystem, performance and sentiment

  • Groq quickly exposes Scout and Maverick with very high token throughput and low prices; several people test via Groq/OpenRouter and compare to Gemini/Claude/OpenAI.
  • Early impressions:
    • Vision is clearly improved over Llama 3 but still trails GPT‑4o and top Qwen models.
    • Instruction following and writing quality seen as below Gemini 2.5 / Claude in some tests.
  • No “reasoning” variant yet; a placeholder “Llama 4 reasoning is coming” page suggests a later RL‑style reasoning release.
  • Community mood mixes excitement (especially about long context and open weights) with fatigue over yet another giant MoE, lack of small dense models, licensing constraints, and unresolved political/data issues.