Large Enough

Perceived model quality & rankings

  • Many commenters say Claude 3.5 Sonnet is currently the best “everyday” and coding model, often “blowing away” GPT‑4/4o and Copilot in real workflows, especially for complex code reasoning and self‑correction.
  • Others report opposite experiences, finding GPT‑4o better or at least not worse; several suggest performance depends heavily on task type and prompting style.
  • Initial tests comparing Mistral Large 2 and Llama 3.1 405B against prior Claude prompts often rank them roughly tied and slightly below Claude 3.5 Sonnet.
  • Some see GPT‑4 as having degraded over time (more boilerplate, laziness, shallow outputs) while 4o optimizes more for cost/latency than raw capability.

Coding assistants & tooling

  • Claude 3.5 + tools like Aider or OpenWebUI is repeatedly praised as a highly effective coding partner with strong project‑/codebase‑wide context.
  • Cursor, Copilot, and other IDE tools get mixed reviews: good for inline suggestions but weaker on large refactors, continuity across edits, or complex reasoning.
  • Some users report massive productivity gains (e.g., shipping new apps or navigating complex Unreal C++), others find LLM code help too error‑prone to trust.

Benchmarks, evaluation, and “strawberry”

  • Commenters debate the value of leaderboards (e.g., LMSys, ArtificialAnalysis, Aider’s coding boards) vs “mass anecdata” from real use.
  • The “how many r’s in strawberry” question becomes a focal example: many top models answer incorrectly unless guided through step‑by‑step reasoning or via tools.
  • This sparks long discussion of tokenization limits, counting/math weaknesses, hallucination confidence, and the need for better tests of reasoning and long‑context competence.

Costs, scale, and possible plateau

  • Some argue frontier models are converging and we may be near the limits of scaling transformers; incremental benchmark gains are costly and may be marginal in practice.
  • Others think bigger or better‑trained models (and new architectures, internal reasoning, tools integration) still have significant headroom.
  • There’s concern that proprietary leaders are shifting from capability to cost/latency optimization and that open models plus local deployment will erode their advantage.

Licensing, openness, and deployment

  • Mistral Large 2’s open weights with a non‑commercial license are welcomed but viewed as less attractive than fully open Llama 3.1 for many use cases.
  • Anthropic’s restrictive commercial terms (no using Claude outputs to “compete”) worry some; others doubt such clauses are enforceable.
  • Many users now run multiple models via unified UIs (OpenWebUI, local Ollama, API multiplexing) and select per‑task based on speed, cost, and refusals rather than a single “winner.”