Large Enough
Perceived model quality & rankings
- Many commenters say Claude 3.5 Sonnet is currently the best “everyday” and coding model, often “blowing away” GPT‑4/4o and Copilot in real workflows, especially for complex code reasoning and self‑correction.
- Others report opposite experiences, finding GPT‑4o better or at least not worse; several suggest performance depends heavily on task type and prompting style.
- Initial tests comparing Mistral Large 2 and Llama 3.1 405B against prior Claude prompts often rank them roughly tied and slightly below Claude 3.5 Sonnet.
- Some see GPT‑4 as having degraded over time (more boilerplate, laziness, shallow outputs) while 4o optimizes more for cost/latency than raw capability.
Coding assistants & tooling
- Claude 3.5 + tools like Aider or OpenWebUI is repeatedly praised as a highly effective coding partner with strong project‑/codebase‑wide context.
- Cursor, Copilot, and other IDE tools get mixed reviews: good for inline suggestions but weaker on large refactors, continuity across edits, or complex reasoning.
- Some users report massive productivity gains (e.g., shipping new apps or navigating complex Unreal C++), others find LLM code help too error‑prone to trust.
Benchmarks, evaluation, and “strawberry”
- Commenters debate the value of leaderboards (e.g., LMSys, ArtificialAnalysis, Aider’s coding boards) vs “mass anecdata” from real use.
- The “how many r’s in strawberry” question becomes a focal example: many top models answer incorrectly unless guided through step‑by‑step reasoning or via tools.
- This sparks long discussion of tokenization limits, counting/math weaknesses, hallucination confidence, and the need for better tests of reasoning and long‑context competence.
Costs, scale, and possible plateau
- Some argue frontier models are converging and we may be near the limits of scaling transformers; incremental benchmark gains are costly and may be marginal in practice.
- Others think bigger or better‑trained models (and new architectures, internal reasoning, tools integration) still have significant headroom.
- There’s concern that proprietary leaders are shifting from capability to cost/latency optimization and that open models plus local deployment will erode their advantage.
Licensing, openness, and deployment
- Mistral Large 2’s open weights with a non‑commercial license are welcomed but viewed as less attractive than fully open Llama 3.1 for many use cases.
- Anthropic’s restrictive commercial terms (no using Claude outputs to “compete”) worry some; others doubt such clauses are enforceable.
- Many users now run multiple models via unified UIs (OpenWebUI, local Ollama, API multiplexing) and select per‑task based on speed, cost, and refusals rather than a single “winner.”