2024-07-24

Large Enough

Perceived model quality & rankings

Many commenters say Claude 3.5 Sonnet is currently the best “everyday” and coding model, often “blowing away” GPT‑4/4o and Copilot in real workflows, especially for complex code reasoning and self‑correction.
Others report opposite experiences, finding GPT‑4o better or at least not worse; several suggest performance depends heavily on task type and prompting style.
Initial tests comparing Mistral Large 2 and Llama 3.1 405B against prior Claude prompts often rank them roughly tied and slightly below Claude 3.5 Sonnet.
Some see GPT‑4 as having degraded over time (more boilerplate, laziness, shallow outputs) while 4o optimizes more for cost/latency than raw capability.

Coding assistants & tooling

Claude 3.5 + tools like Aider or OpenWebUI is repeatedly praised as a highly effective coding partner with strong project‑/codebase‑wide context.
Cursor, Copilot, and other IDE tools get mixed reviews: good for inline suggestions but weaker on large refactors, continuity across edits, or complex reasoning.
Some users report massive productivity gains (e.g., shipping new apps or navigating complex Unreal C++), others find LLM code help too error‑prone to trust.

Benchmarks, evaluation, and “strawberry”

Commenters debate the value of leaderboards (e.g., LMSys, ArtificialAnalysis, Aider’s coding boards) vs “mass anecdata” from real use.
The “how many r’s in strawberry” question becomes a focal example: many top models answer incorrectly unless guided through step‑by‑step reasoning or via tools.
This sparks long discussion of tokenization limits, counting/math weaknesses, hallucination confidence, and the need for better tests of reasoning and long‑context competence.

Costs, scale, and possible plateau

Some argue frontier models are converging and we may be near the limits of scaling transformers; incremental benchmark gains are costly and may be marginal in practice.
Others think bigger or better‑trained models (and new architectures, internal reasoning, tools integration) still have significant headroom.
There’s concern that proprietary leaders are shifting from capability to cost/latency optimization and that open models plus local deployment will erode their advantage.

Licensing, openness, and deployment

Mistral Large 2’s open weights with a non‑commercial license are welcomed but viewed as less attractive than fully open Llama 3.1 for many use cases.
Anthropic’s restrictive commercial terms (no using Claude outputs to “compete”) worry some; others doubt such clauses are enforceable.
Many users now run multiple models via unified UIs (OpenWebUI, local Ollama, API multiplexing) and select per‑task based on speed, cost, and refusals rather than a single “winner.”

Related topics