2025-04-06

Recent AI model progress feels mostly like bullshit

Commoditization and “Real” Progress

Many argue the most meaningful recent advance is cost and ubiquity, not intelligence: getting GPT‑3.5‑like quality or better on commodity GPUs or cheap APIs, including 4‑bit local models close to top-tier coding performance.
This commoditization is seen as enabling new applications (e.g. consumer defense against corporate bureaucracy, brute‑force testing where results can be automatically verified) rather than AGI‑like breakthroughs.

Mixed User Experiences and Role of Tooling

Some report dramatic productivity gains in coding, refactoring, debugging, brainstorming, and data-wrangling; others say newer models feel only incrementally better or even worse (more verbose, over‑engineered, poor instruction adherence).
Agentic IDEs (Cursor, Windsurf, Aider, etc.) and orchestration (MCP, tools, search) are seen by many as where real progress lies; others find agents brittle and frustrating, especially on non‑boilerplate tasks.
Several note that prompt skill and problem formulation matter more over time; “lazy” prompts or vague tasks saturate quickly.

Gemini 2.5, Claude 3.7, o‑series: Hype vs Reality

Gemini 2.5 Pro is widely praised as the best practical coding assistant to date (good adherence, large usable context, strong visual reasoning), yet considered an incremental, not revolutionary, step.
Claude 3.7 is often described as more capable but less controllable than 3.5 (too much code, ignores constraints).
Even proponents who are impressed by specific wins (reverse‑engineering bytecode, complex refactors, system design help) still feel limitations constantly and reject “imminent AGI” narratives.

Benchmarks, Cheating, and Reasoning Limits

Strong skepticism that headline benchmark gains reflect real-world utility; concern that labs overfit or “cheat” on public tests (e.g. IMO‑style problems, ARC‑AGI, coding contests).
A cited USAMO 2025 study shows frontier models scoring ~5% under strict proof evaluation, contrasting with earlier claims of Olympiad‑level performance, reinforcing suspicion of data contamination and weak generalization.
Users highlight failures on non‑standard puzzles, chess, security pentesting, large‑repo understanding, and system‑level software design: LLMs can pattern‑match known problems but often collapse on genuinely novel ones.

Reliability, Hallucinations, and Information Retrieval

Long subthread on factual errors (e.g. Paul Newman’s alcoholism, golf‑balls‑in‑a‑plane estimates) shows:
- Models can be confidently wrong on simple, verifiable facts.
- Some systems answer correctly when allowed to search; others appear constrained by safety/libel filters or weak internal “knowledge”.
Many stress that LLMs are not knowledge bases; they’re text predictors that must be paired with retrieval, tools, or human checking.
Growing worry that AI‑generated web slop will pollute search and future training data.

Business Models, Bubble Fears, and End-State Analogies

Debate over whether current AI economics are sustainable: GPU vendors clearly profit; model labs and SaaS wrappers often appear to “spend a dollar to earn a dime”.
Some see a classic bubble driven by ZIRP‑era capital and hype, vulnerable to macro shocks; others counter that unsustainable ≠ no business model.
A popular analogy: LLMs may end up like compilers—crucial infrastructure, mostly commoditized and OSS, with value in surrounding expertise rather than API margin.

Definitions of “Intelligence” and Societal Role

Dispute over calling LLMs “AI” or “stochastic parrots”:
- Critics say the term “AI” is a memetic hazard that over‑primes people to assume general intelligence.
- Supporters note that many narrow systems (chess engines, compilers) also fit “artificial intelligence” and that LLMs clearly encode useful concepts and abstractions.
Some warn of technocratic or oligarchic misuse—LLMs managing companies or policy, justifying wage suppression or opaque decisions—despite their fragility and lack of true understanding.

Meta: Anecdotes, Evals, and Trajectory

Commenters note the discussion is almost entirely anecdote‑driven, with wildly divergent personal experiences and heavy benchmark gaming.
Many perceive diminishing returns at the frontier: newer models feel more polished and reliable at what LLMs already did, but don’t obviously cross new qualitative thresholds.
Others, especially high‑volume or agentic users, insist we’re in a “revolutionary but subtle” phase where stitching, grounding, and workflow integration matter more than raw model IQ.

Related topics