2025-02-24

OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

Human vs AI Coding (and Internet Access)

Several argue the comparison is skewed: models are tested without internet, tools, or iteration, while humans rely heavily on docs, search, compilers, and debuggers.
Others counter that experienced engineers can still solve many real-world tasks offline, especially bug fixes and small features, and that interviews already test “paper” or pseudocode reasoning.
There’s nostalgia for pre-internet coding, with claims that modern reliance on copy‑paste and Stack Overflow can reduce deep understanding.

Observed Capabilities and Failure Modes

Many report LLMs are helpful for small, well-specified tasks, boilerplate, simple scripts, and known technologies (e.g., React, Python), often acting like a strong junior.
When users don’t understand the domain well, or problems involve tricky design, large codebases, or complex SQL, models frequently hallucinate, repeat the same errors, or endlessly rewrite broken code.
Effective use requires domain expertise to spot nonsense quickly; novices who trust outputs blindly are seen as especially vulnerable.

Benchmarks, SWE-Lancer, and Evaluation

Commenters highlight that the new SWE-Lancer benchmark uses real Upwork-style tasks (many bugfixes) and that top models only solve a minority, even after multiple attempts.
Some see this as honest, positive signal: a more realistic bar that today’s systems can’t clear, contradicting strong “replace engineers soon” narratives.
Others worry about overfitting to yet another benchmark and about increased code churn and superficial “pass the test” behavior.

Assistant vs Replacement; Job Impact

Strong consensus: current LLMs augment, not replace, most professional software engineers; half the work is specification, iteration, and deep codebase understanding.
Non-engineers report using LLMs to do work they once outsourced to freelancers, suggesting displacement at the margins rather than wholesale replacement.
Some expect cumulative efficiency gains to eventually add up to full roles, while skeptics note this falls far short of the “trillions in disruption” being hyped.

Hype, AGI, and Learning Analogy

Many express skepticism or cynicism toward AGI timelines and claims that models already rival low-level engineers, likening this to past “self-driving next year” promises.
A recurring theme: LLMs haven’t “learned to code” like humans—no structured curriculum, no mentoring, no interactive practice—just massive passive ingestion of often low-quality code.
A minority argue that dismissing them as “just pattern matchers” ignores that human reasoning is also pattern-based, and that LLMs plus better tooling/agents might eventually tackle more complex work.

Related topics