2025-07-20

Coding with LLMs in the summer of 2025 – an update

LLM‑friendly codebases and testing structure

Many argue codebases “that work for LLMs” look like good human‑oriented codebases: clear modules, small functions, sound interfaces, and good docs. If an LLM is confused, humans probably are too.
Some suggest going further: finer‑grained runnable stages (multiple dev/test environments, layered Nix flakes, tagged pytest stages) so an agent can focus on stage‑local code and tests while ignoring the rest.
Several people now split larger integrations into separate libraries to give LLMs smaller, self‑contained scopes.

Context management and prompting strategies

Large context is a double‑edged sword: great for “architect” or design sessions, harmful for focused coding where aggressive pruning works better.
A common pattern:
- Use maximum context for design/architecture.
- For coding, only feed adjacent files/tests; restart sessions instead of “arguing” when the model drifts.
- Ask the model to first describe a plan in prose, refine that, then implement.
Some workflows: one branch per conversation, sometimes multiple parallel branches with the same prompt, then choose the best diff.

Models, tools, and division of labor

Many distinguish roles:
- Gemini 2.5 Pro / Opus 4 / DeepSeek R1 for big‑picture reasoning and architecture.
- Claude Sonnet 4 (and similar) for day‑to‑day coding: cheaper, more concise, less over‑engineered.
Experiences with Gemini CLI and Claude Code are mixed but often positive: good at small scripts, refactors, and code review; weaker on large, complex feature work without careful steering.
Some use LLMs heavily for automated PR review, build‑failure triage, and static‑analysis‑driven cleanups; signal is imperfect but often catches real bugs.

Agents vs manual control

One camp follows the article: avoid agents and IDE magic; instead manually copy/paste code into a frontier model’s web UI to control context precisely and stay mentally “in the loop.”
Another camp finds this too laborious: they prefer agentic tools (Claude Code, Cursor, Gemini CLI, JetBrains assistants, Copilot) that can read files, run tests, and apply edits, while the human reviews diffs and steers.
There is broad agreement that fully autonomous “one‑shot” agents still fail on medium/large tasks; human supervision and iterative prompting remain crucial.

Quality, bugs, and domain dependence

Users report LLMs excel at: one‑off scripts, glue code, adapters, API clients, test generation, and “boring” boilerplate—often writing more tests and spotting edge cases humans missed.
Others show counter‑examples: extremely inefficient or subtly wrong code, commented‑out assertions, flaky concurrency, or heavy complexity creep.
Domain, language, and problem type matter a lot: what feels magical in one stack can be nearly useless in another; people caution against generalizing from single anecdotes.

Proprietary vs open models, lock‑in, and cost

Strong debate over relying on closed, paid frontier models:
- Pro‑side: paid models are currently “much better,” and switching providers or falling back to manual coding is trivial, so dependency is weak.
- Skeptical side: worries about enshittification, rising prices, usage limits, data exposure, and recreating a pay‑to‑play gate around programming similar to historical proprietary toolchains.
Some point to open‑weight models (Kimi K2, DeepSeek, Qwen, etc.) as improving fast but still lagging for serious coding; local inference remains expensive and hardware‑bounded.
Tooling exists to abstract model choice (Ollama, vLLM, Continue, Cline, Aider, generic OpenAI‑compatible APIs), but most people still gravitate to frontier SaaS for productivity.

Skills, “PhD‑level knowledge,” and future of programming

The “PhD‑level knowledge” metaphor is criticized: a PhD is more about learning to do research and ask questions than about static knowledge; LLMs are “lazy knowledge‑rich workers” that don’t generate their own hypotheses unless prompted.
Some fear LLM‑centric workflows will deskill programmers or tie careers to subscriptions; others see them as powerful amplifiers that still require deep human understanding, especially for problem formulation and verification.
Overall sentiment: today’s best use is human‑in‑the‑loop amplification, not autonomous replacement; workflows, tools, and open models are still rapidly evolving.

Related topics