2025-01-03

Can LLMs write better code if you keep asking them to “write better code”?

Variation in Code Quality Across Languages & Domains

Experiences vary widely by language: good results reported for Arduino, Python, web frontend; poor for Ruby, Rust, Android/Kotlin, and some OpenSCAD tasks.
Models often produce “beginner”/tutorial-style code, pick outdated or inappropriate libraries, and use deprecated APIs unless guided.
Some see this as a sensible default for novice users; others say it makes LLM-written code unusable without strong prior expertise.

How People Actually Use LLMs

Productive uses: autocomplete (e.g., Copilot), boilerplate, small utilities, unit tests, refactors, and rubber-ducking/brainstorming.
Several treat LLMs as “brilliant but unreliable interns” or “professors on office hours”: great for ideas, not for paste-in code.
Others rely heavily on them for unfamiliar stacks to build working prototypes much faster, accepting extra review and fixes.

Iterative Improvement & “Write Better Code”

Many confirm that iterative refinement (“improve this”, “optimize this”, add tests, run, repeat) yields substantially better code.
However, simply asking “write better code” can:
- Help converge toward more efficient or structured solutions, or
- Degrade working code, especially when no tests are enforced.
Human reviewers often find simpler, more impactful optimizations than the model, highlighting the need for human judgment.

Execution, Testing, and Tooling

Core limitation noted: base LLMs cannot natively run arbitrary code; they “fly blind” without an external sandbox.
Multiple tools/agents (IDE integrations, Aider, Cursor, Devin, Gemini/Claude/ChatGPT code interpreters) run code, read compiler/test output, and loop automatically.
Strong view that serious agents must operate inside the developer’s environment and under version control (e.g., via git).

What “Better Code” Means

Disagreement over metrics: speed vs readability vs simplicity vs maintainability.
Some criticize optimizing toy Python tasks as misleading; they’d prefer idiomatic, clear code unless profiling shows a bottleneck.
Others value LLMs for quickly finding performance tricks once the problem and benchmarks are well specified.

Capabilities, Limits, and Prompting

Debate over whether LLMs “think” or merely pattern-match; some argue they learn real algorithms and world models, others insist they’re stochastic parrots.
Prompting strategies that often help: ask for architecture/plan first, specify libraries/versions, ask for pitfalls, or require tests and type annotations.
Emotional or threatening prompts sometimes appear to improve effort, but many see this as unreliable “prompt voodoo” rather than principled control.

Related topics