Can LLMs write better code if you keep asking them to “write better code”?

Variation in Code Quality Across Languages & Domains

  • Experiences vary widely by language: good results reported for Arduino, Python, web frontend; poor for Ruby, Rust, Android/Kotlin, and some OpenSCAD tasks.
  • Models often produce “beginner”/tutorial-style code, pick outdated or inappropriate libraries, and use deprecated APIs unless guided.
  • Some see this as a sensible default for novice users; others say it makes LLM-written code unusable without strong prior expertise.

How People Actually Use LLMs

  • Productive uses: autocomplete (e.g., Copilot), boilerplate, small utilities, unit tests, refactors, and rubber-ducking/brainstorming.
  • Several treat LLMs as “brilliant but unreliable interns” or “professors on office hours”: great for ideas, not for paste-in code.
  • Others rely heavily on them for unfamiliar stacks to build working prototypes much faster, accepting extra review and fixes.

Iterative Improvement & “Write Better Code”

  • Many confirm that iterative refinement (“improve this”, “optimize this”, add tests, run, repeat) yields substantially better code.
  • However, simply asking “write better code” can:
    • Help converge toward more efficient or structured solutions, or
    • Degrade working code, especially when no tests are enforced.
  • Human reviewers often find simpler, more impactful optimizations than the model, highlighting the need for human judgment.

Execution, Testing, and Tooling

  • Core limitation noted: base LLMs cannot natively run arbitrary code; they “fly blind” without an external sandbox.
  • Multiple tools/agents (IDE integrations, Aider, Cursor, Devin, Gemini/Claude/ChatGPT code interpreters) run code, read compiler/test output, and loop automatically.
  • Strong view that serious agents must operate inside the developer’s environment and under version control (e.g., via git).

What “Better Code” Means

  • Disagreement over metrics: speed vs readability vs simplicity vs maintainability.
  • Some criticize optimizing toy Python tasks as misleading; they’d prefer idiomatic, clear code unless profiling shows a bottleneck.
  • Others value LLMs for quickly finding performance tricks once the problem and benchmarks are well specified.

Capabilities, Limits, and Prompting

  • Debate over whether LLMs “think” or merely pattern-match; some argue they learn real algorithms and world models, others insist they’re stochastic parrots.
  • Prompting strategies that often help: ask for architecture/plan first, specify libraries/versions, ask for pitfalls, or require tests and type annotations.
  • Emotional or threatening prompts sometimes appear to improve effort, but many see this as unreliable “prompt voodoo” rather than principled control.