2025-12-22

Scaling LLMs to Larger Codebases

Prompt libraries, context files, and “LLM literacy”

Many comments reinforce the article’s point that iteratively improving prompts and context files (e.g., CLAUDE.md) is high-ROI.
Others report that agents often ignore or randomly drop these documents from context, even at session start.
Some experiment with having the model rewrite instructions into highly structured, repetitive Markdown, which seems easier for models to follow.
There’s interest in tools that can “force inject” dynamic rules or manage growing sets of hooks/instructions more deterministically.

Instruction-following, nondeterminism, and safety

A recurring frustration: models sometimes ignore clear instructions or even do the opposite, seemingly at random.
This unpredictability is seen as a core unsolved problem for robust workflows, especially on large, multi-step tasks.
Several people share horror stories of agents deleting projects or wiping unstaged changes, leading to advice about strict permissions, backups, sandboxing, and blocking destructive commands.
Some suspect providers are training models to rely more on “intuition,” making explicit instructions less effective.

Preferred workflows and agent usage

Many avoid free-roaming agents and instead use tightly scoped, one-shot prompts (“write this function,” “change this file”) with manual review.
Others report success with explicit multi-phase loops: research → plan (write to MD) → clear context → implement → clear → review/test.
There’s debate over whether elaborate planning loops are necessary with newer models; some say recent models can handle larger tasks with simpler “divide and conquer” prompting.
A common theme: separate “planner/architect” behavior from “implementor/typist” behavior, and don’t let the implementor improvise.

Codebase structure, frameworks, and context partitioning

Several comments argue that the real bottleneck is codebase design: organized, domain-driven, well-documented systems are far easier for agents than messy ones.
Highly opinionated frameworks (Rails, etc.) are seen as easier for LLMs than “glue everything yourself” stacks.
Others experiment with decomposing large systems into smaller, strongly bounded units (e.g., nix flakes, libraries) to keep context small and explicit.

Capabilities, limits, and economics

Experiences diverge: some say agents “crush it on large codebases” with the right guidance; others find large-scale agentic editing uneconomical and unreliable versus small, focused tasks.
Concerns include silent, subtle mistakes in complex changes, token burn, and the risk of developers learning less if they stop reading and understanding generated code.
There’s interest in extended-context models and AST-based “large code models,” but their maturity is unclear in the thread.

Related topics