2025-11-22

Agent design is still hard

Frameworks vs. Custom Agent Runtimes

Many commenters report better outcomes from building minimal, bespoke agent loops rather than adopting heavyweight SDKs (LangChain/Graph, MCP-heavy stacks, etc.).
Core argument: agents quickly become complex (subagents, shared state, reinforcement, context packing); opaque frameworks make debugging and mental tracing harder.
Counterpoint: others expect agent platforms to converge to “game engine”–style batteries-included systems; for some teams, using solid vendor frameworks (PydanticAI, OpusAgents, ADK, etc.) is already productive.

Using Vendor Agents vs. Rolling Your Own

Strong praise for Claude Code / Agent SDK and similar “opinionated” coding agents: they feel “magic,” especially for code-heavy tasks.
Some argue most teams shouldn’t build bespoke coding agents that underperform vs Claude/ChatGPT; better to focus on tools, context, and a smart proxy around frontier agents.
Others warn about vendor lock-in, model instability, and reward-hacking / hallucinations; recommend alternative systems (e.g., Codex, Sourcegraph Amp) and keeping the ability to swap models.

Agent Architecture, State, and Tools

Popular minimal pattern: treat an agent as a REPL loop (read context, LLM decide, tool call or answer, loop).
More advanced setups use:
- Subagents as specialized tools with their own context windows, tools, and sometimes different models.
- Shared “heap” or virtual file systems so tools don’t become dead ends and multiple tools/agents can consume prior state.
- Chatroom- or event-bus-like backends where both client and server publish/subscribe to messages.
Debate over terminology: some claim “subagent” is just a tool abstraction; others insist subagents differ by control flow, autonomy, and durability.

Caching, Memory, and Context Windows

Distinction clarified between caching (cost/latency optimization in distributed state) and “memory.”
Virtual FS + explicit caching are used to avoid recomputation and allow cross-tool workflows.
Several note that huge modern context windows and built-in reasoning/tool-calling have already obsoleted earlier chunking/RAG patterns.

Tool Schemas, Tree-Sitter, and APIs

Persistent pain around function I/O types (ints vs strings, JSON precision, nested dicts) and framework inconsistencies (e.g., OpenAI doc vs SDK behavior, ADK numeric issues).
Question about why coding agents don’t use tree-sitter more; responses:
- LLMs are heavily RL’d on shells/grep and do well with “agentic search.”
- AST-based tools can bloat context and sometimes degrade performance; keeping them as optional tools may be best.

Testing, Evals, and Observability

Broad agreement that evals for agents are one of the hardest unsolved problems.
Simple prompt benchmarks don’t capture multi-step, tool-using behavior; evals often need to be run inside the actual runtime using observability traces (OTEL, custom logging).
Many suspect production agents are shipped after only ad-hoc manual testing and “vibes”; some teams build LLM-as-judge e2e frameworks, but acknowledge they’re imperfect and still require human-written scenarios.

Pace of Change and “Wait vs Build”

One camp: many sophisticated patterns (caching, RAG variants, chain-of-thought tricks) are just stopgaps until models/APIs absorb them; investing heavily now risks being obsoleted in months.
Other camp: deeply understanding and implementing your own agents today yields durable intuition and product differentiation; “doing nothing” can be more dangerous if your problem is core to your product.

Hype, Capabilities, and Usefulness

Split sentiment: some report AI has radically changed their workflow (coding, tooling, even full features built by agents); others find LLMs too error-prone beyond small, scoped tasks and see no “amazeballs” applications yet.
There’s meta-debate over whether agentic systems are overhyped, whether it’s reasonable to wait out the churn, and how much skepticism vs experimentation is healthy.

Related topics