2025-03-31

AI agents: Less capability, more reliability, please

Human-in-the-loop and workflows vs “general agents”

Strong support for narrow, deterministic workflows with clear steps and human checkpoints, rather than open-ended autonomous agents.
Many argue the right pattern is: define explicit processes, insert LLM calls as tools inside them, and focus on reversibility (git, undo, containers, CRDTs) instead of pretending errors won’t happen.
Some see “agents” as emergent from stitched workflows, not a fundamentally separate thing.

Agent UX vs existing tools

Repeated complaint: “book a flight” or “order groceries” agents are worse than mature UIs like Google Flights, Uber, or a basic web search.
Natural-language interfaces are unbounded and hard to test; they also obscure what the system can and can’t do, making trust harder.
Comparisons to 2016 chatbots and touchscreens-in-cars: shiny, but often a UX regression.

Coding assistants and “vibe coding”

Many developers only accept AI-written code they can understand, or at least cover with solid tests.
Others are comfortable with “vibe coding” for disposable/one-off analysis and scripts, treating LLMs like a black-box library, as long as behavior passes tests.
The Cursor/git wipeout anecdote splits opinion: some blame users’ lack of version control basics; others say tools must embed guardrails for catastrophic operations.

Flight booking as a case study

Users describe highly personal heuristics: price vs layover length, airport quality, loyalty points, weird award tricks, “hidden city” routes.
Many doubt a generic agent can capture these shifting, tacit tradeoffs; they’d accept suggestions, but not blind booking.
A few argue that exactly these complex, multi-step optimizations are where agents could shine—if they ever become reliably better than doing it yourself.

Reliability, evaluation, and error handling

Several builders emphasize rigorous, task-specific evals instead of benchmark chasing; “real-world” tests often differ drastically from academic ones.
Ideas: retry loops modeled after TCP, specialized sub-agents for “is this irreversible?”, and designing systems around detection and rollback rather than perfection.
Skeptics note hallucinations and probabilistic behavior seem inherent; for high-stakes domains (medicine, finance, booking travel) partial reliability is not acceptable.

Incentives, hype, and platform dynamics

Widespread criticism of overpromised “AI engineer” / “MCP for everything” products and capability-first demos that barely work in practice.
Concern that agent platforms will become new rent-seeking intermediaries, similar to ad-powered search, app stores, or private APIs—especially if businesses differentiate between human and AI “users.”

Related topics