2025-06-19

Andrej Karpathy: Software in the era of AI [video]

Software “1.0 / 2.0 / 3.0” and roles of AI

Many commenters like the framing that ML models (2.0) and LLMs/agents (3.0) are additional tools, not replacements: code, weights, and prompts will coexist.
Others argue the “versioning” metaphor is misleading because it implies linear improvement and displacement, whereas older paradigms persist (like assembly or the web).
Several propose rephrasing:
- 1.0 = precise code for precisely specified problems.
- 2.0 = learned models for problems defined by examples.
- 3.0 = natural-language specification of goals and behavior.

LLMs for coding, structured outputs, and “vibe coding”

Strong interest in structured outputs / JSON mode / constrained decoding as a way to make LLMs reliable components in pipelines and avoid brittle parsing.
Experiences are mixed: some report big gains (classification, extraction, function calling), others show concrete failures (misclassified ingredients, dropped fields) even with schemas and post‑processing.
“Vibe coding” (natural-language-driven app building) is seen by some as empowering and a good way to prototype or learn; others see it as unmaintainable code generation that just moves developers into low‑value reviewing of sloppy PRs.
There’s debate over whether LLM-assisted code is ever “top-tier” quality, and whether multiple AI-generated PR variants are helpful or just more review burden.

Determinism, debugging, and formal methods

A recurring concern: LLM-based systems are hard to debug and reason about; unlike traditional code, you can’t step through to find why a specific edge case fails.
Some push for tighter verification loops, including formal methods and “AI on a tight leash” (AI proposes, formal systems or tests verify).
Others argue English (or natural language generally) is fundamentally ambiguous and cannot replace formal languages for safety-critical or complex systems, warning of a drift back toward “magical thinking.”
Counterpoint: most real software already depends on non-deterministic components (APIs, hardware, ML models), so the real issue is designing robust verification and isolation layers, not banning probabilistic tools.

Interfaces, UX, and llms.txt

Several latch onto the analogy of today’s chat UIs to 1960s terminals: powerful backend, weak interface.
New ideas discussed: dynamic, LLM-generated GUIs per task; “malleable” personal interfaces; LLMs orchestrating tools behind the scenes. Concerns focus on non-deterministic, constantly shifting UIs being unlearnable and ripe for dark patterns.
The proposed llms.txt standard for sites is widely discussed:
- Enthusiasts like the idea of clean, LLM‑oriented descriptions and API instructions.
- Critics worry about divergence from HTML, gaming or misalignment between human and machine views, and yet another root-level file vs /.well-known/.
- Broader lament that the human web is being sidelined (apps, SEO, social feeds) while machines get the “good,” structured view.

Self-driving, world models, and analogies

The self-driving segment triggers a technical debate:
- Some think a single generalist, multimodal model (“drive safely”) could eventually subsume specialized stacks.
- Others argue driving is a tightly constrained, high-speed, partial-information control problem where specialized architectures and physics-based prediction (world models, MuZero-style planning, 3D state spaces) remain superior.
Broader skepticism about analogies:
- Electricity/OS/mainframe metaphors are seen as insightful by some, but nitpicked or rejected by others as historically inaccurate or overextended.
- One line of critique: these analogies obscure who actually controls LLMs (corporations, sometimes governments), even while the talk emphasizes “power to ordinary people.”

Power, diffusion, and centralization

Disagreement over whether LLMs truly “flip” tech diffusion:
- Supporters note early mass consumer use (boiling eggs, homework, small scripts) versus historically government/military first-use (cryptography, computing, GPS).
- Skeptics stress that model training, data access, and infrastructure are dominated by large corporations and governments; open‑weights remain dependent on corporate-scale datasets and compute.
Some worry that concentration of model power plus agentic capabilities will further entrench big platforms, not democratize software.

Limits, brittleness, and skepticism

Many practitioners report that current LLMs often “almost work” but fail in subtle ways: wrong math, off‑by‑one bugs, dropped fields, mis-normalized data, or plausible but incorrect logic.
There’s pushback against “AI as electricity” or “near-AGI” narratives:
- People compare the hype to crypto and metaverse bubbles.
- Some point to high-profile “AI coding” experiments at large companies where AI-generated PRs required intense human micromanagement and added little value.
Nonetheless, others share compelling use cases: faster test scaffolding, refactors, documentation, data munging, bespoke scripts, and domain-specific helpers, especially when paired with good rules files and schemas.

Future of work, education, and small models

Concern that widespread “vibe coding” and AI code generation will kill entry-level roles, deskill developers into PR reviewers, and worsen long‑term code quality.
Others say the main shift is that domain experts (doctors, teachers, small business owners) can build narrow tools without learning full-stack development, with engineers focusing more on architecture, verification, and “context wrangling.”
Debate on small/local models:
- Some argue rapid improvement (e.g., compact models) will make on-device AI a real alternative to centralized “mainframes,” especially once good enough for many tasks.
- Others counter that frontier cloud models remain far ahead in capability, and running strong local models is still costly and technically demanding.

DevOps, deployment, and enterprise concerns

Several note a practical friction: adding AI to an app often forces teams to build backends just to safely proxy LLM APIs and manage keys, tests, and logging—undermining “frontend-only” or “no-backend” development.
Ideas for “Firebase for LLMs” or platform features to handle secure proxying, rate limiting, and tool orchestration are floated.
Enterprise and regulated settings raise special worries:
- How to certify safety, security, and compliance if parts of systems are non-deterministic and poorly understood, or if vendors themselves rely heavily on LLM-generated internals.
- How to maintain and evolve systems where no human fully understands the code the agent originally wrote.

Related topics