Andrej Karpathy: Software in the era of AI [video]
Software “1.0 / 2.0 / 3.0” and roles of AI
- Many commenters like the framing that ML models (2.0) and LLMs/agents (3.0) are additional tools, not replacements: code, weights, and prompts will coexist.
- Others argue the “versioning” metaphor is misleading because it implies linear improvement and displacement, whereas older paradigms persist (like assembly or the web).
- Several propose rephrasing:
- 1.0 = precise code for precisely specified problems.
- 2.0 = learned models for problems defined by examples.
- 3.0 = natural-language specification of goals and behavior.
LLMs for coding, structured outputs, and “vibe coding”
- Strong interest in structured outputs / JSON mode / constrained decoding as a way to make LLMs reliable components in pipelines and avoid brittle parsing.
- Experiences are mixed: some report big gains (classification, extraction, function calling), others show concrete failures (misclassified ingredients, dropped fields) even with schemas and post‑processing.
- “Vibe coding” (natural-language-driven app building) is seen by some as empowering and a good way to prototype or learn; others see it as unmaintainable code generation that just moves developers into low‑value reviewing of sloppy PRs.
- There’s debate over whether LLM-assisted code is ever “top-tier” quality, and whether multiple AI-generated PR variants are helpful or just more review burden.
Determinism, debugging, and formal methods
- A recurring concern: LLM-based systems are hard to debug and reason about; unlike traditional code, you can’t step through to find why a specific edge case fails.
- Some push for tighter verification loops, including formal methods and “AI on a tight leash” (AI proposes, formal systems or tests verify).
- Others argue English (or natural language generally) is fundamentally ambiguous and cannot replace formal languages for safety-critical or complex systems, warning of a drift back toward “magical thinking.”
- Counterpoint: most real software already depends on non-deterministic components (APIs, hardware, ML models), so the real issue is designing robust verification and isolation layers, not banning probabilistic tools.
Interfaces, UX, and llms.txt
- Several latch onto the analogy of today’s chat UIs to 1960s terminals: powerful backend, weak interface.
- New ideas discussed: dynamic, LLM-generated GUIs per task; “malleable” personal interfaces; LLMs orchestrating tools behind the scenes. Concerns focus on non-deterministic, constantly shifting UIs being unlearnable and ripe for dark patterns.
- The proposed
llms.txtstandard for sites is widely discussed:- Enthusiasts like the idea of clean, LLM‑oriented descriptions and API instructions.
- Critics worry about divergence from HTML, gaming or misalignment between human and machine views, and yet another root-level file vs
/.well-known/. - Broader lament that the human web is being sidelined (apps, SEO, social feeds) while machines get the “good,” structured view.
Self-driving, world models, and analogies
- The self-driving segment triggers a technical debate:
- Some think a single generalist, multimodal model (“drive safely”) could eventually subsume specialized stacks.
- Others argue driving is a tightly constrained, high-speed, partial-information control problem where specialized architectures and physics-based prediction (world models, MuZero-style planning, 3D state spaces) remain superior.
- Broader skepticism about analogies:
- Electricity/OS/mainframe metaphors are seen as insightful by some, but nitpicked or rejected by others as historically inaccurate or overextended.
- One line of critique: these analogies obscure who actually controls LLMs (corporations, sometimes governments), even while the talk emphasizes “power to ordinary people.”
Power, diffusion, and centralization
- Disagreement over whether LLMs truly “flip” tech diffusion:
- Supporters note early mass consumer use (boiling eggs, homework, small scripts) versus historically government/military first-use (cryptography, computing, GPS).
- Skeptics stress that model training, data access, and infrastructure are dominated by large corporations and governments; open‑weights remain dependent on corporate-scale datasets and compute.
- Some worry that concentration of model power plus agentic capabilities will further entrench big platforms, not democratize software.
Limits, brittleness, and skepticism
- Many practitioners report that current LLMs often “almost work” but fail in subtle ways: wrong math, off‑by‑one bugs, dropped fields, mis-normalized data, or plausible but incorrect logic.
- There’s pushback against “AI as electricity” or “near-AGI” narratives:
- People compare the hype to crypto and metaverse bubbles.
- Some point to high-profile “AI coding” experiments at large companies where AI-generated PRs required intense human micromanagement and added little value.
- Nonetheless, others share compelling use cases: faster test scaffolding, refactors, documentation, data munging, bespoke scripts, and domain-specific helpers, especially when paired with good rules files and schemas.
Future of work, education, and small models
- Concern that widespread “vibe coding” and AI code generation will kill entry-level roles, deskill developers into PR reviewers, and worsen long‑term code quality.
- Others say the main shift is that domain experts (doctors, teachers, small business owners) can build narrow tools without learning full-stack development, with engineers focusing more on architecture, verification, and “context wrangling.”
- Debate on small/local models:
- Some argue rapid improvement (e.g., compact models) will make on-device AI a real alternative to centralized “mainframes,” especially once good enough for many tasks.
- Others counter that frontier cloud models remain far ahead in capability, and running strong local models is still costly and technically demanding.
DevOps, deployment, and enterprise concerns
- Several note a practical friction: adding AI to an app often forces teams to build backends just to safely proxy LLM APIs and manage keys, tests, and logging—undermining “frontend-only” or “no-backend” development.
- Ideas for “Firebase for LLMs” or platform features to handle secure proxying, rate limiting, and tool orchestration are floated.
- Enterprise and regulated settings raise special worries:
- How to certify safety, security, and compliance if parts of systems are non-deterministic and poorly understood, or if vendors themselves rely heavily on LLM-generated internals.
- How to maintain and evolve systems where no human fully understands the code the agent originally wrote.