Andrej Karpathy: Software in the era of AI [video]

Software “1.0 / 2.0 / 3.0” and roles of AI

  • Many commenters like the framing that ML models (2.0) and LLMs/agents (3.0) are additional tools, not replacements: code, weights, and prompts will coexist.
  • Others argue the “versioning” metaphor is misleading because it implies linear improvement and displacement, whereas older paradigms persist (like assembly or the web).
  • Several propose rephrasing:
    • 1.0 = precise code for precisely specified problems.
    • 2.0 = learned models for problems defined by examples.
    • 3.0 = natural-language specification of goals and behavior.

LLMs for coding, structured outputs, and “vibe coding”

  • Strong interest in structured outputs / JSON mode / constrained decoding as a way to make LLMs reliable components in pipelines and avoid brittle parsing.
  • Experiences are mixed: some report big gains (classification, extraction, function calling), others show concrete failures (misclassified ingredients, dropped fields) even with schemas and post‑processing.
  • “Vibe coding” (natural-language-driven app building) is seen by some as empowering and a good way to prototype or learn; others see it as unmaintainable code generation that just moves developers into low‑value reviewing of sloppy PRs.
  • There’s debate over whether LLM-assisted code is ever “top-tier” quality, and whether multiple AI-generated PR variants are helpful or just more review burden.

Determinism, debugging, and formal methods

  • A recurring concern: LLM-based systems are hard to debug and reason about; unlike traditional code, you can’t step through to find why a specific edge case fails.
  • Some push for tighter verification loops, including formal methods and “AI on a tight leash” (AI proposes, formal systems or tests verify).
  • Others argue English (or natural language generally) is fundamentally ambiguous and cannot replace formal languages for safety-critical or complex systems, warning of a drift back toward “magical thinking.”
  • Counterpoint: most real software already depends on non-deterministic components (APIs, hardware, ML models), so the real issue is designing robust verification and isolation layers, not banning probabilistic tools.

Interfaces, UX, and llms.txt

  • Several latch onto the analogy of today’s chat UIs to 1960s terminals: powerful backend, weak interface.
  • New ideas discussed: dynamic, LLM-generated GUIs per task; “malleable” personal interfaces; LLMs orchestrating tools behind the scenes. Concerns focus on non-deterministic, constantly shifting UIs being unlearnable and ripe for dark patterns.
  • The proposed llms.txt standard for sites is widely discussed:
    • Enthusiasts like the idea of clean, LLM‑oriented descriptions and API instructions.
    • Critics worry about divergence from HTML, gaming or misalignment between human and machine views, and yet another root-level file vs /.well-known/.
    • Broader lament that the human web is being sidelined (apps, SEO, social feeds) while machines get the “good,” structured view.

Self-driving, world models, and analogies

  • The self-driving segment triggers a technical debate:
    • Some think a single generalist, multimodal model (“drive safely”) could eventually subsume specialized stacks.
    • Others argue driving is a tightly constrained, high-speed, partial-information control problem where specialized architectures and physics-based prediction (world models, MuZero-style planning, 3D state spaces) remain superior.
  • Broader skepticism about analogies:
    • Electricity/OS/mainframe metaphors are seen as insightful by some, but nitpicked or rejected by others as historically inaccurate or overextended.
    • One line of critique: these analogies obscure who actually controls LLMs (corporations, sometimes governments), even while the talk emphasizes “power to ordinary people.”

Power, diffusion, and centralization

  • Disagreement over whether LLMs truly “flip” tech diffusion:
    • Supporters note early mass consumer use (boiling eggs, homework, small scripts) versus historically government/military first-use (cryptography, computing, GPS).
    • Skeptics stress that model training, data access, and infrastructure are dominated by large corporations and governments; open‑weights remain dependent on corporate-scale datasets and compute.
  • Some worry that concentration of model power plus agentic capabilities will further entrench big platforms, not democratize software.

Limits, brittleness, and skepticism

  • Many practitioners report that current LLMs often “almost work” but fail in subtle ways: wrong math, off‑by‑one bugs, dropped fields, mis-normalized data, or plausible but incorrect logic.
  • There’s pushback against “AI as electricity” or “near-AGI” narratives:
    • People compare the hype to crypto and metaverse bubbles.
    • Some point to high-profile “AI coding” experiments at large companies where AI-generated PRs required intense human micromanagement and added little value.
  • Nonetheless, others share compelling use cases: faster test scaffolding, refactors, documentation, data munging, bespoke scripts, and domain-specific helpers, especially when paired with good rules files and schemas.

Future of work, education, and small models

  • Concern that widespread “vibe coding” and AI code generation will kill entry-level roles, deskill developers into PR reviewers, and worsen long‑term code quality.
  • Others say the main shift is that domain experts (doctors, teachers, small business owners) can build narrow tools without learning full-stack development, with engineers focusing more on architecture, verification, and “context wrangling.”
  • Debate on small/local models:
    • Some argue rapid improvement (e.g., compact models) will make on-device AI a real alternative to centralized “mainframes,” especially once good enough for many tasks.
    • Others counter that frontier cloud models remain far ahead in capability, and running strong local models is still costly and technically demanding.

DevOps, deployment, and enterprise concerns

  • Several note a practical friction: adding AI to an app often forces teams to build backends just to safely proxy LLM APIs and manage keys, tests, and logging—undermining “frontend-only” or “no-backend” development.
  • Ideas for “Firebase for LLMs” or platform features to handle secure proxying, rate limiting, and tool orchestration are floated.
  • Enterprise and regulated settings raise special worries:
    • How to certify safety, security, and compliance if parts of systems are non-deterministic and poorly understood, or if vendors themselves rely heavily on LLM-generated internals.
    • How to maintain and evolve systems where no human fully understands the code the agent originally wrote.