2025-12-19

LLM Year in Review

Scope Gaps and Strategic Questions

Several commenters feel the review underplays structural issues: concentration of power in model labs, hardware bottlenecks, the future of open source, and what truly “local” AI means.
Confusion around the claim that Claude Code “runs on your computer”: clarifications that the agent/harness runs locally while inference is in the cloud; some argue this distinction should be made explicit.
Proposed 2025–26 priorities: online/continuous learning, reducing hallucinations, improving reliability and escalation when models hit unfamiliar situations.

Agents, Local vs Cloud, and Coding Workflows

Strong interest in “localhost agents” as wrappers that can call shells, tools, and operate over full file systems; the LLM becomes a “remote brain” in a local “mech suit.”
Discussion of local agent stacks (Codex + gpt‑oss, llama.cpp, Ollama) vs frontier cloud models; consensus that local models are still noticeably weaker but strategically important (offline, privacy, new architectures).
Intense comparison of coding agents: Claude Code, Cursor, Codex, GLM.
- Some report 5×+ productivity and almost no manual coding; others say Claude Code is heavy for small tasks and prefer tighter IDE integration (e.g., Cursor).
- Benchmarks and user reports conflict on which frontier model (GPT‑5.2, Gemini 3, Opus 4.5) is “best”; steerability, tool use, and instruction-following matter more than raw scores.

“Vibe Coding” and Ephemeral Software

Many are excited about “vibe coding”: spinning up one-off tools, debug scripts, or micro‑apps and discarding them afterwards.
Others note economic caveats: current LLM usage is heavily subsidized; if prices rise, ephemeral coding may become less attractive.

UI Generation and Modal Interfaces

Some see UI generation as a major underexplored frontier: models that choose the best representation (apps, graphics, animations) rather than just text.
Nano Banana and video models are cited as early hints of models that can transform and reason over visual environments.
Skeptics worry about chaotic, inconsistent LLM‑generated UIs and already dislike “chatbot-as-UI” patterns (e.g., being forced to talk to a bot to unsubscribe).
Counterpoint: text/speech likely remain primary UI; images/video are additive, not replacements.

Jagged Intelligence, “Ghosts,” and Style Drift

The “jagged intelligence” / “summoning ghosts vs growing animals” framing resonates with some as an explanation for benchmark fragility and RL overfitting; others dismiss it as metaphorical and insist present systems lack true general intelligence and embodiment.
Concern that bots now reply to bots on social platforms, “haunting” the training data and making online argument pointless.
Many notice AI‑like rhetorical tics (“it’s not X, it’s Y”, overuse of em dashes) seeping into human writing, making it harder to distinguish human from model text and provoking reader fatigue.

Enterprise Adoption, Reliability, and Evaluations

Commenters working in industry say LLMs still feel like “geek toys” in many enterprises: creatives hand‑tune prompts, but critical domains (HR, finance) often forbid data exposure to external models.
Non-determinism and hallucinations make standard training and process control difficult.
Interest in building domain-specific eval suites to measure whether models are “good enough” for particular workflows, beyond public benchmarks.

Related topics