2025-05-08

Why do LLMs have emergent properties?

Debate over “emergent abilities” vs metric artifacts

Several comments cite work arguing that many “emergent abilities” are illusions caused by non‑linear or discontinuous evaluation metrics; if you use smooth metrics, performance scales smoothly.
Others push back: the metrics criticized there are exactly what people care about in practice (pass/fail, accuracy thresholds), so sudden jumps are meaningful. Smooth internal properties do not rule out real emergent behaviors at the task level.
Some criticize the article for acknowledging this line of work yet still talking about “emergence” as if it were unquestioned.

What “emergence” means (and doesn’t)

One camp treats “emergent properties” as a vague label for “we don’t understand this yet” or even a dualist cop‑out.
Another camp gives standard complex‑systems definitions: macroscopic properties not present in individual parts (thermodynamics, entropy, flocking, cars transporting people, Game of Life patterns, fractals).
Several stress that emergence is not magic or ignorance: you can fully understand the parts and still have qualitatively new system‑level behavior.
Disagreement persists on whether this is just semantics or a substantive systems‑theory concept.

Benchmarks, thresholds, and human perception

People note that many abilities are treated as binary (“can do addition”, “can fly”), but underlying competence improves continuously until a threshold is crossed, at which point we relabel it as a new capability.
This is tied to benchmark design: percentage scores saturate, so small gains near the top feel like big leaps; humans also choose arbitrary cut points and then call what happens beyond them “emergent.”
Others argue that the rapid breaking of increasingly sophisticated benchmarks suggests something more than arbitrary re‑labeling is going on.

Scaling, history, and why big models were tried

Emergence wasn’t predicted as a sharp phase change; model sizes increased gradually as each bigger model gave smoother but real gains.
Earlier successes in deep learning (vision, games) and hardware advances made “just scale it up” a reasonable, incremental bet rather than a wild leap.

Interpolation, data, and where “intelligence” lives

Some argue LLMs mainly interpolate within massive training corpora and store labeling effort; “emergence” may belong more to the data’s latent structure than the models.
Others counter that even if it’s “just interpolation,” human brains are also sophisticated interpolators, and the qualitative novelty of some solved tasks is still notable.
One line of thought suggests that beyond a certain scale, “learning general heuristics” becomes more parameter‑efficient than storing countless task‑specific tricks; whether LLMs have crossed that line remains debated.

Underspecification, parameters, and training dynamics

There is disagreement about “bit budgets”: some see models as undertrained relative to their size; others emphasize underspecification (many parameter settings yield similar loss).
Different random initializations lead to different minima with broadly similar behavior; some see this as evidence of many equivalent optima in high‑dimensional space, not radically different emergent skill sets.

Limits, missing pieces, and skepticism

Skeptical voices say LLMs haven’t yet shown truly unexpected behavior; they do what they were optimized to do, so calling that “emergence” is subjective.
Others point out that humans need far less data to reach comparable reasoning, implying that current architectures might be missing key mechanisms for self‑learning and sense‑making.
There is interest in whether we can predict when specific capabilities will appear, control which emergent behaviors do or don’t arise, and rigorously distinguish genuine new abstractions from ever‑larger bags of heuristics.

Related topics