2024-12-23

Adversarial policies beat superhuman Go AIs (2023)

Human playstyles, intuition, and ratings

Some compare the Go exploit to humans using bizarre, unpredictable play or obscure chess openings to push opponents out of preparation.
Discussion on human strengths: memory, calculation, and especially intuition; top players differ in which they excel at.
Several comments debate Elo/ladder systems:
- One side enjoys 50% win rates and evenly matched games, valuing improvement and quality over sheer winning.
- Others dislike systems where win rate stabilizes, preferring formats (like tournaments or “playing down”) that reward visible progress.
- There’s tension between “winning is the fun part” vs “learning and good games are the fun part.”

Nature of the Go adversarial attack

The attack creates positions where superhuman Go AIs mis-evaluate life-and-death, especially long “dead-man walking” situations where a group is effectively dead but not yet captured.
The adversary often plays slightly suboptimal moves to keep the AI “confused” instead of cashing in an obvious win, because exposing the true status might let the AI recover.
There are two main strategies discussed: a “pass adversary” exploiting a particular formal ruleset, and a “cyclic adversary” based on wrapped, circular groups.

Go rules, ladders, and persistent weaknesses

Part of the controversy centers on rulesets:
- Some say the “pass” attack mainly abuses an artificial Tromp–Taylor variant without dead-stone removal, which is not how AIs typically play humans.
- Others argue the evaluation should match the rules the AI is configured for, regardless of human conventions.
The more respected “cyclic” attack reveals genuine misreads of group status, not just rules quirks.
Separate thread on ladders: early Go AIs and even modern NNs struggle with long, mechanical ladder sequences, prompting hard-coded ladder solvers.
KataGo developers reportedly patched some cyclic flaws via extra training and larger networks, but expect other, harder-to-find flaws will always exist.

Broader AI reliability and “superhuman” claims

Several commenters highlight that “superhuman” at a game does not mean robust or generally intelligent; narrow AIs can still have brittle, surprising failure modes.
Some see the paper as important evidence that future powerful systems will also harbor unknown vulnerabilities; others call the conclusion empty or overgeneralized.
A later defense paper is noted: defenses can stop known attacks but fail against newly trained adversaries, suggesting an ongoing arms race.

Parallels to chess engines and other games

Chess examples (fortresses, locked structures) show top engines mis-evaluating drawn positions that humans see as clearly unwinnable, underscoring that engines rely on search and heuristics rather than human-like constraint reasoning.
Past experience with anti-computer strategies in chess is cited as an analogy: you can target the evaluation function, but enough compute and better training typically overcome such tricks.

LLMs, hallucinations, and adversarial prompts

Some draw analogies between Go adversarial policies and LLM “hallucinations” and jailbreaks.
Debate over terminology:
- One view treats hallucinations as attempts to extrapolate from data.
- Another stresses they’re simply outputs violating constraints (e.g., fake cases, unsafe recipes) and not real “reasoning.”
Adversarial attacks on LLMs are noted as an active research area, reinforcing the general theme: powerful models can be steered into failure regimes their creators didn’t anticipate.

Related topics