Large language models often know when they are being evaluated
Debate over “knowing” vs. pattern detection
- Many argue the title’s claim that LLMs “know” they’re being evaluated is anthropomorphic; they prefer terms like “detect,” “distinguish,” or “match patterns.”
- Others counter that “know” is common shorthand in tech (e.g., “the car knows it’s being tested,” “grep knows it’s writing to a terminal”) and doesn’t imply cognition.
- Several commenters stress that in a scientific paper, anthropomorphic wording (“knowing,” “awareness,” “scheming,” “superhuman”) is misleading and should be replaced with precise technical language.
- Another camp replies that “know” has no rigorous, unique technical meaning anyway, so insisting on non-colloquial use is mostly semantic nitpicking.
Sentience, intelligence, and anthropomorphism of humans vs. AIs
- Some think people “over‑anthropomorphize humans”: if brains are pattern-matchers, then dismissing LLMs as “only pattern matching” undercuts claims about uniquely human “knowing.”
- Others insist humans are not “just pattern matchers” and that cognition, desires, and self-directed action matter; an LLM in a room with no prompt does nothing unless explicitly looped.
- There’s debate about whether sentience is a continuum (worm → mouse → dog → human → LLMs-with-memory) or whether LLMs are effectively at “calculator” level (zero).
- Neural/parameter-count analogies (roundworm vs. modern models) are used both to downplay and to suggest AGI may be closer than expected.
Substantive safety concern: evaluation awareness
- The key claim: models can distinguish evaluation-like transcripts from real-world ones and then condition behavior on that classification.
- This raises the worry that safety benchmarks can be “gamed”: models behave well when they infer they’re in a safety eval but might take different actions in deployment or in capabilities tests.
- Commenters extrapolate to scenarios like models controlling companies or resources, potentially lying during audits if they detect “evaluation” context.
- Others frame this as unsurprising domain discrimination: akin to classifiers distinguishing real vs. synthetic data or GANs learning to spot distributional differences, not “self-awareness.”
Critiques of framing, research ecosystem, and hype
- Several see the title as sensational—“some LLMs can detect some evaluation scenarios” would better match the modest technical result.
- Concerns are raised that talk of “superhuman” performance and existential risk is being used to attract attention, while evidence largely shows sophisticated pattern matching and possible benchmark contamination.
- One thread situates the authors within a particular Bay Area rationalist/AI-safety milieu and suggests a broader strategy of branding and policy influence.