2025-01-22

OpenAI's o1 Playing Codenames

LLMs Playing Codenames and Similar Experiments

Multiple people report running Codenames-style experiments with various models (Claude, GPT‑3/3.5, o1), often finding AI guesses align well with human guesses.
Some tried Codenames Pictures and got weaker results.
Others built or linked apps to play Codenames/variants with LLM partners.

Evaluation, Fairness, and Benchmarks

Many see Codenames as a natural benchmark for LLMs, given its reliance on semantic associations and light strategy.
Suggestions include Elo-style ratings across board/card games and pitting different models or AI–human teams against each other, not just a model playing with itself.
Critics argue AI–AI play is easier because both roles share the same “brain” and associations.

Game Strategy, Rules, and Exploits

Discussion of advanced tactics: high-number clues spanning several turns, deliberately tolerating one neutral/opponent hit, and using later clues to “inflate” counts to recover unfinished sets.
Some argue strict rules require the number to match actual related words; others play with looser house rules.
A binary-encoding “powers of two” strategy is noted but called explicitly illegal under the official rules.
Layout-card memorization and pattern abuse in physical boards are mentioned; online versions can randomize away such patterns.

Quality of o1’s Play and Clue Choices

Some are impressed, especially by multi-word clues like “paper” for four cards and by explicit reasoning traces.
Several think performance is overhyped: mostly safe 2‑word clues, occasional luck, and questionable tactical decisions about when to keep guessing.
Specific clues such as “007” are criticized as weak or risky due to many plausible unintended associations.

Embeddings vs Full LLMs

Some believe classic word embeddings (word2vec, GloVe) should suffice; others report poor results, especially for 3+ word clues, unless augmented with association graphs.
LLMs are praised for broader “latent space” search across concepts, quotes, books, and puzzles like NYT Connections.

Reasoning, Explanations, and Human Factors

Debate over whether models’ step-by-step explanations reflect real internal reasoning or are post‑hoc justifications.
Comparisons are drawn to humans’ own post‑hoc rationalizations, with disagreement on how analogous the processes are.
Several emphasize that true Codenames skill depends on modeling specific teammates, inside jokes, and psychology—areas where same‑model AI–AI play has an unfair advantage and where human–human play remains uniquely fun.

Related topics