OpenAI's o1 Playing Codenames

LLMs Playing Codenames and Similar Experiments

  • Multiple people report running Codenames-style experiments with various models (Claude, GPT‑3/3.5, o1), often finding AI guesses align well with human guesses.
  • Some tried Codenames Pictures and got weaker results.
  • Others built or linked apps to play Codenames/variants with LLM partners.

Evaluation, Fairness, and Benchmarks

  • Many see Codenames as a natural benchmark for LLMs, given its reliance on semantic associations and light strategy.
  • Suggestions include Elo-style ratings across board/card games and pitting different models or AI–human teams against each other, not just a model playing with itself.
  • Critics argue AI–AI play is easier because both roles share the same “brain” and associations.

Game Strategy, Rules, and Exploits

  • Discussion of advanced tactics: high-number clues spanning several turns, deliberately tolerating one neutral/opponent hit, and using later clues to “inflate” counts to recover unfinished sets.
  • Some argue strict rules require the number to match actual related words; others play with looser house rules.
  • A binary-encoding “powers of two” strategy is noted but called explicitly illegal under the official rules.
  • Layout-card memorization and pattern abuse in physical boards are mentioned; online versions can randomize away such patterns.

Quality of o1’s Play and Clue Choices

  • Some are impressed, especially by multi-word clues like “paper” for four cards and by explicit reasoning traces.
  • Several think performance is overhyped: mostly safe 2‑word clues, occasional luck, and questionable tactical decisions about when to keep guessing.
  • Specific clues such as “007” are criticized as weak or risky due to many plausible unintended associations.

Embeddings vs Full LLMs

  • Some believe classic word embeddings (word2vec, GloVe) should suffice; others report poor results, especially for 3+ word clues, unless augmented with association graphs.
  • LLMs are praised for broader “latent space” search across concepts, quotes, books, and puzzles like NYT Connections.

Reasoning, Explanations, and Human Factors

  • Debate over whether models’ step-by-step explanations reflect real internal reasoning or are post‑hoc justifications.
  • Comparisons are drawn to humans’ own post‑hoc rationalizations, with disagreement on how analogous the processes are.
  • Several emphasize that true Codenames skill depends on modeling specific teammates, inside jokes, and psychology—areas where same‑model AI–AI play has an unfair advantage and where human–human play remains uniquely fun.