OpenAI's o1 Playing Codenames
LLMs Playing Codenames and Similar Experiments
- Multiple people report running Codenames-style experiments with various models (Claude, GPT‑3/3.5, o1), often finding AI guesses align well with human guesses.
- Some tried Codenames Pictures and got weaker results.
- Others built or linked apps to play Codenames/variants with LLM partners.
Evaluation, Fairness, and Benchmarks
- Many see Codenames as a natural benchmark for LLMs, given its reliance on semantic associations and light strategy.
- Suggestions include Elo-style ratings across board/card games and pitting different models or AI–human teams against each other, not just a model playing with itself.
- Critics argue AI–AI play is easier because both roles share the same “brain” and associations.
Game Strategy, Rules, and Exploits
- Discussion of advanced tactics: high-number clues spanning several turns, deliberately tolerating one neutral/opponent hit, and using later clues to “inflate” counts to recover unfinished sets.
- Some argue strict rules require the number to match actual related words; others play with looser house rules.
- A binary-encoding “powers of two” strategy is noted but called explicitly illegal under the official rules.
- Layout-card memorization and pattern abuse in physical boards are mentioned; online versions can randomize away such patterns.
Quality of o1’s Play and Clue Choices
- Some are impressed, especially by multi-word clues like “paper” for four cards and by explicit reasoning traces.
- Several think performance is overhyped: mostly safe 2‑word clues, occasional luck, and questionable tactical decisions about when to keep guessing.
- Specific clues such as “007” are criticized as weak or risky due to many plausible unintended associations.
Embeddings vs Full LLMs
- Some believe classic word embeddings (word2vec, GloVe) should suffice; others report poor results, especially for 3+ word clues, unless augmented with association graphs.
- LLMs are praised for broader “latent space” search across concepts, quotes, books, and puzzles like NYT Connections.
Reasoning, Explanations, and Human Factors
- Debate over whether models’ step-by-step explanations reflect real internal reasoning or are post‑hoc justifications.
- Comparisons are drawn to humans’ own post‑hoc rationalizations, with disagreement on how analogous the processes are.
- Several emphasize that true Codenames skill depends on modeling specific teammates, inside jokes, and psychology—areas where same‑model AI–AI play has an unfair advantage and where human–human play remains uniquely fun.