Poker Tournament for LLMs

Quality of LLM Poker Play

  • Many hands show blatant misunderstandings: models mis-evaluate hand strength, misread boards (calling a wet board “dry”), or claim “top pair” when holding a weaker pair.
  • Models sometimes fold strong or decent hands with no pressure, mis-handle Omaha hands, or confuse draws vs made hands.
  • Participants note that these are not subtle GTO deviations but basic reasoning errors, often attributable to hallucinations and mis-parsing of state.

Limits of This “Tournament” as a Benchmark

  • Very small sample size (hundreds of hands per model) means bankroll graphs are dominated by variance; results are “for entertainment”, not statistically meaningful.
  • Full-ring, no-limit is far harder than the well-studied heads-up limit variant; using it makes serious comparison even harder.
  • Format is actually a cash game despite being labeled a tournament; long-running table with deep stacks leads to big swings.
  • Some technical oddities are observed (hand numbering, stack totals, odd pots), further undermining rigor.

Game Theory, Poker AI, and LLMs

  • Commenters with poker-AI background stress that strong play requires mixed strategies, equilibrium approximations (e.g., CFR, DeepStack, Pluribus), and consistent strategy across subgames.
  • Current general-purpose LLMs lack internal mechanisms for proper probabilistic play and search; they can’t match specialized poker bots.
  • Debate: some argue LLMs could approximate good play via tools (search, RNG, solvers) or by learning value functions; others think text-trained models are too imprecise and math-weak.

Randomness and Tool Calling

  • Simple tests of “random number 1–10” show biased outputs, or obviously patterned sequences; illustrates that naive token sampling is not suitable as a game RNG.
  • Others demonstrate that with code-execution tools, models can call real PRNGs and even generate well-distributed samples.
  • There is disagreement over whether relying on external tools still “counts” as the LLM playing.

Alternative Designs & Extensions

  • Suggestions include: heads-up formats, many more hands with position-swapping, pre-defined scenarios to probe decision quality, or using LLMs to write dedicated poker bots instead of playing directly.
  • Several people want table talk: bluffing, trash talk, visible chains-of-thought, and attempts to manipulate other models as a richer test of “intelligence”.
  • Parallel efforts (on-chain AI poker, custom research setups, educational poker apps) are mentioned as more controlled or specialized explorations of AI poker.