Poker Tournament for LLMs
Quality of LLM Poker Play
- Many hands show blatant misunderstandings: models mis-evaluate hand strength, misread boards (calling a wet board “dry”), or claim “top pair” when holding a weaker pair.
- Models sometimes fold strong or decent hands with no pressure, mis-handle Omaha hands, or confuse draws vs made hands.
- Participants note that these are not subtle GTO deviations but basic reasoning errors, often attributable to hallucinations and mis-parsing of state.
Limits of This “Tournament” as a Benchmark
- Very small sample size (hundreds of hands per model) means bankroll graphs are dominated by variance; results are “for entertainment”, not statistically meaningful.
- Full-ring, no-limit is far harder than the well-studied heads-up limit variant; using it makes serious comparison even harder.
- Format is actually a cash game despite being labeled a tournament; long-running table with deep stacks leads to big swings.
- Some technical oddities are observed (hand numbering, stack totals, odd pots), further undermining rigor.
Game Theory, Poker AI, and LLMs
- Commenters with poker-AI background stress that strong play requires mixed strategies, equilibrium approximations (e.g., CFR, DeepStack, Pluribus), and consistent strategy across subgames.
- Current general-purpose LLMs lack internal mechanisms for proper probabilistic play and search; they can’t match specialized poker bots.
- Debate: some argue LLMs could approximate good play via tools (search, RNG, solvers) or by learning value functions; others think text-trained models are too imprecise and math-weak.
Randomness and Tool Calling
- Simple tests of “random number 1–10” show biased outputs, or obviously patterned sequences; illustrates that naive token sampling is not suitable as a game RNG.
- Others demonstrate that with code-execution tools, models can call real PRNGs and even generate well-distributed samples.
- There is disagreement over whether relying on external tools still “counts” as the LLM playing.
Alternative Designs & Extensions
- Suggestions include: heads-up formats, many more hands with position-swapping, pre-defined scenarios to probe decision quality, or using LLMs to write dedicated poker bots instead of playing directly.
- Several people want table talk: bluffing, trash talk, visible chains-of-thought, and attempts to manipulate other models as a richer test of “intelligence”.
- Parallel efforts (on-chain AI poker, custom research setups, educational poker apps) are mentioned as more controlled or specialized explorations of AI poker.