2025-10-28

Poker Tournament for LLMs

Quality of LLM Poker Play

Many hands show blatant misunderstandings: models mis-evaluate hand strength, misread boards (calling a wet board “dry”), or claim “top pair” when holding a weaker pair.
Models sometimes fold strong or decent hands with no pressure, mis-handle Omaha hands, or confuse draws vs made hands.
Participants note that these are not subtle GTO deviations but basic reasoning errors, often attributable to hallucinations and mis-parsing of state.

Limits of This “Tournament” as a Benchmark

Very small sample size (hundreds of hands per model) means bankroll graphs are dominated by variance; results are “for entertainment”, not statistically meaningful.
Full-ring, no-limit is far harder than the well-studied heads-up limit variant; using it makes serious comparison even harder.
Format is actually a cash game despite being labeled a tournament; long-running table with deep stacks leads to big swings.
Some technical oddities are observed (hand numbering, stack totals, odd pots), further undermining rigor.

Game Theory, Poker AI, and LLMs

Commenters with poker-AI background stress that strong play requires mixed strategies, equilibrium approximations (e.g., CFR, DeepStack, Pluribus), and consistent strategy across subgames.
Current general-purpose LLMs lack internal mechanisms for proper probabilistic play and search; they can’t match specialized poker bots.
Debate: some argue LLMs could approximate good play via tools (search, RNG, solvers) or by learning value functions; others think text-trained models are too imprecise and math-weak.

Randomness and Tool Calling

Simple tests of “random number 1–10” show biased outputs, or obviously patterned sequences; illustrates that naive token sampling is not suitable as a game RNG.
Others demonstrate that with code-execution tools, models can call real PRNGs and even generate well-distributed samples.
There is disagreement over whether relying on external tools still “counts” as the LLM playing.

Alternative Designs & Extensions

Suggestions include: heads-up formats, many more hands with position-swapping, pre-defined scenarios to probe decision quality, or using LLMs to write dedicated poker bots instead of playing directly.
Several people want table talk: bluffing, trash talk, visible chains-of-thought, and attempts to manipulate other models as a richer test of “intelligence”.
Parallel efforts (on-chain AI poker, custom research setups, educational poker apps) are mentioned as more controlled or specialized explorations of AI poker.

Related topics