DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

A Texas Hold'em-style tabletop benchmark that couples semantic grounding, sequential state tracking, and fine-grained multi-finger manipulation on a ShadowHand-UR10e platform.

Feng Chen^*†, Tianzhe Chu^*, Li Sun^*, Pei Zhou^*, Zhuxiu Xu, Shenghua Gao, Yuexiang Zhai, Yanchao Yang, Yi Ma

* Equal contribution. † Project leader.

Paper Dataset GitHub Org DexHoldem Policy Skills Repo Croissant

1,470: teleoperated robot demonstrations
14: card and chip primitives
36: fixed agentic perception problems
3: system-level case-study rollouts

Abstract

From isolated dexterous skills to embodied-agent tabletop play.

DexHoldem teaser figure showing environment setup, system overview, policy results, and agent results — DexHoldem teaser: real ShadowHand tabletop setup, embodied-agent system loop, physical policy results, and agentic perception results.

Current embodied-agent benchmarks often emphasize semantic grounding and planning while relying on simulation, coarse actions, or gripper-centric manipulation. Dexterous-manipulation benchmarks capture contact-rich control, but usually evaluate isolated motor skills without instruction-conditioned visual grounding or long-horizon state tracking.

We introduce DexHoldem, a real-world ShadowHand benchmark that uses Texas Hold'em-style tabletop interaction to couple semantic grounding, sequential state tracking, and fine-grained multi-finger manipulation. The benchmark contains 1,470 teleoperated demonstrations across 14 atomic card and chip primitives, a completed physical policy evaluation, and a 36-problem perception benchmark for parsing turn state, cards, chips, bets, robot activity, recovery conditions, and outcomes.

In the current benchmark, the strongest policy reaches 47.5% scene-preserving success and 61.2% task completion, while perception remains bottlenecked by complete state recovery: Opus 4.7 obtains the best strict problem-level score at 34.3%, and GPT 5.5 obtains the best field-wise average at 66.8%. Three full-loop case studies further expose how waits, recovery, human-help requests, and repeated primitive dispatches accumulate in real hand-level play.

Benchmark Scope

A benchmark suite with three coupled layers.

DexHoldem is not a benchmark of gambling strategy. It uses Hold'em-style tabletop interaction as a controlled real-world setting where semantic state, object layout, long-horizon progress tracking, and dexterous physical execution all matter.

Dexterous Hand Policy Bench

14 language-instructed atomic primitives over cards and chips, each with 105 teleoperated demonstrations, a fixed 100/5 split, and a shared 30-dimensional joint-position action space.

Policy Details 02

Agentic Perception Bench

36 real deployment states test whether a perceiver can recover structured game-state memory from one current agent-view capture and predecessor context.

Agent Evaluation 03

Full-System Protocol

A capture-parse-route-execute-verify loop composes visual state parsing, legal high-level actions, primitive execution, recovery, and human intervention when safe continuation fails.

System Details

DexHoldem system loop showing capture, game-state memory, activity reasoning, and execution — DexHoldem system loop: capture tabletop observations, load and renew game-state memory, reason over the activity stage, and execute the selected dexterous primitive.

Primitive Suite

The policy benchmark isolates atomic dexterous execution from game-level decision making. Each policy is evaluated under the same 80-rollout physical primitive schedule and scored with a four-level rubric that separates task completion from scene preservation.

pick_up_left pick_up_right put_down_left put_down_right show_left show_right push_5 push_10 push_50 push_100 pull_5 pull_10 pull_50 pull_100

Released benchmark components
Robot Platform	ShadowHand on UR10e, real-world tabletop
Observations	Top-down, third-person, wrist RGB-D views plus arm and hand proprioception
Policy Data	100 train and 5 validation trajectories per primitive
Perception Data	36 labeled tabletop-state problems with deterministic semantic checks
Evaluation	SPSR/TCR for completed policy rollouts; strict exact-match and field-wise perception accuracy; protocol specification for system rollouts

Embodied System

Capture, parse, route, execute, and recover.

DexHoldem composes an agent-view perception loop with dexterous-policy primitives. The agent captures a tabletop image, renews structured game-state memory, routes through deterministic workflow gates, and dispatches a robot primitive only when physical motion is required.

Activity

Capturing
Perceiving
Routing
Acting
Verifying
Recovering
Requesting Human

Game State Memory

My Chips5: 0 | 10: 0
50: 0 | 100: 0

Opponent Chips5: 0 | 10: 0
50: 0 | 100: 0

My Bet5: 0 | 10: 0
50: 0 | 100: 0

Opponent Bet5: 0 | 10: 0
50: 0 | 100: 0

Community PokerFlop: Unknown
Turn: Unknown
River: Unknown

My PokerLeft: Unknown
Right: Unknown

Opponent PokerLeft: Unknown
Right: Unknown

StatusMy Turn: True
Stable: True

Runtime Contract

CaptureAcquire a single agent-view tabletop image for the current loop step.
ParseRecover loop stage, turn ownership, cards, bets, chip inventories, and showdown outcome.
RouteGate the state through waiting, verification, continuation, recovery, or legal decision selection.
ExecuteTranslate a legal agent primitive into one or more policy primitives.
VerifyCheck the tabletop state and retry individual failed atoms when recovery is safe.
ContinueAdvance pending multi-atom translations without re-prompting the main agent.

DexHoldem Skills repository

Agentic Perception Bench

State parsing before any action is allowed.

Each perception problem samples one real tabletop state from system-level deployment. The perceiver parses only the current agent-view capture, optionally uses predecessor states as context, and writes the fixed structured schema for loop stage, turn ownership, blinds, cards, chip inventories, current bets, outcome, and uncertainty.

Turn-gate perception problem with blinds, chips, and robot hand — Turn-gate state: decide whether the robot should act or wait.

Recovery perception problem with robot near the table after a failed action — Recovery state: distinguish retryable failures from human-help cases.

Showdown perception problem with visible cards and chips — Outcome state: parse community cards, hole cards, bets, and winner.

Turn Gate

Turn ownership, blind context, and wait-vs-act decisions.

Robot Progress

Acting, atom-idle, cached sequence continuation, and activity-stage tracking.

Recovery Safety

Retryable recovery, down states, and human intervention.

Table Decision

Cards, chips, bets, and turn state for downstream routing.

Held Card Read

Robot-held card identity using current visual evidence and predecessor context.

Outcome Judge

Win or lose judgment from visible cards, cached state, or opponent fold.

Results

Current policies and agents remain far from reliable.

Policy results are measured over 80 real-world primitive rollouts per model. Perception results use strict problem-level exact-match over 36 tabletop states, plus field-wise diagnostic accuracies. Full-system results are reported as three case-study trajectories with operational counters rather than as a hand-level leaderboard.

Aggregate policy-model results over 80 real-world primitive-evaluation trials per policy.
Policy	Family	SP	DC	TF	DF	SPSR	TCR
π_0.5	VLA	38	11	31	0	47.5%	61.2%
π₀	VLA	38	8	33	1	47.5%	57.5%
RDT	Robot-pretrained	24	13	40	3	30.0%	46.2%
DP (DINO)	Task-trained	21	8	48	3	26.2%	36.2%
DP-Transformer	Task-trained	11	5	46	18	13.8%	20.0%
RDT-small	Task-trained	11	3	59	7	13.8%	17.5%
ACT	Task-trained	8	4	67	1	10.0%	15.0%
BAKU	Task-trained	5	5	67	3	6.2%	12.5%
DP-UNet	Task-trained	1	0	79	0	1.2%	1.2%

Agentic perception semantic accuracy. Overall is strict exact-match over 36 problems; other columns are field-wise accuracies on applicable subsets.
Harness	Perceiver	Overall	LS	TO	BI	CC	CB	RCI	OCI	SO	Avg
Codex	GPT 5.5	31.5	72.2	80.6	100.0	61.5	45.8	62.5	35.4	76.2	66.8
Codex	GPT 5.4	31.5	65.7	93.5	100.0	23.1	31.2	56.2	18.8	47.6	54.5
Codex	GPT 5.4 mini	25.9	56.5	94.4	99.1	33.3	14.6	29.2	18.8	47.6	49.2
Claude Code	Opus 4.7	34.3	43.5	93.5	100.0	43.6	31.2	37.5	43.8	0.0	49.1
Claude Code	Sonnet 4.6	25.0	46.3	88.0	100.0	23.1	10.4	29.2	22.9	14.3	41.8
Claude Code	Haiku 4.5	13.9	47.2	68.5	91.7	35.9	12.5	25.0	18.8	0.0	37.4
Gemini CLI	Gemini 3 Flash	20.4	63.9	77.8	100.0	28.2	18.8	29.2	22.9	71.4	51.5
Gemini CLI	Gemini 3.1 Flash L.	10.2	27.8	73.1	94.4	28.2	12.5	22.9	14.6	0.0	34.2

DexHoldem policy rollout success and failure examples — Representative real-world policy rollouts for view, show, push, and pull primitives.

DexHoldem aggregate policy result overview — Aggregate policy benchmark overview across pretrained and task-trained policy families.

Policy pretraining scale, model size, and task completion chart — Policy pretraining scale, model size, and real-world task completion.

RDT pretraining data scaling chart — RDT pretraining data-scaling diagnostics on DexHoldem trajectories.

RDT final validation loss scaling chart — RDT final validation loss under DexHoldem data-scaling ablations.

Demos & Release

Videos, runnable skills, and benchmark artifacts.

Policy-bench rollouts are released as compressed web videos with labels recovered from the original model-task-result filenames. The system-level release documents the GPT 5.5 + pi0 hand traces, state folders, wait branches, recovery dispatches, and primitive counters used in the paper's case studies.

Policy Bench Rollouts

Compressed real-world demos with model, task, and outcome labels.

Source recordings are 4K 60fps MOV files. The hosted videos are 960x540 H.264 MP4 previews with original filenames preserved for traceability.

Model Outcome

DexHoldem policy benchmark success and failure examples

Benchmark Data

Data in Benchmark

1,470 physical policy demonstrations plus 36 agent-state problems with ground-truth state, route, and action labels.

Policy Evaluation

Physical primitive trials compare pi-series, RDT, diffusion-policy, ACT, and BAKU policies under SPSR and TCR.

Agent Evaluation

Agentic perception results over 36 tabletop states quantify strict state recovery and field-wise bottlenecks.

System Evaluation

System-Level Evaluation

Three GPT 5.5 + pi0 hand-level case studies expose wait branches, recovery dispatches, human-help, and primitive counters.

Resources

Paper, dataset, skills, metadata, and citation.

Paper Paper placeholder Dataset TexasPokerRobot on Hugging Face GitHub DexHoldem organization Policy DexHoldem Policy repository Skills DexHoldemSKills runtime Metadata Croissant JSON-LD Authors Author profiles and contributions

BibTeX

@misc{dexholdem2026,
  title = {DexHoldem: Playing Texas Hold'em with Dexterous Embodied System},
  author = {Chen, Feng and Chu, Tianzhe and Sun, Li and Zhou, Pei and Xu, Zhuxiu and Gao, Shenghua and Zhai, Yuexiang and Yang, Yanchao and Ma, Yi},
  year = {2026},
  url = {https://dexholdem.github.io/Dexholdem/}
}

Author Contributions

Project roles across benchmark design, hardware, data, policies, agents, and guidance.

Contributions are summarized for the project website. For a compact list of every author with profile links, see the dedicated author page.

View all authors

Project Guidance

Shenghua Gao, Yuexiang Zhai, Yanchao Yang, and Yi Ma provided project guidance and feedback. Yuexiang Zhai and Yi Ma also co-proposed the project.