DexHoldem Playing Texas Hold'em with Dexterous Embodied System

Feng Chen*†, Tianzhe Chu*, Li Sun*, Pei Zhou*, Zhuxiu Xu, Shenghua Gao, Yuexiang Zhai, Yanchao Yang, Yi Ma

* Equal contribution. † Project leader.

1,470
teleoperated demonstrations
14
Texas Hold'em style primitives
36
agentic perception problems
3
system-level case studies

One-Minute Demo

Abstract

We introduce DexHoldem, a real-world benchmark for dexterous hand embodied systems built around Texas Hold'em-style tabletop interaction. DexHoldem includes 1,470 teleoperated demonstrations across 14 card and chip action primitives for policy training and evaluation, along with a 36-problem perception bench that tests an embodied agent's ability to parse turn gates, cards, chips, bets, outcomes, and robot activity from table and robot scene observations. On the current benchmark, the best policy achieves 61.2% task completion and 47.5% scene-preserving success; the best perceiver scores 66.8% field-wise accuracy (GPT-5.5) while strict problem-level exact match remains at 34.3% (Opus 4.7). We further present three full-system case studies that close the perception-to-action loop on a physical table, revealing how recovery, human-help requests, and repeated primitive dispatches compound during real dexterous play.

Full Agent Demos

Complete Texas Hold'em Demos

The autonomous demo runs the embodied agent loop end to end, while the teleoperation demo uses human teleoperation. Videos are shown at 10x speed.

Autonomous System

Fully autonomous agent demo

The system perceives the tabletop state, routes actions, and deploys with policy models.

Human Teleoperation

Human teleop demo with agent guidance.

A human teleoperator controls the robot hand through a complete poker hand, separate from the autonomous policy benchmark.

DexHoldem teaser figure showing environment setup, system overview, policy results, and agent results
Figure 1: (a) Environment setup with ShadowHand on UR10e, showing top-view, third-person, and wrist camera placements alongside the teleoperation device. (b) System overview: the agent perceives the table, updates game-state memory, routes through a strategy layer, and dispatches a policy primitive as the action. (c) Policy bench results: task completion rate versus pretraining data scale for VLA and imitation learning families. (d) Agent bench results: problem-level exact match and field-wise accuracy across frontier models.

Benchmark Scope

Texas Hold'em tabletop interaction naturally couples the three capabilities DexHoldem measures: cards and chips demand dexterous multi-finger primitives (policy bench), the evolving game state—turn gates, bets, hands, outcomes—must be recovered from raw scene observations (perception bench), and a full hand of play requires the agent to perceive the scene, maintain game-state memory, reason over state and robot activity to select an action, and dispatch dexterous primitives for execution (system-level evaluation). Click below for details:

Embodied System

The DexHoldem embodied system is implemented as a skill for coding-agent harnesses (Claude Code, Gemini CLI, Codex), each binding its natively supported vision-language model. The skill follows a perceive-route-execute loop: the agent parses the current table image into structured game state (turn gate, cards, chips, bets, robot activity), a deterministic router loads persistent game-state memory—including verified hole cards and multi-step action progress—validates the parsed fields, enforces safety limits, and selects an action class (wait, view card, push/pull chips, show hand, or request human help). The selected action is translated into instruction codes that route the dexterous policy model for physical execution.

DexHoldem system loop showing capture, game-state memory, activity reasoning, and execution
Figure 2: DexHoldem system loop: capture tabletop observations, load and renew game-state memory, reason over the activity stage, and execute the selected dexterous primitive.

Agentic Perception Bench

Each of the 36 perception problems is a snapshot from a real system-level deployment trajectory. The perceiver receives the current agent-view capture together with all predecessor states as context, and must output a structured state covering 8 fields: loop stage, turn ownership, blind position, community cards, chip inventories, current bets, and outcome. See an example problem below and full details here.

Problem p19 — 32 predecessor states as context, 1 target state to predict. Hover or tap a dot to browse; the final dot is the perception target.
Agent view for state s32

Results

We report three sets of results on this page: (1) aggregate policy results scored by a four-level rubric over 80 real-world primitive rollouts per model, (2) per-perceiver accuracy on the 36-problem perception bench with strict problem-level exact match and field-wise diagnostics, and (3) RDT pretraining data-scaling analysis. See the full results page for system-level case studies and detailed analysis.

Aggregate policy results over 80 real-world primitive-evaluation trials per policy.
Policy Family SP DC TF DF SPSR TCR
π0.5 Pretrained 38 11 31 0 47.5% 61.2%
π0 Pretrained 38 8 33 1 47.5% 57.5%
RDT Pretrained 24 13 40 3 30.0% 46.2%
DP (DINO) From-scratch 21 8 48 3 26.2% 36.2%
DP-Transformer From-scratch 11 5 46 18 13.8% 20.0%
RDT-small From-scratch 11 3 59 7 13.8% 17.5%
ACT From-scratch 8 4 67 1 10.0% 15.0%
BAKU From-scratch 5 5 67 3 6.2% 12.5%
DP-UNet From-scratch 1 0 79 0 1.2% 1.2%
Per-perceiver accuracy on perception bench. Overall is strict problem-level exact match; sub-field columns are field-wise accuracies on applicable subsets.
Harness Perceiver Overall LS TO BI CC CB RCI OCI SO Avg
Codex GPT 5.5 31.5 72.2 80.6 100.0 61.5 45.8 62.5 35.4 76.2 66.8
Codex GPT 5.4 31.5 65.7 93.5 100.0 23.1 31.2 56.2 18.8 47.6 54.5
Codex GPT 5.4 mini 25.9 56.5 94.4 99.1 33.3 14.6 29.2 18.8 47.6 49.2
Claude Code Opus 4.7 34.3 43.5 93.5 100.0 43.6 31.2 37.5 43.8 0.0 49.1
Claude Code Sonnet 4.6 25.0 46.3 88.0 100.0 23.1 10.4 29.2 22.9 14.3 41.8
Claude Code Haiku 4.5 13.9 47.2 68.5 91.7 35.9 12.5 25.0 18.8 0.0 37.4
Gemini CLI Gemini 3 Flash 20.4 63.9 77.8 100.0 28.2 18.8 29.2 22.9 71.4 51.5
Gemini CLI Gemini 3.1 Flash L. 10.2 27.8 73.1 94.4 28.2 12.5 22.9 14.6 0.0 34.2

Notes: each sub-field column (LS, TO, BI, CC, CB, RCI, OCI, SO) is scored only on the subset of problems where that field applies — for example, SO is evaluated only on showdown problems, so a 0% SO does not prevent a model from achieving a higher Overall score. Overall requires exact match across all applicable fields per problem. Accuracy may vary due to harness version. Current results are evaluated at May 7, 2026.

RDT pretraining data scaling chart
Figure 3: Train-time validation-loss curves for the RDT fine-tuning data-scaling study. Each panel compares random initialization against pretrained RDT across 10%, 20%, 50%, and 100% data ratios, with shaded bands denoting one standard deviation over paired seeds.
RDT final validation loss scaling chart
Figure 4: Final validation loss for the RDT fine-tuning data-scaling probe. Random and pretrained initializations follow similar data-scaling trends. Error bars denote one standard deviation over three completed paired seeds.

Demo Categories & Release

DexHoldem demos are grouped by what they show: complete hand-level agent play, representative dataset camera triplets, and primitive-level policy benchmark rollouts. The embedded block below is the policy-demo preview; each category also has a dedicated page.

Policy Bench Rollouts

Compressed real-world demos with model, task, and outcome labels.

Source recordings are 4K 60fps MOV files. The hosted videos are 960x540 H.264 MP4 previews with original filenames preserved for traceability.

Resources

BibTeX

@misc{dexholdem2026,
  title = {DexHoldem: Playing Texas Hold'em with Dexterous Embodied System},
  author = {Chen, Feng and Chu, Tianzhe and Sun, Li and Zhou, Pei and Xu, Zhuxiu and Gao, Shenghua and Zhai, Yuexiang and Yang, Yanchao and Ma, Yi},
  year = {2026},
  url = {https://dexholdem.github.io/Dexholdem/}
}

Author Contributions

Contributions are summarized for the project website. For a compact list of every author with profile links, see the dedicated author page.

View all authors

Feng Chen

Co-proposed and led the project; designed the data-collection infrastructure; maintained the hardware; trained DP, RDT, and ACT; contributed to embodied-agent and perception-benchmark design; collected data; and built the project website.

Tianzhe Chu

Co-proposed the project; designed the data-collection infrastructure; led the embodied-agent and perception-benchmark design; and performed teleoperation.

Li Sun

Co-proposed the project; designed the data-collection infrastructure; trained Octo; and performed teleoperation.

Pei Zhou

Trained the pi-series and BAKU models; deployed and evaluated policy models and embodied agents; and performed teleoperation.

Zhuxiu Xu

Designed the simulation component; deployed and evaluated embodied agents; and collected data.