DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

A Texas Hold'em-style tabletop benchmark that couples semantic grounding, sequential state tracking, and fine-grained multi-finger manipulation on a ShadowHand-UR10e platform.

Feng Chen*†, Tianzhe Chu*, Li Sun*, Pei Zhou*, Zhuxiu Xu, Shenghua Gao, Yuexiang Zhai, Yanchao Yang, Yi Ma

* Equal contribution. † Project leader.

1,470
teleoperated robot demonstrations
14
card and chip primitives
36
fixed agentic perception problems
3
system-level case-study rollouts

Abstract

From isolated dexterous skills to embodied-agent tabletop play.

DexHoldem teaser figure showing environment setup, system overview, policy results, and agent results
DexHoldem teaser: real ShadowHand tabletop setup, embodied-agent system loop, physical policy results, and agentic perception results.

Current embodied-agent benchmarks often emphasize semantic grounding and planning while relying on simulation, coarse actions, or gripper-centric manipulation. Dexterous-manipulation benchmarks capture contact-rich control, but usually evaluate isolated motor skills without instruction-conditioned visual grounding or long-horizon state tracking.

We introduce DexHoldem, a real-world ShadowHand benchmark that uses Texas Hold'em-style tabletop interaction to couple semantic grounding, sequential state tracking, and fine-grained multi-finger manipulation. The benchmark contains 1,470 teleoperated demonstrations across 14 atomic card and chip primitives, a completed physical policy evaluation, and a 36-problem perception benchmark for parsing turn state, cards, chips, bets, robot activity, recovery conditions, and outcomes.

In the current benchmark, the strongest policy reaches 47.5% scene-preserving success and 61.2% task completion, while perception remains bottlenecked by complete state recovery: Opus 4.7 obtains the best strict problem-level score at 34.3%, and GPT 5.5 obtains the best field-wise average at 66.8%. Three full-loop case studies further expose how waits, recovery, human-help requests, and repeated primitive dispatches accumulate in real hand-level play.

Benchmark Scope

A benchmark suite with three coupled layers.

DexHoldem is not a benchmark of gambling strategy. It uses Hold'em-style tabletop interaction as a controlled real-world setting where semantic state, object layout, long-horizon progress tracking, and dexterous physical execution all matter.

DexHoldem system loop showing capture, game-state memory, activity reasoning, and execution
DexHoldem system loop: capture tabletop observations, load and renew game-state memory, reason over the activity stage, and execute the selected dexterous primitive.

Primitive Suite

The policy benchmark isolates atomic dexterous execution from game-level decision making. Each policy is evaluated under the same 80-rollout physical primitive schedule and scored with a four-level rubric that separates task completion from scene preservation.

pick_up_left pick_up_right put_down_left put_down_right show_left show_right push_5 push_10 push_50 push_100 pull_5 pull_10 pull_50 pull_100
Released benchmark components
Robot Platform ShadowHand on UR10e, real-world tabletop
Observations Top-down, third-person, wrist RGB-D views plus arm and hand proprioception
Policy Data 100 train and 5 validation trajectories per primitive
Perception Data 36 labeled tabletop-state problems with deterministic semantic checks
Evaluation SPSR/TCR for completed policy rollouts; strict exact-match and field-wise perception accuracy; protocol specification for system rollouts

Embodied System

Capture, parse, route, execute, and recover.

DexHoldem composes an agent-view perception loop with dexterous-policy primitives. The agent captures a tabletop image, renews structured game-state memory, routes through deterministic workflow gates, and dispatches a robot primitive only when physical motion is required.

Activity

  • Capturing
  • Perceiving
  • Routing
  • Acting
  • Verifying
  • Recovering
  • Requesting Human

Game State Memory

My Chips5: 0 | 10: 0
50: 0 | 100: 0
Opponent Chips5: 0 | 10: 0
50: 0 | 100: 0
My Bet5: 0 | 10: 0
50: 0 | 100: 0
Opponent Bet5: 0 | 10: 0
50: 0 | 100: 0
Community PokerFlop: Unknown
Turn: Unknown
River: Unknown
My PokerLeft: Unknown
Right: Unknown
Opponent PokerLeft: Unknown
Right: Unknown
StatusMy Turn: True
Stable: True

Runtime Contract

  1. CaptureAcquire a single agent-view tabletop image for the current loop step.
  2. ParseRecover loop stage, turn ownership, cards, bets, chip inventories, and showdown outcome.
  3. RouteGate the state through waiting, verification, continuation, recovery, or legal decision selection.
  4. ExecuteTranslate a legal agent primitive into one or more policy primitives.
  5. VerifyCheck the tabletop state and retry individual failed atoms when recovery is safe.
  6. ContinueAdvance pending multi-atom translations without re-prompting the main agent.
DexHoldem Skills repository

Agentic Perception Bench

State parsing before any action is allowed.

Each perception problem samples one real tabletop state from system-level deployment. The perceiver parses only the current agent-view capture, optionally uses predecessor states as context, and writes the fixed structured schema for loop stage, turn ownership, blinds, cards, chip inventories, current bets, outcome, and uncertainty.

Turn-gate perception problem with blinds, chips, and robot hand
Turn-gate state: decide whether the robot should act or wait.
Recovery perception problem with robot near the table after a failed action
Recovery state: distinguish retryable failures from human-help cases.
Showdown perception problem with visible cards and chips
Outcome state: parse community cards, hole cards, bets, and winner.

Turn Gate

Turn ownership, blind context, and wait-vs-act decisions.

Robot Progress

Acting, atom-idle, cached sequence continuation, and activity-stage tracking.

Recovery Safety

Retryable recovery, down states, and human intervention.

Table Decision

Cards, chips, bets, and turn state for downstream routing.

Held Card Read

Robot-held card identity using current visual evidence and predecessor context.

Outcome Judge

Win or lose judgment from visible cards, cached state, or opponent fold.

Results

Current policies and agents remain far from reliable.

Policy results are measured over 80 real-world primitive rollouts per model. Perception results use strict problem-level exact-match over 36 tabletop states, plus field-wise diagnostic accuracies. Full-system results are reported as three case-study trajectories with operational counters rather than as a hand-level leaderboard.

Aggregate policy-model results over 80 real-world primitive-evaluation trials per policy.
Policy Family SP DC TF DF SPSR TCR
π0.5 VLA 38 11 31 0 47.5% 61.2%
π0 VLA 38 8 33 1 47.5% 57.5%
RDT Robot-pretrained 24 13 40 3 30.0% 46.2%
DP (DINO) Task-trained 21 8 48 3 26.2% 36.2%
DP-Transformer Task-trained 11 5 46 18 13.8% 20.0%
RDT-small Task-trained 11 3 59 7 13.8% 17.5%
ACT Task-trained 8 4 67 1 10.0% 15.0%
BAKU Task-trained 5 5 67 3 6.2% 12.5%
DP-UNet Task-trained 1 0 79 0 1.2% 1.2%
Agentic perception semantic accuracy. Overall is strict exact-match over 36 problems; other columns are field-wise accuracies on applicable subsets.
Harness Perceiver Overall LS TO BI CC CB RCI OCI SO Avg
Codex GPT 5.5 31.5 72.2 80.6 100.0 61.5 45.8 62.5 35.4 76.2 66.8
Codex GPT 5.4 31.5 65.7 93.5 100.0 23.1 31.2 56.2 18.8 47.6 54.5
Codex GPT 5.4 mini 25.9 56.5 94.4 99.1 33.3 14.6 29.2 18.8 47.6 49.2
Claude Code Opus 4.7 34.3 43.5 93.5 100.0 43.6 31.2 37.5 43.8 0.0 49.1
Claude Code Sonnet 4.6 25.0 46.3 88.0 100.0 23.1 10.4 29.2 22.9 14.3 41.8
Claude Code Haiku 4.5 13.9 47.2 68.5 91.7 35.9 12.5 25.0 18.8 0.0 37.4
Gemini CLI Gemini 3 Flash 20.4 63.9 77.8 100.0 28.2 18.8 29.2 22.9 71.4 51.5
Gemini CLI Gemini 3.1 Flash L. 10.2 27.8 73.1 94.4 28.2 12.5 22.9 14.6 0.0 34.2
DexHoldem policy rollout success and failure examples
Representative real-world policy rollouts for view, show, push, and pull primitives.
DexHoldem aggregate policy result overview
Aggregate policy benchmark overview across pretrained and task-trained policy families.
Policy pretraining scale, model size, and task completion chart
Policy pretraining scale, model size, and real-world task completion.
RDT pretraining data scaling chart
RDT pretraining data-scaling diagnostics on DexHoldem trajectories.
RDT final validation loss scaling chart
RDT final validation loss under DexHoldem data-scaling ablations.

Demos & Release

Videos, runnable skills, and benchmark artifacts.

Policy-bench rollouts are released as compressed web videos with labels recovered from the original model-task-result filenames. The system-level release documents the GPT 5.5 + pi0 hand traces, state folders, wait branches, recovery dispatches, and primitive counters used in the paper's case studies.

Policy Bench Rollouts

Compressed real-world demos with model, task, and outcome labels.

Source recordings are 4K 60fps MOV files. The hosted videos are 960x540 H.264 MP4 previews with original filenames preserved for traceability.

Resources

Paper, dataset, skills, metadata, and citation.

BibTeX

@misc{dexholdem2026,
  title = {DexHoldem: Playing Texas Hold'em with Dexterous Embodied System},
  author = {Chen, Feng and Chu, Tianzhe and Sun, Li and Zhou, Pei and Xu, Zhuxiu and Gao, Shenghua and Zhai, Yuexiang and Yang, Yanchao and Ma, Yi},
  year = {2026},
  url = {https://dexholdem.github.io/Dexholdem/}
}

Author Contributions

Project roles across benchmark design, hardware, data, policies, agents, and guidance.

Contributions are summarized for the project website. For a compact list of every author with profile links, see the dedicated author page.

View all authors

Feng Chen

Co-proposed and led the project; designed the data-collection infrastructure; maintained the hardware; trained DP, RDT, and ACT; contributed to embodied-agent and perception-benchmark design; collected data; and built the project website.

Tianzhe Chu

Co-proposed the project; designed the data-collection infrastructure; led the embodied-agent and perception-benchmark design; and performed teleoperation.

Li Sun

Co-proposed the project; designed the data-collection infrastructure; trained Octo; and performed teleoperation.

Pei Zhou

Trained the pi-series and BAKU models; deployed and evaluated policy models and embodied agents; and performed teleoperation.

Zhuxiu Xu

Designed the simulation component; deployed and evaluated embodied agents; and collected data.