Benchmark Results

Policy and Perception Evaluation

Policy results are measured over 80 real-world primitive rollouts per model. Perception results use strict problem-level exact-match over 36 tabletop states, plus field-wise diagnostic accuracies.

DexHoldem aggregate policy result overview

Policy Benchmark

Aggregate physical policy results over 80 real-world trials per model.

Nine policies spanning two model families are evaluated under the same primitive schedule, observation interface, and physical scoring rubric. The table reports raw outcome counts (SP, DC, TF, DF) and the two aggregate rates: scene-preserving success rate (SPSR) and task completion rate (TCR).

SP

Scene-preserving success: the primitive is completed and the tabletop remains usable.

DC

Disruptive completion: the goal is achieved but the scene is disturbed enough to prevent continuation.

TF

Task failure: the primitive is not completed, but the scene remains stable for retry.

DF

Disruptive failure: the primitive fails and the environment must be reset.

Aggregate policy-model results over 80 real-world primitive-evaluation trials per policy.
Policy Family Params SP DC TF DF N SPSR TCR
π0.5VLA2.94B38113108047.5%61.2%
π0VLA2.82B3883318047.5%57.5%
RDTRobot-pretrained1.23B24134038030.0%46.2%
DP (DINO)Task-trained128M2184838026.2%36.2%
DP-TransformerTask-trained128M11546188013.8%20.0%
RDT-smallTask-trained165M1135978013.8%17.5%
ACTTask-trained72.8M846718010.0%15.0%
BAKUTask-trained39.4M55673806.2%12.5%
DP-UNetTask-trained74.4M10790801.2%1.2%
DexHoldem physical policy benchmark success and failure examples
Representative real-world policy rollouts for view, show, push, and pull primitives. Success is not just object motion: non-target cards and chips must remain usable for later tabletop interaction.

Policy Analysis

Pretraining scale, model size, and task completion.

The bubble chart maps each policy by pretraining data scale (x-axis), policy-only parameter count (bubble size), and physical task completion rate (y-axis). VLA-family models with large-scale robot pretraining occupy the upper right; task-trained baselines cluster toward the lower left.

DexHoldem model scale and policy completion chart
Model family, pretraining scale, policy-only parameter count, and physical task completion rate over the shared 80-trial primitive schedule.
Aggregate policy benchmark overview across pretrained and task-trained policy families
Aggregate policy benchmark overview across pretrained and task-trained policy families.

RDT Scaling Diagnostics

Pretraining data scale and fine-tuning loss under ablation.

Two diagnostic views isolate how gripper-pretrained initialization interacts with DexHoldem-specific fine-tuning data. The left panel tracks physical task completion under data-ratio ablation; the right panel shows the corresponding final validation loss.

RDT pretraining data scaling chart
RDT pretraining data-scaling diagnostics on DexHoldem trajectories.
RDT fine-tuning final validation loss across data ratios
RDT final validation loss under DexHoldem data-scaling ablations.

Perception Benchmark

Agentic perception accuracy across frontier vision-language models.

Each perceiver is deployed inside its native coding-agent harness and evaluated on 36 real-deployment tabletop states. Overall is strict problem-level exact-match; field-wise columns report per-field accuracy on applicable subsets. The best field-wise average is 66.8% (GPT 5.5) while the best strict exact-match is 34.3% (Opus 4.7).

Agentic perception semantic accuracy. Overall is strict exact-match over 36 problems; other columns are field-wise accuracies on applicable subsets.
Harness Perceiver Overall LS TO BI CC CB RCI OCI SO Avg
Codex GPT 5.5 31.5 72.2 80.6 100.0 61.5 45.8 62.5 35.4 76.2 66.8
Codex GPT 5.4 31.5 65.7 93.5 100.0 23.1 31.2 56.2 18.8 47.6 54.5
Codex GPT 5.4 mini 25.9 56.5 94.4 99.1 33.3 14.6 29.2 18.8 47.6 49.2
Claude Code Opus 4.7 34.3 43.5 93.5 100.0 43.6 31.2 37.5 43.8 0.0 49.1
Claude Code Sonnet 4.6 25.0 46.3 88.0 100.0 23.1 10.4 29.2 22.9 14.3 41.8
Claude Code Haiku 4.5 13.9 47.2 68.5 91.7 35.9 12.5 25.0 18.8 0.0 37.4
Gemini CLI Gemini 3 Flash 20.4 63.9 77.8 100.0 28.2 18.8 29.2 22.9 71.4 51.5
Gemini CLI Gemini 3.1 Flash L. 10.2 27.8 73.1 94.4 28.2 12.5 22.9 14.6 0.0 34.2

Perception Bottlenecks

Field-wise accuracy reveals where perceivers struggle most.

Blind / Turn

Blind information (BI) and turn ownership (TO) are near-ceiling for most models, since they rely on game-flow reasoning rather than fine-grained visual parsing.

Community Cards

Community card recognition (CC) ranges from 23% to 62%, reflecting difficulty in reading partially occluded or angled card faces from the agent view.

Chip Counting

Current bets (CB) and chip inventories (RCI, OCI) are the hardest fields, requiring the perceiver to locate, identify, and count physical chip stacks.

Showdown

Showdown outcome (SO) is binary but context-dependent: some models score 0% because they never detect the opponent folding or the hand concluding.

Key Observations

What the results reveal about current dexterous embodied systems.

VLA Advantage

Large-scale robot-pretrained VLA policies (pi0.5, pi0) achieve the highest task completion rates. Pretraining on diverse manipulation data transfers meaningfully to the ShadowHand domain despite the 30-DOF action-space gap.

Scene Preservation Gap

Even the best policy shows a 14-point gap between task completion (61.2%) and scene-preserving success (47.5%). Disruptive completions are common: the target object moves correctly but surrounding cards or chips are displaced.

Perception Ceiling

No perceiver exceeds 34.3% strict exact-match. The gap between field-wise average (66.8%) and exact-match shows that errors across different fields compound: getting every field right simultaneously remains challenging.