DexHoldem Results

Policy Benchmark

Aggregate physical policy results over 80 real-world trials per model.

Nine policies spanning two model families are evaluated under the same primitive schedule, observation interface, and physical scoring rubric. The table reports raw outcome counts (SP, DC, TF, DF) and the two aggregate rates: scene-preserving success rate (SPSR) and task completion rate (TCR).

SP

Scene-preserving success: the primitive is completed and the tabletop remains usable.

DC

Disruptive completion: the goal is achieved but the scene is disturbed enough to prevent continuation.

TF

Task failure: the primitive is not completed, but the scene remains stable for retry.

DF

Disruptive failure: the primitive fails and the environment must be reset.

Aggregate policy-model results over 80 real-world primitive-evaluation trials per policy.
Policy	Family	Params	SP	DC	TF	DF	N	SPSR	TCR
π_0.5	VLA	2.94B	38	11	31	0	80	47.5%	61.2%
π₀	VLA	2.82B	38	8	33	1	80	47.5%	57.5%
RDT	Robot-pretrained	1.23B	24	13	40	3	80	30.0%	46.2%
DP (DINO)	Task-trained	128M	21	8	48	3	80	26.2%	36.2%
DP-Transformer	Task-trained	128M	11	5	46	18	80	13.8%	20.0%
RDT-small	Task-trained	165M	11	3	59	7	80	13.8%	17.5%
ACT	Task-trained	72.8M	8	4	67	1	80	10.0%	15.0%
BAKU	Task-trained	39.4M	5	5	67	3	80	6.2%	12.5%
DP-UNet	Task-trained	74.4M	1	0	79	0	80	1.2%	1.2%

DexHoldem physical policy benchmark success and failure examples — Representative real-world policy rollouts for view, show, push, and pull primitives. Success is not just object motion: non-target cards and chips must remain usable for later tabletop interaction.

Policy Analysis

Pretraining scale, model size, and task completion.

The bubble chart maps each policy by pretraining data scale (x-axis), policy-only parameter count (bubble size), and physical task completion rate (y-axis). VLA-family models with large-scale robot pretraining occupy the upper right; task-trained baselines cluster toward the lower left.

DexHoldem model scale and policy completion chart — Model family, pretraining scale, policy-only parameter count, and physical task completion rate over the shared 80-trial primitive schedule.

Aggregate policy benchmark overview across pretrained and task-trained policy families.

RDT Scaling Diagnostics

Pretraining data scale and fine-tuning loss under ablation.

Two diagnostic views isolate how gripper-pretrained initialization interacts with DexHoldem-specific fine-tuning data. The left panel tracks physical task completion under data-ratio ablation; the right panel shows the corresponding final validation loss.

RDT pretraining data scaling chart — RDT pretraining data-scaling diagnostics on DexHoldem trajectories.

RDT fine-tuning final validation loss across data ratios — RDT final validation loss under DexHoldem data-scaling ablations.

Perception Benchmark

Agentic perception accuracy across frontier vision-language models.

Each perceiver is deployed inside its native coding-agent harness and evaluated on 36 real-deployment tabletop states. Overall is strict problem-level exact-match; field-wise columns report per-field accuracy on applicable subsets. The best field-wise average is 66.8% (GPT 5.5) while the best strict exact-match is 34.3% (Opus 4.7).

Agentic perception semantic accuracy. Overall is strict exact-match over 36 problems; other columns are field-wise accuracies on applicable subsets.
Harness	Perceiver	Overall	LS	TO	BI	CC	CB	RCI	OCI	SO	Avg
Codex	GPT 5.5	31.5	72.2	80.6	100.0	61.5	45.8	62.5	35.4	76.2	66.8
Codex	GPT 5.4	31.5	65.7	93.5	100.0	23.1	31.2	56.2	18.8	47.6	54.5
Codex	GPT 5.4 mini	25.9	56.5	94.4	99.1	33.3	14.6	29.2	18.8	47.6	49.2
Claude Code	Opus 4.7	34.3	43.5	93.5	100.0	43.6	31.2	37.5	43.8	0.0	49.1
Claude Code	Sonnet 4.6	25.0	46.3	88.0	100.0	23.1	10.4	29.2	22.9	14.3	41.8
Claude Code	Haiku 4.5	13.9	47.2	68.5	91.7	35.9	12.5	25.0	18.8	0.0	37.4
Gemini CLI	Gemini 3 Flash	20.4	63.9	77.8	100.0	28.2	18.8	29.2	22.9	71.4	51.5
Gemini CLI	Gemini 3.1 Flash L.	10.2	27.8	73.1	94.4	28.2	12.5	22.9	14.6	0.0	34.2

Perception Bottlenecks

Field-wise accuracy reveals where perceivers struggle most.

Blind / Turn

Blind information (BI) and turn ownership (TO) are near-ceiling for most models, since they rely on game-flow reasoning rather than fine-grained visual parsing.

Community Cards

Community card recognition (CC) ranges from 23% to 62%, reflecting difficulty in reading partially occluded or angled card faces from the agent view.

Chip Counting

Current bets (CB) and chip inventories (RCI, OCI) are the hardest fields, requiring the perceiver to locate, identify, and count physical chip stacks.

Showdown

Showdown outcome (SO) is binary but context-dependent: some models score 0% because they never detect the opponent folding or the hand concluding.

Key Observations

What the results reveal about current dexterous embodied systems.

VLA Advantage

Large-scale robot-pretrained VLA policies (pi0.5, pi0) achieve the highest task completion rates. Pretraining on diverse manipulation data transfers meaningfully to the ShadowHand domain despite the 30-DOF action-space gap.

Scene Preservation Gap

Even the best policy shows a 14-point gap between task completion (61.2%) and scene-preserving success (47.5%). Disruptive completions are common: the target object moves correctly but surrounding cards or chips are displaced.

Perception Ceiling

No perceiver exceeds 34.3% strict exact-match. The gap between field-wise average (66.8%) and exact-match shows that errors across different fields compound: getting every field right simultaneously remains challenging.

Back To Main Results Policy Details Perception Details System Evaluation

Policy and Perception Evaluation