DexHoldem Policy Implementation

14 Primitives

Card pickup, card placement, card reveal, chip push, and chip pull tasks are specified as language-instructed atomic skills.

1,470 Demos

Each primitive has 105 accepted teleoperated demonstrations, with a fixed 100 train / 5 validation split.

30-D Actions

Policies output joint-position targets for the 6-DOF UR10e arm and 24-DOF Shadow Dexterous Hand.

80 Rollouts

Physical evaluation uses 80 primitive-level trials per policy, grouped into pickup, chip push, chip pull, and put-down/show.

Benchmark Interface

The shared interface maps three RGB-D views and proprioception to joint-position action chunks.

The policy bench removes game-level decision making from the low-level model. At each rollout step, the policy receives top-down, third-person, and wrist-mounted RealSense RGB-D observations, the current arm-hand joint state, and a task condition. It returns a short-horizon sequence of 30-dimensional joint-position targets in the canonical robot order.

Observation

Top-down, third-person, and wrist RGB-D cameras plus normalized joint-position proprioception.

Condition

Pretrained policies use natural-language task text; task-specific baselines use discrete instruction IDs.

Prediction

The canonical loader exposes RGB/depth inputs, optional precomputed visual features, instruction IDs, and 30-D action targets.

Deployment

A ZeroMQ policy server on port 13579 returns executable joint targets to the robot-side client.

Primitive Suite

Policy tasks use the exact primitive IDs and instructions from the paper.

Directional labels are interpreted in the robot-facing tabletop frame: push moves chips away from the robot into the forward betting region, and pull moves chips back toward the robot-side region.

Primitive-level task definitions.
ID	Primitive	Policy instruction
0	`pick_up_left`	Pick up the card on the left side.
1	`pick_up_right`	Pick up the card on the right side.
2	`push_5`	Push forward the chips worth 5.
3	`push_10`	Push forward the chips worth 10.
4	`push_50`	Push forward the chips worth 50.
5	`push_100`	Push forward the chips worth 100.
6	`pull_5`	Pull back the chips worth 5.
7	`pull_10`	Pull back the chips worth 10.
8	`pull_50`	Pull back the chips worth 50.
9	`pull_100`	Pull back the chips worth 100.
10	`put_down_left`	Place the held card onto the left position.
11	`put_down_right`	Place the held card onto the right position.
12	`show_left`	Reveal the face of the left card.
13	`show_right`	Reveal the face of the right card.

Policy Families

The main comparison contains two model families under one physical trial protocol.

The reported benchmark focuses on the policy families and variants included in the main physical comparison. Experimental adapters that are not part of that comparison are kept separate from the benchmarked-policy summary.

01

Pretrained Robot and VLA Policies

pi0.5, pi0, RDT, and RDT-small are adapted to the ShadowHand-UR interface and conditioned on natural-language task text.

02

Task-Specific Imitation Baselines

DP (DINO), DP-Transformer, DP-UNet, ACT, and BAKU are trained on the DexHoldem demonstrations with the same action space, primitive schedule, and scoring rubric.

03

Shared Scoring Contract

All models are evaluated by SP, DC, TF, and DF outcome labels, then reported as scene-preserving success rate and task completion rate.

Policy implementations used in the main physical comparison.
Policy	Conditioning	Implementation summary
pi0.5	Natural-language prompt	OpenPI bridge maps the three camera streams and robot state to pi0.5; the DexHoldem-side output uses absolute joint-action targets.
pi0	Natural-language prompt	OpenPI bridge with the same camera and prompt mapping as pi0.5; the default action convention is delta joint motion before conversion.
RDT	Cached T5 language tokens	RDT-1B-style diffusion Transformer adapted from a gripper interface to the 30-D ShadowHand-UR joint space with SigLIP visual patch tokens.
RDT-small	Cached T5 language tokens	Reduced-capacity RDT variant using the same observation adapter and deployment path, initialized randomly and trained from scratch.
DP (DINO)	Instruction ID	High-capacity diffusion-policy baseline using frozen DINOv2 visual features and a Transformer denoiser for 30-D action chunks.
DP-Transformer	Instruction ID	Diffusion-policy Transformer baseline trained from scratch under the same instruction-ID-conditioned objective.
DP-UNet	Instruction ID	Lightweight diffusion-policy baseline with trainable ResNet visual encoders and a 1D UNet denoiser.
ACT	Instruction ID	CVAE Transformer policy that decodes deterministic 30-D action chunks from observation and instruction tokens at inference time.
BAKU	Instruction ID	Deterministic action-token Transformer adapted to the canonical batch format and 30-D robot command space.

Training and Runtime

The implementation keeps model-specific adapters behind a common robot-facing contract.

Implementation stages exposed by DexHoldem Policy.
Stage	Implementation	Purpose
Organize	Raw primitive folders are converted to per-episode `.npy` arrays with 5 held-out validation trajectories per primitive.	Build the fixed 100/5 train-validation split used by every benchmarked policy.
Load	The loader exposes RGB/depth observations, optional precomputed RGB features, normalized proprioception, instruction ID, and action target.	Keep task-trained policies and pretrained adapters on the same data and state representation.
Normalize	Numeric proprioception and action channels are normalized to `[-1, 1]` with training-set statistics saved in the checkpoint.	Reuse the same statistics at deployment before unnormalizing executable joint targets.
Serve	`deploy_policy.py` and OpenPI deployment scripts expose a ZeroMQ inference endpoint on port `13579`.	Keep GPU inference, checkpoint loading, and model-specific preprocessing off the robot-control process.
Execute	`robot_client.py` packages live cameras, robot joints, and the selected primitive, then executes returned action chunks.	Run physical rollouts under the fixed primitive schedule and shared robot command format.

Physical Scoring

DexHoldem separates nominal task completion from scene-preserving execution.

This distinction matters because a primitive can locally succeed while moving non-target cards or chips enough to block later poker actions.

SP

Scene-preserving success: the requested primitive is completed and the tabletop remains usable.

DC

Disruptive completion: the local goal is achieved, but the scene is disturbed enough to prevent normal continuation.

TF

Task failure: the primitive is not completed, but the scene remains stable enough for retry.

DF

Disruptive failure: the primitive fails and the environment must be reset before continuing.

Aggregate physical policy results over 80 real-world primitive trials per policy.
Policy	Params	SP	DC	TF	DF	N	SPSR	TCR
pi0.5	2.94B	38	11	31	0	80	47.5%	61.2%
pi0	2.82B	38	8	33	1	80	47.5%	57.5%
RDT	1.23B	24	13	40	3	80	30.0%	46.2%
DP (DINO)	128M	21	8	48	3	80	26.2%	36.2%
DP-Transformer	128M	11	5	46	18	80	13.8%	20.0%
RDT-small	165M	11	3	59	7	80	13.8%	17.5%
ACT	72.8M	8	4	67	1	80	10.0%	15.0%
BAKU	39.4M	5	5	67	3	80	6.2%	12.5%
DP-UNet	74.4M	1	0	79	0	80	1.2%	1.2%