Dexterous Hand Policy Bench

Experiment and Policy Implementation

DexHoldem isolates low-level dexterous execution from poker strategy. Every policy is trained and evaluated as a single multi-task primitive policy under the same real-robot observation, action, rollout, and scoring interface.

DexHoldem policy result bubble chart comparing model family, pretraining scale, model size, and task completion

14 Primitives

Card pickup, card placement, card reveal, chip push, and chip pull tasks are specified as language-instructed atomic skills.

1,470 Demos

Each primitive has 105 accepted teleoperated demonstrations, with a fixed 100 train / 5 validation split.

30-D Actions

Policies output joint-position targets for the 6-DOF UR10e arm and 24-DOF Shadow Dexterous Hand.

80 Rollouts

Physical evaluation uses 80 primitive-level trials per policy, grouped into pickup, chip push, chip pull, and put-down/show.

Benchmark Interface

The shared interface maps three RGB-D views and proprioception to joint-position action chunks.

The policy bench removes game-level decision making from the low-level model. At each rollout step, the policy receives top-down, third-person, and wrist-mounted RealSense RGB-D observations, the current arm-hand joint state, and a task condition. It returns a short-horizon sequence of 30-dimensional joint-position targets in the canonical robot order.

Observation

Top-down, third-person, and wrist RGB-D cameras plus normalized joint-position proprioception.

Condition

Pretrained policies use natural-language task text; task-specific baselines use discrete instruction IDs.

Prediction

The canonical loader exposes RGB/depth inputs, optional precomputed visual features, instruction IDs, and 30-D action targets.

Deployment

A ZeroMQ policy server on port 13579 returns executable joint targets to the robot-side client.

Primitive Suite

Policy tasks use the exact primitive IDs and instructions from the paper.

Directional labels are interpreted in the robot-facing tabletop frame: push moves chips away from the robot into the forward betting region, and pull moves chips back toward the robot-side region.

Primitive-level task definitions.
ID Primitive Policy instruction
0pick_up_leftPick up the card on the left side.
1pick_up_rightPick up the card on the right side.
2push_5Push forward the chips worth 5.
3push_10Push forward the chips worth 10.
4push_50Push forward the chips worth 50.
5push_100Push forward the chips worth 100.
6pull_5Pull back the chips worth 5.
7pull_10Pull back the chips worth 10.
8pull_50Pull back the chips worth 50.
9pull_100Pull back the chips worth 100.
10put_down_leftPlace the held card onto the left position.
11put_down_rightPlace the held card onto the right position.
12show_leftReveal the face of the left card.
13show_rightReveal the face of the right card.

Policy Families

The main comparison contains two model families under one physical trial protocol.

The reported benchmark focuses on the policy families and variants included in the main physical comparison. Experimental adapters that are not part of that comparison are kept separate from the benchmarked-policy summary.

01

Pretrained Robot and VLA Policies

pi0.5, pi0, RDT, and RDT-small are adapted to the ShadowHand-UR interface and conditioned on natural-language task text.

02

Task-Specific Imitation Baselines

DP (DINO), DP-Transformer, DP-UNet, ACT, and BAKU are trained on the DexHoldem demonstrations with the same action space, primitive schedule, and scoring rubric.

03

Shared Scoring Contract

All models are evaluated by SP, DC, TF, and DF outcome labels, then reported as scene-preserving success rate and task completion rate.

Policy implementations used in the main physical comparison.
Policy Conditioning Implementation summary
pi0.5 Natural-language prompt OpenPI bridge maps the three camera streams and robot state to pi0.5; the DexHoldem-side output uses absolute joint-action targets.
pi0 Natural-language prompt OpenPI bridge with the same camera and prompt mapping as pi0.5; the default action convention is delta joint motion before conversion.
RDT Cached T5 language tokens RDT-1B-style diffusion Transformer adapted from a gripper interface to the 30-D ShadowHand-UR joint space with SigLIP visual patch tokens.
RDT-small Cached T5 language tokens Reduced-capacity RDT variant using the same observation adapter and deployment path, initialized randomly and trained from scratch.
DP (DINO) Instruction ID High-capacity diffusion-policy baseline using frozen DINOv2 visual features and a Transformer denoiser for 30-D action chunks.
DP-Transformer Instruction ID Diffusion-policy Transformer baseline trained from scratch under the same instruction-ID-conditioned objective.
DP-UNet Instruction ID Lightweight diffusion-policy baseline with trainable ResNet visual encoders and a 1D UNet denoiser.
ACT Instruction ID CVAE Transformer policy that decodes deterministic 30-D action chunks from observation and instruction tokens at inference time.
BAKU Instruction ID Deterministic action-token Transformer adapted to the canonical batch format and 30-D robot command space.

Training and Runtime

The implementation keeps model-specific adapters behind a common robot-facing contract.

Implementation stages exposed by DexHoldem Policy.
Stage Implementation Purpose
Organize Raw primitive folders are converted to per-episode .npy arrays with 5 held-out validation trajectories per primitive. Build the fixed 100/5 train-validation split used by every benchmarked policy.
Load The loader exposes RGB/depth observations, optional precomputed RGB features, normalized proprioception, instruction ID, and action target. Keep task-trained policies and pretrained adapters on the same data and state representation.
Normalize Numeric proprioception and action channels are normalized to [-1, 1] with training-set statistics saved in the checkpoint. Reuse the same statistics at deployment before unnormalizing executable joint targets.
Serve deploy_policy.py and OpenPI deployment scripts expose a ZeroMQ inference endpoint on port 13579. Keep GPU inference, checkpoint loading, and model-specific preprocessing off the robot-control process.
Execute robot_client.py packages live cameras, robot joints, and the selected primitive, then executes returned action chunks. Run physical rollouts under the fixed primitive schedule and shared robot command format.

Physical Scoring

DexHoldem separates nominal task completion from scene-preserving execution.

This distinction matters because a primitive can locally succeed while moving non-target cards or chips enough to block later poker actions.

SP

Scene-preserving success: the requested primitive is completed and the tabletop remains usable.

DC

Disruptive completion: the local goal is achieved, but the scene is disturbed enough to prevent normal continuation.

TF

Task failure: the primitive is not completed, but the scene remains stable enough for retry.

DF

Disruptive failure: the primitive fails and the environment must be reset before continuing.

Aggregate physical policy results over 80 real-world primitive trials per policy.
Policy Params SP DC TF DF N SPSR TCR
pi0.52.94B38113108047.5%61.2%
pi02.82B3883318047.5%57.5%
RDT1.23B24134038030.0%46.2%
DP (DINO)128M2184838026.2%36.2%
DP-Transformer128M11546188013.8%20.0%
RDT-small165M1135978013.8%17.5%
ACT72.8M846718010.0%15.0%
BAKU39.4M55673806.2%12.5%
DP-UNet74.4M10790801.2%1.2%
DexHoldem physical policy benchmark success and failure examples
Representative physical rollouts from the policy benchmark. The examples show that success is not just object motion: non-target cards and chips must remain usable for later tabletop interaction.
DexHoldem model scale and policy completion chart
Model family, pretraining scale, policy-only parameter count, and physical task completion rate over the shared 80-trial primitive schedule.
RDT fine-tuning final validation loss across data ratios
RDT fine-tuning data-scaling probe: gripper-pretrained initialization gives a modest lower-loss offset, but does not create a strong low-data regime at 10% data.