Policy Bench

9 policies, 14 primitives, 80 real-world trials each.

The policy bench isolates low-level dexterous execution from poker strategy. Every policy—pretrained VLAs and from-scratch baselines alike—is trained and evaluated under the same observation, action, rollout, and scoring interface on the physical ShadowHand–UR10e setup. A four-level rubric separates nominal task completion from scene-preserving execution, because a primitive can locally succeed while displacing cards or chips enough to block later actions.

Model family, pretraining scale, parameter count, and task completion rate

Primitives

14 language-instructed atomic skills spanning card and chip manipulation.

Directional labels are in the robot-facing tabletop frame: push moves chips away from the robot into the forward betting region; pull moves them back.

Card Pickup

pick_up_left, pick_up_right

Card Placement

put_down_left, put_down_right, show_left, show_right

Chip Push

push_5, push_10, push_50, push_100

Chip Pull

pull_5, pull_10, pull_50, pull_100

Scoring Rubric

Four outcome levels: SP, DC, TF, DF.

SP — Scene-Preserving Success

Primitive completed and the tabletop remains usable for subsequent actions.

DC — Disruptive Completion

Goal achieved, but execution disturbs the scene enough to prevent normal continuation.

TF — Task Failure

Primitive not completed, but the scene remains stable enough for retry.

DF — Disruptive Failure

Primitive fails and the environment must be reset before continuing.

SPSR = SP / N (only clean successes). TCR = (SP + DC) / N (any completion).

Results

Aggregate results over 80 physical trials per policy.

Policy bench results. Each policy is evaluated on the same 80 primitive-level real-world trials.
Policy Family Params SP DC TF DF SPSR TCR
π0.5Pretrained2.94B381131047.5%61.2%
π0Pretrained2.82B38833147.5%57.5%
RDTPretrained1.23B241340330.0%46.2%
DP (DINO)From-scratch128M21848326.2%36.2%
DP-TransformerFrom-scratch128M115461813.8%20.0%
RDT-smallFrom-scratch165M11359713.8%17.5%
ACTFrom-scratch72.8M8467110.0%15.0%
BAKUFrom-scratch39.4M556736.2%12.5%
DP-UNetFrom-scratch74.4M107901.2%1.2%

Interface

Three RGB-D views and proprioception map to 30-D joint-position action chunks.

At each rollout step the policy receives top-down, third-person, and wrist-mounted RealSense RGB-D observations, the current 30-D arm+hand joint state, and a task condition (natural-language text for pretrained models, discrete instruction ID for from-scratch baselines). It returns a short-horizon sequence of joint-position targets.

1,470 Demos

105 teleoperated demonstrations per primitive, with a fixed 100-train / 5-validation split.

30-D Actions

Joint-position targets for the 6-DOF UR10e arm and 24-DOF Shadow Dexterous Hand.

ZMQ Server

GPU inference runs behind a ZeroMQ endpoint; the robot client sends observations and executes returned chunks.

80 Trials

Physical evaluation uses 80 primitive-level trials per policy, grouped into pickup, push, pull, and put-down/show.

Policies

Pretrained VLAs and from-scratch baselines under one protocol.

Policy implementations in the main physical comparison.
Policy Family Conditioning Implementation
π0.5PretrainedLanguageOpenPI bridge maps three camera streams and robot state; absolute joint-action targets.
π0PretrainedLanguageSame OpenPI bridge as π0.5; default delta joint motion before conversion.
RDTPretrainedT5 tokens1B-class diffusion Transformer adapted from gripper to 30-D joint space with SigLIP patches.
RDT-smallFrom-scratchT5 tokensReduced-capacity RDT variant, randomly initialized and trained from scratch.
DP (DINO)From-scratchInstr. IDDiffusion policy with frozen DINOv2 features and Transformer denoiser.
DP-TransformerFrom-scratchInstr. IDDiffusion policy Transformer trained from scratch.
DP-UNetFrom-scratchInstr. IDLightweight diffusion policy with trainable ResNet encoders and 1D UNet denoiser.
ACTFrom-scratchInstr. IDCVAE Transformer decoding deterministic action chunks at inference.
BAKUFrom-scratchInstr. IDDeterministic action-token Transformer adapted to the 30-D command space.

Figures

Scaling and physical rollout examples.

Physical policy benchmark success and failure examples
Representative physical rollouts. Success is not just object motion—non-target cards and chips must remain usable for later tabletop interaction.
Model scale and policy completion chart
Model family, pretraining scale, parameter count, and physical task completion rate.
RDT fine-tuning final validation loss
RDT data-scaling probe: pretrained initialization gives a modest lower-loss offset but does not create a strong low-data advantage at 10% data.