System-level Evaluation

Three closed-loop hand-level case studies.

System evaluation composes GPT 5.5 (perceiver + main agent), deterministic routing, and the pi0 dexterous policy into full hand-level rollouts on the physical table. Unlike the perception bench—which isolates visual state parsing—the system evaluation closes the loop: every captured state feeds back into perception, routing, and physical execution. This page presents the three released case-study trajectories with per-state agent-view captures, predicted parsed states, and operational counters.

DexHoldem system-level trajectory state

Protocol

Perceive, route, execute.

At each loop step the agent captures an image, parses it into structured game state, and routes through deterministic workflow gates. Physical motion is dispatched only when the scene is stable and an executable primitive is needed; otherwise the router handles waiting, verification, continuation of pending multi-atom sequences, and retryable recovery.

DexHoldem embodied-agent system loop
The agent loads structured game-state memory and routes to a dexterous policy only when the scene is stable and an executable primitive is needed.

Operational Counters

Operational counters.

These three trajectories are case studies, not a statistically powered estimate. Each row reports how many states the system captured, how many were agent-level primitives versus dexterous-policy dispatches, and how many were spent waiting, requesting human help, or in recovery.

Case studies of per-trajectory operational counters under the system-level protocol.
Agent Policy Trajectory States AP DPP WA HL RC LAP LDP
GPT 5.5pi0(i)2287721view_card(L)pick_up_left
GPT 5.5pi0(ii)5413222601collect_winningspush_100
GPT 5.5pi0(iii)23810701callpick_up_left

States

Trajectory length. Hands folded at preflop are short; hands reaching the river or showdown are long.

AP / DPP

Dispatched agent primitives (high-level decisions) and dexterous-policy primitives (physical motor commands). Reflects trajectory complexity.

WA

Wait-branch count. Elevated WA suggests over-sensitivity to scene motion or conservative completion/stage-transition judgment.

HL / RC

Human-help escalations and recovery dispatches. HL marks unrecoverable states; RC marks retryable primitive failures.

LAP

Agent primitive occupying the most consecutive states. Identifies agent-level bottlenecks such as stuck multi-atom sequences.

LDP

Dexterous-policy primitive occupying the most consecutive states. Pinpoints where physical execution is slow or repeatedly fails verification.

Trajectory Previews

Trajectory (i) — 22 states. Hole cards, two human-help escalations, raise, check, check, call.
Agent view for trajectory (i)
Trajectory (ii) — 54 states. Full hand through all-in, showdown, and collect winnings.
Agent view for trajectory (ii)
Trajectory (iii) — 23 states. Raise, check, call, showdown with both cards revealed.
Agent view for trajectory (iii)

Label Legend

State-label grammar.

Agent Choice

view_card, raise, check, call, all_in, show_card, collect_winnings—the main agent selects a new primitive.

Wait

wait (scene), wait (acting), wait (turn)—the router pauses because the scene is unstable, the robot is still moving, or it is the opponent's turn.

Router Gate

cont., cache hole card, verify, complete, retry, end—deterministic routing advances or closes a pending primitive.

Human Help

request_human—the agent escalates when the scene cannot be resolved automatically (e.g. a misplaced card).