System Evaluation

Closed-loop hand-level rollouts.

This page follows the paper's Section 3.3, Section 4.5, and Appendix B.3. It is separate from agent evaluation: the agent benchmark isolates visual state parsing, while system evaluation composes GPT 5.5, deterministic routing, and the pi0 dexterous policy in real hand-level rollouts.

Boundary

Agent evaluation is a perception benchmark; system evaluation is a composed rollout.

Agent Evaluation

The agentic-perception benchmark scores whether a perceiver can recover structured table state from fixed tabletop states. It is isolated from poker-action selection and physical execution.

System Evaluation

The system-level study runs closed-loop Texas Hold'em-style hands. It counts captured states, wait branches, recovery dispatches, human-help requests, and repeated pi0 primitive dispatches across the full trace.

Section 3.3 Protocol

The system composes perception, routing, and dexterous-policy execution.

At each loop step, the agent captures an agent-view image, parses it into structured state, routes that state through deterministic workflow gates, and dispatches a dexterous-policy primitive whenever physical motion is required. The main agent is not invoked at every captured state: the router handles waiting, verification, completion, continuation of pending multi-atom translations, and retryable recovery.

DexHoldem embodied-agent system loop — Figure 2 protocol context: the agent loads structured game-state memory and routes to a dexterous policy only when the scene is stable and an executable primitive is needed.

Section 4.5 / Table 3

Three GPT 5.5 + pi0 traces are reported as operational counters.

The reported instantiation uses the Codex harness with GPT 5.5 as both perceiver and main agent, paired with the pi0 dexterous policy. These three released trajectories are case studies, not a statistically powered success-rate estimate.

Case studies of per-trajectory operational counters under the system-level protocol.
Agent	Policy	Trajectory	States	AP	DPP	WA	HL	RC	LAP	LDP
GPT 5.5	pi0	(i)	22	8	7	7	2	1	`view_card(L)`	`pick_up_left`
GPT 5.5	pi0	(ii)	54	13	22	26	0	1	`collect_winnings`	`push_100`
GPT 5.5	pi0	(iii)	23	8	10	7	0	1	`call`	`pick_up_left`

AP / DPP

AP counts dispatched agent primitives, including request_human. DPP counts dispatched dexterous-policy primitives from the physical policy suite.

WA / HL / RC

WA counts wait-branch events; HL counts human-help requests; RC counts recovery dispatches for retryable failures.

LAP / LDP

LAP and LDP identify the agent primitive and dexterous-policy primitive occupying the most consecutive states. They are longest-run counters, not final actions.

Section 4.5 Case Study

Trajectory (iii) shows the compounding closed-loop reliability gap.

The main text inspects trajectory (iii). Across 23 states, GPT 5.5 makes eight high-level decisions: it views both hole cards, raises 10 chips, checks twice, calls a 200-chip bet, and reveals both cards at showdown. About one third of the states are spent in wait branches around chip or card changes. The run has one router-level retry, never requests human help, and terminates after the second card reveal.

Trajectory (i)

A 22-state rollout: view_card on both hole cards, two request_human escalations when the scene fails to settle, then raise (10), check, check, and call. The final call is interrupted before chip-push completion.

Trajectory (ii)

A 54-state rollout: both hole cards are read, the bet escalates through raise (5), raise (105), two raise (100) actions, and all_in, then showdown and collect_winnings expand into chip-pull atoms.

Trajectory (iii)

A 23-state rollout: both hole cards are read, the betting sequence is raise (10), check, check, and call, and showdown finishes through show_card(L) and show_card(R) after one router-level retry.

Appendix B.3

Trace labels reveal where the router, main agent, and pi0 policy each act.

Appendix B.3 shows per-state agent-view captures for all three rollouts. The label grammar is part of the evaluation: it distinguishes main-agent choices, waits, continuation of already translated multi-atom primitives, verification, completion, recovery, and terminal states.

State-Label Grammar

Main-Agent ChoiceLabels such as view_card(L), raise (10), check, call, show_card(R), collect_winnings, and request_human.
Wait Branchwait (scene), wait (acting), and wait (turn) indicate scene instability, robot still acting, or not the robot's turn.
Continuationcont. labels advance a multi-atom translation already chosen earlier, such as another push_5 atom in a raise (105) sequence.
Router Gatecache hole card, verify, complete, retry, and end expose visual reads, verification, recovery, and terminal rollout states.

Takeaway

System evaluation exposes accumulation, not a single decision outcome.

The counters show that closed-loop execution is dominated by repeated waiting, verification, continuation, and occasional recovery. Even when the policy and perceiver solve parts of the benchmark in isolation, the composed rollout must keep maintaining state, selecting or continuing legal primitives, executing pi0 actions, and verifying the resulting table state across many captured states.

DexHoldem initial system trace state — `s0`: capture and parse before the first route decision.

DexHoldem post-dispatch system trace state — `s1`: table state after a dispatched physical primitive.

DexHoldem continued system trace state — `s2`: continued verification, wait, or recovery routing.

Back To System Overview Agent Evaluation Policy Details DexHoldem Skills Repository