Current Frame
The sampled state's agent-view image is the only current observation parsed from raw pixels. Previous states serve as context rather than new visual targets.
Agent Evaluation
DexHoldem evaluates agent perception as a controlled state-parsing problem. The bench isolates visual state parsing from downstream routing, poker-action selection, and physical execution, then scores whether a perceiver recovers the structured game-state memory needed by the embodied-system router.
Problem Interface
The released bench contains 36 problems, p1-p36, drawn from
representative states encountered during system-level deployment. For each problem, the
perceiver receives one current agent-view capture and, when relevant, predecessor-state
context consisting of earlier captures and pre-labeled structured game-state memory.
The sampled state's agent-view image is the only current observation parsed from raw pixels. Previous states serve as context rather than new visual targets.
Perceivers receive the same system visual guidelines and workflow guidelines used by the deployed agent, but the isolated bench does not require script execution.
The required output is a structured visual summary, not a caption. It records loop stage, turn state, cards, chip dictionaries, bets, showdown outcome, and uncertainty.
Structured State Schema
{
"loop_stage": "idle",
"blind": "big_blind",
"showdown_outcome": "not_showdown",
"table": {
"scene_stable": true,
"is_my_turn": true,
"community_cards": [],
"my_chips": {"5": 4, "10": 3, "50": 3, "100": 3},
"opponent_chips": {"5": 4, "10": 4, "50": 3, "100": 3},
"my_current_bet": {"5": 0, "10": 0, "50": 0, "100": 0},
"opponent_bet": {"5": 0, "10": 0, "50": 0, "100": 0},
"uncertain_fields": []
}
}
Scoring Columns
The deterministic evaluator reports one strict Overall score plus eight sub-capability columns. Universal fields are scored on all 36 problems; chip, card, and outcome fields are scored only where they are routing-relevant.
| Column | Problems | Scored Content |
|---|---|---|
| Overall | 36 | Strict exact match over every applicable challenge for that problem. |
| LS | 36 | Loop stage: acting, atom_idle, idle, win, lose, to_recover, or down. |
| TO | 36 | Turn ownership, represented by whether the robot is allowed to act or should wait. |
| BI | 36 | Blind assignment for initial and legal-decision context. |
| CC | 13 | Visible community-card set, order-insensitive, when board cards are present. |
| CB | 16 | Exact denomination-level current-bet dictionaries for robot and opponent bets. |
| RCI | 16 | Exact robot-side chip inventory over 5, 10, 50, and 100 denominations. |
| OCI | 16 | Exact opponent-side chip inventory over 5, 10, 50, and 100 denominations. |
| SO | 7 | Showdown outcome, requiring win or lose from visible cards or a detected fold. |
Core Challenge Types
Each problem carries one or more core challenges. For example, a state with the robot executing a primitive stresses loop-stage recognition, while a state where both players have revealed cards stresses showdown-outcome judgment.
Resolve whether the robot is allowed to start acting or must wait for the opponent.
Detect whether the robot is acting, settled between atoms, or ready for verification.
Use the current capture and predecessor context to identify a robot-held card.
Distinguish retryable harmless failures from states that require human intervention.
Recover cards, chip inventories, current bets, blind state, and turn ownership.
Determine win or lose from visible cards, cached state, or an opponent fold.
Embodied-Agent Context
In the full system, the embodied agent runs a capture -> perceive -> route -> execute workflow. Each loop iteration captures a single agent-camera frame, writes the structured game-state memory above, routes through deterministic gates, and dispatches a dexterous primitive only when physical motion is required.
acting, atom_idle, idle, win,
lose, to_recover, and down drive waiting,
continuation, recovery, stopping, or human-help branches.
Hard workflow constraints are deterministic: a fresh game routes first to
view_card, and a pre-translated chip-bet sequence advances one robot atom
at a time without re-prompting the main agent.
The main agent is invoked only when multiple branches are legal, such as the
idle loop stage where a new poker action must be selected.
| Component | Role | Class |
|---|---|---|
SKILL.md | Workflow document: loop, action space, and routing rules. | Doc |
visual_guidelines/ | Ten Markdown modules used during the perceive stage. | Doc |
preflight.py | Backend validation and experiment-folder initialization. | Setup |
capture.py | Single-frame capture from the agent camera. | Perception I/O |
state.py | State-folder manager and parsed-state writer. | State |
router.py | Rule-based per-state router that emits the next gate as JSON. | Routing |
action_translator.py | Translates an agent primitive into a robot-primitive sequence. | Translation |
executor.py | Dispatches robot commands and records execution progress. | Execution |
text_to_sound.py | Plays audio cues for non-robot primitives. | Audio |
remote_exec.py | Sends commands to the dexterous-hand control terminal. | Control |
utils.py | Shared config, file I/O, and state helpers. | Helpers |
| Agent Primitive | Dexterous-Policy Primitive Sequence | Type |
|---|---|---|
wait, fold, stop | State-machine sleep, recognized-by-scene fold, or route termination. | Control |
reset_to_init | Reset hand to the home pose. | Reset |
view_card(L/R) | pick_up_left/right -> perceive -> put_down_left/right. | View |
show_card(L/R) | pick_up_left/right -> show_left/right. | Show |
put_down_card(L/R, down/up) | put_down_left/right or show_left/right, depending on target face state. | Put-down |
check | Audio cue: Check. | Audio |
call | Push chips for delta = opponent_bet - my_bet. | Chip push |
raise(amount A) | Push chips for delta = A - my_bet. | Chip push |
all_in | Push primitives over the full robot-side chip stack. | Chip push |
collect_winnings | Pull primitives across both bet zones. | Chip pull |
request_human(reason) | Audio cue, then set loop_stage to down. | Help |
Chip-betting primitives are split by a min-count rule that prefers larger denominations and dispatches one push or pull primitive per chip in 100 -> 50 -> 10 -> 5 order, so a single failed primitive can be retried in isolation.
Evaluation Protocol
Codex, Claude Code, and Gemini CLI are each given the same current observation, allowed predecessor-state context, system visual guidelines, and workflow guidelines. Each row below averages three validation runs under the same medium thinking budget exposed by the corresponding harness.
| Harness | Perceiver | Overall | LS | TO | BI | CC | CB | RCI | OCI | SO | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Codex | GPT 5.5 | 31.5 | 72.2 | 80.6 | 100.0 | 61.5 | 45.8 | 62.5 | 35.4 | 76.2 | 66.8 |
| Codex | GPT 5.4 | 31.5 | 65.7 | 93.5 | 100.0 | 23.1 | 31.2 | 56.2 | 18.8 | 47.6 | 54.5 |
| Codex | GPT 5.4 mini | 25.9 | 56.5 | 94.4 | 99.1 | 33.3 | 14.6 | 29.2 | 18.8 | 47.6 | 49.2 |
| Claude Code | Opus 4.7 | 34.3 | 43.5 | 93.5 | 100.0 | 43.6 | 31.2 | 37.5 | 43.8 | 0.0 | 49.1 |
| Claude Code | Sonnet 4.6 | 25.0 | 46.3 | 88.0 | 100.0 | 23.1 | 10.4 | 29.2 | 22.9 | 14.3 | 41.8 |
| Claude Code | Haiku 4.5 | 13.9 | 47.2 | 68.5 | 91.7 | 35.9 | 12.5 | 25.0 | 18.8 | 0.0 | 37.4 |
| Gemini CLI | Gemini 3 Flash | 20.4 | 63.9 | 77.8 | 100.0 | 28.2 | 18.8 | 29.2 | 22.9 | 71.4 | 51.5 |
| Gemini CLI | Gemini 3.1 Flash L. | 10.2 | 27.8 | 73.1 | 94.4 | 28.2 | 12.5 | 22.9 | 14.6 | 0.0 | 34.2 |
Result Takeaways
Overall counts a problem correct only when every applicable structured field is correct. The best strict score is 34.3%, achieved by Opus 4.7.
GPT 5.5 has the best unweighted mean over the eight sub-capability columns at 66.8%, showing that strong isolated fields do not automatically compose.
Current bet chips and opponent chip inventory are the weakest routing-critical fields, peaking at 45.8% and 43.8% because they require exact denomination-level dictionaries under occlusion.
Failure Modes
Blind information is nearly saturated, and turn ownership is often reliable. The harder cases are table-decision and outcome-judge states, where one wrong chip dictionary, community-card set, or showdown decision is enough to fail strict Overall. In a closed-loop run, missing a change in the opponent's bet can cause the system to keep routing into a wait branch even after the opponent has moved.