Agent Evaluation

Agentic Perception Bench

DexHoldem evaluates agent perception as a controlled state-parsing problem. The bench isolates visual state parsing from downstream routing, poker-action selection, and physical execution, then scores whether a perceiver recovers the structured game-state memory needed by the embodied-system router.

DexHoldem agentic perception tabletop state

Problem Interface

Each problem is one real tabletop state with predecessor context.

The released bench contains 36 problems, p1-p36, drawn from representative states encountered during system-level deployment. For each problem, the perceiver receives one current agent-view capture and, when relevant, predecessor-state context consisting of earlier captures and pre-labeled structured game-state memory.

Current Frame

The sampled state's agent-view image is the only current observation parsed from raw pixels. Previous states serve as context rather than new visual targets.

Shared Workflow

Perceivers receive the same system visual guidelines and workflow guidelines used by the deployed agent, but the isolated bench does not require script execution.

Fixed Output

The required output is a structured visual summary, not a caption. It records loop stage, turn state, cards, chip dictionaries, bets, showdown outcome, and uncertainty.

Structured State Schema

The target artifact matches the full-system game-state memory.

{
  "loop_stage": "idle",
  "blind": "big_blind",
  "showdown_outcome": "not_showdown",
  "table": {
    "scene_stable": true,
    "is_my_turn": true,
    "community_cards": [],
    "my_chips":       {"5": 4, "10": 3, "50": 3, "100": 3},
    "opponent_chips": {"5": 4, "10": 4, "50": 3, "100": 3},
    "my_current_bet": {"5": 0, "10": 0, "50": 0, "100": 0},
    "opponent_bet":   {"5": 0, "10": 0, "50": 0, "100": 0},
    "uncertain_fields": []
  }
}

Scoring Columns

Overall is strict exact match over the fields applicable to each state.

The deterministic evaluator reports one strict Overall score plus eight sub-capability columns. Universal fields are scored on all 36 problems; chip, card, and outcome fields are scored only where they are routing-relevant.

Per-column problem applicability for the 36-problem agentic perception benchmark.
Column Problems Scored Content
Overall36Strict exact match over every applicable challenge for that problem.
LS36Loop stage: acting, atom_idle, idle, win, lose, to_recover, or down.
TO36Turn ownership, represented by whether the robot is allowed to act or should wait.
BI36Blind assignment for initial and legal-decision context.
CC13Visible community-card set, order-insensitive, when board cards are present.
CB16Exact denomination-level current-bet dictionaries for robot and opponent bets.
RCI16Exact robot-side chip inventory over 5, 10, 50, and 100 denominations.
OCI16Exact opponent-side chip inventory over 5, 10, 50, and 100 denominations.
SO7Showdown outcome, requiring win or lose from visible cards or a detected fold.

Core Challenge Types

Different states stress different perception capabilities.

Each problem carries one or more core challenges. For example, a state with the robot executing a primitive stresses loop-stage recognition, while a state where both players have revealed cards stresses showdown-outcome judgment.

Turn Gate

Resolve whether the robot is allowed to start acting or must wait for the opponent.

Robot Progress

Detect whether the robot is acting, settled between atoms, or ready for verification.

Held-Card Read

Use the current capture and predecessor context to identify a robot-held card.

Recovery Safety

Distinguish retryable harmless failures from states that require human intervention.

Table Decision

Recover cards, chip inventories, current bets, blind state, and turn ownership.

Outcome Judge

Determine win or lose from visible cards, cached state, or an opponent fold.

Embodied-Agent Context

The perception bench measures the front end of the deployed agent loop.

In the full system, the embodied agent runs a capture -> perceive -> route -> execute workflow. Each loop iteration captures a single agent-camera frame, writes the structured game-state memory above, routes through deterministic gates, and dispatches a dexterous primitive only when physical motion is required.

Seven Loop Stages

acting, atom_idle, idle, win, lose, to_recover, and down drive waiting, continuation, recovery, stopping, or human-help branches.

Rule-Based Router

Hard workflow constraints are deterministic: a fresh game routes first to view_card, and a pre-translated chip-bet sequence advances one robot atom at a time without re-prompting the main agent.

Main-Agent Calls

The main agent is invoked only when multiple branches are legal, such as the idle loop stage where a new poker action must be selected.

Sandbox components used by the embodied-agent runtime.
Component Role Class
SKILL.mdWorkflow document: loop, action space, and routing rules.Doc
visual_guidelines/Ten Markdown modules used during the perceive stage.Doc
preflight.pyBackend validation and experiment-folder initialization.Setup
capture.pySingle-frame capture from the agent camera.Perception I/O
state.pyState-folder manager and parsed-state writer.State
router.pyRule-based per-state router that emits the next gate as JSON.Routing
action_translator.pyTranslates an agent primitive into a robot-primitive sequence.Translation
executor.pyDispatches robot commands and records execution progress.Execution
text_to_sound.pyPlays audio cues for non-robot primitives.Audio
remote_exec.pySends commands to the dexterous-hand control terminal.Control
utils.pyShared config, file I/O, and state helpers.Helpers
Agent-primitive to dexterous-policy-primitive mapping used by the runtime.
Agent Primitive Dexterous-Policy Primitive Sequence Type
wait, fold, stopState-machine sleep, recognized-by-scene fold, or route termination.Control
reset_to_initReset hand to the home pose.Reset
view_card(L/R)pick_up_left/right -> perceive -> put_down_left/right.View
show_card(L/R)pick_up_left/right -> show_left/right.Show
put_down_card(L/R, down/up)put_down_left/right or show_left/right, depending on target face state.Put-down
checkAudio cue: Check.Audio
callPush chips for delta = opponent_bet - my_bet.Chip push
raise(amount A)Push chips for delta = A - my_bet.Chip push
all_inPush primitives over the full robot-side chip stack.Chip push
collect_winningsPull primitives across both bet zones.Chip pull
request_human(reason)Audio cue, then set loop_stage to down.Help

Chip-betting primitives are split by a min-count rule that prefers larger denominations and dispatches one push or pull primitive per chip in 100 -> 50 -> 10 -> 5 order, so a single failed primitive can be retried in isolation.

Evaluation Protocol

Each perceiver runs inside its native agent harness.

Codex, Claude Code, and Gemini CLI are each given the same current observation, allowed predecessor-state context, system visual guidelines, and workflow guidelines. Each row below averages three validation runs under the same medium thinking budget exposed by the corresponding harness.

Per-perceiver accuracy on the 36-problem agentic perception benchmark.
Harness Perceiver Overall LS TO BI CC CB RCI OCI SO Avg
CodexGPT 5.531.572.280.6100.061.545.862.535.476.266.8
CodexGPT 5.431.565.793.5100.023.131.256.218.847.654.5
CodexGPT 5.4 mini25.956.594.499.133.314.629.218.847.649.2
Claude CodeOpus 4.734.343.593.5100.043.631.237.543.80.049.1
Claude CodeSonnet 4.625.046.388.0100.023.110.429.222.914.341.8
Claude CodeHaiku 4.513.947.268.591.735.912.525.018.80.037.4
Gemini CLIGemini 3 Flash20.463.977.8100.028.218.829.222.971.451.5
Gemini CLIGemini 3.1 Flash L.10.227.873.194.428.212.522.914.60.034.2

Result Takeaways

Current agent perceivers still fail to recover complete routing state reliably.

Strict Overall

Overall counts a problem correct only when every applicable structured field is correct. The best strict score is 34.3%, achieved by Opus 4.7.

Field-Wise Average

GPT 5.5 has the best unweighted mean over the eight sub-capability columns at 66.8%, showing that strong isolated fields do not automatically compose.

Chip Bottleneck

Current bet chips and opponent chip inventory are the weakest routing-critical fields, peaking at 45.8% and 43.8% because they require exact denomination-level dictionaries under occlusion.

Failure Modes

Coarse markers are easier than long-horizon chip and outcome state.

Blind information is nearly saturated, and turn ownership is often reliable. The harder cases are table-decision and outcome-judge states, where one wrong chip dictionary, community-card set, or showdown decision is enough to fail strict Overall. In a closed-loop run, missing a change in the opponent's bet can cause the system to keep routing into a wait branch even after the opponent has moved.

Turn-gate agentic perception problem
Turn gate: parse turn ownership, blind context, and whether the route may begin.
Recovery-safety agentic perception problem
Recovery safety: distinguish retryable physical failures from human-help states.
Outcome-judge agentic perception problem
Outcome judge: combine visible cards, chip state, and terminal win/loss evidence.