DexHoldem Agentic Perception Bench

Problem Interface

Each problem is one real tabletop state with predecessor context.

The released bench contains 36 problems, p1-p36, drawn from representative states encountered during system-level deployment. For each problem, the perceiver receives one current agent-view capture and, when relevant, predecessor-state context consisting of earlier captures and pre-labeled structured game-state memory.

Current Frame

The sampled state's agent-view image is the only current observation parsed from raw pixels. Previous states serve as context rather than new visual targets.

Shared Workflow

Perceivers receive the same system visual guidelines and workflow guidelines used by the deployed agent, but the isolated bench does not require script execution.

Fixed Output

The required output is a structured visual summary, not a caption. It records loop stage, turn state, cards, chip dictionaries, bets, showdown outcome, and uncertainty.

Structured State Schema

The target artifact matches the full-system game-state memory.

{
  "loop_stage": "idle",
  "blind": "big_blind",
  "showdown_outcome": "not_showdown",
  "table": {
    "scene_stable": true,
    "is_my_turn": true,
    "community_cards": [],
    "my_chips":       {"5": 4, "10": 3, "50": 3, "100": 3},
    "opponent_chips": {"5": 4, "10": 4, "50": 3, "100": 3},
    "my_current_bet": {"5": 0, "10": 0, "50": 0, "100": 0},
    "opponent_bet":   {"5": 0, "10": 0, "50": 0, "100": 0},
    "uncertain_fields": []
  }
}

Scoring Columns

Overall is strict exact match over the fields applicable to each state.

The deterministic evaluator reports one strict Overall score plus eight sub-capability columns. Universal fields are scored on all 36 problems; chip, card, and outcome fields are scored only where they are routing-relevant.

Per-column problem applicability for the 36-problem agentic perception benchmark.
Column	Problems	Scored Content
Overall	36	Strict exact match over every applicable challenge for that problem.
LS	36	Loop stage: `acting`, `atom_idle`, `idle`, `win`, `lose`, `to_recover`, or `down`.
TO	36	Turn ownership, represented by whether the robot is allowed to act or should wait.
BI	36	Blind assignment for initial and legal-decision context.
CC	13	Visible community-card set, order-insensitive, when board cards are present.
CB	16	Exact denomination-level current-bet dictionaries for robot and opponent bets.
RCI	16	Exact robot-side chip inventory over 5, 10, 50, and 100 denominations.
OCI	16	Exact opponent-side chip inventory over 5, 10, 50, and 100 denominations.
SO	7	Showdown outcome, requiring `win` or `lose` from visible cards or a detected fold.

Core Challenge Types

Different states stress different perception capabilities.

Each problem carries one or more core challenges. For example, a state with the robot executing a primitive stresses loop-stage recognition, while a state where both players have revealed cards stresses showdown-outcome judgment.

Turn Gate

Resolve whether the robot is allowed to start acting or must wait for the opponent.

Robot Progress

Detect whether the robot is acting, settled between atoms, or ready for verification.

Held-Card Read

Use the current capture and predecessor context to identify a robot-held card.

Recovery Safety

Distinguish retryable harmless failures from states that require human intervention.

Table Decision

Recover cards, chip inventories, current bets, blind state, and turn ownership.

Outcome Judge

Determine win or lose from visible cards, cached state, or an opponent fold.

Embodied-Agent Context

The perception bench measures the front end of the deployed agent loop.

In the full system, the embodied agent runs a capture -> perceive -> route -> execute workflow. Each loop iteration captures a single agent-camera frame, writes the structured game-state memory above, routes through deterministic gates, and dispatches a dexterous primitive only when physical motion is required.

Seven Loop Stages

acting, atom_idle, idle, win, lose, to_recover, and down drive waiting, continuation, recovery, stopping, or human-help branches.

Rule-Based Router

Hard workflow constraints are deterministic: a fresh game routes first to view_card, and a pre-translated chip-bet sequence advances one robot atom at a time without re-prompting the main agent.

Main-Agent Calls

The main agent is invoked only when multiple branches are legal, such as the idle loop stage where a new poker action must be selected.

Sandbox components used by the embodied-agent runtime.
Component	Role	Class
`SKILL.md`	Workflow document: loop, action space, and routing rules.	Doc
`visual_guidelines/`	Ten Markdown modules used during the perceive stage.	Doc
`preflight.py`	Backend validation and experiment-folder initialization.	Setup
`capture.py`	Single-frame capture from the agent camera.	Perception I/O
`state.py`	State-folder manager and parsed-state writer.	State
`router.py`	Rule-based per-state router that emits the next gate as JSON.	Routing
`action_translator.py`	Translates an agent primitive into a robot-primitive sequence.	Translation
`executor.py`	Dispatches robot commands and records execution progress.	Execution
`text_to_sound.py`	Plays audio cues for non-robot primitives.	Audio
`remote_exec.py`	Sends commands to the dexterous-hand control terminal.	Control
`utils.py`	Shared config, file I/O, and state helpers.	Helpers

Agent-primitive to dexterous-policy-primitive mapping used by the runtime.
Agent Primitive	Dexterous-Policy Primitive Sequence	Type
`wait`, `fold`, `stop`	State-machine sleep, recognized-by-scene fold, or route termination.	Control
`reset_to_init`	Reset hand to the home pose.	Reset
`view_card(L/R)`	`pick_up_left/right` -> perceive -> `put_down_left/right`.	View
`show_card(L/R)`	`pick_up_left/right` -> `show_left/right`.	Show
`put_down_card(L/R, down/up)`	`put_down_left/right` or `show_left/right`, depending on target face state.	Put-down
`check`	Audio cue: Check.	Audio
`call`	Push chips for delta = `opponent_bet - my_bet`.	Chip push
`raise(amount A)`	Push chips for delta = `A - my_bet`.	Chip push
`all_in`	Push primitives over the full robot-side chip stack.	Chip push
`collect_winnings`	Pull primitives across both bet zones.	Chip pull
`request_human(reason)`	Audio cue, then set `loop_stage` to `down`.	Help

Chip-betting primitives are split by a min-count rule that prefers larger denominations and dispatches one push or pull primitive per chip in 100 -> 50 -> 10 -> 5 order, so a single failed primitive can be retried in isolation.

Evaluation Protocol

Each perceiver runs inside its native agent harness.

Codex, Claude Code, and Gemini CLI are each given the same current observation, allowed predecessor-state context, system visual guidelines, and workflow guidelines. Each row below averages three validation runs under the same medium thinking budget exposed by the corresponding harness.

Per-perceiver accuracy on the 36-problem agentic perception benchmark.
Harness	Perceiver	Overall	LS	TO	BI	CC	CB	RCI	OCI	SO	Avg
Codex	GPT 5.5	31.5	72.2	80.6	100.0	61.5	45.8	62.5	35.4	76.2	66.8
Codex	GPT 5.4	31.5	65.7	93.5	100.0	23.1	31.2	56.2	18.8	47.6	54.5
Codex	GPT 5.4 mini	25.9	56.5	94.4	99.1	33.3	14.6	29.2	18.8	47.6	49.2
Claude Code	Opus 4.7	34.3	43.5	93.5	100.0	43.6	31.2	37.5	43.8	0.0	49.1
Claude Code	Sonnet 4.6	25.0	46.3	88.0	100.0	23.1	10.4	29.2	22.9	14.3	41.8
Claude Code	Haiku 4.5	13.9	47.2	68.5	91.7	35.9	12.5	25.0	18.8	0.0	37.4
Gemini CLI	Gemini 3 Flash	20.4	63.9	77.8	100.0	28.2	18.8	29.2	22.9	71.4	51.5
Gemini CLI	Gemini 3.1 Flash L.	10.2	27.8	73.1	94.4	28.2	12.5	22.9	14.6	0.0	34.2

Result Takeaways

Current agent perceivers still fail to recover complete routing state reliably.

Strict Overall

Overall counts a problem correct only when every applicable structured field is correct. The best strict score is 34.3%, achieved by Opus 4.7.

Field-Wise Average

GPT 5.5 has the best unweighted mean over the eight sub-capability columns at 66.8%, showing that strong isolated fields do not automatically compose.

Chip Bottleneck

Current bet chips and opponent chip inventory are the weakest routing-critical fields, peaking at 45.8% and 43.8% because they require exact denomination-level dictionaries under occlusion.

Failure Modes

Coarse markers are easier than long-horizon chip and outcome state.

Blind information is nearly saturated, and turn ownership is often reliable. The harder cases are table-decision and outcome-judge states, where one wrong chip dictionary, community-card set, or showdown decision is enough to fail strict Overall. In a closed-loop run, missing a change in the opponent's bet can cause the system to keep routing into a wait branch even after the opponent has moved.

Turn-gate agentic perception problem — Turn gate: parse turn ownership, blind context, and whether the route may begin.

Recovery-safety agentic perception problem — Recovery safety: distinguish retryable physical failures from human-help states.

Outcome-judge agentic perception problem — Outcome judge: combine visible cards, chip state, and terminal win/loss evidence.

Agent Bench Data Homepage Summary Back To Release Cards