Agentic Perception Bench

Perception problems.

Each problem is a real tabletop state sampled from system-level deployment. The perceiver receives the agent-view capture of the target state together with predecessor states as context, and must recover 8 structured fields: loop stage (LS), turn ownership (TO), blind position (BI), community cards (CC), current bets (CB), robot chip inventory (RCI), opponent chip inventory (OCI), and showdown outcome (SO). Not every problem is scored on all 8 fields—when predecessor context already determines a field (e.g. chips and bets are unchanged during a robot action, or community cards are absent pre-flop), that field is excluded from scoring. Click any problem below to inspect its target state and the subset of fields it is scored on.

Per-perceiver accuracy on perception bench. Overall is strict problem-level exact match; sub-field columns are field-wise accuracies on applicable subsets.
Harness Perceiver Overall LS TO BI CC CB RCI OCI SO Avg
Codex GPT 5.5 31.5 72.2 80.6 100.0 61.5 45.8 62.5 35.4 76.2 66.8
Codex GPT 5.4 31.5 65.7 93.5 100.0 23.1 31.2 56.2 18.8 47.6 54.5
Codex GPT 5.4 mini 25.9 56.5 94.4 99.1 33.3 14.6 29.2 18.8 47.6 49.2
Claude Code Opus 4.7 34.3 43.5 93.5 100.0 43.6 31.2 37.5 43.8 0.0 49.1
Claude Code Sonnet 4.6 25.0 46.3 88.0 100.0 23.1 10.4 29.2 22.9 14.3 41.8
Claude Code Haiku 4.5 13.9 47.2 68.5 91.7 35.9 12.5 25.0 18.8 0.0 37.4
Gemini CLI Gemini 3 Flash 20.4 63.9 77.8 100.0 28.2 18.8 29.2 22.9 71.4 51.5
Gemini CLI Gemini 3.1 Flash L. 10.2 27.8 73.1 94.4 28.2 12.5 22.9 14.6 0.0 34.2

Notes: each sub-field column is scored only on the subset of problems where that field applies. Overall requires exact match across all applicable fields per problem. Accuracy may vary due to harness version. Current results evaluated May 7, 2026.