We Built an AI Tug-of-War Arena in 73 Minutes. Here's the Receipt.
🪢 What we built: A head-to-head AI agent competition — submit a JS function, climb the ELO ladder, watch animated replays.
⏱️ How long it took: 1 hour 13 minutes, one /one-shot-scripts session, $85.67 in API spend.
📊 What it cost in trade-offs: One dimension (robustness) scored 0.88 — and we left the documented MVP trade-off in the code instead of hiding it.
What Got Built
The Agent Arena lives at /agent-arena/arena/ on getgodmode.dev. It's a sibling to the existing solo training ground levels, but the gameplay is head-to-head: your agent against another player's agent in a tug-of-war.
The rope starts at position 50. Player A wants it at 0. Player B wants it at 100. Each round, both agents return a structured move — a stance (pull, brace, or sprint) and an effort 0–20. A deterministic engine resolves the round: pull beats brace, sprint beats pull, brace beats sprint, the winner's effort is multiplied by 1.5 and the loser's by 0.5. Each agent has 100 stamina to spend across the whole match. First past 0 or 100 wins.
Update (2026-04-15): The original build used a Claude Haiku judge to score prose "pull" strings. We ripped it out the same day. The judge added latency, cost, and — more importantly — third-party scoring in what's supposed to be user-vs-user. The current engine is fully deterministic. The only AI involved is whatever the user uses to write their script.
↓
✍️ Each agent returns
{stance, effort}
↓
🎯 RPS matchup × effort × stamina
↓
🪢 Rope moves, stamina deducted
↓
🏆 First past 0 or 100 takes the match — ELO updates
Three weight classes scaffold the architecture, but only one is live. Middleweight ships fully wired: submit a JavaScript pull(state, history) function, the server runs three auto-seeded matches against random opponents, your ELO updates, you appear on the leaderboard. Featherweight (live browser lobby) and heavyweight (API webhooks) get coming-soon pages with the full pitch and a draft API contract.
Build Stats
This is what the run-report block printed at the end of the session — verbatim from node ~/.claude/scripts/run-report.js end one-shot-scripts:
The cache hit rate is the line that matters: 23.1M tokens read from cache against 1.3M created. Without prompt caching, the same session would have been a multiple of that cost. The protocol re-reads the same recon files, the same plan, the same task list across every phase — caching is what makes that affordable.
Dimension Scorecard
The /one-shot-scripts protocol blocks delivery until every rubric dimension passes 0.85 and the composite hits 0.92. Here's how the final loop scored:
| Dimension | Score | Notes |
|---|---|---|
| Correctness | 0.93 | 36/36 unit tests + integration test of the replay viewer with mocked match data |
| Completeness | 0.94 | Middleweight live + scaffolds for the other two tiers + replay viewer |
| Quality | 0.92 | Mirrors existing obstacle-actions and migration 008 patterns exactly |
| Robustness | 0.88 | Documented MVP trade-off — see "the honest one" below |
| UX | 0.94 | All 7 pages visually verified via Playwright screenshots |
| Documentation | 0.93 | README walks the deploy steps, security caveats, rollback |
| Testing | 0.90 | Honest about what the watchdog catches and what it doesn't |
| Composite | 0.92 | At threshold — shipped |
The Honest One: Robustness 0.88
Middleweight scripts run inside new Function(...) in the Deno edge isolate. The protection layers are a static blacklist (no fetch, no Deno, no while(true), no Date.now busy-waits), a 400ms async watchdog, frozen inputs, and a per-user rate limit.
Here's what the rubric flagged: the watchdog only catches asynchronous runaway. Single-threaded JavaScript can't preempt a synchronous busy loop from outside, so a script that bypasses the blacklist with clever obfuscation runs until Supabase's wall-time kill (~150 seconds) terminates the function.
The trade-off we shipped with: Moving execution into a Deno Worker would give real CPU-time enforcement, but we couldn't validate that against Supabase Edge Functions locally. Deploying untested isolation code is worse than shipping documented MVP isolation. The upgrade path lives in arena/README.md.
The score reflects an explicit choice, not a gap. We could have inflated it to 0.95 and called it production-ready. Instead the dimension stays at 0.88 and the README has a "what is NOT protected" section. That's the kind of honesty /one-shot-scripts is built to surface.
- Synchronous busy loop bypasses the 400 ms async watchdog
- Supabase wall-time kill (~150 s) is the only hard floor
- Static blacklist won't catch obfuscated
fetch/Date.nowglobals - Deno Worker upgrade path: untested vs. Edge Functions locally
arena/README.md. The 0.92 ships with eyes open, not closed.The Real Analogy
Think of /one-shot-scripts like commissioning a contractor who walks the site, draws the plan, builds, runs their own punch-list, screenshots every room, and only hands you the keys when their own checklist is green. Except they finish in 73 minutes and the punch-list is public.
What Shipped — by the File
| File | Lines |
|---|---|
supabase/migrations/013_obstacle_arena.sql | 113 |
supabase/functions/arena-actions/index.ts | 745 |
agent-arena/arena/index.html | 384 |
agent-arena/arena/middleweight.html | 593 |
agent-arena/arena/match.html | 439 |
agent-arena/arena/featherweight.html | 148 |
agent-arena/arena/heavyweight.html | 176 |
agent-arena/arena/arena-client.js | 173 |
agent-arena/arena/test-arena.mjs | 259 |
| plus an arena card injected into the existing course landing | +12 |
What the Protocol Caught That a Single Pass Wouldn't
Two moments in the run earned their keep:
Caught
The async watchdog test failed on the first pass because Promise.race can't preempt a sync busy loop. The protocol forced the test to be honest about it instead of papering over with a bigger timeout.
Caught
The replay viewer was never actually rendered until Phase 7. A Playwright integration test with mocked match data caught it in time — and the screenshot proved the rope animation worked end-to-end.
Both findings landed in the README's "security caveats" section and the test suite. Neither would have surfaced from a "build it and see" approach.
Try the Arena
Middleweight is live. Sign in with a forum account, paste a pull(state, history) function, and the server will run three matches against random opponents on the spot. ELO updates immediately. The leaderboard is empty as of writing — first agent in gets the rank-1 slot until someone takes it.
The other two tiers are scaffolded with full pitches and (for heavyweight) a draft API contract. They ship next.
Enter the Arena
Submit a tug-of-war agent, climb the ELO ladder, or just watch a replay.
Enter the Arena See /one-shot-scripts →