Build Receipt April 15, 2026 ⏱️ 5 min read

We Built an AI Tug-of-War Arena in 73 Minutes. Here's the Receipt.

TL;DR

🪢 What we built: A head-to-head AI agent competition — submit a JS function, climb the ELO ladder, watch animated replays.
⏱️ How long it took: 1 hour 13 minutes, one /one-shot-scripts session, $85.67 in API spend.
📊 What it cost in trade-offs: One dimension (robustness) scored 0.88 — and we left the documented MVP trade-off in the code instead of hiding it.

JUDGE · live

ROPE x=0.50

AGT 1

[−o−]

<≡≡≡

AGGRESSOR

PULL ← ROPE → PULL

AGT 2

[−o−]

≡≡≡>

DEFENDER

LEDGER · APPEND-ONLY

SIX-PIECE COORDINATION · ROPE IS THE TRUTH

🪢 What Got Built

The Agent Arena lives at /agent-arena/arena/ on getgodmode.dev. It's a sibling to the existing solo training ground levels, but the gameplay is head-to-head: your agent against another player's agent in a tug-of-war.

The rope starts at position 50. Player A wants it at 0. Player B wants it at 100. Each round, both agents return a structured move — a stance (pull, brace, or sprint) and an effort 0–20. A deterministic engine resolves the round: pull beats brace, sprint beats pull, brace beats sprint, the winner's effort is multiplied by 1.5 and the loser's by 0.5. Each agent has 100 stamina to spend across the whole match. First past 0 or 100 wins.

Update (2026-04-15): The original build used a Claude Haiku judge to score prose "pull" strings. We ripped it out the same day. The judge added latency, cost, and — more importantly — third-party scoring in what's supposed to be user-vs-user. The current engine is fully deterministic. The only AI involved is whatever the user uses to write their script.

🧠 Round kicks off — both agents see rope, round, stamina
↓
✍️ Each agent returns {stance, effort}
↓
🎯 RPS matchup × effort × stamina
↓
🪢 Rope moves, stamina deducted
↓
🏆 First past 0 or 100 takes the match — ELO updates

SIX MOVING PARTS · EVERY MOVE FLOWS THROUGH THE ARENA

Three weight classes scaffold the architecture, but only one is live. Middleweight ships fully wired: submit a JavaScript pull(state, history) function, the server runs three auto-seeded matches against random opponents, your ELO updates, you appear on the leaderboard. Featherweight (live browser lobby) and heavyweight (API webhooks) get coming-soon pages with the full pitch and a draft API contract.

📊 Build Stats

This is what the run-report block printed at the end of the session — verbatim from node ~/.claude/scripts/run-report.js end one-shot-scripts:

─── Run Report ────────────────────────────── Skill/label : one-shot-scripts Started : 2026-04-14T22:29:32.480Z Ended : 2026-04-14T23:42:54.185Z Time taken : 1h 13m 21s Input tokens : 286 Output tokens : 347,988 Cache read : 23,125,939 Cache created : 1,326,955 Total tokens : 24,801,168 Estimated cost : $85.67 USD Assistant turns : 165 ─────────────────────────────────────────────

The cache hit rate is the line that matters: 23.1M tokens read from cache against 1.3M created. Without prompt caching, the same session would have been a multiple of that cost. The protocol re-reads the same recon files, the same plan, the same task list across every phase — caching is what makes that affordable.

LEDGER REPLAY · 73 MIN BUILD · 4× SPEED

DRAG GAUGE TO SCRUB · ORANGE DOTS = JUDGE RULINGS

📋 Dimension Scorecard

The /one-shot-scripts protocol blocks delivery until every rubric dimension passes 0.85 and the composite hits 0.92. Here's how the final loop scored:

Dimension	Score	Notes
Correctness	0.93	36/36 unit tests + integration test of the replay viewer with mocked match data
Completeness	0.94	Middleweight live + scaffolds for the other two tiers + replay viewer
Quality	0.92	Mirrors existing obstacle-actions and migration 008 patterns exactly
Robustness	0.88	Documented MVP trade-off — see "the honest one" below
UX	0.94	All 7 pages visually verified via Playwright screenshots
Documentation	0.93	README walks the deploy steps, security caveats, rollback
Testing	0.90	Honest about what the watchdog catches and what it doesn't
Composite	0.92	At threshold — shipped

⚠️ The Honest One: Robustness 0.88

Middleweight scripts run inside new Function(...) in the Deno edge isolate. The protection layers are a static blacklist (no fetch, no Deno, no while(true), no Date.now busy-waits), a 400ms async watchdog, frozen inputs, and a per-user rate limit.

Here's what the rubric flagged: the watchdog only catches asynchronous runaway. Single-threaded JavaScript can't preempt a synchronous busy loop from outside, so a script that bypasses the blacklist with clever obfuscation runs until Supabase's wall-time kill (~150 seconds) terminates the function.

The trade-off we shipped with: Moving execution into a Deno Worker would give real CPU-time enforcement, but we couldn't validate that against Supabase Edge Functions locally. Deploying untested isolation code is worse than shipping documented MVP isolation. The upgrade path lives in arena/README.md.

The score reflects an explicit choice, not a gap. We could have inflated it to 0.95 and called it production-ready. Instead the dimension stays at 0.88 and the README has a "what is NOT protected" section. That's the kind of honesty /one-shot-scripts is built to surface.

COMPOSITE

0.92

SHIPPED · AT THRESHOLD

ROBUSTNESS

0.88

DOCUMENTED TRADE-OFF

WHAT THE 0.88 BUYS YOU · OWNED IN PUBLIC

Synchronous busy loop bypasses the 400 ms async watchdog
Supabase wall-time kill (~150 s) is the only hard floor
Static blacklist won't catch obfuscated fetch / Date.now globals
Deno Worker upgrade path: untested vs. Edge Functions locally

Upgrade path documented in arena/README.md. The 0.92 ships with eyes open, not closed.

HOVER OR TAP THE 0.88 BAR · TRADE-OFF IS THE FEATURE, NOT THE BUG

🏗️ The Real Analogy

Think of /one-shot-scripts like commissioning a contractor who walks the site, draws the plan, builds, runs their own punch-list, screenshots every room, and only hands you the keys when their own checklist is green. Except they finish in 73 minutes and the punch-list is public.

📦 What Shipped — by the File

File	Lines
`supabase/migrations/013_obstacle_arena.sql`	113
`supabase/functions/arena-actions/index.ts`	745
`agent-arena/arena/index.html`	384
`agent-arena/arena/middleweight.html`	593
`agent-arena/arena/match.html`	439
`agent-arena/arena/featherweight.html`	148
`agent-arena/arena/heavyweight.html`	176
`agent-arena/arena/arena-client.js`	173
`agent-arena/arena/test-arena.mjs`	259
plus an arena card injected into the existing course landing	+12

🔍 What the Protocol Caught That a Single Pass Wouldn't

Two moments in the run earned their keep:

Caught

The async watchdog test failed on the first pass because Promise.race can't preempt a sync busy loop. The protocol forced the test to be honest about it instead of papering over with a bigger timeout.

Caught

The replay viewer was never actually rendered until Phase 7. A Playwright integration test with mocked match data caught it in time — and the screenshot proved the rope animation worked end-to-end.

Both findings landed in the README's "security caveats" section and the test suite. Neither would have surfaced from a "build it and see" approach.

🚀 Try the Arena

Middleweight is live. Sign in with a forum account, paste a pull(state, history) function, and the server will run three matches against random opponents on the spot. ELO updates immediately. The leaderboard is empty as of writing — first agent in gets the rank-1 slot until someone takes it.

The other two tiers are scaffolded with full pitches and (for heavyweight) a draft API contract. They ship next.

Enter the Arena

Submit a tug-of-war agent, climb the ELO ladder, or just watch a replay.

Enter the Arena See /one-shot-scripts →

← Research-First All Posts →