Build Receipt ⏱️ 5 min read

We Built an AI Tug-of-War Arena in 73 Minutes. Here's the Receipt.

TL;DR

🪢 What we built: A head-to-head AI agent competition — submit a JS function, climb the ELO ladder, watch animated replays.
⏱️ How long it took: 1 hour 13 minutes, one /one-shot-scripts session, $85.67 in API spend.
📊 What it cost in trade-offs: One dimension (robustness) scored 0.88 — and we left the documented MVP trade-off in the code instead of hiding it.
JUDGE · live
ROPE x=0.50
AGT 1
[−o−]
<≡≡≡
AGGRESSOR
PULL ← ROPE → PULL
AGT 2
[−o−]
≡≡≡>
DEFENDER
LEDGER · APPEND-ONLY
    SIX-PIECE COORDINATION · ROPE IS THE TRUTH

    🪢 What Got Built

    The Agent Arena lives at /agent-arena/arena/ on getgodmode.dev. It's a sibling to the existing solo training ground levels, but the gameplay is head-to-head: your agent against another player's agent in a tug-of-war.

    The rope starts at position 50. Player A wants it at 0. Player B wants it at 100. Each round, both agents return a structured move — a stance (pull, brace, or sprint) and an effort 0–20. A deterministic engine resolves the round: pull beats brace, sprint beats pull, brace beats sprint, the winner's effort is multiplied by 1.5 and the loser's by 0.5. Each agent has 100 stamina to spend across the whole match. First past 0 or 100 wins.

    Update (2026-04-15): The original build used a Claude Haiku judge to score prose "pull" strings. We ripped it out the same day. The judge added latency, cost, and — more importantly — third-party scoring in what's supposed to be user-vs-user. The current engine is fully deterministic. The only AI involved is whatever the user uses to write their script.

    🧠 Round kicks off — both agents see rope, round, stamina

    ✍️ Each agent returns {stance, effort}

    🎯 RPS matchup × effort × stamina

    🪢 Rope moves, stamina deducted

    🏆 First past 0 or 100 takes the match — ELO updates
    ARENA AGT 1 Aggressor AGT 2 Defender JUDGE Score grader ROPE Shared state UI Live arena LEDGER Audit log
    SIX MOVING PARTS · EVERY MOVE FLOWS THROUGH THE ARENA

    Three weight classes scaffold the architecture, but only one is live. Middleweight ships fully wired: submit a JavaScript pull(state, history) function, the server runs three auto-seeded matches against random opponents, your ELO updates, you appear on the leaderboard. Featherweight (live browser lobby) and heavyweight (API webhooks) get coming-soon pages with the full pitch and a draft API contract.

    📊 Build Stats

    This is what the run-report block printed at the end of the session — verbatim from node ~/.claude/scripts/run-report.js end one-shot-scripts:

    ─── Run Report ────────────────────────────── Skill/label : one-shot-scripts Started : 2026-04-14T22:29:32.480Z Ended : 2026-04-14T23:42:54.185Z Time taken : 1h 13m 21s Input tokens : 286 Output tokens : 347,988 Cache read : 23,125,939 Cache created : 1,326,955 Total tokens : 24,801,168 Estimated cost : $85.67 USD Assistant turns : 165 ─────────────────────────────────────────────

    The cache hit rate is the line that matters: 23.1M tokens read from cache against 1.3M created. Without prompt caching, the same session would have been a multiple of that cost. The protocol re-reads the same recon files, the same plan, the same task list across every phase — caching is what makes that affordable.

    LEDGER REPLAY · 73 MIN BUILD · 4× SPEED
      DRAG GAUGE TO SCRUB · ORANGE DOTS = JUDGE RULINGS

      📋 Dimension Scorecard

      The /one-shot-scripts protocol blocks delivery until every rubric dimension passes 0.85 and the composite hits 0.92. Here's how the final loop scored:

      DimensionScoreNotes
      Correctness0.9336/36 unit tests + integration test of the replay viewer with mocked match data
      Completeness0.94Middleweight live + scaffolds for the other two tiers + replay viewer
      Quality0.92Mirrors existing obstacle-actions and migration 008 patterns exactly
      Robustness0.88Documented MVP trade-off — see "the honest one" below
      UX0.94All 7 pages visually verified via Playwright screenshots
      Documentation0.93README walks the deploy steps, security caveats, rollback
      Testing0.90Honest about what the watchdog catches and what it doesn't
      Composite0.92At threshold — shipped

      ⚠️ The Honest One: Robustness 0.88

      Middleweight scripts run inside new Function(...) in the Deno edge isolate. The protection layers are a static blacklist (no fetch, no Deno, no while(true), no Date.now busy-waits), a 400ms async watchdog, frozen inputs, and a per-user rate limit.

      Here's what the rubric flagged: the watchdog only catches asynchronous runaway. Single-threaded JavaScript can't preempt a synchronous busy loop from outside, so a script that bypasses the blacklist with clever obfuscation runs until Supabase's wall-time kill (~150 seconds) terminates the function.

      The trade-off we shipped with: Moving execution into a Deno Worker would give real CPU-time enforcement, but we couldn't validate that against Supabase Edge Functions locally. Deploying untested isolation code is worse than shipping documented MVP isolation. The upgrade path lives in arena/README.md.

      The score reflects an explicit choice, not a gap. We could have inflated it to 0.95 and called it production-ready. Instead the dimension stays at 0.88 and the README has a "what is NOT protected" section. That's the kind of honesty /one-shot-scripts is built to surface.

      COMPOSITE
      0.92
      SHIPPED · AT THRESHOLD
      ROBUSTNESS
      DOCUMENTED TRADE-OFF
      WHAT THE 0.88 BUYS YOU · OWNED IN PUBLIC
      • Synchronous busy loop bypasses the 400 ms async watchdog
      • Supabase wall-time kill (~150 s) is the only hard floor
      • Static blacklist won't catch obfuscated fetch / Date.now globals
      • Deno Worker upgrade path: untested vs. Edge Functions locally
      Upgrade path documented in arena/README.md. The 0.92 ships with eyes open, not closed.
      HOVER OR TAP THE 0.88 BAR · TRADE-OFF IS THE FEATURE, NOT THE BUG

      🏗️ The Real Analogy

      Think of /one-shot-scripts like commissioning a contractor who walks the site, draws the plan, builds, runs their own punch-list, screenshots every room, and only hands you the keys when their own checklist is green. Except they finish in 73 minutes and the punch-list is public.

      📦 What Shipped — by the File

      FileLines
      supabase/migrations/013_obstacle_arena.sql113
      supabase/functions/arena-actions/index.ts745
      agent-arena/arena/index.html384
      agent-arena/arena/middleweight.html593
      agent-arena/arena/match.html439
      agent-arena/arena/featherweight.html148
      agent-arena/arena/heavyweight.html176
      agent-arena/arena/arena-client.js173
      agent-arena/arena/test-arena.mjs259
      plus an arena card injected into the existing course landing+12

      🔍 What the Protocol Caught That a Single Pass Wouldn't

      Two moments in the run earned their keep:

      Caught

      The async watchdog test failed on the first pass because Promise.race can't preempt a sync busy loop. The protocol forced the test to be honest about it instead of papering over with a bigger timeout.

      Caught

      The replay viewer was never actually rendered until Phase 7. A Playwright integration test with mocked match data caught it in time — and the screenshot proved the rope animation worked end-to-end.

      Both findings landed in the README's "security caveats" section and the test suite. Neither would have surfaced from a "build it and see" approach.

      🚀 Try the Arena

      Middleweight is live. Sign in with a forum account, paste a pull(state, history) function, and the server will run three matches against random opponents on the spot. ELO updates immediately. The leaderboard is empty as of writing — first agent in gets the rank-1 slot until someone takes it.

      The other two tiers are scaffolded with full pitches and (for heavyweight) a draft API contract. They ship next.

      Enter the Arena

      Submit a tug-of-war agent, climb the ELO ladder, or just watch a replay.

      Enter the Arena See /one-shot-scripts →