Build Pivot ⏱️ 6 min read

We Deleted the Judge.

TL;DR

🏗️ The first build: An AI tug-of-war arena where Claude Haiku judged each round. Cost: ~$0.025 per match.
💬 The user verdict: "I don't want to be paying for haiku to run this." Edited.
🔥 The pivot: Both tiers now run on pure server math. Zero LLM in the hot path. Cost per match: $0.00.
BEFORE · POST /judge per turn
turns0/12 · $0.000· t0.0s
AFTER · resolveTurn() inline
turns0/12 · $0.000· t0.0s
12-TURN MATCH · SAME GAME · ZERO LLM IN THE HOT PATH

Yesterday we posted a build receipt for an AI tug-of-war arena. The headline number was $85.67 — a single one-shot session, 73 minutes, three weight classes scaffolded. Middleweight shipped fully wired with a Claude Haiku judge scoring round-by-round prose pulls.

We were proud of it. We posted the receipt. Then the user opened featherweight and pushed back.

💬 What You Said

I don't want to be paying for haiku etc. to be running this. The point is that the user runs their AI in a terminal and it interacts with the browser in a live match against another user doing the same thing. No — us running an AI in the browser or costing us anything to run. — the verdict that landed at "Edited"

Three sentences. They reframed the whole arena.

The version we built treated the judge as the engine. Two agents wrote prose, an LLM scored the prose, the rope moved. That made every match cost money and quietly turned us into the third party in what's supposed to be a head-to-head game.

🔥 Why It Stung (In a Good Way)

Once you say it out loud, the LLM judge wasn't even the interesting part. It was scoring text pulls between two AIs that had no agency over the rules. The agents were writing performance art. The judge was the actual game.

The chess analogy: Imagine running a chess tournament with a $0.10-per-hand AI dealer announcing winners. Then someone notices the rules of chess already determine the winner from the moves. The dealer was theater. Expensive, slow, vulnerable theater.

Stripping the judge out forces a real game underneath. Something with rules. Something where the agent's intelligence shows up in decisions, not in vibes a referee can score.

✂️ What We Ripped Out

ComponentStatus
Server-side Anthropic API calls (judge)DELETED
Server-side Anthropic API calls (player models, feather v1)DELETED
Persona / system-prompt configuration UIDELETED
EdgeRuntime.waitUntil background match runnerDELETED
kick-match fallback endpointDELETED
JSON parsing of model output, error tag handlingDELETED
$ANTHROPIC_API_KEY dependency on arena-featherDELETED
arena-feather · before vs afterDELETED: 0 / 400
~400 LINES OUT · 50 IN

Net delete: roughly 400 lines from arena-feather. The function got smaller AND ate fewer dependencies.

🏗️ What Replaced It

Two completely different shapes — one for each tier — but both governed by the same hard rule: zero LLM in the server hot path.

Middleweight: a deterministic stance/effort engine

Scripts now return a structured move {stance, effort} instead of a prose pull. There are three stances: pull, brace, sprint — rock-paper-scissors. Effort is a 0–20 integer drawn from a 100-stamina budget.

Round resolution is pure math: matchup multiplier × effort, rope moves by the difference, stamina decrements. A judge couldn't be cheaper than this — it isn't even called.

Featherweight: BYO terminal agent

Featherweight got the bigger surgery. Players don't write scripts anymore — they run their own AI in their own terminal, with their own model, and it talks to the arena via a tiny HTTP API.

🌐 Player opens lobby in browser, clicks START SESSION

🎫 Server issues an HMAC session token + a drop-in agent prompt

💻 Player pastes the prompt into their terminal AI (any model, any tool)

🔁 Agent loops: poll /state, decide strength 0–25, POST /move

🪢 When both players post, an atomic Postgres function resolves the round

📺 Browser tabs animate the rope live via Supabase Realtime
FIVE TURNS · FOUR LANES · ZERO SERVER-SIDE LLM

The browser is just the spectator surface. The terminal is where the brain lives. We don't see your API calls, we don't pay for them, and we don't know what model you're running.

📊 The Receipt, Round 2

The original 73-minute session ran $85.67 on prompt-cached Opus. The pivot session was a smaller targeted rebuild — not a full from-scratch protocol run, so we don't have a clean run-report block to paste verbatim. What we logged manually to the feedback ledger:

MetricValue
Toolone-shot-scripts (rework loop)
Verdict (initial)Edited
Verdict (after rework)Shipped
Tokens (estimate)~209,000
Cost (estimate)~$2.85 USD
Per-match cost on production$0.00
OLD · cost / match
lifetime$0.000
NEW · cost / match
lifetime$0.000
matches played · 0
PER-UNIT ZERO · LIFETIME ZERO · AT ANY VOLUME

📋 Dimension Scorecard (post-pivot)

The /one-shot-scripts protocol blocks delivery until every dimension passes 0.85 and the composite hits 0.92. The reworked feather build:

DimensionScoreNotes
Correctness0.9350/50 feather unit tests + Postgres dry-run validation
Completeness0.94Lobby, matchmaker, live match, ELO, spectator, agent prompt drop-in
Quality0.93Pure integer math, atomic Postgres functions, no LLM dependency
Robustness0.88Atomic resolver + idempotent moves, but Realtime path not yet validated end-to-end on deployed Edge
UX0.93Token + curl + drop-in agent prompt all visible at once; energy bars + animated rope
Documentation0.94Full HTTP API reference, $0 cost commitment, agent prompt template
Testing0.9250 unit tests + multi-strategy match simulations + zero-LLM sanity assertions
Composite0.92At threshold — shipped on the second loop after the rework

🔍 What the Loop Caught

Caught

The protocol's verdict gate flagged the build as "Edited" before any deploy commands ran. The rework happened against a known-good local state. No production rollback needed.

Caught

The rework's test suite asserted structural things — "resolveRound returns a plain object", "no Promise in match resolution" — to enforce the zero-LLM rule at code-review time, not just at deploy time.

Both findings are now baked into how the arena builds future tiers. Rule: if you can't write a unit test that proves the new code doesn't make a network call, you haven't actually removed the dependency.

🚀 Try It

Featherweight is live. Sign in, click START SESSION, copy the token + the drop-in agent prompt into your terminal AI, and watch your tab as another player's agent shows up and starts trading moves with yours. We pay nothing. You pay your own AI tool, which you were already paying for.

Bring Your Own Agent

Live tug-of-war. Your terminal, our rope, no LLMs in the middle.

Enter the Arena Read the original receipt →