Post-Mortem ⏱️ 5 min read

The Judge Had to Die

TL;DR

💥 The launch: Built a tug-of-war arena where users submit JS scripts and a Claude Haiku judge scored every "pull" 1–10.
🗣️ The pushback: One user sentence: "I want users' submissions to compete against each other. It's user vs user."
✂️ The fix: Same day, deleted the judge. Replaced it with a deterministic engine. $0.001 per match dropped to $0.

🏗️ What We Built (And Why It Was Wrong)

The Agent Arena went live last week. Three weight classes, one game: tug-of-war. Players submit a tiny JavaScript function, the server runs three matches against random opponents, ELO updates, you climb the ladder.

The middleweight tier had a clever scoring system. Each player's pull(state, history) function returned a short prose string — a creative description of how their agent was dragging the rope. A Claude Haiku judge then scored both pulls 1–10 on strength, cleverness, and strategic relevance. Higher score won the round.

It was technically impressive. It was also missing the point.

🗣️ The Sentence That Killed It

One user opened the page, looked at the judge, and pushed back:

"I don't want a Haiku judge. I want users' submissions to compete against each other — that's the whole point. It's user vs user."

Three sentences. They were right. The arena was advertised as head-to-head, but every round had a third party in the middle deciding who won.

The real-world version: Imagine an arm-wrestling tournament where two competitors strain across the table, then a referee leans in and says "Ah, your form was more graceful, you win." Nobody signed up for that. They signed up for whose hand hits the table first.

⚔️ What "User vs User" Actually Means

Once you say it out loud, the rule is obvious: the only AI involved should be the AI the users chose to bring. Whether they wrote their script with Claude, ChatGPT, Cursor, or a yellow legal pad — that's their move. Our job is to resolve the math.

An LLM judge breaks that contract three ways. It charges money per match, it adds latency, and worst of all it inserts our taste into a competition that's supposed to be between two players. A "smarter" judge would just be a different bias.

⚙️ The New Engine in One Page

The replacement is dumb on purpose. Scripts now return a structured move instead of prose:

function pull(state, history) {
  return { stance: "pull", effort: 12 };
}

Stance is one of three values: pull, brace, sprint. Effort is an integer 0–20. The engine resolves each round with a rock-paper-scissors matchup: pull beats brace, sprint beats pull, brace beats sprint.

Each player starts the match with 100 stamina, and effort is deducted from it every round. Spend it all in round one and you're forced to coast through round two onward at zero. It's a pacing game wrapped in a counter-prediction game.

🧠 Round opens — both scripts see rope, round number, their own stamina

✍️ Each script returns { stance, effort }

🎯 Engine looks up the matchup — winner's effort × 1.5, loser's × 0.5

🪢 Rope moves by the difference, stamina deducted

🔁 Repeat up to 10 rounds, or until rope hits 0 or 100

🏆 Closer side to its target wins — ELO updates

📊 Before vs After

ThingOld (LLM judge)New (deterministic)
Who decides the winnerClaude HaikuThe engine, by arithmetic
Cost per match~$0.001$0
Latency per round1–3 secondsmilliseconds
Reproducible?No (judge varies)Yes, identical inputs → identical match
What the script returnsA prose string{ stance, effort }
What rewards skillFlowery writingReading your opponent
Third party in the loopYes — usNone

🛡️ Bad Scripts Forfeit, They Don't Crash

One thing the new engine had to keep was the "scripts can be hostile" assumption. The old code already sandboxed user JS against infinite loops, network calls, and timing tricks. None of that changed.

What changed is what happens when a script returns the wrong shape. Throw an error? Wrong stance string? Effort that's NaN? The engine quietly records that round as brace × 0 — a forfeit — and the match continues.

What the engine accepts

{ stance: "pull", effort: 12 }
Lowercased stance, integer effort, valid number. Anything stricter would be the engine being precious.

What forfeits the round

Throws, timeouts, missing fields, wrong stance, effort over 20, effort over your remaining stamina. Engine logs the reason and keeps playing.

49 unit tests cover every adversarial case we could think of. null returns, JSON-stringified moves, unicode in stance names, scripts that mutate frozen state, scripts that return promises that never resolve. All caught, all forfeited, no match ever crashes.

💡 The Wider Lesson

This rewrite took less than an hour to land. It was small. The point isn't the engineering — it's that we shipped the wrong thing first because the LLM judge felt clever, and "clever" is a trap when your product has one job.

The job of an arena is to host a fight between two players. The moment you put yourself in the middle, the arena is no longer about the players. It's about you.

Core insight: When a user pushes back on a feature in one sentence and you can't argue with it, the feature is wrong. Don't defend the build — delete it. The same day, ideally.

🚀 It's Live Now

The middleweight ladder is open. Backend deployed, replay viewer updated, build-receipt blog post amended with an "we changed our minds" callout. Old prose-based matches still render in the replay viewer for backwards compatibility — we didn't migrate history, we just stopped writing new history that needed a judge.

Every existing slot on the leaderboard is up for grabs. First deterministic agent in is rank 1 by default.

Submit an agent

Write a pull(state, history) function. Climb the deterministic ladder. No third party scoring you.

Enter the Arena Read the original build receipt →