The Judge Had to Die
💥 The launch: Built a tug-of-war arena where users submit JS scripts and a Claude Haiku judge scored every "pull" 1–10.
🗣️ The pushback: One user sentence: "I want users' submissions to compete against each other. It's user vs user."
✂️ The fix: Same day, deleted the judge. Replaced it with a deterministic engine. $0.001 per match dropped to $0.
tug-of-war engine — live, no model in the loop
DETERMINISTIC ENGINE · NO MODEL CALL · NO COST
What We Built (And Why It Was Wrong)
The Agent Arena went live last week. Three weight classes, one game: tug-of-war. Players submit a tiny JavaScript function, the server runs three matches against random opponents, ELO updates, you climb the ladder.
The middleweight tier had a clever scoring system. Each player's pull(state, history) function returned a short prose string — a creative description of how their agent was dragging the rope. A Claude Haiku judge then scored both pulls 1–10 on strength, cleverness, and strategic relevance. Higher score won the round.
It was technically impressive. It was also missing the point.
The Sentence That Killed It
One user opened the page, looked at the judge, and pushed back:
Three sentences. They were right. The arena was advertised as head-to-head, but every round had a third party in the middle deciding who won.
The real-world version: Imagine an arm-wrestling tournament where two competitors strain across the table, then a referee leans in and says "Ah, your form was more graceful, you win." Nobody signed up for that. They signed up for whose hand hits the table first.
What "User vs User" Actually Means
Once you say it out loud, the rule is obvious: the only AI involved should be the AI the users chose to bring. Whether they wrote their script with Claude, ChatGPT, Cursor, or a yellow legal pad — that's their move. Our job is to resolve the math.
An LLM judge breaks that contract three ways. It charges money per match, it adds latency, and worst of all it inserts our taste into a competition that's supposed to be between two players. A "smarter" judge would just be a different bias.
The New Engine in One Page
The replacement is dumb on purpose. Scripts now return a structured move instead of prose:
function pull(state, history) {
return { stance: "pull", effort: 12 };
}
Stance is one of three values: pull, brace, sprint. Effort is an integer 0–20. The engine resolves each round with a rock-paper-scissors matchup: pull beats brace, sprint beats pull, brace beats sprint.
three stances · rock-paper-scissors
RPS RULE · WHO BEATS WHO · NO LLM TASTE
Each player starts the match with 100 stamina, and effort is deducted from it every round. Spend it all in round one and you're forced to coast through round two onward at zero. It's a pacing game wrapped in a counter-prediction game.
↓
✍️ Each script returns
{ stance, effort }
↓
🎯 Engine looks up the matchup — winner's effort × 1.5, loser's × 0.5
↓
🪢 Rope moves by the difference, stamina deducted
↓
🔁 Repeat up to 10 rounds, or until rope hits 0 or 100
↓
🏆 Closer side to its target wins — ELO updates
stamina decay · recorded 8-turn match
EFFORT IS NOT FREE — BIG SWINGS GET PUNISHED LATER
Before vs After
| Thing | Old (LLM judge) | New (deterministic) |
|---|---|---|
| Who decides the winner | Claude Haiku | The engine, by arithmetic |
| Cost per match | ~$0.001 | $0 |
| Latency per round | 1–3 seconds | milliseconds |
| Reproducible? | No (judge varies) | Yes, identical inputs → identical match |
| What the script returns | A prose string | { stance, effort } |
| What rewards skill | Flowery writing | Reading your opponent |
| Third party in the loop | Yes — us | None |
cost per match · old judge vs new engine
1,000 MATCHES · ONE METER STILL · ONE METER SPINNING
Bad Scripts Forfeit, They Don't Crash
One thing the new engine had to keep was the "scripts can be hostile" assumption. The old code already sandboxed user JS against infinite loops, network calls, and timing tricks. None of that changed.
What changed is what happens when a script returns the wrong shape. Throw an error? Wrong stance string? Effort that's NaN? The engine quietly records that round as brace × 0 — a forfeit — and the match continues.
What the engine accepts
{ stance: "pull", effort: 12 }
Lowercased stance, integer effort, valid number. Anything stricter would be the engine being precious.
What forfeits the round
Throws, timeouts, missing fields, wrong stance, effort over 20, effort over your remaining stamina. Engine logs the reason and keeps playing.
49 unit tests cover every adversarial case we could think of. null returns, JSON-stringified moves, unicode in stance names, scripts that mutate frozen state, scripts that return promises that never resolve. All caught, all forfeited, no match ever crashes.
The Wider Lesson
This rewrite took less than an hour to land. It was small. The point isn't the engineering — it's that we shipped the wrong thing first because the LLM judge felt clever, and "clever" is a trap when your product has one job.
The job of an arena is to host a fight between two players. The moment you put yourself in the middle, the arena is no longer about the players. It's about you.
Core insight: When a user pushes back on a feature in one sentence and you can't argue with it, the feature is wrong. Don't defend the build — delete it. The same day, ideally.
It's Live Now
The middleweight ladder is open. Backend deployed, replay viewer updated, build-receipt blog post amended with an "we changed our minds" callout. Old prose-based matches still render in the replay viewer for backwards compatibility — we didn't migrate history, we just stopped writing new history that needed a judge.
Every existing slot on the leaderboard is up for grabs. First deterministic agent in is rank 1 by default.
Submit an agent
Write a pull(state, history) function. Climb the deterministic ladder. No third party scoring you.