Built by /blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.
Get free skill (account)
Deep Dive ⏱️ 5 min read

The Prestige: Evolution v3 and the End of Metric Gaming

TL;DR

Scores going up ≠ quality going up. v3 fixes that:

🔍 Rubric Audit — Attack the scoring system itself. Find its blind spots.
🙈 Blind Probe — Withhold the rubric. Detect when the AI figures it out anyway.
✂️ Factor Attribution — Remove each mutation. See if the score actually drops.
👤 Human Outcome — Did you ship it, edit it, or reject it? Ground truth.

Every magic trick has three acts. Evolution v2 was the Pledge — point the engine at a skill and it scores, mutates, and benchmarks its way to better output.

But "better" was measured by the system grading itself. A variant can hit a 0.95 overall score and still produce output a human rewrites from scratch. v3 is the Prestige: four features that don't make scores higher — they make scores honest.

⚠️ The Problem With a System Grading Itself

v2's thirteen anti-gaming guardrails all assume the scoring rules themselves are correct. If the scoring rules are wrong, the system optimizes toward the wrong target — perfectly and relentlessly.

Goodhart's Law: when you measure something too closely, people (or AIs) start gaming the measurement. v3 addresses this with four features, each attacking a different failure mode.

The Four Prestige Features

4
New Features
4
Companion Files
13
v2 Guardrails Kept
01

Rubric Audit

evo audit — Stress-tests the scoring rules themselves. Generates outputs designed to score high while being obviously bad. If your "testing" dimension gives 0.92 to tests that always pass (like asserting true === true), the rules have a blind spot. The audit finds it, proposes a fix, and benchmarks against five historical sessions. Human approval required.

02

Blind Probe

evo-loop --blind — Runs evolution with the scoring rules hidden. The AI sees only an overall score and pass/fail. At random intervals, a probe checks if the AI has figured out the rules anyway (overlap > 0.8). If it has, genuine improvement is over — everything after is gaming. The loop stops.

03

Factor Attribution

Remove a change and see if the score drops. After a mutation improves the overall, re-run the benchmark without it. Drop ≥ 0.02 = load-bearing. No change = cosmetic. Over time this builds a contribution map showing which instructions actually drive which scores.

04

Human Outcome Tracking

After delivery, one question: Shipped, Edited, or Rejected? Over time this reveals whether high overall scores actually correlate with outputs humans use. Score says 0.94 but the human rewrites it? The scoring rules are wrong — and that divergence auto-flags an audit item.

🙈 Why Blind Mode Matters Most

Without blind mode, you can't tell where genuine improvement ended and gaming the score began. Blind mode draws the line: when the AI figures out the scoring rules, stop. You know exactly where peak productivity lives — and stop wasting tokens chasing phantom gains.

📊 The Divergence Signal

Shipped as-is Edited / Rejected
High score (≥ 0.92) Aligned. System works. Scoring blind spot. Audit now.
Low score (< 0.85) Scoring too strict. Adjust weights. Aligned. System works.

The top-right cell is the one that matters: high score plus human rejection means the system confidently produced garbage. That's not a bug in the output — it's a bug in the scoring rules, and the audit system picks it up automatically.

✂️ Remove It and See What Breaks

Factor attribution asks a simple question: if you remove a change, does the score drop? After enough mutations, you get a contribution map:

CONTRIBUTION MAP -- godmode-alpha

Section               | Strongest dim     | Avg delta | Verdict
Phase 3: Test         | testing (+0.08)   |    +0.03  | load_bearing
Phase 4: Harden       | security (+0.06)  |    +0.02  | load_bearing
Phase 2: Build        | execution (+0.04) |    +0.01  | contributing
Phase 1: Recon        | context (+0.02)   |    +0.01  | contributing
Operating Rules       | polish (+0.01)    |    +0.00  | cosmetic

Now you know which sections do the work. When trimming a skill file to hit the 200-line limit, you know what you can cut and what you can't.

🔄 What Changed From v2

v3 is additive. Everything in v2 still works — the fork-score-mutate loop, three modes (Evolve, Mutate, Splice), 13 anti-gaming guardrails, split-file setup. Four new companion files, four new capabilities:

Feature File Command
Rubric Audit rubric-audit.md evo audit
Blind Probe blind-probe.md evo-loop --blind
Factor Attribution factor-attribution.md evo mutate --ablate
Human Outcome human-outcome.md Prompted after delivery

🎯 Goodhart's Law in One Frame

💡 The Philosophy

v2 asked "Is the output getting better?" v3 asks "Is 'better' the right word for what we're measuring?" The trick was never making the numbers go up — it was making the numbers mean something.

Evolution v3 is live.

Rubric audit. Blind probe. Factor attribution. Human outcome tracking. Four features that make every score honest.

Get Evolution Read the v2 post