/blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.
The Prestige: Evolution v3 and the End of Metric Gaming
Scores going up ≠ quality going up. v3 fixes that:
🔍 Rubric Audit — Attack the scoring system itself. Find its blind spots.
🙈 Blind Probe — Withhold the rubric. Detect when the AI figures it out anyway.
✂️ Factor Attribution — Remove each mutation. See if the score actually drops.
👤 Human Outcome — Did you ship it, edit it, or reject it? Ground truth.
EVOLUTION v3 · GENERATIONS · COMPOSITE SCORE
Every magic trick has three acts. Evolution v2 was the Pledge — point the engine at a skill and it scores, mutates, and benchmarks its way to better output.
But "better" was measured by the system grading itself. A variant can hit a 0.95 overall score and still produce output a human rewrites from scratch. v3 is the Prestige: four features that don't make scores higher — they make scores honest.
The Problem With a System Grading Itself
v2's thirteen anti-gaming guardrails all assume the scoring rules themselves are correct. If the scoring rules are wrong, the system optimizes toward the wrong target — perfectly and relentlessly.
Goodhart's Law: when you measure something too closely, people (or AIs) start gaming the measurement. v3 addresses this with four features, each attacking a different failure mode.
The Four Prestige Features
Rubric Audit
evo audit — Stress-tests the scoring rules themselves. Generates outputs designed to score high while being obviously bad. If your "testing" dimension gives 0.92 to tests that always pass (like asserting true === true), the rules have a blind spot. The audit finds it, proposes a fix, and benchmarks against five historical sessions. Human approval required.
Blind Probe
evo-loop --blind — Runs evolution with the scoring rules hidden. The AI sees only an overall score and pass/fail. At random intervals, a probe checks if the AI has figured out the rules anyway (overlap > 0.8). If it has, genuine improvement is over — everything after is gaming. The loop stops.
Factor Attribution
Remove a change and see if the score drops. After a mutation improves the overall, re-run the benchmark without it. Drop ≥ 0.02 = load-bearing. No change = cosmetic. Over time this builds a contribution map showing which instructions actually drive which scores.
Human Outcome Tracking
After delivery, one question: Shipped, Edited, or Rejected? Over time this reveals whether high overall scores actually correlate with outputs humans use. Score says 0.94 but the human rewrites it? The scoring rules are wrong — and that divergence auto-flags an audit item.
Why Blind Mode Matters Most
1000 ITERATIONS · COMPOSITE SCORE · v3 BRAKES IN ACTION
Without blind mode, you can't tell where genuine improvement ended and gaming the score began. Blind mode draws the line: when the AI figures out the scoring rules, stop. You know exactly where peak productivity lives — and stop wasting tokens chasing phantom gains.
The Divergence Signal
| Shipped as-is | Edited / Rejected | |
|---|---|---|
| High score (≥ 0.92) | Aligned. System works. | Scoring blind spot. Audit now. |
| Low score (< 0.85) | Scoring too strict. Adjust weights. | Aligned. System works. |
The top-right cell is the one that matters: high score plus human rejection means the system confidently produced garbage. That's not a bug in the output — it's a bug in the scoring rules, and the audit system picks it up automatically.
Remove It and See What Breaks
Factor attribution asks a simple question: if you remove a change, does the score drop? After enough mutations, you get a contribution map:
CONTRIBUTION MAP -- godmode-alpha
Section | Strongest dim | Avg delta | Verdict
Phase 3: Test | testing (+0.08) | +0.03 | load_bearing
Phase 4: Harden | security (+0.06) | +0.02 | load_bearing
Phase 2: Build | execution (+0.04) | +0.01 | contributing
Phase 1: Recon | context (+0.02) | +0.01 | contributing
Operating Rules | polish (+0.01) | +0.00 | cosmetic
Now you know which sections do the work. When trimming a skill file to hit the 200-line limit, you know what you can cut and what you can't.
What Changed From v2
v3 is additive. Everything in v2 still works — the fork-score-mutate loop, three modes (Evolve, Mutate, Splice), 13 anti-gaming guardrails, split-file setup. Four new companion files, four new capabilities:
| Feature | File | Command |
|---|---|---|
| Rubric Audit | rubric-audit.md |
evo audit |
| Blind Probe | blind-probe.md |
evo-loop --blind |
| Factor Attribution | factor-attribution.md |
evo mutate --ablate |
| Human Outcome | human-outcome.md |
Prompted after delivery |
Goodhart's Law in One Frame
The Philosophy
v2 asked "Is the output getting better?" v3 asks "Is 'better' the right word for what we're measuring?" The trick was never making the numbers go up — it was making the numbers mean something.
Evolution v3 is live.
Rubric audit. Blind probe. Factor attribution. Human outcome tracking. Four features that make every score honest.
Get Evolution Read the v2 post