Post-Mortem ⏱️ 5 min read

The Audit: 16 Bugs in the Thing That Finds Bugs

TL;DR

🔍 What we did: Full audit of Evolution v3 — 14 files, ~2,100 lines of self-improvement protocol
🚨 What we found: 16 issues including 2 critical data-corruption risks
🧬 What we added: 6 research-backed improvements from AlphaEvolve, MAP-Elites, and RL evaluation literature
Result: Evolution v3.1.0 — hardened, honest, and harder to game
drag · helix v3.1
hover an orb Each orb opens its bucket of findings.

📊 Audit Stats

MetricValue
Toolone-shot-orchestra
Files audited14 markdown files
Files modified10 / 14
Issues found16 (2 critical, 5 high, 7 medium, 2 low)
Fresh-spawn workers7
Composite score0.928 / 1.00
VerdictShipped

⚠️ The Irony of a Self-Improving System That Can’t Audit Itself

Evolution is the skill that makes other skills better. It scores every session, tracks weaknesses over time, and proposes targeted mutations tested against benchmarks.

But who audits the auditor?

Think of it like this: You have a factory inspector who stamps “PASS” on every product. The inspector is thorough, well-trained, follows a checklist. But nobody has ever inspected the inspector’s checklist. That’s what v3.0.0 was — a powerful system that had never been turned on itself.

So we pointed One-Shot Orchestra — 7 fresh Claude workers, each with a clean million-token context — directly at Evolution’s own 14 files. Here’s what fell out.

🚨 The Two Critical Bugs

The image below tells the story. Left: a crumbling database — your scoring history silently overwritten. Right: a snapping chain — a running loop crashing mid-execution.

Two red warning triangles: a crumbling database representing data corruption, and a snapping chain representing a mid-loop crash

Both could silently corrupt your evolution data. Neither had ever fired in production — but both were one accidental command away.

1. evo init Had No Safety Net

Running evo init on an already-initialized skill would overwrite your scoring config, benchmarks, and audit history. No warning, no confirmation. Every session score you’d collected — gone.

The fix: An idempotency guard (a check that prevents duplicate operations). Now evo init refuses if the skill is already set up. You need evo reinit to deliberately start over.

2. Mid-Loop Variant Pruning Could Crash the Engine

If you pruned a variant while an evolution loop was running, the loop’s variant selector would try to sample from a variant that no longer existed. The result: a crash with no state saved.

The fix: The loop now checks that each variant’s directory still exists before sampling. Missing variants get logged and skipped instead of crashing the run.

📋 All 16 Findings

SeverityCountExamples
CRITICAL2Init data destruction, mid-loop crash
HIGH57 undocumented commands, SKILL.md 4 lines from cap, held-out benchmark leak, template convergence naming collision
MEDIUM7“Exporter” typo, dead CLI flag, dangling reference, missing config seeding, goal-drift detection gap, staging boundary ambiguity, schema mismatch
LOW2No variant cap (could grow unbounded), less-useful behavioral descriptor

Every finding was fixed in the same session. Zero deferred.

🧬 6 Research-Backed Additions

The audit wasn’t just bug-hunting. We researched what the latest evolutionary AI papers say about self-improvement. Six ideas made it into v3.1.0.

Below: a DNA helix made of code, with 6 glowing nodes — each one a new capability spliced into the engine’s genome.

A DNA double helix made of green code with 6 glowing research nodes: clock, flask, target, recycling, chart, and lock

Six research nodes spliced into the evolution engine’s DNA

⏱️

Temporal Consistency

Flags sudden score jumps (>0.10 between sessions) as anomalies. Like a credit card fraud detector — if your score suddenly spikes, something suspicious happened.

🧪

Evaluator Stress Test

Every 10 mutations, re-runs a benchmark with reworded prompts. If scores swing wildly, the scoring dimension is measuring phrasing — not quality.

🎯

Adaptive Staging

Early variants get big structural changes. Mature variants get polish. Like training for a sport — beginners learn fundamentals, pros refine technique.

🔄

Meta-Mutation Trigger

If mutations keep getting rejected, the system suggests changing the mutation process itself. Evolution that evolves its own evolution.

📉

Goal-Drift Detection

Catches the #1 failure mode in self-improvement: overall score goes up while a specific capability quietly degrades. Now flagged and blocked.

🔒

Held-Out Validation

20% of benchmarks are kept secret from the mutation process. Like a teacher keeping some exam questions hidden until test day — catches “studying the test” instead of learning.

Core insight: The biggest risk in self-improvement isn’t that the system gets worse — it’s that it gets better at the wrong thing. Goal-drift detection and held-out validation exist to catch exactly that.

⚙️ How One-Shot Orchestra Ran This

Orchestra delegates each phase to a fresh worker with a clean context window. Seven terminals, one conductor, zero shared state between workers.

Seven glowing green terminal screens arranged in a semicircle around a central conductor's podium, connected by green light beams

Seven fresh-spawn workers, each with a clean 1M-token context

Each worker takes one slice — accuracy, polish, refs, structure, security, line-budget, drift — then ticks ✓

Here’s the actual flow. Each step was a separate worker that started with zero knowledge of the others:

🔍 Recon — Read all 14 files, map every cross-reference


🌐 Research — Fetch AlphaEvolve, MAP-Elites, SCOPE papers


🛠️ Build — Implement fixes across 10 files


🧪 Test — Verify line counts, cross-refs, helper scripts


🛡️ Harden — Adversarial critic finds 4 more issues


Verify — All 14 files under 200-line cap, 0 broken refs


📊 Score — 0.928 composite — shipped

The orchestrator never touched a single skill file directly. It planned, delegated, read results, and decided.

✂️ The 200-Line Problem

Evolution enforces a hard rule: no skill file may exceed 200 lines. It’s one of the engine’s own laws — and its main controller was at 196 lines. Four lines from breaking its own rule.

The audit needed to add content (7 missing commands, a new anti-gaming rule, updated CLI signatures). That meant trimming first.

Do

Compress “How to Parse a Request” from 5 lines to 1. Remove redundant flow descriptions. Make room before adding.

Don’t

Add the new content first and “figure out the line budget later.” The 200-line cap prevents monolith files that overflow the AI’s working memory.

Final count: 186 lines. Fourteen lines of headroom for future mutations.

📊 v3.0.0 vs v3.1.0

Left: a crumbling tower of warning signs. Right: a clean modular grid, every block in place. That’s the difference between unaudited and audited.

Split screen: a crumbling red tower of warnings on the left versus a clean green modular grid on the right
v3.0.0 v3.1.0
Capabilityv3.0.0v3.1.0
Init safetyNone — overwrites silentlyIdempotency guard + evo reinit
Mid-loop resilienceCrashes on pruned variantSkips missing, logs, continues
Score anomaly detectionNoneTemporal consistency + stress test
Overfitting preventionBenchmark rotation onlyRotation + held-out validation
Goal-drift detectionNoneCross-dimension regression tracking
Operator selectionStatic heuristicAdaptive staging + meta-mutation
Variant capUnboundedSoft cap at 15
Documented commands1118
Runtime config seedingManualAutomatic (4 files)

💬 What We Didn’t Fix

Three issues were found but intentionally left. Transparency matters more than a clean sheet.

Why list unfixed issues? Because a post-mortem that claims 100% fix rate is either lying or didn’t look hard enough. The three items above are real — they’re just not worth the line-budget cost to fix right now.

Try the Evolution Engine

Point it at any Claude Code skill. It gets better every time you use it.

Get Godmode Read: Evolution v3