Post-Mortem April 23, 2026 ⏱️ 5 min read

The Audit: 16 Bugs in the Thing That Finds Bugs

TL;DR

🔍 What we did: Full audit of Evolution v3 — 14 files, ~2,100 lines of self-improvement protocol
🚨 What we found: 16 issues including 2 critical data-corruption risks
🧬 What we added: 6 research-backed improvements from AlphaEvolve, MAP-Elites, and RL evaluation literature
✅ Result: Evolution v3.1.0 — hardened, honest, and harder to game

drag · helix v3.1

📊 Audit Stats

Metric	Value
Tool	one-shot-orchestra
Files audited	14 markdown files
Files modified	10 / 14
Issues found	16 (2 critical, 5 high, 7 medium, 2 low)
Fresh-spawn workers	7
Composite score	0.928 / 1.00
Verdict	Shipped

⚠️ The Irony of a Self-Improving System That Can’t Audit Itself

Evolution is the skill that makes other skills better. It scores every session, tracks weaknesses over time, and proposes targeted mutations tested against benchmarks.

But who audits the auditor?

Think of it like this: You have a factory inspector who stamps “PASS” on every product. The inspector is thorough, well-trained, follows a checklist. But nobody has ever inspected the inspector’s checklist. That’s what v3.0.0 was — a powerful system that had never been turned on itself.

So we pointed One-Shot Orchestra — 7 fresh Claude workers, each with a clean million-token context — directly at Evolution’s own 14 files. Here’s what fell out.

🚨 The Two Critical Bugs

The image below tells the story. Left: a crumbling database — your scoring history silently overwritten. Right: a snapping chain — a running loop crashing mid-execution.

Two red warning triangles: a crumbling database representing data corruption, and a snapping chain representing a mid-loop crash

Both could silently corrupt your evolution data. Neither had ever fired in production — but both were one accidental command away.

1. `evo init` Had No Safety Net

Running evo init on an already-initialized skill would overwrite your scoring config, benchmarks, and audit history. No warning, no confirmation. Every session score you’d collected — gone.

The fix: An idempotency guard (a check that prevents duplicate operations). Now evo init refuses if the skill is already set up. You need evo reinit to deliberately start over.

2. Mid-Loop Variant Pruning Could Crash the Engine

If you pruned a variant while an evolution loop was running, the loop’s variant selector would try to sample from a variant that no longer existed. The result: a crash with no state saved.

The fix: The loop now checks that each variant’s directory still exists before sampling. Missing variants get logged and skipped instead of crashing the run.

📋 All 16 Findings

Severity	Count	Examples
CRITICAL	2	Init data destruction, mid-loop crash
HIGH	5	7 undocumented commands, SKILL.md 4 lines from cap, held-out benchmark leak, template convergence naming collision
MEDIUM	7	“Exporter” typo, dead CLI flag, dangling reference, missing config seeding, goal-drift detection gap, staging boundary ambiguity, schema mismatch
LOW	2	No variant cap (could grow unbounded), less-useful behavioral descriptor

Every finding was fixed in the same session. Zero deferred.

0 files audited

0 issues found

0 files fixed

0 deferred

🧬 6 Research-Backed Additions

The audit wasn’t just bug-hunting. We researched what the latest evolutionary AI papers say about self-improvement. Six ideas made it into v3.1.0.

Below: a DNA helix made of code, with 6 glowing nodes — each one a new capability spliced into the engine’s genome.

A DNA double helix made of green code with 6 glowing research nodes: clock, flask, target, recycling, chart, and lock

Six research nodes spliced into the evolution engine’s DNA

⏱️

Temporal Consistency

Flags sudden score jumps (>0.10 between sessions) as anomalies. Like a credit card fraud detector — if your score suddenly spikes, something suspicious happened.

🧪

Evaluator Stress Test

Every 10 mutations, re-runs a benchmark with reworded prompts. If scores swing wildly, the scoring dimension is measuring phrasing — not quality.

🎯

Adaptive Staging

Early variants get big structural changes. Mature variants get polish. Like training for a sport — beginners learn fundamentals, pros refine technique.

🔄

Meta-Mutation Trigger

If mutations keep getting rejected, the system suggests changing the mutation process itself. Evolution that evolves its own evolution.

📉

Goal-Drift Detection

Catches the #1 failure mode in self-improvement: overall score goes up while a specific capability quietly degrades. Now flagged and blocked.

🔒

Held-Out Validation

20% of benchmarks are kept secret from the mutation process. Like a teacher keeping some exam questions hidden until test day — catches “studying the test” instead of learning.

Core insight: The biggest risk in self-improvement isn’t that the system gets worse — it’s that it gets better at the wrong thing. Goal-drift detection and held-out validation exist to catch exactly that.

⚙️ How One-Shot Orchestra Ran This

Orchestra delegates each phase to a fresh worker with a clean context window. Seven terminals, one conductor, zero shared state between workers.

Seven glowing green terminal screens arranged in a semicircle around a central conductor's podium, connected by green light beams

Seven fresh-spawn workers, each with a clean 1M-token context

Each worker takes one slice — accuracy, polish, refs, structure, security, line-budget, drift — then ticks ✓

Here’s the actual flow. Each step was a separate worker that started with zero knowledge of the others:

🔍 Recon — Read all 14 files, map every cross-reference

↓

🌐 Research — Fetch AlphaEvolve, MAP-Elites, SCOPE papers

↓

🛠️ Build — Implement fixes across 10 files

↓

🧪 Test — Verify line counts, cross-refs, helper scripts

↓

🛡️ Harden — Adversarial critic finds 4 more issues

↓

✅ Verify — All 14 files under 200-line cap, 0 broken refs

↓

📊 Score — 0.928 composite — shipped

The orchestrator never touched a single skill file directly. It planned, delegated, read results, and decided.

✂️ The 200-Line Problem

Evolution enforces a hard rule: no skill file may exceed 200 lines. It’s one of the engine’s own laws — and its main controller was at 196 lines. Four lines from breaking its own rule.

The audit needed to add content (7 missing commands, a new anti-gaming rule, updated CLI signatures). That meant trimming first.

Do

Compress “How to Parse a Request” from 5 lines to 1. Remove redundant flow descriptions. Make room before adding.

Don’t

Add the new content first and “figure out the line budget later.” The 200-line cap prevents monolith files that overflow the AI’s working memory.

Final count: 186 lines. Fourteen lines of headroom for future mutations.

📊 v3.0.0 vs v3.1.0

Left: a crumbling tower of warning signs. Right: a clean modular grid, every block in place. That’s the difference between unaudited and audited.

Split screen: a crumbling red tower of warnings on the left versus a clean green modular grid on the right

v3.0.0 v3.1.0

Capability	v3.0.0	v3.1.0
Init safety	None — overwrites silently	Idempotency guard + `evo reinit`
Mid-loop resilience	Crashes on pruned variant	Skips missing, logs, continues
Score anomaly detection	None	Temporal consistency + stress test
Overfitting prevention	Benchmark rotation only	Rotation + held-out validation
Goal-drift detection	None	Cross-dimension regression tracking
Operator selection	Static heuristic	Adaptive staging + meta-mutation
Variant cap	Unbounded	Soft cap at 15
Documented commands	11	18
Runtime config seeding	Manual	Automatic (4 files)

💬 What We Didn’t Fix

Three issues were found but intentionally left. Transparency matters more than a clean sheet.

Hardcoded canary names in an example: The rule says canary dimensions must come from the real rubric. The example uses placeholders. Low risk — it’s an example, not runtime code.
Factor-attribution precision gap: A threshold defined as exactly 0.01 instead of a range. Pre-existing, not introduced by this audit.
Implicit audit cadence in evo-loop: The loop delegates scoring to the main flow, which handles the counter. Not broken, just not obvious.

Why list unfixed issues? Because a post-mortem that claims 100% fix rate is either lying or didn’t look hard enough. The three items above are real — they’re just not worth the line-budget cost to fix right now.

Try the Evolution Engine

Point it at any Claude Code skill. It gets better every time you use it.

Get Godmode Read: Evolution v3

← The Conductor Read Every Score All Posts →