The Audit: 16 Bugs in the Thing That Finds Bugs
🔍 What we did: Full audit of Evolution v3 — 14 files, ~2,100 lines of self-improvement protocol
🚨 What we found: 16 issues including 2 critical data-corruption risks
🧬 What we added: 6 research-backed improvements from AlphaEvolve, MAP-Elites, and RL evaluation literature
✅ Result: Evolution v3.1.0 — hardened, honest, and harder to game
Audit Stats
| Metric | Value |
|---|---|
| Tool | one-shot-orchestra |
| Files audited | 14 markdown files |
| Files modified | 10 / 14 |
| Issues found | 16 (2 critical, 5 high, 7 medium, 2 low) |
| Fresh-spawn workers | 7 |
| Composite score | 0.928 / 1.00 |
| Verdict | Shipped |
The Irony of a Self-Improving System That Can’t Audit Itself
Evolution is the skill that makes other skills better. It scores every session, tracks weaknesses over time, and proposes targeted mutations tested against benchmarks.
But who audits the auditor?
Think of it like this: You have a factory inspector who stamps “PASS” on every product. The inspector is thorough, well-trained, follows a checklist. But nobody has ever inspected the inspector’s checklist. That’s what v3.0.0 was — a powerful system that had never been turned on itself.
So we pointed One-Shot Orchestra — 7 fresh Claude workers, each with a clean million-token context — directly at Evolution’s own 14 files. Here’s what fell out.
The Two Critical Bugs
The image below tells the story. Left: a crumbling database — your scoring history silently overwritten. Right: a snapping chain — a running loop crashing mid-execution.
Both could silently corrupt your evolution data. Neither had ever fired in production — but both were one accidental command away.
1. evo init Had No Safety Net
Running evo init on an already-initialized skill would overwrite your scoring config, benchmarks, and audit history. No warning, no confirmation. Every session score you’d collected — gone.
The fix: An idempotency guard (a check that prevents duplicate operations). Now evo init refuses if the skill is already set up. You need evo reinit to deliberately start over.
2. Mid-Loop Variant Pruning Could Crash the Engine
If you pruned a variant while an evolution loop was running, the loop’s variant selector would try to sample from a variant that no longer existed. The result: a crash with no state saved.
The fix: The loop now checks that each variant’s directory still exists before sampling. Missing variants get logged and skipped instead of crashing the run.
All 16 Findings
| Severity | Count | Examples |
|---|---|---|
| CRITICAL | 2 | Init data destruction, mid-loop crash |
| HIGH | 5 | 7 undocumented commands, SKILL.md 4 lines from cap, held-out benchmark leak, template convergence naming collision |
| MEDIUM | 7 | “Exporter” typo, dead CLI flag, dangling reference, missing config seeding, goal-drift detection gap, staging boundary ambiguity, schema mismatch |
| LOW | 2 | No variant cap (could grow unbounded), less-useful behavioral descriptor |
Every finding was fixed in the same session. Zero deferred.
6 Research-Backed Additions
The audit wasn’t just bug-hunting. We researched what the latest evolutionary AI papers say about self-improvement. Six ideas made it into v3.1.0.
Below: a DNA helix made of code, with 6 glowing nodes — each one a new capability spliced into the engine’s genome.
Six research nodes spliced into the evolution engine’s DNA
Temporal Consistency
Flags sudden score jumps (>0.10 between sessions) as anomalies. Like a credit card fraud detector — if your score suddenly spikes, something suspicious happened.
Evaluator Stress Test
Every 10 mutations, re-runs a benchmark with reworded prompts. If scores swing wildly, the scoring dimension is measuring phrasing — not quality.
Adaptive Staging
Early variants get big structural changes. Mature variants get polish. Like training for a sport — beginners learn fundamentals, pros refine technique.
Meta-Mutation Trigger
If mutations keep getting rejected, the system suggests changing the mutation process itself. Evolution that evolves its own evolution.
Goal-Drift Detection
Catches the #1 failure mode in self-improvement: overall score goes up while a specific capability quietly degrades. Now flagged and blocked.
Held-Out Validation
20% of benchmarks are kept secret from the mutation process. Like a teacher keeping some exam questions hidden until test day — catches “studying the test” instead of learning.
Core insight: The biggest risk in self-improvement isn’t that the system gets worse — it’s that it gets better at the wrong thing. Goal-drift detection and held-out validation exist to catch exactly that.
How One-Shot Orchestra Ran This
Orchestra delegates each phase to a fresh worker with a clean context window. Seven terminals, one conductor, zero shared state between workers.
Seven fresh-spawn workers, each with a clean 1M-token context
Here’s the actual flow. Each step was a separate worker that started with zero knowledge of the others:
↓
↓
↓
↓
↓
↓
The orchestrator never touched a single skill file directly. It planned, delegated, read results, and decided.
The 200-Line Problem
Evolution enforces a hard rule: no skill file may exceed 200 lines. It’s one of the engine’s own laws — and its main controller was at 196 lines. Four lines from breaking its own rule.
The audit needed to add content (7 missing commands, a new anti-gaming rule, updated CLI signatures). That meant trimming first.
Do
Compress “How to Parse a Request” from 5 lines to 1. Remove redundant flow descriptions. Make room before adding.
Don’t
Add the new content first and “figure out the line budget later.” The 200-line cap prevents monolith files that overflow the AI’s working memory.
Final count: 186 lines. Fourteen lines of headroom for future mutations.
v3.0.0 vs v3.1.0
Left: a crumbling tower of warning signs. Right: a clean modular grid, every block in place. That’s the difference between unaudited and audited.
v3.0.0
v3.1.0
| Capability | v3.0.0 | v3.1.0 |
|---|---|---|
| Init safety | None — overwrites silently | Idempotency guard + evo reinit |
| Mid-loop resilience | Crashes on pruned variant | Skips missing, logs, continues |
| Score anomaly detection | None | Temporal consistency + stress test |
| Overfitting prevention | Benchmark rotation only | Rotation + held-out validation |
| Goal-drift detection | None | Cross-dimension regression tracking |
| Operator selection | Static heuristic | Adaptive staging + meta-mutation |
| Variant cap | Unbounded | Soft cap at 15 |
| Documented commands | 11 | 18 |
| Runtime config seeding | Manual | Automatic (4 files) |
What We Didn’t Fix
Three issues were found but intentionally left. Transparency matters more than a clean sheet.
- Hardcoded canary names in an example: The rule says canary dimensions must come from the real rubric. The example uses placeholders. Low risk — it’s an example, not runtime code.
- Factor-attribution precision gap: A threshold defined as exactly 0.01 instead of a range. Pre-existing, not introduced by this audit.
- Implicit audit cadence in evo-loop: The loop delegates scoring to the main flow, which handles the counter. Not broken, just not obvious.
Why list unfixed issues? Because a post-mortem that claims 100% fix rate is either lying or didn’t look hard enough. The three items above are real — they’re just not worth the line-budget cost to fix right now.
Try the Evolution Engine
Point it at any Claude Code skill. It gets better every time you use it.
Get Godmode Read: Evolution v3