/blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.
Vanilla Claude vs Our Best Skill. Here's What Actually Happened.
🏁 The race: Raw Claude Code vs a monolith skill vs a modular scripts skill — same 3 tasks, same rubric
💥 The shock: Our monolith skill produces fewer tests than vanilla Claude on large tasks
🏆 The winner: Modular scripts — 0.92 avg quality vs 0.77 (monolith) vs 0.55 (vanilla). Scripts won all 9 rounds.
Composite — 3 rounds, 3 contestants. Each block = 1 quality point.
The Question We Had to Answer
We'd been building execution skills for months. Godmode, Godmode+, One-Shot — layers of instructions that tell Claude how to build, test, and verify code.
But we'd never tested the obvious question: does any of this actually help?
Maybe raw Claude Code — no skill, no protocol, just "build this" — produces the same quality. Maybe our elaborate 8-phase execution protocol is just expensive overhead. We had to know.
The analogy: Think of it like cooking. We built a 14-step recipe (the skill). But what if a talented chef just winging it produces the same meal? The recipe only matters if the food is measurably better.
The Experiment
Three tasks. Three sizes. Three competitors. Same rubric every time.
Vanilla
Raw Claude Code. No skill loaded. Just the task prompt — "build X, include tests."
Monolith
One-Shot Beta: our 8-phase protocol in a single skill file. ~2,300 tokens of instructions.
Scripts
One-Shot Scripts: same protocol split into 16 files. Each phase is a standalone script. ~11,600 tokens total.
The tasks scaled in complexity:
| Round | Size | Task |
|---|---|---|
| 1 | Small | TypeScript parsing library — 3 functions with Result types and tests |
| 2 | Medium | Express rate limiter — two algorithms, mutex, pluggable store interface |
| 3 | Large | Markdown link checker CLI — glob patterns, YAML config, --fix mode |
Each output was graded on 8 dimensions: code quality, test coverage, security, completeness, process, documentation, polish, and decision quality. Weighted. Blind where possible.
The Scoreboard
Here's every round, every score, every competitor.
Round 1 — Small
| Dimension | Vanilla | Monolith | Scripts |
|---|---|---|---|
| Code Quality | 0.70 | 0.85 | 1.00 |
| Test Coverage | 0.55 | 0.85 | 1.00 |
| Security | 0.20 | 0.70 | 1.00 |
| Composite | 0.47 | 0.82 | 0.97 |
| Tests written | 17 | 36 | 75 |
| Time | 1m 47s | 2m 25s | 5m 41s |
Round 2 — Medium
| Dimension | Vanilla | Monolith | Scripts |
|---|---|---|---|
| Code Quality | 0.80 | 0.85 | 0.90 |
| Test Coverage | 0.65 | 0.70 | 0.90 |
| Security | 0.35 | 0.70 | 0.85 |
| Composite | 0.59 | 0.80 | 0.89 |
| Tests written | 18 | 23 | 46 |
| Time | 2m 30s | 4m 37s | 7m 44s |
Round 3 — Large
| Dimension | Vanilla | Monolith | Scripts |
|---|---|---|---|
| Code Quality | 0.82 | 0.85 | 0.92 |
| Test Coverage | 0.62 | 0.25 | 0.95 |
| Security | 0.25 | 0.70 | 0.85 |
| Composite | 0.59 | 0.70 | 0.91 |
| Tests written | 30 | 0 | 81 |
| Time | 4m 47s | 6m 49s | 9m 53s |
Read that Round 3 row again. The monolith — our carefully crafted 8-phase protocol that explicitly says "write tests" — produced zero unit tests. Vanilla Claude, with no instructions at all, wrote 30.
Tests written per cell. Watch Round 3 Monolith collapse — that pillar is the failure mode.
The Result Nobody Expected
Here are the averages across all three rounds:
| Metric | Vanilla | Monolith | Scripts |
|---|---|---|---|
| Composite | 0.55 | 0.77 | 0.92 |
| Avg tests written | 22 | 20 | 67 |
| Avg time | ~3m | ~4m 37s | ~7m 46s |
| Security avg | 0.27 | 0.70 | 0.90 |
| Wins | 0 | 0 | 3 / 3 |
Scripts won every single round. That part wasn't surprising — we'd already seen that in the A/B test.
What was surprising: vanilla Claude writes more tests on average than our monolith skill. 22 tests vs 20. On the large task, it's 30 vs zero.
Our skill's phase overhead — the scorecards, delivery reports, process labels — was consuming the "attention budget" that would otherwise go to writing tests. The model spent so much time following protocol that it forgot the most important part.
Think of it like a student. The monolith is a student who spends so long filling out the exam cover sheet — name, date, section, ID number, declaration — that they run out of time for the actual questions. Vanilla just starts answering.
Why Scripts Beat Everything
The monolith puts every instruction in one big file. By the time Claude is deep into building, the "write tests" instruction is paragraph 12 in a wall of text. Easy to skip.
The scripts model loads each phase from its own file, when that phase starts. Testing isn't buried — it's a standalone document with its own clear instructions and exit criteria.
↓
🔍 Phase 3 ("write tests") is paragraph 12 of 30
↓
❌ Model skips or minimises tests under context pressure
vs
📂 Scripts: Phase 3 loads fresh as its own file
↓
🎯 Model reads focused testing instructions with full attention
↓
✅ 67 tests avg, adversarial inputs included, every round
Three structural advantages showed up in every round:
Can't Skip What's Explicit
A monolith says "write tests" somewhere in paragraph 12. A standalone file called phase-3-test.md is a gate you walk through. Scripts produced tests in 3 out of 3 rounds. Monolith: 2 out of 3.
Self-Assessment Stays Honest
The scoring script checks "do test files exist? how many pass?" The monolith scores on vibes. Monolith self-scored 0.92 on a zero-test output. Scripts was off by only 0.03 on average.
No Phase Blurring
The monolith combined "Harden" and "Document" into one step and skipped security review entirely. Scripts executes them separately because they are separate files.
Decisions Get Documented
Scripts delivery reports include explicit "DECISIONS MADE" with rationale. Monolith delivery reports list features but never explain why. Avg decision score: 0.96 vs 0.80.
How We Got Here: The Journey
The scripts model didn't appear from nowhere. It emerged from failures.
↓
🔄 Built the Evolution engine — skills that score and mutate themselves
↓
📝 Created One-Shot Beta (monolith) — 8 phases, self-assessment, zero-rework contract
↓
⚠ Noticed inconsistency — tests skipped, self-scores inflated, phases blurred
↓
✂ Split the monolith into 16 standalone scripts — same protocol, modular architecture
↓
🏁 A/B tested: Scripts won 3-0 across Small, Medium, Large
↓
🔍 Added vanilla baseline — discovered monolith doesn't even reliably beat raw Claude on tests
↓
✅ Conclusion: modular scripts is the architecture that works
The conventional wisdom was that splitting a skill into separate files would hurt performance. More handoffs, more file reads, more chances to lose context. That intuition was completely wrong.
What actually happens: each phase gets fresher, more focused context instead of competing with a wall of instructions the model has been staring at for 10 minutes.
Think of it like textbook chapters vs cramming everything on one page. A student who reads Chapter 3 before the Chapter 3 test performs better than a student who scanned 8 chapters at once and hopes they remember page 47.
What It Costs
Better quality isn't free. Here's the price tag at Opus API pricing ($15/$75 per million input/output tokens):
| Approach | Cost/Run | Quality | Cost per Quality Point |
|---|---|---|---|
| Vanilla | ~$5.25 | 0.55 | $9.55 |
| Monolith | ~$8.40 | 0.77 | $10.91 |
| Scripts | ~$16.50 | 0.92 | $17.93 |
Scripts costs 3x vanilla. But the quality jump is enormous — from "probably has bugs" to "tested, secure, documented, verified."
The real question isn't "is $16 too much?" It's "how much does rework cost?" If you ship vanilla output and spend 20 minutes manually adding tests and reviewing for security, the skill already paid for itself.
Cost climbs left→right. Quality climbs up. Only Scripts crosses the 0.85 ship-ready plane.
Where Each Approach Wins
This isn't a simple "scripts is always best" story. Each approach has a use case.
Use Vanilla When...
You need speed. Prototyping. Throwaway code. Quick scripts. Anything where you'll review the output yourself and add tests manually. ~3 minutes, functional code, no ceremony.
Use Scripts When...
You're shipping production code. Need tests, security review, and documentation. Want confidence the output works without manual verification. ~8 minutes, verified code.
The monolith? Honestly, the data says it's the worst of both worlds. Slower than vanilla but doesn't reliably produce tests. More ceremony but less substance.
What Skills Actually Improve
Security review (+0.63 over vanilla). Decision documentation (+0.61). Process structure (+0.67). Testing — but ONLY when tests have their own dedicated phase file.
What Skills Don't Improve
Completeness (+0.06 — vanilla already handles requirements well). Code quality (+0.17 — Opus baseline is already decent). Speed (skills are always slower).
The Deeper Lesson: Architecture Matters More Than Content
The monolith and scripts contain identical instructions. Same phases. Same scoring rubric. Same delivery format. The only difference is how they're packaged.
That packaging difference produces a +0.15 composite quality gap. Consistently. Across all task sizes.
| Property | Monolith | Scripts |
|---|---|---|
| Total tokens | ~2,300 | ~11,600 |
| Files | 1 | 16 |
| Per-phase context | All at once | Loaded on demand |
| Expandable? | Near ceiling | 7-13x headroom |
| Composite avg | 0.77 | 0.92 |
| Self-score accuracy | Off by +0.15 | Off by +0.03 |
The scripts model also has 7-13x more headroom before hitting its ceiling. The monolith is already near its architectural limit — you can't cram much more into one file before instructions start getting ignored.
The lesson: When building AI skills, how you structure instructions matters more than what those instructions say. Modular files with focused context beat monolithic prompts with comprehensive context. Every time we tested.
Hover any brick in the Scripts tower to read its filename and line count.
What We Shipped
Based on these results, we've adopted the modular scripts architecture as the default for all Godmode execution skills. Here's what changed:
- One-Shot Scripts is now the primary execution skill. 16 files, ~11,600 tokens, 0.92 average composite.
- The monolith stays available as a fallback for edge cases where script loading might fail.
- The A/B grader skill we built to run this experiment is now a permanent tool for testing future skill changes.
- Every future skill improvement gets measured against vanilla baseline. If it doesn't beat raw Claude, it doesn't ship.
The irony isn't lost on us. We built a complex 8-phase protocol, discovered it was barely better than doing nothing on the dimension that matters most (testing), then fixed it by making the architecture more complex (16 files instead of 1). But it's the right kind of complexity — modular, focused, measurable.
We failed our way to something that genuinely outperforms vanilla Claude Code. That's the whole story.
Get the Skills That Beat Vanilla
One-Shot Scripts, Evolution, and the full Godmode protocol — tested, measured, proven.
Get Godmode Read the A/B Test