Built by /blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.

Get free skill (account)

Experiment April 02, 2026 ⏱️ 7 min read

Vanilla Claude vs Our Best Skill. Here's What Actually Happened.

TL;DR

🏁 The race: Raw Claude Code vs a monolith skill vs a modular scripts skill — same 3 tasks, same rubric
💥 The shock: Our monolith skill produces fewer tests than vanilla Claude on large tasks
🏆 The winner: Modular scripts — 0.92 avg quality vs 0.77 (monolith) vs 0.55 (vanilla). Scripts won all 9 rounds.

PODIUM // COMPOSITE_3R

3D podium requires WebGL. Composite scores: Vanilla 55 · Monolith 77 · Scripts 92.

Composite — 3 rounds, 3 contestants. Each block = 1 quality point.

🎯 The Question We Had to Answer

We'd been building execution skills for months. Godmode, Godmode+, One-Shot — layers of instructions that tell Claude how to build, test, and verify code.

But we'd never tested the obvious question: does any of this actually help?

Maybe raw Claude Code — no skill, no protocol, just "build this" — produces the same quality. Maybe our elaborate 8-phase execution protocol is just expensive overhead. We had to know.

The analogy: Think of it like cooking. We built a 14-step recipe (the skill). But what if a talented chef just winging it produces the same meal? The recipe only matters if the food is measurably better.

🔬 The Experiment

Three tasks. Three sizes. Three competitors. Same rubric every time.

🟢

Vanilla

Raw Claude Code. No skill loaded. Just the task prompt — "build X, include tests."

🟡

Monolith

One-Shot Beta: our 8-phase protocol in a single skill file. ~2,300 tokens of instructions.

🔵

Scripts

One-Shot Scripts: same protocol split into 16 files. Each phase is a standalone script. ~11,600 tokens total.

The tasks scaled in complexity:

Round	Size	Task
1	Small	TypeScript parsing library — 3 functions with Result types and tests
2	Medium	Express rate limiter — two algorithms, mutex, pluggable store interface
3	Large	Markdown link checker CLI — glob patterns, YAML config, --fix mode

Each output was graded on 8 dimensions: code quality, test coverage, security, completeness, process, documentation, polish, and decision quality. Weighted. Blind where possible.

📊 The Scoreboard

Here's every round, every score, every competitor.

Round 1 — Small

Dimension	Vanilla	Monolith	Scripts
Code Quality	0.70	0.85	1.00
Test Coverage	0.55	0.85	1.00
Security	0.20	0.70	1.00
Composite	0.47	0.82	0.97
Tests written	17	36	75
Time	1m 47s	2m 25s	5m 41s

Round 2 — Medium

Dimension	Vanilla	Monolith	Scripts
Code Quality	0.80	0.85	0.90
Test Coverage	0.65	0.70	0.90
Security	0.35	0.70	0.85
Composite	0.59	0.80	0.89
Tests written	18	23	46
Time	2m 30s	4m 37s	7m 44s

Round 3 — Large

Dimension	Vanilla	Monolith	Scripts
Code Quality	0.82	0.85	0.92
Test Coverage	0.62	0.25	0.95
Security	0.25	0.70	0.85
Composite	0.59	0.70	0.91
Tests written	30	0	81
Time	4m 47s	6m 49s	9m 53s

Read that Round 3 row again. The monolith — our carefully crafted 8-phase protocol that explicitly says "write tests" — produced zero unit tests. Vanilla Claude, with no instructions at all, wrote 30.

TESTS // 3R × 3C

3D coverage grid requires WebGL. R1 — V:17 M:36 S:75 · R2 — V:18 M:23 S:46 · R3 — V:30 M:0 S:81.

Tests written per cell. Watch Round 3 Monolith collapse — that pillar is the failure mode.

💥 The Result Nobody Expected

Here are the averages across all three rounds:

Metric	Vanilla	Monolith	Scripts
Composite	0.55	0.77	0.92
Avg tests written	22	20	67
Avg time	~3m	~4m 37s	~7m 46s
Security avg	0.27	0.70	0.90
Wins	0	0	3 / 3

Scripts won every single round. That part wasn't surprising — we'd already seen that in the A/B test.

What was surprising: vanilla Claude writes more tests on average than our monolith skill. 22 tests vs 20. On the large task, it's 30 vs zero.

Our skill's phase overhead — the scorecards, delivery reports, process labels — was consuming the "attention budget" that would otherwise go to writing tests. The model spent so much time following protocol that it forgot the most important part.

Think of it like a student. The monolith is a student who spends so long filling out the exam cover sheet — name, date, section, ID number, declaration — that they run out of time for the actual questions. Vanilla just starts answering.

🔧 Why Scripts Beat Everything

The monolith puts every instruction in one big file. By the time Claude is deep into building, the "write tests" instruction is paragraph 12 in a wall of text. Easy to skip.

The scripts model loads each phase from its own file, when that phase starts. Testing isn't buried — it's a standalone document with its own clear instructions and exit criteria.

📄 Monolith: All 8 phases in one prompt — instructions compete for attention
↓
🔍 Phase 3 ("write tests") is paragraph 12 of 30
↓
❌ Model skips or minimises tests under context pressure

vs

📂 Scripts: Phase 3 loads fresh as its own file
↓
🎯 Model reads focused testing instructions with full attention
↓
✅ 67 tests avg, adversarial inputs included, every round

Three structural advantages showed up in every round:

🛡

Can't Skip What's Explicit

A monolith says "write tests" somewhere in paragraph 12. A standalone file called phase-3-test.md is a gate you walk through. Scripts produced tests in 3 out of 3 rounds. Monolith: 2 out of 3.

📏

Self-Assessment Stays Honest

The scoring script checks "do test files exist? how many pass?" The monolith scores on vibes. Monolith self-scored 0.92 on a zero-test output. Scripts was off by only 0.03 on average.

✂

No Phase Blurring

The monolith combined "Harden" and "Document" into one step and skipped security review entirely. Scripts executes them separately because they are separate files.

📖

Decisions Get Documented

Scripts delivery reports include explicit "DECISIONS MADE" with rationale. Monolith delivery reports list features but never explain why. Avg decision score: 0.96 vs 0.80.

🏗 How We Got Here: The Journey

The scripts model didn't appear from nowhere. It emerged from failures.

🔰 Started with Godmode v1 — basic execution layers
↓
🔄 Built the Evolution engine — skills that score and mutate themselves
↓
📝 Created One-Shot Beta (monolith) — 8 phases, self-assessment, zero-rework contract
↓
⚠ Noticed inconsistency — tests skipped, self-scores inflated, phases blurred
↓
✂ Split the monolith into 16 standalone scripts — same protocol, modular architecture
↓
🏁 A/B tested: Scripts won 3-0 across Small, Medium, Large
↓
🔍 Added vanilla baseline — discovered monolith doesn't even reliably beat raw Claude on tests
↓
✅ Conclusion: modular scripts is the architecture that works

The conventional wisdom was that splitting a skill into separate files would hurt performance. More handoffs, more file reads, more chances to lose context. That intuition was completely wrong.

What actually happens: each phase gets fresher, more focused context instead of competing with a wall of instructions the model has been staring at for 10 minutes.

Think of it like textbook chapters vs cramming everything on one page. A student who reads Chapter 3 before the Chapter 3 test performs better than a student who scanned 8 chapters at once and hopes they remember page 47.

💰 What It Costs

Better quality isn't free. Here's the price tag at Opus API pricing ($15/$75 per million input/output tokens):

Approach	Cost/Run	Quality	Cost per Quality Point
Vanilla	~$5.25	0.55	$9.55
Monolith	~$8.40	0.77	$10.91
Scripts	~$16.50	0.92	$17.93

Scripts costs 3x vanilla. But the quality jump is enormous — from "probably has bugs" to "tested, secure, documented, verified."

The real question isn't "is $16 too much?" It's "how much does rework cost?" If you ship vanilla output and spend 20 minutes manually adding tests and reviewing for security, the skill already paid for itself.

COST × QUALITY × TESTS

3D scatter requires WebGL. Vanilla $5.25 · 0.55 · 22 tests. Monolith $8.40 · 0.77 · 20 tests. Scripts $16.50 · 0.92 · 67 tests. Threshold 0.85.

Cost climbs left→right. Quality climbs up. Only Scripts crosses the 0.85 ship-ready plane.

🛡 Where Each Approach Wins

This isn't a simple "scripts is always best" story. Each approach has a use case.

⚡

Use Vanilla When...

You need speed. Prototyping. Throwaway code. Quick scripts. Anything where you'll review the output yourself and add tests manually. ~3 minutes, functional code, no ceremony.

🚀

Use Scripts When...

You're shipping production code. Need tests, security review, and documentation. Want confidence the output works without manual verification. ~8 minutes, verified code.

The monolith? Honestly, the data says it's the worst of both worlds. Slower than vanilla but doesn't reliably produce tests. More ceremony but less substance.

What Skills Actually Improve

Security review (+0.63 over vanilla). Decision documentation (+0.61). Process structure (+0.67). Testing — but ONLY when tests have their own dedicated phase file.

What Skills Don't Improve

Completeness (+0.06 — vanilla already handles requirements well). Code quality (+0.17 — Opus baseline is already decent). Speed (skills are always slower).

🧠 The Deeper Lesson: Architecture Matters More Than Content

The monolith and scripts contain identical instructions. Same phases. Same scoring rubric. Same delivery format. The only difference is how they're packaged.

That packaging difference produces a +0.15 composite quality gap. Consistently. Across all task sizes.

Property	Monolith	Scripts
Total tokens	~2,300	~11,600
Files	1	16
Per-phase context	All at once	Loaded on demand
Expandable?	Near ceiling	7-13x headroom
Composite avg	0.77	0.92
Self-score accuracy	Off by +0.15	Off by +0.03

The scripts model also has 7-13x more headroom before hitting its ceiling. The monolith is already near its architectural limit — you can't cram much more into one file before instructions start getting ignored.

The lesson: When building AI skills, how you structure instructions matters more than what those instructions say. Modular files with focused context beat monolithic prompts with comprehensive context. Every time we tested.

FILES // 0 / 1 / 16

3D architecture requires WebGL. Vanilla = 0 files · Monolith = 1 file (800 lines) · Scripts = 16 files (~50 lines each).

Hover any brick in the Scripts tower to read its filename and line count.

✅ What We Shipped

Based on these results, we've adopted the modular scripts architecture as the default for all Godmode execution skills. Here's what changed:

One-Shot Scripts is now the primary execution skill. 16 files, ~11,600 tokens, 0.92 average composite.
The monolith stays available as a fallback for edge cases where script loading might fail.
The A/B grader skill we built to run this experiment is now a permanent tool for testing future skill changes.
Every future skill improvement gets measured against vanilla baseline. If it doesn't beat raw Claude, it doesn't ship.

The irony isn't lost on us. We built a complex 8-phase protocol, discovered it was barely better than doing nothing on the dimension that matters most (testing), then fixed it by making the architecture more complex (16 files instead of 1). But it's the right kind of complexity — modular, focused, measurable.

We failed our way to something that genuinely outperforms vanilla Claude Code. That's the whole story.

Get the Skills That Beat Vanilla

One-Shot Scripts, Evolution, and the full Godmode protocol — tested, measured, proven.

Get Godmode Read the A/B Test

← We Split One Skill Into 14 Files All Posts →