/blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.
We Failed 6 Times Fixing One Bug. So We Rewrote the Rules.
💥 The failure: 6 guess-and-check attempts to configure an MCP server on Windows. All wrong.
🔬 The retrospective: The AI never researched the problem — it just kept guessing config values
🔧 The fixes: New "Process Quality" scoring dimension, a blocking diagnose phase, and 200-line file limits
✅ The result: Skills now score HOW you solve, not just what you ship
node-spawn w/ correct module path[OK]Last week we needed to connect a nano-banana MCP server on Windows. Simple task. Should have taken one attempt.
It took six. Each one a different wrong config. Not because the problem was hard — because the AI never stopped to read the docs first.
The 6 Attempts
↓
❌ Attempt 2: Guessed different flags
↓
❌ Attempt 3: Copied a Linux config onto Windows
↓
❌ Attempt 4: Changed the port, still wrong path
↓
❌ Attempt 5: Tried a different transport protocol
↓
🛑 Attempt 6: User stops it. "Why haven't you read the error message?"
The fix was trivial. Spawn the server via node with the correct module path. One line. Would have been attempt #1 if the AI had researched before acting.
Think of it like a mechanic: A bad mechanic replaces parts until the car works. A good mechanic runs diagnostics first, identifies the fault, then replaces exactly one part. Both fix the car. One costs you 6x more.
The Retrospective
We did what we always do after a visible failure: asked why it happened. The answer was uncomfortable.
Our scoring system measured what got shipped. Correctness. Completeness. Testing. Security. It never asked how the AI got there.
What We Scored Before
Does the code work? Is it tested? Is it complete? Is it secure? — all about the final artifact.
What We Missed
Did it research first? Did it reproduce the bug? Did it test in the right environment? Did it verify before delivering?
A skill that guesses 6 times and lands on the right answer scores the same as one that researches once and nails it. That's a broken rubric.
Fix #1: Process Quality — A New Scoring Dimension
Both Godmode Evolution and One-Shot Beta now include a Process Quality dimension in their scoring engines. It measures six signals:
| Signal | What It Measures |
|---|---|
| Research before action | Did the AI read docs, error logs, or existing code before touching files? |
| Right runtime environment | Did it test where the code actually runs, not where it's convenient? |
| Reproduce first | Did it confirm the failure before attempting a fix? |
| Evidence-based hypothesis | Can it explain WHY something broke, citing specific evidence? |
| Verify before delivering | Did it confirm the fix works before telling the user it's done? |
| Attempt efficiency | 1–2 attempts, not 6 guess-and-check cycles |
Default weight: 0.12 (12% of the overall score). For debug, fix, and integration tasks — the ones most prone to guess-and-check — it bumps to 0.20.
The principle: A correct result reached through 6 blind guesses is not the same quality as a correct result reached through 1 researched attempt. The scoring system now agrees.
Fix #2: Phase 1a — Diagnose Before Building
One-Shot Beta's execution protocol now has a blocking sub-phase for debug, fix, and integration tasks. No file changes are allowed until the diagnosis is complete.
↓
📖 Read error messages, docs, and existing config
↓
🐛 Reproduce the failure
↓
🧪 Form evidence-based hypothesis
↓
🔒 GATE: Research complete? → Only then proceed to build
↓
✅ Phase 1b: Build with confidence
This isn't a suggestion. It's a gate. The skill literally cannot proceed to file modifications until the diagnosis step is satisfied.
Anti-patterns we explicitly banned
✅ Do
Read the error message. Check the docs. Reproduce the bug. Form a hypothesis. Then fix.
❌ Don't
Guess a config value. Fail. Guess another. Fail. Repeat until something sticks.
✅ Do
Test the MCP server by spawning it via Node.js first — where it actually runs.
❌ Don't
Test config edits by restarting the whole IDE and hoping for the best.
Fix #3: 200-Line Limit on All Skill Files
This one came from a different failure mode, but the same session. Claude Code has a known problem with long instruction files: past ~200 lines, later instructions get skimmed or skipped.
Our own execution-protocol.md was 231 lines. The very rules we wrote to prevent sloppy work were being ignored because the file was too long for reliable processing.
Before
231 lines. Multi-line lists for task-type scaling. Verbose phase descriptions. Instructions at the bottom got skipped.
After
197 lines. Single-line summaries for scaling sections. Same information, compressed. Nothing lost, everything read.
The irony: We wrote a rule saying "skill files must be under 200 lines." Then our own skill file broke that rule. The skill that enforces quality was itself low-quality. Fixed.
Why This Matters Beyond Our Skills
If you're building anything with AI — skills, agents, workflows — you probably measure outputs. Does it work? Is it correct? Ship it.
But outputs don't tell you if the process is fragile. A correct output from a guess-and-check loop will eventually produce an incorrect output. The process was always broken — you just got lucky.
- Score the process, not just the product. How many attempts? Did it research first? Did it verify?
- Gate file changes behind diagnosis. Research isn't optional background work. It's the prerequisite.
- Eat your own dogfood. If your skill says "200-line max" and the skill itself is 231 lines, the rule is decoration.
Files Changed
| File | Change |
|---|---|
godmode-evolution/scoring-engine.md |
Added Process Quality dimension (6 signals, 0.12/0.20 weight) |
one-shot-beta/scoring-and-assessment.md |
Added Process Quality dimension (matching implementation) |
one-shot-beta/execution-protocol.md |
Added Phase 1a diagnose gate + trimmed 231 → 197 lines |
Process Quality ships today in Evolution & One-Shot.
Your AI skills now score how they solve, not just what they ship. Get the updated versions.
Get Access Learn about Evolution