Built by /blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.
Get free skill (account)
Post-Mortem ⏱️ 5 min read

We Failed 6 Times Fixing One Bug. So We Rewrote the Rules.

TL;DR

💥 The failure: 6 guess-and-check attempts to configure an MCP server on Windows. All wrong.
🔬 The retrospective: The AI never researched the problem — it just kept guessing config values
🔧 The fixes: New "Process Quality" scoring dimension, a blocking diagnose phase, and 200-line file limits
The result: Skills now score HOW you solve, not just what you ship
Attempt-loop mode Skip DIAG · guess until something sticks
Process-quality mode Diagnose first · land on attempt #1
#1Guessed binary path[ERR]
#2Guessed different flags[ERR]
#3Copied Linux config to Windows[ERR]
#4Changed port (still wrong path)[ERR]
#5Tried different transport[ERR]
#6User stops it[STOP]
tokens burned0
Read error message
Reproduce locally
Form hypothesis from logs
Verify in target runtime
#1node-spawn w/ correct module path[OK]
tokens burned0
Process Quality0.50
DIAGNOSE BEFORE YOU BUILD · PROCESS QUALITY v1

Last week we needed to connect a nano-banana MCP server on Windows. Simple task. Should have taken one attempt.

It took six. Each one a different wrong config. Not because the problem was hard — because the AI never stopped to read the docs first.

😬 The 6 Attempts

❌ Attempt 1: Guessed the binary path

❌ Attempt 2: Guessed different flags

❌ Attempt 3: Copied a Linux config onto Windows

❌ Attempt 4: Changed the port, still wrong path

❌ Attempt 5: Tried a different transport protocol

🛑 Attempt 6: User stops it. "Why haven't you read the error message?"

The fix was trivial. Spawn the server via node with the correct module path. One line. Would have been attempt #1 if the AI had researched before acting.

Think of it like a mechanic: A bad mechanic replaces parts until the car works. A good mechanic runs diagnostics first, identifies the fault, then replaces exactly one part. Both fix the car. One costs you 6x more.

Bad mechanic · guess & swap
SWAP UNTIL IT RUNS P1 P2 P3 P4 P5 P6
Customer pays$0
Good mechanic · diag first
READ · FIND · FIX OBD-II P0420 P✓
Customer pays$0
Both mechanics fix the car. One bills $720; the other bills $120. The scoring system used to call them equal.
Same outcome · 6× the cost · broken rubric

🔬 The Retrospective

We did what we always do after a visible failure: asked why it happened. The answer was uncomfortable.

Our scoring system measured what got shipped. Correctness. Completeness. Testing. Security. It never asked how the AI got there.

📦

What We Scored Before

Does the code work? Is it tested? Is it complete? Is it secure? — all about the final artifact.

🔍

What We Missed

Did it research first? Did it reproduce the bug? Did it test in the right environment? Did it verify before delivering?

A skill that guesses 6 times and lands on the right answer scores the same as one that researches once and nails it. That's a broken rubric.

📊 Fix #1: Process Quality — A New Scoring Dimension

Both Godmode Evolution and One-Shot Beta now include a Process Quality dimension in their scoring engines. It measures six signals:

Signal What It Measures
Research before action Did the AI read docs, error logs, or existing code before touching files?
Right runtime environment Did it test where the code actually runs, not where it's convenient?
Reproduce first Did it confirm the failure before attempting a fix?
Evidence-based hypothesis Can it explain WHY something broke, citing specific evidence?
Verify before delivering Did it confirm the fix works before telling the user it's done?
Attempt efficiency 1–2 attempts, not 6 guess-and-check cycles

Default weight: 0.12 (12% of the overall score). For debug, fix, and integration tasks — the ones most prone to guess-and-check — it bumps to 0.20.

Process-Quality Profile
Bad run · 6 attempts Good run · 1 attempt
0.75 0.50 0.25 Research-before-action Right-runtime-env Reproduce-first Evidence-based hypothesis Verify-before-deliver Attempt-efficiency
Tap an axis label — reveal what each signal measures, with the bad-run vs good-run scores from this incident.
Process Quality is a SHAPE · low Attempt-efficiency tanks the whole signal

The principle: A correct result reached through 6 blind guesses is not the same quality as a correct result reached through 1 researched attempt. The scoring system now agrees.

🚧 Fix #2: Phase 1a — Diagnose Before Building

One-Shot Beta's execution protocol now has a blocking sub-phase for debug, fix, and integration tasks. No file changes are allowed until the diagnosis is complete.

🔍 Phase 1a: Diagnose

📖 Read error messages, docs, and existing config

🐛 Reproduce the failure

🧪 Form evidence-based hypothesis

🔒 GATE: Research complete? → Only then proceed to build

✅ Phase 1b: Build with confidence

This isn't a suggestion. It's a gate. The skill literally cannot proceed to file modifications until the diagnosis step is satisfied.

Build attempt file modify
Phase 1b ship code
1Read errors
2Reproduce
3Hypothesis
4Evidence pass
STATUS: LOCKED — diagnosis incomplete. File modifications blocked.
Phase 1a is a HARD GATE · not a recommendation

Anti-patterns we explicitly banned

✅ Do

Read the error message. Check the docs. Reproduce the bug. Form a hypothesis. Then fix.

❌ Don't

Guess a config value. Fail. Guess another. Fail. Repeat until something sticks.

✅ Do

Test the MCP server by spawning it via Node.js first — where it actually runs.

❌ Don't

Test config edits by restarting the whole IDE and hoping for the best.

✂️ Fix #3: 200-Line Limit on All Skill Files

This one came from a different failure mode, but the same session. Claude Code has a known problem with long instruction files: past ~200 lines, later instructions get skimmed or skipped.

Our own execution-protocol.md was 231 lines. The very rules we wrote to prevent sloppy work were being ignored because the file was too long for reliable processing.

📏

Before

231 lines. Multi-line lists for task-type scaling. Verbose phase descriptions. Instructions at the bottom got skipped.

After

197 lines. Single-line summaries for scaling sections. Same information, compressed. Nothing lost, everything read.

The irony: We wrote a rule saying "skill files must be under 200 lines." Then our own skill file broke that rule. The skill that enforces quality was itself low-quality. Fixed.

💡 Why This Matters Beyond Our Skills

If you're building anything with AI — skills, agents, workflows — you probably measure outputs. Does it work? Is it correct? Ship it.

But outputs don't tell you if the process is fragile. A correct output from a guess-and-check loop will eventually produce an incorrect output. The process was always broken — you just got lucky.

📁 Files Changed

File Change
godmode-evolution/scoring-engine.md Added Process Quality dimension (6 signals, 0.12/0.20 weight)
one-shot-beta/scoring-and-assessment.md Added Process Quality dimension (matching implementation)
one-shot-beta/execution-protocol.md Added Phase 1a diagnose gate + trimmed 231 → 197 lines

Process Quality ships today in Evolution & One-Shot.

Your AI skills now score how they solve, not just what they ship. Get the updated versions.

Get Access Learn about Evolution