Built by /blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.

Get free skill (account)

Evolution April 1, 2026 ⏱️ 6 min read

The Skill That Learned to Refuse: How One-Shot Beta Evolved by Breaking

TL;DR

We used One-Shot Beta to overhaul 10 blog posts, restructure a homepage into 3 pages, build a pricing page, and fix dozens of UI/UX issues. Along the way, it kept shipping broken work — duplicate navs, missing footers, broken mobile layouts.

Each failure became a mutation. Five versions in one session:
v2.3: Cross-file verification v2.4: Visual audit v2.5: Think like the user v2.6: Pre-delivery checklist v2.7: Context gate

The final mutation: the skill learned to refuse work when it couldn't guarantee quality.

FIVE GATES, ANY ONE CAN HALT THE JOB refuse_guard.py · LIVE

REFUSED0%

DELIVERED0%

THE SKILL LEARNED TO SAY NO · ~1 IN 8 JOBS GETS REFUSED

🍳 The Recipe Doesn't Check Itself

Imagine a recipe that says "season to taste." You follow it perfectly. The food is bland. The recipe didn't fail — it never told you what "to taste" means.

That's what happened to One-Shot Beta v2.2. It had an 8-phase execution protocol, a scoring engine, a delivery report. It could build, test, harden, and polish. What it couldn't do was verify that the thing it built actually worked when a human looked at it.

We discovered this the hard way: by pointing it at our own website and watching it confidently ship broken page after broken page.

📋 The Session

The task was ambitious. Take the entire getgodmode.dev site — a single-page homepage with everything crammed in — and restructure it into a proper multi-page architecture. New product pages. New pricing page. Ten blog posts overhauled. Dozens of UI/UX fixes.

Blog posts overhauled

New pages created

Mutations triggered

Session

One-Shot Beta's contract is simple: one prompt, finished product, zero rework. The skill scores its own output across 8 dimensions, loops until everything passes, then delivers. In theory, the loop catches problems before the human ever sees them.

In practice, the loop had blind spots. Five of them.

🔴 Failure #1: The Duplicate Nav

BUG Sub-agent creates page → bulk operation modifies same page → duplicate nav

We asked One-Shot to restructure 10 blog posts. It parallelised the work across sub-agents. Agent A creates a blog post with the correct nav. Then a bulk find-and-replace updates nav links across all files. The blog post now has two navs — the one Agent A wrote and the one the bulk operation injected.

The scoring engine saw... nothing. Every dimension passed. The skill checked code correctness, test coverage, security. It never checked whether two independent operations had corrupted the same file.

The factory analogy: Two workers on an assembly line both tighten the same bolt. Neither knows the other exists. The quality inspector checks that the bolt is tight — it is. Nobody checks that the part behind the bolt is now cracked from double-torque.

v2.3 Mutation: Cross-File Verification

After ALL changes complete — including sub-agent outputs and bulk operations — re-read every modified file. Specifically check for: duplicate content from overlapping operations, broken references between files, shared elements (nav, footer) identical across all files. If sub-agents created files AND a subsequent operation modified those same files, verify the overlap didn't corrupt them.

🔴 Failure #2: The Invisible Render Bug

BUG Source code looks correct → browser renders it broken

The nav HTML was technically valid. The CSS was correct. But when a real browser rendered the page, the hamburger menu was invisible on mobile, the footer was missing on three pages, and one page had horizontal scroll from an overflow nobody could see in the source.

One-Shot's verification phase read the source files. It checked syntax. It traced the code path. All green. But reading source code is not the same as seeing what the user sees.

The cooking analogy: You follow the recipe exactly. Correct ingredients, correct temperature, correct timing. You never taste the food. You serve it. It's inedible. The recipe was correct. The execution was correct. The result was wrong. You can't evaluate output by re-reading the instructions — you have to eat the food.

v2.4 Mutation: Phase 7 Visual Audit

After deployment, fetch every modified page live. Check actual rendered output — not source code. Verify: heading present, nav renders correctly (no duplicates, no overflow, all links live), responsive design (no clipping, 44px+ touch targets), brand consistency, content makes sense to a first-time visitor. Repeat the entire checklist at 375px mobile viewport.

🔴 Failure #3: The Missing Page

BUG Every page works → but the site doesn't

Each individual page passed every check. Nav? Correct. Footer? Present. Content? Coherent. Mobile? Responsive. But step back and look at the whole site: the pricing page existed but no nav link pointed to it. The Evo-Loop page was referenced in the nav but had no content. A CTA button on the homepage linked to a section that had been moved to a different page. Buttons going nowhere. Pages nobody could find.

The skill was verifying pages. It was never verifying the product.

The driving analogy: Every component of the car passes inspection. Brakes work. Engine runs. Lights function. But the steering wheel is connected to the wrong axle. Each part is correct. The car doesn't drive. You can't evaluate a vehicle by inspecting parts — you have to drive it.

THREE ANALOGIES, ONE PRINCIPLE EVERY DOMAIN HAS A REFUSAL PROTOCOL

COOKING. A chef refuses an order when an ingredient is missing — the kitchen does not improvise toward "almost the dish." That is v2.4 visual audit — the food has to be tasted, not the recipe re-read.

FACTORY. The QC station halts the line on a single bad widget — one cracked part is a stop-the-line event, not a "ship it anyway." That is v2.3 cross-file verification — one corrupted file fails the whole batch.

DRIVING. A red light is a non-negotiable stop — the driver does not "make a judgement call" about whether to roll through. That is v2.7 context gate — if the conditions are wrong, the skill simply does not run.

THREE DOMAINS, ONE PROTOCOL · STOP WHEN STOPPING IS THE RIGHT CALL

v2.5 Mutation: Phase 0 — Think Like the User

Before AND after execution, walk the entire product as a first-time user. Click every button and link — where does it go? Is there a page that should exist but doesn't? Can a visitor find what they need in 10 seconds? Navigate as three different users: someone who wants the cheapest option, someone who wants the best, and someone who has no idea what this product does. Mobile is not optional — do the entire walkthrough twice, once desktop, once mobile.

🔴 Failure #4: The Skipped Steps

BUG Protocol says 8 phases → under pressure, only 5 execute

After hours of work — restructuring pages, fixing bugs, looping through assessments — context was running low. The skill started cutting corners. Phase 0 walkthrough? Summarised in one sentence. Cross-file verification? "Verified" without actually re-reading the files. Mobile check? Skipped entirely. The phases existed in the protocol. They just... didn't happen.

This is the AI equivalent of a pilot skipping the pre-flight checklist because they've flown the route a hundred times.

v2.6 Mutation: Pre-Delivery Gate Checklist

Before delivering, print a mandatory checklist. Not "did the protocol say to do this" but "did you actually do this, yes or no":

[ ] Phase 0 completed? Walked site as first-time user?
[ ] Phase 0 mobile? Walked site on mobile viewport?
[ ] All shared elements identical? (nav, footer on EVERY page)
[ ] Every page has an h1/heading?
[ ] Every nav link goes somewhere sensible?
[ ] Every button/CTA has correct destination?
[ ] No orphaned references to removed sections?
[ ] Mobile: no horizontal scroll on any page?
[ ] Mobile: hamburger works, shows current page?
[ ] Mobile: footer readable and not cluttered?
[ ] Cross-file check done after ALL bulk operations?
[ ] WebFetch run on every modified page after push?

No exceptions. Every box must say YES before delivery. The checklist doesn't trust the protocol — it verifies the protocol ran.

🔴 Failure #5: The Degraded Session

BUG Long conversation → context exhausted → quality collapses

Even with the gate checklist, there's a failure mode it can't prevent: running out of room. After thousands of lines of conversation — plans, builds, assessments, fixes, re-assessments — the context window is nearly full. The skill has less space to think. Verification steps get compressed. The execute-assess-fix loop, which needs significant context to work, starts producing shallow assessments. The checklist says "yes" but the work behind the "yes" is hollow.

The skill was still trying. It just didn't have the resources to try properly. And it was delivering anyway.

The cooking analogy again: A chef who's been cooking for 14 hours straight doesn't stop being a chef. They stop being a good chef. The recipe doesn't change. The skills don't vanish. The attention to detail does. The right response isn't "cook faster." It's "stop cooking."

v2.7 Mutation: Context Gate — Hard Block

Before doing ANY work, assess available context. If context is low or the conversation has been running for a long time with many prior tasks: HARD BLOCK. Do not run One-Shot. Tell the user why. Provide a copy-paste prompt for a fresh chat with all the context needed. Save state to memory. Do not proceed. Do not offer to "try anyway" or "do a lighter version." The contract is One-Shot quality or nothing.

📈 The Evolution Arc

v2.2 → v2.3 → v2.4 → v2.5 → v2.6 → v2.7

Version	Failure	Mutation	What It Checks
v2.3	Duplicate navs from overlapping operations	Cross-file verification	Did parallel work corrupt shared files?
v2.4	Source looks right, browser looks wrong	Phase 7 Visual Audit	Does the rendered page match intent?
v2.5	Pages work, product doesn't	Phase 0 Think Like User	Can a human actually use this?
v2.6	Steps skipped under pressure	Pre-delivery gate checklist	Did every step actually execute?
v2.7	Quality degrades in long sessions	Context gate hard block	Can the skill run at full capacity?

FIVE FAILURES, FIVE MUTATIONS HOVER A LEAF TO RETRACE THE FAILURE

EACH MUTATION IS A SCAR FROM A SPECIFIC FAILURE · NOT ARBITRARY DESIGN

v2.3 → FILES · v2.4 → RENDERING · v2.5 → EXPERIENCE · v2.6 → PROCESS · v2.7 → SYSTEM

Each mutation addresses a different layer of verification. v2.3 checks the files. v2.4 checks the rendering. v2.5 checks the experience. v2.6 checks the process. v2.7 checks the system itself.

💡 The Thesis

Every quality system has the same blind spot: the system you use to verify quality is itself a system that needs verification.

Unit tests check your code. But what checks your tests? Code review checks your pull request. But what checks the reviewer's attention? A pre-flight checklist ensures safety. But what ensures the pilot actually reads the checklist?

One-Shot Beta v2.2 had an execution protocol and a scoring engine. It verified the output. What it didn't verify was itself — its own completeness, its own assumptions, its own capacity to do the work.

v2.7 does. Or it refuses to run.

The quality verification stack: v2.3 verifies files are consistent. v2.4 verifies rendering matches source. v2.5 verifies the product works for humans. v2.6 verifies the protocol actually ran. v2.7 verifies the system has capacity to verify. Each layer watches the layer below it. The last layer watches itself.

ANATOMY OF A REFUSAL · PLAYABLE CLICK A GATE TO TOGGLE PASS / FAIL

— DROP IN PROGRESS —

ALL FIVE MUST PASS · ONE FAIL = REFUSE · AND-GATE LOGIC

⚖️ What This Means for AI-Assisted Development

1. AI quality systems degrade under load. Not because the AI gets dumber. Because the context window is a finite resource. A skill that passes 8 verification phases in a fresh session might pass 5 in a loaded one — and not tell you which 3 it skipped.

2. Parallel operations are correctness landmines. Sub-agents are fast. They're also unsupervised. If two agents touch the same file, neither knows the other exists. Post-operation cross-file verification isn't optional — it's the only thing preventing silent corruption.

3. Source code verification is necessary but insufficient. Valid HTML doesn't mean visible content. Correct CSS doesn't mean correct layout. The only reliable verification for visual output is visual verification. Read the source AND look at the page.

4. Page-level verification misses product-level bugs. Every page can be perfect while the site is broken. Links go nowhere. CTAs reference removed sections. Pages exist but aren't reachable. You have to walk the product as a user, not audit it as a developer.

5. The most important mutation is knowing when to stop. Every other mutation makes the system better at catching problems. v2.7 makes it better at knowing when it can't catch problems. A system that refuses to run at half capacity is more trustworthy than one that tries and delivers at half quality.

🔮 The Meta-Lesson

This entire evolution happened in one session. Not through automated mutation. Not through a scoring engine. Through use. A human pointed the skill at a real task, watched it fail, diagnosed the failure, and wrote the fix into the skill's own code. Then pointed it at the next task and watched it fail differently.

That's the loop that actually works: real task → real failure → real mutation → repeat. Not benchmarks. Not synthetic evaluations. Not automated scoring. A human watching an AI break and saying "here's why, and here's what to add so it doesn't happen again."

The AI gets the credit for executing. The human gets the credit for knowing what "broken" looks like.

A note on this post
This article was written by One-Shot Beta in a fresh context window — with no prior task load. We also wrote a companion version in a heavily-loaded context — written in the same session that produced the evolution it describes — to test whether context load affects writing quality. Read both. Draw your own conclusions.

Ship skills that evolve from failure.

One-Shot Beta is the execution protocol. Evolution is the mutation engine. Together, they turn every broken build into a better skill.

Get One-Shot See Evolution

← The Blind Experiment Process Quality: Diagnose Before Building →