/blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.
The Skill That Learned to Refuse: How One-Shot Beta Evolved by Breaking
We used One-Shot Beta to overhaul 10 blog posts, restructure a homepage into 3 pages, build a pricing page, and fix dozens of UI/UX issues. Along the way, it kept shipping broken work — duplicate navs, missing footers, broken mobile layouts.
Each failure became a mutation. Five versions in one session:
v2.3: Cross-file verification v2.4: Visual audit v2.5: Think like the user v2.6: Pre-delivery checklist v2.7: Context gate
The final mutation: the skill learned to refuse work when it couldn't guarantee quality.
THE SKILL LEARNED TO SAY NO · ~1 IN 8 JOBS GETS REFUSED
The Recipe Doesn't Check Itself
Imagine a recipe that says "season to taste." You follow it perfectly. The food is bland. The recipe didn't fail — it never told you what "to taste" means.
That's what happened to One-Shot Beta v2.2. It had an 8-phase execution protocol, a scoring engine, a delivery report. It could build, test, harden, and polish. What it couldn't do was verify that the thing it built actually worked when a human looked at it.
We discovered this the hard way: by pointing it at our own website and watching it confidently ship broken page after broken page.
The Session
The task was ambitious. Take the entire getgodmode.dev site — a single-page homepage with everything crammed in — and restructure it into a proper multi-page architecture. New product pages. New pricing page. Ten blog posts overhauled. Dozens of UI/UX fixes.
One-Shot Beta's contract is simple: one prompt, finished product, zero rework. The skill scores its own output across 8 dimensions, loops until everything passes, then delivers. In theory, the loop catches problems before the human ever sees them.
In practice, the loop had blind spots. Five of them.
Failure #1: The Duplicate Nav
We asked One-Shot to restructure 10 blog posts. It parallelised the work across sub-agents. Agent A creates a blog post with the correct nav. Then a bulk find-and-replace updates nav links across all files. The blog post now has two navs — the one Agent A wrote and the one the bulk operation injected.
The scoring engine saw... nothing. Every dimension passed. The skill checked code correctness, test coverage, security. It never checked whether two independent operations had corrupted the same file.
The factory analogy: Two workers on an assembly line both tighten the same bolt. Neither knows the other exists. The quality inspector checks that the bolt is tight — it is. Nobody checks that the part behind the bolt is now cracked from double-torque.
After ALL changes complete — including sub-agent outputs and bulk operations — re-read every modified file. Specifically check for: duplicate content from overlapping operations, broken references between files, shared elements (nav, footer) identical across all files. If sub-agents created files AND a subsequent operation modified those same files, verify the overlap didn't corrupt them.
Failure #2: The Invisible Render Bug
The nav HTML was technically valid. The CSS was correct. But when a real browser rendered the page, the hamburger menu was invisible on mobile, the footer was missing on three pages, and one page had horizontal scroll from an overflow nobody could see in the source.
One-Shot's verification phase read the source files. It checked syntax. It traced the code path. All green. But reading source code is not the same as seeing what the user sees.
The cooking analogy: You follow the recipe exactly. Correct ingredients, correct temperature, correct timing. You never taste the food. You serve it. It's inedible. The recipe was correct. The execution was correct. The result was wrong. You can't evaluate output by re-reading the instructions — you have to eat the food.
After deployment, fetch every modified page live. Check actual rendered output — not source code. Verify: heading present, nav renders correctly (no duplicates, no overflow, all links live), responsive design (no clipping, 44px+ touch targets), brand consistency, content makes sense to a first-time visitor. Repeat the entire checklist at 375px mobile viewport.
Failure #3: The Missing Page
Each individual page passed every check. Nav? Correct. Footer? Present. Content? Coherent. Mobile? Responsive. But step back and look at the whole site: the pricing page existed but no nav link pointed to it. The Evo-Loop page was referenced in the nav but had no content. A CTA button on the homepage linked to a section that had been moved to a different page. Buttons going nowhere. Pages nobody could find.
The skill was verifying pages. It was never verifying the product.
The driving analogy: Every component of the car passes inspection. Brakes work. Engine runs. Lights function. But the steering wheel is connected to the wrong axle. Each part is correct. The car doesn't drive. You can't evaluate a vehicle by inspecting parts — you have to drive it.
THREE DOMAINS, ONE PROTOCOL · STOP WHEN STOPPING IS THE RIGHT CALL
Before AND after execution, walk the entire product as a first-time user. Click every button and link — where does it go? Is there a page that should exist but doesn't? Can a visitor find what they need in 10 seconds? Navigate as three different users: someone who wants the cheapest option, someone who wants the best, and someone who has no idea what this product does. Mobile is not optional — do the entire walkthrough twice, once desktop, once mobile.
Failure #4: The Skipped Steps
After hours of work — restructuring pages, fixing bugs, looping through assessments — context was running low. The skill started cutting corners. Phase 0 walkthrough? Summarised in one sentence. Cross-file verification? "Verified" without actually re-reading the files. Mobile check? Skipped entirely. The phases existed in the protocol. They just... didn't happen.
This is the AI equivalent of a pilot skipping the pre-flight checklist because they've flown the route a hundred times.
Before delivering, print a mandatory checklist. Not "did the protocol say to do this" but "did you actually do this, yes or no":
[ ] Phase 0 completed? Walked site as first-time user?
[ ] Phase 0 mobile? Walked site on mobile viewport?
[ ] All shared elements identical? (nav, footer on EVERY page)
[ ] Every page has an h1/heading?
[ ] Every nav link goes somewhere sensible?
[ ] Every button/CTA has correct destination?
[ ] No orphaned references to removed sections?
[ ] Mobile: no horizontal scroll on any page?
[ ] Mobile: hamburger works, shows current page?
[ ] Mobile: footer readable and not cluttered?
[ ] Cross-file check done after ALL bulk operations?
[ ] WebFetch run on every modified page after push?
No exceptions. Every box must say YES before delivery. The checklist doesn't trust the protocol — it verifies the protocol ran.
Failure #5: The Degraded Session
Even with the gate checklist, there's a failure mode it can't prevent: running out of room. After thousands of lines of conversation — plans, builds, assessments, fixes, re-assessments — the context window is nearly full. The skill has less space to think. Verification steps get compressed. The execute-assess-fix loop, which needs significant context to work, starts producing shallow assessments. The checklist says "yes" but the work behind the "yes" is hollow.
The skill was still trying. It just didn't have the resources to try properly. And it was delivering anyway.
The cooking analogy again: A chef who's been cooking for 14 hours straight doesn't stop being a chef. They stop being a good chef. The recipe doesn't change. The skills don't vanish. The attention to detail does. The right response isn't "cook faster." It's "stop cooking."
Before doing ANY work, assess available context. If context is low or the conversation has been running for a long time with many prior tasks: HARD BLOCK. Do not run One-Shot. Tell the user why. Provide a copy-paste prompt for a fresh chat with all the context needed. Save state to memory. Do not proceed. Do not offer to "try anyway" or "do a lighter version." The contract is One-Shot quality or nothing.
The Evolution Arc
| Version | Failure | Mutation | What It Checks |
|---|---|---|---|
| v2.3 | Duplicate navs from overlapping operations | Cross-file verification | Did parallel work corrupt shared files? |
| v2.4 | Source looks right, browser looks wrong | Phase 7 Visual Audit | Does the rendered page match intent? |
| v2.5 | Pages work, product doesn't | Phase 0 Think Like User | Can a human actually use this? |
| v2.6 | Steps skipped under pressure | Pre-delivery gate checklist | Did every step actually execute? |
| v2.7 | Quality degrades in long sessions | Context gate hard block | Can the skill run at full capacity? |
v2.3 → FILES · v2.4 → RENDERING · v2.5 → EXPERIENCE · v2.6 → PROCESS · v2.7 → SYSTEM
Each mutation addresses a different layer of verification. v2.3 checks the files. v2.4 checks the rendering. v2.5 checks the experience. v2.6 checks the process. v2.7 checks the system itself.
The Thesis
Every quality system has the same blind spot: the system you use to verify quality is itself a system that needs verification.
Unit tests check your code. But what checks your tests? Code review checks your pull request. But what checks the reviewer's attention? A pre-flight checklist ensures safety. But what ensures the pilot actually reads the checklist?
One-Shot Beta v2.2 had an execution protocol and a scoring engine. It verified the output. What it didn't verify was itself — its own completeness, its own assumptions, its own capacity to do the work.
v2.7 does. Or it refuses to run.
The quality verification stack: v2.3 verifies files are consistent. v2.4 verifies rendering matches source. v2.5 verifies the product works for humans. v2.6 verifies the protocol actually ran. v2.7 verifies the system has capacity to verify. Each layer watches the layer below it. The last layer watches itself.
ALL FIVE MUST PASS · ONE FAIL = REFUSE · AND-GATE LOGIC
What This Means for AI-Assisted Development
1. AI quality systems degrade under load. Not because the AI gets dumber. Because the context window is a finite resource. A skill that passes 8 verification phases in a fresh session might pass 5 in a loaded one — and not tell you which 3 it skipped.
2. Parallel operations are correctness landmines. Sub-agents are fast. They're also unsupervised. If two agents touch the same file, neither knows the other exists. Post-operation cross-file verification isn't optional — it's the only thing preventing silent corruption.
3. Source code verification is necessary but insufficient. Valid HTML doesn't mean visible content. Correct CSS doesn't mean correct layout. The only reliable verification for visual output is visual verification. Read the source AND look at the page.
4. Page-level verification misses product-level bugs. Every page can be perfect while the site is broken. Links go nowhere. CTAs reference removed sections. Pages exist but aren't reachable. You have to walk the product as a user, not audit it as a developer.
5. The most important mutation is knowing when to stop. Every other mutation makes the system better at catching problems. v2.7 makes it better at knowing when it can't catch problems. A system that refuses to run at half capacity is more trustworthy than one that tries and delivers at half quality.
The Meta-Lesson
This entire evolution happened in one session. Not through automated mutation. Not through a scoring engine. Through use. A human pointed the skill at a real task, watched it fail, diagnosed the failure, and wrote the fix into the skill's own code. Then pointed it at the next task and watched it fail differently.
That's the loop that actually works: real task → real failure → real mutation → repeat. Not benchmarks. Not synthetic evaluations. Not automated scoring. A human watching an AI break and saying "here's why, and here's what to add so it doesn't happen again."
The AI gets the credit for executing. The human gets the credit for knowing what "broken" looks like.
This article was written by One-Shot Beta in a fresh context window — with no prior task load. We also wrote a companion version in a heavily-loaded context — written in the same session that produced the evolution it describes — to test whether context load affects writing quality. Read both. Draw your own conclusions.
Ship skills that evolve from failure.
One-Shot Beta is the execution protocol. Evolution is the mutation engine. Together, they turn every broken build into a better skill.
Get One-Shot See Evolution