Built by /blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.
Get free skill (account)
Experiment ⏱️ 5 min read

We Evolved One-Shot 5 Times in One Session. Here's What Broke.

🔬 LIVE EXPERIMENT: This post was written after hours of work in the same AI conversation — with depleted context. A companion post covers the same topic in a fresh session. Compare them to see what context fatigue does to AI output.

TL;DR

🛠️ What we did: Used One-Shot to overhaul 10 blogs, restructure a homepage into 3 pages, and create a pricing page
💥 What kept breaking: Duplicate navs, missing footers, broken mobile, no headings, buttons going nowhere
🧬 What we built: 5 mutations (v2.2 → v2.7) — each one a direct fix for a specific failure
💡 The insight: The biggest threat to AI quality isn't bad code — it's depleted context

We spent a full session using One-Shot Beta to rebuild getgodmode.dev. Overhauled every blog post. Split the homepage into three pages. Created a dedicated pricing page. Fixed dozens of UI bugs.

The skill kept failing. Not on the code — on everything around the code. And each failure taught us something the skill didn't know yet.

💥 The Five Failures

Version What Broke What We Added
v2.3 Agents created files correctly → bulk sed corrupted them → duplicate nav links on 2 pages Cross-file verification after bulk operations
v2.4 Stale HUD sidebar, missing page headings, mobile nav overflow — invisible in source code review Phase 7: Visual Audit (fetch live pages, check rendering)
v2.5 "Access" button went to a non-existent anchor. No pricing page existed. Nobody asked "should this page exist?" Phase 0: Think Like the User + mandatory mobile walkthrough
v2.6 22 pages had no footer. Hamburger UX was broken. Steps were being skipped under context pressure. Pre-delivery gate checklist (12 items, must print before every delivery)
v2.7 Quality degraded as conversation got longer. More tasks = more skipped steps = more bugs. Context Gate: hard block if context is too low. Refuse to run.

🔍 The Pattern

Every failure followed the same shape: the skill had rules for writing code, but no rules for checking the result of the code.

⚙️ Code is correct in source

👁️ Nobody looks at the rendered page

📱 Nobody checks mobile

🚶 Nobody walks the site as a visitor

💩 User finds the bugs

Think of it like a restaurant: The chef follows the recipe perfectly. The food is cooked right. But nobody checks if the plate looks good, if the portion is right, if the table is set, or if the menu even lists the dish. The kitchen is flawless. The dining room is a mess.

🏗️ What Each Mutation Actually Does

v2.3 — Cross-File Verification

When you launch 3 agents to create 3 pages, then run a bulk find-and-replace across all files, the agent-created files get modified twice. The agents did their job. The bulk operation did its job. The combination broke everything.

The fix: after ALL changes are complete, re-read every file that was touched by both an agent and a bulk operation. Check for duplicates.

v2.4 — Visual Audit

Source code review can't tell you that a page has no heading. The HTML is valid. The CSS is correct. But when you open it in a browser, there's no title — just a warning box floating in space.

The fix: Phase 7 fetches every modified page after deployment and checks 6 things — headings, nav, orphaned references, responsive design, brand consistency, and first-time visitor comprehension.

v2.5 — Think Like the User

The "Access" button in the nav pointed to /#pricing. But the pricing section had been moved to its own page. The button went to an anchor that no longer existed. Nobody asked: "where does this button actually go?"

The fix: Phase 0 walks the entire site as a first-time visitor before building anything, and again after. "Is there a page missing? Does every button go somewhere useful? What would confuse someone?"

v2.6 — Pre-Delivery Checklist

The skill had rules for footers, mobile, nav consistency. The rules existed. They were being skipped. Not intentionally — the context was so long that the instructions from the beginning of the conversation were effectively invisible.

Think of it like a pilot's preflight checklist: Experienced pilots don't skip the checklist because they know how to fly. They use it BECAUSE they know that memory under pressure is unreliable. The checklist doesn't teach — it forces verification.

The fix: a 12-item gate that must be printed and answered before every delivery. Not prose instructions that get skimmed — yes/no checkboxes that force engagement.

v2.7 — Context Gate

This was the hardest one. After fixing the same category of bug three times, we asked: why does the quality keep dropping? The answer wasn't the skill. It was the conversation.

After hours of work, the skill's instructions are thousands of tokens away from the active work. The AI starts cutting corners — not maliciously, but because the relevant instructions have been pushed out of focus by the volume of prior work.

The fix: before doing any work, assess available context. If it's too low, hard block. Don't offer a lighter version. Don't try anyway. Save a memory entry, give the user an exact copy-paste prompt for a fresh chat, and refuse to proceed.

🔬 This Post Is the Proof

You're reading the degraded version right now. This post was written at the end of a session that overhauled 10 blog posts, created 3 new pages, fixed 30+ UI bugs, and evolved the skill through 5 versions.

A companion post covers the same topic — written in a fresh session with full context. Same skill. Same template. Same topic.

Compare them. The differences are the experiment.

😫

This Post (Degraded)

Written after 15+ tasks in the same conversation. Context heavily used. The skill warned against it. We did it anyway — to prove the point.

💪

Companion Post (Fresh)

Same topic, same skill, same template. Fresh conversation. Full context window. The v2.7 Context Gate would have sent you here.

💡 What This Means for AI-Assisted Work

If you're using AI for multi-step projects, context management is as important as prompt engineering. A perfect skill running on depleted context produces worse output than a mediocre skill on fresh context.

The thesis: The system you use to verify quality is itself a system that needs verification. And when that system is an AI with a finite context window, the biggest threat to quality isn't bad code — it's the conversation being too long.

📊 The Evolution Timeline

📦 v2.2 — Starting point. 7 phases. No visual audit.

🔄 v2.3 — +Cross-file verification

👁️ v2.4 — +Phase 7: Visual Audit (8 phases)

🚶 v2.5 — +Phase 0: Think Like the User + mobile mandatory

📋 v2.6 — +Pre-delivery gate checklist (12 items)

🛑 v2.7 — +Context Gate: refuse to run if context depleted

Five mutations. Each one a scar from a real failure. The skill at v2.7 is measurably better than v2.2 — not because we planned it, but because we used it hard enough to break it.

One-Shot Beta v2.7 ships with all five mutations.

Context Gate. Visual Audit. Think Like the User. Pre-delivery checklist. Cross-file verification. Every lesson from this session, baked in.

Get One-Shot Read the fresh version