Deep Dive ⏱️ 5 min read

Same Approach, Harder — Why We Banned Retries That Don't Change Tactic

TL;DR

💥 The problem: One-Shot-Scripts looped on low scores — but every loop did the same thing, just harder.
🔧 The fix: Retries MUST log the approach tried and pick a structurally different one next loop.
The result: Loop 2 is now a genuinely new attempt, not a tantrum.
tactic-orbit-3d click a tactic — same one twice = blocked
RetryDiversityError Pick a different angle. Doing the same thing harder is not a tactic.

Tactics already tried

— ledger empty — pick a tactic to begin —
DIM TESTING · score 0.50
Reduced motion: the constellation is paused. The DIM at the centre is TESTING; six orbiting tactics are UNIT, FUZZ, INTEGRATION, E2E, PROPERTY, MUTATION. Click a tactic to log it — clicking the same tactic twice raises RetryDiversityError("Pick a different angle. Doing the same thing harder is not a tactic.") and the score climb is rejected.

THE GUARD CLAUSE THAT BANS ‘TRY HARDER’ — ENFORCED ON YOU

🔍 What We Caught

One-Shot-Scripts is our loop-until-perfect execution skill. It runs, scores, and if any dimension falls short, it retries — targeting only the weak dimension, not the whole run.

That sounded fine on paper. In practice, "retry the testing phase" kept meaning "write more tests of the same kind." Loop 1: unit tests for the happy path. Loop 2: more unit tests for the happy path. Score barely moves. Loop 3: even more.

Core insight: If loop 1 failed, the approach was wrong — not the effort level. Doing the same thing harder is a tantrum, not a retry.

🎓 The Student Analogy

Picture a student who fails a test by rereading their notes. They fail. What do they do?

The wrong move is "reread the notes again, but really concentrate this time." The right move is switching to practice problems, flashcards, or teaching the material to someone else. Different method, not more intensity of the same method.

The loop was making the first mistake. We needed it to make the second.

student-rereading-3d reread the notes — or switch method
Test score
0.22
Method REREADING NOTES
Reduced motion: imagine a student at a desk with a notebook open. Rereading the notebook keeps the test score flat at 0.22. Switching to flashcards, practice problems, or teaching someone else spawns a new study tool on the desk and pushes the gauge to 0.78, 0.86, or 0.92 respectively. Different method, not more intensity of the same method.

🔧 The New Rule

We added a mandatory approach-diversity step to loop-mechanics.md. Before any retry, the loop must:

📝 1. Log the approach just tried (one line)

🧠 2. Generate 2–3 structurally different alternatives

🎯 3. Pick the one most likely to fix the weak dimension

🚫 4. Forbidden: any approach already on the list

The list grows as the session progresses. If the loop catches itself reaching for a logged approach, it stops and picks something else.

⚖️ Real Diversity vs Fake Diversity

The whole rule falls apart if "different approach" is gamed into meaning "same approach with different wording." So we wrote explicit examples.

Real diversity

Testing weak? Switch from example-based unit tests to property-based fuzzing, or from unit tests to integration tests.

Fake diversity

"I wrote more tests." Same paradigm, same inputs, more volume. This is what we're trying to stop.

Real diversity

Execution weak? Swap the library, split one module into three, or move from sync to event-driven.

Fake diversity

"I refactored the naming." Cosmetic change that touches no behaviour and no architecture.

Real diversity

Visual weak? Swap flex for grid, change the component library, or rebuild the layout from scratch.

Fake diversity

"I tweaked the padding." The critic already listed the failure — padding wasn't it.

real-vs-fake-diversity-cube rotate a face vs re-skin a sticker

Pick a mode and tap the cube

REAL diversity = full face rotation (all 9 tiles move). FAKE diversity = swap one sticker's colour. Same cube, different change. One is structural; the other is cosmetic.

Mode REAL
Reduced motion: the cube has six faces labelled UNIT, INTEGRATION, FUZZ, E2E, PROPERTY, MUTATION. REAL mode rotates a whole face — nine tiles move together (e.g. switching example-based unit tests to property-based fuzzing). FAKE mode swaps one sticker's colour (e.g. tweaking padding). Anyone who has held a Rubik’s cube knows which feels like a real change.

🤔 Why Not Use Evolution?

Fair question — we have an evolution engine that mutates skills across sessions. Couldn't that solve this?

No — those are different problems. Evolution changes the skill itself over weeks of real use. This problem needs to be solved inside a single session, during the loop, before the user sees the output.

🧬

Evolution

Cross-session. Logs scores over many runs, mutates the skill file itself. Slow feedback, permanent changes.

🔄

Approach diversity

Intra-session. On retry, switch tactic. Fast feedback, local to the run. What we just added.

📜 The Bigger Cleanup

This sat on top of a week of anti-inflation work. The loop was cheap to fix once we'd already forced the scoring to be honest.

FixWhat it does
Earn-up from 0.5Every dimension starts at 0.5 and earns up with evidence — not starts at 1.0 and deducts down.
Evidence receiptsAny score above 0.70 needs a concrete receipt (test output, grep, screenshot). No receipt, no high score.
Adversarial critic passEvery scoring cycle spawns a critic subagent whose only job is to find what's wrong. Findings go in the scorecard.
First-loop capFirst-run composite is capped at 0.88 unless machine-checked evidence exists.
Approach diversityThis post. Retries must switch tactic, not double down.
anti-inflation-stack-3d balloon drifts up — lock the keystone to anchor it

Tap a tablet for its rule

Bottom four rules limit drift but the balloon still floats. The fifth tablet — APPROACH-DIVERSITY — is the keystone that yanks the self-flattery score back to the 0.88 ceiling.

Score 1.04 (drifting)
Reduced motion: a 3D stack of five tablets, bottom-up — EARN-UP-FROM-0.5, EVIDENCE-RECEIPTS, ADVERSARIAL-CRITIC, FIRST-LOOP-CAP-0.88, APPROACH-DIVERSITY. With only the bottom four locked, the “self-flattery” balloon drifts up past 1.04. Lock the fifth tablet (this post’s rule) and the balloon yanks down to the 0.88 ceiling. The stack is the visual: this post is the 5th rule.

Self-graded systems drift toward self-flattery. These rules are the guardrails that stop the drift.

Run the loop that won't lie to itself

One-Shot-Scripts ships with the Godmode tier. Pick up access on the pricing page, then install via the dashboard or the godmode-mcp CLI.

Get Godmode Open dashboard