Same Approach, Harder — Why We Banned Retries That Don't Change Tactic
💥 The problem: One-Shot-Scripts looped on low scores — but every loop did the same thing, just harder.
🔧 The fix: Retries MUST log the approach tried and pick a structurally different one next loop.
✅ The result: Loop 2 is now a genuinely new attempt, not a tantrum.
Tactics already tried
RetryDiversityError("Pick a different angle. Doing the same thing harder is not a tactic.") and the score climb is rejected.
THE GUARD CLAUSE THAT BANS ‘TRY HARDER’ — ENFORCED ON YOU
What We Caught
One-Shot-Scripts is our loop-until-perfect execution skill. It runs, scores, and if any dimension falls short, it retries — targeting only the weak dimension, not the whole run.
That sounded fine on paper. In practice, "retry the testing phase" kept meaning "write more tests of the same kind." Loop 1: unit tests for the happy path. Loop 2: more unit tests for the happy path. Score barely moves. Loop 3: even more.
Core insight: If loop 1 failed, the approach was wrong — not the effort level. Doing the same thing harder is a tantrum, not a retry.
The Student Analogy
Picture a student who fails a test by rereading their notes. They fail. What do they do?
The wrong move is "reread the notes again, but really concentrate this time." The right move is switching to practice problems, flashcards, or teaching the material to someone else. Different method, not more intensity of the same method.
The loop was making the first mistake. We needed it to make the second.
The New Rule
We added a mandatory approach-diversity step to loop-mechanics.md. Before any retry, the loop must:
↓
🧠 2. Generate 2–3 structurally different alternatives
↓
🎯 3. Pick the one most likely to fix the weak dimension
↓
🚫 4. Forbidden: any approach already on the list
The list grows as the session progresses. If the loop catches itself reaching for a logged approach, it stops and picks something else.
Real Diversity vs Fake Diversity
The whole rule falls apart if "different approach" is gamed into meaning "same approach with different wording." So we wrote explicit examples.
Real diversity
Testing weak? Switch from example-based unit tests to property-based fuzzing, or from unit tests to integration tests.
Fake diversity
"I wrote more tests." Same paradigm, same inputs, more volume. This is what we're trying to stop.
Real diversity
Execution weak? Swap the library, split one module into three, or move from sync to event-driven.
Fake diversity
"I refactored the naming." Cosmetic change that touches no behaviour and no architecture.
Real diversity
Visual weak? Swap flex for grid, change the component library, or rebuild the layout from scratch.
Fake diversity
"I tweaked the padding." The critic already listed the failure — padding wasn't it.
Pick a mode and tap the cube
REAL diversity = full face rotation (all 9 tiles move). FAKE diversity = swap one sticker's colour. Same cube, different change. One is structural; the other is cosmetic.
Why Not Use Evolution?
Fair question — we have an evolution engine that mutates skills across sessions. Couldn't that solve this?
No — those are different problems. Evolution changes the skill itself over weeks of real use. This problem needs to be solved inside a single session, during the loop, before the user sees the output.
Evolution
Cross-session. Logs scores over many runs, mutates the skill file itself. Slow feedback, permanent changes.
Approach diversity
Intra-session. On retry, switch tactic. Fast feedback, local to the run. What we just added.
The Bigger Cleanup
This sat on top of a week of anti-inflation work. The loop was cheap to fix once we'd already forced the scoring to be honest.
| Fix | What it does |
|---|---|
| Earn-up from 0.5 | Every dimension starts at 0.5 and earns up with evidence — not starts at 1.0 and deducts down. |
| Evidence receipts | Any score above 0.70 needs a concrete receipt (test output, grep, screenshot). No receipt, no high score. |
| Adversarial critic pass | Every scoring cycle spawns a critic subagent whose only job is to find what's wrong. Findings go in the scorecard. |
| First-loop cap | First-run composite is capped at 0.88 unless machine-checked evidence exists. |
| Approach diversity | This post. Retries must switch tactic, not double down. |
Tap a tablet for its rule
Bottom four rules limit drift but the balloon still floats. The fifth tablet — APPROACH-DIVERSITY — is the keystone that yanks the self-flattery score back to the 0.88 ceiling.
Self-graded systems drift toward self-flattery. These rules are the guardrails that stop the drift.
Run the loop that won't lie to itself
One-Shot-Scripts ships with the Godmode tier. Pick up access on the pricing page, then install via the dashboard or the godmode-mcp CLI.