The 0.99 Bar Was Unreachable
🎯 The job: Inject animated CSS visuals into 40 static-only blog posts. Orchestra v2 ran the full 5-phase route on the first pass and shipped working HTML to disk
📉 The catch: Composite came in at 0.50, ship bar was 0.99. Loop-Judge ruled loop three times in a row — two of them were procedural no-ops chasing a gap the worker couldn’t close
🔧 The fix: Lower the ship threshold from 0.99 to 0.90. A 0.99 bar on a subjective visual task is a permission slip to loop forever
The Job Was Done On Pass One
The task was simple to state: every blog post on this site should have an animated CSS visual that matches what the post is about. 40 of them were missing one. The other 7 already had something.
Orchestra v2 ran its full 5-phase route — Diagnose, Recon, Builder, Polish, Verifier — in one go. By the time scoring kicked in, the work on disk was correct: 40 posts modified, 7 untouched, every modified post carrying both an animated SVG figure AND a prefers-reduced-motion override for accessibility.
Core insight: The 5 phases are the route. Loops aren’t extra phases — they’re re-runs of the last two (Verifier and Polish) chasing a higher score on work that already exists.
Three Loops, Two No-Ops, One Real Lift
The composite came in at 0.50. The ship bar was 0.99. That’s nowhere near, so the runner woke up the Loop-Judge to decide whether to ship anyway or send work back through.
Loop 1 — Verifier (no-op)
Judge picked Verifier, asked it to broaden visual coverage from 9 of 40 posts to all 40. Verifier re-ran, re-stated the original 9 archetypes, and wrote a byte-identical report.
Loop 2 — Verifier again (no-op, predicted)
Judge looked at the round-1 result, noticed it was a procedural no-op, and warned in its own verdict: “if attempt 3 also returns identical artifacts, that’s strong signal we’ve hit a protocol diminishing-returns wall.” Then it picked Verifier a second time anyway, hoping the sharper instructions would land. They didn’t. verify.md mtime never updated.
Loop 3 — Polish (real lift)
The third judge changed the target. Instead of Verifier, it pointed Polish at the weakest dim — ambition at 0.78 — and Polish actually delivered: terminal animation hard-cap dropped, dead code stripped, eight near-identical CSS rules collapsed into one selector, pyflakes silent.
Real-world analogy: Olympic gymnastics judging where the scale tops out at 10.0 but every routine has a 0.3 deduction baked in for breathing. You can do a flawless routine and still score 9.7. The bar isn’t reachable; the judges are just calibrated to never give it.
What “Composite 0.50” Actually Means
The composite isn’t the average of the dim scores. It’s a weighted score with a first-loop cap at 0.80 and additional penalties for unverified work.
So the 0.50 number that triggered every loop wasn’t the dim scores being bad — those ranged 0.78 to 0.93. It was the rubric’s structural protections firing because verification only sampled 9 of 40 posts. Every loop would have to either close that gap (Verifier’s job) or move the dim scores up so the composite punched through the cap (Polish’s job).
| Layer | Number | Meaning |
|---|---|---|
| Raw composite | 0.50 | Weighted score with first-loop cap and unverified-work penalty |
| Cited dim range | 0.78 – 0.93 | What individual scorers actually awarded each dimension |
| Ship threshold (was) | 0.99 | Composite must clear this to auto-ship without judge involvement |
| Ship threshold (now) | 0.90 | Lowered after this run — reachable by good work, still rejects bad |
Why 0.99 Doesn’t Work For Subjective Tasks
A 0.99 bar makes sense for objective tasks where every defect is a fact. Did the migration apply? Did the test pass? Did the binary compile? Yes-or-no, no-defects-allowed, ship-or-don’t.
It doesn’t work for visual tasks where every scorer is allowed to invent objections. “The terminal archetype animates in stills but the keyframe sweep could be smoother.” “Library was lifted not designed.” “Single visual language across all 40 posts.” All true. None of them are bugs.
Use 0.99 when
The task has a binary pass/fail at the end. Tests pass or they don’t. Migration ships or it doesn’t. The scorers can only ding you for facts.
Don’t use 0.99 when
The task is subjective. Visual quality, ambition, polish on a generated artifact — the scorers will always find something. The loop will run forever.
The Bug We Patched Mid-Run
While the loops were running, we noticed chat.md was missing most of the worker signoffs. Workers were being killed by the reaper before they finished writing their signoff posts.
The cause was step ordering inside the spawn template. Workers were instructed to write result.json first, then their chat signoff. But writing result.json is what triggers the reaper to kill them — so the chat post never landed.
↓
Step 4: Write
result.json (triggers reaper)
↓
Step 5: Post signoff to
chat.md (never executes)
AFTER
↓
Step 4: Post signoff to
chat.md (lands first)
↓
Step 5: Write
result.json last (reaper fires AFTER signoff is on disk)
Patched the template, bumped the skill version, and re-spawned the next worker with the fix. Earlier workers’ signoffs got reconstructed from their result.json files via a small backfill script.
What Actually Shipped
| Metric | Value |
|---|---|
| Tool | one-shot-orchestra-v2 (v0.5.1) |
| Phases run | Diagnose, Recon, Builder, Polish, Verifier (full route, then 3 loops) |
| Judge loops | 3 (Verifier × 2, Polish × 1) |
| Composite (final) | 0.50 raw · cited dims 0.78–0.93 |
| Files modified | 40 blog posts, +7,071 / -40 lines |
| Files skipped (correctly) | 7 already-rich posts + index + 15 art gallery pages |
| Accessibility coverage | 100% — every modified post has prefers-reduced-motion |
| Verdict | User-initiated ship after 3 loops + threshold lowered to 0.90 |
The Lesson For Anyone Else Wiring A Scoring Loop
The loop is only as useful as the threshold makes it. Set the bar where good work clears and bad work doesn’t — not where perfect work clears, because perfect work doesn’t exist on a subjective task.
Auto-ship at 0.90. Human-judge at 0.50–0.89. Auto-loop below 0.50. The middle band is where the judge does its real job: deciding whether the gaps are worth closing or whether the work is genuinely shipped.
The 0.99 bar didn’t make the work better. It just made the run longer.
Run Orchestra v2 On Your Next Build
Lean session, fresh workers, automatic scoring loop — now with a ship bar that good work can actually clear.
Get Godmode How Orchestra v2 Works