Post-Mortem April 25, 2026 ⏱️ 5 min read

The 0.99 Bar Was Unreachable

TL;DR

🎯 The job: Inject animated CSS visuals into 40 static-only blog posts. Orchestra v2 ran the full 5-phase route on the first pass and shipped working HTML to disk
📉 The catch: Composite came in at 0.50, ship bar was 0.99. Loop-Judge ruled loop three times in a row — two of them were procedural no-ops chasing a gap the worker couldn’t close
🔧 The fix: Lower the ship threshold from 0.99 to 0.90. A 0.99 bar on a subjective visual task is a permission slip to loop forever

Seven Dim Scores · Asymptote & Ceiling

composite: 0.42 · scroll — ceiling drops 0.99 → 0.90 at the end

🏗️ The Job Was Done On Pass One

The task was simple to state: every blog post on this site should have an animated CSS visual that matches what the post is about. 40 of them were missing one. The other 7 already had something.

Orchestra v2 ran its full 5-phase route — Diagnose, Recon, Builder, Polish, Verifier — in one go. By the time scoring kicked in, the work on disk was correct: 40 posts modified, 7 untouched, every modified post carrying both an animated SVG figure AND a prefers-reduced-motion override for accessibility.

Core insight: The 5 phases are the route. Loops aren’t extra phases — they’re re-runs of the last two (Verifier and Polish) chasing a higher score on work that already exists.

🔁 Three Loops, Two No-Ops, One Real Lift

The composite came in at 0.50. The ship bar was 0.99. That’s nowhere near, so the runner woke up the Loop-Judge to decide whether to ship anyway or send work back through.

Three Loops · Two No-Ops · One Real Lift

0.50

composite

Click L1 / L2 / L3 · only L3 (Polish) actually moves the needle

Loop 1 — Verifier (no-op)

Judge picked Verifier, asked it to broaden visual coverage from 9 of 40 posts to all 40. Verifier re-ran, re-stated the original 9 archetypes, and wrote a byte-identical report.

Loop 2 — Verifier again (no-op, predicted)

Judge looked at the round-1 result, noticed it was a procedural no-op, and warned in its own verdict: “if attempt 3 also returns identical artifacts, that’s strong signal we’ve hit a protocol diminishing-returns wall.” Then it picked Verifier a second time anyway, hoping the sharper instructions would land. They didn’t. verify.md mtime never updated.

Loop 3 — Polish (real lift)

The third judge changed the target. Instead of Verifier, it pointed Polish at the weakest dim — ambition at 0.78 — and Polish actually delivered: terminal animation hard-cap dropped, dead code stripped, eight near-identical CSS rules collapsed into one selector, pyflakes silent.

Real-world analogy: Olympic gymnastics judging where the scale tops out at 10.0 but every routine has a 0.3 deduction baked in for breathing. You can do a flawless routine and still score 9.7. The bar isn’t reachable; the judges are just calibrated to never give it.

📊 What “Composite 0.50” Actually Means

The Banded Ship Policy · Drag the Handle

BandAMBER · JUDGE

Runner doesjudge decides

Frictionone judge call

Drag the green dot · arrow keys also work · amber ghost = the old 0.99 bar

The composite isn’t the average of the dim scores. It’s a weighted score with a first-loop cap at 0.80 and additional penalties for unverified work.

So the 0.50 number that triggered every loop wasn’t the dim scores being bad — those ranged 0.78 to 0.93. It was the rubric’s structural protections firing because verification only sampled 9 of 40 posts. Every loop would have to either close that gap (Verifier’s job) or move the dim scores up so the composite punched through the cap (Polish’s job).

Layer	Number	Meaning
Raw composite	0.50	Weighted score with first-loop cap and unverified-work penalty
Cited dim range	0.78 – 0.93	What individual scorers actually awarded each dimension
Ship threshold (was)	0.99	Composite must clear this to auto-ship without judge involvement
Ship threshold (now)	0.90	Lowered after this run — reachable by good work, still rejects bad

💡 Why 0.99 Doesn’t Work For Subjective Tasks

A 0.99 bar makes sense for objective tasks where every defect is a fact. Did the migration apply? Did the test pass? Did the binary compile? Yes-or-no, no-defects-allowed, ship-or-don’t.

It doesn’t work for visual tasks where every scorer is allowed to invent objections. “The terminal archetype animates in stills but the keyframe sweep could be smoother.” “Library was lifted not designed.” “Single visual language across all 40 posts.” All true. None of them are bugs.

Use 0.99 when

The task has a binary pass/fail at the end. Tests pass or they don’t. Migration ships or it doesn’t. The scorers can only ding you for facts.

Don’t use 0.99 when

The task is subjective. Visual quality, ambition, polish on a generated artifact — the scorers will always find something. The loop will run forever.

🐛 The Bug We Patched Mid-Run

Mid-Run Patch · Step 4 ↔ Step 5

Left column = BEFORE (signoff never lands) · loops the swap into the AFTER state on the right

While the loops were running, we noticed chat.md was missing most of the worker signoffs. Workers were being killed by the reaper before they finished writing their signoff posts.

The cause was step ordering inside the spawn template. Workers were instructed to write result.json first, then their chat signoff. But writing result.json is what triggers the reaper to kill them — so the chat post never landed.

BEFORE
↓
Step 4: Write result.json (triggers reaper)
↓
Step 5: Post signoff to chat.md (never executes)

AFTER
↓
Step 4: Post signoff to chat.md (lands first)
↓
Step 5: Write result.json last (reaper fires AFTER signoff is on disk)

Patched the template, bumped the skill version, and re-spawned the next worker with the fix. Earlier workers’ signoffs got reconstructed from their result.json files via a small backfill script.

📦 What Actually Shipped

Metric	Value
Tool	one-shot-orchestra-v2 (v0.5.1)
Phases run	Diagnose, Recon, Builder, Polish, Verifier (full route, then 3 loops)
Judge loops	3 (Verifier × 2, Polish × 1)
Composite (final)	0.50 raw · cited dims 0.78–0.93
Files modified	40 blog posts, +7,071 / -40 lines
Files skipped (correctly)	7 already-rich posts + index + 15 art gallery pages
Accessibility coverage	100% — every modified post has `prefers-reduced-motion`
Verdict	User-initiated ship after 3 loops + threshold lowered to 0.90

🎯 The Lesson For Anyone Else Wiring A Scoring Loop

The loop is only as useful as the threshold makes it. Set the bar where good work clears and bad work doesn’t — not where perfect work clears, because perfect work doesn’t exist on a subjective task.

Auto-ship at 0.90. Human-judge at 0.50–0.89. Auto-loop below 0.50. The middle band is where the judge does its real job: deciding whether the gaps are worth closing or whether the work is genuinely shipped.

The 0.99 bar didn’t make the work better. It just made the run longer.

Run Orchestra v2 On Your Next Build

Lean session, fresh workers, automatic scoring loop — now with a ship bar that good work can actually clear.

Get Godmode How Orchestra v2 Works

← Silent Ship, Then Three Quality Lifters All Posts →