Silent Ship, Then Three Quality Lifters
💥 The bug: Loop-Judge wrote
polish (lowercase). The runner did strict-equality match against Polish. Match failed — runner shipped at first-pass quality without ever looping🔧 The fix: Case-insensitive resolver that accepts the display name OR the slug, with a safe fallback when names don’t match at all
🚀 The bonus: Three quality lifters bolted on top — an ambition score, a tightened Builder brief, and a new Polish phase — so the loop has something worth catching
What Was Supposed To Happen
Orchestra v2 has a Loop-Judge worker. When the score lands in the ambiguous band — not great, not bad — the judge reads the scorecard, picks one or two phases worth re-running, and writes its verdict to a file. The runner reads that verdict and resumes the run from the chosen phase.
That’s the loop. It’s how a build climbs from “passes” to “ships.”
What Actually Happened
The first time we ran a hard task through v2 — “make the most advanced thing you can possibly make” — the judge picked the new Polish phase as the rerun target. It wrote ["polish"] in the verdict. Lowercase.
The runner’s phaseIndexForRerun did strict equality against the canonical name Polish. The match failed. The list of resumable phases came back empty. The runner shrugged and shipped the first-pass build.
Real-world analogy: a kitchen where the head chef shouts “send it back!” but the runner can’t read the rejection slip because the chef’s handwriting is too messy. The plate goes out anyway. Nobody intended a silent ship — the handoff just dropped on the floor.
The Bug In One Line
The original match looked like this:
const idx = phaseRoute.findIndex(r => r.name === p);
Strict equality. Case-sensitive. Only matched against name, never the slug. If the judge said anything other than the exact display name, the array index came back empty and the runner walked.
Worse: the empty result wasn’t treated as an error. It was treated as “nothing to rerun” — which meant the runner happily marked the build as done.
The Handoff Fix
We replaced that line with a real resolver. It does three things the old match didn’t.
- Normalises casing on both sides —
Polish,polish, andPOLISHall match. - Accepts the display name OR the slug. The judge can write
Builderorbuild; both resolve. - Returns the canonical names so downstream filters (like “which phases have I completed?”) get the right strings every time.
And critically: when the judge writes something we can’t resolve at all — an invented phase name, a typo, anything — the runner falls back to [Builder, Verifier] and flags the fallback in the run log. Better to do the safe rerun than ship silently.
Core insight: when a system can fail open or fail closed, default to fail closed — especially at the boundary between two AI workers. A loose string match plus a silent “nothing to do” is exactly how you ship work that should have been redone.
The Bigger Issue
The bug was real but small. The bigger issue showed up when we looked at why the build had needed a loop in the first place.
The black hole rendered. Tests passed. Composite was 0.86. But it wasn’t the most advanced thing you can possibly make. It was technically fine but pedestrian. None of the existing scorers caught that. The loop wasn’t going to find ambition gaps even if it had fired — the rubric didn’t ask.
So we added three changes on top of the bug fix — one to the rubric, one to the brief, one new phase.
Lifter 1: The Ambition Dim
A new scorecard dimension called ambition. It asks: does this feel like the version a senior would demo at a conference, or the version that compiles?
It’s wired into the visual, 3D-entity, and writing scorecard presets. It is not wired into refactor or bugfix presets — cleanups don’t get judged on wow factor, and that’s the right call.
The dim has a hard cap at 0.85 if any claimed headline feature is invisible in a still capture. Live-only motion doesn’t save it. If your demo only sells in motion, the still is the portfolio shot you don’t have.
Lifter 2: The Tightened Builder Brief
The Builder used to get told “implement the task.” Now it gets a non-negotiable quality bar.
Headline visible in stills
If the task has a visual surface, every claimed feature must show up in a single still capture — not just live motion.
Default state is the demo
The first frame a user sees should already sell the artifact. Sliders exist to vary it, not to unlock it.
Feature-rich beats minimum-viable
“The most advanced X” is not “X with the textbook formula.” Add the second-order touches a craftsperson would.
Honest reporting
If you ship with a known gap, write it in running-notes.md so Polish can target it. Don’t ship and hope.
Lifter 3: The Polish Phase
A new phase between Verifier and scoring. Its job is the “feel finished” pass — framing, exposure, money-shot first frame, micro-interactions, voice for writing tasks. One or two focused improvements, never a rewrite.
↓
📡 Recon
↓
🏗️ Builder
↓
✅ Verifier
↓
✨ Polish — new in v0.2.0
↓
📊 Scoring
If Polish thinks the build is already polished, it ships a no-op and writes its reasoning. Skipping is fine; lying is not.
Targeted Reruns
The Polish phase isn’t just a once-per-run pass. The Loop-Judge can rerun it independently when ambition lags but the build itself is fine.
That’s the routing rule that turns a quality lift into a cheap quality lift.
| Weakness | Reruns | Cost |
|---|---|---|
| Structural (missing feature, wrong geometry, broken behavior) | Builder + Verifier (Polish chains automatically) | Expensive |
| Verification gap (no captures, flaky test) | Verifier only | Medium |
| Finish-line gap (framing, exposure, voice, ambition) | Polish only | Cheap |
Three phases of work to fix bad framing was overkill. One Polish pass is all it needs.
The Regression Tests
Every change in the runner gets a scenario in test-phase3.sh. The new tests cover the silent-ship surface area plus the new quality wiring:
- Lowercase phase names (
polish,builder) resolve to canonical names - Unknown phase names trigger a safe fallback to
[Builder, Verifier]with the fallback flagged - MAX_ATTEMPTS ceiling forces a ship after 3 loops — no infinite cycles
- Ambition score below the floor auto-loops to Polish only, not the full Builder + Verifier sequence
- The
code-visualpreset wires bothvisual_qualityandambition;code-non-visualwires neither
13 scenarios, all green. The silent-ship case has its own scenario now — if anyone ever regresses the resolver, the test breaks the build.
Why The Real Fix Wasn’t The Bug Fix
Do think of this as
A handoff bug that exposed a missing rubric dim and a missing phase. The bug was the trigger; the quality lifters were the actual work.
Don’t think of this as
Just a string-match fix. Patching the resolver alone would have looped the same build through the same scorers and converged on the same pedestrian-but-passing output.
You can fix a string match in 30 seconds. Fixing the rubric so the loop has something to find takes longer — but that’s the change that lifts every future build, not just the next one.
The silent ship was the prompt. The lifters were the answer.
Run Orchestra v2 On Your Next Build
One prompt in, finished product out. Lean session, fresh workers, and a loop that won’t ship until the rubric says ship.
Get Godmode How Orchestra v2 Works