Post-Mortem ⏱️ 5 min read

Silent Ship, Then Three Quality Lifters

TL;DR

💥 The bug: Loop-Judge wrote polish (lowercase). The runner did strict-equality match against Polish. Match failed — runner shipped at first-pass quality without ever looping
🔧 The fix: Case-insensitive resolver that accepts the display name OR the slug, with a safe fallback when names don’t match at all
🚀 The bonus: Three quality lifters bolted on top — an ambition score, a tightened Builder brief, and a new Polish phase — so the loop has something worth catching
Three Quality Lifters — Composite Lift Rig
composite 0.50  ·  baseline 0.50  ·  lift +0.00 auto-loop
drag handles · touch · or arrow keys after focusing the canvas (1/2/3 selects lifter)
Each lifter proves itself · Combined they dominate

🔍 What Was Supposed To Happen

Orchestra v2 has a Loop-Judge worker. When the score lands in the ambiguous band — not great, not bad — the judge reads the scorecard, picks one or two phases worth re-running, and writes its verdict to a file. The runner reads that verdict and resumes the run from the chosen phase.

That’s the loop. It’s how a build climbs from “passes” to “ships.”

💥 What Actually Happened

The first time we ran a hard task through v2 — “make the most advanced thing you can possibly make” — the judge picked the new Polish phase as the rerun target. It wrote ["polish"] in the verdict. Lowercase.

The runner’s phaseIndexForRerun did strict equality against the canonical name Polish. The match failed. The list of resumable phases came back empty. The runner shrugged and shipped the first-pass build.

Real-world analogy: a kitchen where the head chef shouts “send it back!” but the runner can’t read the rejection slip because the chef’s handwriting is too messy. The plate goes out anyway. Nobody intended a silent ship — the handoff just dropped on the floor.

🐛 The Bug In One Line

The original match looked like this:

const idx = phaseRoute.findIndex(r => r.name === p);

Strict equality. Case-sensitive. Only matched against name, never the slug. If the judge said anything other than the exact display name, the array index came back empty and the runner walked.

Worse: the empty result wasn’t treated as an error. It was treated as “nothing to rerun” — which meant the runner happily marked the build as done.

Strict Match vs Case-Insensitive Resolver
Same letters, two matchers · one ships silent · one loops

🔧 The Handoff Fix

We replaced that line with a real resolver. It does three things the old match didn’t.

And critically: when the judge writes something we can’t resolve at all — an invented phase name, a typo, anything — the runner falls back to [Builder, Verifier] and flags the fallback in the run log. Better to do the safe rerun than ship silently.

Core insight: when a system can fail open or fail closed, default to fail closed — especially at the boundary between two AI workers. A loose string match plus a silent “nothing to do” is exactly how you ship work that should have been redone.

🎯 The Bigger Issue

The bug was real but small. The bigger issue showed up when we looked at why the build had needed a loop in the first place.

The black hole rendered. Tests passed. Composite was 0.86. But it wasn’t the most advanced thing you can possibly make. It was technically fine but pedestrian. None of the existing scorers caught that. The loop wasn’t going to find ambition gaps even if it had fired — the rubric didn’t ask.

So we added three changes on top of the bug fix — one to the rubric, one to the brief, one new phase.

Lifter 1: The Ambition Dim

A new scorecard dimension called ambition. It asks: does this feel like the version a senior would demo at a conference, or the version that compiles?

It’s wired into the visual, 3D-entity, and writing scorecard presets. It is not wired into refactor or bugfix presets — cleanups don’t get judged on wow factor, and that’s the right call.

The dim has a hard cap at 0.85 if any claimed headline feature is invisible in a still capture. Live-only motion doesn’t save it. If your demo only sells in motion, the still is the portfolio shot you don’t have.

🏗️ Lifter 2: The Tightened Builder Brief

The Builder used to get told “implement the task.” Now it gets a non-negotiable quality bar.

🖼️

Headline visible in stills

If the task has a visual surface, every claimed feature must show up in a single still capture — not just live motion.

🎬

Default state is the demo

The first frame a user sees should already sell the artifact. Sliders exist to vary it, not to unlock it.

🌟

Feature-rich beats minimum-viable

“The most advanced X” is not “X with the textbook formula.” Add the second-order touches a craftsperson would.

📝

Honest reporting

If you ship with a known gap, write it in running-notes.md so Polish can target it. Don’t ship and hope.

Lifter 3: The Polish Phase

A new phase between Verifier and scoring. Its job is the “feel finished” pass — framing, exposure, money-shot first frame, micro-interactions, voice for writing tasks. One or two focused improvements, never a rewrite.

🔍 Diagnose

📡 Recon

🏗️ Builder

✅ Verifier

✨ Polish — new in v0.2.0

📊 Scoring
Polish slots in — v0.2.0 phase pipeline
Hover or focus a phase pill for its job description · Polish drops in on view

If Polish thinks the build is already polished, it ships a no-op and writes its reasoning. Skipping is fine; lying is not.

🎯 Targeted Reruns

The Polish phase isn’t just a once-per-run pass. The Loop-Judge can rerun it independently when ambition lags but the build itself is fine.

That’s the routing rule that turns a quality lift into a cheap quality lift.

WeaknessRerunsCost
Structural (missing feature, wrong geometry, broken behavior)Builder + Verifier (Polish chains automatically)Expensive
Verification gap (no captures, flaky test)Verifier onlyMedium
Finish-line gap (framing, exposure, voice, ambition)Polish onlyCheap

Three phases of work to fix bad framing was overkill. One Polish pass is all it needs.

Routing Cost Radar — Targeted Reruns
StructuralBuilder + Verifier (Polish chains automatically)cost: expensive
Verification gapVerifier onlycost: medium
Finish-line gapPolish onlycost: cheap
Cost asymmetry favours polish-only reruns · hover or tap a dot

🧪 The Regression Tests

Every change in the runner gets a scenario in test-phase3.sh. The new tests cover the silent-ship surface area plus the new quality wiring:

13 scenarios, all green. The silent-ship case has its own scenario now — if anyone ever regresses the resolver, the test breaks the build.

💡 Why The Real Fix Wasn’t The Bug Fix

Do think of this as

A handoff bug that exposed a missing rubric dim and a missing phase. The bug was the trigger; the quality lifters were the actual work.

Don’t think of this as

Just a string-match fix. Patching the resolver alone would have looped the same build through the same scorers and converged on the same pedestrian-but-passing output.

You can fix a string match in 30 seconds. Fixing the rubric so the loop has something to find takes longer — but that’s the change that lifts every future build, not just the next one.

The silent ship was the prompt. The lifters were the answer.

Run Orchestra v2 On Your Next Build

One prompt in, finished product out. Lean session, fresh workers, and a loop that won’t ship until the rubric says ship.

Get Godmode How Orchestra v2 Works