Build Receipt ⏱️ 5 min read

Feedback.jsonl Made the Edits — One-Shot-Scripts v3.2 Recap

TL;DR

📝 What shipped: One-Shot-Scripts v3.1.0 → v3.2.0 — 16 surgical edits to the protocol that ships your code.
🔍 How the changes were chosen: The skill's own feedback.jsonl — 14 past run outcomes — pointed at exactly what to fix.
🎯 The result: Composite 0.96, shipped in 17 minutes, $15.73.

📊 Build Stats

MetricValue
Tool/ignition (Stage 1 planning + Stage 2 autonomous)
Target~/.claude/skills/one-shot-scripts/
Time taken17m 29s
Assistant turns33
Total tokens4,106,807
Estimated cost$15.73 USD
Composite score0.96 / 1.00 (passing)
VerdictShipped

🎯 Why Self-Improve A Skill

One-Shot-Scripts is the protocol that ships finished code in one prompt. It runs nine phases, scores its own output across eleven dimensions, and loops until every dimension passes. It also leaves a paper trail.

Every completed run appends one line to feedback.jsonl — the rubric scores, the verdict (Shipped, Edited, or Rejected), and any notes the user added. After 14 runs that file stops being a log and starts being a diagnosis.

Think of it like a chef reading their own customer comment cards. You don't change the menu because you feel like it. You change it because the same dish keeps coming back half-eaten with the same complaint written on the back.

feedback.jsonl — append-only +0
Tap to append the next entry
PATTERN FIREDvisual < 0.85 in 3 of last 5 — Fix 3 queued.

📝 What The Log Actually Said

Three patterns showed up so often they couldn't be ignored. Each one became a rule the protocol now enforces by default.

Pattern in the logWhat it meant
14 entries used different names for the same scoring dimensions (process vs process_quality, ripple vs ripple_integrity)Schema drift — the analytics dataset was quietly breaking
4 entries shipped below the 0.92 composite threshold but were rescued by the “adjusted” scoreThe actual gating practice didn't match the documented rule
3 entries with notes like “pieces look terrible” despite passing every code-level checkThe visual audit phase wasn't catching procedural-content failures

🔧 The Three Fixes That Mattered Most

📐

Schema discipline

Delivery now requires the canonical dimension keys verbatim. No abbreviations, no ad-hoc dimensions per task. Adjust weights if needed, never the dimension set.

⚖️

Adjusted-score gating

The adjusted composite gates delivery only when the rationale documents which dimensions were down-weighted and why. No rationale → original score applies.

🔍

Per-item visual evidence

If the task makes 6 distinct things, you need 6 close-up screenshots. Wide shots that look fine from a distance are how bad output ships.

✏️

Plus 13 more

Routing table renamed Code → Feature/Refactor split. Operating Rules deduplicated. Dead reference to the old beta protocol dropped. Live task tracker reference cleaned up.

3 PATTERNS → 3 FIXES · CLICK TO TRACE

P1Schema drift
14 entries used different names for the same dimensions (process / process_quality, ripple / ripple_integrity).
P2Rescued by adjusted
4 entries shipped below 0.92 composite, rescued by an adjusted score with no rationale recorded.
P3Visual audit miss
3 entries with notes like “pieces look terrible” despite passing every code-level check.
F1Schema discipline
  // delivery rubric  process: 0.95,    // ad-hoc abbreviation  process_quality: 0.95,    // canonical key  ripple: 1.00,  ripple_integrity: 1.00,
F2Adjusted-score gating
  // gating decision  if (adjusted >= 0.92) ship();  if (adjusted >= 0.92 && rationale) ship();  else apply(originalScore);
F3Per-item visual evidence
  // visual phase 1c  capture(scope: 'wide');  for (const item of items)    capture(item, { closeup: true });

Each fix is anchored in the log · no rule without a pattern

🔥 Why Ignition, Not One-Shot

The original instinct was “run One-Shot on itself.” Two seconds of thought killed that idea — a skill modifying its own scaffolding while running on that scaffolding is a recipe for losing your work.

Ignition is the safer sibling. Stage 1 is collaborative and gated — ideation, research, blind-spot review, with explicit approval before any file is touched. Stage 2 only fires after you sign off the change list.

💬 Recon: read all 18 scripts + every line of feedback.jsonl

🧠 Propose 16 surgical edits, each tied to a specific log pattern or defect

🚪 Approval gate: user reviews the change list, picks defaults, approves

⚡ Stage 2 fires: edits applied, cross-script references verified, task tracker closed

✅ Score, deliver, log outcome, emit receipt

📊 Dimension Scorecard

The full scorecard from the run log. Including the lowest one — that's the honest detail.

DimensionScoreNotes
Context Load1.00read all 18 scripts + every feedback entry
Execution1.0016 of 16 planned edits applied
Testing0.90cross-reference scan, no runtime suite for markdown
Security1.00no attack surface in internal docs
Process Quality0.95diagnose-before-build followed end to end
Deep Research0.85flagged LOW relevance — pure internal refactor
Alternatives0.953 options weighed for each major edit
Documentation1.00version bumped, change rationale captured
Ripple Integrity1.00all 18 cross-script references verified
Polish0.95one stale heading missed in first pass, fixed in loop

✏️ The Honest Trade-off

Deep Research scored 0.85 — the lowest of any dimension. The rubric expects a research brief with external sources, asset URLs, API verification.

For this task, that would have been fake compliance. There's no library to verify, no asset to source, no domain knowledge to confirm — the inputs were our own past runs. The rubric flagged it as LOW relevance and down-weighted it, which is exactly what the new adjusted-score rule is for.

Do

Down-weight a dimension when the rubric genuinely doesn't fit the task — and write the rationale.

Don't

Inflate a low score with hollow research just to clear the threshold. The score is a gift; honesty is the whole point.

📚 The Recursive Bit

The new rules in v3.2 are the rules that would have caught the failures in v3.1's own log. Schema discipline would have stopped the dimension-name drift. Adjusted-score gating would have made the four below-threshold ships explicit instead of implicit. Per-item visual evidence would have flagged the “pieces look terrible” runs in Phase 1c, before any code ran.

The next 14 entries in feedback.jsonl are the test. If the same patterns reappear, the rules need teeth. If they don't, v3.2 actually learned something.

Run the same protocol on your own builds

One-Shot-Scripts is free. Ignition adds the Stage 1 planning gate for high-stakes work.

See One-Shot Get Access