Build Receipt April 18, 2026 ⏱️ 5 min read

Feedback.jsonl Made the Edits — One-Shot-Scripts v3.2 Recap

TL;DR

📝 What shipped: One-Shot-Scripts v3.1.0 → v3.2.0 — 16 surgical edits to the protocol that ships your code.
🔍 How the changes were chosen: The skill's own feedback.jsonl — 14 past run outcomes — pointed at exactly what to fix.
🎯 The result: Composite 0.96, shipped in 17 minutes, $15.73.

feedback.jsonl — tail -f LIVE

Pattern detected — —

click to pause

FEEDBACK STREAM · PATTERNS DETECTED LIVE

📊 Build Stats

Metric	Value
Tool	/ignition (Stage 1 planning + Stage 2 autonomous)
Target	~/.claude/skills/one-shot-scripts/
Time taken	17m 29s
Assistant turns	33
Total tokens	4,106,807
Estimated cost	$15.73 USD
Composite score	0.96 / 1.00 (passing)
Verdict	Shipped

BUILD RECEIPT — v3.2.0

Time 00:00

Cost $0.00

Composite 0.00

PASS

🎯 Why Self-Improve A Skill

One-Shot-Scripts is the protocol that ships finished code in one prompt. It runs nine phases, scores its own output across eleven dimensions, and loops until every dimension passes. It also leaves a paper trail.

Every completed run appends one line to feedback.jsonl — the rubric scores, the verdict (Shipped, Edited, or Rejected), and any notes the user added. After 14 runs that file stops being a log and starts being a diagnosis.

Think of it like a chef reading their own customer comment cards. You don't change the menu because you feel like it. You change it because the same dish keeps coming back half-eaten with the same complaint written on the back.

feedback.jsonl — append-only +0

Tap to append the next entry

PATTERN FIREDvisual < 0.85 in 3 of last 5 — Fix 3 queued.

📝 What The Log Actually Said

Three patterns showed up so often they couldn't be ignored. Each one became a rule the protocol now enforces by default.

Pattern in the log	What it meant
14 entries used different names for the same scoring dimensions (`process` vs `process_quality`, `ripple` vs `ripple_integrity`)	Schema drift — the analytics dataset was quietly breaking
4 entries shipped below the 0.92 composite threshold but were rescued by the “adjusted” score	The actual gating practice didn't match the documented rule
3 entries with notes like “pieces look terrible” despite passing every code-level check	The visual audit phase wasn't catching procedural-content failures

🔧 The Three Fixes That Mattered Most

📐

Schema discipline

Delivery now requires the canonical dimension keys verbatim. No abbreviations, no ad-hoc dimensions per task. Adjust weights if needed, never the dimension set.

⚖️

Adjusted-score gating

The adjusted composite gates delivery only when the rationale documents which dimensions were down-weighted and why. No rationale → original score applies.

🔍

Per-item visual evidence

If the task makes 6 distinct things, you need 6 close-up screenshots. Wide shots that look fine from a distance are how bad output ships.

✏️

Plus 13 more

Routing table renamed Code → Feature/Refactor split. Operating Rules deduplicated. Dead reference to the old beta protocol dropped. Live task tracker reference cleaned up.

3 PATTERNS → 3 FIXES · CLICK TO TRACE

P1Schema drift

14 entries used different names for the same dimensions (process / process_quality, ripple / ripple_integrity).

P2Rescued by adjusted

4 entries shipped below 0.92 composite, rescued by an adjusted score with no rationale recorded.

P3Visual audit miss

3 entries with notes like “pieces look terrible” despite passing every code-level check.

F1Schema discipline

  // delivery rubric  process: 0.95,    // ad-hoc abbreviation  process_quality: 0.95,    // canonical key  ripple: 1.00,  ripple_integrity: 1.00,

F2Adjusted-score gating

  // gating decision  if (adjusted >= 0.92) ship();  if (adjusted >= 0.92 && rationale) ship();  else apply(originalScore);

F3Per-item visual evidence

  // visual phase 1c  capture(scope: 'wide');  for (const item of items)    capture(item, { closeup: true });

Each fix is anchored in the log · no rule without a pattern

🔥 Why Ignition, Not One-Shot

The original instinct was “run One-Shot on itself.” Two seconds of thought killed that idea — a skill modifying its own scaffolding while running on that scaffolding is a recipe for losing your work.

Ignition is the safer sibling. Stage 1 is collaborative and gated — ideation, research, blind-spot review, with explicit approval before any file is touched. Stage 2 only fires after you sign off the change list.

💬 Recon: read all 18 scripts + every line of feedback.jsonl
↓
🧠 Propose 16 surgical edits, each tied to a specific log pattern or defect
↓
🚪 Approval gate: user reviews the change list, picks defaults, approves
↓
⚡ Stage 2 fires: edits applied, cross-script references verified, task tracker closed
↓
✅ Score, deliver, log outcome, emit receipt

📊 Dimension Scorecard

The full scorecard from the run log. Including the lowest one — that's the honest detail.

Dimension	Score	Notes
Context Load	1.00	read all 18 scripts + every feedback entry
Execution	1.00	16 of 16 planned edits applied
Testing	0.90	cross-reference scan, no runtime suite for markdown
Security	1.00	no attack surface in internal docs
Process Quality	0.95	diagnose-before-build followed end to end
Deep Research	0.85	flagged LOW relevance — pure internal refactor
Alternatives	0.95	3 options weighed for each major edit
Documentation	1.00	version bumped, change rationale captured
Ripple Integrity	1.00	all 18 cross-script references verified
Polish	0.95	one stale heading missed in first pass, fixed in loop

✏️ The Honest Trade-off

Deep Research scored 0.85 — the lowest of any dimension. The rubric expects a research brief with external sources, asset URLs, API verification.

For this task, that would have been fake compliance. There's no library to verify, no asset to source, no domain knowledge to confirm — the inputs were our own past runs. The rubric flagged it as LOW relevance and down-weighted it, which is exactly what the new adjusted-score rule is for.

Do

Down-weight a dimension when the rubric genuinely doesn't fit the task — and write the rationale.

Don't

Inflate a low score with hollow research just to clear the threshold. The score is a gift; honesty is the whole point.

📚 The Recursive Bit

The new rules in v3.2 are the rules that would have caught the failures in v3.1's own log. Schema discipline would have stopped the dimension-name drift. Adjusted-score gating would have made the four below-threshold ships explicit instead of implicit. Per-item visual evidence would have flagged the “pieces look terrible” runs in Phase 1c, before any code ran.

The next 14 entries in feedback.jsonl are the test. If the same patterns reappear, the rules need teeth. If they don't, v3.2 actually learned something.

Run the same protocol on your own builds

One-Shot-Scripts is free. Ignition adds the Stage 1 planning gate for high-stakes work.

See One-Shot Get Access

← The Godmode MCP Receipt All posts →