Feedback.jsonl Made the Edits — One-Shot-Scripts v3.2 Recap
📝 What shipped: One-Shot-Scripts v3.1.0 → v3.2.0 — 16 surgical edits to the protocol that ships your code.
🔍 How the changes were chosen: The skill's own
feedback.jsonl — 14 past run outcomes — pointed at exactly what to fix.🎯 The result: Composite
0.96, shipped in 17 minutes, $15.73.
FEEDBACK STREAM · PATTERNS DETECTED LIVE
Build Stats
| Metric | Value |
|---|---|
| Tool | /ignition (Stage 1 planning + Stage 2 autonomous) |
| Target | ~/.claude/skills/one-shot-scripts/ |
| Time taken | 17m 29s |
| Assistant turns | 33 |
| Total tokens | 4,106,807 |
| Estimated cost | $15.73 USD |
| Composite score | 0.96 / 1.00 (passing) |
| Verdict | Shipped |
BUILD RECEIPT — v3.2.0
Why Self-Improve A Skill
One-Shot-Scripts is the protocol that ships finished code in one prompt. It runs nine phases, scores its own output across eleven dimensions, and loops until every dimension passes. It also leaves a paper trail.
Every completed run appends one line to feedback.jsonl — the rubric scores, the verdict (Shipped, Edited, or Rejected), and any notes the user added. After 14 runs that file stops being a log and starts being a diagnosis.
Think of it like a chef reading their own customer comment cards. You don't change the menu because you feel like it. You change it because the same dish keeps coming back half-eaten with the same complaint written on the back.
What The Log Actually Said
Three patterns showed up so often they couldn't be ignored. Each one became a rule the protocol now enforces by default.
| Pattern in the log | What it meant |
|---|---|
14 entries used different names for the same scoring dimensions (process vs process_quality, ripple vs ripple_integrity) | Schema drift — the analytics dataset was quietly breaking |
| 4 entries shipped below the 0.92 composite threshold but were rescued by the “adjusted” score | The actual gating practice didn't match the documented rule |
| 3 entries with notes like “pieces look terrible” despite passing every code-level check | The visual audit phase wasn't catching procedural-content failures |
The Three Fixes That Mattered Most
Schema discipline
Delivery now requires the canonical dimension keys verbatim. No abbreviations, no ad-hoc dimensions per task. Adjust weights if needed, never the dimension set.
Adjusted-score gating
The adjusted composite gates delivery only when the rationale documents which dimensions were down-weighted and why. No rationale → original score applies.
Per-item visual evidence
If the task makes 6 distinct things, you need 6 close-up screenshots. Wide shots that look fine from a distance are how bad output ships.
Plus 13 more
Routing table renamed Code → Feature/Refactor split. Operating Rules deduplicated. Dead reference to the old beta protocol dropped. Live task tracker reference cleaned up.
3 PATTERNS → 3 FIXES · CLICK TO TRACE
// delivery rubric process: 0.95, // ad-hoc abbreviation process_quality: 0.95, // canonical key ripple: 1.00, ripple_integrity: 1.00,
// gating decision if (adjusted >= 0.92) ship(); if (adjusted >= 0.92 && rationale) ship(); else apply(originalScore);
// visual phase 1c capture(scope: 'wide'); for (const item of items) capture(item, { closeup: true });
Each fix is anchored in the log · no rule without a pattern
Why Ignition, Not One-Shot
The original instinct was “run One-Shot on itself.” Two seconds of thought killed that idea — a skill modifying its own scaffolding while running on that scaffolding is a recipe for losing your work.
Ignition is the safer sibling. Stage 1 is collaborative and gated — ideation, research, blind-spot review, with explicit approval before any file is touched. Stage 2 only fires after you sign off the change list.
↓
🧠 Propose 16 surgical edits, each tied to a specific log pattern or defect
↓
🚪 Approval gate: user reviews the change list, picks defaults, approves
↓
⚡ Stage 2 fires: edits applied, cross-script references verified, task tracker closed
↓
✅ Score, deliver, log outcome, emit receipt
Dimension Scorecard
The full scorecard from the run log. Including the lowest one — that's the honest detail.
| Dimension | Score | Notes |
|---|---|---|
| Context Load | 1.00 | read all 18 scripts + every feedback entry |
| Execution | 1.00 | 16 of 16 planned edits applied |
| Testing | 0.90 | cross-reference scan, no runtime suite for markdown |
| Security | 1.00 | no attack surface in internal docs |
| Process Quality | 0.95 | diagnose-before-build followed end to end |
| Deep Research | 0.85 | flagged LOW relevance — pure internal refactor |
| Alternatives | 0.95 | 3 options weighed for each major edit |
| Documentation | 1.00 | version bumped, change rationale captured |
| Ripple Integrity | 1.00 | all 18 cross-script references verified |
| Polish | 0.95 | one stale heading missed in first pass, fixed in loop |
The Honest Trade-off
Deep Research scored 0.85 — the lowest of any dimension. The rubric expects a research brief with external sources, asset URLs, API verification.
For this task, that would have been fake compliance. There's no library to verify, no asset to source, no domain knowledge to confirm — the inputs were our own past runs. The rubric flagged it as LOW relevance and down-weighted it, which is exactly what the new adjusted-score rule is for.
Do
Down-weight a dimension when the rubric genuinely doesn't fit the task — and write the rationale.
Don't
Inflate a low score with hollow research just to clear the threshold. The score is a gift; honesty is the whole point.
The Recursive Bit
The new rules in v3.2 are the rules that would have caught the failures in v3.1's own log. Schema discipline would have stopped the dimension-name drift. Adjusted-score gating would have made the four below-threshold ships explicit instead of implicit. Per-item visual evidence would have flagged the “pieces look terrible” runs in Phase 1c, before any code ran.
The next 14 entries in feedback.jsonl are the test. If the same patterns reappear, the rules need teeth. If they don't, v3.2 actually learned something.
Run the same protocol on your own builds
One-Shot-Scripts is free. Ignition adds the Stage 1 planning gate for high-stakes work.
See One-Shot Get Access