/blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.
Outcome Tracking: Skills That Learn From Users, Not Just Themselves
📊 The problem: Skills scored themselves — no reality check from actual users
🔧 The fix: Every One-Shot run now asks: Shipped, Edited, or Rejected?
🌐 The scale: Outcomes are shared via API so all users feed the same evolution engine
✅ The result: A live dashboard at /outcomes tracking real-world skill performance
Until now, One-Shot scored its own work. It ran 9 dimensions, calculated a composite, and decided whether to loop or ship. That scoring is rigorous — but it’s still a student grading their own exam.
Today, the student gets a teacher.
The Problem With Self-Assessment
One-Shot’s internal scoring catches real issues. A missing error handler drops the security score. An untested edge case drops coverage. These are measurable, and the skill fixes them before delivery.
But some failures are invisible to the rubric.
Rubric Blind Spots
The scoring dimensions might not cover what actually matters for a specific task type.
Score Inflation
A skill can score 0.95 and still produce output the user needs to edit. The rubric says “great” — the user says “close, but no.”
Silent Drift
Without external signal, a skill can evolve toward gaming its own rubric rather than serving users.
Unknown Unknowns
Some quality dimensions only exist in the user’s head. No rubric can anticipate every preference.
Think of it like a restaurant. A chef who only tastes their own food might think every dish is perfect. But the reviews on the wall — “great steak, soggy fries” — tell a different story. You need both the chef’s palate AND the customer’s feedback.
How Outcome Tracking Works
After every One-Shot delivery, the skill now asks one question:
How did this output land?
S. Shipped as-is (no edits needed)
E. Edited (usable but needed changes)
R. Rejected (started over or abandoned)
That single letter — S, E, or R — gets logged alongside the full scorecard: every dimension score, the composite, the task type, and any notes about what needed work.
↓
📊 Internal scoring (9 dimensions)
↓
📦 Delivery to user
↓
👤 User verdict: S / E / R
↓
💾 Log locally + POST to shared API
↓
🔬 Evolution engine pulls real-world data
From Local to Global: The Shared Outcomes API
Here’s where it gets interesting. Outcome data isn’t just saved on your machine. It’s also sent to a shared API that every user contributes to.
This means the evolution engine doesn’t just learn from one person’s experience. It learns from everyone’s.
Local Logging
- Saved to
feedback.jsonlin the skill directory - Works offline
- Your personal history
- Available for local evo runs
Remote API
- POST to shared outcomes endpoint
- Aggregated across all users
- Powers the public dashboard
- Feeds cross-user evolution
The API accepts outcomes from any Godmode skill, not just One-Shot Scripts. As more skills adopt outcome tracking, the dataset grows — and the evolution engine gets smarter.
The Outcomes Dashboard
All of this data flows into a live dashboard at /outcomes. It shows the numbers that matter:
Ship Rate
What percentage of One-Shot outputs ship without edits. The North Star metric.
Avg Composite
The mean self-assessed score across all runs. Useful when compared against ship rate.
Dimension Averages
Which scoring dimensions consistently score lowest? That’s where the next mutation should aim.
Verdict Breakdown
S vs E vs R distribution. A high “E” rate means the skill is close but not there yet.
The gap matters most. If average composite is 0.95 but ship rate is 60%, the rubric is measuring the wrong things. That gap is the single most valuable signal for evolution.
How This Feeds Evolution
The Evo-Loop and Evolution engine already mutate skills based on dimensional scores. Outcome tracking adds a layer they never had: ground truth.
Pattern Detection
If bugfix tasks consistently get “E” verdicts while scoring 0.93+, the evolution engine knows the bugfix rubric is miscalibrated. It can adjust dimension weights for that task type.
Targeted Mutation
Instead of guessing which phase to mutate, the engine looks at which dimensions correlate with “R” verdicts. If low testing scores predict rejection, testing gets priority in the next mutation.
Rubric Evolution
The scoring rubric itself can evolve. If users consistently edit outputs in ways no dimension captures, that’s a signal to add a new dimension or reweight existing ones.
What We Don’t Collect
Outcome tracking is deliberately minimal. No code is sent. No file contents are shared. No personal information leaves your machine.
Collected
Skill name, task type, dimension scores, composite score, verdict (S/E/R), weak areas, timestamp.
Not Collected
Source code, file paths, project names, user identity, conversation content, API keys.
The API is open. You can hit GET /api/outcomes to see exactly what’s stored. Nothing hidden.
📡 SHARED — sent to /api/outcomes
🔒 STAYS LOCAL — never leaves your machine
Why This Changes Everything
Most AI tools improve through developer intuition. Someone guesses what the prompt should say, tests it a few times, and ships it. That’s fine for v1.
But Godmode skills now have something different: a closed feedback loop from execution to human verdict to mutation.
| Approach | Signal Source | Feedback Speed | Scale |
|---|---|---|---|
| Manual prompt tuning | Developer intuition | Slow (manual testing) | Single developer |
| Self-assessment only | Internal rubric | Fast (same session) | Single run |
| Outcome tracking | Real user verdicts | Fast (post-delivery) | All users, all runs |
This is the difference between a student and a professional. A student improves by studying harder. A professional improves by shipping work, hearing what clients actually think, and adjusting. Outcome tracking makes our skills professionals.
Try It Now
If you’re already using One-Shot Scripts, you have outcome tracking. After your next delivery, you’ll see the S/E/R prompt. Hit a key. That’s it — you just contributed to skill evolution.
Check the live data at /outcomes.
Your Verdict Shapes the Next Version
Every S, E, or R you submit makes the skill better for everyone. Try One-Shot Scripts and see the difference.
Get Access View Live Outcomes