Built by /blog-post-GM — a Claude Code skill we evolved with our own Evolution engine to write every post in the Godmode voice.
Get free skill (account)
Evolution ⏱️ 5 min read

Outcome Tracking: Skills That Learn From Users, Not Just Themselves

TL;DR

📊 The problem: Skills scored themselves — no reality check from actual users
🔧 The fix: Every One-Shot run now asks: Shipped, Edited, or Rejected?
🌐 The scale: Outcomes are shared via API so all users feed the same evolution engine
The result: A live dashboard at /outcomes tracking real-world skill performance
runs 0 ship-rate avg-composite gap
OUTCOMES IN — SKILL OUT — LOOP NEVER ENDS

Until now, One-Shot scored its own work. It ran 9 dimensions, calculated a composite, and decided whether to loop or ship. That scoring is rigorous — but it’s still a student grading their own exam.

Today, the student gets a teacher.

🎯 The Problem With Self-Assessment

One-Shot’s internal scoring catches real issues. A missing error handler drops the security score. An untested edge case drops coverage. These are measurable, and the skill fixes them before delivery.

But some failures are invisible to the rubric.

🔍

Rubric Blind Spots

The scoring dimensions might not cover what actually matters for a specific task type.

📈

Score Inflation

A skill can score 0.95 and still produce output the user needs to edit. The rubric says “great” — the user says “close, but no.”

🔄

Silent Drift

Without external signal, a skill can evolve toward gaming its own rubric rather than serving users.

Unknown Unknowns

Some quality dimensions only exist in the user’s head. No rubric can anticipate every preference.

Think of it like a restaurant. A chef who only tastes their own food might think every dish is perfect. But the reviews on the wall — “great steak, soggy fries” — tell a different story. You need both the chef’s palate AND the customer’s feedback.

⚙️ How Outcome Tracking Works

After every One-Shot delivery, the skill now asks one question:

How did this output land?

  S. Shipped as-is (no edits needed)
  E. Edited (usable but needed changes)
  R. Rejected (started over or abandoned)

That single letter — S, E, or R — gets logged alongside the full scorecard: every dimension score, the composite, the task type, and any notes about what needed work.

🎯 One-Shot executes task

📊 Internal scoring (9 dimensions)

📦 Delivery to user

👤 User verdict: S / E / R

💾 Log locally + POST to shared API

🔬 Evolution engine pulls real-world data

🌐 From Local to Global: The Shared Outcomes API

Here’s where it gets interesting. Outcome data isn’t just saved on your machine. It’s also sent to a shared API that every user contributes to.

This means the evolution engine doesn’t just learn from one person’s experience. It learns from everyone’s.

Local Logging

  • Saved to feedback.jsonl in the skill directory
  • Works offline
  • Your personal history
  • Available for local evo runs

Remote API

  • POST to shared outcomes endpoint
  • Aggregated across all users
  • Powers the public dashboard
  • Feeds cross-user evolution

The API accepts outcomes from any Godmode skill, not just One-Shot Scripts. As more skills adopt outcome tracking, the dataset grows — and the evolution engine gets smarter.

📊 The Outcomes Dashboard

All of this data flows into a live dashboard at /outcomes. It shows the numbers that matter:

📦

Ship Rate

What percentage of One-Shot outputs ship without edits. The North Star metric.

🎯

Avg Composite

The mean self-assessed score across all runs. Useful when compared against ship rate.

📏

Dimension Averages

Which scoring dimensions consistently score lowest? That’s where the next mutation should aim.

⚠️

Verdict Breakdown

S vs E vs R distribution. A high “E” rate means the skill is close but not there yet.

The gap matters most. If average composite is 0.95 but ship rate is 60%, the rubric is measuring the wrong things. That gap is the single most valuable signal for evolution.

avg composite ship rate gap
0.50
shipped
52%
edited
31%
rejected
17%
REWEIGHT THE RUBRIC ALL YOU LIKE — ONLY OUTCOMES MOVE THE SHIP LINE

🔬 How This Feeds Evolution

The Evo-Loop and Evolution engine already mutate skills based on dimensional scores. Outcome tracking adds a layer they never had: ground truth.

1

Pattern Detection

If bugfix tasks consistently get “E” verdicts while scoring 0.93+, the evolution engine knows the bugfix rubric is miscalibrated. It can adjust dimension weights for that task type.

2

Targeted Mutation

Instead of guessing which phase to mutate, the engine looks at which dimensions correlate with “R” verdicts. If low testing scores predict rejection, testing gets priority in the next mutation.

3

Rubric Evolution

The scoring rubric itself can evolve. If users consistently edit outputs in ways no dimension captures, that’s a signal to add a new dimension or reweight existing ones.

🛡️ What We Don’t Collect

Outcome tracking is deliberately minimal. No code is sent. No file contents are shared. No personal information leaves your machine.

Collected

Skill name, task type, dimension scores, composite score, verdict (S/E/R), weak areas, timestamp.

Not Collected

Source code, file paths, project names, user identity, conversation content, API keys.

The API is open. You can hit GET /api/outcomes to see exactly what’s stored. Nothing hidden.

DRAG EACH FIELD → SHARED OR STAYS LOCAL
📡 SHARED — sent to /api/outcomes
🔒 STAYS LOCAL — never leaves your machine
12 fields remaining · sort each chip into the correct tray

          
        
GET /api/outcomes — THE 7 FIELDS WE COLLECT, THE 5 WE NEVER TOUCH

🚀 Why This Changes Everything

Most AI tools improve through developer intuition. Someone guesses what the prompt should say, tests it a few times, and ships it. That’s fine for v1.

But Godmode skills now have something different: a closed feedback loop from execution to human verdict to mutation.

Approach Signal Source Feedback Speed Scale
Manual prompt tuning Developer intuition Slow (manual testing) Single developer
Self-assessment only Internal rubric Fast (same session) Single run
Outcome tracking Real user verdicts Fast (post-delivery) All users, all runs

This is the difference between a student and a professional. A student improves by studying harder. A professional improves by shipping work, hearing what clients actually think, and adjusting. Outcome tracking makes our skills professionals.

skill quality
0.62
STUDENT PLATEAUS · PROFESSIONAL CLIMBS — FEEDBACK LOOP CHANGES EVERYTHING

Try It Now

If you’re already using One-Shot Scripts, you have outcome tracking. After your next delivery, you’ll see the S/E/R prompt. Hit a key. That’s it — you just contributed to skill evolution.

Check the live data at /outcomes.

Your Verdict Shapes the Next Version

Every S, E, or R you submit makes the skill better for everyone. Try One-Shot Scripts and see the difference.

Get Access View Live Outcomes