We Killed Workers Mid-Thought — Then Taught Orchestra Not To
💀 The bug: Two orchestra workers got force-killed mid-conversation when the user re-engaged them after they finished
🔍 The cause: The reaper used
result.json existing as the kill signal. It didn't notice the human had taken the mic back🔧 The fix: Sample the worker's CPU for 3 seconds after the signoff window. If the model is generating, back off and tell the user they own the close
Same scenario, two reaper strategies — only one keeps the conversation alive
What Orchestra Does
Orchestra is our multi-agent execution mode for Claude Code. A lean main session (the Orchestrator) spawns each phase as a fresh Claude Code worker in its own terminal tab.
Each worker reads a brief, does the work, writes result.json, posts a signoff to the shared chat.md, and then sits at the prompt waiting. A separate reaper script watches for result.json and closes the tab.
The Bug, In One Sentence
The reaper killed two workers in the middle of follow-up conversations because it stopped checking whether anyone was talking to them.
Here's how it played out. Worker finished its phase, wrote result.json, posted its signoff. The user saw the worker was "done" and typed a follow-up question into the same tab. The model started generating an answer. About thirty seconds later, the reaper killed the entire process tree mid-sentence.
Think of it like a bouncer clearing finished customers from a booth. Normally fine — but if the customer just started chatting with someone new, you don't drag them out by the collar. You wait, and ideally hand them a key and walk away.
We Were Asking The Wrong Question
The old reaper logic only knew about one signal: does result.json exist? If yes, wait thirty seconds for the chat signoff, then kill the tab.
That works fine when the worker is autonomous. It fails the moment a human takes the mic back — because "is the work done?" and "is the model thinking right now?" are two different questions, and the reaper was only asking the first one.
The Signal That Worked
An interactive Claude Code session burns very different amounts of CPU depending on what it's doing. Below is the signal we now sample over a 3-second window before deciding to kill.
The threshold is set at roughly 17% average over 3 seconds — well above the noise floor of an idle TUI, well below the floor of an active turn. In code, that's a CPU-time delta of 0.5 seconds across the 3-second wall-clock window.
The New Decision Tree
↓
⏱️ wait up to 30s for chat signoff
↓
📊 sample claude.exe CPU for 3 seconds
↓
🔠 busy? — leave tab open, post system note, exit
↓
💤 idle? — kill the worker process tree
↓
✅ orchestrator gets
result.json either way
Timeouts still kill regardless — if the worker never writes result.json within its budget, the reaper doesn't wait around to check what it's "really" doing.
Telling The Human
Detection without communication is half a fix. When the reaper backs off, it appends a system note to chat.md — which renders live in the browser viewer the user is already watching:
### [system] 2026-04-26T14:22:08+10:00 — auto-close skipped for Researcher-A
Worker Researcher-A finished and wrote result.json, but you've started chatting with it again.
The auto-reaper has backed off — this terminal will NOT close on its own. Close the tab manually when you're done.
---
The worker's own prompt also got a small addendum, so the model knows what to do when a human message arrives after STEP 5:
NOTE — IF THE HUMAN STARTS CHATTING WITH YOU AFTER YOU FINISH:
The auto-reaper waits ~30 seconds after result.json appears, then samples your CPU.
If you're mid-turn (the human typed a follow-up), it backs off and leaves your terminal
open — the human owns closing it manually. So if a human message arrives after
STEP 5, just answer it normally; treat the session as a regular Claude Code chat
from that point on.
Why Not Other Signals?
We considered three other ways to detect re-engagement before settling on CPU sampling.
| Approach | Why we rejected it |
|---|---|
Worker writes a sentinel file (.busy-NAME) on each turn |
Requires the model to remember to clear it. One forgotten clear and the worker is immortal. |
Watch the transcript JSONL mtime in ~/.claude/projects/ |
Multiple workers share the same project folder. Mapping name → session UUID is a brittle dance. |
| Hook into Claude Code internals to ask "are you generating?" | Couples the reaper to the TUI's private state. Breaks every Claude Code release. |
| 3-second CPU sample on claude.exe | Zero coupling. One PowerShell call. Same answer every time. |
Do / Don't For Reaping Live Sessions
Do
Treat completion and currently active as two independent signals. Read both before destroying state.
Don't
Assume a "done" agent is also idle. The human can take the mic back any time the tab is interactive.
Do
Tell the user when you change behavior. A backed-off reaper that says nothing is indistinguishable from a broken one.
Don't
Couple the cleanup logic to the agent's internal state. Process-level signals stay valid across version bumps.
The Lesson
"Done" isn't a single bit when humans share sessions with autonomous agents. The system that owns cleanup needs both a logical signal (the worker said it finished) and a runtime signal (nothing's happening right now). Neither alone is enough.
Core insight: When you give a human the ability to grab the mic back from an autonomous agent, your "the agent is done" signal becomes advisory. Read the live process state before doing anything destructive.
The fix shipped as one-shot-orchestra v0.16.1. Two files touched, about thirty lines added.
Visuals created with one-shot-scripts
Run Multi-Agent Claude Code
One-Shot Orchestra is one of the Godmode skills for Claude Code — lean orchestrator, fresh-context workers, full multi-tab visibility.
See One-Shot Read More Posts