Lesson 02 — Startup recovery wedge
Grounding: issue
#1595,
PR
#1604
(merged), and the resulting code on main in
packages/agents/src/index.ts. The PR was filed in
response to a production incident: a class of Durable Objects was
permanently unreachable until someone deleted SQL rows out of band.
1. What broke and why it mattered
Production traffic on a high-volume Agent worker was accumulating
blockConcurrencyWhile timeouts at a steady rate: about
2,370 in seven days, with one specific Durable Object hitting
66 in two days, all spaced at roughly 30-second
intervals. Every hit came back as the same error from
workerd:
Error: A call to blockConcurrencyWhile() in a Durable Object waited for
too long. The call was canceled and the Durable Object was reset.
"Reset" sounds like recovery. It wasn't. The DO came back, the same code path ran, the same timeout triggered, the DO reset again. Every entry point - fetch, WebSocket upgrade, native RPC, alarm - produced the same error. The only way to unstick the DO was to manually delete rows from its SQLite tables.
The issue had a label and an assignee (me). It was reassigned and fixed by Sunil in PR #1604 before I started. This lesson reconstructs both the failure and the fix so the next time something with this shape shows up, I recognise it.
2. Primer: BCW and partyserver startup
Two concepts you need before the rest makes sense.
2.1 blockConcurrencyWhile (BCW)
A Durable Object can call
this.ctx.blockConcurrencyWhile(async () => { … })
to hold back every incoming event - fetch, WebSocket message, RPC,
alarm - until the callback resolves. While the gate is held the DO
still accepts requests; it just doesn't dispatch them.
workerd enforces a hard ceiling: roughly 30 seconds of wall time. If the callback hasn't resolved by then, workerd throws inside the callback, the DO is reset, and the original promise rejects. The ceiling is not configurable.
2.2 partyserver startup
The Agent base class extends partyserver's
Server. Partyserver wraps the user's
onStart in a single
blockConcurrencyWhile call so consumers can do
startup work (restore in-memory state, validate schema, open
connections) without worrying about a request arriving
mid-initialisation.
That means anything the Agent class adds to its own
onStart wrapper - and everything user code awaits
inside onStart - is on the BCW critical path.
Implication
Awaiting cross-DO RPC inside onStart is dangerous.
The other DO has its own startup that may itself await cross-DO
RPC. Total wall time is the sum of every step in that tree, and
the first leaf that exceeds 30 seconds kills the whole tree.
3. Cast of characters
| Name | What it is | Reference |
|---|---|---|
cf_agent_tool_runs |
SQLite table on every Agent that records
sub-agent invocations. One row per child run, with
status, parent_tool_call_id,
started_at, terminal output, etc. Persists
across restarts.
|
packages/agents/src/index.ts |
_reconcileAgentToolRuns |
The recovery routine: for every row in
starting or running state, walk
over to the child facet, inspect its actual status, and
finalize the parent row (completed / interrupted) and
fire onAgentToolFinish hooks.
|
same file |
_cf_resolveSubAgent + _cf_initAsFacet |
The path that materialises a child DO and runs
its onStart via
__unsafe_ensureInitialized(). Triggered when
the parent inspects a child.
|
same file |
ctx.waitUntil |
Workers primitive: keep the DO alive long enough for a promise to resolve, but don't block the response. The promise can run after the gate that triggered it releases. | workerd |
| Single-flight | Pattern where overlapping calls share one in-progress promise instead of starting their own. Used here so a second wake while recovery is still running doesn't schedule a second recovery. | idiom |
4. The deadlock chain
Picture a parent Agent with two sub-agent runs in flight. The
worker crashes, evicts, or just gets a new isolate. The two child
rows in cf_agent_tool_runs remain in
running status. Now a request arrives and wakes the
parent.
4.1 The old startup wrapper
// agents/src/index.ts, before #1604
await this.ctx.blockConcurrencyWhile(async () => {
// …other restore work…
await this._checkRunFibers();
const recoveredAgentToolFinishes = await this._reconcileAgentToolRuns({
deferFinishHooks: true,
});
this._insideOnStart = true;
try {
result = await _onStart(); // user's onStart
} finally {
this._insideOnStart = false;
}
await this._runDeferredAgentToolFinishHooks(recoveredAgentToolFinishes);
return result;
});
_reconcileAgentToolRuns walks every stale row
sequentially. For each row it calls
_cf_resolveSubAgent(agent_type, run_id) which
triggers the child's own
__unsafe_ensureInitialized() - that is, the
child's blockConcurrencyWhile(onStart). If
the child also has stale rows, it does the same thing for its own
children, and so on.
4.2 The recursion in pictures
sequenceDiagram autonumber participant Req as Request participant P as Parent DO participant C1 as Child 1 DO participant C2 as Child 2 DO participant G as Grandchild DO Req->>P: any wake event P->>P: blockConcurrencyWhile begins (30s budget starts) P->>P: _reconcileAgentToolRuns starts P->>C1: _cf_resolveSubAgent C1->>C1: blockConcurrencyWhile begins (child has its own gate) C1->>G: _cf_resolveSubAgent (child also has stale rows) G->>G: blockConcurrencyWhile begins Note over G: grandchild's tree is large or wedged Note over P,G: ~30s wall time elapsed G--xC1: workerd cancels grandchild gate C1--xP: workerd cancels child gate P--xReq: parent gate also cancelled, DO reset Note over P: rows still in 'running' status. Next wake repeats.
The crucial detail
_reconcileAgentToolRuns finalised rows only
after per-row work completed. If workerd interrupted
the loop mid-row, no rows were marked terminal. On the next
wake the same set of running rows triggered the
same cascade. Deterministic, durable wedge.
4.3 Why #1578's fix wasn't enough
Issue #1577 / PR #1578 had already fixed a related deadlock: a facet whose DO id collided with its root would make an RPC into itself and deadlock on its own BCW lock instantly. That fix added path-v2 identities so child DO ids no longer collide with the root.
#1595's wedge is on the same critical path but with a different
cause. The children here are distinct DO instances. No
self-RPC. No identity collision. The cost is just the sum of
sequentially driving N children through their own full
onStart while the parent holds its own 30-second
gate. With a deep enough tree, or one stuck child, that sum
exceeds the budget on its own.
5. Why it was permanent, not transient
A normal 30-second timeout retries on the next request and eventually succeeds. This one didn't. Three properties combined to make it durable.
| Property | Consequence |
|---|---|
| Stale rows live in SQLite. | They survive isolate eviction, deploy rollouts, OOMs. |
| Rows are only finalized after recovery completes. | Cancellation throws away in-flight finalization work. Same rows on next wake. |
Every entry point goes through partyserver's
ensureInitialized. |
No way to bypass onStart from the outside.
Even a "wake the DO and ignore the result" request hits
the same gate.
|
Production evidence in the issue: every cancelled parent gate
correlated to-the-millisecond with a child's
_cf_initAsFacet being cancelled. The hang was
independent of the wake trigger.
6. The fix, step by step
The fix in PR #1604 has four moving parts. They all sit in
packages/agents/src/index.ts.
6.1 Move reconciliation out of the startup gate
The old onStart wrapper awaited
_reconcileAgentToolRuns. The new one snapshots and
defers.
// agents/src/index.ts, after #1604
await this.ctx.blockConcurrencyWhile(async () => {
// …other restore work…
await this._checkRunFibers();
const startupAgentToolRunIds = this._agentToolRunRecoveryRunIds(); // ← snapshot
this._insideOnStart = true;
try {
result = await _onStart(); // user's onStart
} finally {
this._insideOnStart = false;
}
this._scheduleAgentToolRunRecovery({ // ← background
runIds: startupAgentToolRunIds,
});
return result;
});
6.2 Schedule the background work
private _scheduleAgentToolRunRecovery(options?: {
childInspectionTimeoutMs?: number;
runIds?: readonly string[];
}): Promise<void> {
if (this._agentToolRunRecoveryPromise) {
return this._agentToolRunRecoveryPromise; // single-flight
}
if (options?.runIds && options.runIds.length === 0) {
return Promise.resolve();
}
const recovery = (async () => {
await new Promise<void>((resolve) => setTimeout(resolve, 0));
const recoveredAgentToolFinishes = await this._reconcileAgentToolRuns({
deferFinishHooks: true,
childInspectionTimeoutMs: options?.childInspectionTimeoutMs,
runIds: options?.runIds,
});
await this._runDeferredAgentToolFinishHooks(recoveredAgentToolFinishes);
})()
.catch(async (error) => {
try { await this.onError(error); }
catch { /* never wedge */ }
})
.finally(() => { this._agentToolRunRecoveryPromise = undefined; });
this._agentToolRunRecoveryPromise = recovery;
this.ctx.waitUntil(recovery); // keep DO alive but don't gate
return recovery;
}
6.3 Bound per-row inspection with a race
Even when recovery is no longer on the BCW critical path, an
individual stuck child shouldn't be able to block the parent's
background work forever. The new
_inspectAgentToolRunForRecovery wraps the resolve +
init call in a 2-second
Promise.race:
private async _inspectAgentToolRunForRecovery(
row: AgentToolRunStorageRow,
_sequence: number,
timeoutMs = DEFAULT_AGENT_TOOL_RECOVERY_TIMEOUT_MS, // 2_000
): Promise<AgentToolRecoveryInspection> {
const inspect = (async (): Promise<AgentToolRecoveryInspection> => {
const child = await this._cf_resolveSubAgent(row.agent_type, row.run_id);
const adapter = this._asAgentToolChildAdapter(child);
const inspection = await adapter.inspectAgentToolRun(row.run_id);
return { status: "inspected", adapter, inspection };
})().catch((): AgentToolRecoveryInspection => ({ status: "failed" }));
if (timeoutMs <= 0) return inspect;
let timeoutId: ReturnType<typeof setTimeout> | undefined;
const timeout = new Promise<AgentToolRecoveryInspection>((resolve) => {
timeoutId = setTimeout(() => resolve({ status: "timed-out" }), timeoutMs);
});
const result = await Promise.race([inspect, timeout]);
if (timeoutId !== undefined) clearTimeout(timeoutId);
return result;
}
6.4 Terminal-finalize timed-out rows as interrupted
Inside the per-row loop in _reconcileAgentToolRuns,
a timed-out outcome no longer leaves the row
untouched. It writes a terminal
interrupted result so the same row never re-enters
recovery.
} else if (recovery.status === "timed-out") {
result = {
runId: row.run_id,
agentType: row.agent_type,
status: "interrupted",
error: "Agent tool run inspection timed out during parent recovery.",
};
} else { // status === "failed"
result = {
runId: row.run_id,
agentType: row.agent_type,
status: "interrupted",
// …
};
}
6.5 The complete new picture
sequenceDiagram autonumber participant Req as Request participant P as Parent DO participant BG as Recovery via waitUntil participant C as Stuck Child Req->>P: any wake event P->>P: blockConcurrencyWhile begins P->>P: _agentToolRunRecoveryRunIds runs a SELECT P->>P: user onStart runs P->>BG: _scheduleAgentToolRunRecovery P-->>Req: response served Note over P,BG: gate released, DO is alive and serving BG->>C: _cf_resolveSubAgent inside a 2s race Note over C: child wedged BG->>BG: timeout fires BG->>BG: write status interrupted for that row BG-->>BG: next row
7. The snapshot is doing real work
One detail that's easy to miss:
_agentToolRunRecoveryRunIds() returns the set of
running/starting
run_ids at the moment the parent's
onStart is about to run. That snapshot is
passed into the background task.
Why bother? Because user code can call runAgentTool
from inside onStart (or immediately after). Those
new runs will write their own rows with
status = 'running' seconds before the background
recovery actually starts. Without the snapshot, the recovery loop
would see those legitimate new rows, fail to inspect them within
2 seconds (the child is still busy executing a brand-new request,
not stuck), and write them off as interrupted. That
would be a regression that silently kills new tool calls every
time the DO restarts.
The snapshot draws a clean line: "rows that existed before
startup got a chance to run." Anything created during or after
onStart is the running system's concern, not
recovery's.
Test that locks this in
only recovers rows that were stale before Think startup
began in
packages/think/src/tests/agent-tools.test.ts.
8. Trade-offs Sunil accepted
The PR body is explicit about behavior changes. They are real trade-offs the fix takes deliberately:
-
onStart()can now observe stale tool-run rows. User code that readscf_agent_tool_runsfromonStartwill seerunningrows that are about to becomeinterrupted. The PR weighs that against the alternative (DO never starts) and accepts it. -
Recovered
onAgentToolFinishhooks run after startup, not before. Anything that depended on those hooks firing before useronStartno longer can. -
Clients may briefly see a stale
runningtool before receiving the recovered terminal event over the wire. -
A row classified as
interruptedmay have actually completed successfully on the child if the inspection just happened to time out. This is the "small amount of correctness for a hard cap on cost" trade the issue itself proposed.
9. The reusable pattern
Boil the fix down. It's a five-step template for any "startup wants to do recovery that touches another DO" situation.
- Cheap synchronous snapshot. Inside the startup gate, read just enough state to know what to recover. A SELECT against your own SQLite is fine.
-
Schedule, don't await. Hand the snapshot to
a background task via
ctx.waitUntil. Yield withsetTimeout(0)first to let the current microtask finish releasing the gate. -
Single-flight the background task. Stash the
in-flight promise on the instance so overlapping wakes don't
schedule duplicate work. Clear in
finally. -
Bound per-item work.
Promise.raceagainst a small timeout. Carry an outcome enum, not a boolean, so callers can distinguish success from failure from timeout. - Write a terminal outcome on timeout. The row must not look the same on the next wake. Even a "give up" outcome must be durable.
When this pattern applies
Any place where a Durable Object needs to reconcile
cross-DO state at startup. The shape is independent of agents:
if you're awaiting a remote RPC inside
blockConcurrencyWhile, you are one stuck remote
away from this bug.
10. Self check
Q1. Why was the wedge permanent rather than transient?
Show answer
Three things had to align. The stale state lived in
durable SQLite, so it survived resets. The recovery loop
wrote terminal status only after per-row work completed,
so cancellation produced zero updates. And every entry
point on the DO went through the same partyserver
ensureInitialized gate, so there was no path
around the failing code.
Q2. Why didn't PR #1578 (facet identity fix) help here?
Show answer
#1578 eliminated one specific deadlock: a facet whose DO id
collided with its root, so it RPC'd into itself and blocked
on its own gate instantly. The children in #1595's scenario
are distinct DO instances. The cost comes from the
cumulative wall time of sequentially driving every
child (and grandchild) through its own
onStart while the parent's 30-second gate is
held. No identity collision is required.
Q3. Why does the snapshot of run ids matter? What goes wrong without it?
Show answer
Without the snapshot, the background recovery loop runs over
whatever rows are running when it actually
starts. That set includes any new tool runs initiated by
user code during onStart. Those new runs are
legitimately busy, will not be inspectable within 2 seconds,
and would be written off as interrupted by the
timeout branch. The snapshot draws a clean before/after
line so only pre-startup rows are touched.
Q4. Why ctx.waitUntil rather than just an
unawaited promise?
Show answer
Without waitUntil, the Workers runtime is free
to suspend the DO as soon as the request handler returns,
and the background recovery may never run to completion.
waitUntil tells the runtime "keep this DO alive
until this promise settles," so recovery either completes or
gets at most one more chance, instead of being dropped on
the floor.
Q5. The fix accepts that a row may be marked
interrupted when the child actually completed
successfully. Why is that acceptable?
Show answer
The issue framed the choice explicitly: a small amount of correctness for a hard cap on cost. The alternative is a Durable Object that no client can ever reach. A child whose inspection times out is by definition uncooperative for at least 2 seconds; the recovered terminal status downstream (interrupted) is far easier to handle than an unreachable parent.
Q6. Sketch the five-step pattern for "startup wants to recover something cross-DO."
Show answer
- Snapshot what to recover from your own SQLite inside the gate.
- Schedule the work in
ctx.waitUntilafter yielding withsetTimeout(0). - Single-flight the background task on a private promise.
- Race each item against a small timeout; carry an outcome enum.
- Write a terminal status on timeout so the next wake doesn't repeat it.