Lesson 02 — Startup recovery wedge

Grounding: issue #1595, PR #1604 (merged), and the resulting code on main in packages/agents/src/index.ts. The PR was filed in response to a production incident: a class of Durable Objects was permanently unreachable until someone deleted SQL rows out of band.

1. What broke and why it mattered

Production traffic on a high-volume Agent worker was accumulating blockConcurrencyWhile timeouts at a steady rate: about 2,370 in seven days, with one specific Durable Object hitting 66 in two days, all spaced at roughly 30-second intervals. Every hit came back as the same error from workerd:

Error: A call to blockConcurrencyWhile() in a Durable Object waited for
too long. The call was canceled and the Durable Object was reset.

"Reset" sounds like recovery. It wasn't. The DO came back, the same code path ran, the same timeout triggered, the DO reset again. Every entry point - fetch, WebSocket upgrade, native RPC, alarm - produced the same error. The only way to unstick the DO was to manually delete rows from its SQLite tables.

The issue had a label and an assignee (me). It was reassigned and fixed by Sunil in PR #1604 before I started. This lesson reconstructs both the failure and the fix so the next time something with this shape shows up, I recognise it.

2. Primer: BCW and partyserver startup

Two concepts you need before the rest makes sense.

2.1 blockConcurrencyWhile (BCW)

A Durable Object can call this.ctx.blockConcurrencyWhile(async () => { … }) to hold back every incoming event - fetch, WebSocket message, RPC, alarm - until the callback resolves. While the gate is held the DO still accepts requests; it just doesn't dispatch them.

workerd enforces a hard ceiling: roughly 30 seconds of wall time. If the callback hasn't resolved by then, workerd throws inside the callback, the DO is reset, and the original promise rejects. The ceiling is not configurable.

2.2 partyserver startup

The Agent base class extends partyserver's Server. Partyserver wraps the user's onStart in a single blockConcurrencyWhile call so consumers can do startup work (restore in-memory state, validate schema, open connections) without worrying about a request arriving mid-initialisation.

That means anything the Agent class adds to its own onStart wrapper - and everything user code awaits inside onStart - is on the BCW critical path.

Implication

Awaiting cross-DO RPC inside onStart is dangerous. The other DO has its own startup that may itself await cross-DO RPC. Total wall time is the sum of every step in that tree, and the first leaf that exceeds 30 seconds kills the whole tree.

3. Cast of characters

Name What it is Reference
cf_agent_tool_runs SQLite table on every Agent that records sub-agent invocations. One row per child run, with status, parent_tool_call_id, started_at, terminal output, etc. Persists across restarts. packages/agents/src/index.ts
_reconcileAgentToolRuns The recovery routine: for every row in starting or running state, walk over to the child facet, inspect its actual status, and finalize the parent row (completed / interrupted) and fire onAgentToolFinish hooks. same file
_cf_resolveSubAgent + _cf_initAsFacet The path that materialises a child DO and runs its onStart via __unsafe_ensureInitialized(). Triggered when the parent inspects a child. same file
ctx.waitUntil Workers primitive: keep the DO alive long enough for a promise to resolve, but don't block the response. The promise can run after the gate that triggered it releases. workerd
Single-flight Pattern where overlapping calls share one in-progress promise instead of starting their own. Used here so a second wake while recovery is still running doesn't schedule a second recovery. idiom

4. The deadlock chain

Picture a parent Agent with two sub-agent runs in flight. The worker crashes, evicts, or just gets a new isolate. The two child rows in cf_agent_tool_runs remain in running status. Now a request arrives and wakes the parent.

4.1 The old startup wrapper

// agents/src/index.ts, before #1604
await this.ctx.blockConcurrencyWhile(async () => {
  // …other restore work…
  await this._checkRunFibers();
  const recoveredAgentToolFinishes = await this._reconcileAgentToolRuns({
    deferFinishHooks: true,
  });

  this._insideOnStart = true;
  try {
    result = await _onStart();          // user's onStart
  } finally {
    this._insideOnStart = false;
  }

  await this._runDeferredAgentToolFinishHooks(recoveredAgentToolFinishes);
  return result;
});

_reconcileAgentToolRuns walks every stale row sequentially. For each row it calls _cf_resolveSubAgent(agent_type, run_id) which triggers the child's own __unsafe_ensureInitialized() - that is, the child's blockConcurrencyWhile(onStart). If the child also has stale rows, it does the same thing for its own children, and so on.

4.2 The recursion in pictures

sequenceDiagram
  autonumber
  participant Req as Request
  participant P as Parent DO
  participant C1 as Child 1 DO
  participant C2 as Child 2 DO
  participant G as Grandchild DO

  Req->>P: any wake event
  P->>P: blockConcurrencyWhile begins (30s budget starts)
  P->>P: _reconcileAgentToolRuns starts
  P->>C1: _cf_resolveSubAgent
  C1->>C1: blockConcurrencyWhile begins (child has its own gate)
  C1->>G: _cf_resolveSubAgent (child also has stale rows)
  G->>G: blockConcurrencyWhile begins
  Note over G: grandchild's tree is large or wedged
  Note over P,G: ~30s wall time elapsed
  G--xC1: workerd cancels grandchild gate
  C1--xP: workerd cancels child gate
  P--xReq: parent gate also cancelled, DO reset
  Note over P: rows still in 'running' status. Next wake repeats.

The crucial detail

_reconcileAgentToolRuns finalised rows only after per-row work completed. If workerd interrupted the loop mid-row, no rows were marked terminal. On the next wake the same set of running rows triggered the same cascade. Deterministic, durable wedge.

4.3 Why #1578's fix wasn't enough

Issue #1577 / PR #1578 had already fixed a related deadlock: a facet whose DO id collided with its root would make an RPC into itself and deadlock on its own BCW lock instantly. That fix added path-v2 identities so child DO ids no longer collide with the root.

#1595's wedge is on the same critical path but with a different cause. The children here are distinct DO instances. No self-RPC. No identity collision. The cost is just the sum of sequentially driving N children through their own full onStart while the parent holds its own 30-second gate. With a deep enough tree, or one stuck child, that sum exceeds the budget on its own.

5. Why it was permanent, not transient

A normal 30-second timeout retries on the next request and eventually succeeds. This one didn't. Three properties combined to make it durable.

Property Consequence
Stale rows live in SQLite. They survive isolate eviction, deploy rollouts, OOMs.
Rows are only finalized after recovery completes. Cancellation throws away in-flight finalization work. Same rows on next wake.
Every entry point goes through partyserver's ensureInitialized. No way to bypass onStart from the outside. Even a "wake the DO and ignore the result" request hits the same gate.

Production evidence in the issue: every cancelled parent gate correlated to-the-millisecond with a child's _cf_initAsFacet being cancelled. The hang was independent of the wake trigger.

6. The fix, step by step

The fix in PR #1604 has four moving parts. They all sit in packages/agents/src/index.ts.

6.1 Move reconciliation out of the startup gate

The old onStart wrapper awaited _reconcileAgentToolRuns. The new one snapshots and defers.

// agents/src/index.ts, after #1604
await this.ctx.blockConcurrencyWhile(async () => {
  // …other restore work…
  await this._checkRunFibers();
  const startupAgentToolRunIds = this._agentToolRunRecoveryRunIds();   // ← snapshot

  this._insideOnStart = true;
  try {
    result = await _onStart();                                          // user's onStart
  } finally {
    this._insideOnStart = false;
  }

  this._scheduleAgentToolRunRecovery({                                  // ← background
    runIds: startupAgentToolRunIds,
  });
  return result;
});

6.2 Schedule the background work

private _scheduleAgentToolRunRecovery(options?: {
  childInspectionTimeoutMs?: number;
  runIds?: readonly string[];
}): Promise<void> {
  if (this._agentToolRunRecoveryPromise) {
    return this._agentToolRunRecoveryPromise;          // single-flight
  }

  if (options?.runIds && options.runIds.length === 0) {
    return Promise.resolve();
  }

  const recovery = (async () => {
    await new Promise<void>((resolve) => setTimeout(resolve, 0));
    const recoveredAgentToolFinishes = await this._reconcileAgentToolRuns({
      deferFinishHooks: true,
      childInspectionTimeoutMs: options?.childInspectionTimeoutMs,
      runIds: options?.runIds,
    });
    await this._runDeferredAgentToolFinishHooks(recoveredAgentToolFinishes);
  })()
    .catch(async (error) => {
      try { await this.onError(error); }
      catch { /* never wedge */ }
    })
    .finally(() => { this._agentToolRunRecoveryPromise = undefined; });

  this._agentToolRunRecoveryPromise = recovery;
  this.ctx.waitUntil(recovery);                          // keep DO alive but don't gate
  return recovery;
}

6.3 Bound per-row inspection with a race

Even when recovery is no longer on the BCW critical path, an individual stuck child shouldn't be able to block the parent's background work forever. The new _inspectAgentToolRunForRecovery wraps the resolve + init call in a 2-second Promise.race:

private async _inspectAgentToolRunForRecovery(
  row: AgentToolRunStorageRow,
  _sequence: number,
  timeoutMs = DEFAULT_AGENT_TOOL_RECOVERY_TIMEOUT_MS,  // 2_000
): Promise<AgentToolRecoveryInspection> {
  const inspect = (async (): Promise<AgentToolRecoveryInspection> => {
    const child = await this._cf_resolveSubAgent(row.agent_type, row.run_id);
    const adapter = this._asAgentToolChildAdapter(child);
    const inspection = await adapter.inspectAgentToolRun(row.run_id);
    return { status: "inspected", adapter, inspection };
  })().catch((): AgentToolRecoveryInspection => ({ status: "failed" }));

  if (timeoutMs <= 0) return inspect;

  let timeoutId: ReturnType<typeof setTimeout> | undefined;
  const timeout = new Promise<AgentToolRecoveryInspection>((resolve) => {
    timeoutId = setTimeout(() => resolve({ status: "timed-out" }), timeoutMs);
  });

  const result = await Promise.race([inspect, timeout]);
  if (timeoutId !== undefined) clearTimeout(timeoutId);
  return result;
}

6.4 Terminal-finalize timed-out rows as interrupted

Inside the per-row loop in _reconcileAgentToolRuns, a timed-out outcome no longer leaves the row untouched. It writes a terminal interrupted result so the same row never re-enters recovery.

} else if (recovery.status === "timed-out") {
  result = {
    runId: row.run_id,
    agentType: row.agent_type,
    status: "interrupted",
    error: "Agent tool run inspection timed out during parent recovery.",
  };
} else {                            // status === "failed"
  result = {
    runId: row.run_id,
    agentType: row.agent_type,
    status: "interrupted",
    // …
  };
}

6.5 The complete new picture

sequenceDiagram
  autonumber
  participant Req as Request
  participant P as Parent DO
  participant BG as Recovery via waitUntil
  participant C as Stuck Child

  Req->>P: any wake event
  P->>P: blockConcurrencyWhile begins
  P->>P: _agentToolRunRecoveryRunIds runs a SELECT
  P->>P: user onStart runs
  P->>BG: _scheduleAgentToolRunRecovery
  P-->>Req: response served
  Note over P,BG: gate released, DO is alive and serving
  BG->>C: _cf_resolveSubAgent inside a 2s race
  Note over C: child wedged
  BG->>BG: timeout fires
  BG->>BG: write status interrupted for that row
  BG-->>BG: next row

7. The snapshot is doing real work

One detail that's easy to miss: _agentToolRunRecoveryRunIds() returns the set of running/starting run_ids at the moment the parent's onStart is about to run. That snapshot is passed into the background task.

Why bother? Because user code can call runAgentTool from inside onStart (or immediately after). Those new runs will write their own rows with status = 'running' seconds before the background recovery actually starts. Without the snapshot, the recovery loop would see those legitimate new rows, fail to inspect them within 2 seconds (the child is still busy executing a brand-new request, not stuck), and write them off as interrupted. That would be a regression that silently kills new tool calls every time the DO restarts.

The snapshot draws a clean line: "rows that existed before startup got a chance to run." Anything created during or after onStart is the running system's concern, not recovery's.

Test that locks this in

only recovers rows that were stale before Think startup began in packages/think/src/tests/agent-tools.test.ts.

8. Trade-offs Sunil accepted

The PR body is explicit about behavior changes. They are real trade-offs the fix takes deliberately:

  • onStart() can now observe stale tool-run rows. User code that reads cf_agent_tool_runs from onStart will see running rows that are about to become interrupted. The PR weighs that against the alternative (DO never starts) and accepts it.
  • Recovered onAgentToolFinish hooks run after startup, not before. Anything that depended on those hooks firing before user onStart no longer can.
  • Clients may briefly see a stale running tool before receiving the recovered terminal event over the wire.
  • A row classified as interrupted may have actually completed successfully on the child if the inspection just happened to time out. This is the "small amount of correctness for a hard cap on cost" trade the issue itself proposed.

9. The reusable pattern

Boil the fix down. It's a five-step template for any "startup wants to do recovery that touches another DO" situation.

  1. Cheap synchronous snapshot. Inside the startup gate, read just enough state to know what to recover. A SELECT against your own SQLite is fine.
  2. Schedule, don't await. Hand the snapshot to a background task via ctx.waitUntil. Yield with setTimeout(0) first to let the current microtask finish releasing the gate.
  3. Single-flight the background task. Stash the in-flight promise on the instance so overlapping wakes don't schedule duplicate work. Clear in finally.
  4. Bound per-item work. Promise.race against a small timeout. Carry an outcome enum, not a boolean, so callers can distinguish success from failure from timeout.
  5. Write a terminal outcome on timeout. The row must not look the same on the next wake. Even a "give up" outcome must be durable.

When this pattern applies

Any place where a Durable Object needs to reconcile cross-DO state at startup. The shape is independent of agents: if you're awaiting a remote RPC inside blockConcurrencyWhile, you are one stuck remote away from this bug.

10. Self check

Q1. Why was the wedge permanent rather than transient?

Show answer

Three things had to align. The stale state lived in durable SQLite, so it survived resets. The recovery loop wrote terminal status only after per-row work completed, so cancellation produced zero updates. And every entry point on the DO went through the same partyserver ensureInitialized gate, so there was no path around the failing code.

Q2. Why didn't PR #1578 (facet identity fix) help here?

Show answer

#1578 eliminated one specific deadlock: a facet whose DO id collided with its root, so it RPC'd into itself and blocked on its own gate instantly. The children in #1595's scenario are distinct DO instances. The cost comes from the cumulative wall time of sequentially driving every child (and grandchild) through its own onStart while the parent's 30-second gate is held. No identity collision is required.

Q3. Why does the snapshot of run ids matter? What goes wrong without it?

Show answer

Without the snapshot, the background recovery loop runs over whatever rows are running when it actually starts. That set includes any new tool runs initiated by user code during onStart. Those new runs are legitimately busy, will not be inspectable within 2 seconds, and would be written off as interrupted by the timeout branch. The snapshot draws a clean before/after line so only pre-startup rows are touched.

Q4. Why ctx.waitUntil rather than just an unawaited promise?

Show answer

Without waitUntil, the Workers runtime is free to suspend the DO as soon as the request handler returns, and the background recovery may never run to completion. waitUntil tells the runtime "keep this DO alive until this promise settles," so recovery either completes or gets at most one more chance, instead of being dropped on the floor.

Q5. The fix accepts that a row may be marked interrupted when the child actually completed successfully. Why is that acceptable?

Show answer

The issue framed the choice explicitly: a small amount of correctness for a hard cap on cost. The alternative is a Durable Object that no client can ever reach. A child whose inspection times out is by definition uncooperative for at least 2 seconds; the recovered terminal status downstream (interrupted) is far easier to handle than an unreachable parent.

Q6. Sketch the five-step pattern for "startup wants to recover something cross-DO."

Show answer
  1. Snapshot what to recover from your own SQLite inside the gate.
  2. Schedule the work in ctx.waitUntil after yielding with setTimeout(0).
  3. Single-flight the background task on a private promise.
  4. Race each item against a small timeout; carry an outcome enum.
  5. Write a terminal status on timeout so the next wake doesn't repeat it.