Skip to content

Non-compaction auto-retry policy

This document describes the standard API-error retry path in AgentSession.

It explicitly excludes context-overflow recovery via auto-compaction. Overflow is handled by compaction logic and is documented separately in compaction.md.

Implementation files

Scope boundary vs compaction

Retry and compaction are checked from the same agent_end path, but they are intentionally separated:

  1. agent_end inspects the last assistant message.
  2. #isRetryableError(...) runs first.
  3. If retry is initiated, compaction checks are skipped for that turn.
  4. Context-overflow errors are hard-excluded from retry classification (isContextOverflow(...) short-circuits retry).
  5. Overflow therefore falls through to #checkCompaction(...) instead of standard retry.

So: overload/rate/server/network-style failures use this retry policy; context-window overflow uses compaction recovery.

Retry classification

#isRetryableError(...) requires all of the following:

  • assistant stopReason === "error"
  • errorMessage exists
  • message is not context overflow
  • errorMessage matches #isRetryableErrorMessage(...)

Current retryable pattern set (regex-based):

  • overloaded
  • rate limit / usage limit / too many requests
  • HTTP-like server classes: 429, 500, 502, 503, 504
  • service unavailable / server error / internal error
  • connection error / fetch failed
  • retry delay wording

This is string-pattern classification, not typed provider error codes.

Retry lifecycle and state transitions

Session state used by retry:

  • #retryAttempt: number (0 means idle)
  • #retryPromise: Promise<void> | undefined (tracks in-progress retry lifecycle)
  • #retryResolve: (() => void) | undefined (resolves #retryPromise)
  • #retryAbortController: AbortController | undefined (cancels backoff sleep)

Flow (#handleRetryableError):

  1. Read retry settings group.
  2. If retry.enabled === false, stop immediately (false, no retry started).
  3. Increment #retryAttempt.
  4. Create #retryPromise once (first attempt in a chain).
  5. If attempt exceeded retry.maxRetries, emit final failure event and stop.
  6. Compute delay: retry.baseDelayMs * 2^(attempt-1).
  7. For usage-limit errors, parse retry hints and call auth storage (markUsageLimitReached(...)); if provider/model switch succeeds, force delay to 0.
  8. Emit auto_retry_start.
  9. Remove the trailing assistant error message from agent runtime state (kept in persisted session history).
  10. Sleep with abort support.
  11. On wake, schedule agent.continue() via setTimeout(..., 0).

What resets retry counters

#retryAttempt resets to 0 in these cases:

  • first successful non-error, non-aborted assistant message after retries started (emits auto_retry_end { success: true })
  • retry cancellation during backoff sleep
  • max retries exceeded path

#retryPromise resolves/clears when retry chain ends (success, cancellation, or max-exceeded), via #resolveRetry().

Backoff and max-attempt semantics

Settings:

  • retry.enabled (default true)
  • retry.maxRetries (default 3)
  • retry.baseDelayMs (default 2000)

Attempt numbering:

  • attempt counter is incremented before max-check
  • start events use current attempt (1-based)
  • max-exceeded end event reports attempt: this.#retryAttempt - 1 (last attempted retry count)

Backoff sequence with default settings:

  • attempt 1: 2000 ms
  • attempt 2: 4000 ms
  • attempt 3: 8000 ms

Delay override inputs are only used in the usage-limit handling path, and only to influence auth-storage model/account switching decision. In the main non-compaction retry path, backoff remains local exponential delay unless switching succeeds (delayMs = 0).

Abort mechanics

Explicit retry abort

abortRetry():

  • aborts #retryAbortController (if present)
  • resolves retry promise (#resolveRetry()) so awaiters are unblocked

If abort hits while sleeping, catch path emits:

  • auto_retry_end { success: false, finalError: "Retry cancelled" }
  • resets attempt/controller

Global operation abort interaction

abort() calls abortRetry() before aborting the active agent stream. This guarantees retry backoff is cancelled when user issues a general abort.

TUI interaction

On auto_retry_start, EventController:

  • swaps Esc handler to session.abortRetry()
  • renders loader text: Retrying (attempt/maxAttempts) in Ns… (esc to cancel)

On auto_retry_end, it restores prior Esc handler and clears loader state.

Streaming and prompt completion behavior

prompt() ultimately waits on #waitForRetry() after agent.prompt(...) returns.

Effect:

  • a prompt call does not fully resolve until any started retry chain finishes (success/failure/cancel)
  • retry lifecycle is part of one logical prompt execution boundary

This prevents callers from treating a retrying turn as complete too early.

Controls: settings and RPC

Configuration knobs

Defined in settings schema under retry group:

  • retry.enabled
  • retry.maxRetries
  • retry.baseDelayMs

Programmatic toggles in session:

  • setAutoRetryEnabled(enabled) writes retry.enabled
  • autoRetryEnabled reads retry.enabled
  • isRetrying reports whether retry lifecycle promise is active

RPC controls

RPC command surface:

  • set_auto_retrysession.setAutoRetryEnabled(command.enabled)
  • abort_retrysession.abortRetry()

Client helpers:

  • RpcClient.setAutoRetry(enabled)
  • RpcClient.abortRetry()

Both commands return success responses; retry progress/failure details come from streamed session events, not command response payloads.

Event emission and failure surfacing

Session-level retry events:

  • auto_retry_start { attempt, maxAttempts, delayMs, errorMessage }
  • auto_retry_end { success, attempt, finalError? }

Propagation:

  • emitted through AgentSession.subscribe(...)
  • forwarded to extension runner as extension events
  • in RPC mode, forwarded directly as JSON event objects (session.subscribe(event => output(event)))
  • in TUI, consumed by EventController for loader/error UI

Final failure surfacing:

  • On max-exceeded or cancellation, auto_retry_end.success === false
  • TUI shows: Retry failed after N attempts: <finalError>
  • Extensions/hooks receive auto_retry_end with same fields
  • RPC consumers receive same event object on stdout stream

Permanent stop conditions

Retry stops and will not auto-continue when any of these occur:

  • retry.enabled is false
  • error is not retry-classified
  • error is context overflow (delegated to compaction path)
  • max retries exceeded
  • user cancels retry (abort_retry or Esc during retry loader)
  • global abort (abort) cancels retry first

A new retry chain can still start later on a future retryable error after counters reset.

Operational caveats

  • Classification is regex text matching; provider-specific structured errors are not used here.
  • Retry strips the failing assistant error from runtime context before re-continue, but session history still keeps that error entry.
  • RpcSessionState currently exposes autoCompactionEnabled but not an autoRetryEnabled field; RPC callers must track their own toggle state or query settings through other APIs.