Non-compaction auto-retry policy

This document describes the standard API-error retry path in AgentSession.

It explicitly excludes context-overflow recovery via auto-compaction. Overflow is handled by compaction logic and is documented separately in compaction.md.

Implementation files

Scope boundary vs compaction

Retry and compaction are checked from the same agent_end path, but they are intentionally separated:

agent_end inspects the last assistant message.
#isRetryableError(...) runs first.
If retry is initiated, compaction checks are skipped for that turn.
Context-overflow errors are hard-excluded from retry classification (isContextOverflow(...) short-circuits retry).
Overflow therefore falls through to #checkCompaction(...) instead of standard retry.

So: overload/rate/server/network-style failures use this retry policy; context-window overflow uses compaction recovery.

Retry classification

#isRetryableError(...) requires all of the following:

assistant stopReason === "error"
errorMessage exists
message is not context overflow
errorMessage matches #isRetryableErrorMessage(...)

Current retryable pattern set (regex-based):

overloaded
rate limit / usage limit / too many requests
HTTP-like server classes: 429, 500, 502, 503, 504
service unavailable / server error / internal error
connection error / fetch failed
retry delay wording

This is string-pattern classification, not typed provider error codes.

Retry lifecycle and state transitions

Session state used by retry:

#retryAttempt: number (0 means idle)
#retryPromise: Promise<void> | undefined (tracks in-progress retry lifecycle)
#retryResolve: (() => void) | undefined (resolves #retryPromise)
#retryAbortController: AbortController | undefined (cancels backoff sleep)

Flow (#handleRetryableError):

Read retry settings group.
If retry.enabled === false, stop immediately (false, no retry started).
Increment #retryAttempt.
Create #retryPromise once (first attempt in a chain).
If attempt exceeded retry.maxRetries, emit final failure event and stop.
Compute delay: retry.baseDelayMs * 2^(attempt-1).
For usage-limit errors, parse retry hints and call auth storage (markUsageLimitReached(...)); if provider/model switch succeeds, force delay to 0.
Emit auto_retry_start.
Remove the trailing assistant error message from agent runtime state (kept in persisted session history).
Sleep with abort support.
On wake, schedule agent.continue() via setTimeout(..., 0).

What resets retry counters

#retryAttempt resets to 0 in these cases:

first successful non-error, non-aborted assistant message after retries started (emits auto_retry_end { success: true })
retry cancellation during backoff sleep
max retries exceeded path

#retryPromise resolves/clears when retry chain ends (success, cancellation, or max-exceeded), via #resolveRetry().

Backoff and max-attempt semantics

Settings:

retry.enabled (default true)
retry.maxRetries (default 3)
retry.baseDelayMs (default 2000)

Attempt numbering:

attempt counter is incremented before max-check
start events use current attempt (1-based)
max-exceeded end event reports attempt: this.#retryAttempt - 1 (last attempted retry count)

Backoff sequence with default settings:

attempt 1: 2000 ms
attempt 2: 4000 ms
attempt 3: 8000 ms

Delay override inputs are only used in the usage-limit handling path, and only to influence auth-storage model/account switching decision. In the main non-compaction retry path, backoff remains local exponential delay unless switching succeeds (delayMs = 0).

Abort mechanics

Explicit retry abort

abortRetry():

aborts #retryAbortController (if present)
resolves retry promise (#resolveRetry()) so awaiters are unblocked

If abort hits while sleeping, catch path emits:

auto_retry_end { success: false, finalError: "Retry cancelled" }
resets attempt/controller

Global operation abort interaction

abort() calls abortRetry() before aborting the active agent stream. This guarantees retry backoff is cancelled when user issues a general abort.

TUI interaction

On auto_retry_start, EventController:

swaps Esc handler to session.abortRetry()
renders loader text: Retrying (attempt/maxAttempts) in Ns… (esc to cancel)

On auto_retry_end, it restores prior Esc handler and clears loader state.

Streaming and prompt completion behavior

prompt() ultimately waits on #waitForRetry() after agent.prompt(...) returns.

Effect:

a prompt call does not fully resolve until any started retry chain finishes (success/failure/cancel)
retry lifecycle is part of one logical prompt execution boundary

This prevents callers from treating a retrying turn as complete too early.

Controls: settings and RPC

Configuration knobs

Defined in settings schema under retry group:

retry.enabled
retry.maxRetries
retry.baseDelayMs

Programmatic toggles in session:

setAutoRetryEnabled(enabled) writes retry.enabled
autoRetryEnabled reads retry.enabled
isRetrying reports whether retry lifecycle promise is active

RPC controls

RPC command surface:

set_auto_retry → session.setAutoRetryEnabled(command.enabled)
abort_retry → session.abortRetry()

Client helpers:

RpcClient.setAutoRetry(enabled)
RpcClient.abortRetry()

Both commands return success responses; retry progress/failure details come from streamed session events, not command response payloads.

Event emission and failure surfacing

Session-level retry events:

auto_retry_start { attempt, maxAttempts, delayMs, errorMessage }
auto_retry_end { success, attempt, finalError? }

Propagation:

emitted through AgentSession.subscribe(...)
forwarded to extension runner as extension events
in RPC mode, forwarded directly as JSON event objects (session.subscribe(event => output(event)))
in TUI, consumed by EventController for loader/error UI

Final failure surfacing:

On max-exceeded or cancellation, auto_retry_end.success === false
TUI shows: Retry failed after N attempts: <finalError>
Extensions/hooks receive auto_retry_end with same fields
RPC consumers receive same event object on stdout stream

Permanent stop conditions

Retry stops and will not auto-continue when any of these occur:

retry.enabled is false
error is not retry-classified
error is context overflow (delegated to compaction path)
max retries exceeded
user cancels retry (abort_retry or Esc during retry loader)
global abort (abort) cancels retry first

A new retry chain can still start later on a future retryable error after counters reset.

Operational caveats

Classification is regex text matching; provider-specific structured errors are not used here.
Retry strips the failing assistant error from runtime context before re-continue, but session history still keeps that error entry.
RpcSessionState currently exposes autoCompactionEnabled but not an autoRetryEnabled field; RPC callers must track their own toggle state or query settings through other APIs.

Non-compaction auto-retry policy ​

Implementation files ​

Scope boundary vs compaction ​

Retry classification ​

Retry lifecycle and state transitions ​

What resets retry counters ​

Backoff and max-attempt semantics ​

Abort mechanics ​

Explicit retry abort ​

Global operation abort interaction ​

TUI interaction ​

Streaming and prompt completion behavior ​

Controls: settings and RPC ​

Configuration knobs ​

RPC controls ​

Event emission and failure surfacing ​

Permanent stop conditions ​

Operational caveats ​