HomeAgents & OrchestrationAgent Error Recovery & Retry

intermediate10 min read· Module 9, Lesson 4

🔄Agent Error Recovery & Retry

Handle failures gracefully in autonomous agent workflows

Agent Error Recovery & Retry

Autonomous AI agents operate in unpredictable environments. APIs go down, tools return unexpected results, models hallucinate, and token budgets run out. A production-grade agent must handle every one of these failures without crashing and, whenever possible, recover automatically.

This lesson covers the full spectrum of agent failure modes and teaches you battle-tested patterns for building resilient, self-correcting agent systems.

Common Agent Failures

Before you can handle errors, you need to understand what can go wrong. Agent failures fall into several distinct categories:

1. API Errors

These are the most common failures. The Claude API (or any external API your agent calls) can return errors for many reasons:

Error Code	Meaning	Typical Cause
400	Bad Request	Malformed input, invalid parameters
401	Unauthorized	Expired or missing API key
403	Forbidden	Insufficient permissions
429	Rate Limited	Too many requests in a time window
500	Server Error	Upstream service failure
529	Overloaded	API is temporarily at capacity

2. Tool Failures

When your agent calls external tools (file system, database, web search), those tools can fail:

Timeout: The tool takes too long to respond
Invalid output: The tool returns data your agent cannot parse
Permission denied: The tool lacks access to the requested resource
Resource not found: The file, URL, or record does not exist

3. Infinite Loops

An agent can get stuck repeating the same action when:

The model keeps calling the same tool with the same arguments
The agent's correction attempt produces the same error
Two tools trigger each other in a cycle

4. Context Overflow

Every model has a context window limit. Agents that accumulate tool results, conversation history, and internal reasoning can exceed this limit, causing:

Truncated context and lost information
Degraded reasoning quality as context fills up
Hard failures when the API rejects the request

Retry Strategies

Exponential Backoff

The most important retry pattern. Instead of retrying immediately (which can worsen rate limiting), you wait progressively longer between attempts:

TypeScript
interface RetryConfig {
  maxRetries: number;
  baseDelayMs: number;
  maxDelayMs: number;
  jitterMs: number;
}

async function withExponentialBackoff<T>(
  fn: () => Promise<T>,
  config: RetryConfig = {
    maxRetries: 5,
    baseDelayMs: 1000,
    maxDelayMs: 60000,
    jitterMs: 500,
  }
): Promise<T> {
  let lastError: Error | undefined;

  for (let attempt = 0; attempt <= config.maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error: any) {
      lastError = error;

      // Do NOT retry on non-retryable errors
      if (error.status === 400 || error.status === 401 || error.status === 403) {
        throw error;
      }

      if (attempt === config.maxRetries) break;

      const delay = Math.min(
        config.baseDelayMs * Math.pow(2, attempt) +
          Math.random() * config.jitterMs,
        config.maxDelayMs
      );

      console.log(
        `[Retry] Attempt ${attempt + 1} failed. ` +
        `Retrying in ${Math.round(delay)}ms...`
      );
      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }

  throw lastError;
}

Key principles:

Jitter adds randomness so multiple agents do not retry in sync
Max delay caps the wait so your agent is not stuck for minutes
Non-retryable errors (400, 401, 403) are thrown immediately -- retrying them is pointless

Circuit Breaker

When a service is consistently failing, you should stop calling it entirely for a cooldown period instead of burning through retries on every request:

TypeScript
class CircuitBreaker {
  private failures = 0;
  private lastFailureTime = 0;
  private state: "closed" | "open" | "half-open" = "closed";

  constructor(
    private failureThreshold: number = 5,
    private cooldownMs: number = 30000
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === "open") {
      if (Date.now() - this.lastFailureTime > this.cooldownMs) {
        this.state = "half-open";
      } else {
        throw new Error(
          `Circuit breaker is OPEN. Service unavailable. ` +
          `Retry after ${this.remainingCooldownSec()}s.`
        );
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = "closed";
  }

  private onFailure() {
    this.failures++;
    this.lastFailureTime = Date.now();
    if (this.failures >= this.failureThreshold) {
      this.state = "open";
      console.warn("[CircuitBreaker] OPEN -- halting requests.");
    }
  }

  private remainingCooldownSec(): number {
    const elapsed = Date.now() - this.lastFailureTime;
    return Math.max(0, Math.round((this.cooldownMs - elapsed) / 1000));
  }
}

The circuit breaker has three states:

Closed: everything works normally
Open: all calls are rejected immediately (service is down)
Half-open: one test call is allowed to check if the service has recovered

Self-Correction Patterns

A sophisticated agent does not just retry blindly. It analyzes the error and adjusts its approach.

Error Classification and Adaptive Response

TypeScript
interface AgentError {
  type: "api" | "tool" | "parsing" | "logic" | "context_overflow";
  message: string;
  retryable: boolean;
  suggestedAction: string;
}

function classifyError(error: any): AgentError {
  // API rate limit
  if (error.status === 429) {
    return {
      type: "api",
      message: "Rate limited by API",
      retryable: true,
      suggestedAction: "wait_and_retry",
    };
  }

  // Context too long
  if (error.status === 400 && error.message?.includes("too many tokens")) {
    return {
      type: "context_overflow",
      message: "Context window exceeded",
      retryable: true,
      suggestedAction: "summarize_and_retry",
    };
  }

  // Tool returned invalid output
  if (error.message?.includes("Invalid JSON")) {
    return {
      type: "parsing",
      message: "Tool returned unparseable output",
      retryable: true,
      suggestedAction: "retry_with_stricter_prompt",
    };
  }

  // Default: non-retryable
  return {
    type: "logic",
    message: error.message || "Unknown error",
    retryable: false,
    suggestedAction: "escalate_to_human",
  };
}

Context Summarization on Overflow

When context grows too large, compress it instead of failing:

TypeScript
async function summarizeContext(
  client: Anthropic,
  messages: Message[]
): Promise<Message[]> {
  const tokenEstimate = JSON.stringify(messages).length / 4;

  if (tokenEstimate < 150000) return messages; // under limit

  console.log("[Agent] Context nearing limit. Summarizing...");

  const summary = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 2000,
    system: "Summarize the conversation so far. Keep all key decisions, " +
            "tool results, and pending tasks. Be concise but complete.",
    messages: [
      {
        role: "user",
        content: JSON.stringify(messages.slice(0, -3)),
      },
    ],
  });

  const summaryText = summary.content[0].type === "text"
    ? summary.content[0].text
    : "";

  // Replace old messages with summary + keep last 3 messages
  return [
    { role: "user" as const, content: `[Previous context summary]: ${summaryText}` },
    ...messages.slice(-3),
  ];
}

Fallback Behaviors

When the primary path fails, agents need fallback strategies:

TypeScript
interface FallbackChain {
  primary: () => Promise<string>;
  fallbacks: Array<{
    name: string;
    condition: (error: any) => boolean;
    handler: () => Promise<string>;
  }>;
  lastResort: () => Promise<string>;
}

async function executeWithFallbacks(chain: FallbackChain): Promise<string> {
  try {
    return await chain.primary();
  } catch (primaryError) {
    console.warn("[Agent] Primary action failed:", primaryError);

    for (const fallback of chain.fallbacks) {
      if (fallback.condition(primaryError)) {
        try {
          console.log(`[Agent] Trying fallback: ${fallback.name}`);
          return await fallback.handler();
        } catch (fallbackError) {
          console.warn(`[Agent] Fallback "${fallback.name}" also failed.`);
        }
      }
    }

    console.error("[Agent] All fallbacks exhausted. Using last resort.");
    return await chain.lastResort();
  }
}

Example usage: if the main model is overloaded, fall back to a smaller model. If that also fails, return a cached response.

Timeout Management

Agents can hang indefinitely if a tool or API call never responds. Always enforce timeouts:

TypeScript
function withTimeout<T>(
  promise: Promise<T>,
  timeoutMs: number,
  label: string = "operation"
): Promise<T> {
  return new Promise((resolve, reject) => {
    const timer = setTimeout(() => {
      reject(new Error(
        `[Timeout] ${label} did not complete within ${timeoutMs}ms`
      ));
    }, timeoutMs);

    promise
      .then((result) => {
        clearTimeout(timer);
        resolve(result);
      })
      .catch((error) => {
        clearTimeout(timer);
        reject(error);
      });
  });
}

// Usage in an agent loop
const toolResult = await withTimeout(
  executeTool(toolName, toolInput),
  30000,  // 30-second timeout
  `Tool: ${toolName}`
);

Logging and Debugging Agents

Agents are difficult to debug because they make autonomous decisions. Structured logging is essential:

TypeScript
interface AgentLog {
  timestamp: string;
  iteration: number;
  action: string;
  input?: any;
  output?: any;
  error?: string;
  durationMs: number;
  tokenUsage?: { input: number; output: number };
}

class AgentLogger {
  private logs: AgentLog[] = [];
  private iteration = 0;

  log(entry: Omit<AgentLog, "timestamp" | "iteration">) {
    this.logs.push({
      ...entry,
      timestamp: new Date().toISOString(),
      iteration: this.iteration,
    });
  }

  nextIteration() {
    this.iteration++;
  }

  dump(): string {
    return JSON.stringify(this.logs, null, 2);
  }

  summary(): string {
    const totalTokens = this.logs.reduce(
      (sum, l) =>
        sum + (l.tokenUsage?.input ?? 0) + (l.tokenUsage?.output ?? 0),
      0
    );
    const errors = this.logs.filter((l) => l.error);
    return (
      `Iterations: ${this.iteration} | ` +
      `Total tokens: ${totalTokens} | ` +
      `Errors: ${errors.length}`
    );
  }
}

Guardrails: Max Iterations & Cost Limits

Production agents must have hard limits to prevent runaway behavior:

TypeScript
interface AgentGuardrails {
  maxIterations: number;
  maxTotalTokens: number;
  maxCostUsd: number;
  maxToolCalls: number;
  maxWallClockMs: number;
}

const DEFAULT_GUARDRAILS: AgentGuardrails = {
  maxIterations: 25,
  maxTotalTokens: 500000,
  maxCostUsd: 2.0,
  maxToolCalls: 50,
  maxWallClockMs: 300000, // 5 minutes
  },

function checkGuardrails(
  state: {
    iteration: number;
    totalTokens: number;
    estimatedCostUsd: number;
    toolCallCount: number;
    startTimeMs: number;
  },
  limits: AgentGuardrails = DEFAULT_GUARDRAILS
): { ok: boolean; reason?: string } {
  if (state.iteration >= limits.maxIterations) {
    return { ok: false, reason: `Max iterations (${limits.maxIterations}) reached` };
  }
  if (state.totalTokens >= limits.maxTotalTokens) {
    return { ok: false, reason: `Token budget (${limits.maxTotalTokens}) exhausted` };
  }
  if (state.estimatedCostUsd >= limits.maxCostUsd) {
    return { ok: false, reason: `Cost limit ($${limits.maxCostUsd}) reached` };
  }
  if (state.toolCallCount >= limits.maxToolCalls) {
    return { ok: false, reason: `Tool call limit (${limits.maxToolCalls}) reached` };
  }
  const elapsed = Date.now() - state.startTimeMs;
  if (elapsed >= limits.maxWallClockMs) {
    return { ok: false, reason: `Wall clock limit (${limits.maxWallClockMs}ms) exceeded` };
  }
  return { ok: true };
}

Full Practical Example: Resilient Research Agent

Here is a complete agent that searches the web, reads pages, and writes a report -- with full error handling:

TypeScript

// --- Configuration ---
const client = new Anthropic();
const MODEL = "claude-sonnet-4-20250514";
const GUARDRAILS: AgentGuardrails = {
  maxIterations: 20,
  maxTotalTokens: 300000,
  maxCostUsd: 1.5,
  maxToolCalls: 40,
  maxWallClockMs: 180000,
  },

// --- Agent State ---
interface AgentState {
  messages: any[];
  iteration: number;
  totalTokens: number;
  estimatedCostUsd: number;
  toolCallCount: number;
  startTimeMs: number;
  logger: AgentLogger;
}

// --- Main Agent Loop ---
async function runResearchAgent(query: string): Promise<string> {
  const state: AgentState = {
    messages: [{ role: "user", content: query }],
    iteration: 0,
    totalTokens: 0,
    estimatedCostUsd: 0,
    toolCallCount: 0,
    startTimeMs: Date.now(),
    logger: new AgentLogger(),
  };

  const breaker = new CircuitBreaker(3, 15000);

  while (true) {
    state.iteration++;
    state.logger.nextIteration();

    // --- Check guardrails ---
    const guardrailCheck = checkGuardrails(state, GUARDRAILS);
    if (!guardrailCheck.ok) {
      state.logger.log({
        action: "guardrail_stop",
        error: guardrailCheck.reason,
        durationMs: Date.now() - state.startTimeMs,
      });
      return buildPartialReport(state, guardrailCheck.reason!);
    }

    // --- Call Claude with retry + circuit breaker ---
    let response;
    const callStart = Date.now();

    try {
      response = await breaker.execute(() =>
        withExponentialBackoff(() =>
          withTimeout(
            client.messages.create({
              model: MODEL,
              max_tokens: 4096,
              system: "You are a research agent. Use tools to find info, " +
                      "then write a comprehensive report. " +
                      "When done, respond with your final report text.",
              tools: researchTools,
              messages: state.messages,
            }),
            60000,
            "Claude API"
          )
        )
      );
    } catch (error: any) {
      const classified = classifyError(error);
      state.logger.log({
        action: "api_call_failed",
        error: classified.message,
        durationMs: Date.now() - callStart,
      });

      if (classified.suggestedAction === "summarize_and_retry") {
        state.messages = await summarizeContext(client, state.messages);
        continue;
      }

      return buildPartialReport(state, `Agent stopped: ${classified.message}`);
    }

    // --- Update token tracking ---
    const usage = response.usage;
    state.totalTokens += usage.input_tokens + usage.output_tokens;
    state.estimatedCostUsd +=
      (usage.input_tokens * 3 + usage.output_tokens * 15) / 1_000_000;

    state.logger.log({
      action: "api_call",
      durationMs: Date.now() - callStart,
      tokenUsage: {
        input: usage.input_tokens,
        output: usage.output_tokens,
      },
    });

    // --- Process response ---
    if (response.stop_reason === "end_turn") {
      const text = response.content.find((b: any) => b.type === "text");
      console.log("[Agent] Task complete.");
      console.log("[Agent]", state.logger.summary());
      return text?.text || "No report generated.";
    }

    // --- Handle tool calls ---
    if (response.stop_reason === "tool_use") {
      state.messages.push({ role: "assistant", content: response.content });

      const toolResults = [];
      for (const block of response.content) {
        if (block.type !== "tool_use") continue;

        state.toolCallCount++;
        const toolStart = Date.now();

        try {
          const result = await withTimeout(
            executeTool(block.name, block.input),
            30000,
            block.name
          );
          toolResults.push({
            type: "tool_result" as const,
            tool_use_id: block.id,
            content: result,
          });
          state.logger.log({
            action: `tool:${block.name}`,
            input: block.input,
            output: result.slice(0, 200),
            durationMs: Date.now() - toolStart,
          });
        } catch (toolError: any) {
          toolResults.push({
            type: "tool_result" as const,
            tool_use_id: block.id,
            content: `ERROR: ${toolError.message}`,
            is_error: true,
          });
          state.logger.log({
            action: `tool:${block.name}`,
            input: block.input,
            error: toolError.message,
            durationMs: Date.now() - toolStart,
          });
        }
      }

      state.messages.push({ role: "user", content: toolResults });
    }
  }
}

function buildPartialReport(state: AgentState, reason: string): string {
  return (
    `[Agent stopped: ${reason}]\n\n` +
    `Completed ${state.iteration} iterations.\n` +
    `Token usage: ${state.totalTokens}\n` +
    `Estimated cost: $${state.estimatedCostUsd.toFixed(4)}\n\n` +
    `Partial results may be available in the conversation history.`
  );
}

Key Takeaways

Always classify errors before deciding how to handle them. Not every error deserves a retry.
Exponential backoff with jitter is the standard retry strategy for API errors.
Circuit breakers protect your agent from hammering a failing service.
Context summarization prevents overflow and keeps your agent running longer.
Guardrails are non-negotiable: set hard limits on iterations, tokens, cost, and time.
Structured logging is the only way to debug autonomous agent behavior after the fact.
Fallback chains ensure your agent always produces some useful output, even when things go wrong.
Timeouts must wrap every external call -- never trust an API or tool to respond promptly.

Build your agents to expect failure. The difference between a demo agent and a production agent is how gracefully it handles the unexpected.

Module 9

4/6

🧠 Agent Memory & State Management

Building Multi-Agent Systems 👥

4/6