HomeProduction & DeploymentRate Limits & Scaling

intermediate12 min read· Module 10, Lesson 2

📈Rate Limits & Scaling

Understand rate limits, handle 429 errors, and scale your application

Rate Limits & Scaling

When you build production applications with the Claude API, one of the first walls you will hit is rate limits. Understanding how they work, how to handle them gracefully, and how to architect your system for scale is critical for any serious deployment.

What Are Rate Limits?

Rate limits are restrictions that Anthropic places on how many requests you can send within a given time window. They exist to:

Protect infrastructure from overload
Ensure fair usage across all customers
Maintain quality of service during peak demand
Prevent abuse of the API

Rate limits are measured in two dimensions:

Dimension	What It Measures
Requests per minute (RPM)	How many API calls you can make per minute
Tokens per minute (TPM)	How many input + output tokens you can consume per minute
Tokens per day (TPD)	Daily token consumption ceiling

You are constrained by whichever limit you hit first. For example, even if you have RPM capacity left, exceeding your TPM will trigger a rate limit.

Anthropic Rate Limit Tiers

Anthropic organizes rate limits into four tiers based on your usage history and spend:

Tier	RPM	TPM (Input)	TPM (Output)	How to Qualify
Tier 1	50	40,000	8,000	New accounts with valid payment
Tier 2	1,000	80,000	16,000	$100+ total spend
Tier 3	2,000	160,000	32,000	$500+ total spend
Tier 4	4,000	400,000	80,000	$1,000+ total spend, high trust

Note: These limits vary by model. Claude Opus has lower limits than Claude Haiku. Always check the latest limits on the Anthropic docs.

How Tier Upgrades Work

Tier upgrades are automatic based on cumulative spend. However:

Upgrades are not instant; they may take up to 24 hours
You can request manual upgrades by contacting Anthropic sales
Enterprise customers can negotiate custom limits

Rate Limit Headers

Every API response from Anthropic includes headers that tell you exactly where you stand:

Output

x-ratelimit-limit-requests: 1000
x-ratelimit-limit-tokens: 80000
x-ratelimit-remaining-requests: 847
x-ratelimit-remaining-tokens: 63250
x-ratelimit-reset-requests: 2025-01-15T12:01:00Z
x-ratelimit-reset-tokens: 2025-01-15T12:01:00Z
retry-after: 3

Header	Purpose
`x-ratelimit-limit-requests`	Your total RPM allowance
`x-ratelimit-limit-tokens`	Your total TPM allowance
`x-ratelimit-remaining-requests`	How many requests you have left this window
`x-ratelimit-remaining-tokens`	How many tokens you have left this window
`x-ratelimit-reset-requests`	When the request counter resets
`x-ratelimit-reset-tokens`	When the token counter resets
`retry-after`	Seconds to wait before retrying (on 429)

Reading Headers in Code

TypeScript

const client = new Anthropic();

async function callWithRateLimitInfo() {
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [{ role: "user", content: "Hello" }],
  });

  // Access rate limit info from the response headers
  // The SDK exposes these on the raw response
  const rawResponse = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [{ role: "user", content: "Hello" }],
  }).asResponse();

  const headers = rawResponse.headers;
  const remaining = headers.get("x-ratelimit-remaining-requests");
  const resetTime = headers.get("x-ratelimit-reset-requests");

  console.log("Requests remaining:", remaining);
  console.log("Resets at:", resetTime);

  return response;
}

Handling 429 Errors

When you exceed a rate limit, the API returns a 429 Too Many Requests status code. You must handle this gracefully.

Basic 429 Error Structure

JSON
{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Rate limit exceeded. Please retry after 3 seconds."
  }
}

Exponential Backoff Implementation

The standard approach is exponential backoff with jitter. Each retry waits longer than the last, with randomness to prevent thundering herd:

TypeScript
async function callWithRetry(
  fn: () => Promise<any>,
  maxRetries: number = 5,
  baseDelay: number = 1000
): Promise<any> {
  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error: any) {
      if (error.status !== 429 || attempt === maxRetries) {
        throw error;
      }

      // Parse retry-after header if available
      const retryAfter = error.headers?.["retry-after"];
      let delay: number;

      if (retryAfter) {
        delay = parseInt(retryAfter, 10) * 1000;
      } else {
        // Exponential backoff: 1s, 2s, 4s, 8s, 16s
        delay = baseDelay * Math.pow(2, attempt);
      }

      // Add jitter: +/- 25% randomness
      const jitter = delay * 0.25 * (Math.random() * 2 - 1);
      delay = Math.max(0, delay + jitter);

      console.warn(
        `Rate limited. Attempt ${attempt + 1}/${maxRetries}. ` +
        `Retrying in ${Math.round(delay)}ms...`
      );

      await new Promise((resolve) => setTimeout(resolve, delay));
    }
  }
}

// Usage
const response = await callWithRetry(() =>
  client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [{ role: "user", content: "Explain rate limiting" }],
  })
);

Request Queuing

For applications processing many requests, a queue prevents you from exceeding limits:

TypeScript
class RequestQueue {
  private queue: Array<{
    fn: () => Promise<any>;
    resolve: (value: any) => void;
    reject: (reason: any) => void;
  }> = [];
  private running: number = 0;
  private maxConcurrent: number;
  private intervalMs: number;

  constructor(requestsPerMinute: number, maxConcurrent: number = 10) {
    this.maxConcurrent = maxConcurrent;
    this.intervalMs = (60 * 1000) / requestsPerMinute;
  }

  async add<T>(fn: () => Promise<T>): Promise<T> {
    return new Promise((resolve, reject) => {
      this.queue.push({ fn, resolve, reject });
      this.process();
    });
  }

  private async process() {
    if (this.running >= this.maxConcurrent || this.queue.length === 0) {
      return;
    }

    const item = this.queue.shift();
    if (!item) return;

    this.running++;

    try {
      const result = await item.fn();
      item.resolve(result);
    } catch (error) {
      item.reject(error);
    } finally {
      this.running--;
      // Enforce rate spacing
      setTimeout(() => this.process(), this.intervalMs);
    }
  }
}

// Usage
const queue = new RequestQueue(50, 5); // 50 RPM, 5 concurrent

const results = await Promise.all(
  prompts.map((prompt) =>
    queue.add(() =>
      client.messages.create({
        model: "claude-sonnet-4-20250514",
        max_tokens: 512,
        messages: [{ role: "user", content: prompt }],
      })
    )
  )
);

Concurrent Request Management

When handling multiple users or tasks simultaneously, you need a concurrency limiter:

TypeScript
class ConcurrencyLimiter {
  private active: number = 0;
  private waiting: Array<() => void> = [];

  constructor(private limit: number) {}

  async acquire(): Promise<void> {
    if (this.active < this.limit) {
      this.active++;
      return;
    }
    return new Promise<void>((resolve) => {
      this.waiting.push(() => {
        this.active++;
        resolve();
      });
    });
  }

  release(): void {
    this.active--;
    const next = this.waiting.shift();
    if (next) next();
  }

  async run<T>(fn: () => Promise<T>): Promise<T> {
    await this.acquire();
    try {
      return await fn();
    } finally {
      this.release();
    }
  }
}

// Allow max 10 concurrent Claude requests
const limiter = new ConcurrencyLimiter(10);

async function handleUserRequest(userMessage: string) {
  return limiter.run(() =>
    client.messages.create({
      model: "claude-sonnet-4-20250514",
      max_tokens: 1024,
      messages: [{ role: "user", content: userMessage }],
    })
  );
}

Batch API for High Volume

For processing large volumes of requests where latency is not critical, Anthropic offers a Message Batches API that provides 50% cost savings:

TypeScript
// Create a batch of requests
const batch = await client.messages.batches.create({
  requests: [
    {
      custom_id: "request-1",
      params: {
        model: "claude-sonnet-4-20250514",
        max_tokens: 1024,
        messages: [{ role: "user", content: "Summarize article 1" }],
      },
    },
    {
      custom_id: "request-2",
      params: {
        model: "claude-sonnet-4-20250514",
        max_tokens: 1024,
        messages: [{ role: "user", content: "Summarize article 2" }],
      },
    },
    // Up to 10,000 requests per batch
  ],
});

console.log("Batch ID:", batch.id);
console.log("Status:", batch.processing_status);

// Poll for completion
async function waitForBatch(batchId: string) {
  while (true) {
    const status = await client.messages.batches.retrieve(batchId);

    if (status.processing_status === "ended") {
      console.log("Batch complete!");
      console.log("Succeeded:", status.request_counts.succeeded);
      console.log("Failed:", status.request_counts.errored);
      return status;
    }

    // Wait 30 seconds before checking again
    await new Promise((r) => setTimeout(r, 30000));
  }
}

// Retrieve results
const completedBatch = await waitForBatch(batch.id);
const results = [];

for await (const result of client.messages.batches.results(batch.id)) {
  if (result.result.type === "succeeded") {
    results.push({
      id: result.custom_id,
      text: result.result.message.content[0].text,
    });
  }
}

When to Use Batch API

Use Case	Batch API?	Why
Summarizing 1,000 articles	Yes	High volume, latency flexible
Real-time chatbot	No	Needs instant responses
Nightly data processing	Yes	Scheduled, no urgency
Interactive code review	No	User waiting for result

Priority Tiers

When your application serves multiple types of requests, implement a priority system:

TypeScript
type Priority = "critical" | "high" | "normal" | "low";

class PriorityQueue {
  private queues: Map<Priority, Array<() => Promise<any>>> = new Map([
    ["critical", []],
    ["high", []],
    ["normal", []],
    ["low", []],
  ]);

  private processing: boolean = false;
  private activeRequests: number = 0;
  private maxConcurrent: number;

  constructor(maxConcurrent: number = 5) {
    this.maxConcurrent = maxConcurrent;
  }

  async enqueue<T>(
    priority: Priority,
    fn: () => Promise<T>
  ): Promise<T> {
    return new Promise((resolve, reject) => {
      const wrappedFn = async () => {
        try {
          const result = await fn();
          resolve(result);
        } catch (err) {
          reject(err);
        }
      };
      this.queues.get(priority)!.push(wrappedFn);
      this.processNext();
    });
  }

  private async processNext() {
    if (this.activeRequests >= this.maxConcurrent) return;

    const priorities: Priority[] = ["critical", "high", "normal", "low"];
    for (const priority of priorities) {
      const queue = this.queues.get(priority)!;
      if (queue.length > 0) {
        const fn = queue.shift()!;
        this.activeRequests++;
        fn().finally(() => {
          this.activeRequests--;
          this.processNext();
        });
        return;
      }
    }
  }
}

// Usage
const pq = new PriorityQueue(5);

// Critical: user-facing real-time request
pq.enqueue("critical", () =>
  client.messages.create({ /* ... */ })
);

// Low: background analytics summarization
pq.enqueue("low", () =>
  client.messages.create({ /* ... */ })
);

Scaling Strategies

1. Load Balancing Across API Keys

Distribute requests across multiple API keys to multiply your effective rate limits:

TypeScript
class KeyRotator {
  private keys: string[];
  private index: number = 0;
  private usage: Map<string, { requests: number; resetTime: number }>;

  constructor(apiKeys: string[]) {
    this.keys = apiKeys;
    this.usage = new Map();
    for (const key of apiKeys) {
      this.usage.set(key, { requests: 0, resetTime: Date.now() + 60000 });
    }
  }

  getNextKey(): string {
    // Round-robin with usage awareness
    const startIndex = this.index;

    do {
      const key = this.keys[this.index];
      const usage = this.usage.get(key)!;

      // Reset counter if window expired
      if (Date.now() > usage.resetTime) {
        usage.requests = 0;
        usage.resetTime = Date.now() + 60000;
      }

      this.index = (this.index + 1) % this.keys.length;

      // Return key if it has capacity
      if (usage.requests < 50) {
        usage.requests++;
        return key;
      }
    } while (this.index !== startIndex);

    throw new Error("All API keys rate limited");
  }
}

const rotator = new KeyRotator([
  process.env.CLAUDE_KEY_1!,
  process.env.CLAUDE_KEY_2!,
  process.env.CLAUDE_KEY_3!,
]);

async function scaledRequest(prompt: string) {
  const key = rotator.getNextKey();
  const client = new Anthropic({ apiKey: key });

  return client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [{ role: "user", content: prompt }],
  });
}

2. Monitoring Rate Limit Usage

Track your consumption to proactively avoid hitting limits:

TypeScript
class RateLimitMonitor {
  private history: Array<{
    timestamp: number;
    tokensUsed: number;
    model: string;
  }> = [];

  record(tokensUsed: number, model: string) {
    this.history.push({
      timestamp: Date.now(),
      tokensUsed,
      model,
    });
    // Keep only last hour
    const oneHourAgo = Date.now() - 3600000;
    this.history = this.history.filter((h) => h.timestamp > oneHourAgo);
  }

  getUsageLastMinute(): { requests: number; tokens: number } {
    const oneMinuteAgo = Date.now() - 60000;
    const recent = this.history.filter((h) => h.timestamp > oneMinuteAgo);
    return {
      requests: recent.length,
      tokens: recent.reduce((sum, h) => sum + h.tokensUsed, 0),
    };
  }

  shouldThrottle(rpmLimit: number, tpmLimit: number): boolean {
    const usage = this.getUsageLastMinute();
    return (
      usage.requests >= rpmLimit * 0.8 ||
      usage.tokens >= tpmLimit * 0.8
    );
  }

  getReport(): string {
    const usage = this.getUsageLastMinute();
    return (
      `Last minute: ${usage.requests} requests, ` +
      `${usage.tokens} tokens`
    );
  }
}

3. Choosing the Right Model for Scale

Scenario	Recommended Model	Why
High-volume classification	Claude Haiku	Fast, cheap, high RPM
Complex analysis at scale	Claude Sonnet	Balance of quality and speed
Critical decisions only	Claude Opus	Best quality, lowest limits
Mixed workloads	Route by complexity	Optimize cost and capacity

Key Takeaways

Know your tier and track your usage against limits
Always implement retry logic with exponential backoff and jitter
Use queues to smooth out traffic spikes
Batch API saves 50% on high-volume, latency-tolerant workloads
Rotate keys and balance load for horizontal scaling
Monitor proactively -- do not wait for 429 errors to react
Choose the right model for each task to maximize throughput

Next up: We will explore caching strategies that can reduce your API costs and avoid rate limits altogether.

Module 10

2/7

🔑 API Keys & Security Best Practices

Monitoring & Observability 📡

2/7