HomeProduction & DeploymentRate Limits & Scaling
intermediate12 min read· Module 10, Lesson 2

📈Rate Limits & Scaling

Understand rate limits, handle 429 errors, and scale your application

Rate Limits & Scaling

When you build production applications with the Claude API, one of the first walls you will hit is rate limits. Understanding how they work, how to handle them gracefully, and how to architect your system for scale is critical for any serious deployment.


What Are Rate Limits?

Rate limits are restrictions that Anthropic places on how many requests you can send within a given time window. They exist to:

  • Protect infrastructure from overload
  • Ensure fair usage across all customers
  • Maintain quality of service during peak demand
  • Prevent abuse of the API

Rate limits are measured in two dimensions:

DimensionWhat It Measures
Requests per minute (RPM)How many API calls you can make per minute
Tokens per minute (TPM)How many input + output tokens you can consume per minute
Tokens per day (TPD)Daily token consumption ceiling

You are constrained by whichever limit you hit first. For example, even if you have RPM capacity left, exceeding your TPM will trigger a rate limit.


Anthropic Rate Limit Tiers

Anthropic organizes rate limits into four tiers based on your usage history and spend:

TierRPMTPM (Input)TPM (Output)How to Qualify
Tier 15040,0008,000New accounts with valid payment
Tier 21,00080,00016,000$100+ total spend
Tier 32,000160,00032,000$500+ total spend
Tier 44,000400,00080,000$1,000+ total spend, high trust

Note: These limits vary by model. Claude Opus has lower limits than Claude Haiku. Always check the latest limits on the Anthropic docs.

How Tier Upgrades Work

Tier upgrades are automatic based on cumulative spend. However:

  • Upgrades are not instant; they may take up to 24 hours
  • You can request manual upgrades by contacting Anthropic sales
  • Enterprise customers can negotiate custom limits

Rate Limit Headers

Every API response from Anthropic includes headers that tell you exactly where you stand:

Output
x-ratelimit-limit-requests: 1000 x-ratelimit-limit-tokens: 80000 x-ratelimit-remaining-requests: 847 x-ratelimit-remaining-tokens: 63250 x-ratelimit-reset-requests: 2025-01-15T12:01:00Z x-ratelimit-reset-tokens: 2025-01-15T12:01:00Z retry-after: 3
HeaderPurpose
x-ratelimit-limit-requestsYour total RPM allowance
x-ratelimit-limit-tokensYour total TPM allowance
x-ratelimit-remaining-requestsHow many requests you have left this window
x-ratelimit-remaining-tokensHow many tokens you have left this window
x-ratelimit-reset-requestsWhen the request counter resets
x-ratelimit-reset-tokensWhen the token counter resets
retry-afterSeconds to wait before retrying (on 429)

Reading Headers in Code

TypeScript
const client = new Anthropic(); async function callWithRateLimitInfo() { const response = await client.messages.create({ model: "claude-sonnet-4-20250514", max_tokens: 1024, messages: [{ role: "user", content: "Hello" }], }); // Access rate limit info from the response headers // The SDK exposes these on the raw response const rawResponse = await client.messages.create({ model: "claude-sonnet-4-20250514", max_tokens: 1024, messages: [{ role: "user", content: "Hello" }], }).asResponse(); const headers = rawResponse.headers; const remaining = headers.get("x-ratelimit-remaining-requests"); const resetTime = headers.get("x-ratelimit-reset-requests"); console.log("Requests remaining:", remaining); console.log("Resets at:", resetTime); return response; }

Handling 429 Errors

When you exceed a rate limit, the API returns a 429 Too Many Requests status code. You must handle this gracefully.

Basic 429 Error Structure

JSON
{ "type": "error", "error": { "type": "rate_limit_error", "message": "Rate limit exceeded. Please retry after 3 seconds." } }

Exponential Backoff Implementation

The standard approach is exponential backoff with jitter. Each retry waits longer than the last, with randomness to prevent thundering herd:

TypeScript
async function callWithRetry( fn: () => Promise<any>, maxRetries: number = 5, baseDelay: number = 1000 ): Promise<any> { for (let attempt = 0; attempt <= maxRetries; attempt++) { try { return await fn(); } catch (error: any) { if (error.status !== 429 || attempt === maxRetries) { throw error; } // Parse retry-after header if available const retryAfter = error.headers?.["retry-after"]; let delay: number; if (retryAfter) { delay = parseInt(retryAfter, 10) * 1000; } else { // Exponential backoff: 1s, 2s, 4s, 8s, 16s delay = baseDelay * Math.pow(2, attempt); } // Add jitter: +/- 25% randomness const jitter = delay * 0.25 * (Math.random() * 2 - 1); delay = Math.max(0, delay + jitter); console.warn( `Rate limited. Attempt ${attempt + 1}/${maxRetries}. ` + `Retrying in ${Math.round(delay)}ms...` ); await new Promise((resolve) => setTimeout(resolve, delay)); } } } // Usage const response = await callWithRetry(() => client.messages.create({ model: "claude-sonnet-4-20250514", max_tokens: 1024, messages: [{ role: "user", content: "Explain rate limiting" }], }) );

Request Queuing

For applications processing many requests, a queue prevents you from exceeding limits:

TypeScript
class RequestQueue { private queue: Array<{ fn: () => Promise<any>; resolve: (value: any) => void; reject: (reason: any) => void; }> = []; private running: number = 0; private maxConcurrent: number; private intervalMs: number; constructor(requestsPerMinute: number, maxConcurrent: number = 10) { this.maxConcurrent = maxConcurrent; this.intervalMs = (60 * 1000) / requestsPerMinute; } async add<T>(fn: () => Promise<T>): Promise<T> { return new Promise((resolve, reject) => { this.queue.push({ fn, resolve, reject }); this.process(); }); } private async process() { if (this.running >= this.maxConcurrent || this.queue.length === 0) { return; } const item = this.queue.shift(); if (!item) return; this.running++; try { const result = await item.fn(); item.resolve(result); } catch (error) { item.reject(error); } finally { this.running--; // Enforce rate spacing setTimeout(() => this.process(), this.intervalMs); } } } // Usage const queue = new RequestQueue(50, 5); // 50 RPM, 5 concurrent const results = await Promise.all( prompts.map((prompt) => queue.add(() => client.messages.create({ model: "claude-sonnet-4-20250514", max_tokens: 512, messages: [{ role: "user", content: prompt }], }) ) ) );

Concurrent Request Management

When handling multiple users or tasks simultaneously, you need a concurrency limiter:

TypeScript
class ConcurrencyLimiter { private active: number = 0; private waiting: Array<() => void> = []; constructor(private limit: number) {} async acquire(): Promise<void> { if (this.active < this.limit) { this.active++; return; } return new Promise<void>((resolve) => { this.waiting.push(() => { this.active++; resolve(); }); }); } release(): void { this.active--; const next = this.waiting.shift(); if (next) next(); } async run<T>(fn: () => Promise<T>): Promise<T> { await this.acquire(); try { return await fn(); } finally { this.release(); } } } // Allow max 10 concurrent Claude requests const limiter = new ConcurrencyLimiter(10); async function handleUserRequest(userMessage: string) { return limiter.run(() => client.messages.create({ model: "claude-sonnet-4-20250514", max_tokens: 1024, messages: [{ role: "user", content: userMessage }], }) ); }

Batch API for High Volume

For processing large volumes of requests where latency is not critical, Anthropic offers a Message Batches API that provides 50% cost savings:

TypeScript
// Create a batch of requests const batch = await client.messages.batches.create({ requests: [ { custom_id: "request-1", params: { model: "claude-sonnet-4-20250514", max_tokens: 1024, messages: [{ role: "user", content: "Summarize article 1" }], }, }, { custom_id: "request-2", params: { model: "claude-sonnet-4-20250514", max_tokens: 1024, messages: [{ role: "user", content: "Summarize article 2" }], }, }, // Up to 10,000 requests per batch ], }); console.log("Batch ID:", batch.id); console.log("Status:", batch.processing_status); // Poll for completion async function waitForBatch(batchId: string) { while (true) { const status = await client.messages.batches.retrieve(batchId); if (status.processing_status === "ended") { console.log("Batch complete!"); console.log("Succeeded:", status.request_counts.succeeded); console.log("Failed:", status.request_counts.errored); return status; } // Wait 30 seconds before checking again await new Promise((r) => setTimeout(r, 30000)); } } // Retrieve results const completedBatch = await waitForBatch(batch.id); const results = []; for await (const result of client.messages.batches.results(batch.id)) { if (result.result.type === "succeeded") { results.push({ id: result.custom_id, text: result.result.message.content[0].text, }); } }

When to Use Batch API

Use CaseBatch API?Why
Summarizing 1,000 articlesYesHigh volume, latency flexible
Real-time chatbotNoNeeds instant responses
Nightly data processingYesScheduled, no urgency
Interactive code reviewNoUser waiting for result

Priority Tiers

When your application serves multiple types of requests, implement a priority system:

TypeScript
type Priority = "critical" | "high" | "normal" | "low"; class PriorityQueue { private queues: Map<Priority, Array<() => Promise<any>>> = new Map([ ["critical", []], ["high", []], ["normal", []], ["low", []], ]); private processing: boolean = false; private activeRequests: number = 0; private maxConcurrent: number; constructor(maxConcurrent: number = 5) { this.maxConcurrent = maxConcurrent; } async enqueue<T>( priority: Priority, fn: () => Promise<T> ): Promise<T> { return new Promise((resolve, reject) => { const wrappedFn = async () => { try { const result = await fn(); resolve(result); } catch (err) { reject(err); } }; this.queues.get(priority)!.push(wrappedFn); this.processNext(); }); } private async processNext() { if (this.activeRequests >= this.maxConcurrent) return; const priorities: Priority[] = ["critical", "high", "normal", "low"]; for (const priority of priorities) { const queue = this.queues.get(priority)!; if (queue.length > 0) { const fn = queue.shift()!; this.activeRequests++; fn().finally(() => { this.activeRequests--; this.processNext(); }); return; } } } } // Usage const pq = new PriorityQueue(5); // Critical: user-facing real-time request pq.enqueue("critical", () => client.messages.create({ /* ... */ }) ); // Low: background analytics summarization pq.enqueue("low", () => client.messages.create({ /* ... */ }) );

Scaling Strategies

1. Load Balancing Across API Keys

Distribute requests across multiple API keys to multiply your effective rate limits:

TypeScript
class KeyRotator { private keys: string[]; private index: number = 0; private usage: Map<string, { requests: number; resetTime: number }>; constructor(apiKeys: string[]) { this.keys = apiKeys; this.usage = new Map(); for (const key of apiKeys) { this.usage.set(key, { requests: 0, resetTime: Date.now() + 60000 }); } } getNextKey(): string { // Round-robin with usage awareness const startIndex = this.index; do { const key = this.keys[this.index]; const usage = this.usage.get(key)!; // Reset counter if window expired if (Date.now() > usage.resetTime) { usage.requests = 0; usage.resetTime = Date.now() + 60000; } this.index = (this.index + 1) % this.keys.length; // Return key if it has capacity if (usage.requests < 50) { usage.requests++; return key; } } while (this.index !== startIndex); throw new Error("All API keys rate limited"); } } const rotator = new KeyRotator([ process.env.CLAUDE_KEY_1!, process.env.CLAUDE_KEY_2!, process.env.CLAUDE_KEY_3!, ]); async function scaledRequest(prompt: string) { const key = rotator.getNextKey(); const client = new Anthropic({ apiKey: key }); return client.messages.create({ model: "claude-sonnet-4-20250514", max_tokens: 1024, messages: [{ role: "user", content: prompt }], }); }

2. Monitoring Rate Limit Usage

Track your consumption to proactively avoid hitting limits:

TypeScript
class RateLimitMonitor { private history: Array<{ timestamp: number; tokensUsed: number; model: string; }> = []; record(tokensUsed: number, model: string) { this.history.push({ timestamp: Date.now(), tokensUsed, model, }); // Keep only last hour const oneHourAgo = Date.now() - 3600000; this.history = this.history.filter((h) => h.timestamp > oneHourAgo); } getUsageLastMinute(): { requests: number; tokens: number } { const oneMinuteAgo = Date.now() - 60000; const recent = this.history.filter((h) => h.timestamp > oneMinuteAgo); return { requests: recent.length, tokens: recent.reduce((sum, h) => sum + h.tokensUsed, 0), }; } shouldThrottle(rpmLimit: number, tpmLimit: number): boolean { const usage = this.getUsageLastMinute(); return ( usage.requests >= rpmLimit * 0.8 || usage.tokens >= tpmLimit * 0.8 ); } getReport(): string { const usage = this.getUsageLastMinute(); return ( `Last minute: ${usage.requests} requests, ` + `${usage.tokens} tokens` ); } }

3. Choosing the Right Model for Scale

ScenarioRecommended ModelWhy
High-volume classificationClaude HaikuFast, cheap, high RPM
Complex analysis at scaleClaude SonnetBalance of quality and speed
Critical decisions onlyClaude OpusBest quality, lowest limits
Mixed workloadsRoute by complexityOptimize cost and capacity

Key Takeaways

  • Know your tier and track your usage against limits
  • Always implement retry logic with exponential backoff and jitter
  • Use queues to smooth out traffic spikes
  • Batch API saves 50% on high-volume, latency-tolerant workloads
  • Rotate keys and balance load for horizontal scaling
  • Monitor proactively -- do not wait for 429 errors to react
  • Choose the right model for each task to maximize throughput

Next up: We will explore caching strategies that can reduce your API costs and avoid rate limits altogether.