📈Rate Limits & Scaling
Understand rate limits, handle 429 errors, and scale your application
Rate Limits & Scaling
When you build production applications with the Claude API, one of the first walls you will hit is rate limits. Understanding how they work, how to handle them gracefully, and how to architect your system for scale is critical for any serious deployment.
What Are Rate Limits?
Rate limits are restrictions that Anthropic places on how many requests you can send within a given time window. They exist to:
- Protect infrastructure from overload
- Ensure fair usage across all customers
- Maintain quality of service during peak demand
- Prevent abuse of the API
Rate limits are measured in two dimensions:
| Dimension | What It Measures |
|---|---|
| Requests per minute (RPM) | How many API calls you can make per minute |
| Tokens per minute (TPM) | How many input + output tokens you can consume per minute |
| Tokens per day (TPD) | Daily token consumption ceiling |
You are constrained by whichever limit you hit first. For example, even if you have RPM capacity left, exceeding your TPM will trigger a rate limit.
Anthropic Rate Limit Tiers
Anthropic organizes rate limits into four tiers based on your usage history and spend:
| Tier | RPM | TPM (Input) | TPM (Output) | How to Qualify |
|---|---|---|---|---|
| Tier 1 | 50 | 40,000 | 8,000 | New accounts with valid payment |
| Tier 2 | 1,000 | 80,000 | 16,000 | $100+ total spend |
| Tier 3 | 2,000 | 160,000 | 32,000 | $500+ total spend |
| Tier 4 | 4,000 | 400,000 | 80,000 | $1,000+ total spend, high trust |
Note: These limits vary by model. Claude Opus has lower limits than Claude Haiku. Always check the latest limits on the Anthropic docs.
How Tier Upgrades Work
Tier upgrades are automatic based on cumulative spend. However:
- Upgrades are not instant; they may take up to 24 hours
- You can request manual upgrades by contacting Anthropic sales
- Enterprise customers can negotiate custom limits
Rate Limit Headers
Every API response from Anthropic includes headers that tell you exactly where you stand:
x-ratelimit-limit-requests: 1000
x-ratelimit-limit-tokens: 80000
x-ratelimit-remaining-requests: 847
x-ratelimit-remaining-tokens: 63250
x-ratelimit-reset-requests: 2025-01-15T12:01:00Z
x-ratelimit-reset-tokens: 2025-01-15T12:01:00Z
retry-after: 3| Header | Purpose |
|---|---|
x-ratelimit-limit-requests | Your total RPM allowance |
x-ratelimit-limit-tokens | Your total TPM allowance |
x-ratelimit-remaining-requests | How many requests you have left this window |
x-ratelimit-remaining-tokens | How many tokens you have left this window |
x-ratelimit-reset-requests | When the request counter resets |
x-ratelimit-reset-tokens | When the token counter resets |
retry-after | Seconds to wait before retrying (on 429) |
Reading Headers in Code
const client = new Anthropic();
async function callWithRateLimitInfo() {
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello" }],
});
// Access rate limit info from the response headers
// The SDK exposes these on the raw response
const rawResponse = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [{ role: "user", content: "Hello" }],
}).asResponse();
const headers = rawResponse.headers;
const remaining = headers.get("x-ratelimit-remaining-requests");
const resetTime = headers.get("x-ratelimit-reset-requests");
console.log("Requests remaining:", remaining);
console.log("Resets at:", resetTime);
return response;
}Handling 429 Errors
When you exceed a rate limit, the API returns a 429 Too Many Requests status code. You must handle this gracefully.
Basic 429 Error Structure
{
"type": "error",
"error": {
"type": "rate_limit_error",
"message": "Rate limit exceeded. Please retry after 3 seconds."
}
}Exponential Backoff Implementation
The standard approach is exponential backoff with jitter. Each retry waits longer than the last, with randomness to prevent thundering herd:
async function callWithRetry(
fn: () => Promise<any>,
maxRetries: number = 5,
baseDelay: number = 1000
): Promise<any> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error: any) {
if (error.status !== 429 || attempt === maxRetries) {
throw error;
}
// Parse retry-after header if available
const retryAfter = error.headers?.["retry-after"];
let delay: number;
if (retryAfter) {
delay = parseInt(retryAfter, 10) * 1000;
} else {
// Exponential backoff: 1s, 2s, 4s, 8s, 16s
delay = baseDelay * Math.pow(2, attempt);
}
// Add jitter: +/- 25% randomness
const jitter = delay * 0.25 * (Math.random() * 2 - 1);
delay = Math.max(0, delay + jitter);
console.warn(
`Rate limited. Attempt ${attempt + 1}/${maxRetries}. ` +
`Retrying in ${Math.round(delay)}ms...`
);
await new Promise((resolve) => setTimeout(resolve, delay));
}
}
}
// Usage
const response = await callWithRetry(() =>
client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [{ role: "user", content: "Explain rate limiting" }],
})
);Request Queuing
For applications processing many requests, a queue prevents you from exceeding limits:
class RequestQueue {
private queue: Array<{
fn: () => Promise<any>;
resolve: (value: any) => void;
reject: (reason: any) => void;
}> = [];
private running: number = 0;
private maxConcurrent: number;
private intervalMs: number;
constructor(requestsPerMinute: number, maxConcurrent: number = 10) {
this.maxConcurrent = maxConcurrent;
this.intervalMs = (60 * 1000) / requestsPerMinute;
}
async add<T>(fn: () => Promise<T>): Promise<T> {
return new Promise((resolve, reject) => {
this.queue.push({ fn, resolve, reject });
this.process();
});
}
private async process() {
if (this.running >= this.maxConcurrent || this.queue.length === 0) {
return;
}
const item = this.queue.shift();
if (!item) return;
this.running++;
try {
const result = await item.fn();
item.resolve(result);
} catch (error) {
item.reject(error);
} finally {
this.running--;
// Enforce rate spacing
setTimeout(() => this.process(), this.intervalMs);
}
}
}
// Usage
const queue = new RequestQueue(50, 5); // 50 RPM, 5 concurrent
const results = await Promise.all(
prompts.map((prompt) =>
queue.add(() =>
client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 512,
messages: [{ role: "user", content: prompt }],
})
)
)
);Concurrent Request Management
When handling multiple users or tasks simultaneously, you need a concurrency limiter:
class ConcurrencyLimiter {
private active: number = 0;
private waiting: Array<() => void> = [];
constructor(private limit: number) {}
async acquire(): Promise<void> {
if (this.active < this.limit) {
this.active++;
return;
}
return new Promise<void>((resolve) => {
this.waiting.push(() => {
this.active++;
resolve();
});
});
}
release(): void {
this.active--;
const next = this.waiting.shift();
if (next) next();
}
async run<T>(fn: () => Promise<T>): Promise<T> {
await this.acquire();
try {
return await fn();
} finally {
this.release();
}
}
}
// Allow max 10 concurrent Claude requests
const limiter = new ConcurrencyLimiter(10);
async function handleUserRequest(userMessage: string) {
return limiter.run(() =>
client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [{ role: "user", content: userMessage }],
})
);
}Batch API for High Volume
For processing large volumes of requests where latency is not critical, Anthropic offers a Message Batches API that provides 50% cost savings:
// Create a batch of requests
const batch = await client.messages.batches.create({
requests: [
{
custom_id: "request-1",
params: {
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [{ role: "user", content: "Summarize article 1" }],
},
},
{
custom_id: "request-2",
params: {
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [{ role: "user", content: "Summarize article 2" }],
},
},
// Up to 10,000 requests per batch
],
});
console.log("Batch ID:", batch.id);
console.log("Status:", batch.processing_status);
// Poll for completion
async function waitForBatch(batchId: string) {
while (true) {
const status = await client.messages.batches.retrieve(batchId);
if (status.processing_status === "ended") {
console.log("Batch complete!");
console.log("Succeeded:", status.request_counts.succeeded);
console.log("Failed:", status.request_counts.errored);
return status;
}
// Wait 30 seconds before checking again
await new Promise((r) => setTimeout(r, 30000));
}
}
// Retrieve results
const completedBatch = await waitForBatch(batch.id);
const results = [];
for await (const result of client.messages.batches.results(batch.id)) {
if (result.result.type === "succeeded") {
results.push({
id: result.custom_id,
text: result.result.message.content[0].text,
});
}
}When to Use Batch API
| Use Case | Batch API? | Why |
|---|---|---|
| Summarizing 1,000 articles | Yes | High volume, latency flexible |
| Real-time chatbot | No | Needs instant responses |
| Nightly data processing | Yes | Scheduled, no urgency |
| Interactive code review | No | User waiting for result |
Priority Tiers
When your application serves multiple types of requests, implement a priority system:
type Priority = "critical" | "high" | "normal" | "low";
class PriorityQueue {
private queues: Map<Priority, Array<() => Promise<any>>> = new Map([
["critical", []],
["high", []],
["normal", []],
["low", []],
]);
private processing: boolean = false;
private activeRequests: number = 0;
private maxConcurrent: number;
constructor(maxConcurrent: number = 5) {
this.maxConcurrent = maxConcurrent;
}
async enqueue<T>(
priority: Priority,
fn: () => Promise<T>
): Promise<T> {
return new Promise((resolve, reject) => {
const wrappedFn = async () => {
try {
const result = await fn();
resolve(result);
} catch (err) {
reject(err);
}
};
this.queues.get(priority)!.push(wrappedFn);
this.processNext();
});
}
private async processNext() {
if (this.activeRequests >= this.maxConcurrent) return;
const priorities: Priority[] = ["critical", "high", "normal", "low"];
for (const priority of priorities) {
const queue = this.queues.get(priority)!;
if (queue.length > 0) {
const fn = queue.shift()!;
this.activeRequests++;
fn().finally(() => {
this.activeRequests--;
this.processNext();
});
return;
}
}
}
}
// Usage
const pq = new PriorityQueue(5);
// Critical: user-facing real-time request
pq.enqueue("critical", () =>
client.messages.create({ /* ... */ })
);
// Low: background analytics summarization
pq.enqueue("low", () =>
client.messages.create({ /* ... */ })
);Scaling Strategies
1. Load Balancing Across API Keys
Distribute requests across multiple API keys to multiply your effective rate limits:
class KeyRotator {
private keys: string[];
private index: number = 0;
private usage: Map<string, { requests: number; resetTime: number }>;
constructor(apiKeys: string[]) {
this.keys = apiKeys;
this.usage = new Map();
for (const key of apiKeys) {
this.usage.set(key, { requests: 0, resetTime: Date.now() + 60000 });
}
}
getNextKey(): string {
// Round-robin with usage awareness
const startIndex = this.index;
do {
const key = this.keys[this.index];
const usage = this.usage.get(key)!;
// Reset counter if window expired
if (Date.now() > usage.resetTime) {
usage.requests = 0;
usage.resetTime = Date.now() + 60000;
}
this.index = (this.index + 1) % this.keys.length;
// Return key if it has capacity
if (usage.requests < 50) {
usage.requests++;
return key;
}
} while (this.index !== startIndex);
throw new Error("All API keys rate limited");
}
}
const rotator = new KeyRotator([
process.env.CLAUDE_KEY_1!,
process.env.CLAUDE_KEY_2!,
process.env.CLAUDE_KEY_3!,
]);
async function scaledRequest(prompt: string) {
const key = rotator.getNextKey();
const client = new Anthropic({ apiKey: key });
return client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [{ role: "user", content: prompt }],
});
}2. Monitoring Rate Limit Usage
Track your consumption to proactively avoid hitting limits:
class RateLimitMonitor {
private history: Array<{
timestamp: number;
tokensUsed: number;
model: string;
}> = [];
record(tokensUsed: number, model: string) {
this.history.push({
timestamp: Date.now(),
tokensUsed,
model,
});
// Keep only last hour
const oneHourAgo = Date.now() - 3600000;
this.history = this.history.filter((h) => h.timestamp > oneHourAgo);
}
getUsageLastMinute(): { requests: number; tokens: number } {
const oneMinuteAgo = Date.now() - 60000;
const recent = this.history.filter((h) => h.timestamp > oneMinuteAgo);
return {
requests: recent.length,
tokens: recent.reduce((sum, h) => sum + h.tokensUsed, 0),
};
}
shouldThrottle(rpmLimit: number, tpmLimit: number): boolean {
const usage = this.getUsageLastMinute();
return (
usage.requests >= rpmLimit * 0.8 ||
usage.tokens >= tpmLimit * 0.8
);
}
getReport(): string {
const usage = this.getUsageLastMinute();
return (
`Last minute: ${usage.requests} requests, ` +
`${usage.tokens} tokens`
);
}
}3. Choosing the Right Model for Scale
| Scenario | Recommended Model | Why |
|---|---|---|
| High-volume classification | Claude Haiku | Fast, cheap, high RPM |
| Complex analysis at scale | Claude Sonnet | Balance of quality and speed |
| Critical decisions only | Claude Opus | Best quality, lowest limits |
| Mixed workloads | Route by complexity | Optimize cost and capacity |
Key Takeaways
- Know your tier and track your usage against limits
- Always implement retry logic with exponential backoff and jitter
- Use queues to smooth out traffic spikes
- Batch API saves 50% on high-volume, latency-tolerant workloads
- Rotate keys and balance load for horizontal scaling
- Monitor proactively -- do not wait for 429 errors to react
- Choose the right model for each task to maximize throughput
Next up: We will explore caching strategies that can reduce your API costs and avoid rate limits altogether.