💰Cost Optimization at Scale
Reduce AI costs by 50-90% with smart caching, batching, and model selection
Cost Optimization at Scale
When you move from prototype to production, API costs can explode. A single Claude call might cost fractions of a cent, but at 10 million calls per month the bill becomes significant. This lesson teaches you every technique to slash costs without sacrificing quality.
1. Understanding the Cost Formula
Every API call has a deterministic cost:
cost = (input_tokens × input_price) + (output_tokens × output_price)
Current pricing (per million tokens):
| Model | Input | Output |
|---|---|---|
| Claude Opus 4 | $15 | $75 |
| Claude Sonnet 4 | $3 | $15 |
| Claude Haiku 3.5 | $0.80 | $4 |
Key insight: Output tokens cost 5x more than input tokens. Controlling output length is the single highest-leverage cost reduction.
Quick Estimator
function estimateCost(
inputTokens: number,
outputTokens: number,
model: "opus" | "sonnet" | "haiku"
): number {
const pricing = {
opus: { input: 15 / 1_000_000, output: 75 / 1_000_000 },
sonnet: { input: 3 / 1_000_000, output: 15 / 1_000_000 },
haiku: { input: 0.8 / 1_000_000, output: 4 / 1_000_000 },
};
const p = pricing[model];
return inputTokens * p.input + outputTokens * p.output;
}
// Example: 2000 input + 500 output on Sonnet
console.log(estimateCost(2000, 500, "sonnet"));
// $0.0135 per call → $135 per 10k calls2. Prompt Caching — Reuse What You Already Sent
Prompt caching lets you mark a stable prefix (system prompt, documents, few-shot examples) so the API stores it server-side. Subsequent requests that share the same prefix skip re-processing those tokens.
How It Works
- Cache write: First request pays a 25% surcharge on the cached portion.
- Cache hit: Subsequent requests pay only 10% of the normal input price.
- TTL: 5 minutes by default; extended to 1 hour with a
cache_controlbreakpoint marked as"ephemeral". - Minimum: 1,024 tokens (Haiku), 2,048 tokens (Sonnet), 4,096 tokens (Opus).
Implementation
const client = new Anthropic();
// The system prompt is the stable prefix — cache it
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: longSystemPrompt, // 3000+ tokens of instructions
cache_control: { type: "ephemeral" }, // keeps cache for up to 1 hour
},
],
messages: [{ role: "user", content: userQuery }],
});
// Check cache performance
console.log("Cache read tokens:", response.usage.cache_read_input_tokens);
console.log("Cache write tokens:", response.usage.cache_creation_input_tokens);Savings Calculation
| Scenario | Without Cache | With Cache (hit) | Savings |
|---|---|---|---|
| 4,000 token system prompt × 1,000 calls | $12.00 | $1.20 | 90% |
| 10,000 token docs × 500 calls | $15.00 | $1.50 | 90% |
| 2,000 token few-shot × 2,000 calls | $12.00 | $1.20 | 90% |
Rule of thumb: If the same prefix appears in 3+ requests within 5 minutes, cache it.
3. Batch API — 50% Off for Non-Urgent Work
The Message Batches API lets you submit up to 100,000 requests at once and receive results within 24 hours — at 50% of the standard price.
Best Use Cases
- Nightly content generation
- Bulk classification / tagging
- Dataset enrichment
- Evaluation runs
- Report generation
Implementation
const client = new Anthropic();
// Step 1: Create the batch
const batch = await client.messages.batches.create({
requests: products.map((product, i) => ({
custom_id: `product-${i}`,
params: {
model: "claude-haiku-3-5-20241022",
max_tokens: 200,
messages: [
{
role: "user",
content: `Write a one-line description for: ${product.name}`,
},
],
},
})),
});
console.log("Batch ID:", batch.id);
// Step 2: Poll for completion
async function waitForBatch(batchId: string) {
while (true) {
const status = await client.messages.batches.retrieve(batchId);
if (status.processing_status === "ended") return status;
await new Promise((r) => setTimeout(r, 30_000)); // check every 30s
}
}
// Step 3: Retrieve results
const results = await client.messages.batches.results(batch.id);
for await (const entry of results) {
if (entry.result.type === "succeeded") {
console.log(entry.custom_id, entry.result.message.content[0].text);
}
}Cost Comparison
| Task | Standard API | Batch API | Savings |
|---|---|---|---|
| 10,000 classifications (Haiku) | $16 | $8 | $8 |
| 5,000 summaries (Sonnet) | $150 | $75 | $75 |
| 1,000 analyses (Opus) | $900 | $450 | $450 |
4. Model Routing — Right Model for the Right Job
Not every request needs the most powerful model. A smart router directs tasks to the cheapest model that can handle them.
Tiered Architecture
┌─────────────────────────────┐
│ Incoming Request │
└─────────┬───────────────────┘
│
┌──────▼──────┐
│ Classifier │ (Haiku — near-zero cost)
└──────┬──────┘
│
┌──────┴──────────────┬───────────────────┐
▼ ▼ ▼
Simple Medium Complex
(Haiku) (Sonnet) (Opus)
$0.80/M in $3/M in $15/M in
FAQ, greetings Summaries Analysis,
classifications Code gen reasoning
Router Implementation
type Tier = "haiku" | "sonnet" | "opus";
interface RouteResult {
tier: Tier;
model: string;
}
async function routeRequest(userMessage: string): Promise<RouteResult> {
// Use Haiku to classify — costs fractions of a cent
const classification = await client.messages.create({
model: "claude-haiku-3-5-20241022",
max_tokens: 20,
messages: [
{
role: "user",
content: `Classify this request complexity as SIMPLE, MEDIUM, or COMPLEX.
Only output one word.
Request: "${userMessage}"`,
},
],
});
const level = classification.content[0].text.trim().toUpperCase();
const routes: Record<string, RouteResult> = {
SIMPLE: { tier: "haiku", model: "claude-haiku-3-5-20241022" },
MEDIUM: { tier: "sonnet", model: "claude-sonnet-4-20250514" },
COMPLEX: { tier: "opus", model: "claude-opus-4-20250514" },
};
return routes[level] ?? routes.MEDIUM;
}
// Usage
const route = await routeRequest("What is your return policy?");
const response = await client.messages.create({
model: route.model,
max_tokens: 512,
messages: [{ role: "user", content: "What is your return policy?" }],
});Routing Savings
| Traffic Mix | All-Opus Cost | Routed Cost | Savings |
|---|---|---|---|
| 70% simple, 20% medium, 10% complex | $15,000 | $2,460 | 84% |
| 50% simple, 30% medium, 20% complex | $15,000 | $4,520 | 70% |
| 30% simple, 40% medium, 30% complex | $15,000 | $6,780 | 55% |
5. Token Reduction Techniques
5a. Shorter Prompts
Every word in your prompt costs tokens. Trim aggressively:
// BEFORE: 47 tokens
const verbose = `I would like you to please analyze the following customer
feedback message and determine whether the overall sentiment expressed by
the customer is positive, negative, or neutral. Here is the feedback:`;
// AFTER: 15 tokens
const concise = `Classify sentiment as positive/negative/neutral:`;Savings at scale: 32 fewer tokens × 1M calls × $3/M = $96 saved on input alone.
5b. Limit Output with max_tokens
Always set max_tokens to the minimum you need:
// Classification: only need one word
{ max_tokens: 10 }
// Short summary: one paragraph
{ max_tokens: 150 }
// Full analysis: still cap it
{ max_tokens: 1024 }5c. Structured Output Constraints
Ask for compact formats:
const response = await client.messages.create({
model: "claude-haiku-3-5-20241022",
max_tokens: 100,
messages: [
{
role: "user",
content: `Extract entities as JSON. No explanation.
Input: "John Smith ordered 3 laptops from Tokyo office"
Output format: {"people":[],"items":[],"locations":[]}`,
},
],
});
// Output: {"people":["John Smith"],"items":["laptops"],"locations":["Tokyo"]}
// ~20 tokens instead of 100+ with explanation6. Response Caching — Never Pay Twice for the Same Answer
Build an application-level cache so identical (or near-identical) questions reuse previous answers.
Hash-Based Cache
interface CacheEntry {
response: string;
timestamp: number;
model: string;
tokens: { input: number; output: number };
}
class ResponseCache {
private cache = new Map<string, CacheEntry>();
private ttlMs: number;
constructor(ttlMinutes = 60) {
this.ttlMs = ttlMinutes * 60 * 1000;
}
private hash(model: string, systemPrompt: string, userMessage: string): string {
return crypto
.createHash("sha256")
.update(`${model}|${systemPrompt}|${userMessage}`)
.digest("hex");
}
get(model: string, systemPrompt: string, userMessage: string): CacheEntry | null {
const key = this.hash(model, systemPrompt, userMessage);
const entry = this.cache.get(key);
if (!entry) return null;
if (Date.now() - entry.timestamp > this.ttlMs) {
this.cache.delete(key);
return null;
}
return entry;
}
set(
model: string,
systemPrompt: string,
userMessage: string,
response: string,
tokens: { input: number; output: number }
): void {
const key = this.hash(model, systemPrompt, userMessage);
this.cache.set(key, {
response,
timestamp: Date.now(),
model,
tokens,
});
}
}Using the Cache
const cache = new ResponseCache(120); // 2-hour TTL
async function cachedQuery(
systemPrompt: string,
userMessage: string,
model: string
): Promise<string> {
// Check cache first
const cached = cache.get(model, systemPrompt, userMessage);
if (cached) {
console.log("Cache HIT — saved",
estimateCost(cached.tokens.input, cached.tokens.output, "sonnet"),
"dollars"
);
return cached.response;
}
// Cache miss — call API
const response = await client.messages.create({
model,
max_tokens: 1024,
system: systemPrompt,
messages: [{ role: "user", content: userMessage }],
});
const text = response.content[0].text;
cache.set(model, systemPrompt, userMessage, text, {
input: response.usage.input_tokens,
output: response.usage.output_tokens,
});
return text;
}7. Building a Cost-Aware Wrapper
Combine all techniques into a single wrapper class:
interface CostReport {
model: string;
inputTokens: number;
outputTokens: number;
cost: number;
cached: boolean;
batchEligible: boolean;
}
class CostAwareClient {
private client: Anthropic;
private cache: ResponseCache;
private totalSpend = 0;
private callCount = 0;
constructor() {
this.client = new Anthropic();
this.cache = new ResponseCache(120);
}
async query(options: {
userMessage: string;
systemPrompt?: string;
forceModel?: string;
maxTokens?: number;
skipCache?: boolean;
}): Promise<{ text: string; report: CostReport }> {
const sys = options.systemPrompt ?? "";
// 1. Check response cache
if (!options.skipCache) {
const cached = this.cache.get(
options.forceModel ?? "auto",
sys,
options.userMessage
);
if (cached) {
return {
text: cached.response,
report: {
model: cached.model,
inputTokens: 0,
outputTokens: 0,
cost: 0,
cached: true,
batchEligible: false,
},
};
}
}
// 2. Route to cheapest suitable model
const route = options.forceModel
? { model: options.forceModel, tier: "manual" }
: await routeRequest(options.userMessage);
// 3. Call with prompt caching
const response = await this.client.messages.create({
model: route.model,
max_tokens: options.maxTokens ?? 1024,
system: sys
? [{ type: "text", text: sys, cache_control: { type: "ephemeral" } }]
: undefined,
messages: [{ role: "user", content: options.userMessage }],
});
const text = response.content[0].text;
const inp = response.usage.input_tokens;
const out = response.usage.output_tokens;
const cost = estimateCost(inp, out, route.tier as any);
// 4. Store in cache
this.cache.set(route.model, sys, options.userMessage, text, {
input: inp,
output: out,
});
this.totalSpend += cost;
this.callCount += 1;
return {
text,
report: {
model: route.model,
inputTokens: inp,
outputTokens: out,
cost,
cached: false,
batchEligible: false,
},
};
}
getStats() {
return {
totalSpend: this.totalSpend,
callCount: this.callCount,
avgCostPerCall: this.totalSpend / (this.callCount || 1),
};
}
}8. Real-World Cost Scenarios
Scenario A: Customer Support Chatbot
- Volume: 50,000 conversations/month, avg 4 turns each
- Without optimization: All Sonnet, no caching = ~$3,600/month
- With optimization: Haiku routing (70%), prompt cache, response cache = ~$420/month — 88% reduction
Scenario B: Document Processing Pipeline
- Volume: 10,000 documents/day, each ~5,000 tokens
- Without optimization: All Sonnet, real-time = ~$4,500/month
- With optimization: Batch API + Haiku for extraction + Sonnet for summary = ~$900/month — 80% reduction
Scenario C: Code Review Tool
- Volume: 2,000 PRs/month, avg 3,000 tokens per diff
- Without optimization: All Opus = ~$5,400/month
- With optimization: Sonnet for most, Opus only for complex, prompt cache = ~$810/month — 85% reduction
9. Cost Monitoring Dashboard
Track spending in real time:
class CostMonitor {
private dailyCosts: Map<string, number> = new Map();
private alertThreshold: number;
constructor(dailyBudget: number) {
this.alertThreshold = dailyBudget;
}
recordCall(model: string, inputTokens: number, outputTokens: number): void {
const today = new Date().toISOString().split("T")[0];
const cost = estimateCost(inputTokens, outputTokens, model as any);
const current = this.dailyCosts.get(today) ?? 0;
this.dailyCosts.set(today, current + cost);
if (current + cost > this.alertThreshold) {
console.warn(`ALERT: Daily spend $${(current + cost).toFixed(2)} exceeds budget $${this.alertThreshold}`);
}
}
getReport(): { date: string; spend: number }[] {
return Array.from(this.dailyCosts.entries()).map(([date, spend]) => ({
date,
spend: Math.round(spend * 100) / 100,
}));
}
}10. Quick-Reference Checklist
- Set
max_tokensto the minimum needed for every call - Enable prompt caching for system prompts over 1,024 tokens
- Route simple tasks to Haiku, medium to Sonnet, complex to Opus
- Use Batch API for any workload that can tolerate 24-hour latency
- Cache identical responses at the application level
- Trim prompt wording — every token counts at scale
- Monitor daily spend and set budget alerts
- Request structured (JSON) output to reduce output tokens
- Review the cost dashboard weekly and adjust routing thresholds
Target: Aim for under $0.01 average cost per interaction. Most production systems can achieve $0.001-$0.005 with proper optimization.