HomeProduction & DeploymentCost Optimization at Scale
intermediate12 min read· Module 10, Lesson 4

💰Cost Optimization at Scale

Reduce AI costs by 50-90% with smart caching, batching, and model selection

Cost Optimization at Scale

When you move from prototype to production, API costs can explode. A single Claude call might cost fractions of a cent, but at 10 million calls per month the bill becomes significant. This lesson teaches you every technique to slash costs without sacrificing quality.


1. Understanding the Cost Formula

Every API call has a deterministic cost:

cost = (input_tokens × input_price) + (output_tokens × output_price)

Current pricing (per million tokens):

ModelInputOutput
Claude Opus 4$15$75
Claude Sonnet 4$3$15
Claude Haiku 3.5$0.80$4

Key insight: Output tokens cost 5x more than input tokens. Controlling output length is the single highest-leverage cost reduction.

Quick Estimator

TypeScript
function estimateCost( inputTokens: number, outputTokens: number, model: "opus" | "sonnet" | "haiku" ): number { const pricing = { opus: { input: 15 / 1_000_000, output: 75 / 1_000_000 }, sonnet: { input: 3 / 1_000_000, output: 15 / 1_000_000 }, haiku: { input: 0.8 / 1_000_000, output: 4 / 1_000_000 }, }; const p = pricing[model]; return inputTokens * p.input + outputTokens * p.output; } // Example: 2000 input + 500 output on Sonnet console.log(estimateCost(2000, 500, "sonnet")); // $0.0135 per call → $135 per 10k calls

2. Prompt Caching — Reuse What You Already Sent

Prompt caching lets you mark a stable prefix (system prompt, documents, few-shot examples) so the API stores it server-side. Subsequent requests that share the same prefix skip re-processing those tokens.

How It Works

  • Cache write: First request pays a 25% surcharge on the cached portion.
  • Cache hit: Subsequent requests pay only 10% of the normal input price.
  • TTL: 5 minutes by default; extended to 1 hour with a cache_control breakpoint marked as "ephemeral".
  • Minimum: 1,024 tokens (Haiku), 2,048 tokens (Sonnet), 4,096 tokens (Opus).

Implementation

TypeScript
const client = new Anthropic(); // The system prompt is the stable prefix — cache it const response = await client.messages.create({ model: "claude-sonnet-4-20250514", max_tokens: 1024, system: [ { type: "text", text: longSystemPrompt, // 3000+ tokens of instructions cache_control: { type: "ephemeral" }, // keeps cache for up to 1 hour }, ], messages: [{ role: "user", content: userQuery }], }); // Check cache performance console.log("Cache read tokens:", response.usage.cache_read_input_tokens); console.log("Cache write tokens:", response.usage.cache_creation_input_tokens);

Savings Calculation

ScenarioWithout CacheWith Cache (hit)Savings
4,000 token system prompt × 1,000 calls$12.00$1.2090%
10,000 token docs × 500 calls$15.00$1.5090%
2,000 token few-shot × 2,000 calls$12.00$1.2090%

Rule of thumb: If the same prefix appears in 3+ requests within 5 minutes, cache it.


3. Batch API — 50% Off for Non-Urgent Work

The Message Batches API lets you submit up to 100,000 requests at once and receive results within 24 hours — at 50% of the standard price.

Best Use Cases

  • Nightly content generation
  • Bulk classification / tagging
  • Dataset enrichment
  • Evaluation runs
  • Report generation

Implementation

TypeScript
const client = new Anthropic(); // Step 1: Create the batch const batch = await client.messages.batches.create({ requests: products.map((product, i) => ({ custom_id: `product-${i}`, params: { model: "claude-haiku-3-5-20241022", max_tokens: 200, messages: [ { role: "user", content: `Write a one-line description for: ${product.name}`, }, ], }, })), }); console.log("Batch ID:", batch.id); // Step 2: Poll for completion async function waitForBatch(batchId: string) { while (true) { const status = await client.messages.batches.retrieve(batchId); if (status.processing_status === "ended") return status; await new Promise((r) => setTimeout(r, 30_000)); // check every 30s } } // Step 3: Retrieve results const results = await client.messages.batches.results(batch.id); for await (const entry of results) { if (entry.result.type === "succeeded") { console.log(entry.custom_id, entry.result.message.content[0].text); } }

Cost Comparison

TaskStandard APIBatch APISavings
10,000 classifications (Haiku)$16$8$8
5,000 summaries (Sonnet)$150$75$75
1,000 analyses (Opus)$900$450$450

4. Model Routing — Right Model for the Right Job

Not every request needs the most powerful model. A smart router directs tasks to the cheapest model that can handle them.

Tiered Architecture

┌─────────────────────────────┐ │ Incoming Request │ └─────────┬───────────────────┘ │ ┌──────▼──────┐ │ Classifier │ (Haiku — near-zero cost) └──────┬──────┘ │ ┌──────┴──────────────┬───────────────────┐ ▼ ▼ ▼ Simple Medium Complex (Haiku) (Sonnet) (Opus) $0.80/M in $3/M in $15/M in FAQ, greetings Summaries Analysis, classifications Code gen reasoning

Router Implementation

TypeScript
type Tier = "haiku" | "sonnet" | "opus"; interface RouteResult { tier: Tier; model: string; } async function routeRequest(userMessage: string): Promise<RouteResult> { // Use Haiku to classify — costs fractions of a cent const classification = await client.messages.create({ model: "claude-haiku-3-5-20241022", max_tokens: 20, messages: [ { role: "user", content: `Classify this request complexity as SIMPLE, MEDIUM, or COMPLEX. Only output one word. Request: "${userMessage}"`, }, ], }); const level = classification.content[0].text.trim().toUpperCase(); const routes: Record<string, RouteResult> = { SIMPLE: { tier: "haiku", model: "claude-haiku-3-5-20241022" }, MEDIUM: { tier: "sonnet", model: "claude-sonnet-4-20250514" }, COMPLEX: { tier: "opus", model: "claude-opus-4-20250514" }, }; return routes[level] ?? routes.MEDIUM; } // Usage const route = await routeRequest("What is your return policy?"); const response = await client.messages.create({ model: route.model, max_tokens: 512, messages: [{ role: "user", content: "What is your return policy?" }], });

Routing Savings

Traffic MixAll-Opus CostRouted CostSavings
70% simple, 20% medium, 10% complex$15,000$2,46084%
50% simple, 30% medium, 20% complex$15,000$4,52070%
30% simple, 40% medium, 30% complex$15,000$6,78055%

5. Token Reduction Techniques

5a. Shorter Prompts

Every word in your prompt costs tokens. Trim aggressively:

TypeScript
// BEFORE: 47 tokens const verbose = `I would like you to please analyze the following customer feedback message and determine whether the overall sentiment expressed by the customer is positive, negative, or neutral. Here is the feedback:`; // AFTER: 15 tokens const concise = `Classify sentiment as positive/negative/neutral:`;

Savings at scale: 32 fewer tokens × 1M calls × $3/M = $96 saved on input alone.

5b. Limit Output with max_tokens

Always set max_tokens to the minimum you need:

TypeScript
// Classification: only need one word { max_tokens: 10 } // Short summary: one paragraph { max_tokens: 150 } // Full analysis: still cap it { max_tokens: 1024 }

5c. Structured Output Constraints

Ask for compact formats:

TypeScript
const response = await client.messages.create({ model: "claude-haiku-3-5-20241022", max_tokens: 100, messages: [ { role: "user", content: `Extract entities as JSON. No explanation. Input: "John Smith ordered 3 laptops from Tokyo office" Output format: {"people":[],"items":[],"locations":[]}`, }, ], }); // Output: {"people":["John Smith"],"items":["laptops"],"locations":["Tokyo"]} // ~20 tokens instead of 100+ with explanation

6. Response Caching — Never Pay Twice for the Same Answer

Build an application-level cache so identical (or near-identical) questions reuse previous answers.

Hash-Based Cache

TypeScript
interface CacheEntry { response: string; timestamp: number; model: string; tokens: { input: number; output: number }; } class ResponseCache { private cache = new Map<string, CacheEntry>(); private ttlMs: number; constructor(ttlMinutes = 60) { this.ttlMs = ttlMinutes * 60 * 1000; } private hash(model: string, systemPrompt: string, userMessage: string): string { return crypto .createHash("sha256") .update(`${model}|${systemPrompt}|${userMessage}`) .digest("hex"); } get(model: string, systemPrompt: string, userMessage: string): CacheEntry | null { const key = this.hash(model, systemPrompt, userMessage); const entry = this.cache.get(key); if (!entry) return null; if (Date.now() - entry.timestamp > this.ttlMs) { this.cache.delete(key); return null; } return entry; } set( model: string, systemPrompt: string, userMessage: string, response: string, tokens: { input: number; output: number } ): void { const key = this.hash(model, systemPrompt, userMessage); this.cache.set(key, { response, timestamp: Date.now(), model, tokens, }); } }

Using the Cache

TypeScript
const cache = new ResponseCache(120); // 2-hour TTL async function cachedQuery( systemPrompt: string, userMessage: string, model: string ): Promise<string> { // Check cache first const cached = cache.get(model, systemPrompt, userMessage); if (cached) { console.log("Cache HIT — saved", estimateCost(cached.tokens.input, cached.tokens.output, "sonnet"), "dollars" ); return cached.response; } // Cache miss — call API const response = await client.messages.create({ model, max_tokens: 1024, system: systemPrompt, messages: [{ role: "user", content: userMessage }], }); const text = response.content[0].text; cache.set(model, systemPrompt, userMessage, text, { input: response.usage.input_tokens, output: response.usage.output_tokens, }); return text; }

7. Building a Cost-Aware Wrapper

Combine all techniques into a single wrapper class:

TypeScript
interface CostReport { model: string; inputTokens: number; outputTokens: number; cost: number; cached: boolean; batchEligible: boolean; } class CostAwareClient { private client: Anthropic; private cache: ResponseCache; private totalSpend = 0; private callCount = 0; constructor() { this.client = new Anthropic(); this.cache = new ResponseCache(120); } async query(options: { userMessage: string; systemPrompt?: string; forceModel?: string; maxTokens?: number; skipCache?: boolean; }): Promise<{ text: string; report: CostReport }> { const sys = options.systemPrompt ?? ""; // 1. Check response cache if (!options.skipCache) { const cached = this.cache.get( options.forceModel ?? "auto", sys, options.userMessage ); if (cached) { return { text: cached.response, report: { model: cached.model, inputTokens: 0, outputTokens: 0, cost: 0, cached: true, batchEligible: false, }, }; } } // 2. Route to cheapest suitable model const route = options.forceModel ? { model: options.forceModel, tier: "manual" } : await routeRequest(options.userMessage); // 3. Call with prompt caching const response = await this.client.messages.create({ model: route.model, max_tokens: options.maxTokens ?? 1024, system: sys ? [{ type: "text", text: sys, cache_control: { type: "ephemeral" } }] : undefined, messages: [{ role: "user", content: options.userMessage }], }); const text = response.content[0].text; const inp = response.usage.input_tokens; const out = response.usage.output_tokens; const cost = estimateCost(inp, out, route.tier as any); // 4. Store in cache this.cache.set(route.model, sys, options.userMessage, text, { input: inp, output: out, }); this.totalSpend += cost; this.callCount += 1; return { text, report: { model: route.model, inputTokens: inp, outputTokens: out, cost, cached: false, batchEligible: false, }, }; } getStats() { return { totalSpend: this.totalSpend, callCount: this.callCount, avgCostPerCall: this.totalSpend / (this.callCount || 1), }; } }

8. Real-World Cost Scenarios

Scenario A: Customer Support Chatbot

  • Volume: 50,000 conversations/month, avg 4 turns each
  • Without optimization: All Sonnet, no caching = ~$3,600/month
  • With optimization: Haiku routing (70%), prompt cache, response cache = ~$420/month — 88% reduction

Scenario B: Document Processing Pipeline

  • Volume: 10,000 documents/day, each ~5,000 tokens
  • Without optimization: All Sonnet, real-time = ~$4,500/month
  • With optimization: Batch API + Haiku for extraction + Sonnet for summary = ~$900/month — 80% reduction

Scenario C: Code Review Tool

  • Volume: 2,000 PRs/month, avg 3,000 tokens per diff
  • Without optimization: All Opus = ~$5,400/month
  • With optimization: Sonnet for most, Opus only for complex, prompt cache = ~$810/month — 85% reduction

9. Cost Monitoring Dashboard

Track spending in real time:

TypeScript
class CostMonitor { private dailyCosts: Map<string, number> = new Map(); private alertThreshold: number; constructor(dailyBudget: number) { this.alertThreshold = dailyBudget; } recordCall(model: string, inputTokens: number, outputTokens: number): void { const today = new Date().toISOString().split("T")[0]; const cost = estimateCost(inputTokens, outputTokens, model as any); const current = this.dailyCosts.get(today) ?? 0; this.dailyCosts.set(today, current + cost); if (current + cost > this.alertThreshold) { console.warn(`ALERT: Daily spend $${(current + cost).toFixed(2)} exceeds budget $${this.alertThreshold}`); } } getReport(): { date: string; spend: number }[] { return Array.from(this.dailyCosts.entries()).map(([date, spend]) => ({ date, spend: Math.round(spend * 100) / 100, })); } }

10. Quick-Reference Checklist

  • Set max_tokens to the minimum needed for every call
  • Enable prompt caching for system prompts over 1,024 tokens
  • Route simple tasks to Haiku, medium to Sonnet, complex to Opus
  • Use Batch API for any workload that can tolerate 24-hour latency
  • Cache identical responses at the application level
  • Trim prompt wording — every token counts at scale
  • Monitor daily spend and set budget alerts
  • Request structured (JSON) output to reduce output tokens
  • Review the cost dashboard weekly and adjust routing thresholds

Target: Aim for under $0.01 average cost per interaction. Most production systems can achieve $0.001-$0.005 with proper optimization.