HomeProduction & DeploymentCost Optimization at Scale

intermediate12 min read· Module 10, Lesson 4

💰Cost Optimization at Scale

Reduce AI costs by 50-90% with smart caching, batching, and model selection

Cost Optimization at Scale

When you move from prototype to production, API costs can explode. A single Claude call might cost fractions of a cent, but at 10 million calls per month the bill becomes significant. This lesson teaches you every technique to slash costs without sacrificing quality.

1. Understanding the Cost Formula

Every API call has a deterministic cost:

cost = (input_tokens × input_price) + (output_tokens × output_price)

Current pricing (per million tokens):

Model	Input	Output
Claude Opus 4	$15	$75
Claude Sonnet 4	$3	$15
Claude Haiku 3.5	$0.80	$4

Key insight: Output tokens cost 5x more than input tokens. Controlling output length is the single highest-leverage cost reduction.

Quick Estimator

TypeScript
function estimateCost(
  inputTokens: number,
  outputTokens: number,
  model: "opus" | "sonnet" | "haiku"
): number {
  const pricing = {
    opus:   { input: 15 / 1_000_000, output: 75 / 1_000_000 },
    sonnet: { input: 3 / 1_000_000,  output: 15 / 1_000_000 },
    haiku:  { input: 0.8 / 1_000_000, output: 4 / 1_000_000 },
  };
  const p = pricing[model];
  return inputTokens * p.input + outputTokens * p.output;
}

// Example: 2000 input + 500 output on Sonnet
console.log(estimateCost(2000, 500, "sonnet"));
// $0.0135 per call → $135 per 10k calls

2. Prompt Caching — Reuse What You Already Sent

Prompt caching lets you mark a stable prefix (system prompt, documents, few-shot examples) so the API stores it server-side. Subsequent requests that share the same prefix skip re-processing those tokens.

How It Works

Cache write: First request pays a 25% surcharge on the cached portion.
Cache hit: Subsequent requests pay only 10% of the normal input price.
TTL: 5 minutes by default; extended to 1 hour with a cache_control breakpoint marked as "ephemeral".
Minimum: 1,024 tokens (Haiku), 2,048 tokens (Sonnet), 4,096 tokens (Opus).

Implementation

TypeScript

const client = new Anthropic();

// The system prompt is the stable prefix — cache it
const response = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: longSystemPrompt, // 3000+ tokens of instructions
      cache_control: { type: "ephemeral" }, // keeps cache for up to 1 hour
    },
  ],
  messages: [{ role: "user", content: userQuery }],
});

// Check cache performance
console.log("Cache read tokens:", response.usage.cache_read_input_tokens);
console.log("Cache write tokens:", response.usage.cache_creation_input_tokens);

Savings Calculation

Scenario	Without Cache	With Cache (hit)	Savings
4,000 token system prompt × 1,000 calls	$12.00	$1.20	90%
10,000 token docs × 500 calls	$15.00	$1.50	90%
2,000 token few-shot × 2,000 calls	$12.00	$1.20	90%

Rule of thumb: If the same prefix appears in 3+ requests within 5 minutes, cache it.

3. Batch API — 50% Off for Non-Urgent Work

The Message Batches API lets you submit up to 100,000 requests at once and receive results within 24 hours — at 50% of the standard price.

Best Use Cases

Nightly content generation
Bulk classification / tagging
Dataset enrichment
Evaluation runs
Report generation

Implementation

TypeScript

const client = new Anthropic();

// Step 1: Create the batch
const batch = await client.messages.batches.create({
  requests: products.map((product, i) => ({
    custom_id: `product-${i}`,
    params: {
      model: "claude-haiku-3-5-20241022",
      max_tokens: 200,
      messages: [
        {
          role: "user",
          content: `Write a one-line description for: ${product.name}`,
        },
      ],
    },
  })),
});

console.log("Batch ID:", batch.id);

// Step 2: Poll for completion
async function waitForBatch(batchId: string) {
  while (true) {
    const status = await client.messages.batches.retrieve(batchId);
    if (status.processing_status === "ended") return status;
    await new Promise((r) => setTimeout(r, 30_000)); // check every 30s
  }
}

// Step 3: Retrieve results
const results = await client.messages.batches.results(batch.id);
for await (const entry of results) {
  if (entry.result.type === "succeeded") {
    console.log(entry.custom_id, entry.result.message.content[0].text);
  }
}

Cost Comparison

Task	Standard API	Batch API	Savings
10,000 classifications (Haiku)	$16	$8	$8
5,000 summaries (Sonnet)	$150	$75	$75
1,000 analyses (Opus)	$900	$450	$450

4. Model Routing — Right Model for the Right Job

Not every request needs the most powerful model. A smart router directs tasks to the cheapest model that can handle them.

Tiered Architecture

┌─────────────────────────────┐
│       Incoming Request      │
└─────────┬───────────────────┘
          │
   ┌──────▼──────┐
   │  Classifier  │  (Haiku — near-zero cost)
   └──────┬──────┘
          │
   ┌──────┴──────────────┬───────────────────┐
   ▼                     ▼                   ▼
 Simple               Medium             Complex
 (Haiku)             (Sonnet)            (Opus)
 $0.80/M in          $3/M in            $15/M in
 FAQ, greetings      Summaries          Analysis,
 classifications     Code gen           reasoning

Router Implementation

TypeScript
type Tier = "haiku" | "sonnet" | "opus";

interface RouteResult {
  tier: Tier;
  model: string;
}

async function routeRequest(userMessage: string): Promise<RouteResult> {
  // Use Haiku to classify — costs fractions of a cent
  const classification = await client.messages.create({
    model: "claude-haiku-3-5-20241022",
    max_tokens: 20,
    messages: [
      {
        role: "user",
        content: `Classify this request complexity as SIMPLE, MEDIUM, or COMPLEX.
Only output one word.

Request: "${userMessage}"`,
      },
    ],
  });

  const level = classification.content[0].text.trim().toUpperCase();

  const routes: Record<string, RouteResult> = {
    SIMPLE:  { tier: "haiku",  model: "claude-haiku-3-5-20241022" },
    MEDIUM:  { tier: "sonnet", model: "claude-sonnet-4-20250514" },
    COMPLEX: { tier: "opus",   model: "claude-opus-4-20250514" },
  };

  return routes[level] ?? routes.MEDIUM;
}

// Usage
const route = await routeRequest("What is your return policy?");
const response = await client.messages.create({
  model: route.model,
  max_tokens: 512,
  messages: [{ role: "user", content: "What is your return policy?" }],
});

Routing Savings

Traffic Mix	All-Opus Cost	Routed Cost	Savings
70% simple, 20% medium, 10% complex	$15,000	$2,460	84%
50% simple, 30% medium, 20% complex	$15,000	$4,520	70%
30% simple, 40% medium, 30% complex	$15,000	$6,780	55%

5. Token Reduction Techniques

5a. Shorter Prompts

Every word in your prompt costs tokens. Trim aggressively:

TypeScript
// BEFORE: 47 tokens
const verbose = `I would like you to please analyze the following customer
feedback message and determine whether the overall sentiment expressed by
the customer is positive, negative, or neutral. Here is the feedback:`;

// AFTER: 15 tokens
const concise = `Classify sentiment as positive/negative/neutral:`;

Savings at scale: 32 fewer tokens × 1M calls × $3/M = $96 saved on input alone.

5b. Limit Output with max_tokens

Always set max_tokens to the minimum you need:

TypeScript
// Classification: only need one word
{ max_tokens: 10 }

// Short summary: one paragraph
{ max_tokens: 150 }

// Full analysis: still cap it
{ max_tokens: 1024 }

5c. Structured Output Constraints

Ask for compact formats:

TypeScript
const response = await client.messages.create({
  model: "claude-haiku-3-5-20241022",
  max_tokens: 100,
  messages: [
    {
      role: "user",
      content: `Extract entities as JSON. No explanation.
Input: "John Smith ordered 3 laptops from Tokyo office"
Output format: {"people":[],"items":[],"locations":[]}`,
    },
  ],
});
// Output: {"people":["John Smith"],"items":["laptops"],"locations":["Tokyo"]}
// ~20 tokens instead of 100+ with explanation

6. Response Caching — Never Pay Twice for the Same Answer

Build an application-level cache so identical (or near-identical) questions reuse previous answers.

Hash-Based Cache

TypeScript

interface CacheEntry {
  response: string;
  timestamp: number;
  model: string;
  tokens: { input: number; output: number };
}

class ResponseCache {
  private cache = new Map<string, CacheEntry>();
  private ttlMs: number;

  constructor(ttlMinutes = 60) {
    this.ttlMs = ttlMinutes * 60 * 1000;
  }

  private hash(model: string, systemPrompt: string, userMessage: string): string {
    return crypto
      .createHash("sha256")
      .update(`${model}|${systemPrompt}|${userMessage}`)
      .digest("hex");
  }

  get(model: string, systemPrompt: string, userMessage: string): CacheEntry | null {
    const key = this.hash(model, systemPrompt, userMessage);
    const entry = this.cache.get(key);
    if (!entry) return null;
    if (Date.now() - entry.timestamp > this.ttlMs) {
      this.cache.delete(key);
      return null;
    }
    return entry;
  }

  set(
    model: string,
    systemPrompt: string,
    userMessage: string,
    response: string,
    tokens: { input: number; output: number }
  ): void {
    const key = this.hash(model, systemPrompt, userMessage);
    this.cache.set(key, {
      response,
      timestamp: Date.now(),
      model,
      tokens,
    });
  }
}

Using the Cache

TypeScript
const cache = new ResponseCache(120); // 2-hour TTL

async function cachedQuery(
  systemPrompt: string,
  userMessage: string,
  model: string
): Promise<string> {
  // Check cache first
  const cached = cache.get(model, systemPrompt, userMessage);
  if (cached) {
    console.log("Cache HIT — saved",
      estimateCost(cached.tokens.input, cached.tokens.output, "sonnet"),
      "dollars"
    );
    return cached.response;
  }

  // Cache miss — call API
  const response = await client.messages.create({
    model,
    max_tokens: 1024,
    system: systemPrompt,
    messages: [{ role: "user", content: userMessage }],
  });

  const text = response.content[0].text;
  cache.set(model, systemPrompt, userMessage, text, {
    input: response.usage.input_tokens,
    output: response.usage.output_tokens,
  });

  return text;
}

7. Building a Cost-Aware Wrapper

Combine all techniques into a single wrapper class:

TypeScript

interface CostReport {
  model: string;
  inputTokens: number;
  outputTokens: number;
  cost: number;
  cached: boolean;
  batchEligible: boolean;
}

class CostAwareClient {
  private client: Anthropic;
  private cache: ResponseCache;
  private totalSpend = 0;
  private callCount = 0;

  constructor() {
    this.client = new Anthropic();
    this.cache = new ResponseCache(120);
  }

  async query(options: {
    userMessage: string;
    systemPrompt?: string;
    forceModel?: string;
    maxTokens?: number;
    skipCache?: boolean;
  }): Promise<{ text: string; report: CostReport }> {
    const sys = options.systemPrompt ?? "";

    // 1. Check response cache
    if (!options.skipCache) {
      const cached = this.cache.get(
        options.forceModel ?? "auto",
        sys,
        options.userMessage
      );
      if (cached) {
        return {
          text: cached.response,
          report: {
            model: cached.model,
            inputTokens: 0,
            outputTokens: 0,
            cost: 0,
            cached: true,
            batchEligible: false,
          },
        };
      }
    }

    // 2. Route to cheapest suitable model
    const route = options.forceModel
      ? { model: options.forceModel, tier: "manual" }
      : await routeRequest(options.userMessage);

    // 3. Call with prompt caching
    const response = await this.client.messages.create({
      model: route.model,
      max_tokens: options.maxTokens ?? 1024,
      system: sys
        ? [{ type: "text", text: sys, cache_control: { type: "ephemeral" } }]
        : undefined,
      messages: [{ role: "user", content: options.userMessage }],
    });

    const text = response.content[0].text;
    const inp = response.usage.input_tokens;
    const out = response.usage.output_tokens;
    const cost = estimateCost(inp, out, route.tier as any);

    // 4. Store in cache
    this.cache.set(route.model, sys, options.userMessage, text, {
      input: inp,
      output: out,
    });

    this.totalSpend += cost;
    this.callCount += 1;

    return {
      text,
      report: {
        model: route.model,
        inputTokens: inp,
        outputTokens: out,
        cost,
        cached: false,
        batchEligible: false,
      },
    };
  }

  getStats() {
    return {
      totalSpend: this.totalSpend,
      callCount: this.callCount,
      avgCostPerCall: this.totalSpend / (this.callCount || 1),
    };
  }
}

8. Real-World Cost Scenarios

Scenario A: Customer Support Chatbot

Volume: 50,000 conversations/month, avg 4 turns each
Without optimization: All Sonnet, no caching = ~$3,600/month
With optimization: Haiku routing (70%), prompt cache, response cache = ~$420/month — 88% reduction

Scenario B: Document Processing Pipeline

Volume: 10,000 documents/day, each ~5,000 tokens
Without optimization: All Sonnet, real-time = ~$4,500/month
With optimization: Batch API + Haiku for extraction + Sonnet for summary = ~$900/month — 80% reduction

Scenario C: Code Review Tool

Volume: 2,000 PRs/month, avg 3,000 tokens per diff
Without optimization: All Opus = ~$5,400/month
With optimization: Sonnet for most, Opus only for complex, prompt cache = ~$810/month — 85% reduction

9. Cost Monitoring Dashboard

Track spending in real time:

TypeScript
class CostMonitor {
  private dailyCosts: Map<string, number> = new Map();
  private alertThreshold: number;

  constructor(dailyBudget: number) {
    this.alertThreshold = dailyBudget;
  }

  recordCall(model: string, inputTokens: number, outputTokens: number): void {
    const today = new Date().toISOString().split("T")[0];
    const cost = estimateCost(inputTokens, outputTokens, model as any);
    const current = this.dailyCosts.get(today) ?? 0;
    this.dailyCosts.set(today, current + cost);

    if (current + cost > this.alertThreshold) {
      console.warn(`ALERT: Daily spend $${(current + cost).toFixed(2)} exceeds budget $${this.alertThreshold}`);
    }
  }

  getReport(): { date: string; spend: number }[] {
    return Array.from(this.dailyCosts.entries()).map(([date, spend]) => ({
      date,
      spend: Math.round(spend * 100) / 100,
    }));
  }
}

10. Quick-Reference Checklist

Set max_tokens to the minimum needed for every call
Enable prompt caching for system prompts over 1,024 tokens
Route simple tasks to Haiku, medium to Sonnet, complex to Opus
Use Batch API for any workload that can tolerate 24-hour latency
Cache identical responses at the application level
Trim prompt wording — every token counts at scale
Monitor daily spend and set budget alerts
Request structured (JSON) output to reduce output tokens
Review the cost dashboard weekly and adjust routing thresholds

Target: Aim for under $0.01 average cost per interaction. Most production systems can achieve $0.001-$0.005 with proper optimization.

Module 10

4/7

📡 Monitoring & Observability

Deploying Claude Apps ☁️

4/7