💾Prompt Caching — Save Money

Cache repeated prompts and save up to 90% on input token costs

Prompt Caching — Save Money

Prompt caching lets you save up to 90% on input token costs when you repeat the same context across multiple requests.

How It Works

Normally, every API request processes all input tokens from scratch. With caching:

First request: You send content with cache_control — it gets cached (25% surcharge)
Subsequent requests: Cached content is retrieved instantly at a 90% discount

When Caching Helps

Same system instructions in every request (e.g., chatbot)
Large document you ask multiple questions about
Fixed context + varying questions
Few-shot examples shared between requests

Practical Example

JavaScript
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const systemPrompt = "You are a customer support assistant for TechStore. Rules: ..."; // Long instructions

async function chat(userMessage) {
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    system: [
      {
        type: "text",
        text: systemPrompt,
        cache_control: { type: "ephemeral" }, // Cache this content
      },
    ],
    messages: [{ role: "user", content: userMessage }],
  });

  console.log("Cache created:", response.usage.cache_creation_input_tokens || 0);
  console.log("Cache read:", response.usage.cache_read_input_tokens || 0);

  return response.content[0].text;
}

// First request — creates cache (25% surcharge)
await chat("How much are the AirPods?");

// Subsequent requests — reads from cache (90% discount!)
await chat("What's the return policy?");

Python Example

Python
import anthropic

client = anthropic.Anthropic()

system_prompt = "You are a customer support assistant..."

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": "What's the return policy?"}],
)

print(f"Cache creation: {response.usage.cache_creation_input_tokens}")
print(f"Cache read: {response.usage.cache_read_input_tokens}")

Calculating Savings

Without cache:

System instructions: 2,000 tokens x $3.00/million = $0.006 per request
1,000 requests/day = $6.00/day

With cache:

First request: 2,000 tokens x $3.75/million = $0.0075 (25% surcharge)
Remaining requests: 2,000 tokens x $0.30/million = $0.0006 each
999 requests x $0.0006 = $0.60
Total: $0.61/day (90% savings!)

Cache Pricing Summary

	Regular	Cache Write	Cache Read
Opus 4	$15.00/M	$18.75/M (+25%)	$1.50/M (-90%)
Sonnet 4	$3.00/M	$3.75/M (+25%)	$0.30/M (-90%)
Haiku 3.5	$0.80/M	$1.00/M (+25%)	$0.08/M (-90%)

Caching a Document for Multiple Questions

JavaScript
const longDocument = fs.readFileSync("contract.txt", "utf-8");

const response = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: longDocument,
          cache_control: { type: "ephemeral" },
        },
        {
          type: "text",
          text: "What are the termination clauses?",
        },
      ],
    },
  ],
});

Important Rules

Minimum size: Cached content must be at least 1,024 tokens (Sonnet/Opus) or 2,048 tokens (Haiku)
Exact match: The text must match character-for-character — any change invalidates the cache
Duration: 5 minutes, resets with each use
Order matters: Put fixed (cached) content first, variable content last

Summary

Prompt caching saves up to 90% on input costs
Use cache_control: { type: "ephemeral" } on fixed content
Cache lasts 5 minutes and resets with each use
Perfect for system prompts, large documents, and recurring examples

Next: We'll learn reusable prompt templates and patterns.

Module 6

6/9

📦 Batch Processing

Code Execution Tool ⚙️

6/9