advanced12 min read· Module 6, Lesson 6
💾Prompt Caching — Save Money
Cache repeated prompts and save up to 90% on input token costs
Prompt Caching — Save Money
Prompt caching lets you save up to 90% on input token costs when you repeat the same context across multiple requests.
How It Works
Normally, every API request processes all input tokens from scratch. With caching:
- First request: You send content with
cache_control— it gets cached (25% surcharge) - Subsequent requests: Cached content is retrieved instantly at a 90% discount
When Caching Helps
- Same system instructions in every request (e.g., chatbot)
- Large document you ask multiple questions about
- Fixed context + varying questions
- Few-shot examples shared between requests
Practical Example
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const systemPrompt = "You are a customer support assistant for TechStore. Rules: ..."; // Long instructions
async function chat(userMessage) {
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: systemPrompt,
cache_control: { type: "ephemeral" }, // Cache this content
},
],
messages: [{ role: "user", content: userMessage }],
});
console.log("Cache created:", response.usage.cache_creation_input_tokens || 0);
console.log("Cache read:", response.usage.cache_read_input_tokens || 0);
return response.content[0].text;
}
// First request — creates cache (25% surcharge)
await chat("How much are the AirPods?");
// Subsequent requests — reads from cache (90% discount!)
await chat("What's the return policy?");Python Example
import anthropic
client = anthropic.Anthropic()
system_prompt = "You are a customer support assistant..."
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": "What's the return policy?"}],
)
print(f"Cache creation: {response.usage.cache_creation_input_tokens}")
print(f"Cache read: {response.usage.cache_read_input_tokens}")Calculating Savings
Without cache:
- System instructions: 2,000 tokens x $3.00/million = $0.006 per request
- 1,000 requests/day = $6.00/day
With cache:
- First request: 2,000 tokens x $3.75/million = $0.0075 (25% surcharge)
- Remaining requests: 2,000 tokens x $0.30/million = $0.0006 each
- 999 requests x $0.0006 = $0.60
- Total: $0.61/day (90% savings!)
Cache Pricing Summary
| Regular | Cache Write | Cache Read | |
|---|---|---|---|
| Opus 4 | $15.00/M | $18.75/M (+25%) | $1.50/M (-90%) |
| Sonnet 4 | $3.00/M | $3.75/M (+25%) | $0.30/M (-90%) |
| Haiku 3.5 | $0.80/M | $1.00/M (+25%) | $0.08/M (-90%) |
Caching a Document for Multiple Questions
const longDocument = fs.readFileSync("contract.txt", "utf-8");
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [
{
role: "user",
content: [
{
type: "text",
text: longDocument,
cache_control: { type: "ephemeral" },
},
{
type: "text",
text: "What are the termination clauses?",
},
],
},
],
});Important Rules
- Minimum size: Cached content must be at least 1,024 tokens (Sonnet/Opus) or 2,048 tokens (Haiku)
- Exact match: The text must match character-for-character — any change invalidates the cache
- Duration: 5 minutes, resets with each use
- Order matters: Put fixed (cached) content first, variable content last
Summary
- Prompt caching saves up to 90% on input costs
- Use
cache_control: { type: "ephemeral" }on fixed content - Cache lasts 5 minutes and resets with each use
- Perfect for system prompts, large documents, and recurring examples
Next: We'll learn reusable prompt templates and patterns.