🧪Testing Production AI Apps
Integration tests, mocking the API, and CI pipelines
Testing Production AI Apps
AI applications are inherently non-deterministic. The same prompt can produce different outputs on every call. This makes traditional assertion-based testing inadequate on its own. In this lesson you will learn a battle-tested strategy for testing Claude-powered applications — from unit tests with mocked SDKs to integration tests, snapshot regression, and fully automated CI pipelines.
1. The Core Challenge: Non-Determinism
Traditional software tests rely on exact equality:
expect(add(2, 3)).toBe(5); // always passesAI outputs break this contract:
const result = await claude.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 100,
messages: [{ role: "user", content: "Say hello" }],
});
// result.content[0].text could be "Hello!", "Hi there!", "Hey!", etc.You cannot toBe("Hello!") because the response varies. Instead you need strategies that
verify behavior rather than exact output.
Testing Pyramid for AI Apps
╱ E2E ╲ ← Few, expensive, real API
╱──────────╲
╱ Integration ╲ ← Moderate, mocked or real
╱────────────────╲
╱ Unit (mocked) ╲ ← Many, fast, no API calls
╱──────────────────────╲
2. Mocking the Anthropic SDK
The fastest and cheapest tests mock the SDK entirely. You never call the real API, so tests run in milliseconds and cost nothing.
Basic Mock Setup (Jest / Vitest)
// __mocks__/@anthropic-ai/sdk.ts
export class Anthropic {
messages = {
create: jest.fn(),
};
}
export default Anthropic;Using the Mock in a Test
jest.mock("@anthropic-ai/sdk");
describe("summarize", () => {
it("returns the model summary", async () => {
const mockCreate = (Anthropic as any).prototype.messages.create;
mockCreate.mockResolvedValue({
id: "msg_mock",
type: "message",
role: "assistant",
content: [{ type: "text", text: "This article discusses AI testing." }],
model: "claude-sonnet-4-20250514",
stop_reason: "end_turn",
usage: { input_tokens: 50, output_tokens: 10 },
});
const result = await summarize("Long article about AI testing...");
expect(result).toBe("This article discusses AI testing.");
expect(mockCreate).toHaveBeenCalledTimes(1);
});
});Mocking Streaming Responses
function createMockStream(chunks: string[]) {
return {
async *[Symbol.asyncIterator]() {
for (const chunk of chunks) {
yield {
type: "content_block_delta",
delta: { type: "text_delta", text: chunk },
};
}
},
};
}
// In test
mockCreate.mockResolvedValue(createMockStream(["Hello", " world", "!"]));Mocking Tool Use Responses
mockCreate.mockResolvedValue({
id: "msg_mock",
type: "message",
role: "assistant",
content: [
{
type: "tool_use",
id: "toolu_mock",
name: "get_weather",
input: { city: "London" },
},
],
stop_reason: "tool_use",
usage: { input_tokens: 30, output_tokens: 20 },
});3. Assertion Strategies for AI Output
Since you cannot match exact text, use these strategies instead:
Strategy 1: Structural Assertions
it("returns valid JSON", async () => {
const result = await classifyEmail(email);
const parsed = JSON.parse(result);
expect(parsed).toHaveProperty("category");
expect(parsed).toHaveProperty("confidence");
expect(typeof parsed.confidence).toBe("number");
expect(parsed.confidence).toBeGreaterThanOrEqual(0);
expect(parsed.confidence).toBeLessThanOrEqual(1);
});Strategy 2: Containment Checks
it("mentions the key topic", async () => {
const summary = await summarize(article);
expect(summary.toLowerCase()).toContain("machine learning");
});Strategy 3: Regex Pattern Matching
it("returns a numbered list", async () => {
const steps = await generateSteps("bake a cake");
expect(steps).toMatch(/1\./);
expect(steps).toMatch(/2\./);
expect(steps).toMatch(/3\./);
});Strategy 4: Schema Validation with Zod
const SentimentSchema = z.object({
sentiment: z.enum(["positive", "negative", "neutral"]),
score: z.number().min(-1).max(1),
reasoning: z.string().min(10),
});
it("returns valid sentiment analysis", async () => {
const result = await analyzeSentiment("I love this product!");
const parsed = SentimentSchema.safeParse(JSON.parse(result));
expect(parsed.success).toBe(true);
});Strategy 5: LLM-as-Judge
Use a second, cheaper call to evaluate the first:
async function llmJudge(question: string, answer: string): Promise<boolean> {
const response = await client.messages.create({
model: "claude-haiku-3-5-20241022",
max_tokens: 10,
messages: [
{
role: "user",
content: `Does this answer correctly address the question?
Question: ${question}
Answer: ${answer}
Reply only YES or NO.`,
},
],
});
return response.content[0].text.trim().toUpperCase() === "YES";
}4. Snapshot Testing for Regression Detection
Snapshot tests record a known-good output and flag when future outputs change significantly. This catches prompt regressions.
Setting Up Snapshot Tests
interface Snapshot {
promptHash: string;
outputStructure: string[];
keyPhrases: string[];
outputLength: { min: number; max: number };
}
function captureSnapshot(prompt: string, output: string): Snapshot {
return {
promptHash: createHash("md5").update(prompt).digest("hex"),
outputStructure: output.split("\n").filter((l) => l.startsWith("#")),
keyPhrases: extractKeyPhrases(output),
outputLength: {
min: Math.floor(output.length * 0.7),
max: Math.ceil(output.length * 1.3),
},
};
}
function compareSnapshot(current: string, snapshot: Snapshot): boolean {
const len = current.length;
if (len < snapshot.outputLength.min || len > snapshot.outputLength.max) {
return false;
}
const matchedPhrases = snapshot.keyPhrases.filter((p) =>
current.toLowerCase().includes(p.toLowerCase())
);
return matchedPhrases.length >= snapshot.keyPhrases.length * 0.6;
}Regression Test Flow
1. Record baseline snapshot (manually approved)
2. On every PR, re-run the prompt
3. Compare new output against snapshot
4. Flag if structural drift exceeds threshold
5. Human reviews and approves or rejects
5. Integration Tests with the Real API
Integration tests call the actual Anthropic API. They are slower and cost money, so run them sparingly — typically nightly or on release branches.
Guarding Integration Tests
const describeIntegration =
process.env.RUN_INTEGRATION_TESTS === "true" ? describe : describe.skip;
describeIntegration("Claude Integration", () => {
const client = new Anthropic();
it("classifies emails correctly", async () => {
const response = await client.messages.create({
model: "claude-haiku-3-5-20241022",
max_tokens: 100,
messages: [
{
role: "user",
content: "Classify this email as spam or not spam: 'You won a prize! Click here!'",
},
],
});
const text = response.content[0].text.toLowerCase();
expect(text).toContain("spam");
}, 30000);
it("respects token limits", async () => {
const response = await client.messages.create({
model: "claude-haiku-3-5-20241022",
max_tokens: 10,
messages: [{ role: "user", content: "Write a long essay about history." }],
});
expect(response.usage.output_tokens).toBeLessThanOrEqual(15);
}, 30000);
});Rate-Limiting Your Tests
function rateLimitedDescribe(delayMs: number) {
let lastCall = 0;
return async function withRateLimit<T>(fn: () => Promise<T>): Promise<T> {
const now = Date.now();
const wait = Math.max(0, lastCall + delayMs - now);
if (wait > 0) await new Promise((r) => setTimeout(r, wait));
lastCall = Date.now();
return fn();
};
}
const throttle = rateLimitedDescribe(500);6. Cost-Effective Testing Strategies
Use the Cheapest Model for Tests
const TEST_MODEL =
process.env.TEST_MODEL || "claude-haiku-3-5-20241022";Cache API Responses
const CACHE_DIR = path.join(__dirname, "__api_cache__");
async function cachedApiCall(key: string, fn: () => Promise<any>) {
const cacheFile = path.join(CACHE_DIR, `${key}.json`);
if (fs.existsSync(cacheFile)) {
return JSON.parse(fs.readFileSync(cacheFile, "utf-8"));
}
const result = await fn();
fs.mkdirSync(CACHE_DIR, { recursive: true });
fs.writeFileSync(cacheFile, JSON.stringify(result, null, 2));
return result;
}Set a Test Budget
class TestBudget {
private spent = 0;
constructor(private maxCents: number) {}
track(usage: { input_tokens: number; output_tokens: number }) {
const cost =
(usage.input_tokens / 1_000_000) * 0.8 +
(usage.output_tokens / 1_000_000) * 4;
this.spent += cost * 100;
if (this.spent > this.maxCents) {
throw new Error(
`Test budget exceeded: $${(this.spent / 100).toFixed(4)} > $${(this.maxCents / 100).toFixed(4)}`
);
}
}
}
const budget = new TestBudget(5); // 5 cents max7. CI Pipeline with GitHub Actions
Complete Workflow
# .github/workflows/ai-tests.yml
name: AI Test Suite
on:
push:
branches: [main]
pull_request:
branches: [main]
schedule:
- cron: "0 3 * * *" # nightly at 3 AM UTC
env:
NODE_ENV: test
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: npm
- run: npm ci
- run: npm run test:unit
env:
CI: true
integration-tests:
runs-on: ubuntu-latest
if: github.event_name == 'schedule' || github.event_name == 'push'
needs: unit-tests
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: npm
- run: npm ci
- run: npm run test:integration
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
RUN_INTEGRATION_TESTS: "true"
TEST_MODEL: claude-haiku-3-5-20241022
snapshot-regression:
runs-on: ubuntu-latest
if: github.event_name == 'schedule'
needs: unit-tests
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: npm
- run: npm ci
- run: npm run test:snapshots
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
RUN_INTEGRATION_TESTS: "true"
- uses: actions/upload-artifact@v4
if: failure()
with:
name: snapshot-diffs
path: ./test-results/NPM Scripts
{
"scripts": {
"test": "vitest run",
"test:unit": "vitest run --dir tests/unit",
"test:integration": "vitest run --dir tests/integration --timeout 60000",
"test:snapshots": "vitest run --dir tests/snapshots"
}
}8. Example Full Test Suite
// tests/unit/email-classifier.test.ts
jest.mock("@anthropic-ai/sdk");
const mockCreate = jest.fn();
(Anthropic as any).mockImplementation(() => ({
messages: { create: mockCreate },
}));
describe("Email Classifier", () => {
beforeEach(() => mockCreate.mockReset());
it("classifies spam correctly", async () => {
mockCreate.mockResolvedValue({
content: [{ type: "text", text: '{"category":"spam","confidence":0.95}' }],
usage: { input_tokens: 40, output_tokens: 15 },
});
const result = await classifyEmail("You won a million dollars!");
expect(result.category).toBe("spam");
expect(result.confidence).toBeGreaterThan(0.8);
});
it("classifies legitimate email correctly", async () => {
mockCreate.mockResolvedValue({
content: [
{ type: "text", text: '{"category":"legitimate","confidence":0.88}' },
],
usage: { input_tokens: 40, output_tokens: 15 },
});
const result = await classifyEmail("Meeting at 3pm tomorrow");
expect(result.category).toBe("legitimate");
});
it("handles API errors gracefully", async () => {
mockCreate.mockRejectedValue(new Error("rate_limit_exceeded"));
await expect(classifyEmail("test")).rejects.toThrow("rate_limit_exceeded");
});
it("sends correct parameters to the API", async () => {
mockCreate.mockResolvedValue({
content: [{ type: "text", text: '{"category":"spam","confidence":0.5}' }],
usage: { input_tokens: 40, output_tokens: 15 },
});
await classifyEmail("test email");
expect(mockCreate).toHaveBeenCalledWith(
expect.objectContaining({
model: expect.any(String),
max_tokens: expect.any(Number),
messages: expect.arrayContaining([
expect.objectContaining({ role: "user" }),
]),
})
);
});
});9. Testing Best Practices Checklist
| Practice | Reason |
|---|---|
| Mock by default, integrate on schedule | Speed + cost savings |
| Use cheapest model for integration | Haiku costs 95% less than Opus |
| Cache real API responses | Avoid redundant charges |
| Set test budgets | Prevent runaway costs |
| Test structure, not exact text | AI output varies each call |
| Use Zod for schema validation | Catches structural regressions |
| Run snapshots nightly | Catches prompt drift early |
| Gate integration tests with env vars | Prevent accidental API calls in CI |
| Test error handling paths | APIs fail; your app must not |
| Keep test prompts minimal | Fewer tokens = cheaper tests |
Summary
Testing AI apps requires a shift in mindset. You test behavior, structure, and constraints rather than exact output. Your test pyramid should be mostly mocked unit tests at the base, a layer of schema and structural assertions in the middle, and a thin top layer of real API integration tests run on a schedule. Pair this with a GitHub Actions pipeline, response caching, and test budgets to keep costs under control while maintaining high confidence in your production AI application.