HomeProduction & DeploymentTesting Production AI Apps
advanced12 min read· Module 10, Lesson 6

🧪Testing Production AI Apps

Integration tests, mocking the API, and CI pipelines

Testing Production AI Apps

AI applications are inherently non-deterministic. The same prompt can produce different outputs on every call. This makes traditional assertion-based testing inadequate on its own. In this lesson you will learn a battle-tested strategy for testing Claude-powered applications — from unit tests with mocked SDKs to integration tests, snapshot regression, and fully automated CI pipelines.


1. The Core Challenge: Non-Determinism

Traditional software tests rely on exact equality:

TypeScript
expect(add(2, 3)).toBe(5); // always passes

AI outputs break this contract:

TypeScript
const result = await claude.messages.create({ model: "claude-sonnet-4-20250514", max_tokens: 100, messages: [{ role: "user", content: "Say hello" }], }); // result.content[0].text could be "Hello!", "Hi there!", "Hey!", etc.

You cannot toBe("Hello!") because the response varies. Instead you need strategies that verify behavior rather than exact output.

Testing Pyramid for AI Apps

╱ E2E ╲ ← Few, expensive, real API ╱──────────╲ ╱ Integration ╲ ← Moderate, mocked or real ╱────────────────╲ ╱ Unit (mocked) ╲ ← Many, fast, no API calls ╱──────────────────────╲

2. Mocking the Anthropic SDK

The fastest and cheapest tests mock the SDK entirely. You never call the real API, so tests run in milliseconds and cost nothing.

Basic Mock Setup (Jest / Vitest)

TypeScript
// __mocks__/@anthropic-ai/sdk.ts export class Anthropic { messages = { create: jest.fn(), }; } export default Anthropic;

Using the Mock in a Test

TypeScript
jest.mock("@anthropic-ai/sdk"); describe("summarize", () => { it("returns the model summary", async () => { const mockCreate = (Anthropic as any).prototype.messages.create; mockCreate.mockResolvedValue({ id: "msg_mock", type: "message", role: "assistant", content: [{ type: "text", text: "This article discusses AI testing." }], model: "claude-sonnet-4-20250514", stop_reason: "end_turn", usage: { input_tokens: 50, output_tokens: 10 }, }); const result = await summarize("Long article about AI testing..."); expect(result).toBe("This article discusses AI testing."); expect(mockCreate).toHaveBeenCalledTimes(1); }); });

Mocking Streaming Responses

TypeScript
function createMockStream(chunks: string[]) { return { async *[Symbol.asyncIterator]() { for (const chunk of chunks) { yield { type: "content_block_delta", delta: { type: "text_delta", text: chunk }, }; } }, }; } // In test mockCreate.mockResolvedValue(createMockStream(["Hello", " world", "!"]));

Mocking Tool Use Responses

TypeScript
mockCreate.mockResolvedValue({ id: "msg_mock", type: "message", role: "assistant", content: [ { type: "tool_use", id: "toolu_mock", name: "get_weather", input: { city: "London" }, }, ], stop_reason: "tool_use", usage: { input_tokens: 30, output_tokens: 20 }, });

3. Assertion Strategies for AI Output

Since you cannot match exact text, use these strategies instead:

Strategy 1: Structural Assertions

TypeScript
it("returns valid JSON", async () => { const result = await classifyEmail(email); const parsed = JSON.parse(result); expect(parsed).toHaveProperty("category"); expect(parsed).toHaveProperty("confidence"); expect(typeof parsed.confidence).toBe("number"); expect(parsed.confidence).toBeGreaterThanOrEqual(0); expect(parsed.confidence).toBeLessThanOrEqual(1); });

Strategy 2: Containment Checks

TypeScript
it("mentions the key topic", async () => { const summary = await summarize(article); expect(summary.toLowerCase()).toContain("machine learning"); });

Strategy 3: Regex Pattern Matching

TypeScript
it("returns a numbered list", async () => { const steps = await generateSteps("bake a cake"); expect(steps).toMatch(/1\./); expect(steps).toMatch(/2\./); expect(steps).toMatch(/3\./); });

Strategy 4: Schema Validation with Zod

TypeScript
const SentimentSchema = z.object({ sentiment: z.enum(["positive", "negative", "neutral"]), score: z.number().min(-1).max(1), reasoning: z.string().min(10), }); it("returns valid sentiment analysis", async () => { const result = await analyzeSentiment("I love this product!"); const parsed = SentimentSchema.safeParse(JSON.parse(result)); expect(parsed.success).toBe(true); });

Strategy 5: LLM-as-Judge

Use a second, cheaper call to evaluate the first:

TypeScript
async function llmJudge(question: string, answer: string): Promise<boolean> { const response = await client.messages.create({ model: "claude-haiku-3-5-20241022", max_tokens: 10, messages: [ { role: "user", content: `Does this answer correctly address the question? Question: ${question} Answer: ${answer} Reply only YES or NO.`, }, ], }); return response.content[0].text.trim().toUpperCase() === "YES"; }

4. Snapshot Testing for Regression Detection

Snapshot tests record a known-good output and flag when future outputs change significantly. This catches prompt regressions.

Setting Up Snapshot Tests

TypeScript
interface Snapshot { promptHash: string; outputStructure: string[]; keyPhrases: string[]; outputLength: { min: number; max: number }; } function captureSnapshot(prompt: string, output: string): Snapshot { return { promptHash: createHash("md5").update(prompt).digest("hex"), outputStructure: output.split("\n").filter((l) => l.startsWith("#")), keyPhrases: extractKeyPhrases(output), outputLength: { min: Math.floor(output.length * 0.7), max: Math.ceil(output.length * 1.3), }, }; } function compareSnapshot(current: string, snapshot: Snapshot): boolean { const len = current.length; if (len < snapshot.outputLength.min || len > snapshot.outputLength.max) { return false; } const matchedPhrases = snapshot.keyPhrases.filter((p) => current.toLowerCase().includes(p.toLowerCase()) ); return matchedPhrases.length >= snapshot.keyPhrases.length * 0.6; }

Regression Test Flow

1. Record baseline snapshot (manually approved) 2. On every PR, re-run the prompt 3. Compare new output against snapshot 4. Flag if structural drift exceeds threshold 5. Human reviews and approves or rejects

5. Integration Tests with the Real API

Integration tests call the actual Anthropic API. They are slower and cost money, so run them sparingly — typically nightly or on release branches.

Guarding Integration Tests

TypeScript
const describeIntegration = process.env.RUN_INTEGRATION_TESTS === "true" ? describe : describe.skip; describeIntegration("Claude Integration", () => { const client = new Anthropic(); it("classifies emails correctly", async () => { const response = await client.messages.create({ model: "claude-haiku-3-5-20241022", max_tokens: 100, messages: [ { role: "user", content: "Classify this email as spam or not spam: 'You won a prize! Click here!'", }, ], }); const text = response.content[0].text.toLowerCase(); expect(text).toContain("spam"); }, 30000); it("respects token limits", async () => { const response = await client.messages.create({ model: "claude-haiku-3-5-20241022", max_tokens: 10, messages: [{ role: "user", content: "Write a long essay about history." }], }); expect(response.usage.output_tokens).toBeLessThanOrEqual(15); }, 30000); });

Rate-Limiting Your Tests

TypeScript
function rateLimitedDescribe(delayMs: number) { let lastCall = 0; return async function withRateLimit<T>(fn: () => Promise<T>): Promise<T> { const now = Date.now(); const wait = Math.max(0, lastCall + delayMs - now); if (wait > 0) await new Promise((r) => setTimeout(r, wait)); lastCall = Date.now(); return fn(); }; } const throttle = rateLimitedDescribe(500);

6. Cost-Effective Testing Strategies

Use the Cheapest Model for Tests

TypeScript
const TEST_MODEL = process.env.TEST_MODEL || "claude-haiku-3-5-20241022";

Cache API Responses

TypeScript
const CACHE_DIR = path.join(__dirname, "__api_cache__"); async function cachedApiCall(key: string, fn: () => Promise<any>) { const cacheFile = path.join(CACHE_DIR, `${key}.json`); if (fs.existsSync(cacheFile)) { return JSON.parse(fs.readFileSync(cacheFile, "utf-8")); } const result = await fn(); fs.mkdirSync(CACHE_DIR, { recursive: true }); fs.writeFileSync(cacheFile, JSON.stringify(result, null, 2)); return result; }

Set a Test Budget

TypeScript
class TestBudget { private spent = 0; constructor(private maxCents: number) {} track(usage: { input_tokens: number; output_tokens: number }) { const cost = (usage.input_tokens / 1_000_000) * 0.8 + (usage.output_tokens / 1_000_000) * 4; this.spent += cost * 100; if (this.spent > this.maxCents) { throw new Error( `Test budget exceeded: $${(this.spent / 100).toFixed(4)} > $${(this.maxCents / 100).toFixed(4)}` ); } } } const budget = new TestBudget(5); // 5 cents max

7. CI Pipeline with GitHub Actions

Complete Workflow

YAML
# .github/workflows/ai-tests.yml name: AI Test Suite on: push: branches: [main] pull_request: branches: [main] schedule: - cron: "0 3 * * *" # nightly at 3 AM UTC env: NODE_ENV: test jobs: unit-tests: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 cache: npm - run: npm ci - run: npm run test:unit env: CI: true integration-tests: runs-on: ubuntu-latest if: github.event_name == 'schedule' || github.event_name == 'push' needs: unit-tests steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 cache: npm - run: npm ci - run: npm run test:integration env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} RUN_INTEGRATION_TESTS: "true" TEST_MODEL: claude-haiku-3-5-20241022 snapshot-regression: runs-on: ubuntu-latest if: github.event_name == 'schedule' needs: unit-tests steps: - uses: actions/checkout@v4 - uses: actions/setup-node@v4 with: node-version: 20 cache: npm - run: npm ci - run: npm run test:snapshots env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} RUN_INTEGRATION_TESTS: "true" - uses: actions/upload-artifact@v4 if: failure() with: name: snapshot-diffs path: ./test-results/

NPM Scripts

JSON
{ "scripts": { "test": "vitest run", "test:unit": "vitest run --dir tests/unit", "test:integration": "vitest run --dir tests/integration --timeout 60000", "test:snapshots": "vitest run --dir tests/snapshots" } }

8. Example Full Test Suite

TypeScript
// tests/unit/email-classifier.test.ts jest.mock("@anthropic-ai/sdk"); const mockCreate = jest.fn(); (Anthropic as any).mockImplementation(() => ({ messages: { create: mockCreate }, })); describe("Email Classifier", () => { beforeEach(() => mockCreate.mockReset()); it("classifies spam correctly", async () => { mockCreate.mockResolvedValue({ content: [{ type: "text", text: '{"category":"spam","confidence":0.95}' }], usage: { input_tokens: 40, output_tokens: 15 }, }); const result = await classifyEmail("You won a million dollars!"); expect(result.category).toBe("spam"); expect(result.confidence).toBeGreaterThan(0.8); }); it("classifies legitimate email correctly", async () => { mockCreate.mockResolvedValue({ content: [ { type: "text", text: '{"category":"legitimate","confidence":0.88}' }, ], usage: { input_tokens: 40, output_tokens: 15 }, }); const result = await classifyEmail("Meeting at 3pm tomorrow"); expect(result.category).toBe("legitimate"); }); it("handles API errors gracefully", async () => { mockCreate.mockRejectedValue(new Error("rate_limit_exceeded")); await expect(classifyEmail("test")).rejects.toThrow("rate_limit_exceeded"); }); it("sends correct parameters to the API", async () => { mockCreate.mockResolvedValue({ content: [{ type: "text", text: '{"category":"spam","confidence":0.5}' }], usage: { input_tokens: 40, output_tokens: 15 }, }); await classifyEmail("test email"); expect(mockCreate).toHaveBeenCalledWith( expect.objectContaining({ model: expect.any(String), max_tokens: expect.any(Number), messages: expect.arrayContaining([ expect.objectContaining({ role: "user" }), ]), }) ); }); });

9. Testing Best Practices Checklist

PracticeReason
Mock by default, integrate on scheduleSpeed + cost savings
Use cheapest model for integrationHaiku costs 95% less than Opus
Cache real API responsesAvoid redundant charges
Set test budgetsPrevent runaway costs
Test structure, not exact textAI output varies each call
Use Zod for schema validationCatches structural regressions
Run snapshots nightlyCatches prompt drift early
Gate integration tests with env varsPrevent accidental API calls in CI
Test error handling pathsAPIs fail; your app must not
Keep test prompts minimalFewer tokens = cheaper tests

Summary

Testing AI apps requires a shift in mindset. You test behavior, structure, and constraints rather than exact output. Your test pyramid should be mostly mocked unit tests at the base, a layer of schema and structural assertions in the middle, and a thin top layer of real API integration tests run on a schedule. Pair this with a GitHub Actions pipeline, response caching, and test budgets to keep costs under control while maintaining high confidence in your production AI application.