HomeProduction & DeploymentMonitoring & Observability
intermediate12 min read· Module 10, Lesson 3

📡Monitoring & Observability

Track costs, latency, errors, and usage across your Claude integration

Monitoring & Observability

Shipping an AI feature is only half the work. The other half is knowing what is actually happening once it runs in production. Without monitoring, you are flying blind — you cannot tell if responses are slow, if costs are spiking, if error rates are climbing, or if users are even getting value from the integration.

This lesson covers everything you need to build a production-grade observability layer around your Claude API integration.


Why Monitoring Matters for LLM Applications

Traditional software monitoring focuses on uptime and error rates. LLM applications add entirely new dimensions:

DimensionWhy It Matters
LatencyLLM calls can take 2-30+ seconds. Users notice.
Token usageDirectly drives cost. Uncontrolled usage can bankrupt a project.
Cost per requestDifferent models and prompt sizes have wildly different costs.
Error ratesRate limits, overloaded errors, malformed responses.
Response qualityThe model can return valid JSON but terrible content.
User satisfactionAre users accepting, editing, or rejecting AI outputs?

Without visibility into these dimensions, you will only discover problems when users complain — or when you get an unexpected bill.


What to Monitor

Here is a comprehensive checklist of metrics every Claude integration should track:

1. Latency Metrics

  • Time to first token (TTFT): How long before the first byte of the response arrives. Critical for streaming UIs.
  • Total response time: End-to-end duration of the API call.
  • P50 / P95 / P99 latencies: Median tells you the norm; P95 and P99 reveal tail latency problems.
  • Latency by model: Compare performance across Claude Sonnet, Haiku, and Opus.

2. Token Metrics

  • Input tokens per request: Are your prompts growing out of control?
  • Output tokens per request: Are responses unreasonably long?
  • Total tokens per user session: Track cumulative usage across a conversation.
  • Cache hit rate: If you use prompt caching, measure how often it activates.

3. Cost Metrics

  • Cost per request: Calculated from input/output token counts and model pricing.
  • Cost per user: How much does each user cost you?
  • Cost per feature: Which features are the most expensive?
  • Daily / weekly / monthly spend: Trend tracking to catch spikes early.

4. Error Metrics

  • Error rate by type: 429 (rate limit), 500 (server error), 529 (overloaded), timeout.
  • Retry count: How many retries before success?
  • Failure rate: Requests that fail even after retries.
  • Error rate by model: Some models may have higher error rates during peak hours.

5. Usage Metrics

  • Requests per user: Identify power users and potential abuse.
  • Requests per feature: Know which integrations get the most traffic.
  • Peak usage hours: Plan capacity and rate limit budgets.
  • Unique users per day: Track adoption over time.

Structured Logging

The foundation of observability is structured logging. Never log plain strings — always log structured JSON so you can query, filter, and aggregate later.

Basic Logging Setup

TypeScript
interface ClaudeRequestLog { requestId: string; timestamp: string; model: string; userId: string; feature: string; inputTokens: number; outputTokens: number; cacheReadTokens: number; cacheCreationTokens: number; latencyMs: number; statusCode: number; error: string | null; costUsd: number; } function logClaudeRequest(entry: ClaudeRequestLog): void { console.log(JSON.stringify({ level: "info", service: "claude-integration", event: "claude_api_call", ...entry, })); }

What to Include in Every Log Entry

Every single Claude API call should produce a log entry with these fields:

FieldPurpose
requestIdUnique ID to correlate logs, traces, and user reports
timestampISO 8601 timestamp for time-series analysis
modelWhich model was called
userIdWho triggered the call
featureWhich product feature initiated the call
inputTokensToken count from the request
outputTokensToken count from the response
latencyMsWall-clock time for the full request
statusCodeHTTP status code returned
errorError message if the call failed
costUsdCalculated cost in USD

Cost Tracking

Cost tracking is non-negotiable for production LLM applications. Here is how to calculate and track costs accurately.

Pricing Reference

TypeScript
const MODEL_PRICING: Record<string, { inputPer1M: number; outputPer1M: number }> = { "claude-sonnet-4-20250514": { inputPer1M: 3.0, outputPer1M: 15.0 }, "claude-haiku-35-20241022": { inputPer1M: 0.80, outputPer1M: 4.0 }, "claude-opus-4-20250514": { inputPer1M: 15.0, outputPer1M: 75.0 }, }; function calculateCost( model: string, inputTokens: number, outputTokens: number ): number { const pricing = MODEL_PRICING[model]; if (!pricing) return 0; const inputCost = (inputTokens / 1_000_000) * pricing.inputPer1M; const outputCost = (outputTokens / 1_000_000) * pricing.outputPer1M; return Math.round((inputCost + outputCost) * 1_000_000) / 1_000_000; }

Cost Aggregation

Track costs at multiple levels of granularity:

TypeScript
interface CostTracker { byUser: Map<string, number>; byFeature: Map<string, number>; byModel: Map<string, number>; byHour: Map<string, number>; total: number; } function updateCostTracker( tracker: CostTracker, userId: string, feature: string, model: string, cost: number ): void { tracker.byUser.set(userId, (tracker.byUser.get(userId) ?? 0) + cost); tracker.byFeature.set(feature, (tracker.byFeature.get(feature) ?? 0) + cost); tracker.byModel.set(model, (tracker.byModel.get(model) ?? 0) + cost); const hourKey = new Date().toISOString().slice(0, 13); tracker.byHour.set(hourKey, (tracker.byHour.get(hourKey) ?? 0) + cost); tracker.total += cost; }

Budget Alerts

Set up thresholds to catch cost spikes before they become a problem:

TypeScript
interface BudgetConfig { dailyLimitUsd: number; perUserLimitUsd: number; perRequestWarnUsd: number; alertCallback: (message: string) => void; } function checkBudget( config: BudgetConfig, tracker: CostTracker, userId: string, requestCost: number ): boolean { if (requestCost > config.perRequestWarnUsd) { config.alertCallback( `High-cost request: $${requestCost.toFixed(4)} for user ${userId}` ); } const userTotal = tracker.byUser.get(userId) ?? 0; if (userTotal > config.perUserLimitUsd) { config.alertCallback( `User ${userId} exceeded daily budget: $${userTotal.toFixed(2)}` ); return false; } const todayKey = new Date().toISOString().slice(0, 10); let dailyTotal = 0; for (const [key, value] of tracker.byHour) { if (key.startsWith(todayKey)) dailyTotal += value; } if (dailyTotal > config.dailyLimitUsd) { config.alertCallback( `Daily budget exceeded: $${dailyTotal.toFixed(2)}` ); return false; } return true; }

Latency Monitoring

Latency is one of the most impactful metrics for user experience. Here is how to measure it properly.

Measuring Latency

TypeScript
interface LatencyMetrics { ttftMs: number | null; totalMs: number; model: string; inputTokens: number; outputTokens: number; } async function measureLatency( apiCall: () => Promise<any>, model: string ): Promise<{ result: any; metrics: LatencyMetrics }> { const start = performance.now(); const result = await apiCall(); const totalMs = performance.now() - start; return { result, metrics: { ttftMs: null, totalMs: Math.round(totalMs), model, inputTokens: result.usage?.input_tokens ?? 0, outputTokens: result.usage?.output_tokens ?? 0, }, }; }

Latency Percentile Tracker

TypeScript
class PercentileTracker { private values: number[] = []; private readonly maxSize: number; constructor(maxSize = 10000) { this.maxSize = maxSize; } record(value: number): void { this.values.push(value); if (this.values.length > this.maxSize) { this.values.shift(); } } percentile(p: number): number { if (this.values.length === 0) return 0; const sorted = [...this.values].sort((a, b) => a - b); const index = Math.ceil((p / 100) * sorted.length) - 1; return sorted[Math.max(0, index)]; } summary(): { p50: number; p95: number; p99: number; count: number } { return { p50: this.percentile(50), p95: this.percentile(95), p99: this.percentile(99), count: this.values.length, }; } }

Error Dashboards

Errors in LLM applications are not just binary pass/fail. You need to categorize, track trends, and set up alerting for different error types.

Error Categorization

TypeScript
type ErrorCategory = | "rate_limit" | "overloaded" | "server_error" | "timeout" | "invalid_request" | "auth_error" | "context_too_long" | "content_filter" | "unknown"; function categorizeError(statusCode: number, errorMessage: string): ErrorCategory { if (statusCode === 429) return "rate_limit"; if (statusCode === 529) return "overloaded"; if (statusCode >= 500) return "server_error"; if (statusCode === 408 || errorMessage.includes("timeout")) return "timeout"; if (statusCode === 401 || statusCode === 403) return "auth_error"; if (errorMessage.includes("too many tokens")) return "context_too_long"; if (errorMessage.includes("content filtering")) return "content_filter"; if (statusCode === 400) return "invalid_request"; return "unknown"; }

Error Rate Tracker

TypeScript
class ErrorRateTracker { private windows: Map<string, { total: number; errors: number }> = new Map(); private errorsByCategory: Map<ErrorCategory, number> = new Map(); record(success: boolean, category?: ErrorCategory): void { const windowKey = new Date().toISOString().slice(0, 16); const window = this.windows.get(windowKey) ?? { total: 0, errors: 0 }; window.total++; if (!success) { window.errors++; if (category) { this.errorsByCategory.set( category, (this.errorsByCategory.get(category) ?? 0) + 1 ); } } this.windows.set(windowKey, window); } getErrorRate(windowKey: string): number { const window = this.windows.get(windowKey); if (!window || window.total === 0) return 0; return window.errors / window.total; } getBreakdown(): Record<ErrorCategory, number> { return Object.fromEntries(this.errorsByCategory) as Record<ErrorCategory, number>; } }

Alerting

Monitoring without alerting is just data collection. Set up alerts for the conditions that require human attention.

Alert Configuration

TypeScript
interface AlertRule { name: string; condition: () => boolean; message: () => string; severity: "warning" | "critical"; cooldownMinutes: number; } const alertRules: AlertRule[] = [ { name: "high_error_rate", condition: () => { const currentWindow = new Date().toISOString().slice(0, 16); return errorTracker.getErrorRate(currentWindow) > 0.1; }, message: () => "Error rate exceeded 10% in the current window", severity: "critical", cooldownMinutes: 15, }, { name: "high_latency", condition: () => latencyTracker.percentile(95) > 10000, message: () => `P95 latency is ${latencyTracker.percentile(95)}ms (threshold: 10000ms)`, severity: "warning", cooldownMinutes: 30, }, { name: "budget_warning", condition: () => costTracker.total > 80, message: () => `Daily spend at $${costTracker.total.toFixed(2)} — approaching $100 limit`, severity: "warning", cooldownMinutes: 60, }, ];

Alert Engine

TypeScript
class AlertEngine { private lastFired: Map<string, number> = new Map(); evaluate(rules: AlertRule[], notify: (alert: AlertRule) => void): void { const now = Date.now(); for (const rule of rules) { const lastTime = this.lastFired.get(rule.name) ?? 0; const cooldownMs = rule.cooldownMinutes * 60 * 1000; if (now - lastTime < cooldownMs) continue; if (rule.condition()) { this.lastFired.set(rule.name, now); notify(rule); } } } }

Usage Analytics

Beyond operational metrics, you want to understand how your AI features are being used.

Usage Event Tracking

TypeScript
interface UsageEvent { timestamp: string; userId: string; feature: string; action: "request" | "accept" | "reject" | "edit" | "retry"; model: string; inputTokens: number; outputTokens: number; latencyMs: number; costUsd: number; metadata: Record<string, string>; } class UsageAnalytics { private events: UsageEvent[] = []; track(event: UsageEvent): void { this.events.push(event); } acceptanceRate(feature: string): number { const featureEvents = this.events.filter( (e) => e.feature === feature && (e.action === "accept" || e.action === "reject") ); if (featureEvents.length === 0) return 0; const accepted = featureEvents.filter((e) => e.action === "accept").length; return accepted / featureEvents.length; } topFeatures(limit = 10): Array<{ feature: string; count: number }> { const counts = new Map<string, number>(); for (const event of this.events) { counts.set(event.feature, (counts.get(event.feature) ?? 0) + 1); } return [...counts.entries()] .map(([feature, count]) => ({ feature, count })) .sort((a, b) => b.count - a.count) .slice(0, limit); } uniqueUsersToday(): number { const today = new Date().toISOString().slice(0, 10); const users = new Set( this.events .filter((e) => e.timestamp.startsWith(today)) .map((e) => e.userId) ); return users.size; } }

Complete Monitoring Wrapper

Here is a full monitoring wrapper class that ties everything together. Use this as the single entry point for all Claude API calls in your application.

TypeScript
class MonitoredClaudeClient { private client: Anthropic; private costTracker: CostTracker; private latencyTracker: PercentileTracker; private errorTracker: ErrorRateTracker; private analytics: UsageAnalytics; private alertEngine: AlertEngine; private budgetConfig: BudgetConfig; constructor(apiKey: string, budgetConfig: BudgetConfig) { this.client = new Anthropic({ apiKey }); this.costTracker = { byUser: new Map(), byFeature: new Map(), byModel: new Map(), byHour: new Map(), total: 0, }; this.latencyTracker = new PercentileTracker(); this.errorTracker = new ErrorRateTracker(); this.analytics = new UsageAnalytics(); this.alertEngine = new AlertEngine(); this.budgetConfig = budgetConfig; } async createMessage(params: { model: string; max_tokens: number; messages: Array<{ role: string; content: string }>; userId: string; feature: string; }) { const requestId = crypto.randomUUID(); const start = performance.now(); const withinBudget = checkBudget( this.budgetConfig, this.costTracker, params.userId, 0 ); if (!withinBudget) { throw new Error("Budget exceeded — request blocked"); } let statusCode = 200; let error: string | null = null; let result: any = null; try { result = await this.client.messages.create({ model: params.model, max_tokens: params.max_tokens, messages: params.messages as any, }); } catch (err: any) { statusCode = err.status ?? 500; error = err.message ?? "Unknown error"; const category = categorizeError(statusCode, error); this.errorTracker.record(false, category); throw err; } finally { const latencyMs = Math.round(performance.now() - start); const inputTokens = result?.usage?.input_tokens ?? 0; const outputTokens = result?.usage?.output_tokens ?? 0; const cost = calculateCost(params.model, inputTokens, outputTokens); this.latencyTracker.record(latencyMs); this.errorTracker.record(statusCode < 400); updateCostTracker( this.costTracker, params.userId, params.feature, params.model, cost ); const logEntry: ClaudeRequestLog = { requestId, timestamp: new Date().toISOString(), model: params.model, userId: params.userId, feature: params.feature, inputTokens, outputTokens, cacheReadTokens: result?.usage?.cache_read_input_tokens ?? 0, cacheCreationTokens: result?.usage?.cache_creation_input_tokens ?? 0, latencyMs, statusCode, error, costUsd: cost, }; logClaudeRequest(logEntry); this.analytics.track({ timestamp: logEntry.timestamp, userId: params.userId, feature: params.feature, action: "request", model: params.model, inputTokens, outputTokens, latencyMs, costUsd: cost, metadata: { requestId }, }); } return result; } getMetrics() { return { latency: this.latencyTracker.summary(), errors: this.errorTracker.getBreakdown(), costs: { total: this.costTracker.total, byModel: Object.fromEntries(this.costTracker.byModel), byFeature: Object.fromEntries(this.costTracker.byFeature), }, usage: { topFeatures: this.analytics.topFeatures(), uniqueUsersToday: this.analytics.uniqueUsersToday(), }, }; } }

Key Takeaways

  1. Log every API call with structured JSON including tokens, cost, latency, and error details.
  2. Track costs at multiple levels — per request, per user, per feature, and per day.
  3. Measure latency percentiles, not just averages. P95 and P99 reveal real user pain.
  4. Categorize errors so you can distinguish between rate limits, server issues, and bad requests.
  5. Set up alerts with cooldown periods so you get notified without being spammed.
  6. Track usage analytics to understand which features deliver value and which do not.
  7. Use a monitoring wrapper as a single entry point so every call is automatically instrumented.

Observability is not optional for production AI applications. Build it in from day one, and you will catch problems before your users do.