📡Monitoring & Observability
Track costs, latency, errors, and usage across your Claude integration
Monitoring & Observability
Shipping an AI feature is only half the work. The other half is knowing what is actually happening once it runs in production. Without monitoring, you are flying blind — you cannot tell if responses are slow, if costs are spiking, if error rates are climbing, or if users are even getting value from the integration.
This lesson covers everything you need to build a production-grade observability layer around your Claude API integration.
Why Monitoring Matters for LLM Applications
Traditional software monitoring focuses on uptime and error rates. LLM applications add entirely new dimensions:
| Dimension | Why It Matters |
|---|---|
| Latency | LLM calls can take 2-30+ seconds. Users notice. |
| Token usage | Directly drives cost. Uncontrolled usage can bankrupt a project. |
| Cost per request | Different models and prompt sizes have wildly different costs. |
| Error rates | Rate limits, overloaded errors, malformed responses. |
| Response quality | The model can return valid JSON but terrible content. |
| User satisfaction | Are users accepting, editing, or rejecting AI outputs? |
Without visibility into these dimensions, you will only discover problems when users complain — or when you get an unexpected bill.
What to Monitor
Here is a comprehensive checklist of metrics every Claude integration should track:
1. Latency Metrics
- Time to first token (TTFT): How long before the first byte of the response arrives. Critical for streaming UIs.
- Total response time: End-to-end duration of the API call.
- P50 / P95 / P99 latencies: Median tells you the norm; P95 and P99 reveal tail latency problems.
- Latency by model: Compare performance across Claude Sonnet, Haiku, and Opus.
2. Token Metrics
- Input tokens per request: Are your prompts growing out of control?
- Output tokens per request: Are responses unreasonably long?
- Total tokens per user session: Track cumulative usage across a conversation.
- Cache hit rate: If you use prompt caching, measure how often it activates.
3. Cost Metrics
- Cost per request: Calculated from input/output token counts and model pricing.
- Cost per user: How much does each user cost you?
- Cost per feature: Which features are the most expensive?
- Daily / weekly / monthly spend: Trend tracking to catch spikes early.
4. Error Metrics
- Error rate by type: 429 (rate limit), 500 (server error), 529 (overloaded), timeout.
- Retry count: How many retries before success?
- Failure rate: Requests that fail even after retries.
- Error rate by model: Some models may have higher error rates during peak hours.
5. Usage Metrics
- Requests per user: Identify power users and potential abuse.
- Requests per feature: Know which integrations get the most traffic.
- Peak usage hours: Plan capacity and rate limit budgets.
- Unique users per day: Track adoption over time.
Structured Logging
The foundation of observability is structured logging. Never log plain strings — always log structured JSON so you can query, filter, and aggregate later.
Basic Logging Setup
interface ClaudeRequestLog {
requestId: string;
timestamp: string;
model: string;
userId: string;
feature: string;
inputTokens: number;
outputTokens: number;
cacheReadTokens: number;
cacheCreationTokens: number;
latencyMs: number;
statusCode: number;
error: string | null;
costUsd: number;
}
function logClaudeRequest(entry: ClaudeRequestLog): void {
console.log(JSON.stringify({
level: "info",
service: "claude-integration",
event: "claude_api_call",
...entry,
}));
}What to Include in Every Log Entry
Every single Claude API call should produce a log entry with these fields:
| Field | Purpose |
|---|---|
requestId | Unique ID to correlate logs, traces, and user reports |
timestamp | ISO 8601 timestamp for time-series analysis |
model | Which model was called |
userId | Who triggered the call |
feature | Which product feature initiated the call |
inputTokens | Token count from the request |
outputTokens | Token count from the response |
latencyMs | Wall-clock time for the full request |
statusCode | HTTP status code returned |
error | Error message if the call failed |
costUsd | Calculated cost in USD |
Cost Tracking
Cost tracking is non-negotiable for production LLM applications. Here is how to calculate and track costs accurately.
Pricing Reference
const MODEL_PRICING: Record<string, { inputPer1M: number; outputPer1M: number }> = {
"claude-sonnet-4-20250514": { inputPer1M: 3.0, outputPer1M: 15.0 },
"claude-haiku-35-20241022": { inputPer1M: 0.80, outputPer1M: 4.0 },
"claude-opus-4-20250514": { inputPer1M: 15.0, outputPer1M: 75.0 },
};
function calculateCost(
model: string,
inputTokens: number,
outputTokens: number
): number {
const pricing = MODEL_PRICING[model];
if (!pricing) return 0;
const inputCost = (inputTokens / 1_000_000) * pricing.inputPer1M;
const outputCost = (outputTokens / 1_000_000) * pricing.outputPer1M;
return Math.round((inputCost + outputCost) * 1_000_000) / 1_000_000;
}Cost Aggregation
Track costs at multiple levels of granularity:
interface CostTracker {
byUser: Map<string, number>;
byFeature: Map<string, number>;
byModel: Map<string, number>;
byHour: Map<string, number>;
total: number;
}
function updateCostTracker(
tracker: CostTracker,
userId: string,
feature: string,
model: string,
cost: number
): void {
tracker.byUser.set(userId, (tracker.byUser.get(userId) ?? 0) + cost);
tracker.byFeature.set(feature, (tracker.byFeature.get(feature) ?? 0) + cost);
tracker.byModel.set(model, (tracker.byModel.get(model) ?? 0) + cost);
const hourKey = new Date().toISOString().slice(0, 13);
tracker.byHour.set(hourKey, (tracker.byHour.get(hourKey) ?? 0) + cost);
tracker.total += cost;
}Budget Alerts
Set up thresholds to catch cost spikes before they become a problem:
interface BudgetConfig {
dailyLimitUsd: number;
perUserLimitUsd: number;
perRequestWarnUsd: number;
alertCallback: (message: string) => void;
}
function checkBudget(
config: BudgetConfig,
tracker: CostTracker,
userId: string,
requestCost: number
): boolean {
if (requestCost > config.perRequestWarnUsd) {
config.alertCallback(
`High-cost request: $${requestCost.toFixed(4)} for user ${userId}`
);
}
const userTotal = tracker.byUser.get(userId) ?? 0;
if (userTotal > config.perUserLimitUsd) {
config.alertCallback(
`User ${userId} exceeded daily budget: $${userTotal.toFixed(2)}`
);
return false;
}
const todayKey = new Date().toISOString().slice(0, 10);
let dailyTotal = 0;
for (const [key, value] of tracker.byHour) {
if (key.startsWith(todayKey)) dailyTotal += value;
}
if (dailyTotal > config.dailyLimitUsd) {
config.alertCallback(
`Daily budget exceeded: $${dailyTotal.toFixed(2)}`
);
return false;
}
return true;
}Latency Monitoring
Latency is one of the most impactful metrics for user experience. Here is how to measure it properly.
Measuring Latency
interface LatencyMetrics {
ttftMs: number | null;
totalMs: number;
model: string;
inputTokens: number;
outputTokens: number;
}
async function measureLatency(
apiCall: () => Promise<any>,
model: string
): Promise<{ result: any; metrics: LatencyMetrics }> {
const start = performance.now();
const result = await apiCall();
const totalMs = performance.now() - start;
return {
result,
metrics: {
ttftMs: null,
totalMs: Math.round(totalMs),
model,
inputTokens: result.usage?.input_tokens ?? 0,
outputTokens: result.usage?.output_tokens ?? 0,
},
};
}Latency Percentile Tracker
class PercentileTracker {
private values: number[] = [];
private readonly maxSize: number;
constructor(maxSize = 10000) {
this.maxSize = maxSize;
}
record(value: number): void {
this.values.push(value);
if (this.values.length > this.maxSize) {
this.values.shift();
}
}
percentile(p: number): number {
if (this.values.length === 0) return 0;
const sorted = [...this.values].sort((a, b) => a - b);
const index = Math.ceil((p / 100) * sorted.length) - 1;
return sorted[Math.max(0, index)];
}
summary(): { p50: number; p95: number; p99: number; count: number } {
return {
p50: this.percentile(50),
p95: this.percentile(95),
p99: this.percentile(99),
count: this.values.length,
};
}
}Error Dashboards
Errors in LLM applications are not just binary pass/fail. You need to categorize, track trends, and set up alerting for different error types.
Error Categorization
type ErrorCategory =
| "rate_limit"
| "overloaded"
| "server_error"
| "timeout"
| "invalid_request"
| "auth_error"
| "context_too_long"
| "content_filter"
| "unknown";
function categorizeError(statusCode: number, errorMessage: string): ErrorCategory {
if (statusCode === 429) return "rate_limit";
if (statusCode === 529) return "overloaded";
if (statusCode >= 500) return "server_error";
if (statusCode === 408 || errorMessage.includes("timeout")) return "timeout";
if (statusCode === 401 || statusCode === 403) return "auth_error";
if (errorMessage.includes("too many tokens")) return "context_too_long";
if (errorMessage.includes("content filtering")) return "content_filter";
if (statusCode === 400) return "invalid_request";
return "unknown";
}Error Rate Tracker
class ErrorRateTracker {
private windows: Map<string, { total: number; errors: number }> = new Map();
private errorsByCategory: Map<ErrorCategory, number> = new Map();
record(success: boolean, category?: ErrorCategory): void {
const windowKey = new Date().toISOString().slice(0, 16);
const window = this.windows.get(windowKey) ?? { total: 0, errors: 0 };
window.total++;
if (!success) {
window.errors++;
if (category) {
this.errorsByCategory.set(
category,
(this.errorsByCategory.get(category) ?? 0) + 1
);
}
}
this.windows.set(windowKey, window);
}
getErrorRate(windowKey: string): number {
const window = this.windows.get(windowKey);
if (!window || window.total === 0) return 0;
return window.errors / window.total;
}
getBreakdown(): Record<ErrorCategory, number> {
return Object.fromEntries(this.errorsByCategory) as Record<ErrorCategory, number>;
}
}Alerting
Monitoring without alerting is just data collection. Set up alerts for the conditions that require human attention.
Alert Configuration
interface AlertRule {
name: string;
condition: () => boolean;
message: () => string;
severity: "warning" | "critical";
cooldownMinutes: number;
}
const alertRules: AlertRule[] = [
{
name: "high_error_rate",
condition: () => {
const currentWindow = new Date().toISOString().slice(0, 16);
return errorTracker.getErrorRate(currentWindow) > 0.1;
},
message: () => "Error rate exceeded 10% in the current window",
severity: "critical",
cooldownMinutes: 15,
},
{
name: "high_latency",
condition: () => latencyTracker.percentile(95) > 10000,
message: () =>
`P95 latency is ${latencyTracker.percentile(95)}ms (threshold: 10000ms)`,
severity: "warning",
cooldownMinutes: 30,
},
{
name: "budget_warning",
condition: () => costTracker.total > 80,
message: () =>
`Daily spend at $${costTracker.total.toFixed(2)} — approaching $100 limit`,
severity: "warning",
cooldownMinutes: 60,
},
];Alert Engine
class AlertEngine {
private lastFired: Map<string, number> = new Map();
evaluate(rules: AlertRule[], notify: (alert: AlertRule) => void): void {
const now = Date.now();
for (const rule of rules) {
const lastTime = this.lastFired.get(rule.name) ?? 0;
const cooldownMs = rule.cooldownMinutes * 60 * 1000;
if (now - lastTime < cooldownMs) continue;
if (rule.condition()) {
this.lastFired.set(rule.name, now);
notify(rule);
}
}
}
}Usage Analytics
Beyond operational metrics, you want to understand how your AI features are being used.
Usage Event Tracking
interface UsageEvent {
timestamp: string;
userId: string;
feature: string;
action: "request" | "accept" | "reject" | "edit" | "retry";
model: string;
inputTokens: number;
outputTokens: number;
latencyMs: number;
costUsd: number;
metadata: Record<string, string>;
}
class UsageAnalytics {
private events: UsageEvent[] = [];
track(event: UsageEvent): void {
this.events.push(event);
}
acceptanceRate(feature: string): number {
const featureEvents = this.events.filter(
(e) => e.feature === feature && (e.action === "accept" || e.action === "reject")
);
if (featureEvents.length === 0) return 0;
const accepted = featureEvents.filter((e) => e.action === "accept").length;
return accepted / featureEvents.length;
}
topFeatures(limit = 10): Array<{ feature: string; count: number }> {
const counts = new Map<string, number>();
for (const event of this.events) {
counts.set(event.feature, (counts.get(event.feature) ?? 0) + 1);
}
return [...counts.entries()]
.map(([feature, count]) => ({ feature, count }))
.sort((a, b) => b.count - a.count)
.slice(0, limit);
}
uniqueUsersToday(): number {
const today = new Date().toISOString().slice(0, 10);
const users = new Set(
this.events
.filter((e) => e.timestamp.startsWith(today))
.map((e) => e.userId)
);
return users.size;
}
}Complete Monitoring Wrapper
Here is a full monitoring wrapper class that ties everything together. Use this as the single entry point for all Claude API calls in your application.
class MonitoredClaudeClient {
private client: Anthropic;
private costTracker: CostTracker;
private latencyTracker: PercentileTracker;
private errorTracker: ErrorRateTracker;
private analytics: UsageAnalytics;
private alertEngine: AlertEngine;
private budgetConfig: BudgetConfig;
constructor(apiKey: string, budgetConfig: BudgetConfig) {
this.client = new Anthropic({ apiKey });
this.costTracker = {
byUser: new Map(),
byFeature: new Map(),
byModel: new Map(),
byHour: new Map(),
total: 0,
};
this.latencyTracker = new PercentileTracker();
this.errorTracker = new ErrorRateTracker();
this.analytics = new UsageAnalytics();
this.alertEngine = new AlertEngine();
this.budgetConfig = budgetConfig;
}
async createMessage(params: {
model: string;
max_tokens: number;
messages: Array<{ role: string; content: string }>;
userId: string;
feature: string;
}) {
const requestId = crypto.randomUUID();
const start = performance.now();
const withinBudget = checkBudget(
this.budgetConfig,
this.costTracker,
params.userId,
0
);
if (!withinBudget) {
throw new Error("Budget exceeded — request blocked");
}
let statusCode = 200;
let error: string | null = null;
let result: any = null;
try {
result = await this.client.messages.create({
model: params.model,
max_tokens: params.max_tokens,
messages: params.messages as any,
});
} catch (err: any) {
statusCode = err.status ?? 500;
error = err.message ?? "Unknown error";
const category = categorizeError(statusCode, error);
this.errorTracker.record(false, category);
throw err;
} finally {
const latencyMs = Math.round(performance.now() - start);
const inputTokens = result?.usage?.input_tokens ?? 0;
const outputTokens = result?.usage?.output_tokens ?? 0;
const cost = calculateCost(params.model, inputTokens, outputTokens);
this.latencyTracker.record(latencyMs);
this.errorTracker.record(statusCode < 400);
updateCostTracker(
this.costTracker,
params.userId,
params.feature,
params.model,
cost
);
const logEntry: ClaudeRequestLog = {
requestId,
timestamp: new Date().toISOString(),
model: params.model,
userId: params.userId,
feature: params.feature,
inputTokens,
outputTokens,
cacheReadTokens: result?.usage?.cache_read_input_tokens ?? 0,
cacheCreationTokens: result?.usage?.cache_creation_input_tokens ?? 0,
latencyMs,
statusCode,
error,
costUsd: cost,
};
logClaudeRequest(logEntry);
this.analytics.track({
timestamp: logEntry.timestamp,
userId: params.userId,
feature: params.feature,
action: "request",
model: params.model,
inputTokens,
outputTokens,
latencyMs,
costUsd: cost,
metadata: { requestId },
});
}
return result;
}
getMetrics() {
return {
latency: this.latencyTracker.summary(),
errors: this.errorTracker.getBreakdown(),
costs: {
total: this.costTracker.total,
byModel: Object.fromEntries(this.costTracker.byModel),
byFeature: Object.fromEntries(this.costTracker.byFeature),
},
usage: {
topFeatures: this.analytics.topFeatures(),
uniqueUsersToday: this.analytics.uniqueUsersToday(),
},
};
}
}Key Takeaways
- Log every API call with structured JSON including tokens, cost, latency, and error details.
- Track costs at multiple levels — per request, per user, per feature, and per day.
- Measure latency percentiles, not just averages. P95 and P99 reveal real user pain.
- Categorize errors so you can distinguish between rate limits, server issues, and bad requests.
- Set up alerts with cooldown periods so you get notified without being spammed.
- Track usage analytics to understand which features deliver value and which do not.
- Use a monitoring wrapper as a single entry point so every call is automatically instrumented.
Observability is not optional for production AI applications. Build it in from day one, and you will catch problems before your users do.