HomeProduction & DeploymentMonitoring & Observability

intermediate12 min read· Module 10, Lesson 3

📡Monitoring & Observability

Track costs, latency, errors, and usage across your Claude integration

Monitoring & Observability

Shipping an AI feature is only half the work. The other half is knowing what is actually happening once it runs in production. Without monitoring, you are flying blind — you cannot tell if responses are slow, if costs are spiking, if error rates are climbing, or if users are even getting value from the integration.

This lesson covers everything you need to build a production-grade observability layer around your Claude API integration.

Why Monitoring Matters for LLM Applications

Traditional software monitoring focuses on uptime and error rates. LLM applications add entirely new dimensions:

Dimension	Why It Matters
Latency	LLM calls can take 2-30+ seconds. Users notice.
Token usage	Directly drives cost. Uncontrolled usage can bankrupt a project.
Cost per request	Different models and prompt sizes have wildly different costs.
Error rates	Rate limits, overloaded errors, malformed responses.
Response quality	The model can return valid JSON but terrible content.
User satisfaction	Are users accepting, editing, or rejecting AI outputs?

Without visibility into these dimensions, you will only discover problems when users complain — or when you get an unexpected bill.

What to Monitor

Here is a comprehensive checklist of metrics every Claude integration should track:

1. Latency Metrics

Time to first token (TTFT): How long before the first byte of the response arrives. Critical for streaming UIs.
Total response time: End-to-end duration of the API call.
P50 / P95 / P99 latencies: Median tells you the norm; P95 and P99 reveal tail latency problems.
Latency by model: Compare performance across Claude Sonnet, Haiku, and Opus.

2. Token Metrics

Input tokens per request: Are your prompts growing out of control?
Output tokens per request: Are responses unreasonably long?
Total tokens per user session: Track cumulative usage across a conversation.
Cache hit rate: If you use prompt caching, measure how often it activates.

3. Cost Metrics

Cost per request: Calculated from input/output token counts and model pricing.
Cost per user: How much does each user cost you?
Cost per feature: Which features are the most expensive?
Daily / weekly / monthly spend: Trend tracking to catch spikes early.

4. Error Metrics

Error rate by type: 429 (rate limit), 500 (server error), 529 (overloaded), timeout.
Retry count: How many retries before success?
Failure rate: Requests that fail even after retries.
Error rate by model: Some models may have higher error rates during peak hours.

5. Usage Metrics

Requests per user: Identify power users and potential abuse.
Requests per feature: Know which integrations get the most traffic.
Peak usage hours: Plan capacity and rate limit budgets.
Unique users per day: Track adoption over time.

Structured Logging

The foundation of observability is structured logging. Never log plain strings — always log structured JSON so you can query, filter, and aggregate later.

Basic Logging Setup

TypeScript
interface ClaudeRequestLog {
  requestId: string;
  timestamp: string;
  model: string;
  userId: string;
  feature: string;
  inputTokens: number;
  outputTokens: number;
  cacheReadTokens: number;
  cacheCreationTokens: number;
  latencyMs: number;
  statusCode: number;
  error: string | null;
  costUsd: number;
}

function logClaudeRequest(entry: ClaudeRequestLog): void {
  console.log(JSON.stringify({
    level: "info",
    service: "claude-integration",
    event: "claude_api_call",
    ...entry,
  }));
}

What to Include in Every Log Entry

Every single Claude API call should produce a log entry with these fields:

Field	Purpose
`requestId`	Unique ID to correlate logs, traces, and user reports
`timestamp`	ISO 8601 timestamp for time-series analysis
`model`	Which model was called
`userId`	Who triggered the call
`feature`	Which product feature initiated the call
`inputTokens`	Token count from the request
`outputTokens`	Token count from the response
`latencyMs`	Wall-clock time for the full request
`statusCode`	HTTP status code returned
`error`	Error message if the call failed
`costUsd`	Calculated cost in USD

Cost Tracking

Cost tracking is non-negotiable for production LLM applications. Here is how to calculate and track costs accurately.

Pricing Reference

TypeScript
const MODEL_PRICING: Record<string, { inputPer1M: number; outputPer1M: number }> = {
  "claude-sonnet-4-20250514": { inputPer1M: 3.0, outputPer1M: 15.0 },
  "claude-haiku-35-20241022": { inputPer1M: 0.80, outputPer1M: 4.0 },
  "claude-opus-4-20250514": { inputPer1M: 15.0, outputPer1M: 75.0 },
};

function calculateCost(
  model: string,
  inputTokens: number,
  outputTokens: number
): number {
  const pricing = MODEL_PRICING[model];
  if (!pricing) return 0;

  const inputCost = (inputTokens / 1_000_000) * pricing.inputPer1M;
  const outputCost = (outputTokens / 1_000_000) * pricing.outputPer1M;
  return Math.round((inputCost + outputCost) * 1_000_000) / 1_000_000;
}

Cost Aggregation

Track costs at multiple levels of granularity:

TypeScript
interface CostTracker {
  byUser: Map<string, number>;
  byFeature: Map<string, number>;
  byModel: Map<string, number>;
  byHour: Map<string, number>;
  total: number;
}

function updateCostTracker(
  tracker: CostTracker,
  userId: string,
  feature: string,
  model: string,
  cost: number
): void {
  tracker.byUser.set(userId, (tracker.byUser.get(userId) ?? 0) + cost);
  tracker.byFeature.set(feature, (tracker.byFeature.get(feature) ?? 0) + cost);
  tracker.byModel.set(model, (tracker.byModel.get(model) ?? 0) + cost);

  const hourKey = new Date().toISOString().slice(0, 13);
  tracker.byHour.set(hourKey, (tracker.byHour.get(hourKey) ?? 0) + cost);

  tracker.total += cost;
}

Budget Alerts

Set up thresholds to catch cost spikes before they become a problem:

TypeScript
interface BudgetConfig {
  dailyLimitUsd: number;
  perUserLimitUsd: number;
  perRequestWarnUsd: number;
  alertCallback: (message: string) => void;
}

function checkBudget(
  config: BudgetConfig,
  tracker: CostTracker,
  userId: string,
  requestCost: number
): boolean {
  if (requestCost > config.perRequestWarnUsd) {
    config.alertCallback(
      `High-cost request: $${requestCost.toFixed(4)} for user ${userId}`
    );
  }

  const userTotal = tracker.byUser.get(userId) ?? 0;
  if (userTotal > config.perUserLimitUsd) {
    config.alertCallback(
      `User ${userId} exceeded daily budget: $${userTotal.toFixed(2)}`
    );
    return false;
  }

  const todayKey = new Date().toISOString().slice(0, 10);
  let dailyTotal = 0;
  for (const [key, value] of tracker.byHour) {
    if (key.startsWith(todayKey)) dailyTotal += value;
  }

  if (dailyTotal > config.dailyLimitUsd) {
    config.alertCallback(
      `Daily budget exceeded: $${dailyTotal.toFixed(2)}`
    );
    return false;
  }

  return true;
}

Latency Monitoring

Latency is one of the most impactful metrics for user experience. Here is how to measure it properly.

Measuring Latency

TypeScript
interface LatencyMetrics {
  ttftMs: number | null;
  totalMs: number;
  model: string;
  inputTokens: number;
  outputTokens: number;
}

async function measureLatency(
  apiCall: () => Promise<any>,
  model: string
): Promise<{ result: any; metrics: LatencyMetrics }> {
  const start = performance.now();
  const result = await apiCall();
  const totalMs = performance.now() - start;

  return {
    result,
    metrics: {
      ttftMs: null,
      totalMs: Math.round(totalMs),
      model,
      inputTokens: result.usage?.input_tokens ?? 0,
      outputTokens: result.usage?.output_tokens ?? 0,
    },
  };
}

Latency Percentile Tracker

TypeScript
class PercentileTracker {
  private values: number[] = [];
  private readonly maxSize: number;

  constructor(maxSize = 10000) {
    this.maxSize = maxSize;
  }

  record(value: number): void {
    this.values.push(value);
    if (this.values.length > this.maxSize) {
      this.values.shift();
    }
  }

  percentile(p: number): number {
    if (this.values.length === 0) return 0;
    const sorted = [...this.values].sort((a, b) => a - b);
    const index = Math.ceil((p / 100) * sorted.length) - 1;
    return sorted[Math.max(0, index)];
  }

  summary(): { p50: number; p95: number; p99: number; count: number } {
    return {
      p50: this.percentile(50),
      p95: this.percentile(95),
      p99: this.percentile(99),
      count: this.values.length,
    };
  }
}

Error Dashboards

Errors in LLM applications are not just binary pass/fail. You need to categorize, track trends, and set up alerting for different error types.

Error Categorization

TypeScript
type ErrorCategory =
  | "rate_limit"
  | "overloaded"
  | "server_error"
  | "timeout"
  | "invalid_request"
  | "auth_error"
  | "context_too_long"
  | "content_filter"
  | "unknown";

function categorizeError(statusCode: number, errorMessage: string): ErrorCategory {
  if (statusCode === 429) return "rate_limit";
  if (statusCode === 529) return "overloaded";
  if (statusCode >= 500) return "server_error";
  if (statusCode === 408 || errorMessage.includes("timeout")) return "timeout";
  if (statusCode === 401 || statusCode === 403) return "auth_error";
  if (errorMessage.includes("too many tokens")) return "context_too_long";
  if (errorMessage.includes("content filtering")) return "content_filter";
  if (statusCode === 400) return "invalid_request";
  return "unknown";
}

Error Rate Tracker

TypeScript
class ErrorRateTracker {
  private windows: Map<string, { total: number; errors: number }> = new Map();
  private errorsByCategory: Map<ErrorCategory, number> = new Map();

  record(success: boolean, category?: ErrorCategory): void {
    const windowKey = new Date().toISOString().slice(0, 16);

    const window = this.windows.get(windowKey) ?? { total: 0, errors: 0 };
    window.total++;
    if (!success) {
      window.errors++;
      if (category) {
        this.errorsByCategory.set(
          category,
          (this.errorsByCategory.get(category) ?? 0) + 1
        );
      }
    }
    this.windows.set(windowKey, window);
  }

  getErrorRate(windowKey: string): number {
    const window = this.windows.get(windowKey);
    if (!window || window.total === 0) return 0;
    return window.errors / window.total;
  }

  getBreakdown(): Record<ErrorCategory, number> {
    return Object.fromEntries(this.errorsByCategory) as Record<ErrorCategory, number>;
  }
}

Alerting

Monitoring without alerting is just data collection. Set up alerts for the conditions that require human attention.

Alert Configuration

TypeScript
interface AlertRule {
  name: string;
  condition: () => boolean;
  message: () => string;
  severity: "warning" | "critical";
  cooldownMinutes: number;
}

const alertRules: AlertRule[] = [
  {
    name: "high_error_rate",
    condition: () => {
      const currentWindow = new Date().toISOString().slice(0, 16);
      return errorTracker.getErrorRate(currentWindow) > 0.1;
    },
    message: () => "Error rate exceeded 10% in the current window",
    severity: "critical",
    cooldownMinutes: 15,
  },
  {
    name: "high_latency",
    condition: () => latencyTracker.percentile(95) > 10000,
    message: () =>
      `P95 latency is ${latencyTracker.percentile(95)}ms (threshold: 10000ms)`,
    severity: "warning",
    cooldownMinutes: 30,
  },
  {
    name: "budget_warning",
    condition: () => costTracker.total > 80,
    message: () =>
      `Daily spend at $${costTracker.total.toFixed(2)} — approaching $100 limit`,
    severity: "warning",
    cooldownMinutes: 60,
  },
];

Alert Engine

TypeScript
class AlertEngine {
  private lastFired: Map<string, number> = new Map();

  evaluate(rules: AlertRule[], notify: (alert: AlertRule) => void): void {
    const now = Date.now();

    for (const rule of rules) {
      const lastTime = this.lastFired.get(rule.name) ?? 0;
      const cooldownMs = rule.cooldownMinutes * 60 * 1000;

      if (now - lastTime < cooldownMs) continue;

      if (rule.condition()) {
        this.lastFired.set(rule.name, now);
        notify(rule);
      }
    }
  }
}

Usage Analytics

Beyond operational metrics, you want to understand how your AI features are being used.

Usage Event Tracking

TypeScript
interface UsageEvent {
  timestamp: string;
  userId: string;
  feature: string;
  action: "request" | "accept" | "reject" | "edit" | "retry";
  model: string;
  inputTokens: number;
  outputTokens: number;
  latencyMs: number;
  costUsd: number;
  metadata: Record<string, string>;
}

class UsageAnalytics {
  private events: UsageEvent[] = [];

  track(event: UsageEvent): void {
    this.events.push(event);
  }

  acceptanceRate(feature: string): number {
    const featureEvents = this.events.filter(
      (e) => e.feature === feature && (e.action === "accept" || e.action === "reject")
    );
    if (featureEvents.length === 0) return 0;
    const accepted = featureEvents.filter((e) => e.action === "accept").length;
    return accepted / featureEvents.length;
  }

  topFeatures(limit = 10): Array<{ feature: string; count: number }> {
    const counts = new Map<string, number>();
    for (const event of this.events) {
      counts.set(event.feature, (counts.get(event.feature) ?? 0) + 1);
    }
    return [...counts.entries()]
      .map(([feature, count]) => ({ feature, count }))
      .sort((a, b) => b.count - a.count)
      .slice(0, limit);
  }

  uniqueUsersToday(): number {
    const today = new Date().toISOString().slice(0, 10);
    const users = new Set(
      this.events
        .filter((e) => e.timestamp.startsWith(today))
        .map((e) => e.userId)
    );
    return users.size;
  }
}

Complete Monitoring Wrapper

Here is a full monitoring wrapper class that ties everything together. Use this as the single entry point for all Claude API calls in your application.

TypeScript

class MonitoredClaudeClient {
  private client: Anthropic;
  private costTracker: CostTracker;
  private latencyTracker: PercentileTracker;
  private errorTracker: ErrorRateTracker;
  private analytics: UsageAnalytics;
  private alertEngine: AlertEngine;
  private budgetConfig: BudgetConfig;

  constructor(apiKey: string, budgetConfig: BudgetConfig) {
    this.client = new Anthropic({ apiKey });
    this.costTracker = {
      byUser: new Map(),
      byFeature: new Map(),
      byModel: new Map(),
      byHour: new Map(),
      total: 0,
    };
    this.latencyTracker = new PercentileTracker();
    this.errorTracker = new ErrorRateTracker();
    this.analytics = new UsageAnalytics();
    this.alertEngine = new AlertEngine();
    this.budgetConfig = budgetConfig;
  }

  async createMessage(params: {
    model: string;
    max_tokens: number;
    messages: Array<{ role: string; content: string }>;
    userId: string;
    feature: string;
  }) {
    const requestId = crypto.randomUUID();
    const start = performance.now();

    const withinBudget = checkBudget(
      this.budgetConfig,
      this.costTracker,
      params.userId,
      0
    );

    if (!withinBudget) {
      throw new Error("Budget exceeded — request blocked");
    }

    let statusCode = 200;
    let error: string | null = null;
    let result: any = null;

    try {
      result = await this.client.messages.create({
        model: params.model,
        max_tokens: params.max_tokens,
        messages: params.messages as any,
      });
    } catch (err: any) {
      statusCode = err.status ?? 500;
      error = err.message ?? "Unknown error";
      const category = categorizeError(statusCode, error);
      this.errorTracker.record(false, category);
      throw err;
    } finally {
      const latencyMs = Math.round(performance.now() - start);
      const inputTokens = result?.usage?.input_tokens ?? 0;
      const outputTokens = result?.usage?.output_tokens ?? 0;
      const cost = calculateCost(params.model, inputTokens, outputTokens);

      this.latencyTracker.record(latencyMs);
      this.errorTracker.record(statusCode < 400);
      updateCostTracker(
        this.costTracker,
        params.userId,
        params.feature,
        params.model,
        cost
      );

      const logEntry: ClaudeRequestLog = {
        requestId,
        timestamp: new Date().toISOString(),
        model: params.model,
        userId: params.userId,
        feature: params.feature,
        inputTokens,
        outputTokens,
        cacheReadTokens: result?.usage?.cache_read_input_tokens ?? 0,
        cacheCreationTokens: result?.usage?.cache_creation_input_tokens ?? 0,
        latencyMs,
        statusCode,
        error,
        costUsd: cost,
      };
      logClaudeRequest(logEntry);

      this.analytics.track({
        timestamp: logEntry.timestamp,
        userId: params.userId,
        feature: params.feature,
        action: "request",
        model: params.model,
        inputTokens,
        outputTokens,
        latencyMs,
        costUsd: cost,
        metadata: { requestId },
      });
    }

    return result;
  }

  getMetrics() {
    return {
      latency: this.latencyTracker.summary(),
      errors: this.errorTracker.getBreakdown(),
      costs: {
        total: this.costTracker.total,
        byModel: Object.fromEntries(this.costTracker.byModel),
        byFeature: Object.fromEntries(this.costTracker.byFeature),
      },
      usage: {
        topFeatures: this.analytics.topFeatures(),
        uniqueUsersToday: this.analytics.uniqueUsersToday(),
      },
    };
  }
}

Key Takeaways

Log every API call with structured JSON including tokens, cost, latency, and error details.
Track costs at multiple levels — per request, per user, per feature, and per day.
Measure latency percentiles, not just averages. P95 and P99 reveal real user pain.
Categorize errors so you can distinguish between rate limits, server issues, and bad requests.
Set up alerts with cooldown periods so you get notified without being spammed.
Track usage analytics to understand which features deliver value and which do not.
Use a monitoring wrapper as a single entry point so every call is automatically instrumented.

Observability is not optional for production AI applications. Build it in from day one, and you will catch problems before your users do.

Module 10

3/7

📈 Rate Limits & Scaling

Cost Optimization at Scale 💰

3/7