TechLead
Lesson 28 of 30
6 min read
System Design

Monitoring and Observability

Master monitoring and observability with the three pillars: logs, metrics, and traces. Learn about Prometheus, Grafana, alerting, SLOs, and dashboards.

Monitoring and Observability

Monitoring tells you when something is broken. Observability tells you why. As systems grow in complexity with microservices, containers, and distributed architectures, traditional monitoring is insufficient. Observability is the ability to understand the internal state of your system by examining its external outputs: logs, metrics, and traces.

Monitoring vs. Observability

Aspect Monitoring Observability
Focus Known failure modes Unknown unknowns
Approach Predefined dashboards and alerts Ad-hoc exploration and correlation
Question "Is the system healthy?" "Why is this request slow?"
Data Aggregated metrics High-cardinality, high-dimensionality data

The Three Pillars

1. Logs

Logs are discrete, timestamped records of events. They provide detailed context about what happened at a specific point in time.

// Structured logging (JSON format)
interface LogEntry {
  timestamp: string;
  level: "DEBUG" | "INFO" | "WARN" | "ERROR";
  message: string;
  service: string;
  traceId?: string;
  spanId?: string;
  userId?: string;
  requestId?: string;
  duration?: number;
  error?: {
    message: string;
    stack: string;
    code: string;
  };
  [key: string]: unknown; // Additional context
}

// Good logging practices
class Logger {
  private context: Record<string, unknown>;

  constructor(service: string) {
    this.context = { service };
  }

  withContext(ctx: Record<string, unknown>): Logger {
    const logger = new Logger(this.context.service as string);
    logger.context = { ...this.context, ...ctx };
    return logger;
  }

  info(message: string, data?: Record<string, unknown>): void {
    console.log(JSON.stringify({
      timestamp: new Date().toISOString(),
      level: "INFO",
      message,
      ...this.context,
      ...data,
    }));
  }

  error(message: string, error: Error, data?: Record<string, unknown>): void {
    console.error(JSON.stringify({
      timestamp: new Date().toISOString(),
      level: "ERROR",
      message,
      error: { message: error.message, stack: error.stack, name: error.name },
      ...this.context,
      ...data,
    }));
  }
}

// Usage
const logger = new Logger("order-service");
const reqLogger = logger.withContext({ requestId: "req_123", userId: "user_456" });
reqLogger.info("Order created", { orderId: "ord_789", amount: 99.99 });
// Output: {"timestamp":"2025-...","level":"INFO","message":"Order created","service":"order-service","requestId":"req_123","userId":"user_456","orderId":"ord_789","amount":99.99}

2. Metrics

Metrics are numerical measurements collected over time. They are aggregatable and efficient to store, making them ideal for dashboards, alerts, and trend analysis.

// Four types of metrics (following Prometheus conventions)

// Counter: monotonically increasing value
// Use for: total requests, errors, bytes transferred
const httpRequestsTotal = new Counter({
  name: "http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "path", "status"],
});
httpRequestsTotal.labels("GET", "/api/orders", "200").inc();

// Gauge: value that goes up and down
// Use for: current connections, queue depth, temperature
const activeConnections = new Gauge({
  name: "active_connections",
  help: "Current active connections",
});
activeConnections.set(42);

// Histogram: distribution of values in buckets
// Use for: request duration, response size
const requestDuration = new Histogram({
  name: "http_request_duration_seconds",
  help: "Request duration in seconds",
  labelNames: ["method", "path"],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
const end = requestDuration.labels("GET", "/api/orders").startTimer();
// ... handle request ...
end(); // Records the duration

// Summary: similar to histogram but calculates percentiles on the client
const requestDurationSummary = new Summary({
  name: "http_request_duration_summary",
  help: "Request duration percentiles",
  percentiles: [0.5, 0.9, 0.95, 0.99],
});

3. Traces

Distributed traces follow a request as it flows through multiple services. Each trace is composed of spans, where each span represents a unit of work.

// Trace structure
interface Span {
  traceId: string;     // Unique ID for the entire request
  spanId: string;      // Unique ID for this span
  parentSpanId?: string; // Parent span (undefined for root span)
  operationName: string;
  serviceName: string;
  startTime: number;
  duration: number;    // milliseconds
  status: "OK" | "ERROR";
  tags: Record<string, string>;
  logs: SpanLog[];
}

// Example: API request -> Order Service -> Payment Service -> Database
// Trace: traceId = "abc"
//   Span 1: API Gateway (root span)
//     Span 2: Order Service - createOrder
//       Span 3: Payment Service - processPayment
//         Span 4: Stripe API call
//       Span 5: Database - insertOrder

// Middleware to propagate trace context
function tracingMiddleware(req: Request, res: Response, next: Function) {
  const traceId = req.headers["x-trace-id"] || generateTraceId();
  const parentSpanId = req.headers["x-span-id"];
  const spanId = generateSpanId();

  // Add to request context
  req.traceContext = { traceId, spanId, parentSpanId };

  // Propagate to downstream calls
  res.setHeader("x-trace-id", traceId);

  const startTime = Date.now();
  res.on("finish", () => {
    recordSpan({
      traceId,
      spanId,
      parentSpanId,
      operationName: `${req.method} ${req.path}`,
      serviceName: "api-gateway",
      startTime,
      duration: Date.now() - startTime,
      status: res.statusCode < 400 ? "OK" : "ERROR",
      tags: { "http.method": req.method, "http.status": String(res.statusCode) },
      logs: [],
    });
  });

  next();
}

Tools Landscape

Tool Pillar Purpose
Prometheus Metrics Time-series metrics collection and querying (PromQL)
Grafana Visualization Dashboards for metrics, logs, and traces
ELK Stack Logs Elasticsearch + Logstash + Kibana for log aggregation
Loki Logs Lightweight log aggregation by Grafana Labs
Jaeger Traces Distributed tracing (originally by Uber)
OpenTelemetry All three Vendor-neutral standard for telemetry collection
Datadog All three Commercial all-in-one observability platform

Alerting Strategies

Effective Alerting Principles

  • Alert on symptoms, not causes. Alert on "API error rate > 1%" not "CPU > 80%"
  • Every alert must be actionable. If the on-call person cannot do anything about it, it should not page them
  • Use severity levels: P1 (pages immediately), P2 (Slack notification), P3 (ticket created)
  • Include runbooks. Every alert should link to a document explaining what to check and how to remediate
  • Avoid alert fatigue. Too many alerts means engineers start ignoring them. Regularly review and prune alerts

SLIs, SLOs, and SLAs

These three concepts form a hierarchy for defining and measuring service reliability.

Concept Definition Example
SLI (Service Level Indicator) A quantitative measure of service quality 99.2% of requests complete in under 200ms
SLO (Service Level Objective) A target value for an SLI 99.9% of requests should complete in under 200ms
SLA (Service Level Agreement) A contract with consequences for missing the SLO If availability drops below 99.9%, customer gets credits
// Error budget calculation
interface SLOConfig {
  name: string;
  target: number;     // e.g., 0.999 (99.9%)
  window: number;     // Rolling window in days (e.g., 30)
}

function calculateErrorBudget(slo: SLOConfig, currentSLI: number) {
  const totalMinutesInWindow = slo.window * 24 * 60;
  const allowedDowntimeMinutes = totalMinutesInWindow * (1 - slo.target);
  const usedDowntimeMinutes = totalMinutesInWindow * (1 - currentSLI);
  const remainingBudgetMinutes = allowedDowntimeMinutes - usedDowntimeMinutes;
  const budgetRemainingPercent = (remainingBudgetMinutes / allowedDowntimeMinutes) * 100;

  return {
    totalBudgetMinutes: allowedDowntimeMinutes,      // 43.2 min for 99.9% over 30 days
    usedBudgetMinutes: usedDowntimeMinutes,
    remainingBudgetMinutes: remainingBudgetMinutes,
    budgetRemainingPercent: budgetRemainingPercent,
    burnRate: usedDowntimeMinutes / allowedDowntimeMinutes,
  };
}

// Example: 99.9% SLO over 30 days
// Total budget: 30 * 24 * 60 * 0.001 = 43.2 minutes of downtime allowed
// If current SLI = 99.85%, used 64.8 minutes, budget is EXHAUSTED

Dashboard Design

Effective Dashboard Principles

  • USE method: For infrastructure: Utilization, Saturation, Errors for each resource (CPU, memory, disk, network)
  • RED method: For services: Rate (requests/sec), Errors (error rate), Duration (latency percentiles)
  • Layer your dashboards: Top-level overview with drill-down to specific services, then specific endpoints
  • Show percentiles, not averages. p50, p95, p99 latencies reveal the experience of tail users that averages hide
  • Include context. Show deployment markers, alert thresholds, and SLO targets on graphs
  • Time range consistency. All panels on a dashboard should show the same time range

Continue Learning