What is Monitoring and Observability?

Master monitoring and observability with the three pillars: logs, metrics, and traces. Learn about Prometheus, Grafana, alerting, SLOs, and dashboards.

Monitoring and Observability - System Design Tutorial | TechLead

Monitoring and Observability

Monitoring tells you when something is broken. Observability tells you why. As systems grow in complexity with microservices, containers, and distributed architectures, traditional monitoring is insufficient. Observability is the ability to understand the internal state of your system by examining its external outputs: logs, metrics, and traces.

Monitoring vs. Observability

Aspect	Monitoring	Observability
Focus	Known failure modes	Unknown unknowns
Approach	Predefined dashboards and alerts	Ad-hoc exploration and correlation
Question	"Is the system healthy?"	"Why is this request slow?"
Data	Aggregated metrics	High-cardinality, high-dimensionality data

The Three Pillars

1. Logs

Logs are discrete, timestamped records of events. They provide detailed context about what happened at a specific point in time.

// Structured logging (JSON format)
interface LogEntry {
  timestamp: string;
  level: "DEBUG" | "INFO" | "WARN" | "ERROR";
  message: string;
  service: string;
  traceId?: string;
  spanId?: string;
  userId?: string;
  requestId?: string;
  duration?: number;
  error?: {
    message: string;
    stack: string;
    code: string;
  };
  [key: string]: unknown; // Additional context
}

// Good logging practices
class Logger {
  private context: Record<string, unknown>;

  constructor(service: string) {
    this.context = { service };
  }

  withContext(ctx: Record<string, unknown>): Logger {
    const logger = new Logger(this.context.service as string);
    logger.context = { ...this.context, ...ctx };
    return logger;
  }

  info(message: string, data?: Record<string, unknown>): void {
    console.log(JSON.stringify({
      timestamp: new Date().toISOString(),
      level: "INFO",
      message,
      ...this.context,
      ...data,
    }));
  }

  error(message: string, error: Error, data?: Record<string, unknown>): void {
    console.error(JSON.stringify({
      timestamp: new Date().toISOString(),
      level: "ERROR",
      message,
      error: { message: error.message, stack: error.stack, name: error.name },
      ...this.context,
      ...data,
    }));
  }
}

// Usage
const logger = new Logger("order-service");
const reqLogger = logger.withContext({ requestId: "req_123", userId: "user_456" });
reqLogger.info("Order created", { orderId: "ord_789", amount: 99.99 });
// Output: {"timestamp":"2025-...","level":"INFO","message":"Order created","service":"order-service","requestId":"req_123","userId":"user_456","orderId":"ord_789","amount":99.99}

2. Metrics

Metrics are numerical measurements collected over time. They are aggregatable and efficient to store, making them ideal for dashboards, alerts, and trend analysis.

// Four types of metrics (following Prometheus conventions)

// Counter: monotonically increasing value
// Use for: total requests, errors, bytes transferred
const httpRequestsTotal = new Counter({
  name: "http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "path", "status"],
});
httpRequestsTotal.labels("GET", "/api/orders", "200").inc();

// Gauge: value that goes up and down
// Use for: current connections, queue depth, temperature
const activeConnections = new Gauge({
  name: "active_connections",
  help: "Current active connections",
});
activeConnections.set(42);

// Histogram: distribution of values in buckets
// Use for: request duration, response size
const requestDuration = new Histogram({
  name: "http_request_duration_seconds",
  help: "Request duration in seconds",
  labelNames: ["method", "path"],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
const end = requestDuration.labels("GET", "/api/orders").startTimer();
// ... handle request ...
end(); // Records the duration

// Summary: similar to histogram but calculates percentiles on the client
const requestDurationSummary = new Summary({
  name: "http_request_duration_summary",
  help: "Request duration percentiles",
  percentiles: [0.5, 0.9, 0.95, 0.99],
});

3. Traces

Distributed traces follow a request as it flows through multiple services. Each trace is composed of spans, where each span represents a unit of work.

// Trace structure
interface Span {
  traceId: string;     // Unique ID for the entire request
  spanId: string;      // Unique ID for this span
  parentSpanId?: string; // Parent span (undefined for root span)
  operationName: string;
  serviceName: string;
  startTime: number;
  duration: number;    // milliseconds
  status: "OK" | "ERROR";
  tags: Record<string, string>;
  logs: SpanLog[];
}

// Example: API request -> Order Service -> Payment Service -> Database
// Trace: traceId = "abc"
//   Span 1: API Gateway (root span)
//     Span 2: Order Service - createOrder
//       Span 3: Payment Service - processPayment
//         Span 4: Stripe API call
//       Span 5: Database - insertOrder

// Middleware to propagate trace context
function tracingMiddleware(req: Request, res: Response, next: Function) {
  const traceId = req.headers["x-trace-id"] || generateTraceId();
  const parentSpanId = req.headers["x-span-id"];
  const spanId = generateSpanId();

  // Add to request context
  req.traceContext = { traceId, spanId, parentSpanId };

  // Propagate to downstream calls
  res.setHeader("x-trace-id", traceId);

  const startTime = Date.now();
  res.on("finish", () => {
    recordSpan({
      traceId,
      spanId,
      parentSpanId,
      operationName: `${req.method} ${req.path}`,
      serviceName: "api-gateway",
      startTime,
      duration: Date.now() - startTime,
      status: res.statusCode < 400 ? "OK" : "ERROR",
      tags: { "http.method": req.method, "http.status": String(res.statusCode) },
      logs: [],
    });
  });

  next();
}

Tools Landscape

Tool	Pillar	Purpose
Prometheus	Metrics	Time-series metrics collection and querying (PromQL)
Grafana	Visualization	Dashboards for metrics, logs, and traces
ELK Stack	Logs	Elasticsearch + Logstash + Kibana for log aggregation
Loki	Logs	Lightweight log aggregation by Grafana Labs
Jaeger	Traces	Distributed tracing (originally by Uber)
OpenTelemetry	All three	Vendor-neutral standard for telemetry collection
Datadog	All three	Commercial all-in-one observability platform

Alerting Strategies

Effective Alerting Principles

Alert on symptoms, not causes. Alert on "API error rate > 1%" not "CPU > 80%"
Every alert must be actionable. If the on-call person cannot do anything about it, it should not page them
Use severity levels: P1 (pages immediately), P2 (Slack notification), P3 (ticket created)
Include runbooks. Every alert should link to a document explaining what to check and how to remediate
Avoid alert fatigue. Too many alerts means engineers start ignoring them. Regularly review and prune alerts

SLIs, SLOs, and SLAs

These three concepts form a hierarchy for defining and measuring service reliability.

Concept	Definition	Example
SLI (Service Level Indicator)	A quantitative measure of service quality	99.2% of requests complete in under 200ms
SLO (Service Level Objective)	A target value for an SLI	99.9% of requests should complete in under 200ms
SLA (Service Level Agreement)	A contract with consequences for missing the SLO	If availability drops below 99.9%, customer gets credits

// Error budget calculation
interface SLOConfig {
  name: string;
  target: number;     // e.g., 0.999 (99.9%)
  window: number;     // Rolling window in days (e.g., 30)
}

function calculateErrorBudget(slo: SLOConfig, currentSLI: number) {
  const totalMinutesInWindow = slo.window * 24 * 60;
  const allowedDowntimeMinutes = totalMinutesInWindow * (1 - slo.target);
  const usedDowntimeMinutes = totalMinutesInWindow * (1 - currentSLI);
  const remainingBudgetMinutes = allowedDowntimeMinutes - usedDowntimeMinutes;
  const budgetRemainingPercent = (remainingBudgetMinutes / allowedDowntimeMinutes) * 100;

  return {
    totalBudgetMinutes: allowedDowntimeMinutes,      // 43.2 min for 99.9% over 30 days
    usedBudgetMinutes: usedDowntimeMinutes,
    remainingBudgetMinutes: remainingBudgetMinutes,
    budgetRemainingPercent: budgetRemainingPercent,
    burnRate: usedDowntimeMinutes / allowedDowntimeMinutes,
  };
}

// Example: 99.9% SLO over 30 days
// Total budget: 30 * 24 * 60 * 0.001 = 43.2 minutes of downtime allowed
// If current SLI = 99.85%, used 64.8 minutes, budget is EXHAUSTED

Dashboard Design

Effective Dashboard Principles

USE method: For infrastructure: Utilization, Saturation, Errors for each resource (CPU, memory, disk, network)
RED method: For services: Rate (requests/sec), Errors (error rate), Duration (latency percentiles)
Layer your dashboards: Top-level overview with drill-down to specific services, then specific endpoints
Show percentiles, not averages. p50, p95, p99 latencies reveal the experience of tail users that averages hide
Include context. Show deployment markers, alert thresholds, and SLO targets on graphs
Time range consistency. All panels on a dashboard should show the same time range

Monitoring and Observability

Monitoring and Observability

Monitoring vs. Observability

The Three Pillars

1. Logs

2. Metrics

3. Traces

Tools Landscape

Alerting Strategies

Effective Alerting Principles

SLIs, SLOs, and SLAs

Dashboard Design

Effective Dashboard Principles

Continue Learning

Software Architecture

Cloud & Kubernetes

Performance Engineering

Data Engineering

Engineering Leadership