Monitoring and Observability
Monitoring tells you when something is broken. Observability tells you why. As systems grow in complexity with microservices, containers, and distributed architectures, traditional monitoring is insufficient. Observability is the ability to understand the internal state of your system by examining its external outputs: logs, metrics, and traces.
Monitoring vs. Observability
| Aspect | Monitoring | Observability |
|---|---|---|
| Focus | Known failure modes | Unknown unknowns |
| Approach | Predefined dashboards and alerts | Ad-hoc exploration and correlation |
| Question | "Is the system healthy?" | "Why is this request slow?" |
| Data | Aggregated metrics | High-cardinality, high-dimensionality data |
The Three Pillars
1. Logs
Logs are discrete, timestamped records of events. They provide detailed context about what happened at a specific point in time.
// Structured logging (JSON format)
interface LogEntry {
timestamp: string;
level: "DEBUG" | "INFO" | "WARN" | "ERROR";
message: string;
service: string;
traceId?: string;
spanId?: string;
userId?: string;
requestId?: string;
duration?: number;
error?: {
message: string;
stack: string;
code: string;
};
[key: string]: unknown; // Additional context
}
// Good logging practices
class Logger {
private context: Record<string, unknown>;
constructor(service: string) {
this.context = { service };
}
withContext(ctx: Record<string, unknown>): Logger {
const logger = new Logger(this.context.service as string);
logger.context = { ...this.context, ...ctx };
return logger;
}
info(message: string, data?: Record<string, unknown>): void {
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
level: "INFO",
message,
...this.context,
...data,
}));
}
error(message: string, error: Error, data?: Record<string, unknown>): void {
console.error(JSON.stringify({
timestamp: new Date().toISOString(),
level: "ERROR",
message,
error: { message: error.message, stack: error.stack, name: error.name },
...this.context,
...data,
}));
}
}
// Usage
const logger = new Logger("order-service");
const reqLogger = logger.withContext({ requestId: "req_123", userId: "user_456" });
reqLogger.info("Order created", { orderId: "ord_789", amount: 99.99 });
// Output: {"timestamp":"2025-...","level":"INFO","message":"Order created","service":"order-service","requestId":"req_123","userId":"user_456","orderId":"ord_789","amount":99.99}2. Metrics
Metrics are numerical measurements collected over time. They are aggregatable and efficient to store, making them ideal for dashboards, alerts, and trend analysis.
// Four types of metrics (following Prometheus conventions)
// Counter: monotonically increasing value
// Use for: total requests, errors, bytes transferred
const httpRequestsTotal = new Counter({
name: "http_requests_total",
help: "Total HTTP requests",
labelNames: ["method", "path", "status"],
});
httpRequestsTotal.labels("GET", "/api/orders", "200").inc();
// Gauge: value that goes up and down
// Use for: current connections, queue depth, temperature
const activeConnections = new Gauge({
name: "active_connections",
help: "Current active connections",
});
activeConnections.set(42);
// Histogram: distribution of values in buckets
// Use for: request duration, response size
const requestDuration = new Histogram({
name: "http_request_duration_seconds",
help: "Request duration in seconds",
labelNames: ["method", "path"],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
const end = requestDuration.labels("GET", "/api/orders").startTimer();
// ... handle request ...
end(); // Records the duration
// Summary: similar to histogram but calculates percentiles on the client
const requestDurationSummary = new Summary({
name: "http_request_duration_summary",
help: "Request duration percentiles",
percentiles: [0.5, 0.9, 0.95, 0.99],
});3. Traces
Distributed traces follow a request as it flows through multiple services. Each trace is composed of spans, where each span represents a unit of work.
// Trace structure
interface Span {
traceId: string; // Unique ID for the entire request
spanId: string; // Unique ID for this span
parentSpanId?: string; // Parent span (undefined for root span)
operationName: string;
serviceName: string;
startTime: number;
duration: number; // milliseconds
status: "OK" | "ERROR";
tags: Record<string, string>;
logs: SpanLog[];
}
// Example: API request -> Order Service -> Payment Service -> Database
// Trace: traceId = "abc"
// Span 1: API Gateway (root span)
// Span 2: Order Service - createOrder
// Span 3: Payment Service - processPayment
// Span 4: Stripe API call
// Span 5: Database - insertOrder
// Middleware to propagate trace context
function tracingMiddleware(req: Request, res: Response, next: Function) {
const traceId = req.headers["x-trace-id"] || generateTraceId();
const parentSpanId = req.headers["x-span-id"];
const spanId = generateSpanId();
// Add to request context
req.traceContext = { traceId, spanId, parentSpanId };
// Propagate to downstream calls
res.setHeader("x-trace-id", traceId);
const startTime = Date.now();
res.on("finish", () => {
recordSpan({
traceId,
spanId,
parentSpanId,
operationName: `${req.method} ${req.path}`,
serviceName: "api-gateway",
startTime,
duration: Date.now() - startTime,
status: res.statusCode < 400 ? "OK" : "ERROR",
tags: { "http.method": req.method, "http.status": String(res.statusCode) },
logs: [],
});
});
next();
}Tools Landscape
| Tool | Pillar | Purpose |
|---|---|---|
| Prometheus | Metrics | Time-series metrics collection and querying (PromQL) |
| Grafana | Visualization | Dashboards for metrics, logs, and traces |
| ELK Stack | Logs | Elasticsearch + Logstash + Kibana for log aggregation |
| Loki | Logs | Lightweight log aggregation by Grafana Labs |
| Jaeger | Traces | Distributed tracing (originally by Uber) |
| OpenTelemetry | All three | Vendor-neutral standard for telemetry collection |
| Datadog | All three | Commercial all-in-one observability platform |
Alerting Strategies
Effective Alerting Principles
- Alert on symptoms, not causes. Alert on "API error rate > 1%" not "CPU > 80%"
- Every alert must be actionable. If the on-call person cannot do anything about it, it should not page them
- Use severity levels: P1 (pages immediately), P2 (Slack notification), P3 (ticket created)
- Include runbooks. Every alert should link to a document explaining what to check and how to remediate
- Avoid alert fatigue. Too many alerts means engineers start ignoring them. Regularly review and prune alerts
SLIs, SLOs, and SLAs
These three concepts form a hierarchy for defining and measuring service reliability.
| Concept | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | A quantitative measure of service quality | 99.2% of requests complete in under 200ms |
| SLO (Service Level Objective) | A target value for an SLI | 99.9% of requests should complete in under 200ms |
| SLA (Service Level Agreement) | A contract with consequences for missing the SLO | If availability drops below 99.9%, customer gets credits |
// Error budget calculation
interface SLOConfig {
name: string;
target: number; // e.g., 0.999 (99.9%)
window: number; // Rolling window in days (e.g., 30)
}
function calculateErrorBudget(slo: SLOConfig, currentSLI: number) {
const totalMinutesInWindow = slo.window * 24 * 60;
const allowedDowntimeMinutes = totalMinutesInWindow * (1 - slo.target);
const usedDowntimeMinutes = totalMinutesInWindow * (1 - currentSLI);
const remainingBudgetMinutes = allowedDowntimeMinutes - usedDowntimeMinutes;
const budgetRemainingPercent = (remainingBudgetMinutes / allowedDowntimeMinutes) * 100;
return {
totalBudgetMinutes: allowedDowntimeMinutes, // 43.2 min for 99.9% over 30 days
usedBudgetMinutes: usedDowntimeMinutes,
remainingBudgetMinutes: remainingBudgetMinutes,
budgetRemainingPercent: budgetRemainingPercent,
burnRate: usedDowntimeMinutes / allowedDowntimeMinutes,
};
}
// Example: 99.9% SLO over 30 days
// Total budget: 30 * 24 * 60 * 0.001 = 43.2 minutes of downtime allowed
// If current SLI = 99.85%, used 64.8 minutes, budget is EXHAUSTEDDashboard Design
Effective Dashboard Principles
- USE method: For infrastructure: Utilization, Saturation, Errors for each resource (CPU, memory, disk, network)
- RED method: For services: Rate (requests/sec), Errors (error rate), Duration (latency percentiles)
- Layer your dashboards: Top-level overview with drill-down to specific services, then specific endpoints
- Show percentiles, not averages. p50, p95, p99 latencies reveal the experience of tail users that averages hide
- Include context. Show deployment markers, alert thresholds, and SLO targets on graphs
- Time range consistency. All panels on a dashboard should show the same time range