TechLead
Lesson 14 of 25
5 min read
Engineering Leadership

Incident Management for Leaders

Lead incident response effectively with structured processes, blameless postmortems, and SRE practices

Why Incident Management Matters

Every software system will eventually fail. The question is not whether incidents will happen, but how prepared your team is to handle them. As a Tech Lead, your role during incidents extends beyond technical troubleshooting: you set the tone for how the team responds under pressure, ensure communication flows to the right people, and drive the postmortem process that prevents recurrence.

Organizations with mature incident management practices have lower mean time to resolution (MTTR), less engineer burnout, and stronger customer trust. Those without structured processes experience cascading failures, blame-driven cultures, and the same incidents recurring over and over.

Incident Severity Levels

  • SEV-1 (Critical): Service is completely down or data is being lost. All hands on deck. Customer-facing impact is severe.
  • SEV-2 (Major): Significant functionality is degraded. Subset of users affected. Requires immediate attention during business hours.
  • SEV-3 (Minor): Minor functionality is impaired but workarounds exist. Can be addressed in the next business day.
  • SEV-4 (Low): Cosmetic or minor issues with no significant user impact. Tracked and fixed in normal sprint work.

Incident Response Process

Phase 1: Detection and Alerting

Incidents should be detected by monitoring systems, not by customer complaints. Invest in:

  • Application performance monitoring (APM) with alerting thresholds
  • Health check endpoints that verify critical dependencies
  • Error rate monitoring with automatic escalation
  • Synthetic monitoring (automated tests running against production)
  • Customer-facing status page that updates automatically

Phase 2: Triage and Response

Once an incident is declared, immediately establish structure:

## Incident Response Roles

Incident Commander (IC):
- Owns the overall incident response
- Coordinates between responders
- Makes decisions about escalation
- Communicates status updates

Technical Lead:
- Leads the technical investigation
- Directs debugging efforts
- Proposes and implements fixes

Communications Lead:
- Updates the status page
- Notifies affected customers
- Keeps stakeholders informed
- Manages the incident Slack channel

Scribe:
- Documents the timeline of events
- Records decisions and actions taken
- Captures information for the postmortem

Phase 3: Mitigation and Resolution

The priority during an active incident is to mitigate first, root-cause later. Restoring service is more important than understanding why it broke. Common mitigation strategies include:

  • Rolling back the most recent deployment
  • Scaling up infrastructure to handle increased load
  • Enabling feature flags to disable problematic functionality
  • Failing over to a backup system or region
  • Applying a temporary hotfix while investigating the root cause

Blameless Postmortems

The postmortem is where the real value of incident management lives. A blameless postmortem focuses on understanding what happened and improving systems, not on punishing individuals. The philosophy is simple: humans make mistakes; systems should be designed to prevent those mistakes from causing outages.

## Postmortem Template

### Incident Summary
- Date/Time: 2026-03-10, 14:32 - 15:47 UTC
- Duration: 1 hour 15 minutes
- Severity: SEV-2
- Impact: 40% of API requests returned 500 errors

### Timeline
14:32 - Monitoring alerts fire for elevated 5xx rate
14:35 - On-call engineer acknowledges alert
14:40 - Incident declared, IC assigned
14:45 - Root cause identified: database connection pool exhausted
14:50 - Attempted mitigation: restart application pods
14:55 - Restart did not resolve; pool fills up again in minutes
15:10 - Identified: new feature with N+1 query pattern
15:20 - Feature flag disabled for the new feature
15:30 - Error rates return to normal
15:47 - Incident resolved, monitoring confirmed stable

### Root Cause
A new feature deployed at 13:00 contained an N+1 query that
generated 500 database queries per API request instead of 2.
Under normal traffic, this exhausted the connection pool.

### Contributing Factors
- No load testing was performed before deployment
- Database connection pool metrics were not monitored
- Code review did not catch the N+1 pattern

### Action Items
1. [P0] Add database connection pool monitoring and alerting
2. [P0] Fix the N+1 query in the new feature
3. [P1] Add N+1 detection to the CI pipeline (pg-lint)
4. [P1] Create load testing requirements for features
   touching high-traffic endpoints
5. [P2] Add database query count to code review checklist

Making Postmortems Truly Blameless

  • Use "the system" as the subject, not people's names: "The deploy pipeline did not include load tests" not "John did not run load tests"
  • Focus on systemic improvements: every action item should change a system, not a person
  • The IC or Tech Lead should model blamelessness by sharing their own mistakes openly
  • Never use postmortems in performance reviews as negative evidence
  • Thank people who identify the root cause and contribute to the postmortem

On-Call Best Practices

  • Runbooks: Create step-by-step guides for every common alert so any on-call engineer can respond
  • Rotation fairness: Distribute on-call evenly. Compensate engineers for off-hours work.
  • Escalation paths: Clear documentation of who to escalate to and when
  • Handoff process: Structured handoff between rotations documenting active issues
  • Alert quality: Regularly review and tune alerts. An on-call rotation with 50 false alarms per week creates alert fatigue.

Summary

Incident management is a core leadership competency. Your job as a Tech Lead is to ensure the team has the processes, tools, and culture to detect incidents quickly, respond effectively, and learn systematically from every failure. Blameless postmortems are the key to continuous improvement: they transform incidents from crises into learning opportunities.

Continue Learning