TechLead
Lesson 26 of 30
6 min read
Project Management

Incident Management and Post-Mortems

Master incident severity levels, response processes, the incident commander role, blameless post-mortems, and root cause analysis techniques

Incident Severity Levels

Not all incidents are created equal. Severity levels determine the response urgency, communication cadence, and escalation path. Defining these upfront prevents panic during an actual incident.

Severity Definition Response Time Communication Example
SEV-1 (Critical)Complete outage or data breachImmediate (15 min)Every 30 min to exec teamSite is down for all users
SEV-2 (High)Major feature broken, significant user impact< 1 hourHourly to stakeholdersCheckout broken, users cannot purchase
SEV-3 (Medium)Minor feature degraded, workaround exists< 4 hoursDaily updatesSearch returning slow results
SEV-4 (Low)Cosmetic issue, minimal impactNext business dayIn next standupFooter link broken on FAQ page

Incident Response Process

The Incident Response Lifecycle

  1. 1. Detection: Alert fires (monitoring), user reports issue, or team member discovers the problem
  2. 2. Triage: Determine severity level. Assign an Incident Commander (IC).
  3. 3. Communication: IC opens a war room (Slack channel/video call). Notifies stakeholders per severity.
  4. 4. Investigation: Team diagnoses the root cause. Check recent deployments, infrastructure changes, external dependencies.
  5. 5. Mitigation: Apply a fix or workaround. Priority is restoring service, not finding the perfect solution.
  6. 6. Resolution: Confirm the fix is working. Monitor for recurrence.
  7. 7. Post-Mortem: Within 48 hours, conduct a blameless retrospective. Document findings and action items.

Post-Mortem Template

// Post-Mortem Data Model
interface PostMortem {
  id: string;
  title: string;
  date: Date;
  severity: 'SEV-1' | 'SEV-2' | 'SEV-3' | 'SEV-4';
  incidentCommander: string;
  authors: string[];
  status: 'draft' | 'in-review' | 'published';

  summary: string;
  impact: {
    duration: string;
    usersAffected: number;
    revenueImpact: string;
    slaBreached: boolean;
  };

  timeline: {
    time: string;
    event: string;
    actor: string;
  }[];

  rootCause: string;
  contributingFactors: string[];

  whatWentWell: string[];
  whatWentPoorly: string[];
  luckyBreaks: string[];

  actionItems: {
    description: string;
    type: 'prevent' | 'detect' | 'mitigate';
    owner: string;
    priority: 'P0' | 'P1' | 'P2';
    dueDate: Date;
    status: 'open' | 'in-progress' | 'done';
    jiraTicket: string;
  }[];

  fiveWhys: {
    question: string;
    answer: string;
  }[];
}

// Example post-mortem
const checkoutOutage: PostMortem = {
  id: 'PM-2024-007',
  title: 'Checkout Service Outage - March 15, 2024',
  date: new Date('2024-03-15'),
  severity: 'SEV-1',
  incidentCommander: 'Bob Chen',
  authors: ['Bob Chen', 'Alice Kim', 'Dave Park'],
  status: 'published',

  summary: 'The checkout service was unavailable for 47 minutes due to a database connection pool exhaustion caused by a missing index on the orders table after a schema migration.',

  impact: {
    duration: '47 minutes (14:23 - 15:10 UTC)',
    usersAffected: 12500,
    revenueImpact: 'Estimated $85,000 in lost orders',
    slaBreached: true
  },

  timeline: [
    { time: '14:15', event: 'Schema migration deployed with new column on orders table', actor: 'CI/CD pipeline' },
    { time: '14:20', event: 'Database CPU spikes to 95%', actor: 'System' },
    { time: '14:23', event: 'PagerDuty alert: Checkout API p99 latency > 10s', actor: 'Monitoring' },
    { time: '14:25', event: 'Incident Commander (Bob) opens war room', actor: 'Bob' },
    { time: '14:30', event: 'Identified database as bottleneck via APM', actor: 'Alice' },
    { time: '14:35', event: 'Found missing index on orders.user_id after migration', actor: 'Dave' },
    { time: '14:40', event: 'Attempted rollback of migration', actor: 'Dave' },
    { time: '14:45', event: 'Rollback failed — new column has data, cannot drop', actor: 'Dave' },
    { time: '14:50', event: 'Manually created index on orders.user_id', actor: 'Dave' },
    { time: '14:55', event: 'Index creation completed. DB CPU dropping.', actor: 'System' },
    { time: '15:05', event: 'Connection pool recovered. Checkout latency normal.', actor: 'System' },
    { time: '15:10', event: 'Incident resolved. Monitoring confirmed stable.', actor: 'Bob' },
  ],

  rootCause: 'A schema migration added a new column to the orders table but did not include an index on user_id, which was used in a high-frequency query. The full table scan under production load exhausted the database connection pool.',

  contributingFactors: [
    'No query performance testing in CI pipeline for migrations',
    'Migration review checklist did not include index verification',
    'Connection pool limits were set too low (50), amplifying the problem',
    'Rollback procedure for migrations was not tested'
  ],

  whatWentWell: [
    'Alert fired within 3 minutes of degradation',
    'War room assembled quickly (5 minutes)',
    'Root cause identified in 10 minutes',
    'Communication to stakeholders was timely and clear'
  ],

  whatWentPoorly: [
    'Migration was not tested with production-scale data',
    'Rollback plan failed — untested assumption',
    'No automated check for missing indexes on high-traffic tables',
    'Time to resolution was 47 minutes for a SEV-1'
  ],

  luckyBreaks: [
    'The incident happened during business hours when the team was available',
    'No data was lost or corrupted'
  ],

  fiveWhys: [
    { question: 'Why was checkout down?', answer: 'Database connection pool was exhausted.' },
    { question: 'Why was the connection pool exhausted?', answer: 'Queries were taking 10x longer than normal, holding connections.' },
    { question: 'Why were queries slow?', answer: 'A full table scan on the orders table (2M rows) due to missing index.' },
    { question: 'Why was the index missing?', answer: 'Schema migration did not include it, and review did not catch it.' },
    { question: 'Why did the review not catch it?', answer: 'No automated tool or checklist item for verifying indexes on modified tables.' }
  ],

  actionItems: [
    { description: 'Add index verification to migration CI checks', type: 'prevent', owner: 'Dave', priority: 'P0', dueDate: new Date('2024-03-22'), status: 'in-progress', jiraTicket: 'PLAT-1234' },
    { description: 'Increase DB connection pool to 200', type: 'mitigate', owner: 'SRE team', priority: 'P0', dueDate: new Date('2024-03-18'), status: 'done', jiraTicket: 'PLAT-1235' },
    { description: 'Add query performance testing for migrations', type: 'prevent', owner: 'Dave', priority: 'P1', dueDate: new Date('2024-04-01'), status: 'open', jiraTicket: 'PLAT-1236' },
    { description: 'Test migration rollback procedures monthly', type: 'detect', owner: 'SRE team', priority: 'P1', dueDate: new Date('2024-04-01'), status: 'open', jiraTicket: 'PLAT-1237' },
  ]
};

Blameless Post-Mortems

Principles of Blameless Post-Mortems

  • Focus on Systems, Not People: "The process allowed a migration without index verification" not "Dave forgot to add an index."
  • Assume Good Intentions: Everyone was doing their best with the information they had at the time.
  • Prevent, Do Not Punish: The goal is to make the system more resilient, not to assign blame.
  • Action Items Are Mandatory: A post-mortem without action items is a story, not a learning opportunity.
  • Share Widely: Publish post-mortems so other teams learn from your incidents.

Root Cause Analysis Techniques

  • 5 Whys: Keep asking "why?" until you reach the systemic root cause (typically 3-5 levels deep)
  • Fishbone Diagram: Categorize causes by type (People, Process, Technology, Environment)
  • Timeline Analysis: Map every event chronologically to identify where things diverged from expectations
  • Contributing Factors: Most incidents have multiple contributing factors, not a single root cause

Continue Learning