Support AI Skill

Major Incident Management

Manage major incidents affecting multiple customers, including incident command, stakeholder communication, cross-team coordination, post-mortem analysis, and prevention planning. Use when handling service outages, coordinating incident response teams, cond...

Major Incident Management

Manage major incidents that affect multiple customers, with structured command, communication, and recovery.

Workflow

Major Incident Response Process

Trigger: P1 incident declared; multi-customer impact detected; executive escalation received:

  1. Incident declaration: On-call engineer or support lead declares P1 incident; notify incident commander (IC); create incident record in tracking system; start incident timer.
  2. War room activation: Create dedicated communication channel (Slack #incident-[date]-[summary]); invite IC, engineering leads, support lead, comms lead, executive sponsor; set communication rules (updates only, no debugging in channel).
  3. Initial assessment (within 15 minutes):
  1. Customer notification: First external communication within 30 minutes; use proactive-service-notifications workflow; update status page; prepare support team with canned responses.
  2. Resolution work: Engineering team investigates and implements fix; IC tracks progress; comms lead sends updates per cadence (P1: every 30 minutes); support team handles inbound tickets.
  3. Resolution verification: Fix deployed; verification tests run; monitoring confirms stability for 15 minutes; IC declares incident resolved; update all channels.
  4. Recovery monitoring: Enhanced monitoring for 2 hours post-resolution; watch for recurrence; support team monitors for related tickets; comms sends "all clear" notification.
  5. Post-mortem (within 48 hours): Schedule blameless post-mortem meeting; document timeline, root cause, impact, response effectiveness, action items; publish internally (and externally for P1).

Incident Command Structure

INCIDENT COMMAND STRUCTURE
============================

Roles:

Incident Commander (IC):
  - Overall leader; makes final decisions
  - Manages communication cadence
  - Tracks resolution progress
  - Declares incident start/end
  - Authorizes escalation to executives
  - Primary contact for external communications
  - Cannot be the person implementing the fix

Engineering Lead:
  - Leads technical investigation
  - Implements fix or workaround
  - Provides technical status updates to IC
  - Manages engineering team assignments
  - Validates fix before deployment

Support Lead:
  - Manages inbound customer tickets
  - Coordinates canned responses
  - Flags critical customer situations (enterprise, VIP)
  - Reports ticket volume trends to IC
  - Manages SLA exceptions during incident

Communications Lead:
  - Drafts and sends external updates
  - Manages status page
  - Handles social media communications
  - Prepares post-incident customer communication
  - Coordinates with PR/legal if needed

Executive Sponsor:
  - Notified for P1 incidents
  - Receives hourly updates
  - Authorizes compensation (credits, refunds)
  - Contacts enterprise customers personally if needed
  - Approved by board for material incidents

Escalation Paths:
  IC → Engineering VP → CTO (if resolution > 2 hours)
  IC → Support VP → CSO (if ticket volume > 5× normal)
  Comms Lead → PR → CEO (if media/social media attention)

Post-Mortem Template

POST-MORTEM TEMPLATE — BLAMELESS REVIEW
==========================================

Incident Summary:
  Title: [Brief description — e.g., "Payment Processing Outage"]
  Date: [Date and time range]
  Severity: P1 / P2
  Duration: [Start time] to [End time] ([X] hours [Y] minutes)
  Incident Commander: [Name]

Impact:
  Customers affected: [X] ([Y]% of total customer base)
  Revenue impact: $[X] (estimated lost revenue during outage)
  SLA breaches: [X] customers (credits issued: $[Y])
  Support tickets: [X] tickets created during incident ([Y]% increase vs. normal)
  Reputation: [Media mentions, social media sentiment, customer complaints]

Timeline:
  [HH:MM] — Incident detected by [system/person]
  [HH:MM] — P1 declared by [person]
  [HH:MM] — War room activated
  [HH:MM] — First customer notification sent
  [HH:MM] — Root cause identified: [description]
  [HH:MM] — Fix deployed
  [HH:MM] — Resolution verified
  [HH:MM] — Incident resolved declared
  [HH:MM] — All-clear notification sent

Root Cause:
  Technical root cause: [Detailed explanation]
  Contributing factors: [List of factors that made the incident worse]
  Why detection was delayed: [If applicable — monitoring gap, alert threshold, etc.]
  Why escalation was delayed: [If applicable — process gap, communication issue, etc.]

What Went Well:
  1. [Positive aspect — e.g., "Fast root cause identification"]
  2. [Positive aspect — e.g., "Clear customer communication"]
  3. [Positive aspect — e.g., "Effective war room coordination"]

What Could Be Improved:
  1. [Area for improvement — e.g., "Monitoring gap for service X"]
  2. [Area for improvement — e.g., "Slower customer notification (30 min vs. 15 min target)"]
  3. [Area for improvement — e.g., "No documented runbook for this scenario"]

Action Items:
  1. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]
  2. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]
  3. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]

Prevention:
  - How to prevent this from happening again: [Long-term prevention strategy]
  - How to detect faster if it does happen: [Monitoring/alerting improvements]
  - How to respond more effectively: [Process/tool improvements]

Distribution:
  - Internal: All engineering, support, and leadership teams
  - External: Public post-mortem published on blog/status page (for P1)
  - Follow-up: Action items tracked in Jira; reviewed in monthly ops meeting

Edge Cases

Integration Points