---
name: major-incident-management
description: Manage major incidents affecting multiple customers, including incident command, stakeholder communication, cross-team coordination, post-mortem analysis, and prevention planning. Use when handling service outages, coordinating incident response teams, conducting post-mortems, managing crisis communications, or improving incident response processes. Triggers on phrases like "major incident", "incident management", "incident command", "war room", "outage response", "post-mortem", "incident review", "crisis management", "service disruption", "incident commander".
---

# Major Incident Management

Manage major incidents that affect multiple customers, with structured command, communication, and recovery.

## Workflow

### Major Incident Response Process

Trigger: P1 incident declared; multi-customer impact detected; executive escalation received:

1. **Incident declaration**: On-call engineer or support lead declares P1 incident; notify incident commander (IC); create incident record in tracking system; start incident timer.
2. **War room activation**: Create dedicated communication channel (Slack #incident-[date]-[summary]); invite IC, engineering leads, support lead, comms lead, executive sponsor; set communication rules (updates only, no debugging in channel).
3. **Initial assessment** (within 15 minutes):
   - Scope: Which services affected? How many customers?
   - Impact: Revenue impact (estimated), SLA breach risk, customer-facing symptoms
   - Root cause: Initial hypothesis (confirmed or unconfirmed)
   - ETA: Estimated resolution time (with confidence level)
4. **Customer notification**: First external communication within 30 minutes; use proactive-service-notifications workflow; update status page; prepare support team with canned responses.
5. **Resolution work**: Engineering team investigates and implements fix; IC tracks progress; comms lead sends updates per cadence (P1: every 30 minutes); support team handles inbound tickets.
6. **Resolution verification**: Fix deployed; verification tests run; monitoring confirms stability for 15 minutes; IC declares incident resolved; update all channels.
7. **Recovery monitoring**: Enhanced monitoring for 2 hours post-resolution; watch for recurrence; support team monitors for related tickets; comms sends "all clear" notification.
8. **Post-mortem** (within 48 hours): Schedule blameless post-mortem meeting; document timeline, root cause, impact, response effectiveness, action items; publish internally (and externally for P1).

### Incident Command Structure

```
INCIDENT COMMAND STRUCTURE
============================

Roles:

Incident Commander (IC):
  - Overall leader; makes final decisions
  - Manages communication cadence
  - Tracks resolution progress
  - Declares incident start/end
  - Authorizes escalation to executives
  - Primary contact for external communications
  - Cannot be the person implementing the fix

Engineering Lead:
  - Leads technical investigation
  - Implements fix or workaround
  - Provides technical status updates to IC
  - Manages engineering team assignments
  - Validates fix before deployment

Support Lead:
  - Manages inbound customer tickets
  - Coordinates canned responses
  - Flags critical customer situations (enterprise, VIP)
  - Reports ticket volume trends to IC
  - Manages SLA exceptions during incident

Communications Lead:
  - Drafts and sends external updates
  - Manages status page
  - Handles social media communications
  - Prepares post-incident customer communication
  - Coordinates with PR/legal if needed

Executive Sponsor:
  - Notified for P1 incidents
  - Receives hourly updates
  - Authorizes compensation (credits, refunds)
  - Contacts enterprise customers personally if needed
  - Approved by board for material incidents

Escalation Paths:
  IC → Engineering VP → CTO (if resolution > 2 hours)
  IC → Support VP → CSO (if ticket volume > 5× normal)
  Comms Lead → PR → CEO (if media/social media attention)
```

### Post-Mortem Template

```
POST-MORTEM TEMPLATE — BLAMELESS REVIEW
==========================================

Incident Summary:
  Title: [Brief description — e.g., "Payment Processing Outage"]
  Date: [Date and time range]
  Severity: P1 / P2
  Duration: [Start time] to [End time] ([X] hours [Y] minutes)
  Incident Commander: [Name]

Impact:
  Customers affected: [X] ([Y]% of total customer base)
  Revenue impact: $[X] (estimated lost revenue during outage)
  SLA breaches: [X] customers (credits issued: $[Y])
  Support tickets: [X] tickets created during incident ([Y]% increase vs. normal)
  Reputation: [Media mentions, social media sentiment, customer complaints]

Timeline:
  [HH:MM] — Incident detected by [system/person]
  [HH:MM] — P1 declared by [person]
  [HH:MM] — War room activated
  [HH:MM] — First customer notification sent
  [HH:MM] — Root cause identified: [description]
  [HH:MM] — Fix deployed
  [HH:MM] — Resolution verified
  [HH:MM] — Incident resolved declared
  [HH:MM] — All-clear notification sent

Root Cause:
  Technical root cause: [Detailed explanation]
  Contributing factors: [List of factors that made the incident worse]
  Why detection was delayed: [If applicable — monitoring gap, alert threshold, etc.]
  Why escalation was delayed: [If applicable — process gap, communication issue, etc.]

What Went Well:
  1. [Positive aspect — e.g., "Fast root cause identification"]
  2. [Positive aspect — e.g., "Clear customer communication"]
  3. [Positive aspect — e.g., "Effective war room coordination"]

What Could Be Improved:
  1. [Area for improvement — e.g., "Monitoring gap for service X"]
  2. [Area for improvement — e.g., "Slower customer notification (30 min vs. 15 min target)"]
  3. [Area for improvement — e.g., "No documented runbook for this scenario"]

Action Items:
  1. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]
  2. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]
  3. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]

Prevention:
  - How to prevent this from happening again: [Long-term prevention strategy]
  - How to detect faster if it does happen: [Monitoring/alerting improvements]
  - How to respond more effectively: [Process/tool improvements]

Distribution:
  - Internal: All engineering, support, and leadership teams
  - External: Public post-mortem published on blog/status page (for P1)
  - Follow-up: Action items tracked in Jira; reviewed in monthly ops meeting
```

## Edge Cases

- **Cascading failures** (one incident triggers multiple system failures):
  - Response: Single incident commander for all cascading failures; prioritize customer-facing impact; fix root cause first, then downstream effects
  - Communication: Single incident thread; avoid multiple separate communications that confuse customers
  - Complexity: May require multiple engineering teams; IC coordinates priorities and dependencies
  - Example: Database migration fails → API down → payments fail → reporting errors → customer notifications fail

- **Security-related incidents** (data breach, unauthorized access):
  - Escalation: Immediate notification to security team + legal + executive leadership
  - Communication: Legal-approved messaging only; no technical details until investigation complete
  - Regulatory: Breach notification timeline (GDPR: 72 hours; CCPA: reasonable time; industry-specific requirements)
  - Coordination: Security team leads investigation; IC manages operational response; comms manages external communication
  - Documentation: Detailed forensic log; preserve evidence; external auditor engagement if required

- **Cross-organization incidents** (SaaS dependency outage — AWS, Stripe, Twilio):
  - Response: Monitor upstream provider status; adjust communication to reflect dependency ("Our payment processing partner is experiencing issues")
  - Customer expectation: Customers may blame you regardless; communicate proactively; show active monitoring of upstream provider
  - SLA: Check contract terms — may have credit/compensation from upstream provider; pass along to affected customers if appropriate
  - Prevention: Multi-provider strategy for critical dependencies; fallback systems

- **Incident fatigue** (frequent incidents causing customer or team burnout):
  - Detection: Team satisfaction drops; customer complaints about frequency; increase in churn
  - Response: Executive review of incident trends; root cause analysis of recurring patterns; resource investment in reliability
  - Communication: Acknowledge pattern in post-mortem ("This is the third incident this quarter; here's our plan to improve")
  - Prevention: Reliability engineering investment; chaos testing; improved deployment practices

- **After-hours incidents** (no senior staff available):
  - Response: On-call engineer acts as IC; escalate to available senior staff via phone; follow documented runbooks
  - Communication: Automated customer notification if engineer cannot draft; use pre-approved templates
  - Support: On-call support agent handles tickets; escalation to day team if beyond scope
  - Prevention: Robust runbooks; clear on-call escalation paths; automated responses for known scenarios

## Integration Points

- **Incident management**: PagerDuty, Opsgenie, ServiceNow — incident tracking, on-call scheduling, escalation
- **Monitoring**: Datadog, New Relic, Sentry, Grafana — incident detection, alerting, metrics
- **Communication**: Slack, Teams — war room channels, update coordination
- **Status page**: Atlassian Statuspage, Better Uptime — public status updates, subscriber notifications
- **Help desk**: Zendesk, Freshdesk — ticket management, canned responses, auto-close
- **Email**: SendGrid, Mailgun — customer notifications, mass email
- **Documentation**: Confluence, Notion — runbooks, post-mortem storage, knowledge base
- **Project management**: Jira, Linear — action item tracking, improvement projects
- **Analytics**: Custom dashboard — MTTR, incident frequency, impact metrics
- **Collaboration**: Zoom, Google Meet — remote war room, post-mortem meetings