Support AI Skill
Major Incident Management
Manage major incidents affecting multiple customers, including incident command, stakeholder communication, cross-team coordination, post-mortem analysis, and prevention planning. Use when handling service outages, coordinating incident response teams, cond...
Major Incident Management
Manage major incidents that affect multiple customers, with structured command, communication, and recovery.
Workflow
Major Incident Response Process
Trigger: P1 incident declared; multi-customer impact detected; executive escalation received:
- Incident declaration: On-call engineer or support lead declares P1 incident; notify incident commander (IC); create incident record in tracking system; start incident timer.
- War room activation: Create dedicated communication channel (Slack #incident-[date]-[summary]); invite IC, engineering leads, support lead, comms lead, executive sponsor; set communication rules (updates only, no debugging in channel).
- Initial assessment (within 15 minutes):
- Scope: Which services affected? How many customers?
- Impact: Revenue impact (estimated), SLA breach risk, customer-facing symptoms
- Root cause: Initial hypothesis (confirmed or unconfirmed)
- ETA: Estimated resolution time (with confidence level)
- Customer notification: First external communication within 30 minutes; use proactive-service-notifications workflow; update status page; prepare support team with canned responses.
- Resolution work: Engineering team investigates and implements fix; IC tracks progress; comms lead sends updates per cadence (P1: every 30 minutes); support team handles inbound tickets.
- Resolution verification: Fix deployed; verification tests run; monitoring confirms stability for 15 minutes; IC declares incident resolved; update all channels.
- Recovery monitoring: Enhanced monitoring for 2 hours post-resolution; watch for recurrence; support team monitors for related tickets; comms sends "all clear" notification.
- Post-mortem (within 48 hours): Schedule blameless post-mortem meeting; document timeline, root cause, impact, response effectiveness, action items; publish internally (and externally for P1).
Incident Command Structure
INCIDENT COMMAND STRUCTURE
============================
Roles:
Incident Commander (IC):
- Overall leader; makes final decisions
- Manages communication cadence
- Tracks resolution progress
- Declares incident start/end
- Authorizes escalation to executives
- Primary contact for external communications
- Cannot be the person implementing the fix
Engineering Lead:
- Leads technical investigation
- Implements fix or workaround
- Provides technical status updates to IC
- Manages engineering team assignments
- Validates fix before deployment
Support Lead:
- Manages inbound customer tickets
- Coordinates canned responses
- Flags critical customer situations (enterprise, VIP)
- Reports ticket volume trends to IC
- Manages SLA exceptions during incident
Communications Lead:
- Drafts and sends external updates
- Manages status page
- Handles social media communications
- Prepares post-incident customer communication
- Coordinates with PR/legal if needed
Executive Sponsor:
- Notified for P1 incidents
- Receives hourly updates
- Authorizes compensation (credits, refunds)
- Contacts enterprise customers personally if needed
- Approved by board for material incidents
Escalation Paths:
IC → Engineering VP → CTO (if resolution > 2 hours)
IC → Support VP → CSO (if ticket volume > 5× normal)
Comms Lead → PR → CEO (if media/social media attention)
Post-Mortem Template
POST-MORTEM TEMPLATE — BLAMELESS REVIEW
==========================================
Incident Summary:
Title: [Brief description — e.g., "Payment Processing Outage"]
Date: [Date and time range]
Severity: P1 / P2
Duration: [Start time] to [End time] ([X] hours [Y] minutes)
Incident Commander: [Name]
Impact:
Customers affected: [X] ([Y]% of total customer base)
Revenue impact: $[X] (estimated lost revenue during outage)
SLA breaches: [X] customers (credits issued: $[Y])
Support tickets: [X] tickets created during incident ([Y]% increase vs. normal)
Reputation: [Media mentions, social media sentiment, customer complaints]
Timeline:
[HH:MM] — Incident detected by [system/person]
[HH:MM] — P1 declared by [person]
[HH:MM] — War room activated
[HH:MM] — First customer notification sent
[HH:MM] — Root cause identified: [description]
[HH:MM] — Fix deployed
[HH:MM] — Resolution verified
[HH:MM] — Incident resolved declared
[HH:MM] — All-clear notification sent
Root Cause:
Technical root cause: [Detailed explanation]
Contributing factors: [List of factors that made the incident worse]
Why detection was delayed: [If applicable — monitoring gap, alert threshold, etc.]
Why escalation was delayed: [If applicable — process gap, communication issue, etc.]
What Went Well:
1. [Positive aspect — e.g., "Fast root cause identification"]
2. [Positive aspect — e.g., "Clear customer communication"]
3. [Positive aspect — e.g., "Effective war room coordination"]
What Could Be Improved:
1. [Area for improvement — e.g., "Monitoring gap for service X"]
2. [Area for improvement — e.g., "Slower customer notification (30 min vs. 15 min target)"]
3. [Area for improvement — e.g., "No documented runbook for this scenario"]
Action Items:
1. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]
2. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]
3. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]
Prevention:
- How to prevent this from happening again: [Long-term prevention strategy]
- How to detect faster if it does happen: [Monitoring/alerting improvements]
- How to respond more effectively: [Process/tool improvements]
Distribution:
- Internal: All engineering, support, and leadership teams
- External: Public post-mortem published on blog/status page (for P1)
- Follow-up: Action items tracked in Jira; reviewed in monthly ops meeting
Edge Cases
- Cascading failures (one incident triggers multiple system failures):
- Response: Single incident commander for all cascading failures; prioritize customer-facing impact; fix root cause first, then downstream effects
- Communication: Single incident thread; avoid multiple separate communications that confuse customers
- Complexity: May require multiple engineering teams; IC coordinates priorities and dependencies
- Example: Database migration fails → API down → payments fail → reporting errors → customer notifications fail
- Security-related incidents (data breach, unauthorized access):
- Escalation: Immediate notification to security team + legal + executive leadership
- Communication: Legal-approved messaging only; no technical details until investigation complete
- Regulatory: Breach notification timeline (GDPR: 72 hours; CCPA: reasonable time; industry-specific requirements)
- Coordination: Security team leads investigation; IC manages operational response; comms manages external communication
- Documentation: Detailed forensic log; preserve evidence; external auditor engagement if required
- Cross-organization incidents (SaaS dependency outage — AWS, Stripe, Twilio):
- Response: Monitor upstream provider status; adjust communication to reflect dependency ("Our payment processing partner is experiencing issues")
- Customer expectation: Customers may blame you regardless; communicate proactively; show active monitoring of upstream provider
- SLA: Check contract terms — may have credit/compensation from upstream provider; pass along to affected customers if appropriate
- Prevention: Multi-provider strategy for critical dependencies; fallback systems
- Incident fatigue (frequent incidents causing customer or team burnout):
- Detection: Team satisfaction drops; customer complaints about frequency; increase in churn
- Response: Executive review of incident trends; root cause analysis of recurring patterns; resource investment in reliability
- Communication: Acknowledge pattern in post-mortem ("This is the third incident this quarter; here's our plan to improve")
- Prevention: Reliability engineering investment; chaos testing; improved deployment practices
- After-hours incidents (no senior staff available):
- Response: On-call engineer acts as IC; escalate to available senior staff via phone; follow documented runbooks
- Communication: Automated customer notification if engineer cannot draft; use pre-approved templates
- Support: On-call support agent handles tickets; escalation to day team if beyond scope
- Prevention: Robust runbooks; clear on-call escalation paths; automated responses for known scenarios
Integration Points
- Incident management: PagerDuty, Opsgenie, ServiceNow — incident tracking, on-call scheduling, escalation
- Monitoring: Datadog, New Relic, Sentry, Grafana — incident detection, alerting, metrics
- Communication: Slack, Teams — war room channels, update coordination
- Status page: Atlassian Statuspage, Better Uptime — public status updates, subscriber notifications
- Help desk: Zendesk, Freshdesk — ticket management, canned responses, auto-close
- Email: SendGrid, Mailgun — customer notifications, mass email
- Documentation: Confluence, Notion — runbooks, post-mortem storage, knowledge base
- Project management: Jira, Linear — action item tracking, improvement projects
- Analytics: Custom dashboard — MTTR, incident frequency, impact metrics
- Collaboration: Zoom, Google Meet — remote war room, post-mortem meetings