Support AI Skill

Major Incident Management

Manage major incidents affecting multiple customers, including incident command, stakeholder communication, cross-team coordination, post-mortem analysis, and prevention planning. Use when handling service outages, coordinating incident response teams, conducting post-mortems, managing crisis communications, or improving incident response processes. Triggers on phrases like "major incident", "incident management", "incident command", "war room", "outage response", "post-mortem", "incident review", "crisis management", "service disruption", "incident commander".

Major Incident Management

Manage major incidents that affect multiple customers, with structured command, communication, and recovery.

Workflow

Major Incident Response Process

Trigger: P1 incident declared; multi-customer impact detected; executive escalation received:

Incident declaration: On-call engineer or support lead declares P1 incident; notify incident commander (IC); create incident record in tracking system; start incident timer.
War room activation: Create dedicated communication channel (Slack #incident-[date]-[summary]); invite IC, engineering leads, support lead, comms lead, executive sponsor; set communication rules (updates only, no debugging in channel).
Initial assessment (within 15 minutes):

Scope: Which services affected? How many customers?
Impact: Revenue impact (estimated), SLA breach risk, customer-facing symptoms
Root cause: Initial hypothesis (confirmed or unconfirmed)
ETA: Estimated resolution time (with confidence level)

Customer notification: First external communication within 30 minutes; use proactive-service-notifications workflow; update status page; prepare support team with canned responses.
Resolution work: Engineering team investigates and implements fix; IC tracks progress; comms lead sends updates per cadence (P1: every 30 minutes); support team handles inbound tickets.
Resolution verification: Fix deployed; verification tests run; monitoring confirms stability for 15 minutes; IC declares incident resolved; update all channels.
Recovery monitoring: Enhanced monitoring for 2 hours post-resolution; watch for recurrence; support team monitors for related tickets; comms sends "all clear" notification.
Post-mortem (within 48 hours): Schedule blameless post-mortem meeting; document timeline, root cause, impact, response effectiveness, action items; publish internally (and externally for P1).

Incident Command Structure

INCIDENT COMMAND STRUCTURE
============================

Roles:

Incident Commander (IC):
  - Overall leader; makes final decisions
  - Manages communication cadence
  - Tracks resolution progress
  - Declares incident start/end
  - Authorizes escalation to executives
  - Primary contact for external communications
  - Cannot be the person implementing the fix

Engineering Lead:
  - Leads technical investigation
  - Implements fix or workaround
  - Provides technical status updates to IC
  - Manages engineering team assignments
  - Validates fix before deployment

Support Lead:
  - Manages inbound customer tickets
  - Coordinates canned responses
  - Flags critical customer situations (enterprise, VIP)
  - Reports ticket volume trends to IC
  - Manages SLA exceptions during incident

Communications Lead:
  - Drafts and sends external updates
  - Manages status page
  - Handles social media communications
  - Prepares post-incident customer communication
  - Coordinates with PR/legal if needed

Executive Sponsor:
  - Notified for P1 incidents
  - Receives hourly updates
  - Authorizes compensation (credits, refunds)
  - Contacts enterprise customers personally if needed
  - Approved by board for material incidents

Escalation Paths:
  IC → Engineering VP → CTO (if resolution > 2 hours)
  IC → Support VP → CSO (if ticket volume > 5× normal)
  Comms Lead → PR → CEO (if media/social media attention)

Post-Mortem Template

POST-MORTEM TEMPLATE — BLAMELESS REVIEW
==========================================

Incident Summary:
  Title: [Brief description — e.g., "Payment Processing Outage"]
  Date: [Date and time range]
  Severity: P1 / P2
  Duration: [Start time] to [End time] ([X] hours [Y] minutes)
  Incident Commander: [Name]

Impact:
  Customers affected: [X] ([Y]% of total customer base)
  Revenue impact: $[X] (estimated lost revenue during outage)
  SLA breaches: [X] customers (credits issued: $[Y])
  Support tickets: [X] tickets created during incident ([Y]% increase vs. normal)
  Reputation: [Media mentions, social media sentiment, customer complaints]

Timeline:
  [HH:MM] — Incident detected by [system/person]
  [HH:MM] — P1 declared by [person]
  [HH:MM] — War room activated
  [HH:MM] — First customer notification sent
  [HH:MM] — Root cause identified: [description]
  [HH:MM] — Fix deployed
  [HH:MM] — Resolution verified
  [HH:MM] — Incident resolved declared
  [HH:MM] — All-clear notification sent

Root Cause:
  Technical root cause: [Detailed explanation]
  Contributing factors: [List of factors that made the incident worse]
  Why detection was delayed: [If applicable — monitoring gap, alert threshold, etc.]
  Why escalation was delayed: [If applicable — process gap, communication issue, etc.]

What Went Well:
  1. [Positive aspect — e.g., "Fast root cause identification"]
  2. [Positive aspect — e.g., "Clear customer communication"]
  3. [Positive aspect — e.g., "Effective war room coordination"]

What Could Be Improved:
  1. [Area for improvement — e.g., "Monitoring gap for service X"]
  2. [Area for improvement — e.g., "Slower customer notification (30 min vs. 15 min target)"]
  3. [Area for improvement — e.g., "No documented runbook for this scenario"]

Action Items:
  1. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]
  2. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]
  3. [Action] — Owner: [Name] — Due: [Date] — Priority: [High/Medium/Low]

Prevention:
  - How to prevent this from happening again: [Long-term prevention strategy]
  - How to detect faster if it does happen: [Monitoring/alerting improvements]
  - How to respond more effectively: [Process/tool improvements]

Distribution:
  - Internal: All engineering, support, and leadership teams
  - External: Public post-mortem published on blog/status page (for P1)
  - Follow-up: Action items tracked in Jira; reviewed in monthly ops meeting

Edge Cases

Cascading failures (one incident triggers multiple system failures):
Response: Single incident commander for all cascading failures; prioritize customer-facing impact; fix root cause first, then downstream effects
Communication: Single incident thread; avoid multiple separate communications that confuse customers
Complexity: May require multiple engineering teams; IC coordinates priorities and dependencies
Example: Database migration fails → API down → payments fail → reporting errors → customer notifications fail

Security-related incidents (data breach, unauthorized access):
Escalation: Immediate notification to security team + legal + executive leadership
Communication: Legal-approved messaging only; no technical details until investigation complete
Regulatory: Breach notification timeline (GDPR: 72 hours; CCPA: reasonable time; industry-specific requirements)
Coordination: Security team leads investigation; IC manages operational response; comms manages external communication
Documentation: Detailed forensic log; preserve evidence; external auditor engagement if required

Cross-organization incidents (SaaS dependency outage — AWS, Stripe, Twilio):
Response: Monitor upstream provider status; adjust communication to reflect dependency ("Our payment processing partner is experiencing issues")
Customer expectation: Customers may blame you regardless; communicate proactively; show active monitoring of upstream provider
SLA: Check contract terms — may have credit/compensation from upstream provider; pass along to affected customers if appropriate
Prevention: Multi-provider strategy for critical dependencies; fallback systems

Incident fatigue (frequent incidents causing customer or team burnout):
Detection: Team satisfaction drops; customer complaints about frequency; increase in churn
Response: Executive review of incident trends; root cause analysis of recurring patterns; resource investment in reliability
Communication: Acknowledge pattern in post-mortem ("This is the third incident this quarter; here's our plan to improve")
Prevention: Reliability engineering investment; chaos testing; improved deployment practices

After-hours incidents (no senior staff available):
Response: On-call engineer acts as IC; escalate to available senior staff via phone; follow documented runbooks
Communication: Automated customer notification if engineer cannot draft; use pre-approved templates
Support: On-call support agent handles tickets; escalation to day team if beyond scope
Prevention: Robust runbooks; clear on-call escalation paths; automated responses for known scenarios

Integration Points

Incident management: PagerDuty, Opsgenie, ServiceNow — incident tracking, on-call scheduling, escalation
Monitoring: Datadog, New Relic, Sentry, Grafana — incident detection, alerting, metrics
Communication: Slack, Teams — war room channels, update coordination
Status page: Atlassian Statuspage, Better Uptime — public status updates, subscriber notifications
Help desk: Zendesk, Freshdesk — ticket management, canned responses, auto-close
Email: SendGrid, Mailgun — customer notifications, mass email
Documentation: Confluence, Notion — runbooks, post-mortem storage, knowledge base
Project management: Jira, Linear — action item tracking, improvement projects
Analytics: Custom dashboard — MTTR, incident frequency, impact metrics
Collaboration: Zoom, Google Meet — remote war room, post-mortem meetings

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.