---
name: incident-management
description: Manage IT incidents from detection through resolution including triage, investigation, containment, communication, and post-incident review. Use when handling outages, managing major incidents, running incident command, performing root cause analysis, conducting post-mortems, or managing incident communication. Triggers on phrases like "incident response", "outage management", "major incident", "incident triage", "post-mortem", "root cause analysis", "incident communication", "war room", "incident commander".
---

# Incident Management

Restore normal service operation quickly and systematically through structured incident management processes.

## Workflow

### 1. Incident Detection & Logging

1. **Multi-channel detection**:
   - Automated detection from monitoring systems (alerts, thresholds, anomaly detection)
   - User-reported incidents via service portal, email, phone, chat
   - Vendor notifications (cloud provider outages, software vulnerabilities)
   - Social media monitoring for public-facing service issues
   - Proactive detection from synthetic monitoring and uptime checks

2. **Incident creation and initial classification**:
   - Auto-create incident from alerts with enriched context (affected service, timestamp, alert source)
   - Capture: service affected, symptoms, impact scope, user/reporter details
   - Initial classification: Incident vs Problem vs Change vs Request
   - Assign initial priority based on impact × urgency matrix

3. **Priority matrix**:
   - P1/Critical: Major service outage, security breach, data loss — response within 15 minutes
   - P2/High: Significant service degradation, workaround available — response within 30 minutes
   - P3/Medium: Minor service impact, single user or team — response within 2 hours
   - P4/Low: Cosmetic issue, informational request — response within 24 hours

### 2. Incident Triage & Assignment

1. **Initial triage (L1)**:
   - Verify incident (not false positive)
   - Gather initial diagnostic information
   - Check known errors and KB for existing solutions
   - Attempt basic troubleshooting (service restart, cache clear)
   - Escalate to L2/L3 if unresolved within SLA

2. **Specialist assignment (L2/L3)**:
   - Route by technical domain: infrastructure, application, network, database, security
   - Consider on-call rotation for after-hours incidents
   - Check specialist availability and workload
   - Assign backup engineer for coverage
   - Auto-page for P1/P2 incidents

3. **Major incident declaration**:
   - Criteria: service affecting >25% of users, revenue impact, executive attention required
   - Notify incident commander and communication lead
   - Open major incident bridge/war room
   - Activate status page
   - Begin stakeholder notification cascade

### 3. Investigation & Diagnosis

1. **Data gathering**:
   - Collect relevant logs (application, system, network, database)
   - Review recent changes (deployment, configuration, infrastructure)
   - Check monitoring dashboards for correlated anomalies
   - Interview affected users for symptom details
   - Review error rates, latency, resource utilization trends

2. **Root cause investigation**:
   - Systematic elimination of potential causes
   - Reproduce issue in non-production if possible
   - Check vendor status pages and known issues
   - Engage external vendors if needed (cloud provider, SaaS vendor)
   - Time-correlate events across systems

3. **Workaround identification**:
   - Identify temporary solution to restore service
   - Test workaround in staging if available
   - Document workaround steps clearly
   - Deploy workaround to production
   - Communicate workaround to affected users

### 4. Containment & Resolution

1. **Containment actions**:
   - Isolate affected components to prevent spread
   - Scale resources to handle degraded load
   - Enable fallback services or maintenance pages
   - Block malicious traffic (security incidents)
   - Freeze deployments and changes affecting affected systems

2. **Fix implementation**:
   - Apply permanent fix via change management process
   - For emergencies: expedited emergency change approval
   - Deploy fix in staging for validation first
   - Production deployment with rollback readiness
   - Monitor post-deployment metrics closely

3. **Verification**:
   - Confirm service restoration through monitoring
   - Verify with affected users (sample survey for broad impact)
   - Run smoke tests against affected service
   - Monitor for 30 minutes post-fix for stability
   - Clear status page and send resolution notification

### 5. Communication Management

1. **Stakeholder communication plan**:
   - Internal: engineering teams, management, support staff (every 30 min for P1)
   - External: customers, partners, public (via status page, every 60 min for P1)
   - Executive: C-suite briefings for major incidents
   - Regulator: required notifications for data breaches

2. **Communication templates**:
   - Initial notification: "We are investigating an issue affecting [service]"
   - Update: "We have identified the cause and are implementing a fix"
   - Resolution: "Service has been restored. We are monitoring for stability"
   - Post-incident: "Post-mortem available — we've taken steps to prevent recurrence"

3. **Status page management**:
   - Update status within 15 minutes of incident detection
   - Provide clear, non-technical impact descriptions
   - Set expectations for next update timing
   - Archive status for incident record

### 6. Post-Incident Review

1. **Post-mortem process**:
   - Schedule within 3 business days of incident resolution
   - Invite all participants and stakeholders
   - Blameless culture focus: process improvement, not individual fault
   - Document timeline, impact, root cause, resolution

2. **Root cause analysis**:
   - 5 Whys analysis to reach fundamental cause
   - Timeline reconstruction with precise timestamps
   - Identify contributing factors (people, process, technology)
   - Distinguish between root cause and proximate cause

3. **Action item tracking**:
   - Define specific, measurable action items with owners and deadlines
   - Categorize: immediate fix, process improvement, long-term prevention
   - Track action item completion (weekly status updates)
   - Verify effectiveness of actions (no recurrence within 90 days)

## Templates & Frameworks

### Incident Severity Matrix

```
PRIORITY MATRIX
================

                | Low Impact  | Medium Impact  | High Impact  | Critical Impact
----------------|-------------|---------------|-------------|----------------
Scheduled       | P4          | P3            | P3          | P3
Non-Urgent      | P4          | P3            | P2          | P2
Urgent          | P3          | P2            | P1          | P1
Critical        | P3          | P2            | P1          | P1

IMPACT DEFINITIONS:
  Low: Single user, non-critical function
  Medium: Team/department, degraded functionality
  High: Multiple departments, service unavailable
  Critical: Organization-wide, revenue impact, security breach
```

### Post-Mortem Template

```
POST-MORTEM — [Incident ID, Date, Severity]
============================================

EXECUTIVE SUMMARY:
  What happened: [2-3 sentence summary]
  Impact: [affected users/services, duration, business impact]
  Root cause: [fundamental cause identified]

TIMELINE:
  [HH:MM] — Incident detected by [system/person]
  [HH:MM] — Initial triage and classification
  [HH:MM] — Root cause identified
  [HH:MM] — Workaround deployed
  [HH:MM] — Permanent fix deployed
  [HH:MM] — Service fully restored
  [HH:MM] — Monitoring confirmed stable

ROOT CAUSE ANALYSIS (5 Whys):
  Why did [symptom] occur? → [Answer 1]
  Why did [Answer 1] happen? → [Answer 2]
  Why did [Answer 2] happen? → [Answer 3]
  Why did [Answer 3] happen? → [Answer 4]
  Why did [Answer 4] happen? → ROOT CAUSE: [Answer 5]

WHAT WORKED WELL:
  [Positive observations about response]

WHAT COULD BE IMPROVED:
  [Areas for improvement]

ACTION ITEMS:
  [ ] [Action] — Owner: [Name] — Due: [Date] — Priority: [P1/P2/P3]
  [ ] [Action] — Owner: [Name] — Due: [Date] — Priority: [P1/P2/P3]

PREVENTION MEASURES:
  [Long-term changes to prevent recurrence]
```

## Integration Points

- Incident management tools (PagerDuty, Opsgenie, Victor Ops): Alerting, paging, escalation
- ITSM platforms (ServiceNow, Jira Service Management): Incident tracking, SLA management
- Monitoring systems (Datadog, New Relic, Prometheus): Detection source, metrics context
- Communication tools (Slack, Teams, Statuspage): Stakeholder communication
- Runbook automation (Confluence, Runkeeper): Automated response procedures
- ChatOps (Slack/Teams integrations): Real-time collaboration during incidents
- BI/reporting tools: Incident trend analysis and metrics

## Edge Cases

- **Cascading failures across multiple services**: Prioritize restoration order based on dependency graph; declare separate incidents for each service
- **Vendor-caused outages**: Track vendor communication; set realistic expectations; document for SLA exclusion
- **Security incident overlap**: Coordinate with security team; balance transparency with information security; follow breach notification requirements
- **Recurring incidents**: Trigger problem management ticket; investigate systemic root cause; implement permanent fix
- **After-hours/weekend incidents**: Validate on-call coverage; adjust communication cadence; defer non-critical stakeholder notifications

## Output

### Incident Management Dashboard

```
INCIDENT DASHBOARD — Real-Time
===============================

ACTIVE INCIDENTS:
  🔴 P1: Production API Gateway — 45 min duration (war room active)
  ⚠  P2: Email delivery delays — 2h 15min duration (investigation)
  ✓  P3: Internal wiki slow — 45 min duration (workaround deployed)

MTTR TRENDS:
  P1 MTTR (30-day avg): 52 min (target: <60 min ✓)
  P2 MTTR (30-day avg): 2h 15min (target: <4h ✓)
  P3 MTTR (30-day avg): 1h 30min (target: <8h ✓)

INCIDENT VOLUME:
  This month: 47 (vs 52 last month ↓)
  P1 incidents: 3 (vs 5 last month ↓)
  Change-related: 8 (17% — target: <20% ✓)

SLA COMPLIANCE:
  Response time SLA: 97% (target: >95% ✓)
  Resolution time SLA: 94% (target: >90% ✓)
  Communication SLA: 99% (target: >95% ✓)

POST-MORTEM TRACKING:
  Completed (last 30 days): 5/5 (100%)
  Open action items: 12
  Overdue action items: 2
```

## Trigger Phrases

"incident", "outage", "major incident", "war room", "incident response", "post-mortem", "root cause analysis", "incident triage", "incident communication", "incident commander", "P1/P2/P3", "service restoration", "incident bridge", "status page update", "blameless post-mortem", "incident timeline", "on-call escalation"
