---
name: incident-response-playbook
description: Create, execute, and manage incident response playbooks for IT operational incidents including outages, performance degradation, security breaches, and infrastructure failures. Use when developing incident response procedures, handling production incidents, conducting post-incident reviews, managing on-call rotations, defining incident severity levels, or building incident management frameworks. Triggers on phrases like "incident response", "incident playbook", "production incident", "outage response", "incident management", "post-mortem", "incident review", "on-call", "severity level", "war room".
---

# Incident Response Playbook

Structured approach to handling IT operational incidents from detection through resolution and learning.

## Workflow

1. Detect incident via monitoring, user reports, or automated alerts.
2. Triage and classify incident severity (SEV-1 through SEV-4).
3. Activate incident response team based on severity level.
4. Establish incident command structure: Incident Commander, Communications Lead, Technical Lead.
5. Execute diagnostic procedures to identify root cause.
6. Implement mitigation or fix; verify resolution.
7. Conduct post-incident review within 48 hours (SEV-1) or 5 business days (SEV-2/3).
8. Document lessons learned and create action items to prevent recurrence.
9. Update playbooks and monitoring based on findings.
10. Track action item completion to closure.

## Incident Severity Classification

```
INCIDENT SEVERITY FRAMEWORK
============================

SEV-1 (Critical) — Major Outage
  Definition: Complete service outage affecting all or majority of users;
               data loss; security breach; revenue impact > $100K/hour.
  Impact: All users affected or business operations halted
  Response Time: Immediate (< 5 minutes to acknowledge)
  Resolution Target: 2 hours maximum (4 hours with approval from VP)
  Escalation: CTO, VP Engineering, Head of Operations automatically notified
  Communication: Customer-facing status page updated within 30 minutes;
                 executive briefing within 1 hour
  Examples:
    - Production database cluster down (no failover)
    - Authentication service unavailable (no one can log in)
    - Ransomware detected encrypting production data
    - Payment processing system down during peak hours
    - Data breach confirmed (customer data exposed)

SEV-2 (High) — Significant Degradation
  Definition: Major functionality impaired; partial outage; performance
               severely degraded; affecting 25%+ of users.
  Impact: Significant user impact; workaround may exist
  Response Time: 15 minutes to acknowledge
  Resolution Target: 8 hours (one business day)
  Escalation: Engineering manager, operations manager notified
  Communication: Status page updated within 1 hour;
                 customer notification if affecting > 1,000 users
  Examples:
    - API response times > 10 seconds (normal: < 200ms)
    - 50% of payment transactions failing
    - Single-region failure with limited impact (multi-region setup)
    - Critical feature broken (e.g., checkout, search, reporting)
    - Security vulnerability exploited (no data loss yet)

SEV-3 (Medium) — Minor Degradation
  Definition: Minor functionality impaired; non-critical feature broken;
               performance degraded but usable; affecting < 25% of users.
  Impact: Noticeable user impact but core functionality works
  Response Time: 1 hour to acknowledge
  Resolution Target: 24 hours (one business day)
  Escalation: Team lead notified
  Communication: Internal notification; no external communication required
  Examples:
    - Non-critical UI bug affecting specific browser
    - Reporting dashboard loading slowly
    - Email notifications delayed by 30+ minutes
    - Single microservice degraded with fallback active
    - CDN performance degradation in specific region

SEV-4 (Low) — Minor Issue
  Definition: Cosmetic issue; minor inconvenience; no functional impact;
               affecting < 1% of users.
  Impact: Minimal to none on business operations
  Response Time: Next business day
  Resolution Target: 5 business days (include in sprint backlog)
  Escalation: Assigned to appropriate team member
  Communication: No notification required
  Examples:
    - Typo in UI text
    - Non-critical feature cosmetic issue
    - Documentation out of date
    - Performance optimization opportunity
    - Feature request disguised as bug report

Severity matrix (Impact × Urgency):

                    Urgency: Low        Medium          High
  Impact:  Low           SEV-4            SEV-3          SEV-3
           Medium        SEV-3            SEV-3          SEV-2
           High          SEV-3            SEV-2          SEV-1
           Critical      SEV-2            SEV-1          SEV-1
```

## Incident Response Process

```
SEV-1 INCIDENT RESPONSE PLAYBOOK
===================================

Phase 1: Detection and Activation (T+0 to T+5 minutes)

  Step 1 — Incident Detection:
    - Automated alert fires (monitoring system: PagerDuty, Datadog, New Relic)
    - Alert includes: service name, severity, error type, affected region, timestamp
    - On-call engineer acknowledges alert within 5 minutes
    - If not acknowledged in 5 minutes: automatic escalation to backup on-call

  Step 2 — Initial Triage:
    - On-call engineer verifies alert is real (not false positive)
    - Check status page for existing incident (avoid duplicate incidents)
    - Check dashboards for related issues (cascade effect)
    - Confirm severity level (SEV-1 if confirmed)

  Step 3 — Incident Declaration:
    - Create incident ticket in tracking system (Jira Service Management, ServiceNow)
    - Open incident bridge call (Zoom, Google Meet, Slack huddle)
    - Tag incident as SEV-1
    - Post to #incidents Slack channel with initial summary
    - Auto-notify stakeholders via PagerDuty escalation policy

  Step 4 — Assign Roles:
    - Incident Commander (IC): On-call engineer or senior engineer; manages the process
    - Technical Lead: Most knowledgeable engineer on affected system; leads diagnosis
    - Communications Lead: Manages internal/external communication (can be IC initially)
    - Scribe: Documents timeline and decisions (can be automated with transcript tools)
    - Other responders: Subject matter experts as needed

  Team activation:
    - Core team: 2–4 engineers maximum (avoid groupthink, maintain efficiency)
    - Subject matter experts: DBA, network engineer, security engineer (as needed)
    - Management observer: VP/Director (informs stakeholders, does not interfere)

Phase 2: Diagnosis and Mitigation (T+5 to T+60 minutes)

  Step 5 — Situation Assessment:
    - IC: "What do we know? What don't we know? What's the impact?"
    - Review dashboards: error rates, response times, infrastructure metrics
    - Check recent deployments (last 24 hours): was there a change?
    - Check external factors: cloud provider status, CDN status, DNS health
    - Determine user impact: how many users affected, what functionality broken

  Step 6 — Hypothesis and Testing:
    - Technical Lead forms hypothesis: "We believe X caused Y"
    - Test hypothesis with targeted investigation:
      * Check logs for error patterns
      * Compare current vs. baseline metrics
      * Check dependency health (upstream/downstream services)
      * Run diagnostic scripts/playbooks
    - Validate or invalidate hypothesis within 15 minutes

  Step 7 — Mitigation Decision:
    - Options analysis (choose fastest path to restore service):
      * Rollback: revert last deployment (5–15 minutes)
      * Failover: switch to backup region/system (2–10 minutes)
      * Scale: add capacity if overload (5–20 minutes)
      * Fix: deploy hotfix (30–120 minutes, riskier)
      * Disable: turn off problematic feature with feature flag (1–5 minutes)
    - IC makes decision based on: speed, risk, confidence, reversibility
    - Rule: prefer rollback or failover over hotfix during active incident

  Step 8 — Execute Mitigation:
    - Technical Lead executes chosen mitigation
    - Monitor impact in real-time (dashboard, error rates, user reports)
    - Confirm service restoration (synthetic tests, health checks)
    - Communicate progress: "Working on fix, expected resolution in X minutes"

Phase 3: Resolution and Communication (T+60 minutes to resolution)

  Step 9 — Verification:
    - Confirm all services restored to normal operation
    - Monitor for 15–30 minutes post-fix for stability
    - Run regression tests if code fix deployed
    - Verify data integrity (no data loss/corruption)
    - Confirm with business stakeholders

  Step 10 — Resolution Communication:
    - Update status page: "Resolved" with brief explanation
    - Post resolution to #incidents channel
    - Send customer notification (if external impact)
    - Close incident bridge call
    - Transition ticket to "Resolved" status

Phase 4: Post-Incident (Within 48 hours for SEV-1)

  Step 11 — Post-Incident Review (PIR):
    - Schedule within 24 hours of resolution (fresh in memory)
    - Attendees: all incident participants + interested parties
    - Duration: 60–90 minutes
    - Format: blameless retrospective (focus on process, not people)
    - Required: written incident report published within 48 hours

  Step 12 — Action Items:
    - Identify root cause (5-Whys analysis)
    - Create prevention items (avoid recurrence)
    - Create detection items (faster detection next time)
    - Create response items (faster resolution next time)
    - Assign owners and due dates for each action item
    - Track action items in project management system
    - Follow up in 30 days to verify completion
```

## On-Call Management

```
ON-CALL MANAGEMENT FRAMEWORK
==============================

On-call rotation structure:

  Primary on-call:
    - Schedule: 1 week on, 3 weeks off (typical)
    - Hours: 24/7 coverage (24-hour shifts) or business hours only
    - Expectation: acknowledge alerts within 5 minutes
    - PagerDuty integration: phone call, SMS, email, mobile push
    - Compensation: on-call stipend $500–$2,000/month (varies by company/role)

  Secondary on-call:
    - Backup if primary unavailable or unresponsive
    - Steps in after 5–10 minute escalation
    - Same rotation as primary but offset schedule

  On-call handover process:
    T-1 day:
      - Review open incidents and known issues
      - Update on-call runbook with current system state
      - Brief secondary on-call on any active concerns
    T-0 day:
      - Formal handover meeting (15 minutes, video call)
      - Verify pager/alerting working (test alert)
      - Confirm access to all required tools and dashboards

On-call health metrics:

  Target metrics:
    - Alert acknowledgment time (p50): < 5 minutes
    - Alert acknowledgment time (p95): < 15 minutes
    - Alerts per on-call week (acceptable): < 20 (outside business hours)
    - Escalations to secondary: < 2 per week
    - SEV-1 incidents per quarter: < 4 (industry benchmark)

  Unhealthy signals:
    - > 30 alerts per week (alert fatigue — reduce noise)
    - > 5 SEV-1 incidents per quarter (systemic reliability issues)
    - On-call burnout signs: delayed acknowledgments, complaints, turnover
    - Single point of failure: only 1 person who can fix critical system

On-call burnout prevention:

  1. Alert reduction:
     - Review all alerts monthly; disable non-actionable alerts
     - Implement alert grouping (related alerts → single notification)
     - Use intelligent alerting (ML-based anomaly detection)
     - Target: < 5 false positive alerts per on-call week

  2. Runbook coverage:
     - Every alert must have a corresponding runbook
     - Runbook includes: diagnosis steps, remediation steps, escalation path
     - Test runbooks quarterly (ensure they work)
     - Runbook quality score: > 80% coverage of common incidents

  3. Knowledge distribution:
     - Minimum 2 people who can handle each critical system
     - Cross-training sessions monthly
     - Document tribal knowledge in wiki/runbooks
     - Rotate on-call duties across team (no permanent on-call)
```

## Post-Incident Review

```
POST-INCIDENT REVIEW TEMPLATE
===============================

Incident Report — [Date] — [Service/System] — SEV-[Level]

1. Summary (1–2 paragraphs):
   - What happened, when, how long, what was affected
   - Resolution approach and outcome
   - Business impact (users, revenue, SLA)

2. Timeline:
   - [HH:MM UTC] — Incident detected (by whom, how)
   - [HH:MM UTC] — SEV-[Level] declared
   - [HH:MM UTC] — Incident team activated
   - [HH:MM UTC] — Root cause identified
   - [HH:MM UTC] — Fix implemented
   - [HH:MM UTC] — Service restored
   - [HH:MM UTC] — Incident closed

3. Impact:
   - Duration: [X hours Y minutes]
   - Users affected: [number or percentage]
   - Revenue impact: [$ amount if calculable]
   - SLA impact: [did it breach any SLA commitments?]
   - Data impact: [any data loss or corruption?]
   - Customer communications: [how many notifications sent]

4. Root Cause Analysis (5-Whys):
   - Why did the incident occur? [direct cause]
   - Why was the direct cause not prevented? [underlying cause]
   - Why was the underlying cause not detected? [systemic cause]
   - Why did the detection system not catch it? [process gap]
   - Why was the process not in place? [organizational cause]

5. What Went Well:
   - Fast detection time
   - Effective communication
   - Quick resolution
   - Team coordination

6. What Could Be Improved:
   - Detection could be faster (current: X min, target: Y min)
   - Communication could include more technical detail
   - Runbook was outdated
   - Team lacked access to X tool

7. Action Items:
   | # | Action | Type | Owner | Due Date | Status |
   |---|--------|------|-------|----------|--------|
   | 1 | Add monitoring for X metric | Detection | Eng A | MM/DD | Open |
   | 2 | Update runbook for Y service | Prevention | Eng B | MM/DD | Open |
   | 3 | Implement circuit breaker for Z | Prevention | Eng C | MM/DD | Open |
   | 4 | Cross-train team on database failover | Response | Manager | MM/DD | Open |

8. Classification:
   - Type: [Infrastructure, Application, Network, Security, Third-party, Human error]
   - Severity: [SEV-1/2/3/4]
   - Category: [Outage, Performance, Data, Security, Third-party]
   - Recurring: [Yes/No — if yes, reference prior incidents]
```

## Incident Communication Templates

```
INCIDENT COMMUNICATION TEMPLATES
==================================

Template 1: Initial Internal Notification (Slack #incidents)
  🚨 SEV-[1] — [Service Name] — [Brief Description]
  Detected: [Time UTC]
  Impact: [Who/what is affected]
  Status: Investigating
  Bridge: [Link to call]
  On-call: [@person]

Template 2: Status Page Update (Initial)
  DEGRADED PERFORMANCE / OUTAGE — [Service Name]
  We are currently investigating [issue description].
  Our team is actively working on a resolution.
  Next update in 30 minutes.
  Last updated: [Time UTC]

Template 3: Status Page Update (Identified)
  INVESTIGATING — [Service Name]
  We have identified the root cause as [brief explanation].
  Our team is working on a fix.
  Estimated resolution: [Timeframe]
  Next update in 30 minutes.
  Last updated: [Time UTC]

Template 4: Status Page Update (Resolved)
  RESOLVED — [Service Name]
  The issue affecting [service] has been resolved.
  All services are operating normally.
  A post-incident report will be published within 48 hours.
  We apologize for the inconvenience.

Template 5: Customer Email Notification (SEV-1 with customer impact)
  Subject: Service Update — [Company] — [Date]
  Dear [Customer],
  We experienced a service disruption on [Date] from [Start Time] to [End Time].
  During this period, [description of impact in customer terms].
  The issue has been resolved and all services are operating normally.
  We are conducting a thorough review to prevent recurrence.
  A detailed post-incident report is available at [Link].
  We sincerely apologize for the disruption to your business.
  Best regards,
  [Name]
  [Title]

Communication cadence by severity:
  SEV-1: Every 30 minutes until resolved
  SEV-2: Every 1 hour until resolved
  SEV-3: When status changes
  SEV-4: No external communication
```

## Integration Points

- **PagerDuty**: On-call scheduling, alert routing, escalation policies, incident tracking; integrates with all major monitoring tools
- **Opsgenie**: Alternative to PagerDuty; built on Atlassian stack; integrates with Jira for ticket creation
- **Jira Service Management**: Incident tracking, SLA management, post-incident review workflows
- **ServiceNow**: Enterprise ITSM; incident management module with automated workflows and integrations
- **Statuspage / Atlassian Statuspage**: Customer-facing status pages; automated status updates from monitoring
- **Datadog / New Relic / Dynatrace**: Monitoring platforms that trigger incident alerts; APM and infrastructure metrics
- **Slack / Microsoft Teams**: Incident communication channels; bridge calls; real-time collaboration
- **Zoom / Google Meet**: Incident bridge calls; screen sharing for diagnosis
- **Blameless / Fireflies**: Post-incident review automation; action item tracking; incident analytics
- **Grafana / Kibana**: Real-time dashboards for incident diagnosis; metric visualization

## Edge Cases

- **Cascading incidents** (one failure triggers multiple dependent failures): IC must identify root cause vs. symptoms; don't fix downstream symptoms while upstream root cause remains; use dependency maps to understand blast radius; consider full rollback vs. incremental fix when cascade detected
  - Example: Database overload → application timeout → API queue buildup → CDN cache miss → full site degradation
  - Approach: fix database first; downstream issues resolve automatically
  - Time investment: 30 minutes to map dependencies vs. 2+ hours fixing symptoms

- **Third-party dependency failure** (cloud provider outage, CDN down, payment gateway unavailable): Cannot fix externally; focus on mitigation (fallback, graceful degradation); communicate transparently with customers; monitor third-party status pages; have contingency plans for critical dependencies
  - AWS region outage: failover to secondary region (RTO: 15–60 minutes depending on setup)
  - Payment gateway down: activate backup payment processor (pre-integrated, pre-tested)
  - CDN failure: route to origin with increased capacity; cache aggressively

- **Security incident overlap** (outage caused by security breach, DDoS, or ransomware): Involve security team immediately; SEV-1 for both operations AND security; follow security incident response alongside operational response; preserve forensic evidence (don't just reboot); consider legal/regulatory notification requirements
  - Ransomware: isolate affected systems immediately; do NOT pay ransom; restore from last known clean backup; notify legal and compliance within 1 hour
  - DDoS: activate DDoS mitigation service (Cloudflare, AWS Shield); increase CDN capacity; implement rate limiting
  - Data breach: preserve evidence; contain breach; assess data scope; notify affected parties within 72 hours (GDPR)

- **Multi-region global incident**: Coordinate across time zones; ensure on-call coverage globally; use UTC for all timestamps; consider regional failover implications (data consistency, RPO); communicate in multiple languages if needed
  - Primary region: US East (most on-call engineers)
  - Secondary regions: EU West, Asia Pacific (rotating on-call)
  - Failover decision requires: data consistency verification, DNS TTL consideration, regional capacity check

- **Burnout and staffing constraints**: During prolonged incidents (> 4 hours), rotate responders; ensure team has food, water, rest breaks; management support visible but non-intrusive; plan for follow-on fatigue (team less productive for 1–2 days post-incident); consider compensatory time off
  - 4-hour rule: force 30-minute break after 4 consecutive hours on incident
  - Post-incident: team gets flexible scheduling for 48 hours
  - Monthly review: track on-call incident hours per engineer (target: < 10 hours/month outside normal hours)

- **Blameless culture enforcement**: Post-incident reviews must focus on system and process improvements, not individual blame; leaders model this behavior; action items target processes, not people; track "action items completed" as a team metric; recognize teams that learn from incidents
  - Prohibited language: "John forgot to...", "Sarah made a mistake..."
  - Required language: "The process did not catch...", "The system allowed..."
  - Action items: "Add automated check for X" not "John should remember to check X"
