IT AI Skill

Incident Response Playbook

Create, execute, and manage incident response playbooks for IT operational incidents including outages, performance degradation, security breaches, and infrastructure failures. Use when developing incident response procedures, handling production incidents,...

Incident Response Playbook

Structured approach to handling IT operational incidents from detection through resolution and learning.

Workflow

  1. Detect incident via monitoring, user reports, or automated alerts.
  2. Triage and classify incident severity (SEV-1 through SEV-4).
  3. Activate incident response team based on severity level.
  4. Establish incident command structure: Incident Commander, Communications Lead, Technical Lead.
  5. Execute diagnostic procedures to identify root cause.
  6. Implement mitigation or fix; verify resolution.
  7. Conduct post-incident review within 48 hours (SEV-1) or 5 business days (SEV-2/3).
  8. Document lessons learned and create action items to prevent recurrence.
  9. Update playbooks and monitoring based on findings.
  10. Track action item completion to closure.

Incident Severity Classification

INCIDENT SEVERITY FRAMEWORK
============================

SEV-1 (Critical) — Major Outage
  Definition: Complete service outage affecting all or majority of users;
               data loss; security breach; revenue impact > $100K/hour.
  Impact: All users affected or business operations halted
  Response Time: Immediate (< 5 minutes to acknowledge)
  Resolution Target: 2 hours maximum (4 hours with approval from VP)
  Escalation: CTO, VP Engineering, Head of Operations automatically notified
  Communication: Customer-facing status page updated within 30 minutes;
                 executive briefing within 1 hour
  Examples:
    - Production database cluster down (no failover)
    - Authentication service unavailable (no one can log in)
    - Ransomware detected encrypting production data
    - Payment processing system down during peak hours
    - Data breach confirmed (customer data exposed)

SEV-2 (High) — Significant Degradation
  Definition: Major functionality impaired; partial outage; performance
               severely degraded; affecting 25%+ of users.
  Impact: Significant user impact; workaround may exist
  Response Time: 15 minutes to acknowledge
  Resolution Target: 8 hours (one business day)
  Escalation: Engineering manager, operations manager notified
  Communication: Status page updated within 1 hour;
                 customer notification if affecting > 1,000 users
  Examples:
    - API response times > 10 seconds (normal: < 200ms)
    - 50% of payment transactions failing
    - Single-region failure with limited impact (multi-region setup)
    - Critical feature broken (e.g., checkout, search, reporting)
    - Security vulnerability exploited (no data loss yet)

SEV-3 (Medium) — Minor Degradation
  Definition: Minor functionality impaired; non-critical feature broken;
               performance degraded but usable; affecting < 25% of users.
  Impact: Noticeable user impact but core functionality works
  Response Time: 1 hour to acknowledge
  Resolution Target: 24 hours (one business day)
  Escalation: Team lead notified
  Communication: Internal notification; no external communication required
  Examples:
    - Non-critical UI bug affecting specific browser
    - Reporting dashboard loading slowly
    - Email notifications delayed by 30+ minutes
    - Single microservice degraded with fallback active
    - CDN performance degradation in specific region

SEV-4 (Low) — Minor Issue
  Definition: Cosmetic issue; minor inconvenience; no functional impact;
               affecting < 1% of users.
  Impact: Minimal to none on business operations
  Response Time: Next business day
  Resolution Target: 5 business days (include in sprint backlog)
  Escalation: Assigned to appropriate team member
  Communication: No notification required
  Examples:
    - Typo in UI text
    - Non-critical feature cosmetic issue
    - Documentation out of date
    - Performance optimization opportunity
    - Feature request disguised as bug report

Severity matrix (Impact × Urgency):

                    Urgency: Low        Medium          High
  Impact:  Low           SEV-4            SEV-3          SEV-3
           Medium        SEV-3            SEV-3          SEV-2
           High          SEV-3            SEV-2          SEV-1
           Critical      SEV-2            SEV-1          SEV-1

Incident Response Process

SEV-1 INCIDENT RESPONSE PLAYBOOK
===================================

Phase 1: Detection and Activation (T+0 to T+5 minutes)

  Step 1 — Incident Detection:
    - Automated alert fires (monitoring system: PagerDuty, Datadog, New Relic)
    - Alert includes: service name, severity, error type, affected region, timestamp
    - On-call engineer acknowledges alert within 5 minutes
    - If not acknowledged in 5 minutes: automatic escalation to backup on-call

  Step 2 — Initial Triage:
    - On-call engineer verifies alert is real (not false positive)
    - Check status page for existing incident (avoid duplicate incidents)
    - Check dashboards for related issues (cascade effect)
    - Confirm severity level (SEV-1 if confirmed)

  Step 3 — Incident Declaration:
    - Create incident ticket in tracking system (Jira Service Management, ServiceNow)
    - Open incident bridge call (Zoom, Google Meet, Slack huddle)
    - Tag incident as SEV-1
    - Post to #incidents Slack channel with initial summary
    - Auto-notify stakeholders via PagerDuty escalation policy

  Step 4 — Assign Roles:
    - Incident Commander (IC): On-call engineer or senior engineer; manages the process
    - Technical Lead: Most knowledgeable engineer on affected system; leads diagnosis
    - Communications Lead: Manages internal/external communication (can be IC initially)
    - Scribe: Documents timeline and decisions (can be automated with transcript tools)
    - Other responders: Subject matter experts as needed

  Team activation:
    - Core team: 2–4 engineers maximum (avoid groupthink, maintain efficiency)
    - Subject matter experts: DBA, network engineer, security engineer (as needed)
    - Management observer: VP/Director (informs stakeholders, does not interfere)

Phase 2: Diagnosis and Mitigation (T+5 to T+60 minutes)

  Step 5 — Situation Assessment:
    - IC: "What do we know? What don't we know? What's the impact?"
    - Review dashboards: error rates, response times, infrastructure metrics
    - Check recent deployments (last 24 hours): was there a change?
    - Check external factors: cloud provider status, CDN status, DNS health
    - Determine user impact: how many users affected, what functionality broken

  Step 6 — Hypothesis and Testing:
    - Technical Lead forms hypothesis: "We believe X caused Y"
    - Test hypothesis with targeted investigation:
      * Check logs for error patterns
      * Compare current vs. baseline metrics
      * Check dependency health (upstream/downstream services)
      * Run diagnostic scripts/playbooks
    - Validate or invalidate hypothesis within 15 minutes

  Step 7 — Mitigation Decision:
    - Options analysis (choose fastest path to restore service):
      * Rollback: revert last deployment (5–15 minutes)
      * Failover: switch to backup region/system (2–10 minutes)
      * Scale: add capacity if overload (5–20 minutes)
      * Fix: deploy hotfix (30–120 minutes, riskier)
      * Disable: turn off problematic feature with feature flag (1–5 minutes)
    - IC makes decision based on: speed, risk, confidence, reversibility
    - Rule: prefer rollback or failover over hotfix during active incident

  Step 8 — Execute Mitigation:
    - Technical Lead executes chosen mitigation
    - Monitor impact in real-time (dashboard, error rates, user reports)
    - Confirm service restoration (synthetic tests, health checks)
    - Communicate progress: "Working on fix, expected resolution in X minutes"

Phase 3: Resolution and Communication (T+60 minutes to resolution)

  Step 9 — Verification:
    - Confirm all services restored to normal operation
    - Monitor for 15–30 minutes post-fix for stability
    - Run regression tests if code fix deployed
    - Verify data integrity (no data loss/corruption)
    - Confirm with business stakeholders

  Step 10 — Resolution Communication:
    - Update status page: "Resolved" with brief explanation
    - Post resolution to #incidents channel
    - Send customer notification (if external impact)
    - Close incident bridge call
    - Transition ticket to "Resolved" status

Phase 4: Post-Incident (Within 48 hours for SEV-1)

  Step 11 — Post-Incident Review (PIR):
    - Schedule within 24 hours of resolution (fresh in memory)
    - Attendees: all incident participants + interested parties
    - Duration: 60–90 minutes
    - Format: blameless retrospective (focus on process, not people)
    - Required: written incident report published within 48 hours

  Step 12 — Action Items:
    - Identify root cause (5-Whys analysis)
    - Create prevention items (avoid recurrence)
    - Create detection items (faster detection next time)
    - Create response items (faster resolution next time)
    - Assign owners and due dates for each action item
    - Track action items in project management system
    - Follow up in 30 days to verify completion

On-Call Management

ON-CALL MANAGEMENT FRAMEWORK
==============================

On-call rotation structure:

  Primary on-call:
    - Schedule: 1 week on, 3 weeks off (typical)
    - Hours: 24/7 coverage (24-hour shifts) or business hours only
    - Expectation: acknowledge alerts within 5 minutes
    - PagerDuty integration: phone call, SMS, email, mobile push
    - Compensation: on-call stipend $500–$2,000/month (varies by company/role)

  Secondary on-call:
    - Backup if primary unavailable or unresponsive
    - Steps in after 5–10 minute escalation
    - Same rotation as primary but offset schedule

  On-call handover process:
    T-1 day:
      - Review open incidents and known issues
      - Update on-call runbook with current system state
      - Brief secondary on-call on any active concerns
    T-0 day:
      - Formal handover meeting (15 minutes, video call)
      - Verify pager/alerting working (test alert)
      - Confirm access to all required tools and dashboards

On-call health metrics:

  Target metrics:
    - Alert acknowledgment time (p50): < 5 minutes
    - Alert acknowledgment time (p95): < 15 minutes
    - Alerts per on-call week (acceptable): < 20 (outside business hours)
    - Escalations to secondary: < 2 per week
    - SEV-1 incidents per quarter: < 4 (industry benchmark)

  Unhealthy signals:
    - > 30 alerts per week (alert fatigue — reduce noise)
    - > 5 SEV-1 incidents per quarter (systemic reliability issues)
    - On-call burnout signs: delayed acknowledgments, complaints, turnover
    - Single point of failure: only 1 person who can fix critical system

On-call burnout prevention:

  1. Alert reduction:
     - Review all alerts monthly; disable non-actionable alerts
     - Implement alert grouping (related alerts → single notification)
     - Use intelligent alerting (ML-based anomaly detection)
     - Target: < 5 false positive alerts per on-call week

  2. Runbook coverage:
     - Every alert must have a corresponding runbook
     - Runbook includes: diagnosis steps, remediation steps, escalation path
     - Test runbooks quarterly (ensure they work)
     - Runbook quality score: > 80% coverage of common incidents

  3. Knowledge distribution:
     - Minimum 2 people who can handle each critical system
     - Cross-training sessions monthly
     - Document tribal knowledge in wiki/runbooks
     - Rotate on-call duties across team (no permanent on-call)

Post-Incident Review

POST-INCIDENT REVIEW TEMPLATE
===============================

Incident Report — [Date] — [Service/System] — SEV-[Level]

1. Summary (1–2 paragraphs):
   - What happened, when, how long, what was affected
   - Resolution approach and outcome
   - Business impact (users, revenue, SLA)

2. Timeline:
   - [HH:MM UTC] — Incident detected (by whom, how)
   - [HH:MM UTC] — SEV-[Level] declared
   - [HH:MM UTC] — Incident team activated
   - [HH:MM UTC] — Root cause identified
   - [HH:MM UTC] — Fix implemented
   - [HH:MM UTC] — Service restored
   - [HH:MM UTC] — Incident closed

3. Impact:
   - Duration: [X hours Y minutes]
   - Users affected: [number or percentage]
   - Revenue impact: [$ amount if calculable]
   - SLA impact: [did it breach any SLA commitments?]
   - Data impact: [any data loss or corruption?]
   - Customer communications: [how many notifications sent]

4. Root Cause Analysis (5-Whys):
   - Why did the incident occur? [direct cause]
   - Why was the direct cause not prevented? [underlying cause]
   - Why was the underlying cause not detected? [systemic cause]
   - Why did the detection system not catch it? [process gap]
   - Why was the process not in place? [organizational cause]

5. What Went Well:
   - Fast detection time
   - Effective communication
   - Quick resolution
   - Team coordination

6. What Could Be Improved:
   - Detection could be faster (current: X min, target: Y min)
   - Communication could include more technical detail
   - Runbook was outdated
   - Team lacked access to X tool

7. Action Items:
   | # | Action | Type | Owner | Due Date | Status |
   |---|--------|------|-------|----------|--------|
   | 1 | Add monitoring for X metric | Detection | Eng A | MM/DD | Open |
   | 2 | Update runbook for Y service | Prevention | Eng B | MM/DD | Open |
   | 3 | Implement circuit breaker for Z | Prevention | Eng C | MM/DD | Open |
   | 4 | Cross-train team on database failover | Response | Manager | MM/DD | Open |

8. Classification:
   - Type: [Infrastructure, Application, Network, Security, Third-party, Human error]
   - Severity: [SEV-1/2/3/4]
   - Category: [Outage, Performance, Data, Security, Third-party]
   - Recurring: [Yes/No — if yes, reference prior incidents]

Incident Communication Templates

INCIDENT COMMUNICATION TEMPLATES
==================================

Template 1: Initial Internal Notification (Slack #incidents)
  🚨 SEV-[1] — [Service Name] — [Brief Description]
  Detected: [Time UTC]
  Impact: [Who/what is affected]
  Status: Investigating
  Bridge: [Link to call]
  On-call: [@person]

Template 2: Status Page Update (Initial)
  DEGRADED PERFORMANCE / OUTAGE — [Service Name]
  We are currently investigating [issue description].
  Our team is actively working on a resolution.
  Next update in 30 minutes.
  Last updated: [Time UTC]

Template 3: Status Page Update (Identified)
  INVESTIGATING — [Service Name]
  We have identified the root cause as [brief explanation].
  Our team is working on a fix.
  Estimated resolution: [Timeframe]
  Next update in 30 minutes.
  Last updated: [Time UTC]

Template 4: Status Page Update (Resolved)
  RESOLVED — [Service Name]
  The issue affecting [service] has been resolved.
  All services are operating normally.
  A post-incident report will be published within 48 hours.
  We apologize for the inconvenience.

Template 5: Customer Email Notification (SEV-1 with customer impact)
  Subject: Service Update — [Company] — [Date]
  Dear [Customer],
  We experienced a service disruption on [Date] from [Start Time] to [End Time].
  During this period, [description of impact in customer terms].
  The issue has been resolved and all services are operating normally.
  We are conducting a thorough review to prevent recurrence.
  A detailed post-incident report is available at [Link].
  We sincerely apologize for the disruption to your business.
  Best regards,
  [Name]
  [Title]

Communication cadence by severity:
  SEV-1: Every 30 minutes until resolved
  SEV-2: Every 1 hour until resolved
  SEV-3: When status changes
  SEV-4: No external communication

Integration Points

Edge Cases