IT AI Skill
Incident Response Playbook
Create, execute, and manage incident response playbooks for IT operational incidents including outages, performance degradation, security breaches, and infrastructure failures. Use when developing incident response procedures, handling production incidents,...
Incident Response Playbook
Structured approach to handling IT operational incidents from detection through resolution and learning.
Workflow
- Detect incident via monitoring, user reports, or automated alerts.
- Triage and classify incident severity (SEV-1 through SEV-4).
- Activate incident response team based on severity level.
- Establish incident command structure: Incident Commander, Communications Lead, Technical Lead.
- Execute diagnostic procedures to identify root cause.
- Implement mitigation or fix; verify resolution.
- Conduct post-incident review within 48 hours (SEV-1) or 5 business days (SEV-2/3).
- Document lessons learned and create action items to prevent recurrence.
- Update playbooks and monitoring based on findings.
- Track action item completion to closure.
Incident Severity Classification
INCIDENT SEVERITY FRAMEWORK
============================
SEV-1 (Critical) — Major Outage
Definition: Complete service outage affecting all or majority of users;
data loss; security breach; revenue impact > $100K/hour.
Impact: All users affected or business operations halted
Response Time: Immediate (< 5 minutes to acknowledge)
Resolution Target: 2 hours maximum (4 hours with approval from VP)
Escalation: CTO, VP Engineering, Head of Operations automatically notified
Communication: Customer-facing status page updated within 30 minutes;
executive briefing within 1 hour
Examples:
- Production database cluster down (no failover)
- Authentication service unavailable (no one can log in)
- Ransomware detected encrypting production data
- Payment processing system down during peak hours
- Data breach confirmed (customer data exposed)
SEV-2 (High) — Significant Degradation
Definition: Major functionality impaired; partial outage; performance
severely degraded; affecting 25%+ of users.
Impact: Significant user impact; workaround may exist
Response Time: 15 minutes to acknowledge
Resolution Target: 8 hours (one business day)
Escalation: Engineering manager, operations manager notified
Communication: Status page updated within 1 hour;
customer notification if affecting > 1,000 users
Examples:
- API response times > 10 seconds (normal: < 200ms)
- 50% of payment transactions failing
- Single-region failure with limited impact (multi-region setup)
- Critical feature broken (e.g., checkout, search, reporting)
- Security vulnerability exploited (no data loss yet)
SEV-3 (Medium) — Minor Degradation
Definition: Minor functionality impaired; non-critical feature broken;
performance degraded but usable; affecting < 25% of users.
Impact: Noticeable user impact but core functionality works
Response Time: 1 hour to acknowledge
Resolution Target: 24 hours (one business day)
Escalation: Team lead notified
Communication: Internal notification; no external communication required
Examples:
- Non-critical UI bug affecting specific browser
- Reporting dashboard loading slowly
- Email notifications delayed by 30+ minutes
- Single microservice degraded with fallback active
- CDN performance degradation in specific region
SEV-4 (Low) — Minor Issue
Definition: Cosmetic issue; minor inconvenience; no functional impact;
affecting < 1% of users.
Impact: Minimal to none on business operations
Response Time: Next business day
Resolution Target: 5 business days (include in sprint backlog)
Escalation: Assigned to appropriate team member
Communication: No notification required
Examples:
- Typo in UI text
- Non-critical feature cosmetic issue
- Documentation out of date
- Performance optimization opportunity
- Feature request disguised as bug report
Severity matrix (Impact × Urgency):
Urgency: Low Medium High
Impact: Low SEV-4 SEV-3 SEV-3
Medium SEV-3 SEV-3 SEV-2
High SEV-3 SEV-2 SEV-1
Critical SEV-2 SEV-1 SEV-1
Incident Response Process
SEV-1 INCIDENT RESPONSE PLAYBOOK
===================================
Phase 1: Detection and Activation (T+0 to T+5 minutes)
Step 1 — Incident Detection:
- Automated alert fires (monitoring system: PagerDuty, Datadog, New Relic)
- Alert includes: service name, severity, error type, affected region, timestamp
- On-call engineer acknowledges alert within 5 minutes
- If not acknowledged in 5 minutes: automatic escalation to backup on-call
Step 2 — Initial Triage:
- On-call engineer verifies alert is real (not false positive)
- Check status page for existing incident (avoid duplicate incidents)
- Check dashboards for related issues (cascade effect)
- Confirm severity level (SEV-1 if confirmed)
Step 3 — Incident Declaration:
- Create incident ticket in tracking system (Jira Service Management, ServiceNow)
- Open incident bridge call (Zoom, Google Meet, Slack huddle)
- Tag incident as SEV-1
- Post to #incidents Slack channel with initial summary
- Auto-notify stakeholders via PagerDuty escalation policy
Step 4 — Assign Roles:
- Incident Commander (IC): On-call engineer or senior engineer; manages the process
- Technical Lead: Most knowledgeable engineer on affected system; leads diagnosis
- Communications Lead: Manages internal/external communication (can be IC initially)
- Scribe: Documents timeline and decisions (can be automated with transcript tools)
- Other responders: Subject matter experts as needed
Team activation:
- Core team: 2–4 engineers maximum (avoid groupthink, maintain efficiency)
- Subject matter experts: DBA, network engineer, security engineer (as needed)
- Management observer: VP/Director (informs stakeholders, does not interfere)
Phase 2: Diagnosis and Mitigation (T+5 to T+60 minutes)
Step 5 — Situation Assessment:
- IC: "What do we know? What don't we know? What's the impact?"
- Review dashboards: error rates, response times, infrastructure metrics
- Check recent deployments (last 24 hours): was there a change?
- Check external factors: cloud provider status, CDN status, DNS health
- Determine user impact: how many users affected, what functionality broken
Step 6 — Hypothesis and Testing:
- Technical Lead forms hypothesis: "We believe X caused Y"
- Test hypothesis with targeted investigation:
* Check logs for error patterns
* Compare current vs. baseline metrics
* Check dependency health (upstream/downstream services)
* Run diagnostic scripts/playbooks
- Validate or invalidate hypothesis within 15 minutes
Step 7 — Mitigation Decision:
- Options analysis (choose fastest path to restore service):
* Rollback: revert last deployment (5–15 minutes)
* Failover: switch to backup region/system (2–10 minutes)
* Scale: add capacity if overload (5–20 minutes)
* Fix: deploy hotfix (30–120 minutes, riskier)
* Disable: turn off problematic feature with feature flag (1–5 minutes)
- IC makes decision based on: speed, risk, confidence, reversibility
- Rule: prefer rollback or failover over hotfix during active incident
Step 8 — Execute Mitigation:
- Technical Lead executes chosen mitigation
- Monitor impact in real-time (dashboard, error rates, user reports)
- Confirm service restoration (synthetic tests, health checks)
- Communicate progress: "Working on fix, expected resolution in X minutes"
Phase 3: Resolution and Communication (T+60 minutes to resolution)
Step 9 — Verification:
- Confirm all services restored to normal operation
- Monitor for 15–30 minutes post-fix for stability
- Run regression tests if code fix deployed
- Verify data integrity (no data loss/corruption)
- Confirm with business stakeholders
Step 10 — Resolution Communication:
- Update status page: "Resolved" with brief explanation
- Post resolution to #incidents channel
- Send customer notification (if external impact)
- Close incident bridge call
- Transition ticket to "Resolved" status
Phase 4: Post-Incident (Within 48 hours for SEV-1)
Step 11 — Post-Incident Review (PIR):
- Schedule within 24 hours of resolution (fresh in memory)
- Attendees: all incident participants + interested parties
- Duration: 60–90 minutes
- Format: blameless retrospective (focus on process, not people)
- Required: written incident report published within 48 hours
Step 12 — Action Items:
- Identify root cause (5-Whys analysis)
- Create prevention items (avoid recurrence)
- Create detection items (faster detection next time)
- Create response items (faster resolution next time)
- Assign owners and due dates for each action item
- Track action items in project management system
- Follow up in 30 days to verify completion
On-Call Management
ON-CALL MANAGEMENT FRAMEWORK
==============================
On-call rotation structure:
Primary on-call:
- Schedule: 1 week on, 3 weeks off (typical)
- Hours: 24/7 coverage (24-hour shifts) or business hours only
- Expectation: acknowledge alerts within 5 minutes
- PagerDuty integration: phone call, SMS, email, mobile push
- Compensation: on-call stipend $500–$2,000/month (varies by company/role)
Secondary on-call:
- Backup if primary unavailable or unresponsive
- Steps in after 5–10 minute escalation
- Same rotation as primary but offset schedule
On-call handover process:
T-1 day:
- Review open incidents and known issues
- Update on-call runbook with current system state
- Brief secondary on-call on any active concerns
T-0 day:
- Formal handover meeting (15 minutes, video call)
- Verify pager/alerting working (test alert)
- Confirm access to all required tools and dashboards
On-call health metrics:
Target metrics:
- Alert acknowledgment time (p50): < 5 minutes
- Alert acknowledgment time (p95): < 15 minutes
- Alerts per on-call week (acceptable): < 20 (outside business hours)
- Escalations to secondary: < 2 per week
- SEV-1 incidents per quarter: < 4 (industry benchmark)
Unhealthy signals:
- > 30 alerts per week (alert fatigue — reduce noise)
- > 5 SEV-1 incidents per quarter (systemic reliability issues)
- On-call burnout signs: delayed acknowledgments, complaints, turnover
- Single point of failure: only 1 person who can fix critical system
On-call burnout prevention:
1. Alert reduction:
- Review all alerts monthly; disable non-actionable alerts
- Implement alert grouping (related alerts → single notification)
- Use intelligent alerting (ML-based anomaly detection)
- Target: < 5 false positive alerts per on-call week
2. Runbook coverage:
- Every alert must have a corresponding runbook
- Runbook includes: diagnosis steps, remediation steps, escalation path
- Test runbooks quarterly (ensure they work)
- Runbook quality score: > 80% coverage of common incidents
3. Knowledge distribution:
- Minimum 2 people who can handle each critical system
- Cross-training sessions monthly
- Document tribal knowledge in wiki/runbooks
- Rotate on-call duties across team (no permanent on-call)
Post-Incident Review
POST-INCIDENT REVIEW TEMPLATE
===============================
Incident Report — [Date] — [Service/System] — SEV-[Level]
1. Summary (1–2 paragraphs):
- What happened, when, how long, what was affected
- Resolution approach and outcome
- Business impact (users, revenue, SLA)
2. Timeline:
- [HH:MM UTC] — Incident detected (by whom, how)
- [HH:MM UTC] — SEV-[Level] declared
- [HH:MM UTC] — Incident team activated
- [HH:MM UTC] — Root cause identified
- [HH:MM UTC] — Fix implemented
- [HH:MM UTC] — Service restored
- [HH:MM UTC] — Incident closed
3. Impact:
- Duration: [X hours Y minutes]
- Users affected: [number or percentage]
- Revenue impact: [$ amount if calculable]
- SLA impact: [did it breach any SLA commitments?]
- Data impact: [any data loss or corruption?]
- Customer communications: [how many notifications sent]
4. Root Cause Analysis (5-Whys):
- Why did the incident occur? [direct cause]
- Why was the direct cause not prevented? [underlying cause]
- Why was the underlying cause not detected? [systemic cause]
- Why did the detection system not catch it? [process gap]
- Why was the process not in place? [organizational cause]
5. What Went Well:
- Fast detection time
- Effective communication
- Quick resolution
- Team coordination
6. What Could Be Improved:
- Detection could be faster (current: X min, target: Y min)
- Communication could include more technical detail
- Runbook was outdated
- Team lacked access to X tool
7. Action Items:
| # | Action | Type | Owner | Due Date | Status |
|---|--------|------|-------|----------|--------|
| 1 | Add monitoring for X metric | Detection | Eng A | MM/DD | Open |
| 2 | Update runbook for Y service | Prevention | Eng B | MM/DD | Open |
| 3 | Implement circuit breaker for Z | Prevention | Eng C | MM/DD | Open |
| 4 | Cross-train team on database failover | Response | Manager | MM/DD | Open |
8. Classification:
- Type: [Infrastructure, Application, Network, Security, Third-party, Human error]
- Severity: [SEV-1/2/3/4]
- Category: [Outage, Performance, Data, Security, Third-party]
- Recurring: [Yes/No — if yes, reference prior incidents]
Incident Communication Templates
INCIDENT COMMUNICATION TEMPLATES
==================================
Template 1: Initial Internal Notification (Slack #incidents)
🚨 SEV-[1] — [Service Name] — [Brief Description]
Detected: [Time UTC]
Impact: [Who/what is affected]
Status: Investigating
Bridge: [Link to call]
On-call: [@person]
Template 2: Status Page Update (Initial)
DEGRADED PERFORMANCE / OUTAGE — [Service Name]
We are currently investigating [issue description].
Our team is actively working on a resolution.
Next update in 30 minutes.
Last updated: [Time UTC]
Template 3: Status Page Update (Identified)
INVESTIGATING — [Service Name]
We have identified the root cause as [brief explanation].
Our team is working on a fix.
Estimated resolution: [Timeframe]
Next update in 30 minutes.
Last updated: [Time UTC]
Template 4: Status Page Update (Resolved)
RESOLVED — [Service Name]
The issue affecting [service] has been resolved.
All services are operating normally.
A post-incident report will be published within 48 hours.
We apologize for the inconvenience.
Template 5: Customer Email Notification (SEV-1 with customer impact)
Subject: Service Update — [Company] — [Date]
Dear [Customer],
We experienced a service disruption on [Date] from [Start Time] to [End Time].
During this period, [description of impact in customer terms].
The issue has been resolved and all services are operating normally.
We are conducting a thorough review to prevent recurrence.
A detailed post-incident report is available at [Link].
We sincerely apologize for the disruption to your business.
Best regards,
[Name]
[Title]
Communication cadence by severity:
SEV-1: Every 30 minutes until resolved
SEV-2: Every 1 hour until resolved
SEV-3: When status changes
SEV-4: No external communication
Integration Points
- PagerDuty: On-call scheduling, alert routing, escalation policies, incident tracking; integrates with all major monitoring tools
- Opsgenie: Alternative to PagerDuty; built on Atlassian stack; integrates with Jira for ticket creation
- Jira Service Management: Incident tracking, SLA management, post-incident review workflows
- ServiceNow: Enterprise ITSM; incident management module with automated workflows and integrations
- Statuspage / Atlassian Statuspage: Customer-facing status pages; automated status updates from monitoring
- Datadog / New Relic / Dynatrace: Monitoring platforms that trigger incident alerts; APM and infrastructure metrics
- Slack / Microsoft Teams: Incident communication channels; bridge calls; real-time collaboration
- Zoom / Google Meet: Incident bridge calls; screen sharing for diagnosis
- Blameless / Fireflies: Post-incident review automation; action item tracking; incident analytics
- Grafana / Kibana: Real-time dashboards for incident diagnosis; metric visualization
Edge Cases
- Cascading incidents (one failure triggers multiple dependent failures): IC must identify root cause vs. symptoms; don't fix downstream symptoms while upstream root cause remains; use dependency maps to understand blast radius; consider full rollback vs. incremental fix when cascade detected
- Example: Database overload → application timeout → API queue buildup → CDN cache miss → full site degradation
- Approach: fix database first; downstream issues resolve automatically
- Time investment: 30 minutes to map dependencies vs. 2+ hours fixing symptoms
- Third-party dependency failure (cloud provider outage, CDN down, payment gateway unavailable): Cannot fix externally; focus on mitigation (fallback, graceful degradation); communicate transparently with customers; monitor third-party status pages; have contingency plans for critical dependencies
- AWS region outage: failover to secondary region (RTO: 15–60 minutes depending on setup)
- Payment gateway down: activate backup payment processor (pre-integrated, pre-tested)
- CDN failure: route to origin with increased capacity; cache aggressively
- Security incident overlap (outage caused by security breach, DDoS, or ransomware): Involve security team immediately; SEV-1 for both operations AND security; follow security incident response alongside operational response; preserve forensic evidence (don't just reboot); consider legal/regulatory notification requirements
- Ransomware: isolate affected systems immediately; do NOT pay ransom; restore from last known clean backup; notify legal and compliance within 1 hour
- DDoS: activate DDoS mitigation service (Cloudflare, AWS Shield); increase CDN capacity; implement rate limiting
- Data breach: preserve evidence; contain breach; assess data scope; notify affected parties within 72 hours (GDPR)
- Multi-region global incident: Coordinate across time zones; ensure on-call coverage globally; use UTC for all timestamps; consider regional failover implications (data consistency, RPO); communicate in multiple languages if needed
- Primary region: US East (most on-call engineers)
- Secondary regions: EU West, Asia Pacific (rotating on-call)
- Failover decision requires: data consistency verification, DNS TTL consideration, regional capacity check
- Burnout and staffing constraints: During prolonged incidents (> 4 hours), rotate responders; ensure team has food, water, rest breaks; management support visible but non-intrusive; plan for follow-on fatigue (team less productive for 1–2 days post-incident); consider compensatory time off
- 4-hour rule: force 30-minute break after 4 consecutive hours on incident
- Post-incident: team gets flexible scheduling for 48 hours
- Monthly review: track on-call incident hours per engineer (target: < 10 hours/month outside normal hours)
- Blameless culture enforcement: Post-incident reviews must focus on system and process improvements, not individual blame; leaders model this behavior; action items target processes, not people; track "action items completed" as a team metric; recognize teams that learn from incidents
- Prohibited language: "John forgot to...", "Sarah made a mistake..."
- Required language: "The process did not catch...", "The system allowed..."
- Action items: "Add automated check for X" not "John should remember to check X"