IT AI Skill
Incident Management
Manage IT incidents from detection through resolution including triage, investigation, containment, communication, and post-incident review. Use when handling outages, managing major incidents, running incident command, performing root cause analysis, condu...
Incident Management
Restore normal service operation quickly and systematically through structured incident management processes.
Workflow
1. Incident Detection & Logging
- Multi-channel detection:
- Automated detection from monitoring systems (alerts, thresholds, anomaly detection)
- User-reported incidents via service portal, email, phone, chat
- Vendor notifications (cloud provider outages, software vulnerabilities)
- Social media monitoring for public-facing service issues
- Proactive detection from synthetic monitoring and uptime checks
- Incident creation and initial classification:
- Auto-create incident from alerts with enriched context (affected service, timestamp, alert source)
- Capture: service affected, symptoms, impact scope, user/reporter details
- Initial classification: Incident vs Problem vs Change vs Request
- Assign initial priority based on impact × urgency matrix
- Priority matrix:
- P1/Critical: Major service outage, security breach, data loss — response within 15 minutes
- P2/High: Significant service degradation, workaround available — response within 30 minutes
- P3/Medium: Minor service impact, single user or team — response within 2 hours
- P4/Low: Cosmetic issue, informational request — response within 24 hours
2. Incident Triage & Assignment
- Initial triage (L1):
- Verify incident (not false positive)
- Gather initial diagnostic information
- Check known errors and KB for existing solutions
- Attempt basic troubleshooting (service restart, cache clear)
- Escalate to L2/L3 if unresolved within SLA
- Specialist assignment (L2/L3):
- Route by technical domain: infrastructure, application, network, database, security
- Consider on-call rotation for after-hours incidents
- Check specialist availability and workload
- Assign backup engineer for coverage
- Auto-page for P1/P2 incidents
- Major incident declaration:
- Criteria: service affecting >25% of users, revenue impact, executive attention required
- Notify incident commander and communication lead
- Open major incident bridge/war room
- Activate status page
- Begin stakeholder notification cascade
3. Investigation & Diagnosis
- Data gathering:
- Collect relevant logs (application, system, network, database)
- Review recent changes (deployment, configuration, infrastructure)
- Check monitoring dashboards for correlated anomalies
- Interview affected users for symptom details
- Review error rates, latency, resource utilization trends
- Root cause investigation:
- Systematic elimination of potential causes
- Reproduce issue in non-production if possible
- Check vendor status pages and known issues
- Engage external vendors if needed (cloud provider, SaaS vendor)
- Time-correlate events across systems
- Workaround identification:
- Identify temporary solution to restore service
- Test workaround in staging if available
- Document workaround steps clearly
- Deploy workaround to production
- Communicate workaround to affected users
4. Containment & Resolution
- Containment actions:
- Isolate affected components to prevent spread
- Scale resources to handle degraded load
- Enable fallback services or maintenance pages
- Block malicious traffic (security incidents)
- Freeze deployments and changes affecting affected systems
- Fix implementation:
- Apply permanent fix via change management process
- For emergencies: expedited emergency change approval
- Deploy fix in staging for validation first
- Production deployment with rollback readiness
- Monitor post-deployment metrics closely
- Verification:
- Confirm service restoration through monitoring
- Verify with affected users (sample survey for broad impact)
- Run smoke tests against affected service
- Monitor for 30 minutes post-fix for stability
- Clear status page and send resolution notification
5. Communication Management
- Stakeholder communication plan:
- Internal: engineering teams, management, support staff (every 30 min for P1)
- External: customers, partners, public (via status page, every 60 min for P1)
- Executive: C-suite briefings for major incidents
- Regulator: required notifications for data breaches
- Communication templates:
- Initial notification: "We are investigating an issue affecting [service]"
- Update: "We have identified the cause and are implementing a fix"
- Resolution: "Service has been restored. We are monitoring for stability"
- Post-incident: "Post-mortem available — we've taken steps to prevent recurrence"
- Status page management:
- Update status within 15 minutes of incident detection
- Provide clear, non-technical impact descriptions
- Set expectations for next update timing
- Archive status for incident record
6. Post-Incident Review
- Post-mortem process:
- Schedule within 3 business days of incident resolution
- Invite all participants and stakeholders
- Blameless culture focus: process improvement, not individual fault
- Document timeline, impact, root cause, resolution
- Root cause analysis:
- 5 Whys analysis to reach fundamental cause
- Timeline reconstruction with precise timestamps
- Identify contributing factors (people, process, technology)
- Distinguish between root cause and proximate cause
- Action item tracking:
- Define specific, measurable action items with owners and deadlines
- Categorize: immediate fix, process improvement, long-term prevention
- Track action item completion (weekly status updates)
- Verify effectiveness of actions (no recurrence within 90 days)
Templates & Frameworks
Incident Severity Matrix
PRIORITY MATRIX
================
| Low Impact | Medium Impact | High Impact | Critical Impact
----------------|-------------|---------------|-------------|----------------
Scheduled | P4 | P3 | P3 | P3
Non-Urgent | P4 | P3 | P2 | P2
Urgent | P3 | P2 | P1 | P1
Critical | P3 | P2 | P1 | P1
IMPACT DEFINITIONS:
Low: Single user, non-critical function
Medium: Team/department, degraded functionality
High: Multiple departments, service unavailable
Critical: Organization-wide, revenue impact, security breach
Post-Mortem Template
POST-MORTEM — [Incident ID, Date, Severity]
============================================
EXECUTIVE SUMMARY:
What happened: [2-3 sentence summary]
Impact: [affected users/services, duration, business impact]
Root cause: [fundamental cause identified]
TIMELINE:
[HH:MM] — Incident detected by [system/person]
[HH:MM] — Initial triage and classification
[HH:MM] — Root cause identified
[HH:MM] — Workaround deployed
[HH:MM] — Permanent fix deployed
[HH:MM] — Service fully restored
[HH:MM] — Monitoring confirmed stable
ROOT CAUSE ANALYSIS (5 Whys):
Why did [symptom] occur? → [Answer 1]
Why did [Answer 1] happen? → [Answer 2]
Why did [Answer 2] happen? → [Answer 3]
Why did [Answer 3] happen? → [Answer 4]
Why did [Answer 4] happen? → ROOT CAUSE: [Answer 5]
WHAT WORKED WELL:
[Positive observations about response]
WHAT COULD BE IMPROVED:
[Areas for improvement]
ACTION ITEMS:
[ ] [Action] — Owner: [Name] — Due: [Date] — Priority: [P1/P2/P3]
[ ] [Action] — Owner: [Name] — Due: [Date] — Priority: [P1/P2/P3]
PREVENTION MEASURES:
[Long-term changes to prevent recurrence]
Integration Points
- Incident management tools (PagerDuty, Opsgenie, Victor Ops): Alerting, paging, escalation
- ITSM platforms (ServiceNow, Jira Service Management): Incident tracking, SLA management
- Monitoring systems (Datadog, New Relic, Prometheus): Detection source, metrics context
- Communication tools (Slack, Teams, Statuspage): Stakeholder communication
- Runbook automation (Confluence, Runkeeper): Automated response procedures
- ChatOps (Slack/Teams integrations): Real-time collaboration during incidents
- BI/reporting tools: Incident trend analysis and metrics
Edge Cases
- Cascading failures across multiple services: Prioritize restoration order based on dependency graph; declare separate incidents for each service
- Vendor-caused outages: Track vendor communication; set realistic expectations; document for SLA exclusion
- Security incident overlap: Coordinate with security team; balance transparency with information security; follow breach notification requirements
- Recurring incidents: Trigger problem management ticket; investigate systemic root cause; implement permanent fix
- After-hours/weekend incidents: Validate on-call coverage; adjust communication cadence; defer non-critical stakeholder notifications
Output
Incident Management Dashboard
INCIDENT DASHBOARD — Real-Time
===============================
ACTIVE INCIDENTS:
🔴 P1: Production API Gateway — 45 min duration (war room active)
⚠ P2: Email delivery delays — 2h 15min duration (investigation)
✓ P3: Internal wiki slow — 45 min duration (workaround deployed)
MTTR TRENDS:
P1 MTTR (30-day avg): 52 min (target: <60 min ✓)
P2 MTTR (30-day avg): 2h 15min (target: <4h ✓)
P3 MTTR (30-day avg): 1h 30min (target: <8h ✓)
INCIDENT VOLUME:
This month: 47 (vs 52 last month ↓)
P1 incidents: 3 (vs 5 last month ↓)
Change-related: 8 (17% — target: <20% ✓)
SLA COMPLIANCE:
Response time SLA: 97% (target: >95% ✓)
Resolution time SLA: 94% (target: >90% ✓)
Communication SLA: 99% (target: >95% ✓)
POST-MORTEM TRACKING:
Completed (last 30 days): 5/5 (100%)
Open action items: 12
Overdue action items: 2
Trigger Phrases
"incident", "outage", "major incident", "war room", "incident response", "post-mortem", "root cause analysis", "incident triage", "incident communication", "incident commander", "P1/P2/P3", "service restoration", "incident bridge", "status page update", "blameless post-mortem", "incident timeline", "on-call escalation"