IT AI Skill

Incident Management

Manage IT incidents from detection through resolution including triage, investigation, containment, communication, and post-incident review. Use when handling outages, managing major incidents, running incident command, performing root cause analysis, condu...

Incident Management

Restore normal service operation quickly and systematically through structured incident management processes.

Workflow

1. Incident Detection & Logging

  1. Multi-channel detection:
  1. Incident creation and initial classification:
  1. Priority matrix:

2. Incident Triage & Assignment

  1. Initial triage (L1):
  1. Specialist assignment (L2/L3):
  1. Major incident declaration:

3. Investigation & Diagnosis

  1. Data gathering:
  1. Root cause investigation:
  1. Workaround identification:

4. Containment & Resolution

  1. Containment actions:
  1. Fix implementation:
  1. Verification:

5. Communication Management

  1. Stakeholder communication plan:
  1. Communication templates:
  1. Status page management:

6. Post-Incident Review

  1. Post-mortem process:
  1. Root cause analysis:
  1. Action item tracking:

Templates & Frameworks

Incident Severity Matrix

PRIORITY MATRIX
================

                | Low Impact  | Medium Impact  | High Impact  | Critical Impact
----------------|-------------|---------------|-------------|----------------
Scheduled       | P4          | P3            | P3          | P3
Non-Urgent      | P4          | P3            | P2          | P2
Urgent          | P3          | P2            | P1          | P1
Critical        | P3          | P2            | P1          | P1

IMPACT DEFINITIONS:
  Low: Single user, non-critical function
  Medium: Team/department, degraded functionality
  High: Multiple departments, service unavailable
  Critical: Organization-wide, revenue impact, security breach

Post-Mortem Template

POST-MORTEM — [Incident ID, Date, Severity]
============================================

EXECUTIVE SUMMARY:
  What happened: [2-3 sentence summary]
  Impact: [affected users/services, duration, business impact]
  Root cause: [fundamental cause identified]

TIMELINE:
  [HH:MM] — Incident detected by [system/person]
  [HH:MM] — Initial triage and classification
  [HH:MM] — Root cause identified
  [HH:MM] — Workaround deployed
  [HH:MM] — Permanent fix deployed
  [HH:MM] — Service fully restored
  [HH:MM] — Monitoring confirmed stable

ROOT CAUSE ANALYSIS (5 Whys):
  Why did [symptom] occur? → [Answer 1]
  Why did [Answer 1] happen? → [Answer 2]
  Why did [Answer 2] happen? → [Answer 3]
  Why did [Answer 3] happen? → [Answer 4]
  Why did [Answer 4] happen? → ROOT CAUSE: [Answer 5]

WHAT WORKED WELL:
  [Positive observations about response]

WHAT COULD BE IMPROVED:
  [Areas for improvement]

ACTION ITEMS:
  [ ] [Action] — Owner: [Name] — Due: [Date] — Priority: [P1/P2/P3]
  [ ] [Action] — Owner: [Name] — Due: [Date] — Priority: [P1/P2/P3]

PREVENTION MEASURES:
  [Long-term changes to prevent recurrence]

Integration Points

Edge Cases

Output

Incident Management Dashboard

INCIDENT DASHBOARD — Real-Time
===============================

ACTIVE INCIDENTS:
  🔴 P1: Production API Gateway — 45 min duration (war room active)
  ⚠  P2: Email delivery delays — 2h 15min duration (investigation)
  ✓  P3: Internal wiki slow — 45 min duration (workaround deployed)

MTTR TRENDS:
  P1 MTTR (30-day avg): 52 min (target: <60 min ✓)
  P2 MTTR (30-day avg): 2h 15min (target: <4h ✓)
  P3 MTTR (30-day avg): 1h 30min (target: <8h ✓)

INCIDENT VOLUME:
  This month: 47 (vs 52 last month ↓)
  P1 incidents: 3 (vs 5 last month ↓)
  Change-related: 8 (17% — target: <20% ✓)

SLA COMPLIANCE:
  Response time SLA: 97% (target: >95% ✓)
  Resolution time SLA: 94% (target: >90% ✓)
  Communication SLA: 99% (target: >95% ✓)

POST-MORTEM TRACKING:
  Completed (last 30 days): 5/5 (100%)
  Open action items: 12
  Overdue action items: 2

Trigger Phrases

"incident", "outage", "major incident", "war room", "incident response", "post-mortem", "root cause analysis", "incident triage", "incident communication", "incident commander", "P1/P2/P3", "service restoration", "incident bridge", "status page update", "blameless post-mortem", "incident timeline", "on-call escalation"