IT AI Skill

Incident Management

Manage IT incidents from detection through resolution including triage, investigation, containment, communication, and post-incident review. Use when handling outages, managing major incidents, running incident command, performing root cause analysis, conducting post-mortems, or managing incident communication. Triggers on phrases like "incident response", "outage management", "major incident", "incident triage", "post-mortem", "root cause analysis", "incident communication", "war room", "incident commander".

Incident Management

Restore normal service operation quickly and systematically through structured incident management processes.

Workflow

1. Incident Detection & Logging

Multi-channel detection:

Automated detection from monitoring systems (alerts, thresholds, anomaly detection)
User-reported incidents via service portal, email, phone, chat
Vendor notifications (cloud provider outages, software vulnerabilities)
Social media monitoring for public-facing service issues
Proactive detection from synthetic monitoring and uptime checks

Incident creation and initial classification:

Auto-create incident from alerts with enriched context (affected service, timestamp, alert source)
Capture: service affected, symptoms, impact scope, user/reporter details
Initial classification: Incident vs Problem vs Change vs Request
Assign initial priority based on impact × urgency matrix

Priority matrix:

P1/Critical: Major service outage, security breach, data loss — response within 15 minutes
P2/High: Significant service degradation, workaround available — response within 30 minutes
P3/Medium: Minor service impact, single user or team — response within 2 hours
P4/Low: Cosmetic issue, informational request — response within 24 hours

2. Incident Triage & Assignment

Initial triage (L1):

Verify incident (not false positive)
Gather initial diagnostic information
Check known errors and KB for existing solutions
Attempt basic troubleshooting (service restart, cache clear)
Escalate to L2/L3 if unresolved within SLA

Specialist assignment (L2/L3):

Route by technical domain: infrastructure, application, network, database, security
Consider on-call rotation for after-hours incidents
Check specialist availability and workload
Assign backup engineer for coverage
Auto-page for P1/P2 incidents

Major incident declaration:

Criteria: service affecting >25% of users, revenue impact, executive attention required
Notify incident commander and communication lead
Open major incident bridge/war room
Activate status page
Begin stakeholder notification cascade

3. Investigation & Diagnosis

Data gathering:

Collect relevant logs (application, system, network, database)
Review recent changes (deployment, configuration, infrastructure)
Check monitoring dashboards for correlated anomalies
Interview affected users for symptom details
Review error rates, latency, resource utilization trends

Root cause investigation:

Systematic elimination of potential causes
Reproduce issue in non-production if possible
Check vendor status pages and known issues
Engage external vendors if needed (cloud provider, SaaS vendor)
Time-correlate events across systems

Workaround identification:

Identify temporary solution to restore service
Test workaround in staging if available
Document workaround steps clearly
Deploy workaround to production
Communicate workaround to affected users

4. Containment & Resolution

Containment actions:

Isolate affected components to prevent spread
Scale resources to handle degraded load
Enable fallback services or maintenance pages
Block malicious traffic (security incidents)
Freeze deployments and changes affecting affected systems

Fix implementation:

Apply permanent fix via change management process
For emergencies: expedited emergency change approval
Deploy fix in staging for validation first
Production deployment with rollback readiness
Monitor post-deployment metrics closely

Verification:

Confirm service restoration through monitoring
Verify with affected users (sample survey for broad impact)
Run smoke tests against affected service
Monitor for 30 minutes post-fix for stability
Clear status page and send resolution notification

5. Communication Management

Stakeholder communication plan:

Internal: engineering teams, management, support staff (every 30 min for P1)
External: customers, partners, public (via status page, every 60 min for P1)
Executive: C-suite briefings for major incidents
Regulator: required notifications for data breaches

Communication templates:

Initial notification: "We are investigating an issue affecting [service]"
Update: "We have identified the cause and are implementing a fix"
Resolution: "Service has been restored. We are monitoring for stability"
Post-incident: "Post-mortem available — we've taken steps to prevent recurrence"

Status page management:

Update status within 15 minutes of incident detection
Provide clear, non-technical impact descriptions
Set expectations for next update timing
Archive status for incident record

6. Post-Incident Review

Post-mortem process:

Schedule within 3 business days of incident resolution
Invite all participants and stakeholders
Blameless culture focus: process improvement, not individual fault
Document timeline, impact, root cause, resolution

Root cause analysis:

5 Whys analysis to reach fundamental cause
Timeline reconstruction with precise timestamps
Identify contributing factors (people, process, technology)
Distinguish between root cause and proximate cause

Action item tracking:

Define specific, measurable action items with owners and deadlines
Categorize: immediate fix, process improvement, long-term prevention
Track action item completion (weekly status updates)
Verify effectiveness of actions (no recurrence within 90 days)

Templates & Frameworks

Incident Severity Matrix

PRIORITY MATRIX
================

                | Low Impact  | Medium Impact  | High Impact  | Critical Impact
----------------|-------------|---------------|-------------|----------------
Scheduled       | P4          | P3            | P3          | P3
Non-Urgent      | P4          | P3            | P2          | P2
Urgent          | P3          | P2            | P1          | P1
Critical        | P3          | P2            | P1          | P1

IMPACT DEFINITIONS:
  Low: Single user, non-critical function
  Medium: Team/department, degraded functionality
  High: Multiple departments, service unavailable
  Critical: Organization-wide, revenue impact, security breach

Post-Mortem Template

POST-MORTEM — [Incident ID, Date, Severity]
============================================

EXECUTIVE SUMMARY:
  What happened: [2-3 sentence summary]
  Impact: [affected users/services, duration, business impact]
  Root cause: [fundamental cause identified]

TIMELINE:
  [HH:MM] — Incident detected by [system/person]
  [HH:MM] — Initial triage and classification
  [HH:MM] — Root cause identified
  [HH:MM] — Workaround deployed
  [HH:MM] — Permanent fix deployed
  [HH:MM] — Service fully restored
  [HH:MM] — Monitoring confirmed stable

ROOT CAUSE ANALYSIS (5 Whys):
  Why did [symptom] occur? → [Answer 1]
  Why did [Answer 1] happen? → [Answer 2]
  Why did [Answer 2] happen? → [Answer 3]
  Why did [Answer 3] happen? → [Answer 4]
  Why did [Answer 4] happen? → ROOT CAUSE: [Answer 5]

WHAT WORKED WELL:
  [Positive observations about response]

WHAT COULD BE IMPROVED:
  [Areas for improvement]

ACTION ITEMS:
  [ ] [Action] — Owner: [Name] — Due: [Date] — Priority: [P1/P2/P3]
  [ ] [Action] — Owner: [Name] — Due: [Date] — Priority: [P1/P2/P3]

PREVENTION MEASURES:
  [Long-term changes to prevent recurrence]

Integration Points

Incident management tools (PagerDuty, Opsgenie, Victor Ops): Alerting, paging, escalation
ITSM platforms (ServiceNow, Jira Service Management): Incident tracking, SLA management
Monitoring systems (Datadog, New Relic, Prometheus): Detection source, metrics context
Communication tools (Slack, Teams, Statuspage): Stakeholder communication
Runbook automation (Confluence, Runkeeper): Automated response procedures
ChatOps (Slack/Teams integrations): Real-time collaboration during incidents
BI/reporting tools: Incident trend analysis and metrics

Edge Cases

Cascading failures across multiple services: Prioritize restoration order based on dependency graph; declare separate incidents for each service
Vendor-caused outages: Track vendor communication; set realistic expectations; document for SLA exclusion
Security incident overlap: Coordinate with security team; balance transparency with information security; follow breach notification requirements
Recurring incidents: Trigger problem management ticket; investigate systemic root cause; implement permanent fix
After-hours/weekend incidents: Validate on-call coverage; adjust communication cadence; defer non-critical stakeholder notifications

Output

Incident Management Dashboard

INCIDENT DASHBOARD — Real-Time
===============================

ACTIVE INCIDENTS:
  🔴 P1: Production API Gateway — 45 min duration (war room active)
  ⚠  P2: Email delivery delays — 2h 15min duration (investigation)
  ✓  P3: Internal wiki slow — 45 min duration (workaround deployed)

MTTR TRENDS:
  P1 MTTR (30-day avg): 52 min (target: <60 min ✓)
  P2 MTTR (30-day avg): 2h 15min (target: <4h ✓)
  P3 MTTR (30-day avg): 1h 30min (target: <8h ✓)

INCIDENT VOLUME:
  This month: 47 (vs 52 last month ↓)
  P1 incidents: 3 (vs 5 last month ↓)
  Change-related: 8 (17% — target: <20% ✓)

SLA COMPLIANCE:
  Response time SLA: 97% (target: >95% ✓)
  Resolution time SLA: 94% (target: >90% ✓)
  Communication SLA: 99% (target: >95% ✓)

POST-MORTEM TRACKING:
  Completed (last 30 days): 5/5 (100%)
  Open action items: 12
  Overdue action items: 2

Trigger Phrases

"incident", "outage", "major incident", "war room", "incident response", "post-mortem", "root cause analysis", "incident triage", "incident communication", "incident commander", "P1/P2/P3", "service restoration", "incident bridge", "status page update", "blameless post-mortem", "incident timeline", "on-call escalation"

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.