IT AI Skill

Problem Management

Identify, analyze, and resolve underlying causes of recurring incidents through systematic problem management. Use when investigating root causes of recurring issues, managing known errors, coordinating workarounds, tracking problem resolution, or conducting trend analysis. Triggers on phrases like "problem management", "root cause", "known error", "recurring incident", "problem record", "trend analysis", "permanent fix", "workaround".

Problem Management

Proactively identify and eliminate root causes of incidents to prevent recurrence and improve service stability.

Workflow

1. Problem Identification

Reactive problem identification:

Analyze incident patterns from incident management system
Identify recurring incidents (same cause, >3 occurrences in 30 days)
Major incident post-mortems triggering problem records
Emergency changes indicating underlying issues
User complaints and support ticket trend analysis

Proactive problem identification:

Trend analysis from monitoring data (degradation before failure)
Capacity analysis identifying approaching thresholds
Vendor advisories and security bulletins
Quality metrics indicating systemic issues
Proactive health assessments of critical infrastructure

Problem record creation:

Create problem record linked to related incidents
Initial classification: severity (S1-S4), category (hardware, software, network, process)
Assign problem owner based on technical domain
Set target resolution date based on severity
Notify stakeholders of problem investigation

2. Problem Investigation & Diagnosis

Data collection and analysis:

Gather related incidents, error logs, monitoring data, change records
Interview subject matter experts and incident responders
Review recent changes that may have contributed
Analyze system architecture for single points of failure
Check vendor knowledge bases and support forums

Root cause analysis techniques:

5 Whys: iterative questioning to reach fundamental cause
Fishbone diagram (Ishikawa): categorize potential causes (people, process, technology, environment)
Fault tree analysis: top-down logical analysis of failure paths
Pareto analysis: identify 20% of causes driving 80% of incidents
Time-based correlation: map events chronologically

Intermediate and permanent solutions:

Define interim workaround to prevent further incidents
Implement workaround as service knowledge article
Communicate workaround to support teams
Develop permanent fix plan with timeline
Create change request for permanent fix implementation

3. Known Error Management

Known Error Database (KEDB):

Document known error with: symptoms, root cause (if known), workaround, affected CIs, status
Publish to service knowledge base for support team access
Link to related problem and incident records
Update as investigation progresses
Archive when permanent fix deployed

Workaround deployment:

Implement workaround across all affected systems
Validate workaround effectiveness
Monitor for recurrence with workaround in place
Train support teams on workaround application
Review workaround regularly for continued effectiveness

Known error lifecycle management:

Regular review of open known errors
Prioritize by incident volume and business impact
Escalate known errors without progress to management
Close known errors when permanent fix confirmed
Track workaround vs permanent fix ratio

4. Permanent Resolution & Verification

Fix implementation:

Submit change request for permanent fix
Coordinate with change management for scheduling
Implement fix through standard change process
Validate fix resolves root cause (not just symptom)
Test in non-production before production deployment

Verification and validation:

Monitor for 30 days post-fix for recurrence
Compare incident rates before and after fix
Validate fix does not introduce new issues
Confirm all related incidents resolved
Update problem record with verification evidence

Problem closure:

Close linked incidents if not already closed
Update known error record (archived)
Document lessons learned
Share findings with relevant teams
Update runbooks and procedures

5. Trend Analysis & Continuous Improvement

Problem trend reporting:

Monthly problem trend report by category, priority, and resolution time
Track mean time to identify (MTTI) and mean time to resolve (MTTR)
Identify top problem areas by incident volume
Measure proactive vs reactive problem ratio
Track problem resolution compliance with targets

Service improvement planning:

Feed problem data into CSI register
Prioritize improvement initiatives based on problem impact
Track improvement initiative implementation
Measure improvement effectiveness
Close loop: verify improvements reduce problem volume

Knowledge management:

Update documentation with problem findings
Enhance monitoring based on problem patterns
Improve detection and early warning indicators
Update training materials for support teams
Share cross-team learnings

Templates & Frameworks

Problem Record Template

PROBLEM RECORD — [PROB-2025-0089]
===================================

SUMMARY: Intermittent database connection timeouts affecting checkout service

SEVERITY: S2 (High)
CATEGORY: Software / Database
STATUS: Investigation in progress
OWNER: Database Team — John Smith

RELATED INCIDENTS: 12 (last 30 days)
AFFECTED SERVICES: E-commerce checkout, order processing
BUSINESS IMPACT: ~5% order failure rate during peak hours

TIMELINE:
  2025-04-01 — First related incident reported
  2025-04-03 — Problem record created (recurrence pattern identified)
  2025-04-05 — Interim workaround implemented (connection pool increase)
  2025-04-10 — Workaround validated (incidents reduced by 80%)
  2025-04-15 — Root cause identified (query optimization needed)
  2025-04-20 — Permanent fix planned (CR submitted)

INTERIM WORKAROUND:
  Increased database connection pool from 50 to 100 connections
  Added connection timeout retry logic in application layer
  Effectiveness: 80% reduction in related incidents

ROOT CAUSE (Preliminary):
  Inefficient database queries causing connection pool exhaustion during peak load.
  Query plan regression introduced in v2.3 deployment (March 15).

PERMANENT FIX PLAN:
  1. Optimize affected queries and add missing indexes
  2. Implement query plan baseline enforcement
  3. Add connection pool monitoring and auto-scaling
  4. Deploy via CR-2025-0445 — scheduled April 22

TARGET RESOLUTION: April 25, 2025

Problem Trend Report

PROBLEM MANAGEMENT TRENDS — Q1 2025
=====================================

PROBLEM STATISTICS:
  Total problems: 34
  Resolved: 28 (82.4%)
  Open: 6
  Proactive: 9 (26.5%)
  Reactive: 25 (73.5%)

MEAN TIMES:
  MTTI (Mean Time to Identify): 4.2 days
  MTTR (Mean Time to Resolve): 12.8 days
  Time to Workaround: 2.1 days

TOP PROBLEM CATEGORIES:
  1. Database connectivity: 7 problems (20.6%)
  2. Application performance: 6 problems (17.6%)
  3. Network connectivity: 5 problems (14.7%)
  4. Storage capacity: 4 problems (11.8%)
  5. Configuration errors: 4 problems (11.8%)

PROBLEM-RELATED INCIDENT REDUCTION:
  Before resolution: avg 8.3 incidents/problem
  After workaround: avg 1.6 incidents/problem (80.7% reduction)
  After permanent fix: avg 0.3 incidents/problem (96.4% reduction)

TOP IMPROVEMENT RECOMMENDATIONS:
  1. Implement query performance baseline monitoring (estimates 40% DB problem reduction)
  2. Automate connection pool scaling (estimates 60% connectivity problem reduction)
  3. Enhanced pre-deployment testing for performance regression

Integration Points

ITSM platforms (ServiceNow, Jira Service Management): Problem tracking and workflow
Incident management: Related incident correlation
CMDB: CI impact and dependency analysis
Monitoring systems: Trend data and proactive detection
Change management: Fix implementation coordination
Knowledge management: Known error documentation and sharing
Service desk: Workaround communication
Vendor management: External support escalation

Edge Cases

Multi-vendor problem: Designate single problem owner for coordination; establish joint investigation timeline; manage vendor blame-shifting
No clear root cause after extended investigation: Escalate to management; consider interim permanent workaround; allocate additional resources or external expertise
Fix introduces new problems: Halt fix deployment; create new problem record; assess whether to revert to previous state
Business accepts risk without fix: Document business decision; implement monitoring and alerting; set review date; track as managed risk
Cross-organizational problems: Coordinate with partner organizations; define information sharing boundaries; align resolution timelines

Output

Problem Management Dashboard

PROBLEM MANAGEMENT — April 2025
================================

OPEN PROBLEMS:
  S1 (Critical): 0
  S2 (High): 2
  S3 (Medium): 3
  S4 (Low): 1

RESOLUTION TRACKING:
  Resolved this month: 8
  Mean resolution time: 11.3 days
  Workaround deployment time: 2.4 days
  On-target resolution: 7/8 (87.5%)

KNOWN ERRORS:
  Active known errors: 14
  With workarounds: 12 (86%)
  Without workarounds: 2 (escalated)
  Average workaround effectiveness: 84%

PROACTIVE VS REACTIVE:
  Proactive problems: 4 (36%)
  Reactive problems: 7 (64%)
  Target: >40% proactive

TOP RECURRING ISSUES:
  🔴 Database timeout cascade — 15 related incidents (PROB-0089)
  ⚠  Network latency spikes — 9 related incidents (PROB-0091)
  ⚠  Application memory leaks — 7 related incidents (PROB-0093)

IMPROVEMENT INITIATIVES:
  [ ] Query performance monitoring — Due: May 15
  [ ] Auto-scaling connection pools — Due: May 30
  [ ] Memory leak detection automation — Due: June 15

Trigger Phrases

"problem management", "root cause analysis", "known error", "recurring incident", "problem record", "trend analysis", "permanent fix", "workaround", "KEDB", "problem owner", "problem investigation", "fault tree analysis", "5 whys", "service improvement", "problem closure"

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.