IT AI Skill

Problem Management

Identify, analyze, and resolve underlying causes of recurring incidents through systematic problem management. Use when investigating root causes of recurring issues, managing known errors, coordinating workarounds, tracking problem resolution, or conductin...

Problem Management

Proactively identify and eliminate root causes of incidents to prevent recurrence and improve service stability.

Workflow

1. Problem Identification

  1. Reactive problem identification:
  1. Proactive problem identification:
  1. Problem record creation:

2. Problem Investigation & Diagnosis

  1. Data collection and analysis:
  1. Root cause analysis techniques:
  1. Intermediate and permanent solutions:

3. Known Error Management

  1. Known Error Database (KEDB):
  1. Workaround deployment:
  1. Known error lifecycle management:

4. Permanent Resolution & Verification

  1. Fix implementation:
  1. Verification and validation:
  1. Problem closure:

5. Trend Analysis & Continuous Improvement

  1. Problem trend reporting:
  1. Service improvement planning:
  1. Knowledge management:

Templates & Frameworks

Problem Record Template

PROBLEM RECORD — [PROB-2025-0089]
===================================

SUMMARY: Intermittent database connection timeouts affecting checkout service

SEVERITY: S2 (High)
CATEGORY: Software / Database
STATUS: Investigation in progress
OWNER: Database Team — John Smith

RELATED INCIDENTS: 12 (last 30 days)
AFFECTED SERVICES: E-commerce checkout, order processing
BUSINESS IMPACT: ~5% order failure rate during peak hours

TIMELINE:
  2025-04-01 — First related incident reported
  2025-04-03 — Problem record created (recurrence pattern identified)
  2025-04-05 — Interim workaround implemented (connection pool increase)
  2025-04-10 — Workaround validated (incidents reduced by 80%)
  2025-04-15 — Root cause identified (query optimization needed)
  2025-04-20 — Permanent fix planned (CR submitted)

INTERIM WORKAROUND:
  Increased database connection pool from 50 to 100 connections
  Added connection timeout retry logic in application layer
  Effectiveness: 80% reduction in related incidents

ROOT CAUSE (Preliminary):
  Inefficient database queries causing connection pool exhaustion during peak load.
  Query plan regression introduced in v2.3 deployment (March 15).

PERMANENT FIX PLAN:
  1. Optimize affected queries and add missing indexes
  2. Implement query plan baseline enforcement
  3. Add connection pool monitoring and auto-scaling
  4. Deploy via CR-2025-0445 — scheduled April 22

TARGET RESOLUTION: April 25, 2025

Problem Trend Report

PROBLEM MANAGEMENT TRENDS — Q1 2025
=====================================

PROBLEM STATISTICS:
  Total problems: 34
  Resolved: 28 (82.4%)
  Open: 6
  Proactive: 9 (26.5%)
  Reactive: 25 (73.5%)

MEAN TIMES:
  MTTI (Mean Time to Identify): 4.2 days
  MTTR (Mean Time to Resolve): 12.8 days
  Time to Workaround: 2.1 days

TOP PROBLEM CATEGORIES:
  1. Database connectivity: 7 problems (20.6%)
  2. Application performance: 6 problems (17.6%)
  3. Network connectivity: 5 problems (14.7%)
  4. Storage capacity: 4 problems (11.8%)
  5. Configuration errors: 4 problems (11.8%)

PROBLEM-RELATED INCIDENT REDUCTION:
  Before resolution: avg 8.3 incidents/problem
  After workaround: avg 1.6 incidents/problem (80.7% reduction)
  After permanent fix: avg 0.3 incidents/problem (96.4% reduction)

TOP IMPROVEMENT RECOMMENDATIONS:
  1. Implement query performance baseline monitoring (estimates 40% DB problem reduction)
  2. Automate connection pool scaling (estimates 60% connectivity problem reduction)
  3. Enhanced pre-deployment testing for performance regression

Integration Points

Edge Cases

Output

Problem Management Dashboard

PROBLEM MANAGEMENT — April 2025
================================

OPEN PROBLEMS:
  S1 (Critical): 0
  S2 (High): 2
  S3 (Medium): 3
  S4 (Low): 1

RESOLUTION TRACKING:
  Resolved this month: 8
  Mean resolution time: 11.3 days
  Workaround deployment time: 2.4 days
  On-target resolution: 7/8 (87.5%)

KNOWN ERRORS:
  Active known errors: 14
  With workarounds: 12 (86%)
  Without workarounds: 2 (escalated)
  Average workaround effectiveness: 84%

PROACTIVE VS REACTIVE:
  Proactive problems: 4 (36%)
  Reactive problems: 7 (64%)
  Target: >40% proactive

TOP RECURRING ISSUES:
  🔴 Database timeout cascade — 15 related incidents (PROB-0089)
  ⚠  Network latency spikes — 9 related incidents (PROB-0091)
  ⚠  Application memory leaks — 7 related incidents (PROB-0093)

IMPROVEMENT INITIATIVES:
  [ ] Query performance monitoring — Due: May 15
  [ ] Auto-scaling connection pools — Due: May 30
  [ ] Memory leak detection automation — Due: June 15

Trigger Phrases

"problem management", "root cause analysis", "known error", "recurring incident", "problem record", "trend analysis", "permanent fix", "workaround", "KEDB", "problem owner", "problem investigation", "fault tree analysis", "5 whys", "service improvement", "problem closure"