---
name: problem-management
description: Identify, analyze, and resolve underlying causes of recurring incidents through systematic problem management. Use when investigating root causes of recurring issues, managing known errors, coordinating workarounds, tracking problem resolution, or conducting trend analysis. Triggers on phrases like "problem management", "root cause", "known error", "recurring incident", "problem record", "trend analysis", "permanent fix", "workaround".
---

# Problem Management

Proactively identify and eliminate root causes of incidents to prevent recurrence and improve service stability.

## Workflow

### 1. Problem Identification

1. **Reactive problem identification**:
   - Analyze incident patterns from incident management system
   - Identify recurring incidents (same cause, >3 occurrences in 30 days)
   - Major incident post-mortems triggering problem records
   - Emergency changes indicating underlying issues
   - User complaints and support ticket trend analysis

2. **Proactive problem identification**:
   - Trend analysis from monitoring data (degradation before failure)
   - Capacity analysis identifying approaching thresholds
   - Vendor advisories and security bulletins
   - Quality metrics indicating systemic issues
   - Proactive health assessments of critical infrastructure

3. **Problem record creation**:
   - Create problem record linked to related incidents
   - Initial classification: severity (S1-S4), category (hardware, software, network, process)
   - Assign problem owner based on technical domain
   - Set target resolution date based on severity
   - Notify stakeholders of problem investigation

### 2. Problem Investigation & Diagnosis

1. **Data collection and analysis**:
   - Gather related incidents, error logs, monitoring data, change records
   - Interview subject matter experts and incident responders
   - Review recent changes that may have contributed
   - Analyze system architecture for single points of failure
   - Check vendor knowledge bases and support forums

2. **Root cause analysis techniques**:
   - 5 Whys: iterative questioning to reach fundamental cause
   - Fishbone diagram (Ishikawa): categorize potential causes (people, process, technology, environment)
   - Fault tree analysis: top-down logical analysis of failure paths
   - Pareto analysis: identify 20% of causes driving 80% of incidents
   - Time-based correlation: map events chronologically

3. **Intermediate and permanent solutions**:
   - Define interim workaround to prevent further incidents
   - Implement workaround as service knowledge article
   - Communicate workaround to support teams
   - Develop permanent fix plan with timeline
   - Create change request for permanent fix implementation

### 3. Known Error Management

1. **Known Error Database (KEDB)**:
   - Document known error with: symptoms, root cause (if known), workaround, affected CIs, status
   - Publish to service knowledge base for support team access
   - Link to related problem and incident records
   - Update as investigation progresses
   - Archive when permanent fix deployed

2. **Workaround deployment**:
   - Implement workaround across all affected systems
   - Validate workaround effectiveness
   - Monitor for recurrence with workaround in place
   - Train support teams on workaround application
   - Review workaround regularly for continued effectiveness

3. **Known error lifecycle management**:
   - Regular review of open known errors
   - Prioritize by incident volume and business impact
   - Escalate known errors without progress to management
   - Close known errors when permanent fix confirmed
   - Track workaround vs permanent fix ratio

### 4. Permanent Resolution & Verification

1. **Fix implementation**:
   - Submit change request for permanent fix
   - Coordinate with change management for scheduling
   - Implement fix through standard change process
   - Validate fix resolves root cause (not just symptom)
   - Test in non-production before production deployment

2. **Verification and validation**:
   - Monitor for 30 days post-fix for recurrence
   - Compare incident rates before and after fix
   - Validate fix does not introduce new issues
   - Confirm all related incidents resolved
   - Update problem record with verification evidence

3. **Problem closure**:
   - Close linked incidents if not already closed
   - Update known error record (archived)
   - Document lessons learned
   - Share findings with relevant teams
   - Update runbooks and procedures

### 5. Trend Analysis & Continuous Improvement

1. **Problem trend reporting**:
   - Monthly problem trend report by category, priority, and resolution time
   - Track mean time to identify (MTTI) and mean time to resolve (MTTR)
   - Identify top problem areas by incident volume
   - Measure proactive vs reactive problem ratio
   - Track problem resolution compliance with targets

2. **Service improvement planning**:
   - Feed problem data into CSI register
   - Prioritize improvement initiatives based on problem impact
   - Track improvement initiative implementation
   - Measure improvement effectiveness
   - Close loop: verify improvements reduce problem volume

3. **Knowledge management**:
   - Update documentation with problem findings
   - Enhance monitoring based on problem patterns
   - Improve detection and early warning indicators
   - Update training materials for support teams
   - Share cross-team learnings

## Templates & Frameworks

### Problem Record Template

```
PROBLEM RECORD — [PROB-2025-0089]
===================================

SUMMARY: Intermittent database connection timeouts affecting checkout service

SEVERITY: S2 (High)
CATEGORY: Software / Database
STATUS: Investigation in progress
OWNER: Database Team — John Smith

RELATED INCIDENTS: 12 (last 30 days)
AFFECTED SERVICES: E-commerce checkout, order processing
BUSINESS IMPACT: ~5% order failure rate during peak hours

TIMELINE:
  2025-04-01 — First related incident reported
  2025-04-03 — Problem record created (recurrence pattern identified)
  2025-04-05 — Interim workaround implemented (connection pool increase)
  2025-04-10 — Workaround validated (incidents reduced by 80%)
  2025-04-15 — Root cause identified (query optimization needed)
  2025-04-20 — Permanent fix planned (CR submitted)

INTERIM WORKAROUND:
  Increased database connection pool from 50 to 100 connections
  Added connection timeout retry logic in application layer
  Effectiveness: 80% reduction in related incidents

ROOT CAUSE (Preliminary):
  Inefficient database queries causing connection pool exhaustion during peak load.
  Query plan regression introduced in v2.3 deployment (March 15).

PERMANENT FIX PLAN:
  1. Optimize affected queries and add missing indexes
  2. Implement query plan baseline enforcement
  3. Add connection pool monitoring and auto-scaling
  4. Deploy via CR-2025-0445 — scheduled April 22

TARGET RESOLUTION: April 25, 2025
```

### Problem Trend Report

```
PROBLEM MANAGEMENT TRENDS — Q1 2025
=====================================

PROBLEM STATISTICS:
  Total problems: 34
  Resolved: 28 (82.4%)
  Open: 6
  Proactive: 9 (26.5%)
  Reactive: 25 (73.5%)

MEAN TIMES:
  MTTI (Mean Time to Identify): 4.2 days
  MTTR (Mean Time to Resolve): 12.8 days
  Time to Workaround: 2.1 days

TOP PROBLEM CATEGORIES:
  1. Database connectivity: 7 problems (20.6%)
  2. Application performance: 6 problems (17.6%)
  3. Network connectivity: 5 problems (14.7%)
  4. Storage capacity: 4 problems (11.8%)
  5. Configuration errors: 4 problems (11.8%)

PROBLEM-RELATED INCIDENT REDUCTION:
  Before resolution: avg 8.3 incidents/problem
  After workaround: avg 1.6 incidents/problem (80.7% reduction)
  After permanent fix: avg 0.3 incidents/problem (96.4% reduction)

TOP IMPROVEMENT RECOMMENDATIONS:
  1. Implement query performance baseline monitoring (estimates 40% DB problem reduction)
  2. Automate connection pool scaling (estimates 60% connectivity problem reduction)
  3. Enhanced pre-deployment testing for performance regression
```

## Integration Points

- ITSM platforms (ServiceNow, Jira Service Management): Problem tracking and workflow
- Incident management: Related incident correlation
- CMDB: CI impact and dependency analysis
- Monitoring systems: Trend data and proactive detection
- Change management: Fix implementation coordination
- Knowledge management: Known error documentation and sharing
- Service desk: Workaround communication
- Vendor management: External support escalation

## Edge Cases

- **Multi-vendor problem**: Designate single problem owner for coordination; establish joint investigation timeline; manage vendor blame-shifting
- **No clear root cause after extended investigation**: Escalate to management; consider interim permanent workaround; allocate additional resources or external expertise
- **Fix introduces new problems**: Halt fix deployment; create new problem record; assess whether to revert to previous state
- **Business accepts risk without fix**: Document business decision; implement monitoring and alerting; set review date; track as managed risk
- **Cross-organizational problems**: Coordinate with partner organizations; define information sharing boundaries; align resolution timelines

## Output

### Problem Management Dashboard

```
PROBLEM MANAGEMENT — April 2025
================================

OPEN PROBLEMS:
  S1 (Critical): 0
  S2 (High): 2
  S3 (Medium): 3
  S4 (Low): 1

RESOLUTION TRACKING:
  Resolved this month: 8
  Mean resolution time: 11.3 days
  Workaround deployment time: 2.4 days
  On-target resolution: 7/8 (87.5%)

KNOWN ERRORS:
  Active known errors: 14
  With workarounds: 12 (86%)
  Without workarounds: 2 (escalated)
  Average workaround effectiveness: 84%

PROACTIVE VS REACTIVE:
  Proactive problems: 4 (36%)
  Reactive problems: 7 (64%)
  Target: >40% proactive

TOP RECURRING ISSUES:
  🔴 Database timeout cascade — 15 related incidents (PROB-0089)
  ⚠  Network latency spikes — 9 related incidents (PROB-0091)
  ⚠  Application memory leaks — 7 related incidents (PROB-0093)

IMPROVEMENT INITIATIVES:
  [ ] Query performance monitoring — Due: May 15
  [ ] Auto-scaling connection pools — Due: May 30
  [ ] Memory leak detection automation — Due: June 15
```

## Trigger Phrases

"problem management", "root cause analysis", "known error", "recurring incident", "problem record", "trend analysis", "permanent fix", "workaround", "KEDB", "problem owner", "problem investigation", "fault tree analysis", "5 whys", "service improvement", "problem closure"
