---
name: sla-management
description: Define, monitor, and optimize service level agreements across support operations including SLA design, performance tracking, escalation management, breach prevention, and continuous SLA improvement. Use when creating SLAs, monitoring SLA compliance, managing escalations, analyzing SLA performance, or renegotiating SLA terms. Triggers on phrases like "SLA", "service level agreement", "SLA compliance", "response time", "resolution time", "escalation management", "SLA breach", "service credits", "SLA reporting".
---

# SLA Management & Service Quality

Design, track, and optimize service level agreements to ensure consistent service quality and customer satisfaction.

## Workflow

### 1. SLA Design & Definition

1. **Service level target setting**:
   - Priority/severity level definition (P1-Critical, P2-High, P3-Medium, P4-Low)
   - Response time targets by priority (first response to customer)
   - Resolution time targets by priority (time to resolve)
   - Update frequency requirements (status updates during resolution)
   - Business hours vs 24/7 coverage definition

2. **SLA scope and applicability**:
   - Services and products covered under SLA
   - Customer tier applicability (enterprise, premium, standard)
   - Geographic coverage and timezone considerations
   - Exclusions and out-of-scope items
   - Force majeure and exception clauses

3. **SLA measurement framework**:
   - Clock start/stop rules (when timer starts, pauses, stops)
   - Business calendar definition (holidays, maintenance windows)
   - Customer acknowledgment requirements
   - Duplicate and related ticket handling
   - SLA calculation methodology (calendar vs business time)

### 2. SLA Monitoring & Tracking

1. **Real-time SLA monitoring**:
   - Live SLA dashboard for all active tickets
   - Breach prediction and early warning alerts
   - At-risk ticket identification and prioritization
   - Agent SLA performance tracking
   - Queue-level SLA compliance monitoring

2. **Automated SLA enforcement**:
   - Auto-assignment based on SLA and agent capacity
   - Priority adjustment based on SLA risk
   - Automated escalation when SLA threshold approached
   - Customer notification on SLA status
   - Internal notification for SLA breach prevention

3. **SLA reporting and analytics**:
   - Daily SLA compliance report
   - Weekly SLA performance trend analysis
   - Monthly SLA compliance by agent, team, and category
   - Quarterly SLA review with stakeholders
   - Annual SLA assessment and adjustment

### 3. Escalation Management

1. **Escalation framework**:
   - Technical escalation path (L1 → L2 → L3 → vendor)
   - Management escalation path (agent → team lead → manager → director)
   - Customer escalation handling (customer-requested escalation)
   - Escalation criteria by SLA breach imminence
   - Escalation response time requirements

2. **Escalation execution**:
   - Automated escalation trigger on SLA breach prediction
   - Escalation handoff with full context transfer
   - Escalation acknowledgment and acceptance tracking
   - Escalation resolution and feedback loop
   - Escalation documentation and knowledge capture

3. **Escalation prevention**:
   - First contact resolution improvement
   - Agent skill development for faster resolution
   - Knowledge base improvement for self-service
   - Proactive issue identification and resolution
   - Root cause analysis to prevent repeat escalations

### 4. SLA Breach Management

1. **Breach response and recovery**:
   - Immediate notification to stakeholders
   - Customer communication and apology (if customer-visible)
   - Root cause investigation and documentation
   - Corrective action plan development
   - Service credit or compensation processing

2. **Breach analysis and prevention**:
   - Breach categorization (staffing, process, system, complexity)
   - Breach frequency and trend analysis
   - Breach impact assessment (customer satisfaction, revenue)
   - Recurring breach pattern identification
   - Process improvement recommendations

3. **Service credit management**:
   - Service credit calculation based on SLA terms
   - Credit approval and authorization workflow
   - Credit application to customer account
   - Credit cost tracking and analysis
   - Credit abuse prevention

### 5. SLA Continuous Improvement

1. **SLA target optimization**:
   - Historical performance analysis vs SLA targets
   - Customer expectation vs actual SLA comparison
   - Industry benchmark comparison
   - Target adjustment recommendation
   - Customer communication on SLA changes

2. **Performance improvement initiatives**:
   - SLA compliance improvement projects
   - Process optimization for faster resolution
   - Technology and tool enhancement
   - Agent training and certification
   - Knowledge base expansion for resolution speed

3. **SLA review and renegotiation**:
   - Annual SLA review with key customers
   - SLA adjustment based on service evolution
   - New service level tier creation
   - Contract renewal SLA discussion
   - Competitive SLA benchmarking

## Templates & Frameworks

### SLA Matrix

```
SERVICE LEVEL AGREEMENT MATRIX
===============================

PRIORITY DEFINITIONS:
  P1 (Critical): System down, no workaround, major business impact, multiple users affected
  P2 (High): System degraded, limited workaround, significant business impact
  P3 (Medium): System functional with issues, workaround available, moderate impact
  P4 (Low): Minor issue, cosmetic, informational, minimal impact

RESPONSE & RESOLUTION TARGETS:

  Priority | Response | Update   | Resolution | Business Hours
  ---------|----------|----------|------------|--------------
  P1       | 15 min   | 30 min   | 4 hours    | 24/7
  P2       | 1 hour   | 2 hours  | 8 hours    | 24/7
  P3       | 4 hours  | 8 hours  | 3 business days | Business Hours
  P4       | 8 hours  | 24 hours | 5 business days | Business Hours

SERVICE AVAILABILITY TARGETS:
  Core services: 99.9% uptime (max 43 min downtime/month)
  Standard services: 99.5% uptime (max 3.6 hours downtime/month)
  Non-critical services: 99.0% uptime (max 7.3 hours downtime/month)

SLA CLOCKS:
  Start: Ticket created and assigned
  Pause: Awaiting customer response, scheduled maintenance, vendor dependency (with notification)
  Resume: Customer response received, maintenance complete, vendor resolution
  Stop: Ticket resolved and confirmed by customer

SERVICE CREDIT TERMS:
  Core services below 99.9%: 10% monthly credit per 0.1% below target
  Core services below 99.5%: 25% monthly credit
  Core services below 99.0%: 50% monthly credit
  Maximum credit: 100% of monthly fee
```

### SLA Breach Root Cause Template

```
SLA BREACH ANALYSIS — [Ticket ID, Date]
=========================================

INCIDENT OVERVIEW:
  Ticket ID: [ID]
  Priority: [P1/P2/P3/P4]
  SLA target: [X hours]
  Actual resolution time: [Y hours]
  Breach duration: [Z hours]
  Customer impact: [description]

ROOT CAUSE ANALYSIS:
  Primary cause: [staffing gap / process gap / system issue / knowledge gap / complexity]
  Contributing factors: [list]
  Timeline:
    [time] — Ticket created
    [time] — Initial response (SLA: X min, Actual: Y min)
    [time] — Escalation triggered
    [time] — Resolution attempted
    [time] — Resolution confirmed

CORRECTIVE ACTIONS:
  Immediate: [actions taken to prevent immediate recurrence]
  Short-term: [process/tool improvement — within 30 days]
  Long-term: [systemic improvement — within 90 days]

PREVENTION STRATEGY:
  Knowledge base article created/updated: [yes/no, link]
  Process change implemented: [yes/no, description]
  Training delivered: [yes/no, topic]
  Tool/system enhancement: [yes/no, description]

SERVICE CREDIT:
  Credit authorized: [yes/no]
  Credit amount: [percentage or value]
  Customer communication: [completed date]

LESSONS LEARNED:
  [Key takeaways and organizational learning]
```

## Integration Points

- ITSM/helpdesk platforms (ServiceNow, Zendesk, Jira Service Management): SLA engine and tracking
- Monitoring tools (Datadog, PagerDuty, Uptime Robot): Service availability tracking
- CRM platforms: Customer tier and contract SLA terms
- Communication platforms (Slack, Teams, SMS): Escalation notifications
- Reporting platforms (Tableau, Power BI, Looker): SLA analytics dashboards
- CMDB: Service and dependency mapping for impact assessment
- Vendor management platforms: Third-party SLA tracking
- Billing platforms: Service credit application

## Edge Cases

- **Multiple simultaneous P1 incidents**: Activate incident commander; prioritize by business impact; resource reallocation; executive notification; customer communication management
- **SLA breach due to vendor dependency**: Vendor escalation per vendor SLA; internal customer communication; compensatory action documentation; vendor performance review
- **Contract renewal with aggressive SLA demands**: Historical performance review; realistic target setting; tiered SLA proposal; penalty/credit negotiation
- **Global operations with varying SLA expectations**: Region-specific SLA definitions; timezone-aware SLA clocks; regional performance reporting
- **SLA compliance conflict with quality**: Balance speed vs quality metrics; customer satisfaction correlation analysis; resolution quality tracking; agent burnout prevention

## Output

### SLA Performance Dashboard

```
SLA PERFORMANCE — April 2025
===============================

OVERALL SLA COMPLIANCE:
  Total tickets: 3,420
  Within SLA: 3,098 (90.6%) ⚠ (target: >92%)
  Breached: 322 (9.4%)
  Trend: ↓ 1.2% from last month

COMPLIANCE BY PRIORITY:
  P1 (Critical): 89.1% compliance ⚠ (target: >95%)
  P2 (High): 93.4% compliance ✓ (target: >92%)
  P3 (Medium): 94.7% compliance ✓ (target: >90%)
  P4 (Low): 96.2% compliance ✓ (target: >90%)

RESPONSE TIME PERFORMANCE:
  Avg P1 response: 12 min (target: <15 min ✓)
  Avg P2 response: 48 min (target: <60 min ✓)
  Avg P3 response: 3.2 hours (target: <4 hours ✓)
  Avg P4 response: 5.8 hours (target: <8 hours ✓)

RESOLUTION TIME PERFORMANCE:
  Avg P1 resolution: 3.1 hours (target: <4 hours ✓)
  Avg P2 resolution: 6.8 hours (target: <8 hours ✓)
  Avg P3 resolution: 2.4 days (target: <3 days ✓)
  Avg P4 resolution: 3.8 days (target: <5 days ✓)

ESCALATION METRICS:
  Escalations this month: 87
  Escalation-to-resolution: 4.2 hours avg
  Preventable escalations: 23 (26%)
  Customer-requested escalations: 12

SERVICE AVAILABILITY:
  Core services: 99.94% uptime ✓
  Standard services: 99.67% uptime ✓
  Non-critical: 99.31% uptime ✓
  Maintenance windows: 4 (all within schedule)

BREACH ANALYSIS:
  Breach root causes: Staffing (34%), System issue (28%), Complexity (22%), Process (16%)
  Recurring breach areas: Database issues (12%), API integration (8%)
  Service credits issued: $4,200 (0.8% of monthly revenue)

TEAM PERFORMANCE:
  Top performer: Team Alpha — 97.2% SLA compliance
  Needs improvement: Team Delta — 84.1% SLA compliance
  First contact resolution: 67% (target: >70% ⚠)
```
