IT AI Skill
Itsm Service Desk
Manage IT service management and service desk operations including incident management, problem management, change management, request fulfillment, knowledge management, asset management, and ITIL compliance. Use when managing IT service requests, handling...
IT Service Management (ITSM)
Deliver and manage IT services using ITIL-aligned processes, automated workflows, and measurable SLAs.
Incident Management
Incident Lifecycle & Operations
INCIDENT MANAGEMENT FRAMEWORK:
═══════════════════════════════
ITSM PLATFORM: ServiceNow (ITSM module) + Freshservice (backup)
Service desk team: 8 analysts (L1) + 4 specialists (L2) + 2 engineers (L3)
Operating hours: 24/7 (L1: 8 AM - 8 PM, L2: 24/7 on-call, L3: on-call)
Support channels: Email, web portal, phone, chat (Teams integration)
Average tickets/day: 85 (L1) + 15 (L2) + 3 (L3)
INCIDENT CLASSIFICATION:
┌────────────────────────┬──────────┬──────────┬──────────────────┐
│ Category │ Count │ Avg Res │ SLA Target │
│ │ (Jan) │ Time │ │
├────────────────────────┼──────────┼──────────┼──────────────────┤
│ Hardware issues │ 120 │ 2.5 hrs │ 4 hours │
│ Software issues │ 185 │ 1.8 hrs │ 4 hours │
│ Network/connectivity │ 45 │ 3.2 hrs │ 2 hours │
│ Access/authentication │ 95 │ 0.8 hrs │ 1 hour │
│ Application errors │ 78 │ 2.1 hrs │ 4 hours │
│ Email/calendar │ 62 │ 1.2 hrs │ 2 hours │
│ Printer/peripheral │ 38 │ 1.5 hrs │ 8 hours │
│ Account/identity │ 55 │ 0.6 hrs │ 1 hour │
│ Cloud service issues │ 28 │ 2.8 hrs │ 4 hours │
│ Security incidents │ 8 │ 1.5 hrs │ Immediate │
│ Other/unknown │ 15 │ 2.0 hrs │ 8 hours │
│ ───────────────────── │ ────── │ ────── │ ─────────────── │
│ TOTAL │ 729 │ 1.9 hrs │ Variable │
└────────────────────────┴──────────┴──────────┴──────────────────┘
INCIDENT SEVERITY DEFINITIONS:
P1 — Critical:
- Entire service down (all users affected)
- Revenue-impacting outage
- Data loss or corruption
- Security breach (active)
Response: Immediate (<15 min), Update: Every 30 min, Resolution: 4 hours
P2 — High:
- Major service degradation (many users affected)
- Critical function unavailable (workaround exists)
- Department-level impact
Response: <30 min, Update: Every 1 hour, Resolution: 8 hours
P3 — Medium:
- Single user or small group affected
- Non-critical function unavailable
- Workaround available
Response: <2 hours, Update: Every 4 hours, Resolution: 24 hours
P4 — Low:
- Cosmetic issues
- Questions/inquiries
- Minor inconveniences
Response: <8 hours, Update: As needed, Resolution: 3 business days
INCIDENT STATISTICS (January 2025):
Total incidents: 729
┌─────────────────────────┬──────────┬──────────┬──────────┐
│ Severity │ Count │ SLA Met │ SLA Miss │
├─────────────────────────┼──────────┼──────────┼──────────┤
│ P1 — Critical │ 2 │ 2 (100%) │ 0 │
│ P2 — High │ 28 │ 27 (96%) │ 1 │
│ P3 — Medium │ 185 │ 178 (96%)│ 7 │
│ P4 — Low │ 514 │ 498 (97%)│ 16 │
│ ───────────────────── │ ────── │ ────── │ ────── │
│ TOTAL │ 729 │ 705 (97%)│ 24 (3%) │
└─────────────────────────┴──────────┴──────────┴──────────┘
First Contact Resolution (FCR): 68% (target: >65%) ✓
Escalation rate: 18% (L1 → L2/L3)
Mean Time to Resolve (MTTR): 1.9 hours (all incidents)
Mean Time to Acknowledge (MTTA): 18 minutes (all incidents)
Customer satisfaction (CSAT): 4.3/5.0
ESCALATION MATRIX:
L1 → L2 (technical specialist):
Criteria: L1 cannot resolve within 30 minutes
Process: Ticket auto-rerouted + chat handoff + context transfer
Avg. time: 25 minutes (L2 acknowledgment)
L2 → L3 (engineering):
Criteria: L2 cannot resolve within 2 hours; code/config change needed
Process: Ticket created (engineering queue) + detailed notes
Avg. time: 1 hour (L3 acknowledgment)
L3 → Vendor:
Criteria: Vendor-specific issue (SaaS, hardware, cloud)
Process: Vendor ticket + internal ticket linkage
Avg. time: 4 hours (vendor acknowledgment)
L3 → Management:
Criteria: Business impact, budget needed, policy exception
Process: Verbal escalation + email summary
Avg. time: 2 hours (management response)
KNOWLEDGE-BASED RESOLUTION:
Knowledge base (KB): 850+ articles (ServiceNow KB)
KB utilization: 45% of tickets (KB article suggested → resolved)
Self-service resolution: 28% of tickets (user resolves via KB)
KB articles/month: 12-15 (new or updated)
KB accuracy: 92% (reviewed quarterly)
Auto-suggestion:
AI-powered: ServiceNow Virtual Agent (chatbot)
Accuracy: 78% (correct KB article suggested)
Coverage: 65% of ticket categories
Continuous learning: Feedback loop (correct/incorrect)
Change Management
Controlled Change Process
CHANGE MANAGEMENT FRAMEWORK:
════════════════════════════
CHANGE TYPES:
┌────────────────────────┬──────────┬──────────┬──────────────────┐
│ Type │ Count │ Approval │ Lead Time │
│ │ (Jan) │ Required │ │
├────────────────────────┼──────────┼──────────┼──────────────────┤
│ Standard (pre-approved)│ 125 │ None │ Immediate │
│ (patch deployment, │ │ │ │
│ user provisioning, │ │ │ │
│ config update) │ │ │ │
│ Normal (reviewed) │ 42 │ Manager │ 24-48 hours │
│ (new software, │ │ + change │ │
│ infrastructure │ │ owner │ │
│ change) │ │ │ │
│ Emergency │ 5 │ Emergency│ Immediate │
│ (critical fix, │ │ CAB chair│ (post-review) │
│ security patch) │ │ │ │
│ ───────────────────── │ ────── │ ────── │ ─────────────── │
│ TOTAL │ 172 │ Variable │ Variable │
└────────────────────────┴──────────┴──────────┴──────────────────┘
CHANGE PROCESS (Normal):
1. Request: Change request (CR) submitted (ServiceNow form)
- Description, scope, impact, risk level
- Implementation plan, rollback plan
- Testing evidence, back-out criteria
2. Assessment: Change manager evaluates
- Risk assessment (low, medium, high)
- Impact analysis (services, users, dependencies)
- Resource availability (team, time, budget)
- Schedule check (maintenance window conflict?)
3. Approval:
Low risk: Change manager approval (1 person)
Medium risk: Change manager + technical lead (2 people)
High risk: CAB (Change Advisory Board — 3-5 members)
4. Scheduling:
Maintenance window: Sunday 2:00 AM - 6:00 AM (primary)
Emergency window: As scheduled (CAB approval)
Business hours: Exception (CAB + management approval)
5. Implementation:
Pre-check: System health, backup verification
Execute: Documented runbook + live documentation
Post-check: Health check, user validation, monitoring
6. Review:
Successful: Close + knowledge base update (if needed)
Failed: Rollback + post-incident review + RCA
7. CAB (monthly):
Review: All changes (success/failure metrics)
Discussion: Process improvement, lessons learned
Calendar: First Wednesday of each month
CHANGE STATISTICS (January 2025):
Total changes: 172
┌─────────────────────────┬──────────┬──────────┬──────────┐
│ Outcome │ Count │ % │ │
├─────────────────────────┼──────────┼──────────┼──────────┤
│ Successful │ 162 │ 94.2% │ │
│ Failed (rollback) │ 7 │ 4.1% │ │
│ Failed (no rollback) │ 1 │ 0.6% │ ⚠️ │
│ Cancelled │ 2 │ 1.2% │ │
│ ───────────────────── │ ────── │ ────── │ │
│ TOTAL │ 172 │ 100% │ │
└─────────────────────────┴──────────┴──────────┴──────────┘
Change success rate: 94.2% (target: >90%) ✓
Change-related incidents: 3 (1.7% — target: <5%) ✓
Emergency changes: 5 (2.9% — target: <5%) ✓
Failed changes without rollback: 1 (investigated)
Root cause: Runbook gap (documented, fixed)
Impact: 30-minute service degradation (internal tool only)
STANDARD CHANGES (Pre-approved):
Inventory: 45 standard change templates
Examples:
- Server patch deployment (monthly cycle)
- User account creation/modification
- Firewall rule addition (approved vendor)
- SSL certificate renewal
- Backup verification (weekly)
- Monitoring threshold adjustment
- Application configuration update (low risk)
- DNS record update (internal)
- Software license renewal
- Storage expansion (<10% increase)
Auto-approval: Pre-defined criteria met → auto-approve
Auto-implementation: 28 of 45 (fully automated via Runbook Automation)
Monthly volume: ~125 (73% of all changes)
RISK ASSESSMENT MODEL:
Risk score = Impact × Likelihood × Complexity
Impact:
Low (1): Single user, non-critical system
Medium (2): Department, critical system with workaround
High (3): Enterprise-wide, revenue-impacting
Likelihood:
Low (1): Well-tested, proven change
Medium (2): New but tested in staging
High (3): Unproven, complex, tight timeline
Complexity:
Low (1): Single system, well-documented
Medium (2): Multiple systems, dependencies
High (3): Enterprise-wide, cross-team
Risk score mapping:
1-4: Low risk (auto-approve or single approval)
5-9: Medium risk (dual approval)
10-27: High risk (CAB approval)
Problem Management
Root Cause Analysis & Prevention
PROBLEM MANAGEMENT FRAMEWORK:
══════════════════════════════
PROBLEM vs. INCIDENT:
Incident: Restore service (quick fix)
Problem: Find and eliminate root cause (permanent fix)
PROBLEM LIFECYCLE:
1. Identification:
- Reactive: From recurring incidents (3+ same root cause)
- Proactive: Trend analysis, monitoring anomaly, error pattern
- Preventive: Vendor advisory, industry alert, audit finding
2. Logging:
- Problem record (ServiceNow)
- Linked incidents (all related)
- Priority assignment (based on incident severity + frequency)
3. Investigation:
- Root cause analysis (5-Why, Fishbone, Pareto)
- Data collection (logs, metrics, user reports)
- Reproduction (lab environment, if needed)
- Timeline creation (event sequence)
4. Resolution:
- Workaround (immediate, if possible)
- Permanent fix (code change, config update, process change)
- Verification (test, monitor, validate)
5. Closure:
- Known Error record (KB update)
- Change request (if fix requires change)
- Metrics update (problem stats)
- Lessons learned (team sharing)
PROBLEM STATISTICS (January 2025):
Total problems logged: 12
┌─────────────────────────┬──────────┬──────────┐
│ Type │ Count │ Status │
├─────────────────────────┼──────────┼──────────┤
│ Reactive │ 8 │ 6 closed │
│ Proactive │ 3 │ 2 closed │
│ Preventive │ 1 │ 1 closed │
│ ───────────────────── │ ────── │ ────── │
│ TOTAL │ 12 │ 9 closed │
└─────────────────────────┴──────────┴──────────┘
Open problems: 3 (2 investigating, 1 awaiting vendor fix)
Mean time to resolve: 5.2 days (target: <7 days) ✓
Known errors created: 8 (KB updated)
Recurring incidents reduced: 35% (vs. previous quarter)
ROOT CAUSE ANALYSIS (5-Why Example):
Problem: API latency spike (daily, 2 PM - 3 PM)
Why 1: Why latency spike? → Database slow queries
Why 2: Why slow queries? → Missing index on large table
Why 3: Why missing index? → Schema change (new column) added without index
Why 4: Why no index? → Developer didn't know index was needed
Why 5: Why didn't know? → No database design review process for schema changes
Root cause: Lack of database design review in CI/CD pipeline
Permanent fix: Add database migration review step (CI pipeline)
Workaround: Manual index creation (temporary, same day)
Verification: Latency normalized, no recurrence (2 weeks monitoring)
TREND ANALYSIS:
Top recurring issues (January):
1. API timeout (8 incidents → 1 problem → resolved)
2. Email delivery failure (5 incidents → 1 problem → resolved)
3. VPN connectivity (12 incidents → 1 problem → workaround applied)
4. Printer issues (15 incidents → not a problem — user training needed)
5. Login failures (6 incidents → 1 problem → MFA policy update)
Trend tracking:
Monthly: Incident categorization + frequency analysis
Quarterly: Problem trend report + improvement plan
Annually: Service improvement plan (SIP) + investment plan
Request Fulfillment
Service Catalog & User Requests
SERVICE CATALOG:
════════════════
REQUEST TYPES (ServiceNow Service Catalog):
┌──────────────────────────┬──────────┬──────────┬────────────┐
│ Request Type │ Count │ Auto/Manual│ Avg Time │
│ │ (Jan) │ │ │
├──────────────────────────┼──────────┼───────────┼────────────┤
│ New hardware/laptop │ 12 │ 60% auto │ 3 days │
│ Software installation │ 28 │ 75% auto │ 4 hours │
│ Access request │ 85 │ 80% auto │ 2 hours │
│ Password reset │ 45 │ 95% auto │ 5 min │
│ Account unlock │ 22 │ 90% auto │ 10 min │
│ Cloud resource request │ 15 │ 50% auto │ 1 day │
│ Meeting room booking │ 68 │ 100% auto │ Instant │
│ IT training request │ 8 │ 30% auto │ 2 days │
│ Equipment return │ 10 │ 40% auto │ 1 day │
│ Vendor access request │ 5 │ 20% auto │ 3 days │
│ Other/general inquiry │ 32 │ 25% auto │ 4 hours │
│ ────────────────────── │ ────── │ ─────── │ ─────── │
│ TOTAL │ 330 │ 63% auto │ 4.2 hrs │
└──────────────────────────┴──────────┴───────────┴────────────┘
Service catalog items: 45 (standardized)
Self-service rate: 63% (automated fulfillment)
Manual fulfillment: 37% (complex requests, approvals)
Request satisfaction: 4.5/5.0
Average fulfillment time: 4.2 hours (all requests)
FULFILLMENT WORKFLOW:
1. User submits request (ServiceNow portal / chatbot)
2. Auto-validation (eligibility, policy check, approval matrix)
3. Approval (if required — manager, IT, security)
4. Fulfillment (automated runbook or manual technician)
5. Confirmation (email/chat notification)
6. CSAT survey (post-fulfillment, 48 hours)
SLA PERFORMANCE:
┌──────────────────────────┬──────────┬──────────┐
│ Metric │ Target │ Actual │
├──────────────────────────┼──────────┼──────────┤
│ Incident resolution │ 95% │ 97% │
│ Request fulfillment │ 90% │ 93% │
│ Change success rate │ 90% │ 94.2% │
│ First contact resolution │ 65% │ 68% │
│ CSAT score │ 4.0/5.0 │ 4.3/5.0 │
│ Knowledge base accuracy │ 90% │ 92% │
│ ────────────────────── │ ────── │ ────── │
│ Overall │ Pass │ Pass │
└──────────────────────────┴──────────┴──────────┘
All SLAs met (January 2025) ✓
Trend: Improving (consistent over 6 months)
Output
ITSM Operations Dashboard
ITSM OPERATIONS DASHBOARD — Jan 2025
══════════════════════════════════
Service Desk:
Team: 8 (L1) + 4 (L2) + 2 (L3)
Tickets/day: 85 (L1) + 15 (L2) + 3 (L3)
Channels: Email, web, phone, chat (Teams)
CSAT: 4.3/5.0
Incidents:
Total: 729
SLA compliance: 97% (705/729)
P1 incidents: 2 (100% SLA met)
MTTR: 1.9 hours
MTTA: 18 minutes
FCR: 68% (target: >65%) ✓
Changes:
Total: 172
Success rate: 94.2% (target: >90%) ✓
Emergency: 5 (2.9% — target: <5%) ✓
Standard: 125 (73% — pre-approved)
Change-related incidents: 3 (1.7% — target: <5%) ✓
Problems:
Total logged: 12
Closed: 9 (75%)
Open: 3 (investigation/vendor)
MTTR: 5.2 days (target: <7 days) ✓
Known errors: 8 (KB updated)
Requests:
Total: 330
Auto-fulfillment: 63%
Avg. fulfillment: 4.2 hours
Satisfaction: 4.5/5.0
Self-service (KB): 28%
Knowledge Base:
Articles: 850+
Utilization: 45% of tickets
Accuracy: 92% (reviewed quarterly)
New/updated: 12-15/month
Actions:
1. Close 3 open problems (investigation complete)
2. KB article expansion (low-coverage categories)
3. Standard change automation (28 → 35 target)
4. FCR improvement (68% → 72% target)
5. CAB review (monthly — first Wednesday)
Integration Points
- ITSM platforms (ServiceNow, Freshservice, Jira Service Management): Ticketing, workflows
- CMDB (ServiceNow CMDB): Configuration items, relationships, impact analysis
- Monitoring (Datadog, Prometheus): Alert-to-incident automation
- Automation (ServiceNow Flow, Ansible, Runbook Automation): Self-healing, fulfillment
- Communication (Teams, Slack, email): Notifications, updates, CSAT surveys
- Identity (Okta, Azure AD): User lookup, access management
- Asset management (ServiceNow SAM, Snipe-IT): Hardware, software inventory
- Knowledge base (ServiceNow KB, Confluence): Articles, KB suggestions
- HRIS (Rippling, Workday): Employee lifecycle trigger (onboarding/offboarding IT)
- Vendor management (ServiceNow, Procure): Vendor tickets, SLA tracking
- Reporting (ServiceNow reports, Power BI): Dashboards, analytics
- Change management (ServiceNow CMB): Change advisory, scheduling
Edge Cases
- Major incident (enterprise-wide outage): War room activation; communication cadence; cross-team coordination; post-incident review
- Repeated incident (no permanent fix): Problem escalation; vendor engagement; workaround optimization; permanent fix timeline
- Emergency change (no CAB available): Emergency CAB chair; post-review; documentation; risk acceptance
- Change rollback failure: Blast radius containment; manual recovery; root cause; process improvement
- SLA breach (impending): Proactive communication; escalation; workaround; customer notification
- Knowledge gap (no KB article): Immediate article creation; peer review; publish; prevention
- Service catalog overload (high volume): Auto-fulfillment expansion; queue management; staffing adjustment
- Vendor dependency (extended resolution): Vendor escalation; contract review; workaround; alternative
- Security incident (ITSM + SOC): Dual workflow; information sharing; coordinated response; documentation
- System migration (ITSM platform): Data migration; process validation; training; parallel run