Support AI Skill
Quality Assurance Automation
Automate quality assurance scoring of support interactions using AI to evaluate agent performance, identify coaching opportunities, and ensure consistent service quality. Use when implementing automated QA, AI scoring of interactions, quality calibration, c...
Quality Assurance (QA) Automation
Automate the evaluation of support interactions using AI — scaling QA from sampling 2–5% of interactions to evaluating 100%, providing consistent scoring and actionable coaching insights.
Workflow
- Define quality scorecard with weighted criteria aligned to company values.
- Select QA automation platform (built-in or third-party).
- Train AI model on historical interactions with manual QA scores as ground truth.
- Calibrate AI scores against human QA scores (target: 85%+ agreement).
- Deploy automated scoring on all new interactions.
- Review AI scores with human QA auditor (calibration loop).
- Generate agent scorecards, team dashboards, and coaching recommendations.
- Identify systemic quality issues and process improvements.
- Continuously recalibrate AI model with new manual evaluations.
QA Scorecard Design
AUTOMATED QA SCORECARD
========================
Scorecard Categories (weighted):
════════════════════════════════════════════════════════════════════════
Category | Weight | Criteria | AI Evaluation Method
════════════════════════════════════════════════════════════════════════
Empathy & Tone | 20% | Warm, professional, | NLP sentiment analysis
| | customer-focused language | + empathy keyword detection
Process Compliance | 20% | Followed required steps, | Rule-based: checklist
| | used correct templates | verification, form completion
Knowledge Accuracy | 20% | Information provided is | LLM comparison against
| | correct and complete | knowledge base / playbook
Resolution Quality | 20% | Issue fully resolved, | Outcome analysis:
| | clear next steps | re-contact rate, resolution
| | | confirmation, CSAT score
Communication Clarity | 10% | Clear, concise, | Readability score,
| | well-structured response | formatting, jargon detection
Proactivity | 10% | Anticipated follow-up | LLM analysis: suggested
| | questions/needs | related solutions, asked
| | | clarifying questions
════════════════════════════════════════════════════════════════════════
Total: 100%
SCORING SCALE:
→ Each criterion scored 1–5:
1 = Critical Failure (must be coached immediately)
2 = Below Expectations (coaching needed)
3 = Meets Expectations (acceptable)
4 = Exceeds Expectations (good practice)
5 = Exceptional (potential training material)
→ Overall score: Weighted average across all categories
→ Score bands:
90–100% = Star performer
80–89% = Strong performer
70–79% = Meets expectations
60–69% = Needs improvement
< 60% = Critical coaching required
SCORECARD EXAMPLE (Single Interaction):
════════════════════════════════════════════════════════════════════════
Criterion | Score | Weight | Weighted Score | Notes
════════════════════════════════════════════════════════════════════════
Empathy & Tone | 5 | 20% | 1.00 | "Thank you for your patience"
Process Compliance | 4 | 20% | 0.80 | Followed all steps
Knowledge Accuracy | 5 | 20% | 1.00 | Correct solution provided
Resolution Quality | 4 | 20% | 0.80 | Resolved, clear next steps
Communication Clarity | 4 | 10% | 0.40 | Well-structured
Proactivity | 3 | 10% | 0.30 | Basic, could anticipate more
════════════════════════════════════════════════════════════════════════
Overall QA Score: 88% (Strong Performer)
════════════════════════════════════════════════════════════════════════
AI QA Platform Setup
QA AUTOMATION PLATFORMS
=========================
Platform 1 — Built-in Help Desk QA:
→ Zendesk QA: Native scoring, AI insights, sentiment analysis
→ Freshdesk Quality: Automated scoring, keyword detection
→ Intercom Quality: AI-powered interaction review
→ Pricing: Included in enterprise plans ($80–$150/agent/month)
→ Pros: No separate platform, native integration
→ Cons: Limited customization, platform-specific features
Platform 2 — Dedicated QA Platforms:
→ Guru: Knowledge + QA in one platform, AI scoring
→ Maestra: Conversational intelligence, AI QA, coaching
→ Nice CXone Quality Management: Enterprise-grade QA
→ Crisp: Real-time conversation analytics, AI scoring
→ Pricing: $20–$100/user/month
→ Pros: Advanced features, multi-channel, calibration tools
→ Cons: Additional platform to manage
Platform 3 — Custom AI QA:
→ Build: LLM-based scoring (GPT-4, Claude, open-source)
→ Pipeline: Transcribe → Analyze → Score → Recommend
→ Pricing: Model API costs ($0.01–$0.05 per interaction)
→ Pros: Fully customizable, scalable, integrated with any system
→ Cons: Development effort, ongoing maintenance
PLATFORM SELECTION:
════════════════════════════════════════════════════════════════════════
Team Size | Recommended Platform | Why
════════════════════════════════════════════════════════════════════════
< 20 agents | Built-in help desk QA | Simple, included
20–100 agents | Dedicated QA platform | Advanced features
100+ agents | Custom AI QA | Scale, customization
════════════════════════════════════════════════════════════════════════
AI Calibration and Training
AI QA CALIBRATION PROCESS
===========================
Phase 1 — Ground Truth Collection:
→ Collect 500–1,000 historically scored interactions (manual QA)
→ Ensure variety: Different agents, channels, issue types, scores
→ Score range: Include 1s through 5s (not just high scores)
→ Quality: Manual scores from calibrated QA team (inter-rater reliability > 0.8)
Phase 2 — Model Training:
→ LLM prompt engineering: Scorecard criteria → AI scoring instructions
→ Few-shot learning: Provide 10–20 examples per score level
→ Iterative refinement: Test → compare to human scores → adjust prompts
→ Channel-specific: Separate models for chat, email, phone
Phase 3 — Calibration Testing:
→ Test set: 200 interactions NOT used in training
→ Compare: AI scores vs human scores
→ Agreement rate: Target > 85% (within 1 point on 5-point scale)
→ Disagreement analysis: Where does AI differ? Why?
→ Iteration: Adjust prompts, add examples, re-test
CALIBRATION METRICS:
════════════════════════════════════════════════════════════════════════
Metric | Target
════════════════════════════════════════════════════════════════════════
AI-human score agreement | > 85%
Mean absolute error (score) | < 0.5 points
Precision (flagging low scores) | > 90%
Recall (catching all low scores) | > 85%
Calibration drift (monthly) | < 5%
════════════════════════════════════════════════════════════════════════
Phase 4 — Ongoing Recalibration:
→ Monthly: QA team reviews 50 AI-scored interactions
→ Discrepancy threshold: If > 10% disagreement, recalibrate
→ New criteria: Add scorecard items as processes evolve
→ Agent feedback: Agents flag incorrect AI scores → reviewed by QA team
QA Reporting and Coaching
QA REPORTING DASHBOARD
========================
Agent-Level Reports:
════════════════════════════════════════════════════════════════════════
Agent Name | Avg QA | Interactions | Top Strength | Top Area for Growth
════════════════════════════════════════════════════════════════════════
Sarah K. | 92% | 342 | Empathy (4.8) | Proactivity (3.6)
Michael T. | 85% | 298 | Knowledge (4.5)| Process (3.9)
Lisa R. | 78% | 315 | Resolution (4.2)| Tone (3.5)
James W. | 91% | 280 | Clarity (4.7) | Compliance (3.8)
════════════════════════════════════════════════════════════════════════
Team-Level Reports:
════════════════════════════════════════════════════════════════════════
Category | Team Avg | Trend | Target | Status
════════════════════════════════════════════════════════════════════════
Empathy & Tone | 4.3/5 | ↑ +0.1 | 4.5 | 🟡 Close
Process Compliance | 3.9/5 | → 0.0 | 4.5 | 🔴 Below
Knowledge Accuracy | 4.4/5 | ↑ +0.2 | 4.5 | 🟢 On Track
Resolution Quality | 4.2/5 | ↑ +0.1 | 4.5 | 🟡 Close
Communication | 4.1/5 | → 0.0 | 4.3 | 🟡 Close
Proactivity | 3.6/5 | ↓ -0.1 | 4.0 | 🔴 Declining
════════════════════════════════════════════════════════════════════════
Overall Team QA: 84.5% (Target: 87%)
COACHING RECOMMENDATIONS (AI-generated):
════════════════════════════════════════════════════════════════════════
Agent: Lisa R. | Score: 78% | Priority: HIGH
Areas for Improvement:
1. Tone (3.5/5): Responses detected as too formal/robotic
→ Coaching: Role-play exercises for conversational tone
→ Examples: Show "good" vs "current" responses
→ Resource: "Writing Naturally" training module
2. Process Compliance (3.7/5): Skipping verification step in 15% of tickets
→ Coaching: Review ticket verification checklist
→ Action: Add mandatory checklist to ticket workflow
→ Follow-up: Review compliance in 2 weeks
Expected improvement: 5–8 QA points with focused coaching
════════════════════════════════════════════════════════════════════════
Integration Points
- Help Desk (Zendesk, Freshdesk, Intercom): Interaction data, transcripts, ticket history
- QA Platforms (Maestra, Guru, NICE): Automated scoring, calibration, reporting
- AI/ML (OpenAI, Anthropic, custom LLMs): NLP analysis, scoring, recommendation generation
- HR/Learning (Cornerstone, LinkedIn Learning): Training assignment based on QA gaps
- Analytics (Tableau, Power BI): QA dashboards, trend analysis, coaching tracking
- Communication (Slack, Teams): Real-time QA alerts, coaching notifications
- CRM (Salesforce, HubSpot): Agent performance data, QA score sharing with managers
- Data Warehouse (Snowflake, BigQuery): QA data storage, cross-metric analysis
Edge Cases
- AI scores too harshly/leniently: Systematic bias in AI scoring
- Calibration: Regular comparison against human scores (monthly)
- Bias detection: Compare AI scores across agents (is one agent consistently penalized?)
- Adjust: Modify prompt weighting if AI over-penalizes specific criteria
- Agent appeals: Allow agents to dispute AI scores; QA team reviews
- Transparency: Show agents WHY they received a score (AI explanation)
- Multilingual interactions: AI QA trained on English, scores other languages poorly
- Translation: Translate interaction to English → score in English → report in original
- Or: Train separate models per language (resource-intensive)
- Priority: Support top 3–5 languages first
- Quality check: Manual review of non-English AI scores (higher variance expected)
- Phone call QA: Transcription errors lead to incorrect scoring
- Transcription accuracy: Target > 95% (use high-quality ASR like Deepgram, Google Speech)
- Fallback: If transcription confidence < 80%, flag for human review
- Noise handling: Background noise reduces accuracy; recommend headset use
- Validation: Agent can review and correct transcript before scoring
- QA score gaming: Agents optimize for AI score, not actual quality
- Multi-signal scoring: Combine AI score with CSAT, resolution rate, re-contact rate
- Human override: QA team reviews borderline cases and spot-checks high scorers
- Mystery shopping: Periodic fake interactions to test real performance
- Scorecard evolution: Regularly update criteria to prevent gaming
- Culture: Emphasize coaching over punishment; QA as improvement tool, not weapon
- New agent scoring: Insufficient interactions for reliable QA score
- Minimum threshold: Require 50 interactions before QA score displayed
- Early feedback: AI still provides feedback but score not published
- Ramp-up expectation: New agents start at lower expected score (60–70% → 80%+)
- Mentor matching: Pair with high-QA-score agent for shadowing
- Scorecard change impact: Updating scorecard changes historical scores
- Versioning: Scorecard v1, v2, v3 — scores tagged by version
- Reporting: Show current score and trend (not affected by scorecard change)
- Communication: Notify agents of scorecard changes and rationale
- Grace period: 2-week transition with both old and new scores displayed
- Baseline reset: After major scorecard change, reset targets for 1 month