Support AI Skill

Quality Assurance Automation

Automate quality assurance scoring of support interactions using AI to evaluate agent performance, identify coaching opportunities, and ensure consistent service quality. Use when implementing automated QA, AI scoring of interactions, quality calibration, c...

Quality Assurance (QA) Automation

Automate the evaluation of support interactions using AI — scaling QA from sampling 2–5% of interactions to evaluating 100%, providing consistent scoring and actionable coaching insights.

Workflow

  1. Define quality scorecard with weighted criteria aligned to company values.
  2. Select QA automation platform (built-in or third-party).
  3. Train AI model on historical interactions with manual QA scores as ground truth.
  4. Calibrate AI scores against human QA scores (target: 85%+ agreement).
  5. Deploy automated scoring on all new interactions.
  6. Review AI scores with human QA auditor (calibration loop).
  7. Generate agent scorecards, team dashboards, and coaching recommendations.
  8. Identify systemic quality issues and process improvements.
  9. Continuously recalibrate AI model with new manual evaluations.

QA Scorecard Design

AUTOMATED QA SCORECARD
========================

Scorecard Categories (weighted):
  ════════════════════════════════════════════════════════════════════════
  Category              | Weight | Criteria                    | AI Evaluation Method
  ════════════════════════════════════════════════════════════════════════
  Empathy & Tone        | 20%    | Warm, professional,         | NLP sentiment analysis
                        |        | customer-focused language   | + empathy keyword detection
  Process Compliance    | 20%    | Followed required steps,    | Rule-based: checklist
                        |        | used correct templates      | verification, form completion
  Knowledge Accuracy    | 20%    | Information provided is     | LLM comparison against
                        |        | correct and complete        | knowledge base / playbook
  Resolution Quality    | 20%    | Issue fully resolved,       | Outcome analysis:
                        |        | clear next steps            | re-contact rate, resolution
                        |        |                             | confirmation, CSAT score
  Communication Clarity | 10%    | Clear, concise,             | Readability score,
                        |        | well-structured response    | formatting, jargon detection
  Proactivity           | 10%    | Anticipated follow-up       | LLM analysis: suggested
                        |        | questions/needs             | related solutions, asked
                        |        |                             | clarifying questions
  ════════════════════════════════════════════════════════════════════════
  Total: 100%

SCORING SCALE:
  → Each criterion scored 1–5:
     1 = Critical Failure (must be coached immediately)
     2 = Below Expectations (coaching needed)
     3 = Meets Expectations (acceptable)
     4 = Exceeds Expectations (good practice)
     5 = Exceptional (potential training material)
  → Overall score: Weighted average across all categories
  → Score bands:
     90–100% = Star performer
     80–89%  = Strong performer
     70–79%  = Meets expectations
     60–69%  = Needs improvement
     < 60%   = Critical coaching required

SCORECARD EXAMPLE (Single Interaction):
  ════════════════════════════════════════════════════════════════════════
  Criterion              | Score | Weight | Weighted Score | Notes
  ════════════════════════════════════════════════════════════════════════
  Empathy & Tone         | 5     | 20%    | 1.00           | "Thank you for your patience"
  Process Compliance     | 4     | 20%    | 0.80           | Followed all steps
  Knowledge Accuracy     | 5     | 20%    | 1.00           | Correct solution provided
  Resolution Quality     | 4     | 20%    | 0.80           | Resolved, clear next steps
  Communication Clarity  | 4     | 10%    | 0.40           | Well-structured
  Proactivity            | 3     | 10%    | 0.30           | Basic, could anticipate more
  ════════════════════════════════════════════════════════════════════════
  Overall QA Score: 88% (Strong Performer)
  ════════════════════════════════════════════════════════════════════════

AI QA Platform Setup

QA AUTOMATION PLATFORMS
=========================

Platform 1 — Built-in Help Desk QA:
  → Zendesk QA: Native scoring, AI insights, sentiment analysis
  → Freshdesk Quality: Automated scoring, keyword detection
  → Intercom Quality: AI-powered interaction review
  → Pricing: Included in enterprise plans ($80–$150/agent/month)
  → Pros: No separate platform, native integration
  → Cons: Limited customization, platform-specific features

Platform 2 — Dedicated QA Platforms:
  → Guru: Knowledge + QA in one platform, AI scoring
  → Maestra: Conversational intelligence, AI QA, coaching
  → Nice CXone Quality Management: Enterprise-grade QA
  → Crisp: Real-time conversation analytics, AI scoring
  → Pricing: $20–$100/user/month
  → Pros: Advanced features, multi-channel, calibration tools
  → Cons: Additional platform to manage

Platform 3 — Custom AI QA:
  → Build: LLM-based scoring (GPT-4, Claude, open-source)
  → Pipeline: Transcribe → Analyze → Score → Recommend
  → Pricing: Model API costs ($0.01–$0.05 per interaction)
  → Pros: Fully customizable, scalable, integrated with any system
  → Cons: Development effort, ongoing maintenance

PLATFORM SELECTION:
  ════════════════════════════════════════════════════════════════════════
  Team Size      | Recommended Platform         | Why
  ════════════════════════════════════════════════════════════════════════
  < 20 agents    | Built-in help desk QA        | Simple, included
  20–100 agents  | Dedicated QA platform        | Advanced features
  100+ agents    | Custom AI QA                 | Scale, customization
  ════════════════════════════════════════════════════════════════════════

AI Calibration and Training

AI QA CALIBRATION PROCESS
===========================

Phase 1 — Ground Truth Collection:
  → Collect 500–1,000 historically scored interactions (manual QA)
  → Ensure variety: Different agents, channels, issue types, scores
  → Score range: Include 1s through 5s (not just high scores)
  → Quality: Manual scores from calibrated QA team (inter-rater reliability > 0.8)

Phase 2 — Model Training:
  → LLM prompt engineering: Scorecard criteria → AI scoring instructions
  → Few-shot learning: Provide 10–20 examples per score level
  → Iterative refinement: Test → compare to human scores → adjust prompts
  → Channel-specific: Separate models for chat, email, phone

Phase 3 — Calibration Testing:
  → Test set: 200 interactions NOT used in training
  → Compare: AI scores vs human scores
  → Agreement rate: Target > 85% (within 1 point on 5-point scale)
  → Disagreement analysis: Where does AI differ? Why?
  → Iteration: Adjust prompts, add examples, re-test

CALIBRATION METRICS:
  ════════════════════════════════════════════════════════════════════════
  Metric                            | Target
  ════════════════════════════════════════════════════════════════════════
  AI-human score agreement          | > 85%
  Mean absolute error (score)       | < 0.5 points
  Precision (flagging low scores)   | > 90%
  Recall (catching all low scores)  | > 85%
  Calibration drift (monthly)       | < 5%
  ════════════════════════════════════════════════════════════════════════

Phase 4 — Ongoing Recalibration:
  → Monthly: QA team reviews 50 AI-scored interactions
  → Discrepancy threshold: If > 10% disagreement, recalibrate
  → New criteria: Add scorecard items as processes evolve
  → Agent feedback: Agents flag incorrect AI scores → reviewed by QA team

QA Reporting and Coaching

QA REPORTING DASHBOARD
========================

Agent-Level Reports:
  ════════════════════════════════════════════════════════════════════════
  Agent Name    | Avg QA | Interactions | Top Strength   | Top Area for Growth
  ════════════════════════════════════════════════════════════════════════
  Sarah K.      | 92%    | 342          | Empathy (4.8)  | Proactivity (3.6)
  Michael T.    | 85%    | 298          | Knowledge (4.5)| Process (3.9)
  Lisa R.       | 78%    | 315          | Resolution (4.2)| Tone (3.5)
  James W.      | 91%    | 280          | Clarity (4.7)  | Compliance (3.8)
  ════════════════════════════════════════════════════════════════════════

Team-Level Reports:
  ════════════════════════════════════════════════════════════════════════
  Category           | Team Avg | Trend   | Target | Status
  ════════════════════════════════════════════════════════════════════════
  Empathy & Tone     | 4.3/5    | ↑ +0.1  | 4.5    | 🟡 Close
  Process Compliance | 3.9/5    | → 0.0   | 4.5    | 🔴 Below
  Knowledge Accuracy | 4.4/5    | ↑ +0.2  | 4.5    | 🟢 On Track
  Resolution Quality | 4.2/5    | ↑ +0.1  | 4.5    | 🟡 Close
  Communication      | 4.1/5    | → 0.0   | 4.3    | 🟡 Close
  Proactivity        | 3.6/5    | ↓ -0.1  | 4.0    | 🔴 Declining
  ════════════════════════════════════════════════════════════════════════
  Overall Team QA: 84.5% (Target: 87%)

COACHING RECOMMENDATIONS (AI-generated):
  ════════════════════════════════════════════════════════════════════════
  Agent: Lisa R. | Score: 78% | Priority: HIGH
  
  Areas for Improvement:
  1. Tone (3.5/5): Responses detected as too formal/robotic
     → Coaching: Role-play exercises for conversational tone
     → Examples: Show "good" vs "current" responses
     → Resource: "Writing Naturally" training module
  
  2. Process Compliance (3.7/5): Skipping verification step in 15% of tickets
     → Coaching: Review ticket verification checklist
     → Action: Add mandatory checklist to ticket workflow
     → Follow-up: Review compliance in 2 weeks
  
  Expected improvement: 5–8 QA points with focused coaching
  ════════════════════════════════════════════════════════════════════════

Integration Points

Edge Cases