Support AI Skill

Quality Assurance Automation

Automate quality assurance scoring of support interactions using AI to evaluate agent performance, identify coaching opportunities, and ensure consistent service quality. Use when implementing automated QA, AI scoring of interactions, quality calibration, coaching identification, or scaling QA coverage across all interactions. Triggers on phrases like "QA automation", "automated quality assurance", "AI quality scoring", "interaction scoring", "quality calibration", "QA coverage", "auto-coaching", "quality monitoring AI", "support QA automation", "conversation analytics".

Quality Assurance (QA) Automation

Automate the evaluation of support interactions using AI — scaling QA from sampling 2–5% of interactions to evaluating 100%, providing consistent scoring and actionable coaching insights.

Workflow

Define quality scorecard with weighted criteria aligned to company values.
Select QA automation platform (built-in or third-party).
Train AI model on historical interactions with manual QA scores as ground truth.
Calibrate AI scores against human QA scores (target: 85%+ agreement).
Deploy automated scoring on all new interactions.
Review AI scores with human QA auditor (calibration loop).
Generate agent scorecards, team dashboards, and coaching recommendations.
Identify systemic quality issues and process improvements.
Continuously recalibrate AI model with new manual evaluations.

QA Scorecard Design

AUTOMATED QA SCORECARD
========================

Scorecard Categories (weighted):
  ════════════════════════════════════════════════════════════════════════
  Category              | Weight | Criteria                    | AI Evaluation Method
  ════════════════════════════════════════════════════════════════════════
  Empathy & Tone        | 20%    | Warm, professional,         | NLP sentiment analysis
                        |        | customer-focused language   | + empathy keyword detection
  Process Compliance    | 20%    | Followed required steps,    | Rule-based: checklist
                        |        | used correct templates      | verification, form completion
  Knowledge Accuracy    | 20%    | Information provided is     | LLM comparison against
                        |        | correct and complete        | knowledge base / playbook
  Resolution Quality    | 20%    | Issue fully resolved,       | Outcome analysis:
                        |        | clear next steps            | re-contact rate, resolution
                        |        |                             | confirmation, CSAT score
  Communication Clarity | 10%    | Clear, concise,             | Readability score,
                        |        | well-structured response    | formatting, jargon detection
  Proactivity           | 10%    | Anticipated follow-up       | LLM analysis: suggested
                        |        | questions/needs             | related solutions, asked
                        |        |                             | clarifying questions
  ════════════════════════════════════════════════════════════════════════
  Total: 100%

SCORING SCALE:
  → Each criterion scored 1–5:
     1 = Critical Failure (must be coached immediately)
     2 = Below Expectations (coaching needed)
     3 = Meets Expectations (acceptable)
     4 = Exceeds Expectations (good practice)
     5 = Exceptional (potential training material)
  → Overall score: Weighted average across all categories
  → Score bands:
     90–100% = Star performer
     80–89%  = Strong performer
     70–79%  = Meets expectations
     60–69%  = Needs improvement
     < 60%   = Critical coaching required

SCORECARD EXAMPLE (Single Interaction):
  ════════════════════════════════════════════════════════════════════════
  Criterion              | Score | Weight | Weighted Score | Notes
  ════════════════════════════════════════════════════════════════════════
  Empathy & Tone         | 5     | 20%    | 1.00           | "Thank you for your patience"
  Process Compliance     | 4     | 20%    | 0.80           | Followed all steps
  Knowledge Accuracy     | 5     | 20%    | 1.00           | Correct solution provided
  Resolution Quality     | 4     | 20%    | 0.80           | Resolved, clear next steps
  Communication Clarity  | 4     | 10%    | 0.40           | Well-structured
  Proactivity            | 3     | 10%    | 0.30           | Basic, could anticipate more
  ════════════════════════════════════════════════════════════════════════
  Overall QA Score: 88% (Strong Performer)
  ════════════════════════════════════════════════════════════════════════

AI QA Platform Setup

QA AUTOMATION PLATFORMS
=========================

Platform 1 — Built-in Help Desk QA:
  → Zendesk QA: Native scoring, AI insights, sentiment analysis
  → Freshdesk Quality: Automated scoring, keyword detection
  → Intercom Quality: AI-powered interaction review
  → Pricing: Included in enterprise plans ($80–$150/agent/month)
  → Pros: No separate platform, native integration
  → Cons: Limited customization, platform-specific features

Platform 2 — Dedicated QA Platforms:
  → Guru: Knowledge + QA in one platform, AI scoring
  → Maestra: Conversational intelligence, AI QA, coaching
  → Nice CXone Quality Management: Enterprise-grade QA
  → Crisp: Real-time conversation analytics, AI scoring
  → Pricing: $20–$100/user/month
  → Pros: Advanced features, multi-channel, calibration tools
  → Cons: Additional platform to manage

Platform 3 — Custom AI QA:
  → Build: LLM-based scoring (GPT-4, Claude, open-source)
  → Pipeline: Transcribe → Analyze → Score → Recommend
  → Pricing: Model API costs ($0.01–$0.05 per interaction)
  → Pros: Fully customizable, scalable, integrated with any system
  → Cons: Development effort, ongoing maintenance

PLATFORM SELECTION:
  ════════════════════════════════════════════════════════════════════════
  Team Size      | Recommended Platform         | Why
  ════════════════════════════════════════════════════════════════════════
  < 20 agents    | Built-in help desk QA        | Simple, included
  20–100 agents  | Dedicated QA platform        | Advanced features
  100+ agents    | Custom AI QA                 | Scale, customization
  ════════════════════════════════════════════════════════════════════════

AI Calibration and Training

AI QA CALIBRATION PROCESS
===========================

Phase 1 — Ground Truth Collection:
  → Collect 500–1,000 historically scored interactions (manual QA)
  → Ensure variety: Different agents, channels, issue types, scores
  → Score range: Include 1s through 5s (not just high scores)
  → Quality: Manual scores from calibrated QA team (inter-rater reliability > 0.8)

Phase 2 — Model Training:
  → LLM prompt engineering: Scorecard criteria → AI scoring instructions
  → Few-shot learning: Provide 10–20 examples per score level
  → Iterative refinement: Test → compare to human scores → adjust prompts
  → Channel-specific: Separate models for chat, email, phone

Phase 3 — Calibration Testing:
  → Test set: 200 interactions NOT used in training
  → Compare: AI scores vs human scores
  → Agreement rate: Target > 85% (within 1 point on 5-point scale)
  → Disagreement analysis: Where does AI differ? Why?
  → Iteration: Adjust prompts, add examples, re-test

CALIBRATION METRICS:
  ════════════════════════════════════════════════════════════════════════
  Metric                            | Target
  ════════════════════════════════════════════════════════════════════════
  AI-human score agreement          | > 85%
  Mean absolute error (score)       | < 0.5 points
  Precision (flagging low scores)   | > 90%
  Recall (catching all low scores)  | > 85%
  Calibration drift (monthly)       | < 5%
  ════════════════════════════════════════════════════════════════════════

Phase 4 — Ongoing Recalibration:
  → Monthly: QA team reviews 50 AI-scored interactions
  → Discrepancy threshold: If > 10% disagreement, recalibrate
  → New criteria: Add scorecard items as processes evolve
  → Agent feedback: Agents flag incorrect AI scores → reviewed by QA team

QA Reporting and Coaching

QA REPORTING DASHBOARD
========================

Agent-Level Reports:
  ════════════════════════════════════════════════════════════════════════
  Agent Name    | Avg QA | Interactions | Top Strength   | Top Area for Growth
  ════════════════════════════════════════════════════════════════════════
  Sarah K.      | 92%    | 342          | Empathy (4.8)  | Proactivity (3.6)
  Michael T.    | 85%    | 298          | Knowledge (4.5)| Process (3.9)
  Lisa R.       | 78%    | 315          | Resolution (4.2)| Tone (3.5)
  James W.      | 91%    | 280          | Clarity (4.7)  | Compliance (3.8)
  ════════════════════════════════════════════════════════════════════════

Team-Level Reports:
  ════════════════════════════════════════════════════════════════════════
  Category           | Team Avg | Trend   | Target | Status
  ════════════════════════════════════════════════════════════════════════
  Empathy & Tone     | 4.3/5    | ↑ +0.1  | 4.5    | 🟡 Close
  Process Compliance | 3.9/5    | → 0.0   | 4.5    | 🔴 Below
  Knowledge Accuracy | 4.4/5    | ↑ +0.2  | 4.5    | 🟢 On Track
  Resolution Quality | 4.2/5    | ↑ +0.1  | 4.5    | 🟡 Close
  Communication      | 4.1/5    | → 0.0   | 4.3    | 🟡 Close
  Proactivity        | 3.6/5    | ↓ -0.1  | 4.0    | 🔴 Declining
  ════════════════════════════════════════════════════════════════════════
  Overall Team QA: 84.5% (Target: 87%)

COACHING RECOMMENDATIONS (AI-generated):
  ════════════════════════════════════════════════════════════════════════
  Agent: Lisa R. | Score: 78% | Priority: HIGH
  
  Areas for Improvement:
  1. Tone (3.5/5): Responses detected as too formal/robotic
     → Coaching: Role-play exercises for conversational tone
     → Examples: Show "good" vs "current" responses
     → Resource: "Writing Naturally" training module
  
  2. Process Compliance (3.7/5): Skipping verification step in 15% of tickets
     → Coaching: Review ticket verification checklist
     → Action: Add mandatory checklist to ticket workflow
     → Follow-up: Review compliance in 2 weeks
  
  Expected improvement: 5–8 QA points with focused coaching
  ════════════════════════════════════════════════════════════════════════

Integration Points

Help Desk (Zendesk, Freshdesk, Intercom): Interaction data, transcripts, ticket history
QA Platforms (Maestra, Guru, NICE): Automated scoring, calibration, reporting
AI/ML (OpenAI, Anthropic, custom LLMs): NLP analysis, scoring, recommendation generation
HR/Learning (Cornerstone, LinkedIn Learning): Training assignment based on QA gaps
Analytics (Tableau, Power BI): QA dashboards, trend analysis, coaching tracking
Communication (Slack, Teams): Real-time QA alerts, coaching notifications
CRM (Salesforce, HubSpot): Agent performance data, QA score sharing with managers
Data Warehouse (Snowflake, BigQuery): QA data storage, cross-metric analysis

Edge Cases

AI scores too harshly/leniently: Systematic bias in AI scoring
Calibration: Regular comparison against human scores (monthly)
Bias detection: Compare AI scores across agents (is one agent consistently penalized?)
Adjust: Modify prompt weighting if AI over-penalizes specific criteria
Agent appeals: Allow agents to dispute AI scores; QA team reviews
Transparency: Show agents WHY they received a score (AI explanation)

Multilingual interactions: AI QA trained on English, scores other languages poorly
Translation: Translate interaction to English → score in English → report in original
Or: Train separate models per language (resource-intensive)
Priority: Support top 3–5 languages first
Quality check: Manual review of non-English AI scores (higher variance expected)

Phone call QA: Transcription errors lead to incorrect scoring
Transcription accuracy: Target > 95% (use high-quality ASR like Deepgram, Google Speech)
Fallback: If transcription confidence < 80%, flag for human review
Noise handling: Background noise reduces accuracy; recommend headset use
Validation: Agent can review and correct transcript before scoring

QA score gaming: Agents optimize for AI score, not actual quality
Multi-signal scoring: Combine AI score with CSAT, resolution rate, re-contact rate
Human override: QA team reviews borderline cases and spot-checks high scorers
Mystery shopping: Periodic fake interactions to test real performance
Scorecard evolution: Regularly update criteria to prevent gaming
Culture: Emphasize coaching over punishment; QA as improvement tool, not weapon

New agent scoring: Insufficient interactions for reliable QA score
Minimum threshold: Require 50 interactions before QA score displayed
Early feedback: AI still provides feedback but score not published
Ramp-up expectation: New agents start at lower expected score (60–70% → 80%+)
Mentor matching: Pair with high-QA-score agent for shadowing

Scorecard change impact: Updating scorecard changes historical scores
Versioning: Scorecard v1, v2, v3 — scores tagged by version
Reporting: Show current score and trend (not affected by scorecard change)
Communication: Notify agents of scorecard changes and rationale
Grace period: 2-week transition with both old and new scores displayed
Baseline reset: After major scorecard change, reset targets for 1 month

Disclaimer: All rights reserved by Circulos AI. These skills are specifically designed for Claude Code, Claude Cowork, Codex, and OpenClaw. When using or referencing any skill, please provide proper attribution to Circulos AI.