AI Customer Service Quality Management
Target Audience: Customer Service Managers, Quality Management Personnel, Customer Service Trainers
1. Quick Start: Three Quality Metrics for AI Customer Service
How to View Evaluation Report Scores?
Path: AgentOps (sidebar) → AI Assistant Monitoring
In the table, you can directly view the three major scoring metrics for each conversation. Click "View" to see complete details.
Why Do We Need Evaluation?
Just like reviewing customer service call recordings, we also need to check the quality of AI responses. The system automatically scores each conversation, helping you quickly identify issues.
Three Core Metrics
Faithfulness Score
Whether the information provided by AI is correct, whether it fabricates or makes things up
Above 85 points ✅ 60-84 points ⚠️ Below 60 points ❌
Answer Relevancy Score
Whether AI answers the customer's actual question
Above 85 points ✅ 60-84 points ⚠️ Below 60 points ❌
Context Precision Score
Whether AI finds the right reference materials, whether it's precise regarding context
Above 85 points ✅ 60-84 points ⚠️ Below 60 points ❌
Simple Assessment Method
All three metrics > 80 points → ✅ This response is excellent
Any metric < 60 points → ❌ Requires immediate improvement
Two or more < 70 points → ⚠️ Systemic issue, needs comprehensive review2. How to Understand Evaluation Reports
Report Example
Three Common Problem Types
Problem A: Low Faithfulness Score (< 60 points)
Symptoms: Information provided by AI is incorrect or fabricated
Common Causes:
Reference materials are outdated (prices, inventory, policies have been updated)
Conflicting data (different documents say different things)
AI "guesses" answers without relying on database content
Impact: Customers may receive incorrect information, leading to complaints
Problem B: Low Answer Relevancy Score (< 60 points)
Symptoms: AI doesn't answer what the customer actually wants to know
Common Causes:
AI provides lengthy responses but misses the point
Answers are irrelevant, discussing unrelated content
Only explains background without providing actual answers
Impact: Customers need to ask again, reducing satisfaction
Problem C: Low Context Precision Score (< 60 points)
Symptoms: AI finds wrong reference materials or is not precise enough
Common Causes:
Inaccurate keyword searches
Confusion between different products/categories of data
Unclear titles or categories in reference materials
Impact: Even if AI wants to answer well, using wrong data leads to wrong answers
3. Real-World Cases: Common Issues in Fashion Retail
Case 1: Incorrect Pricing (Low Faithfulness Score)
📊 Problem Discovery
Customer Question: "How much is this down jacket?"
AI Response: "This down jacket is priced at NT$ 3,990."
Evaluation Results:
Faithfulness Score: 38 points ❌
Answer Relevancy Score: 95 points ✅
Context Precision Score: 85 points ✅
Actual Situation: The product has been repriced to NT$ 2,990 (after discount), but AI still provided the original price.
💡 Improvement Solution
Step 1: Check Data
Open "Product Price List" in knowledge base
Discovered prices were not updated
Step 2: Update Data
❌ Problem Data
✅ Corrected Data
Step 3: Remind AI
Add to "AI Response Principles":
"When answering about prices, always confirm whether it's the latest price and specify the applicable period. If there's a discount promotion, please inform both original price and discounted price."
Case 2: Irrelevant Answer (Low Answer Relevancy Score)
📊 Problem Discovery
Customer Question: "If it doesn't fit, can I return it?"
AI Response: "We provide multiple size options, including S, M, L, XL. Each garment's size chart is indicated on the product page. We recommend referring to the size chart for measurements before purchasing."
Evaluation Results:
Faithfulness Score: 90 points ✅
Answer Relevancy Score: 35 points ❌
Context Precision Score: 60 points ⚠️
Problem Diagnosis: Customer asked "can I return it," but AI is explaining "how to choose sizes," completely missing the core question.
💡 Improvement Solution
Step 1: Adjust Response Principles
Add to "AI Response Principles":
Core Principle: Answer the customer's core question first
Wrong Example:
Customer asks "Can I return it?"
AI answers "We have comprehensive after-sales service..." ❌ (Doesn't say yes or no)
Correct Example:
Customer asks "Can I return it?"
AI answers "Yes. Within 7 days of receiving the product, if unused, you can apply for return..." ✅ (Direct answer)
Step 2: Reorganize Data Structure
❌ Problem Data (Too scattered)
✅ Corrected Data
Improvement Highlights:
✅ Directly answer "yes" or "no" at the beginning
✅ Use Q&A format for clarity
✅ List clear conditions to avoid disputes
Case 3: Wrong Product Found (Low Context Precision Score)
📊 Problem Discovery
Customer Question: "What is the material of the black knit top?"
AI Response: "This knit top is made of 100% pure cotton, soft and comfortable, suitable for all seasons."
Evaluation Results:
Faithfulness Score: 88 points ✅
Answer Relevancy Score: 90 points ✅
Context Precision Score: 48 points ❌
Actual Situation: The black knit top is 70% wool + 30% polyester, but AI found data for the "white knit top" (100% pure cotton).
💡 Improvement Solution
Step 1: Check Data Labels
Problem data file name:
Issue: All knit tops are in one document, making it difficult for AI to distinguish.
Step 2: Improve Data Structure
✅ Solution A: Separate Files
✅ Solution B: Clear Titles
Step 3: Remind AI
Add to "AI Response Principles":
"When customers mention product color or model number, always confirm that the reference material corresponds to that specific color or model. Different colors of the same product may have different materials and specifications."
4. Three-Step Improvement Plan
When problems are identified, follow this process:
Step 1: Update Data Content
Applicable Situations:
✅ Low faithfulness score (incorrect or outdated data)
✅ Low context precision score (disorganized data, unclear labels)
Checklist:
Data Quality Examples:
❌ Poor Data
✅ Good Data
Step 2: Adjust AI Response Principles
Applicable Situations:
✅ Low answer relevancy score (irrelevant answers)
✅ Low faithfulness score (AI guessing, fabricating)
AI Response Principles Template:
Step 3: Report to Technical Team
Applicable Situations:
Context precision score consistently low
Same problem recurring
No improvement after adjusting data and principles
Report Content:
5. Daily Management Checklist
Daily Inspection
When problems are discovered:
Response Quality Tracking
1. Data Review
2. Problem Analysis
3. Improvement Actions
Appendix A: Problem Diagnosis Quick Reference
Low Faithfulness Score
Outdated or incorrect data, AI fabrication
Step 1: Update data content
Low Answer Relevancy Score
AI provides irrelevant answers
Step 2: Adjust response principles
Low Context Precision Score
AI finds wrong or imprecise data
Step 1: Improve data labels
Multiple low metrics
Systemic issue
Step 1+2, Step 3 if necessary
Improvement Priority
Appendix B: System Evaluation Metrics Reference Table
Primary Metrics (No Standard Answer Required)
These three metrics are the core of this guide and can be directly applied to daily customer service conversation evaluation:
Faithfulness Score
Faithfulness
Evaluates whether AI responses align with database content, whether it fabricates or makes up information
Answer Relevancy Score
Answer Relevancy
Evaluates whether AI responses are relevant to customer questions, whether answers are off-topic
Context Precision Score
Context Precision
Evaluates whether AI responses are precise regarding context, whether correct reference materials are found
Advanced Metrics (Standard Answer Required)
The following metrics require prepared "ground truth" standard answers, suitable for test case evaluation:
Answer Correctness
Answer Correctness
Compares AI response with standard answer, evaluates correctness
Answer Similarity
Answer Similarity
Evaluates semantic similarity between AI response and standard answer
Context Recall
Context Recall
Evaluates whether system retrieves all necessary reference materials
Other Available Metrics (DeepEval)
The system also supports the following additional evaluation metrics for more comprehensive quality inspection:
Bias Detection
Bias
Detects whether responses contain biased or discriminatory content
Toxicity Detection
Toxicity
Detects whether responses contain inappropriate or offensive content
Hallucination Detection
Hallucination
Detects whether AI generates content inconsistent with facts
Contextual Relevancy
Contextual Relevancy
Evaluates whether retrieved reference materials are relevant to the question
Usage Recommendations
Daily Monitoring: Use three primary metrics (Faithfulness Score, Answer Relevancy Score, Context Precision Score)
Test Evaluation: Combine with advanced metrics, prepare standard answers for systematic evaluation
Quality Control: Enable bias and toxicity detection to ensure responses comply with corporate standards
FAQ
Q1: I'm not technical, can I manage AI customer service? A: Yes! Just like managing customer service staff, you only need to:
Review evaluation reports daily, identify problem conversations
Check that data is correct and complete
Adjust AI "response principles" (like training service scripts)
Q2: How are scores generated? Does AI evaluate itself? A: No. Scoring is automatically performed by a specialized "evaluation system," like having another AI acting as "quality control" to check the first AI's responses.
Q3: Are all three metrics important? Can I just look at one? A: We recommend reviewing all three because they reflect different issues:
Faithfulness Score: Whether AI aligns with database content, whether it fabricates
Answer Relevancy Score: Whether AI understands the question, whether response is relevant
Context Precision Score: Whether AI is precise regarding context, finds correct reference materials
If you only look at one, you may miss important issues.
Q4: How soon will I see results after improvements? A:
Data updates: Immediate effect (improvements visible same day)
Response principle adjustments: Immediate effect
Technical adjustments: Requires 2-4 weeks (depending on problem complexity)
Conclusion
Managing AI customer service is like managing a human customer service team:
✅ Regular quality checks (review evaluation reports) ✅ Continuous knowledge updates (update data content) ✅ Optimize service scripts (adjust response principles) ✅ Track improvement results (monitor score changes)
By following this guide, even without technical knowledge, you can make AI customer service better and better!
Last updated
Was this helpful?
