hand-holding-heartAI Customer Service Quality Management

Target Audience: Customer Service Managers, Quality Management Personnel, Customer Service Trainers

1. Quick Start: Three Quality Metrics for AI Customer Service

How to View Evaluation Report Scores?

Path: AgentOps (sidebar) → AI Assistant Monitoring

In the table, you can directly view the three major scoring metrics for each conversation. Click "View" to see complete details.

Why Do We Need Evaluation?

Just like reviewing customer service call recordings, we also need to check the quality of AI responses. The system automatically scores each conversation, helping you quickly identify issues.


Three Core Metrics

Metric Name
Plain Language Explanation
Scoring Standards

Faithfulness Score

Whether the information provided by AI is correct, whether it fabricates or makes things up

Above 85 points ✅ 60-84 points ⚠️ Below 60 points ❌

Answer Relevancy Score

Whether AI answers the customer's actual question

Above 85 points ✅ 60-84 points ⚠️ Below 60 points ❌

Context Precision Score

Whether AI finds the right reference materials, whether it's precise regarding context

Above 85 points ✅ 60-84 points ⚠️ Below 60 points ❌


Simple Assessment Method

All three metrics > 80 points  → ✅ This response is excellent
Any metric < 60 points  → ❌ Requires immediate improvement
Two or more < 70 points → ⚠️ Systemic issue, needs comprehensive review

2. How to Understand Evaluation Reports

Report Example


Three Common Problem Types

Problem A: Low Faithfulness Score (< 60 points)

Symptoms: Information provided by AI is incorrect or fabricated

Common Causes:

  • Reference materials are outdated (prices, inventory, policies have been updated)

  • Conflicting data (different documents say different things)

  • AI "guesses" answers without relying on database content

Impact: Customers may receive incorrect information, leading to complaints


Problem B: Low Answer Relevancy Score (< 60 points)

Symptoms: AI doesn't answer what the customer actually wants to know

Common Causes:

  • AI provides lengthy responses but misses the point

  • Answers are irrelevant, discussing unrelated content

  • Only explains background without providing actual answers

Impact: Customers need to ask again, reducing satisfaction


Problem C: Low Context Precision Score (< 60 points)

Symptoms: AI finds wrong reference materials or is not precise enough

Common Causes:

  • Inaccurate keyword searches

  • Confusion between different products/categories of data

  • Unclear titles or categories in reference materials

Impact: Even if AI wants to answer well, using wrong data leads to wrong answers


3. Real-World Cases: Common Issues in Fashion Retail

Case 1: Incorrect Pricing (Low Faithfulness Score)

📊 Problem Discovery

Customer Question: "How much is this down jacket?"

AI Response: "This down jacket is priced at NT$ 3,990."

Evaluation Results:

  • Faithfulness Score: 38 points

  • Answer Relevancy Score: 95 points ✅

  • Context Precision Score: 85 points ✅

Actual Situation: The product has been repriced to NT$ 2,990 (after discount), but AI still provided the original price.


💡 Improvement Solution

Step 1: Check Data

  • Open "Product Price List" in knowledge base

  • Discovered prices were not updated

Step 2: Update Data

Problem Data

Corrected Data

Step 3: Remind AI

Add to "AI Response Principles":

"When answering about prices, always confirm whether it's the latest price and specify the applicable period. If there's a discount promotion, please inform both original price and discounted price."


Case 2: Irrelevant Answer (Low Answer Relevancy Score)

📊 Problem Discovery

Customer Question: "If it doesn't fit, can I return it?"

AI Response: "We provide multiple size options, including S, M, L, XL. Each garment's size chart is indicated on the product page. We recommend referring to the size chart for measurements before purchasing."

Evaluation Results:

  • Faithfulness Score: 90 points ✅

  • Answer Relevancy Score: 35 points

  • Context Precision Score: 60 points ⚠️

Problem Diagnosis: Customer asked "can I return it," but AI is explaining "how to choose sizes," completely missing the core question.


💡 Improvement Solution

Step 1: Adjust Response Principles

Add to "AI Response Principles":

Core Principle: Answer the customer's core question first

Wrong Example:

  • Customer asks "Can I return it?"

  • AI answers "We have comprehensive after-sales service..." ❌ (Doesn't say yes or no)

Correct Example:

  • Customer asks "Can I return it?"

  • AI answers "Yes. Within 7 days of receiving the product, if unused, you can apply for return..." ✅ (Direct answer)

Step 2: Reorganize Data Structure

Problem Data (Too scattered)

Corrected Data

Improvement Highlights:

  • ✅ Directly answer "yes" or "no" at the beginning

  • ✅ Use Q&A format for clarity

  • ✅ List clear conditions to avoid disputes


Case 3: Wrong Product Found (Low Context Precision Score)

📊 Problem Discovery

Customer Question: "What is the material of the black knit top?"

AI Response: "This knit top is made of 100% pure cotton, soft and comfortable, suitable for all seasons."

Evaluation Results:

  • Faithfulness Score: 88 points ✅

  • Answer Relevancy Score: 90 points ✅

  • Context Precision Score: 48 points

Actual Situation: The black knit top is 70% wool + 30% polyester, but AI found data for the "white knit top" (100% pure cotton).


💡 Improvement Solution

Step 1: Check Data Labels

Problem data file name:

Issue: All knit tops are in one document, making it difficult for AI to distinguish.

Step 2: Improve Data Structure

Solution A: Separate Files

Solution B: Clear Titles

Step 3: Remind AI

Add to "AI Response Principles":

"When customers mention product color or model number, always confirm that the reference material corresponds to that specific color or model. Different colors of the same product may have different materials and specifications."


4. Three-Step Improvement Plan

When problems are identified, follow this process:


Step 1: Update Data Content

Applicable Situations:

  • ✅ Low faithfulness score (incorrect or outdated data)

  • ✅ Low context precision score (disorganized data, unclear labels)

Checklist:

Data Quality Examples:

Poor Data

Good Data


Step 2: Adjust AI Response Principles

Applicable Situations:

  • ✅ Low answer relevancy score (irrelevant answers)

  • ✅ Low faithfulness score (AI guessing, fabricating)

AI Response Principles Template:


Step 3: Report to Technical Team

Applicable Situations:

  • Context precision score consistently low

  • Same problem recurring

  • No improvement after adjusting data and principles

Report Content:


5. Daily Management Checklist

Daily Inspection

When problems are discovered:


Response Quality Tracking

1. Data Review

2. Problem Analysis

3. Improvement Actions


Appendix A: Problem Diagnosis Quick Reference

Score Situation
Possible Cause
Improvement Method

Low Faithfulness Score

Outdated or incorrect data, AI fabrication

Step 1: Update data content

Low Answer Relevancy Score

AI provides irrelevant answers

Step 2: Adjust response principles

Low Context Precision Score

AI finds wrong or imprecise data

Step 1: Improve data labels

Multiple low metrics

Systemic issue

Step 1+2, Step 3 if necessary


Improvement Priority


Appendix B: System Evaluation Metrics Reference Table

Primary Metrics (No Standard Answer Required)

These three metrics are the core of this guide and can be directly applied to daily customer service conversation evaluation:

Chinese Name
English Full Name
Description

Faithfulness Score

Faithfulness

Evaluates whether AI responses align with database content, whether it fabricates or makes up information

Answer Relevancy Score

Answer Relevancy

Evaluates whether AI responses are relevant to customer questions, whether answers are off-topic

Context Precision Score

Context Precision

Evaluates whether AI responses are precise regarding context, whether correct reference materials are found

Advanced Metrics (Standard Answer Required)

The following metrics require prepared "ground truth" standard answers, suitable for test case evaluation:

Chinese Name
English Full Name
Description

Answer Correctness

Answer Correctness

Compares AI response with standard answer, evaluates correctness

Answer Similarity

Answer Similarity

Evaluates semantic similarity between AI response and standard answer

Context Recall

Context Recall

Evaluates whether system retrieves all necessary reference materials

Other Available Metrics (DeepEval)

The system also supports the following additional evaluation metrics for more comprehensive quality inspection:

Chinese Name
English Name
Description

Bias Detection

Bias

Detects whether responses contain biased or discriminatory content

Toxicity Detection

Toxicity

Detects whether responses contain inappropriate or offensive content

Hallucination Detection

Hallucination

Detects whether AI generates content inconsistent with facts

Contextual Relevancy

Contextual Relevancy

Evaluates whether retrieved reference materials are relevant to the question

Usage Recommendations

  1. Daily Monitoring: Use three primary metrics (Faithfulness Score, Answer Relevancy Score, Context Precision Score)

  2. Test Evaluation: Combine with advanced metrics, prepare standard answers for systematic evaluation

  3. Quality Control: Enable bias and toxicity detection to ensure responses comply with corporate standards


FAQ

Q1: I'm not technical, can I manage AI customer service? A: Yes! Just like managing customer service staff, you only need to:

  • Review evaluation reports daily, identify problem conversations

  • Check that data is correct and complete

  • Adjust AI "response principles" (like training service scripts)


Q2: How are scores generated? Does AI evaluate itself? A: No. Scoring is automatically performed by a specialized "evaluation system," like having another AI acting as "quality control" to check the first AI's responses.


Q3: Are all three metrics important? Can I just look at one? A: We recommend reviewing all three because they reflect different issues:

  • Faithfulness Score: Whether AI aligns with database content, whether it fabricates

  • Answer Relevancy Score: Whether AI understands the question, whether response is relevant

  • Context Precision Score: Whether AI is precise regarding context, finds correct reference materials

If you only look at one, you may miss important issues.


Q4: How soon will I see results after improvements? A:

  • Data updates: Immediate effect (improvements visible same day)

  • Response principle adjustments: Immediate effect

  • Technical adjustments: Requires 2-4 weeks (depending on problem complexity)


Conclusion

Managing AI customer service is like managing a human customer service team:

Regular quality checks (review evaluation reports) ✅ Continuous knowledge updates (update data content) ✅ Optimize service scripts (adjust response principles) ✅ Track improvement results (monitor score changes)

By following this guide, even without technical knowledge, you can make AI customer service better and better!

Last updated

Was this helpful?