# AI Customer Service Quality Management

> **Target Audience**: Customer Service Managers, Quality Management Personnel, Customer Service Trainers

## 1. Quick Start: Three Quality Metrics for AI Customer Service

#### **How to View Evaluation Report Scores?**

**Path**: AgentOps (sidebar) → AI Assistant Monitoring

In the table, you can directly view the three major scoring metrics for each conversation. Click "View" to see complete details.

#### Why Do We Need Evaluation?

Just like reviewing customer service call recordings, we also need to check the quality of AI responses.\
The system automatically scores each conversation, helping you quickly identify issues.

***

#### Three Core Metrics

| Metric Name                 | Plain Language Explanation                                                                  | Scoring Standards                                                |
| --------------------------- | ------------------------------------------------------------------------------------------- | ---------------------------------------------------------------- |
| **Faithfulness Score**      | Whether the information provided by AI is correct, whether it fabricates or makes things up | <p>Above 85 points ✅<br>60-84 points ⚠️<br>Below 60 points ❌</p> |
| **Answer Relevancy Score**  | Whether AI answers the customer's actual question                                           | <p>Above 85 points ✅<br>60-84 points ⚠️<br>Below 60 points ❌</p> |
| **Context Precision Score** | Whether AI finds the right reference materials, whether it's precise regarding context      | <p>Above 85 points ✅<br>60-84 points ⚠️<br>Below 60 points ❌</p> |

***

#### Simple Assessment Method

```
All three metrics > 80 points  → ✅ This response is excellent
Any metric < 60 points  → ❌ Requires immediate improvement
Two or more < 70 points → ⚠️ Systemic issue, needs comprehensive review
```

***

## 2. How to Understand Evaluation Reports

#### Report Example

```
Conversation ID: #20240120-001
Customer Question: "Is the black trench coat in XL size still in stock?"
AI Response: "The black trench coat is currently in stock, XL size can be ordered."

Evaluation Results:
├─ Faithfulness Score: 45 points ❌ (Claims in stock, but actually out of stock)
├─ Answer Relevancy Score: 90 points ✅ (Indeed answered the stock question)
└─ Context Precision Score: 70 points ⚠️ (Found trench coat data, but size information is not precise enough)

Problem Diagnosis: AI provided incorrect inventory information
```

***

#### Three Common Problem Types

**Problem A: Low Faithfulness Score (< 60 points)**

**Symptoms**: Information provided by AI is incorrect or fabricated

**Common Causes**:

* Reference materials are outdated (prices, inventory, policies have been updated)
* Conflicting data (different documents say different things)
* AI "guesses" answers without relying on database content

**Impact**: Customers may receive incorrect information, leading to complaints

***

**Problem B: Low Answer Relevancy Score (< 60 points)**

**Symptoms**: AI doesn't answer what the customer actually wants to know

**Common Causes**:

* AI provides lengthy responses but misses the point
* Answers are irrelevant, discussing unrelated content
* Only explains background without providing actual answers

**Impact**: Customers need to ask again, reducing satisfaction

***

**Problem C: Low Context Precision Score (< 60 points)**

**Symptoms**: AI finds wrong reference materials or is not precise enough

**Common Causes**:

* Inaccurate keyword searches
* Confusion between different products/categories of data
* Unclear titles or categories in reference materials

**Impact**: Even if AI wants to answer well, using wrong data leads to wrong answers

***

## 3. Real-World Cases: Common Issues in Fashion Retail

#### Case 1: Incorrect Pricing (Low Faithfulness Score)

**📊 Problem Discovery**

**Customer Question**: "How much is this down jacket?"

**AI Response**: "This down jacket is priced at NT$ 3,990."

**Evaluation Results**:

* Faithfulness Score: **38 points** ❌
* Answer Relevancy Score: 95 points ✅
* Context Precision Score: 85 points ✅

**Actual Situation**:\
The product has been repriced to NT$ 2,990 (after discount), but AI still provided the original price.

***

**💡 Improvement Solution**

**Step 1: Check Data**

* Open "Product Price List" in knowledge base
* Discovered prices were not updated

**Step 2: Update Data**

❌ **Problem Data**

```
Down Jacket Series
- Classic Down Jacket: NT$ 3,990
- Long Down Jacket: NT$ 4,990
```

✅ **Corrected Data**

```
Down Jacket Series (Winter 2024 Prices)

Product Name: Classic Down Jacket
- Original Price: NT$ 3,990
- Discounted Price: NT$ 2,990 (Starting 2024/1/1)
- Promotion Period: 2024/1/1 - 2024/2/28

Product Name: Long Down Jacket
- Original Price: NT$ 4,990
- Discounted Price: NT$ 3,990 (Starting 2024/1/1)
- Promotion Period: 2024/1/1 - 2024/2/28

Last Updated: 2024/1/15
```

**Step 3: Remind AI**

Add to "AI Response Principles":

> "When answering about prices, always confirm whether it's the latest price and specify the applicable period. If there's a discount promotion, please inform both original price and discounted price."

***

#### Case 2: Irrelevant Answer (Low Answer Relevancy Score)

**📊 Problem Discovery**

**Customer Question**: "If it doesn't fit, can I return it?"

**AI Response**: "We provide multiple size options, including S, M, L, XL. Each garment's size chart is indicated on the product page. We recommend referring to the size chart for measurements before purchasing."

**Evaluation Results**:

* Faithfulness Score: 90 points ✅
* Answer Relevancy Score: **35 points** ❌
* Context Precision Score: 60 points ⚠️

**Problem Diagnosis**:\
Customer asked "can I return it," but AI is explaining "how to choose sizes," completely missing the core question.

***

**💡 Improvement Solution**

**Step 1: Adjust Response Principles**

Add to "AI Response Principles":

> **Core Principle: Answer the customer's core question first**
>
> Wrong Example:
>
> * Customer asks "Can I return it?"
> * AI answers "We have comprehensive after-sales service..." ❌ (Doesn't say yes or no)
>
> Correct Example:
>
> * Customer asks "Can I return it?"
> * AI answers "Yes. Within 7 days of receiving the product, if unused, you can apply for return..." ✅ (Direct answer)

**Step 2: Reorganize Data Structure**

❌ **Problem Data (Too scattered)**

```
After-Sales Service Description

Our company values customer rights and provides comprehensive after-sales service.
Please read the size chart carefully when purchasing and choose the appropriate size.
If you have any questions, please contact customer service...
```

✅ **Corrected Data**

```
Return and Exchange Policy

Q: Can I return the product?
A: Yes. Within 7 days of receiving the product, if unused with tags intact, you can apply for return.

Return Conditions:
✅ Product unused, not washed, not tried on for more than 10 minutes
✅ Tags and packaging intact
✅ Application submitted within 7 days of receipt
❌ Sale items, underwear cannot be returned

Return Process:
1. Call customer service hotline 0800-XXX-XXX or fill out online form
2. Customer service provides return address and tracking number
3. Refund within 7-14 days to original payment method

---

Q: Can I exchange the product?
A: Yes. Within 7 days of receipt, you can exchange size or color for free (same product only).

Exchange Process:
1. Contact customer service to confirm stock availability
2. Return product (same conditions as returns)
3. New product shipped within 3 days after receiving returned item
```

**Improvement Highlights**:

* ✅ Directly answer "yes" or "no" at the beginning
* ✅ Use Q\&A format for clarity
* ✅ List clear conditions to avoid disputes

***

#### Case 3: Wrong Product Found (Low Context Precision Score)

**📊 Problem Discovery**

**Customer Question**: "What is the material of the black knit top?"

**AI Response**: "This knit top is made of 100% pure cotton, soft and comfortable, suitable for all seasons."

**Evaluation Results**:

* Faithfulness Score: 88 points ✅
* Answer Relevancy Score: 90 points ✅
* Context Precision Score: **48 points** ❌

**Actual Situation**:\
The black knit top is 70% wool + 30% polyester, but AI found data for the "white knit top" (100% pure cotton).

***

**💡 Improvement Solution**

**Step 1: Check Data Labels**

Problem data file name:

```
Knit_Tops.pdf
```

Issue: All knit tops are in one document, making it difficult for AI to distinguish.

**Step 2: Improve Data Structure**

✅ **Solution A: Separate Files**

```
Product_Data/
├─ Knit_Top_Black_Model_A001.pdf
├─ Knit_Top_White_Model_A002.pdf
├─ Knit_Top_Gray_Model_A003.pdf
```

✅ **Solution B: Clear Titles**

```markdown
# Knit Top Product Information

## Black Knit Top (Model: A001)
- Color: Black
- Material: 70% wool + 30% polyester
- Suitable Season: Autumn/Winter
- Care Instructions: Hand wash, do not tumble dry

## White Knit Top (Model: A002)
- Color: White
- Material: 100% pure cotton
- Suitable Season: All seasons
- Care Instructions: Machine washable, low temperature dry

## Gray Knit Top (Model: A003)
- Color: Gray
- Material: 50% wool + 50% acrylic
- Suitable Season: Autumn/Winter
- Care Instructions: Dry clean only
```

**Step 3: Remind AI**

Add to "AI Response Principles":

> "When customers mention product color or model number, always confirm that the reference material corresponds to that specific color or model. Different colors of the same product may have different materials and specifications."

***

## 4. Three-Step Improvement Plan

When problems are identified, follow this process:

```
Discover low scores
    ↓
Step 1: Update data content (most important)
    ↓
Step 2: Adjust AI response principles
    ↓
Step 3: Report to technical team (if needed)
```

***

#### Step 1: Update Data Content

**Applicable Situations**:

* ✅ Low faithfulness score (incorrect or outdated data)
* ✅ Low context precision score (disorganized data, unclear labels)

**Checklist**:

* [ ] Is the data the latest version?
* [ ] Are prices, inventory, and policies correct?
* [ ] Is data for different products clearly distinguished?
* [ ] Are titles clear? (Making it easy for AI to find)
* [ ] Is content presented in bullet points or tables? (Rather than long paragraphs)

**Data Quality Examples**:

❌ **Poor Data**

```
Return Policy

Some products can be returned, but certain conditions must be met.
Some special products cannot be returned, please note before purchasing.
If you need to return, please contact customer service.
```

✅ **Good Data**

```
Return Policy

Returnable Products:
✅ General clothing (tops, pants, outerwear)
✅ Accessories (bags, hats, scarves)

Non-Returnable Products:
❌ Underwear, swimwear
❌ Sale items (50% off or more)
❌ Customized products

Return Conditions (all must be met):
1. Within 7 days of receipt
2. Product unused (tags intact, no signs of wear)
3. Packaging intact

Return Process:
1. Call customer service hotline 0800-XXX-XXX
2. Provide order number
3. Customer service provides return address
4. Return product (registered mail recommended)
5. Refund within 7-14 days after receiving returned item

Contact Methods:
- Customer Service Hotline: 0800-XXX-XXX (09:00-21:00)
- Online Chat: Chat box at bottom right of website
- Email: service@example.com
```

***

#### Step 2: Adjust AI Response Principles

**Applicable Situations**:

* ✅ Low answer relevancy score (irrelevant answers)
* ✅ Low faithfulness score (AI guessing, fabricating)

**AI Response Principles Template**:

```markdown
# AI Customer Service Response Principles

## Core Rules

1. **Answer the core question first**
   - Customer asks "can I/is it possible" → First answer "yes" or "no"
   - Customer asks "how much" → State price first
   - Customer asks "how to" → Provide steps first

2. **Only state what you're certain about**
   - All information must come from reference materials
   - If uncertain, say "This requires human customer service assistance"
   - Never guess or assume

3. **Pay attention to details**
   - Do not confuse colors, sizes, model numbers
   - Confirm prices are current
   - Clearly state promotion periods

## Response Format

### Policy Questions (returns/exchanges, membership, promotions)
First paragraph: Directly answer "yes" or "no"
Second paragraph: Explain conditions (in bullet points)
Third paragraph: Tell customer what to do (process or contact method)

### Product Questions (price, material, stock)
First paragraph: Directly answer the question (price/material/stock availability)
Second paragraph: Supplement product information (specifications, sizes, colors)
Third paragraph: Purchase link or next step

### Process Questions (how to buy, return, exchange)
First paragraph: Summarize process (3-5 steps)
Second paragraph: Detail each step
Third paragraph: Important notes or contact information

## Prohibited Actions

❌ Cannot say "usually," "generally," "approximately" (must be specific)
❌ Cannot confuse information from different products
❌ Cannot omit important conditions (price, size, period)
❌ Cannot speculate about information customer didn't mention

## Examples

✅ Good Response:
Customer: "Can I return this jacket?"
AI: "Yes. Within 7 days of receipt, if the following conditions are met, you can apply for return:
- Product unused, tags intact
- Packaging in good condition
- Not a sale item

Return Process:
Please call customer service hotline 0800-XXX-XXX, we will provide return address and instructions.
Refund will be processed within 7-14 business days to original payment method."

❌ Poor Response:
Customer: "Can I return this jacket?"
AI: "Our company values customer rights and provides comprehensive after-sales service.
We recommend reading product descriptions carefully before purchasing and choosing the appropriate size.
If you have any questions, please contact customer service..."
(Does not directly answer whether return is possible)
```

***

#### Step 3: Report to Technical Team

**Applicable Situations**:

* Context precision score consistently low
* Same problem recurring
* No improvement after adjusting data and principles

**Report Content**:

```
Problem Type: Low Context Precision Score

Problem Description:
When customers inquire about "black" products, AI frequently finds data for "white" or other colored products.

Impact Scope:
Approximately 15% of product inquiry issues experience this situation

Attempted Improvements:
✅ Separated data files for different colored products
✅ Clearly labeled colors in titles
⚠️ Problem still not completely resolved

Recommended Technical Adjustments:
Hope system can more accurately identify "color" keywords

Attachments:
- test_cases_color_queries.csv (100 test questions)
- current_results.csv (current system retrieval results)
- expected_results.csv (expected correct results)
```

***

## 5. Daily Management Checklist

#### Daily Inspection

**When problems are discovered:**

```
If same type of problem occurs ≥ 3 times
→ Handle immediately (update data or adjust principles)

If involves pricing or policy errors
→ Emergency correction, complete same day

If isolated incident
→ Record for observation, add to discussion
```

***

#### Response Quality Tracking

**1. Data Review**

```
Weekly Statistics:
- Total conversations: ___ 
- Average Faithfulness Score: ___ points
- Average Answer Relevancy Score: ___ points
- Average Context Precision Score: ___ points
- Abnormal conversations: ___ (____%)
```

**2. Problem Analysis**

```
Top 3 High-Frequency Issues:
1. ________ (__ times) - Which metric is low?
2. ________ (__ times) - Which metric is low?
3. ________ (__ times) - Which metric is low?
```

**3. Improvement Actions**

```
This Week's Tasks:
□ Update ___ data files (Responsible person: ___)
□ Adjust ___ response principles (Responsible person: ___)
□ Report ___ technical issues (Responsible person: ___)

Next Week's Goals:
- Reduce abnormal conversation rate to < ____%
- All metrics average > ___ points
```

***

## Appendix A: Problem Diagnosis Quick Reference

| Score Situation                 | Possible Cause                             | Improvement Method                 |
| ------------------------------- | ------------------------------------------ | ---------------------------------- |
| **Low Faithfulness Score**      | Outdated or incorrect data, AI fabrication | Step 1: Update data content        |
| **Low Answer Relevancy Score**  | AI provides irrelevant answers             | Step 2: Adjust response principles |
| **Low Context Precision Score** | AI finds wrong or imprecise data           | Step 1: Improve data labels        |
| **Multiple low metrics**        | Systemic issue                             | Step 1+2, Step 3 if necessary      |

***

#### Improvement Priority

```
First Priority: Faithfulness Score < 60 points
→ May provide customers with incorrect information or fabricated content, causing complaints

Second Priority: Answer Relevancy Score < 60 points
→ Poor customer experience, requires repeated inquiries

Third Priority: Context Precision Score < 60 points
→ Although problem is not obvious, will affect quality long-term
```

***

## Appendix B: System Evaluation Metrics Reference Table

#### Primary Metrics (No Standard Answer Required)

These three metrics are the core of this guide and can be directly applied to daily customer service conversation evaluation:

| Chinese Name                | English Full Name | Description                                                                                                 |
| --------------------------- | ----------------- | ----------------------------------------------------------------------------------------------------------- |
| **Faithfulness Score**      | Faithfulness      | Evaluates whether AI responses align with database content, whether it fabricates or makes up information   |
| **Answer Relevancy Score**  | Answer Relevancy  | Evaluates whether AI responses are relevant to customer questions, whether answers are off-topic            |
| **Context Precision Score** | Context Precision | Evaluates whether AI responses are precise regarding context, whether correct reference materials are found |

#### Advanced Metrics (Standard Answer Required)

The following metrics require prepared "ground truth" standard answers, suitable for test case evaluation:

| Chinese Name           | English Full Name  | Description                                                           |
| ---------------------- | ------------------ | --------------------------------------------------------------------- |
| **Answer Correctness** | Answer Correctness | Compares AI response with standard answer, evaluates correctness      |
| **Answer Similarity**  | Answer Similarity  | Evaluates semantic similarity between AI response and standard answer |
| **Context Recall**     | Context Recall     | Evaluates whether system retrieves all necessary reference materials  |

#### Other Available Metrics (DeepEval)

The system also supports the following additional evaluation metrics for more comprehensive quality inspection:

| Chinese Name                | English Name         | Description                                                                  |
| --------------------------- | -------------------- | ---------------------------------------------------------------------------- |
| **Bias Detection**          | Bias                 | Detects whether responses contain biased or discriminatory content           |
| **Toxicity Detection**      | Toxicity             | Detects whether responses contain inappropriate or offensive content         |
| **Hallucination Detection** | Hallucination        | Detects whether AI generates content inconsistent with facts                 |
| **Contextual Relevancy**    | Contextual Relevancy | Evaluates whether retrieved reference materials are relevant to the question |

#### Usage Recommendations

1. **Daily Monitoring**: Use three primary metrics (Faithfulness Score, Answer Relevancy Score, Context Precision Score)
2. **Test Evaluation**: Combine with advanced metrics, prepare standard answers for systematic evaluation
3. **Quality Control**: Enable bias and toxicity detection to ensure responses comply with corporate standards

***

## FAQ

**Q1: I'm not technical, can I manage AI customer service?**\
A: Yes! Just like managing customer service staff, you only need to:

* Review evaluation reports daily, identify problem conversations
* Check that data is correct and complete
* Adjust AI "response principles" (like training service scripts)

***

**Q2: How are scores generated? Does AI evaluate itself?**\
A: No. Scoring is automatically performed by a specialized "evaluation system," like having another AI acting as "quality control" to check the first AI's responses.

***

**Q3: Are all three metrics important? Can I just look at one?**\
A: We recommend reviewing all three because they reflect different issues:

* **Faithfulness Score**: Whether AI aligns with database content, whether it fabricates
* **Answer Relevancy Score**: Whether AI understands the question, whether response is relevant
* **Context Precision Score**: Whether AI is precise regarding context, finds correct reference materials

If you only look at one, you may miss important issues.

***

**Q4: How soon will I see results after improvements?**\
A:

* Data updates: Immediate effect (improvements visible same day)
* Response principle adjustments: Immediate effect
* Technical adjustments: Requires 2-4 weeks (depending on problem complexity)

***

## Conclusion

Managing AI customer service is like managing a human customer service team:

✅ **Regular quality checks** (review evaluation reports)\
✅ **Continuous knowledge updates** (update data content)\
✅ **Optimize service scripts** (adjust response principles)\
✅ **Track improvement results** (monitor score changes)

By following this guide, even without technical knowledge, you can make AI customer service better and better!
