# Response Evaluation and Monitoring Results

Last Updated: 2025-11-14

## Overview

MaiAgent uses the Deepeval framework for response evaluation and has been upgraded to **Deepeval 3.7.0**, providing more powerful evaluation capabilities and more flexible configuration options.

## Version Update Notes

### Deepeval 3.7.0 New Features

The MaiAgent platform has been upgraded to Deepeval 3.7.0, bringing the following important improvements:

#### 1. Configurable Evaluation LLM

In the new version, you can customize the large language model (LLM) used for evaluation:

* **Flexible Selection**: No longer limited to a specific LLM for evaluation
* **Cost Optimization**: Choose more economical models for evaluation to reduce operational costs
* **Performance Tuning**: Select an appropriate balance between model speed and accuracy based on evaluation needs

**Configuration Example:**

```python
# Specify the LLM to use in evaluation settings
evaluation_config = {
    "evaluation_model": "gpt-4",  # or other supported models
    "temperature": 0.0,
    "max_tokens": 1000
}
```

#### 2. Flexible Handling of Empty Ground Truth

The new version improves handling of empty or missing Ground Truth:

* **Auto-adaptation**: When test cases don't provide Ground Truth, the system automatically adjusts evaluation strategy
* **Partial Evaluation**: Other dimensions can still be evaluated even when complete Ground Truth is missing
* **Friendly Prompts**: Clearly indicates which evaluation metrics cannot be calculated due to missing Ground Truth

**Applicable Scenarios:**

* Exploratory testing phase where standard answers haven't been defined yet
* Open-ended Q\&A scenarios without a single correct answer
* Quick verification of AI assistant's basic response capabilities

#### 3. Parallel Processing Enhances Evaluation Performance

Deepeval 3.7.0 introduces parallel processing mechanisms, significantly improving evaluation speed:

* **Batch Evaluation**: Multiple test cases can be evaluated simultaneously
* **Performance Improvement**: Evaluation speed increased 2-3x compared to the old version
* **Resource Optimization**: More efficient use of computational resources

**Performance Comparison:**

| Number of Test Cases | Old Version Time | New Version Time | Performance Gain |
| -------------------- | ---------------- | ---------------- | ---------------- |
| 10                   | 45 seconds       | 18 seconds       | 2.5x             |
| 50                   | 3.5 minutes      | 1.5 minutes      | 2.3x             |
| 100                  | 7 minutes        | 3 minutes        | 2.3x             |

### Upgrade Recommendations

If you are using the old version of evaluation features, consider the following upgrade strategy:

1. **Review existing evaluation settings**: Confirm currently used evaluation parameters
2. **Test new configuration options**: Try using the new LLM configuration features
3. **Optimize test cases**: Leverage flexible Ground Truth handling to expand test coverage
4. **Monitor performance improvements**: Observe speed improvements from parallel processing

## View Response Evaluation Results

The response evaluation feature is located in the **AgentOps** module, providing two viewing methods:

### Real-time Monitoring

AgentOps → AI Assistant Monitoring

Real-time calculation of scores for each conversation, used to monitor the response quality of online AI assistants.

### Automated Testing

AgentOps → Automated Testing

Use test sets to perform batch evaluations, generating complete reports and improvement suggestions, suitable for quality verification before version releases.

## Scoring Metrics

The MaiAgent platform provides response evaluation functionality, recording and automatically scoring each Q\&A session. Scores include:

<table><thead><tr><th>Metric</th><th>Description</th><th>Influencing Factors</th><th>Question</th><th>Response</th><th width="118">Retrieval Context</th><th>Ground Truth</th></tr></thead><tbody><tr><td>Faithfulness</td><td>Whether the LLM answers truthfully rather than fabricating answers</td><td>LLM, RAG, Knowledge Base</td><td></td><td>✅</td><td>✅</td><td></td></tr><tr><td>Answer Relevancy</td><td>Whether the LLM answers to the point, whether incomplete or contains redundant text</td><td>LLM, RAG, Knowledge Base</td><td>✅</td><td>✅</td><td></td><td></td></tr><tr><td>Context Precision</td><td>Whether RAG-retrieved content is relevant to the question</td><td>RAG, Knowledge Base</td><td>✅</td><td></td><td>✅</td><td></td></tr><tr><td>Contextual Relevancy</td><td>Overall relevance of retrieved content to the question</td><td>RAG, Knowledge Base</td><td>✅</td><td></td><td>✅</td><td></td></tr><tr><td>Context Recall</td><td>Whether RAG-retrieved content, compared to Ground Truth, has retrieved all data</td><td>RAG, Knowledge Base</td><td></td><td></td><td>✅</td><td>✅</td></tr><tr><td>Answer Correctness</td><td>Correctness of the response compared to Ground Truth</td><td>LLM, RAG, Knowledge Base</td><td></td><td>✅</td><td></td><td>✅</td></tr><tr><td>Answer Similarity</td><td>Semantic similarity of the response to Ground Truth</td><td>LLM, RAG, Knowledge Base</td><td></td><td>✅</td><td></td><td>✅</td></tr><tr><td>Bias</td><td>Detects whether the response contains gender, racial, religious, or other biases</td><td>LLM</td><td></td><td>✅</td><td></td><td></td></tr><tr><td>Toxicity</td><td>Detects whether the response contains harmful or offensive content</td><td>LLM</td><td></td><td>✅</td><td></td><td></td></tr><tr><td>Hallucination</td><td>Detects whether the response contains fabricated information inconsistent with context</td><td>LLM, RAG</td><td></td><td>✅</td><td>✅</td><td></td></tr></tbody></table>

### Difference Between Faithfulness and Hallucination Detection

These two metrics are often confused, but they evaluate from different angles:

* **Faithfulness**: Measures "what proportion" of the response content is based on retrieved context, a **positive metric** (higher is better)
* **Hallucination**: Detects whether the response "contains" content that contradicts or cannot be verified by context, a **negative metric** (lower is better)

| Aspect              | Faithfulness                                | Hallucination                               |
| ------------------- | ------------------------------------------- | ------------------------------------------- |
| Direction           | Positive (higher is better)                 | Negative (lower is better)                  |
| Evaluation Question | How much of the answer is based on sources? | Does the answer contain fabricated content? |
| Calculation Method  | Verifiable statements ÷ Total statements    | Detects existence of fabricated information |

**Example**

> **Retrieval Context**: "Taipei 101 has a height of 508 meters and was completed in 2004"
>
> **Response**: "Taipei 101 has a height of 508 meters, was completed in 2004, and was once the world's tallest building"

* **Faithfulness score is low**: Because only 2/3 of the content has basis
* **Hallucination score is high**: Because "was once the world's tallest building" is not mentioned in the context and is considered hallucinated content

In short, Faithfulness focuses on "degree of fidelity to sources," while Hallucination focuses on "whether there is fabrication." They are related but not the same: low Faithfulness doesn't necessarily mean hallucination, but hallucination will definitely lower Faithfulness.

## Feature Support Comparison

| Metric               | Real-time Monitoring | Automated Testing |
| -------------------- | :------------------: | :---------------: |
| Faithfulness         |           ✅          |         ✅         |
| Answer Relevancy     |           ✅          |         ✅         |
| Context Precision    |           ✅          |         ✅         |
| Contextual Relevancy |                      |         ✅         |
| Context Recall       |          ⚠️          |         ✅         |
| Answer Correctness   |          ⚠️          |                   |
| Answer Similarity    |          ⚠️          |                   |
| Bias                 |                      |         ✅         |
| Toxicity             |                      |         ✅         |
| Hallucination        |                      |         ✅         |

> ⚠️ Coming soon

## Score Interpretation

* Below 0.5 is generally considered to need improvement
* 0.6-0.7 is an acceptable range
* Above 0.8 is considered good performance
* Above 0.9 is excellent performance

## Identifying Causes of Low Scores and Solutions

* LLM capability issues, unable to answer questions based on reference materials
  * Solution: Switch to a more capable LLM, or use the new version's configurable evaluation LLM feature
* RAG retrieval capability, whether relevant data to the question has been found
  * Solution: Contact MaiAgent official support
* Whether knowledge base data is sufficiently provided
  * Solution: Supplement correct knowledge base data and FAQ common questions

## Best Practices

### Using Flexible Ground Truth

When standard answers are not available, you can still:

1. Start with basic evaluation (metrics that don't require Ground Truth)
2. Observe AI assistant's response patterns
3. Gradually establish evaluation standards based on actual performance
4. Supplement Ground Truth for complete evaluation

### Leverage Parallel Processing

For optimal evaluation performance:

* Recommend evaluating multiple test cases at once (10 or more)
* Avoid overly frequent small batch evaluations
* Consider performing large evaluations during off-peak hours

## Technical Resources

* [Deepeval Official Documentation](https://docs.confident-ai.com/)
* [Deepeval 3.7.0 Changelog](https://github.com/confident-ai/deepeval/releases/tag/v3.7.0)
