Response Evaluation and Monitoring Results

Last Updated: 2025-11-14

Overview

MaiAgent uses the Deepeval framework for response evaluation and has been upgraded to Deepeval 3.7.0, providing more powerful evaluation capabilities and more flexible configuration options.

Version Update Notes

Deepeval 3.7.0 New Features

The MaiAgent platform has been upgraded to Deepeval 3.7.0, bringing the following important improvements:

1. Configurable Evaluation LLM

In the new version, you can customize the large language model (LLM) used for evaluation:

  • Flexible Selection: No longer limited to a specific LLM for evaluation

  • Cost Optimization: Choose more economical models for evaluation to reduce operational costs

  • Performance Tuning: Select an appropriate balance between model speed and accuracy based on evaluation needs

Configuration Example:

# Specify the LLM to use in evaluation settings
evaluation_config = {
    "evaluation_model": "gpt-4",  # or other supported models
    "temperature": 0.0,
    "max_tokens": 1000
}

2. Flexible Handling of Empty Ground Truth

The new version improves handling of empty or missing Ground Truth:

  • Auto-adaptation: When test cases don't provide Ground Truth, the system automatically adjusts evaluation strategy

  • Partial Evaluation: Other dimensions can still be evaluated even when complete Ground Truth is missing

  • Friendly Prompts: Clearly indicates which evaluation metrics cannot be calculated due to missing Ground Truth

Applicable Scenarios:

  • Exploratory testing phase where standard answers haven't been defined yet

  • Open-ended Q&A scenarios without a single correct answer

  • Quick verification of AI assistant's basic response capabilities

3. Parallel Processing Enhances Evaluation Performance

Deepeval 3.7.0 introduces parallel processing mechanisms, significantly improving evaluation speed:

  • Batch Evaluation: Multiple test cases can be evaluated simultaneously

  • Performance Improvement: Evaluation speed increased 2-3x compared to the old version

  • Resource Optimization: More efficient use of computational resources

Performance Comparison:

Number of Test Cases
Old Version Time
New Version Time
Performance Gain

10

45 seconds

18 seconds

2.5x

50

3.5 minutes

1.5 minutes

2.3x

100

7 minutes

3 minutes

2.3x

Upgrade Recommendations

If you are using the old version of evaluation features, consider the following upgrade strategy:

  1. Review existing evaluation settings: Confirm currently used evaluation parameters

  2. Test new configuration options: Try using the new LLM configuration features

  3. Optimize test cases: Leverage flexible Ground Truth handling to expand test coverage

  4. Monitor performance improvements: Observe speed improvements from parallel processing

View Response Evaluation Results

The response evaluation feature is located in the AgentOps module, providing two viewing methods:

Real-time Monitoring

AgentOps → AI Assistant Monitoring

Real-time calculation of scores for each conversation, used to monitor the response quality of online AI assistants.

Automated Testing

AgentOps → Automated Testing

Use test sets to perform batch evaluations, generating complete reports and improvement suggestions, suitable for quality verification before version releases.

Scoring Metrics

The MaiAgent platform provides response evaluation functionality, recording and automatically scoring each Q&A session. Scores include:

Metric
Description
Influencing Factors
Question
Response
Retrieval Context
Ground Truth

Faithfulness

Whether the LLM answers truthfully rather than fabricating answers

LLM, RAG, Knowledge Base

Answer Relevancy

Whether the LLM answers to the point, whether incomplete or contains redundant text

LLM, RAG, Knowledge Base

Context Precision

Whether RAG-retrieved content is relevant to the question

RAG, Knowledge Base

Contextual Relevancy

Overall relevance of retrieved content to the question

RAG, Knowledge Base

Context Recall

Whether RAG-retrieved content, compared to Ground Truth, has retrieved all data

RAG, Knowledge Base

Answer Correctness

Correctness of the response compared to Ground Truth

LLM, RAG, Knowledge Base

Answer Similarity

Semantic similarity of the response to Ground Truth

LLM, RAG, Knowledge Base

Bias

Detects whether the response contains gender, racial, religious, or other biases

LLM

Toxicity

Detects whether the response contains harmful or offensive content

LLM

Hallucination

Detects whether the response contains fabricated information inconsistent with context

LLM, RAG

Difference Between Faithfulness and Hallucination Detection

These two metrics are often confused, but they evaluate from different angles:

  • Faithfulness: Measures "what proportion" of the response content is based on retrieved context, a positive metric (higher is better)

  • Hallucination: Detects whether the response "contains" content that contradicts or cannot be verified by context, a negative metric (lower is better)

Aspect
Faithfulness
Hallucination

Direction

Positive (higher is better)

Negative (lower is better)

Evaluation Question

How much of the answer is based on sources?

Does the answer contain fabricated content?

Calculation Method

Verifiable statements ÷ Total statements

Detects existence of fabricated information

Example

Retrieval Context: "Taipei 101 has a height of 508 meters and was completed in 2004"

Response: "Taipei 101 has a height of 508 meters, was completed in 2004, and was once the world's tallest building"

  • Faithfulness score is low: Because only 2/3 of the content has basis

  • Hallucination score is high: Because "was once the world's tallest building" is not mentioned in the context and is considered hallucinated content

In short, Faithfulness focuses on "degree of fidelity to sources," while Hallucination focuses on "whether there is fabrication." They are related but not the same: low Faithfulness doesn't necessarily mean hallucination, but hallucination will definitely lower Faithfulness.

Feature Support Comparison

Metric
Real-time Monitoring
Automated Testing

Faithfulness

Answer Relevancy

Context Precision

Contextual Relevancy

Context Recall

⚠️

Answer Correctness

⚠️

Answer Similarity

⚠️

Bias

Toxicity

Hallucination

⚠️ Coming soon

Score Interpretation

  • Below 0.5 is generally considered to need improvement

  • 0.6-0.7 is an acceptable range

  • Above 0.8 is considered good performance

  • Above 0.9 is excellent performance

Identifying Causes of Low Scores and Solutions

  • LLM capability issues, unable to answer questions based on reference materials

    • Solution: Switch to a more capable LLM, or use the new version's configurable evaluation LLM feature

  • RAG retrieval capability, whether relevant data to the question has been found

    • Solution: Contact MaiAgent official support

  • Whether knowledge base data is sufficiently provided

    • Solution: Supplement correct knowledge base data and FAQ common questions

Best Practices

Using Flexible Ground Truth

When standard answers are not available, you can still:

  1. Start with basic evaluation (metrics that don't require Ground Truth)

  2. Observe AI assistant's response patterns

  3. Gradually establish evaluation standards based on actual performance

  4. Supplement Ground Truth for complete evaluation

Leverage Parallel Processing

For optimal evaluation performance:

  • Recommend evaluating multiple test cases at once (10 or more)

  • Avoid overly frequent small batch evaluations

  • Consider performing large evaluations during off-peak hours

Technical Resources

Last updated

Was this helpful?