Response Evaluation and Monitoring Results
Last Updated: 2025-11-14
Overview
MaiAgent uses the Deepeval framework for response evaluation and has been upgraded to Deepeval 3.7.0, providing more powerful evaluation capabilities and more flexible configuration options.
Version Update Notes
Deepeval 3.7.0 New Features
The MaiAgent platform has been upgraded to Deepeval 3.7.0, bringing the following important improvements:
1. Configurable Evaluation LLM
In the new version, you can customize the large language model (LLM) used for evaluation:
Flexible Selection: No longer limited to a specific LLM for evaluation
Cost Optimization: Choose more economical models for evaluation to reduce operational costs
Performance Tuning: Select an appropriate balance between model speed and accuracy based on evaluation needs
Configuration Example:
# Specify the LLM to use in evaluation settings
evaluation_config = {
"evaluation_model": "gpt-4", # or other supported models
"temperature": 0.0,
"max_tokens": 1000
}2. Flexible Handling of Empty Ground Truth
The new version improves handling of empty or missing Ground Truth:
Auto-adaptation: When test cases don't provide Ground Truth, the system automatically adjusts evaluation strategy
Partial Evaluation: Other dimensions can still be evaluated even when complete Ground Truth is missing
Friendly Prompts: Clearly indicates which evaluation metrics cannot be calculated due to missing Ground Truth
Applicable Scenarios:
Exploratory testing phase where standard answers haven't been defined yet
Open-ended Q&A scenarios without a single correct answer
Quick verification of AI assistant's basic response capabilities
3. Parallel Processing Enhances Evaluation Performance
Deepeval 3.7.0 introduces parallel processing mechanisms, significantly improving evaluation speed:
Batch Evaluation: Multiple test cases can be evaluated simultaneously
Performance Improvement: Evaluation speed increased 2-3x compared to the old version
Resource Optimization: More efficient use of computational resources
Performance Comparison:
10
45 seconds
18 seconds
2.5x
50
3.5 minutes
1.5 minutes
2.3x
100
7 minutes
3 minutes
2.3x
Upgrade Recommendations
If you are using the old version of evaluation features, consider the following upgrade strategy:
Review existing evaluation settings: Confirm currently used evaluation parameters
Test new configuration options: Try using the new LLM configuration features
Optimize test cases: Leverage flexible Ground Truth handling to expand test coverage
Monitor performance improvements: Observe speed improvements from parallel processing
View Response Evaluation Results
The response evaluation feature is located in the AgentOps module, providing two viewing methods:
Real-time Monitoring
AgentOps → AI Assistant Monitoring
Real-time calculation of scores for each conversation, used to monitor the response quality of online AI assistants.
Automated Testing
AgentOps → Automated Testing
Use test sets to perform batch evaluations, generating complete reports and improvement suggestions, suitable for quality verification before version releases.
Scoring Metrics
The MaiAgent platform provides response evaluation functionality, recording and automatically scoring each Q&A session. Scores include:
Faithfulness
Whether the LLM answers truthfully rather than fabricating answers
LLM, RAG, Knowledge Base
✅
✅
Answer Relevancy
Whether the LLM answers to the point, whether incomplete or contains redundant text
LLM, RAG, Knowledge Base
✅
✅
Context Precision
Whether RAG-retrieved content is relevant to the question
RAG, Knowledge Base
✅
✅
Contextual Relevancy
Overall relevance of retrieved content to the question
RAG, Knowledge Base
✅
✅
Context Recall
Whether RAG-retrieved content, compared to Ground Truth, has retrieved all data
RAG, Knowledge Base
✅
✅
Answer Correctness
Correctness of the response compared to Ground Truth
LLM, RAG, Knowledge Base
✅
✅
Answer Similarity
Semantic similarity of the response to Ground Truth
LLM, RAG, Knowledge Base
✅
✅
Bias
Detects whether the response contains gender, racial, religious, or other biases
LLM
✅
Toxicity
Detects whether the response contains harmful or offensive content
LLM
✅
Hallucination
Detects whether the response contains fabricated information inconsistent with context
LLM, RAG
✅
✅
Difference Between Faithfulness and Hallucination Detection
These two metrics are often confused, but they evaluate from different angles:
Faithfulness: Measures "what proportion" of the response content is based on retrieved context, a positive metric (higher is better)
Hallucination: Detects whether the response "contains" content that contradicts or cannot be verified by context, a negative metric (lower is better)
Direction
Positive (higher is better)
Negative (lower is better)
Evaluation Question
How much of the answer is based on sources?
Does the answer contain fabricated content?
Calculation Method
Verifiable statements ÷ Total statements
Detects existence of fabricated information
Example
Retrieval Context: "Taipei 101 has a height of 508 meters and was completed in 2004"
Response: "Taipei 101 has a height of 508 meters, was completed in 2004, and was once the world's tallest building"
Faithfulness score is low: Because only 2/3 of the content has basis
Hallucination score is high: Because "was once the world's tallest building" is not mentioned in the context and is considered hallucinated content
In short, Faithfulness focuses on "degree of fidelity to sources," while Hallucination focuses on "whether there is fabrication." They are related but not the same: low Faithfulness doesn't necessarily mean hallucination, but hallucination will definitely lower Faithfulness.
Feature Support Comparison
Faithfulness
✅
✅
Answer Relevancy
✅
✅
Context Precision
✅
✅
Contextual Relevancy
✅
Context Recall
⚠️
✅
Answer Correctness
⚠️
Answer Similarity
⚠️
Bias
✅
Toxicity
✅
Hallucination
✅
⚠️ Coming soon
Score Interpretation
Below 0.5 is generally considered to need improvement
0.6-0.7 is an acceptable range
Above 0.8 is considered good performance
Above 0.9 is excellent performance
Identifying Causes of Low Scores and Solutions
LLM capability issues, unable to answer questions based on reference materials
Solution: Switch to a more capable LLM, or use the new version's configurable evaluation LLM feature
RAG retrieval capability, whether relevant data to the question has been found
Solution: Contact MaiAgent official support
Whether knowledge base data is sufficiently provided
Solution: Supplement correct knowledge base data and FAQ common questions
Best Practices
Using Flexible Ground Truth
When standard answers are not available, you can still:
Start with basic evaluation (metrics that don't require Ground Truth)
Observe AI assistant's response patterns
Gradually establish evaluation standards based on actual performance
Supplement Ground Truth for complete evaluation
Leverage Parallel Processing
For optimal evaluation performance:
Recommend evaluating multiple test cases at once (10 or more)
Avoid overly frequent small batch evaluations
Consider performing large evaluations during off-peak hours
Technical Resources
Last updated
Was this helpful?
