Response Evaluation and Monitoring Results
View Response Evaluation Results
AI Assistant → Configuration → Response Evaluation
Scoring Metrics
The MaiAgent platform provides response evaluation functionality, recording and automatically scoring each Q&A session. The scores include:
Faithfulness
Whether the LLM answers truthfully without fabricating information
LLM, RAG, Knowledge Base
✅
✅
Answer Relevancy
Whether the LLM answers to the point, is complete, and doesn't contain redundant text
LLM, RAG, Knowledge Base
✅
✅
Context Precision
Whether the RAG-retrieved content is relevant to the question
RAG, Knowledge Base
✅
✅
Contextual Relevancy
Overall relevance between retrieved content and question
RAG, Knowledge Base
✅
✅
Context Recall
Whether the RAG-retrieved content includes all data compared to the correct answer
RAG, Knowledge Base
✅
✅
Answer Correctness
Correctness of the response compared to the correct answer
LLM, RAG, Knowledge Base
✅
✅
Answer Similarity
Semantic similarity between response and correct answer
LLM, RAG, Knowledge Base
✅
✅
Bias
Detects whether the answer contains gender, racial, religious, or other biases
LLM
✅
Toxicity
Detects whether the answer contains harmful or offensive content
LLM
✅
Hallucination
Detects whether the answer contains fabricated information inconsistent with the context
LLM, RAG
✅
✅
Difference Between Faithfulness and Hallucination
These two metrics are often confused, but they evaluate from different angles:
Faithfulness: Measures what proportion of the answer is based on retrieved context, a positive metric (higher is better)
Hallucination: Detects whether content contradicting or unverifiable from the context exists in the answer, a negative metric (lower is better)
Direction
Positive (higher is better)
Negative (lower is better)
Evaluation Question
How much of the answer is based on sources?
Does the answer contain fabricated content?
Calculation Method
Verifiable statements ÷ Total statements
Detects presence of fabricated information
Example
Retrieved Context: "Taipei 101 has a height of 508 meters and was completed in 2004"
Answer: "Taipei 101 has a height of 508 meters, was completed in 2004, and was once the world's tallest building"
Low Faithfulness score: Because only 2/3 of the content has supporting evidence
High Hallucination score: Because "was once the world's tallest building" is not mentioned in the context and is considered hallucinated content
In short, Faithfulness focuses on "degree of adherence to sources," while Hallucination focuses on "whether fabrication exists." The two are related but different: low Faithfulness doesn't necessarily mean hallucination, but hallucination will definitely lead to lower Faithfulness.
Feature Support Comparison
Faithfulness
✅
✅
Answer Relevancy
✅
✅
Context Precision
✅
✅
Contextual Relevancy
✅
Context Recall
⚠️
✅
Answer Correctness
⚠️
Answer Similarity
⚠️
Bias
✅
Toxicity
✅
Hallucination
✅
⚠️ Coming soon
Score Interpretation
Below 0.5 is generally considered as needing improvement
0.6-0.7 is an acceptable range
Above 0.8 is considered good performance
Above 0.9 is excellent performance
Identifying Causes of Low Scores and Solutions
LLM capability issues, unable to answer questions based on reference materials
Solution: Switch to a more capable LLM
RAG retrieval capability, whether relevant data is found for the question
Solution: Contact MaiAgent official support
Whether the knowledge base provides sufficient data
Solution: Supplement correct knowledge base data and FAQ common questions
Last updated
Was this helpful?
