Response Evaluation and Monitoring Results

View Response Evaluation Results

AI Assistant → Configuration → Response Evaluation

Scoring Metrics

The MaiAgent platform provides response evaluation functionality, recording and automatically scoring each Q&A session. The scores include:

Metric
Description
Influencing Factors
Question
Response
Retrieved Context
Correct Answer

Faithfulness

Whether the LLM answers truthfully without fabricating information

LLM, RAG, Knowledge Base

Answer Relevancy

Whether the LLM answers to the point, is complete, and doesn't contain redundant text

LLM, RAG, Knowledge Base

Context Precision

Whether the RAG-retrieved content is relevant to the question

RAG, Knowledge Base

Contextual Relevancy

Overall relevance between retrieved content and question

RAG, Knowledge Base

Context Recall

Whether the RAG-retrieved content includes all data compared to the correct answer

RAG, Knowledge Base

Answer Correctness

Correctness of the response compared to the correct answer

LLM, RAG, Knowledge Base

Answer Similarity

Semantic similarity between response and correct answer

LLM, RAG, Knowledge Base

Bias

Detects whether the answer contains gender, racial, religious, or other biases

LLM

Toxicity

Detects whether the answer contains harmful or offensive content

LLM

Hallucination

Detects whether the answer contains fabricated information inconsistent with the context

LLM, RAG

Difference Between Faithfulness and Hallucination

These two metrics are often confused, but they evaluate from different angles:

  • Faithfulness: Measures what proportion of the answer is based on retrieved context, a positive metric (higher is better)

  • Hallucination: Detects whether content contradicting or unverifiable from the context exists in the answer, a negative metric (lower is better)

Aspect
Faithfulness
Hallucination

Direction

Positive (higher is better)

Negative (lower is better)

Evaluation Question

How much of the answer is based on sources?

Does the answer contain fabricated content?

Calculation Method

Verifiable statements ÷ Total statements

Detects presence of fabricated information

Example

Retrieved Context: "Taipei 101 has a height of 508 meters and was completed in 2004"

Answer: "Taipei 101 has a height of 508 meters, was completed in 2004, and was once the world's tallest building"

  • Low Faithfulness score: Because only 2/3 of the content has supporting evidence

  • High Hallucination score: Because "was once the world's tallest building" is not mentioned in the context and is considered hallucinated content

In short, Faithfulness focuses on "degree of adherence to sources," while Hallucination focuses on "whether fabrication exists." The two are related but different: low Faithfulness doesn't necessarily mean hallucination, but hallucination will definitely lead to lower Faithfulness.

Feature Support Comparison

Metric
Real-time Monitoring
Automated Testing

Faithfulness

Answer Relevancy

Context Precision

Contextual Relevancy

Context Recall

⚠️

Answer Correctness

⚠️

Answer Similarity

⚠️

Bias

Toxicity

Hallucination

⚠️ Coming soon

Score Interpretation

  • Below 0.5 is generally considered as needing improvement

  • 0.6-0.7 is an acceptable range

  • Above 0.8 is considered good performance

  • Above 0.9 is excellent performance

Identifying Causes of Low Scores and Solutions

  • LLM capability issues, unable to answer questions based on reference materials

    • Solution: Switch to a more capable LLM

  • RAG retrieval capability, whether relevant data is found for the question

    • Solution: Contact MaiAgent official support

  • Whether the knowledge base provides sufficient data

    • Solution: Supplement correct knowledge base data and FAQ common questions

Last updated

Was this helpful?