Response Evaluation and Monitoring Results

View Response Evaluation Results

AI Assistant → Configuration → Response Evaluation

Scoring Metrics

The MaiAgent platform provides response evaluation functionality, recording and automatically scoring each Q&A session. The scores include:

Metric

Description

Influencing Factors

Question

Response

Retrieved Context

Correct Answer

Faithfulness

Whether the LLM answers truthfully without fabricating information

LLM, RAG, Knowledge Base

✅

Answer Relevancy

Whether the LLM answers to the point, is complete, and doesn't contain redundant text

LLM, RAG, Knowledge Base

✅

Context Precision

Whether the RAG-retrieved content is relevant to the question

RAG, Knowledge Base

✅

Contextual Relevancy

Overall relevance between retrieved content and question

RAG, Knowledge Base

✅

Context Recall

Whether the RAG-retrieved content includes all data compared to the correct answer

RAG, Knowledge Base

✅

Answer Correctness

Correctness of the response compared to the correct answer

LLM, RAG, Knowledge Base

✅

Answer Similarity

Semantic similarity between response and correct answer

LLM, RAG, Knowledge Base

✅

Bias

Detects whether the answer contains gender, racial, religious, or other biases

LLM

✅

Toxicity

Detects whether the answer contains harmful or offensive content

LLM

✅

Hallucination

Detects whether the answer contains fabricated information inconsistent with the context

LLM, RAG

✅

Difference Between Faithfulness and Hallucination

These two metrics are often confused, but they evaluate from different angles:

Faithfulness: Measures what proportion of the answer is based on retrieved context, a positive metric (higher is better)
Hallucination: Detects whether content contradicting or unverifiable from the context exists in the answer, a negative metric (lower is better)

Aspect

Faithfulness

Hallucination

Direction

Positive (higher is better)

Negative (lower is better)

Evaluation Question

How much of the answer is based on sources?

Does the answer contain fabricated content?

Calculation Method

Verifiable statements ÷ Total statements

Detects presence of fabricated information

Example

Retrieved Context: "Taipei 101 has a height of 508 meters and was completed in 2004"
Answer: "Taipei 101 has a height of 508 meters, was completed in 2004, and was once the world's tallest building"

Low Faithfulness score: Because only 2/3 of the content has supporting evidence
High Hallucination score: Because "was once the world's tallest building" is not mentioned in the context and is considered hallucinated content

In short, Faithfulness focuses on "degree of adherence to sources," while Hallucination focuses on "whether fabrication exists." The two are related but different: low Faithfulness doesn't necessarily mean hallucination, but hallucination will definitely lead to lower Faithfulness.

Feature Support Comparison

Metric

Real-time Monitoring

Automated Testing

Faithfulness

✅

Answer Relevancy

✅

Context Precision

✅

Contextual Relevancy

✅

Context Recall

⚠️

✅

Answer Correctness

⚠️

Answer Similarity

⚠️

Bias

✅

Toxicity

✅

Hallucination

✅

⚠️ Coming soon

Score Interpretation

Below 0.5 is generally considered as needing improvement
0.6-0.7 is an acceptable range
Above 0.8 is considered good performance
Above 0.9 is excellent performance

Identifying Causes of Low Scores and Solutions

LLM capability issues, unable to answer questions based on reference materials
- Solution: Switch to a more capable LLM
RAG retrieval capability, whether relevant data is found for the question
- Solution: Contact MaiAgent official support
Whether the knowledge base provides sufficient data
- Solution: Supplement correct knowledge base data and FAQ common questions

PreviousFAQ Management NextAWS Guardrails

Last updated 25 days ago

Was this helpful?

hashtagView Response Evaluation Results

hashtagScoring Metrics

hashtagDifference Between Faithfulness and Hallucination

hashtagFeature Support Comparison

hashtagScore Interpretation

hashtagIdentifying Causes of Low Scores and Solutions

View Response Evaluation Results

Scoring Metrics

Difference Between Faithfulness and Hallucination

Feature Support Comparison

Score Interpretation

Identifying Causes of Low Scores and Solutions