AgentOps (Automated Testing)
This article introduces how to use the AgentOps feature in the MaiAgent system to help you conduct testing and monitoring operations, systematically improving the relevance and accuracy of AI responses.
What is AgentOps?
As AI assistants become more widely used, user questions are becoming increasingly complex.
A systematic approach is needed to ensure AI assistants can:
Provide correct answers
Avoid inappropriate responses
Maintain stable response times
Protect brand reputation
AgentOps is a feature specifically designed for testing and monitoring AI assistant quality. It primarily conducts tests by creating test question banks and helps improve the relevance and accuracy of AI systems when handling queries. Through this approach, you can ensure more precise AI responses and effectively provide correct answers in practical applications.
Traditional Manual Question-by-Question Testing vs. AgentOps Automated Testing
Manual Testing
AgentOps
Source
Manual question-by-question input
Pre-built test question bank with automatic question retrieval and batch execution
Testing Method
Tester manually inputs questions, waits for AI responses, manually compares answers
System automatically runs all questions, compares expected answers, generates complete reports
Speed
Slow, requires question-by-question operation
Fast, single-click execution, can run dozens to hundreds of questions simultaneously
Coverage
Low, typically tests only a small number of questions
High, can batch test large question banks, avoiding missed issues
Labor Cost
High, requires repeated manual verification
Low, tests run completely automatically, manual work only needed for reviewing results
Use Cases
Minor changes, ad-hoc confirmation, quick results
After model updates, database modifications, prompt adjustments—needs verification for regression
Core Features of AgentOps
🧩 1. Automated Evaluation
Create test question bank → Batch automated testing → Scoring and comparison → Identify issues → Feedback optimization
The system will run through the entire test question bank at once and output:
Whether each question was successful
Differences between AI responses and expected answers
Response quality scores
Answer relevance
Average response time in seconds
Failed case reports
When you want to verify the quality of your AI assistant's responses, you no longer need to test question by question.
AgentOps will automatically run through all questions based on the "test question bank" you created, compare expected answers, identify non-conforming cases, and finally compile the results into a complete report.
📡 2. Assistant Monitoring
The system automatically monitors "every real user Q&A" and records:
Whether responses are inappropriate
Whether they comply with company regulations
Automated Testing vs. Assistant Monitoring
Source
Pre-designed test question bank, controllable questions organized by topic
Real user input, varied questions, closer to real-world scenarios
Operation Method
Automatically executes tests through pre-built "test question bank," no manual question input needed
Monitors AI responses to every real user input, records response quality and results
Primary Purpose
Systematic testing of AI response quality
Real-time monitoring of AI-user interaction performance
Controllability
High, can repeatedly test specific questions
Low, question sources uncontrollable, users may ask unexpected questions
Use Cases
Before launch, version updates, prompt adjustments, database corrections—used to verify for regression
Long-term monitoring after launch, used to understand AI performance in real scenarios
Advantages
Batch testing possible, stable comparable results
Can discover real issues "not in the question bank but frequently occurring in practice"
What AgentOps Can Do
Specific Application Scenarios
🏥 Medical Clinic
🏥 School Administration
How to Use AgentOps Features in MaiAgent
1. Create Test Question Bank (First-Time Use)
Step 1: Create New Question Bank
Open "AgentOps" and enter "Test Sets"
Click the "Create Test Set" button

Set the test set name

Step 2: Add Test Questions
Method A: Upload CSV File (Recommended)
Enter the test cases page
Click "Add Test Case"

Select "Import"

Prepare CSV file with the following format:
If there is no standard answer, the ground_truth (standard answer) field can be left empty (requires CSV file upload)
Can I return items after 14 days?
Currently, returns can only be requested within 14 days of receiving the product
What payment methods are supported?
Credit card, ATM transfer, cash on delivery (depending on channel availability)
Scoring Criteria
Whether answer is fluent, complete, and logically correct
Whether it semantically matches the standard answer
Discoverable Issues
Off-topic, logical errors
Incorrect information, policy inconsistencies, too verbose, missing information
After uploading the file, the system will automatically import

Method B: Manual Input
Enter the test cases page
Click "Add Test Case"

Select "Manual Input"

Enter question and expected answer
2. Execute Batch Testing
Step 3: Create Evaluation Task
Enter "AgentOps" → "Automated Testing"
Click "Create Test"

Fill in information:
Name:
E-commerce Customer Service Test - 2025/11/18Description: (Test records, refer to template below)
Select Test Set:
E-commerce Customer Service TestSelect AI Assistant:
E-commerce Customer Service

Click "Confirm" to start testing
Step 4: Wait for Test Completion
Status shows "Evaluating"
Status changes to "Completed" after completion
Test Duration → Overview of overall evaluation execution time
When exceeding 50 cases, it is recommended to execute in batches
10
Approximately 2-3 minutes
Quick test
20
Approximately 4-6 minutes
Routine test
30
Approximately 6-9 minutes
Medium scale
50
Approximately 10-15 minutes
Complete evaluation
100
Approximately 20-30 minutes
Large-scale test
Factors Affecting Execution Time
AI Assistant Complexity ⇒ Normal / Agent Mode
Knowledge Base Size ⇒ Larger knowledge bases require longer retrieval times
Network Conditions ⇒ API call delays affect overall time
3. View Results
Step 5: View Evaluation Results
Click evaluation record to enter details page


All Q&A records and scores from the same test set
View key metrics:
Pass Rate:
0%(No test cases passed)
All test questions do not match the preset answers (Ground Truth)
Quality Score: 100 points
Average Response Time: 5.0 seconds
View scores for each metric:
Answer Relevance:
0.88(Needs improvement)Bias:
0(Excellent)Offensiveness:
0(Excellent)Hallucination:
100(Needs improvement)

Individual Q&A scoring details

View detailed analysis description of individual metrics
Step 6: View Failed Cases
Click "Test Case Details"
View reasons for
low-scoring metricsand AI responses
4. Improve AI Assistant
Step 7: Adjust Based on Results
Make improvements based on evaluation results:
Issue 1: Low Answer Relevance (0.88)
Improvement methods:
Adjust role instructions, emphasize that answers need to stay on topic
Check if knowledge base content is complete
Optimize retrieval parameters
Issue 2: Low Context Recall Rate (0)
Improvement methods:
Check if knowledge base covers relevant information
Issue 3: Hallucinations Occurring (100)
Improvement methods:
Adjust role instructions, add: Please answer user questions completely based on the provided knowledge base content
Check the "gray areas" in the knowledge base, make all ambiguous guidelines specific
❌ Before correction: Please contact us.
✅ After correction: Please call customer service hotline 02-1234-5678 or email [email protected].
Step 8: Modify AI Assistant Settings
Enter the "AI Assistant" page
Modify "Role Instructions"
Save changes
5. Test Again to Verify Improvement Effects
Step 9: Create New Round of Testing
Repeat "Steps 3-4" to create a new evaluation task
Name:
E-commerce Customer Service Test - 2025/11/27 (After Improvement)Detailed records can be noted in the
DescriptionfieldUse the same test question bank for comparison
Step 10: Compare Before and After Results
Compare results from two tests:
Answer Relevance
0.88
1
+0.12
Hallucination
1
0
-1
Average Response Time
5 seconds
2.3 seconds
-2.7 seconds
6. Continuous Optimization Cycle
Recommended Frequency
Test immediately after each major update
Test after adding knowledge base content
Test after adjusting AI assistant settings
Quick Checklist
6. Common Questions
Q1
Can it be used if there are only questions (Q) without standard answers (A)?
Yes. The system will execute tests and generate evaluation results, but limited to metrics that don't require standard answers (such as answer relevance, bias, offensiveness, hallucination, etc.).
Q2
Will the model apply the Prompt we set when generating answers?
Yes. Batch testing will use the AI assistant's current complete settings.
Q3
During batch testing, will answers to consecutive questions influence each other?
No. Each test case is an independent conversation and does not retain the conversation history from the previous question, ensuring each answer is generated based on the same contextual conditions.
Q4
Generating answers in one batch vs. asking questions one by one manually—will there be differences in answers?
Theoretically the same. Because each test case is an independent conversation using the same Prompt and knowledge base settings. Actual differences may come from: knowledge base content updates, system setting changes, LLM randomness (usually minimal).
Q5
Without standard answers, how to judge test results?
Can refer to the following metrics: • Answer Relevance: Is the answer on topic • Safety Metrics: Is there bias, offensiveness, hallucination • Response Time: Is the speed reasonable • Manual Review: Check if actual answer content meets expectations
Q6
How to compare test results before and after improvements?
Recommendations:
Use the same test question bank
Record each test's description and setting changes
Compare pass rates, quality scores, and scores for each metric
View failed cases and analyze improvement effects
Q7
What to do if response time is too long?
Improvement methods: ◦ Adjust LLM model parameters ◦ Optimize retrieval process ◦ Reduce unnecessary tool calls
Q8
If test scores are low, how to improve?
Recommended adjustment steps:
Adjust role instructions: Add clear restrictions, such as "Please answer completely based on the provided knowledge base content." 2. Check knowledge base: Make ambiguous guidelines specific (e.g., add specific phone numbers).
Q9
Why does success rate drop after adding standard answers (Ground Truth)?
Because system verification becomes stricter. Without standard answers, the system only checks response quality (such as tone, completeness); after adding them, the system compares "whether AI responses are completely consistent with company policy."
Last updated
Was this helpful?
