AgentOps (Automated Testing)

This article introduces how to use the AgentOps feature in the MaiAgent system to help you conduct testing and monitoring operations, systematically improving the relevance and accuracy of AI responses.

What is AgentOps?

As AI assistants become more widely used, user questions are becoming increasingly complex.

A systematic approach is needed to ensure AI assistants can:

Provide correct answers
Avoid inappropriate responses
Maintain stable response times
Protect brand reputation

AgentOps is a feature specifically designed for testing and monitoring AI assistant quality. It primarily conducts tests by creating test question banks and helps improve the relevance and accuracy of AI systems when handling queries. Through this approach, you can ensure more precise AI responses and effectively provide correct answers in practical applications.

Traditional Manual Question-by-Question Testing vs. AgentOps Automated Testing

Manual Testing

AgentOps

Source

Manual question-by-question input

Pre-built test question bank with automatic question retrieval and batch execution

Testing Method

Tester manually inputs questions, waits for AI responses, manually compares answers

System automatically runs all questions, compares expected answers, generates complete reports

Speed

Slow, requires question-by-question operation

Fast, single-click execution, can run dozens to hundreds of questions simultaneously

Coverage

Low, typically tests only a small number of questions

High, can batch test large question banks, avoiding missed issues

Labor Cost

High, requires repeated manual verification

Low, tests run completely automatically, manual work only needed for reviewing results

Use Cases

Minor changes, ad-hoc confirmation, quick results

After model updates, database modifications, prompt adjustments—needs verification for regression

Core Features of AgentOps

🧩 1. Automated Evaluation

Create test question bank → Batch automated testing → Scoring and comparison → Identify issues → Feedback optimization

The system will run through the entire test question bank at once and output:

Whether each question was successful
Differences between AI responses and expected answers
Response quality scores
Answer relevance
Average response time in seconds
Failed case reports

When you want to verify the quality of your AI assistant's responses, you no longer need to test question by question.

AgentOps will automatically run through all questions based on the "test question bank" you created, compare expected answers, identify non-conforming cases, and finally compile the results into a complete report.

📡 2. Assistant Monitoring

The system automatically monitors "every real user Q&A" and records:

Whether responses are inappropriate
Whether they comply with company regulations

Automated Testing vs. Assistant Monitoring

Source

Pre-designed test question bank, controllable questions organized by topic

Real user input, varied questions, closer to real-world scenarios

Operation Method

Automatically executes tests through pre-built "test question bank," no manual question input needed

Monitors AI responses to every real user input, records response quality and results

Primary Purpose

Systematic testing of AI response quality

Real-time monitoring of AI-user interaction performance

Controllability

High, can repeatedly test specific questions

Low, question sources uncontrollable, users may ask unexpected questions

Use Cases

Before launch, version updates, prompt adjustments, database corrections—used to verify for regression

Long-term monitoring after launch, used to understand AI performance in real scenarios

Advantages

Batch testing possible, stable comparable results

Can discover real issues "not in the question bank but frequently occurring in practice"

What AgentOps Can Do

Specific Application Scenarios

🏥 Medical Clinic

Create test question bank:
"What treatment is suitable for me?"
"When does Dr. Lin have consultation hours?"

AgentOps automated test report:
1 answer violates medical regulations
2 answers lack sufficient precision

🏥 School Administration

Create test question bank:
"How to request leave?"
"Grade calculation explanation"
Transfer student procedures

AgentOps automated test report:
Accuracy rate: 86%
2 answers with poor content quality
1 answer with excessively low relevance

How to Use AgentOps Features in MaiAgent

1. Create Test Question Bank (First-Time Use)

Step 1: Create New Question Bank

Open "AgentOps" and enter "Test Sets"
Click the "Create Test Set" button

Set the test set name

Step 2: Add Test Questions

Method A: Upload CSV File (Recommended)

Enter the test cases page
Click "Add Test Case"

Select "Import"

Prepare CSV file with the following format:

If there is no standard answer, the ground_truth (standard answer) field can be left empty (requires CSV file upload)

question

Ground Truth

Can I return items after 14 days?

Currently, returns can only be requested within 14 days of receiving the product

What payment methods are supported?

Credit card, ATM transfer, cash on delivery (depending on channel availability)

Scoring Criteria

Whether answer is fluent, complete, and logically correct

Whether it semantically matches the standard answer

Discoverable Issues

Off-topic, logical errors

Incorrect information, policy inconsistencies, too verbose, missing information

After uploading the file, the system will automatically import

Method B: Manual Input

Enter the test cases page
Click "Add Test Case"

Select "Manual Input"

Enter question and expected answer

2. Execute Batch Testing

Step 3: Create Evaluation Task

Enter "AgentOps" → "Automated Testing"
Click "Create Test"

Fill in information:

Name: E-commerce Customer Service Test - 2025/11/18
Description: (Test records, refer to template below)

[Test Type] - [Brief Description]

Test Scope: [Describe the content covered by the test]
Test Purpose: [Explain why this test is being conducted]
Special Notes: [If there are changes or items requiring attention]
Expected Goals: [If there are specific objectives]

Select Test Set: E-commerce Customer Service Test
Select AI Assistant: E-commerce Customer Service

Click "Confirm" to start testing

Step 4: Wait for Test Completion

Status shows "Evaluating"
Status changes to "Completed" after completion
Test Duration → Overview of overall evaluation execution time

When exceeding 50 cases, it is recommended to execute in batches

Number of Test Cases

Estimated Execution Time

Description

Approximately 2-3 minutes

Quick test

Approximately 4-6 minutes

Routine test

Approximately 6-9 minutes

Medium scale

Approximately 10-15 minutes

Complete evaluation

100

Approximately 20-30 minutes

Large-scale test

Factors Affecting Execution Time

AI Assistant Complexity ⇒ Normal / Agent Mode
Knowledge Base Size ⇒ Larger knowledge bases require longer retrieval times
Network Conditions ⇒ API call delays affect overall time

3. View Results

Step 5: View Evaluation Results

Click evaluation record to enter details page

All Q&A records and scores from the same test set

View key metrics:
- Pass Rate: 0% (No test cases passed)
All test questions do not match the preset answers (Ground Truth)
- Quality Score: 100 points
- Average Response Time: 5.0 seconds
View scores for each metric:
- Answer Relevance: 0.88 (Needs improvement)
- Bias: 0 (Excellent)
- Offensiveness: 0 (Excellent)
- Hallucination: 100 (Needs improvement)

Individual Q&A scoring details

View detailed analysis description of individual metrics

Step 6: View Failed Cases

Click "Test Case Details"
View reasons for low-scoring metrics and AI responses

4. Improve AI Assistant

Step 7: Adjust Based on Results

Make improvements based on evaluation results:

Issue 1: Low Answer Relevance (0.88)

Improvement methods:
- Adjust role instructions, emphasize that answers need to stay on topic
- Check if knowledge base content is complete
- Optimize retrieval parameters

Issue 2: Low Context Recall Rate (0)

Improvement methods:
- Check if knowledge base covers relevant information

Issue 3: Hallucinations Occurring (100)

Improvement methods:
- Adjust role instructions, add: Please answer user questions completely based on the provided knowledge base content
- Check the "gray areas" in the knowledge base, make all ambiguous guidelines specific
- ❌ Before correction: Please contact us.
- ✅ After correction: Please call customer service hotline 02-1234-5678 or email [email protected].

Step 8: Modify AI Assistant Settings

Enter the "AI Assistant" page
Modify "Role Instructions"
Save changes

5. Test Again to Verify Improvement Effects

Step 9: Create New Round of Testing

Repeat "Steps 3-4" to create a new evaluation task
Name: E-commerce Customer Service Test - 2025/11/27 (After Improvement)
Detailed records can be noted in the Description field
Use the same test question bank for comparison

Step 10: Compare Before and After Results

Compare results from two tests:

Metric

First Test

After Improvement

Improvement Margin

Answer Relevance

0.88

+0.12

Hallucination

-1

Average Response Time

5 seconds

2.3 seconds

-2.7 seconds

6. Continuous Optimization Cycle

Create Question Bank → Execute Test → View Results → Improve Settings → Test Again → Compare Results
    ↑                                              ↓
    └─────────────────── Continuous Optimization Cycle ───────────────┘

Recommended Frequency

Test immediately after each major update
Test after adding knowledge base content
Test after adjusting AI assistant settings

Quick Checklist

Test question bank created (at least 10 test cases)
Test cases include questions and standard answers
First batch test executed
Evaluation results reviewed and issues identified
AI assistant settings adjusted based on results
Post-improvement test executed
Before and after test results compared

6. Common Questions

Question No.

Question

Answer

Can it be used if there are only questions (Q) without standard answers (A)?

Yes. The system will execute tests and generate evaluation results, but limited to metrics that don't require standard answers (such as answer relevance, bias, offensiveness, hallucination, etc.).

Will the model apply the Prompt we set when generating answers?

Yes. Batch testing will use the AI assistant's current complete settings.

During batch testing, will answers to consecutive questions influence each other?

No. Each test case is an independent conversation and does not retain the conversation history from the previous question, ensuring each answer is generated based on the same contextual conditions.

Generating answers in one batch vs. asking questions one by one manually—will there be differences in answers?

Theoretically the same. Because each test case is an independent conversation using the same Prompt and knowledge base settings. Actual differences may come from: knowledge base content updates, system setting changes, LLM randomness (usually minimal).

Without standard answers, how to judge test results?

Can refer to the following metrics: • Answer Relevance: Is the answer on topic • Safety Metrics: Is there bias, offensiveness, hallucination • Response Time: Is the speed reasonable • Manual Review: Check if actual answer content meets expectations

How to compare test results before and after improvements?

Recommendations:

Use the same test question bank
Record each test's description and setting changes
Compare pass rates, quality scores, and scores for each metric
View failed cases and analyze improvement effects

What to do if response time is too long?

Improvement methods: ◦ Adjust LLM model parameters ◦ Optimize retrieval process ◦ Reduce unnecessary tool calls

If test scores are low, how to improve?

Recommended adjustment steps:

Adjust role instructions: Add clear restrictions, such as "Please answer completely based on the provided knowledge base content." 2. Check knowledge base: Make ambiguous guidelines specific (e.g., add specific phone numbers).

Why does success rate drop after adding standard answers (Ground Truth)?

Because system verification becomes stricter. Without standard answers, the system only checks response quality (such as tone, completeness); after adding them, the system compares "whether AI responses are completely consistent with company policy."

PreviousAll Conversations Feature NextResponse Quality Control

Last updated 2 months ago

Was this helpful?

hashtagWhat is AgentOps?

hashtagTraditional Manual Question-by-Question Testing vs. AgentOps Automated Testing

hashtagCore Features of AgentOps

hashtag🧩 1. Automated Evaluation

hashtag📡 2. Assistant Monitoring

hashtagAutomated Testing vs. Assistant Monitoring

hashtagWhat AgentOps Can Do

hashtagSpecific Application Scenarios

hashtag🏥 Medical Clinic

hashtag🏥 School Administration

hashtagHow to Use AgentOps Features in MaiAgent

hashtag1. Create Test Question Bank (First-Time Use)

hashtagStep 1: Create New Question Bank

hashtagStep 2: Add Test Questions

hashtag2. Execute Batch Testing

hashtagStep 3: Create Evaluation Task

hashtagStep 4: Wait for Test Completion

hashtag3. View Results

hashtagStep 5: View Evaluation Results

hashtagStep 6: View Failed Cases

hashtag4. Improve AI Assistant

hashtagStep 7: Adjust Based on Results

hashtagStep 8: Modify AI Assistant Settings

hashtag5. Test Again to Verify Improvement Effects

hashtagStep 9: Create New Round of Testing

hashtagStep 10: Compare Before and After Results

hashtag6. Continuous Optimization Cycle

hashtagRecommended Frequency

hashtagQuick Checklist

hashtag6. Common Questions

What is AgentOps?

Traditional Manual Question-by-Question Testing vs. AgentOps Automated Testing

Core Features of AgentOps

🧩 1. Automated Evaluation

📡 2. Assistant Monitoring

Automated Testing vs. Assistant Monitoring

What AgentOps Can Do

Specific Application Scenarios

🏥 Medical Clinic

🏥 School Administration

How to Use AgentOps Features in MaiAgent

1. Create Test Question Bank (First-Time Use)

Step 1: Create New Question Bank

Step 2: Add Test Questions

2. Execute Batch Testing

Step 3: Create Evaluation Task

Step 4: Wait for Test Completion

3. View Results

Step 5: View Evaluation Results

Step 6: View Failed Cases

4. Improve AI Assistant

Step 7: Adjust Based on Results

Step 8: Modify AI Assistant Settings

5. Test Again to Verify Improvement Effects

Step 9: Create New Round of Testing

Step 10: Compare Before and After Results

6. Continuous Optimization Cycle

Recommended Frequency

Quick Checklist

6. Common Questions