shield-exclamationAgentOps (Automated Testing)

This article introduces how to use the AgentOps feature in the MaiAgent system to help you conduct testing and monitoring operations, systematically improving the relevance and accuracy of AI responses.

What is AgentOps?

As AI assistants become more widely used, user questions are becoming increasingly complex.

A systematic approach is needed to ensure AI assistants can:

  • Provide correct answers

  • Avoid inappropriate responses

  • Maintain stable response times

  • Protect brand reputation

AgentOps is a feature specifically designed for testing and monitoring AI assistant quality. It primarily conducts tests by creating test question banks and helps improve the relevance and accuracy of AI systems when handling queries. Through this approach, you can ensure more precise AI responses and effectively provide correct answers in practical applications.

Traditional Manual Question-by-Question Testing vs. AgentOps Automated Testing

Manual Testing

AgentOps

Source

Manual question-by-question input

Pre-built test question bank with automatic question retrieval and batch execution

Testing Method

Tester manually inputs questions, waits for AI responses, manually compares answers

System automatically runs all questions, compares expected answers, generates complete reports

Speed

Slow, requires question-by-question operation

Fast, single-click execution, can run dozens to hundreds of questions simultaneously

Coverage

Low, typically tests only a small number of questions

High, can batch test large question banks, avoiding missed issues

Labor Cost

High, requires repeated manual verification

Low, tests run completely automatically, manual work only needed for reviewing results

Use Cases

Minor changes, ad-hoc confirmation, quick results

After model updates, database modifications, prompt adjustments—needs verification for regression

Core Features of AgentOps

🧩 1. Automated Evaluation

circle-info

Create test question bank → Batch automated testing → Scoring and comparison → Identify issues → Feedback optimization

The system will run through the entire test question bank at once and output:

  • Whether each question was successful

  • Differences between AI responses and expected answers

  • Response quality scores

  • Answer relevance

  • Average response time in seconds

  • Failed case reports

When you want to verify the quality of your AI assistant's responses, you no longer need to test question by question.

AgentOps will automatically run through all questions based on the "test question bank" you created, compare expected answers, identify non-conforming cases, and finally compile the results into a complete report.

📡 2. Assistant Monitoring

The system automatically monitors "every real user Q&A" and records:

  • Whether responses are inappropriate

  • Whether they comply with company regulations

Automated Testing vs. Assistant Monitoring

Source

Pre-designed test question bank, controllable questions organized by topic

Real user input, varied questions, closer to real-world scenarios

Operation Method

Automatically executes tests through pre-built "test question bank," no manual question input needed

Monitors AI responses to every real user input, records response quality and results

Primary Purpose

Systematic testing of AI response quality

Real-time monitoring of AI-user interaction performance

Controllability

High, can repeatedly test specific questions

Low, question sources uncontrollable, users may ask unexpected questions

Use Cases

Before launch, version updates, prompt adjustments, database corrections—used to verify for regression

Long-term monitoring after launch, used to understand AI performance in real scenarios

Advantages

Batch testing possible, stable comparable results

Can discover real issues "not in the question bank but frequently occurring in practice"

What AgentOps Can Do

Specific Application Scenarios

🏥 Medical Clinic

🏥 School Administration

How to Use AgentOps Features in MaiAgent

1. Create Test Question Bank (First-Time Use)

Step 1: Create New Question Bank

  1. Open "AgentOps" and enter "Test Sets"

  2. Click the "Create Test Set" button

  1. Set the test set name

Step 2: Add Test Questions

Method A: Upload CSV File (Recommended)

  1. Enter the test cases page

  2. Click "Add Test Case"

  1. Select "Import"

  1. Prepare CSV file with the following format:

triangle-exclamation
question
Ground Truth

Can I return items after 14 days?

Currently, returns can only be requested within 14 days of receiving the product

What payment methods are supported?

Credit card, ATM transfer, cash on delivery (depending on channel availability)

Scoring Criteria

Whether answer is fluent, complete, and logically correct

Whether it semantically matches the standard answer

Discoverable Issues

Off-topic, logical errors

Incorrect information, policy inconsistencies, too verbose, missing information

  1. After uploading the file, the system will automatically import

Method B: Manual Input

  1. Enter the test cases page

  2. Click "Add Test Case"

  1. Select "Manual Input"

  1. Enter question and expected answer

2. Execute Batch Testing

Step 3: Create Evaluation Task

  1. Enter "AgentOps" → "Automated Testing"

  2. Click "Create Test"

  1. Fill in information:

  • Name: E-commerce Customer Service Test - 2025/11/18

  • Description: (Test records, refer to template below)

  • Select Test Set: E-commerce Customer Service Test

  • Select AI Assistant: E-commerce Customer Service

  1. Click "Confirm" to start testing

Step 4: Wait for Test Completion

  • Status shows "Evaluating"

  • Status changes to "Completed" after completion

  • Test Duration → Overview of overall evaluation execution time

triangle-exclamation
Number of Test Cases
Estimated Execution Time
Description

10

Approximately 2-3 minutes

Quick test

20

Approximately 4-6 minutes

Routine test

30

Approximately 6-9 minutes

Medium scale

50

Approximately 10-15 minutes

Complete evaluation

100

Approximately 20-30 minutes

Large-scale test

Factors Affecting Execution Time

  1. AI Assistant Complexity ⇒ Normal / Agent Mode

  2. Knowledge Base Size ⇒ Larger knowledge bases require longer retrieval times

  3. Network Conditions ⇒ API call delays affect overall time

3. View Results

Step 5: View Evaluation Results

  1. Click evaluation record to enter details page

All Q&A records and scores from the same test set

  1. View key metrics:

    • Pass Rate: 0% (No test cases passed)

    All test questions do not match the preset answers (Ground Truth)

    • Quality Score: 100 points

    • Average Response Time: 5.0 seconds

  2. View scores for each metric:

    • Answer Relevance: 0.88 (Needs improvement)

    • Bias: 0 (Excellent)

    • Offensiveness: 0 (Excellent)

    • Hallucination: 100 (Needs improvement)

Individual Q&A scoring details

View detailed analysis description of individual metrics

Step 6: View Failed Cases

  1. Click "Test Case Details"

  2. View reasons for low-scoring metrics and AI responses

4. Improve AI Assistant

Step 7: Adjust Based on Results

Make improvements based on evaluation results:

Issue 1: Low Answer Relevance (0.88)

  • Improvement methods:

    • Adjust role instructions, emphasize that answers need to stay on topic

    • Check if knowledge base content is complete

    • Optimize retrieval parameters

Issue 2: Low Context Recall Rate (0)

  • Improvement methods:

    • Check if knowledge base covers relevant information

Issue 3: Hallucinations Occurring (100)

  • Improvement methods:

    • Adjust role instructions, add: Please answer user questions completely based on the provided knowledge base content

    • Check the "gray areas" in the knowledge base, make all ambiguous guidelines specific

    • ❌ Before correction: Please contact us.

    • ✅ After correction: Please call customer service hotline 02-1234-5678 or email [email protected]envelope.

Step 8: Modify AI Assistant Settings

  1. Enter the "AI Assistant" page

  2. Modify "Role Instructions"

  3. Save changes

5. Test Again to Verify Improvement Effects

Step 9: Create New Round of Testing

  1. Repeat "Steps 3-4" to create a new evaluation task

  2. Name: E-commerce Customer Service Test - 2025/11/27 (After Improvement)

    Detailed records can be noted in the Description field

  3. Use the same test question bank for comparison

Step 10: Compare Before and After Results

Compare results from two tests:

Metric
First Test
After Improvement
Improvement Margin

Answer Relevance

0.88

1

+0.12

Hallucination

1

0

-1

Average Response Time

5 seconds

2.3 seconds

-2.7 seconds

6. Continuous Optimization Cycle

  • Test immediately after each major update

  • Test after adding knowledge base content

  • Test after adjusting AI assistant settings

Quick Checklist

6. Common Questions

Question No.
Question
Answer

Q1

Can it be used if there are only questions (Q) without standard answers (A)?

Yes. The system will execute tests and generate evaluation results, but limited to metrics that don't require standard answers (such as answer relevance, bias, offensiveness, hallucination, etc.).

Q2

Will the model apply the Prompt we set when generating answers?

Yes. Batch testing will use the AI assistant's current complete settings.

Q3

During batch testing, will answers to consecutive questions influence each other?

No. Each test case is an independent conversation and does not retain the conversation history from the previous question, ensuring each answer is generated based on the same contextual conditions.

Q4

Generating answers in one batch vs. asking questions one by one manually—will there be differences in answers?

Theoretically the same. Because each test case is an independent conversation using the same Prompt and knowledge base settings. Actual differences may come from: knowledge base content updates, system setting changes, LLM randomness (usually minimal).

Q5

Without standard answers, how to judge test results?

Can refer to the following metrics: • Answer Relevance: Is the answer on topic • Safety Metrics: Is there bias, offensiveness, hallucination • Response Time: Is the speed reasonable • Manual Review: Check if actual answer content meets expectations

Q6

How to compare test results before and after improvements?

Recommendations:

  1. Use the same test question bank

  2. Record each test's description and setting changes

  3. Compare pass rates, quality scores, and scores for each metric

  4. View failed cases and analyze improvement effects

Q7

What to do if response time is too long?

Improvement methods: ◦ Adjust LLM model parameters ◦ Optimize retrieval process ◦ Reduce unnecessary tool calls

Q8

If test scores are low, how to improve?

Recommended adjustment steps:

  1. Adjust role instructions: Add clear restrictions, such as "Please answer completely based on the provided knowledge base content." 2. Check knowledge base: Make ambiguous guidelines specific (e.g., add specific phone numbers).

Q9

Why does success rate drop after adding standard answers (Ground Truth)?

Because system verification becomes stricter. Without standard answers, the system only checks response quality (such as tone, completeness); after adding them, the system compares "whether AI responses are completely consistent with company policy."

Last updated

Was this helpful?