# AgentOps (Automated Testing)

This article introduces how to use the AgentOps feature in the MaiAgent system to help you conduct testing and monitoring operations, systematically improving the relevance and accuracy of AI responses.

## **What is AgentOps?**

As AI assistants become more widely used, user questions are becoming increasingly complex.

A systematic approach is needed to ensure AI assistants can:

* Provide correct answers
* Avoid inappropriate responses
* Maintain stable response times
* Protect brand reputation

**AgentOps** is a feature specifically designed for testing and monitoring AI assistant quality. It primarily conducts tests by creating test question banks and helps improve the relevance and accuracy of AI systems when handling queries. Through this approach, you can ensure more precise AI responses and effectively provide correct answers in practical applications.

#### **Traditional Manual Question-by-Question Testing vs. AgentOps Automated Testing**

|                    | **Manual Testing**                                                                  | **AgentOps**                                                                                       |
| ------------------ | ----------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| **Source**         | Manual question-by-question input                                                   | Pre-built test question bank with automatic question retrieval and batch execution                 |
| **Testing Method** | Tester manually inputs questions, waits for AI responses, manually compares answers | **System automatically runs all questions**, compares expected answers, generates complete reports |
| **Speed**          | Slow, requires question-by-question operation                                       | **Fast**, single-click execution, can run dozens to hundreds of questions simultaneously           |
| **Coverage**       | Low, typically tests only a small number of questions                               | **High**, can batch test large question banks, avoiding missed issues                              |
| **Labor Cost**     | High, requires repeated manual verification                                         | **Low**, tests run completely automatically, manual work only needed for reviewing results         |
| **Use Cases**      | Minor changes, ad-hoc confirmation, quick results                                   | After model updates, database modifications, prompt adjustments—needs verification for regression  |

## Core Features of AgentOps

#### 🧩 1. Automated Evaluation

{% hint style="info" %}
**Create test question bank → Batch automated testing → Scoring and comparison → Identify issues → Feedback optimization**
{% endhint %}

The system will **run through the entire test question bank at once** and output:

* Whether each question was successful
* Differences between AI responses and expected answers
* Response quality scores
* Answer relevance
* Average response time in seconds
* Failed case reports

When you want to verify the quality of your AI assistant's responses, you no longer need to test question by question.

AgentOps will automatically run through all questions based on the "test question bank" you created, compare expected answers, identify non-conforming cases, and finally compile the results into a complete report.

#### 📡 2. Assistant Monitoring

The system automatically monitors "every real user Q\&A" and records:

* Whether responses are inappropriate
* Whether they comply with company regulations

#### **Automated Testing vs. Assistant Monitoring**

| **Source**           | Pre-designed test question bank, controllable questions organized by topic                             | Real user input, varied questions, closer to real-world scenarios                        |
| -------------------- | ------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |
| **Operation Method** | Automatically executes tests through pre-built "test question bank," no manual question input needed   | Monitors AI responses to every real user input, records response quality and results     |
| **Primary Purpose**  | Systematic testing of AI response quality                                                              | Real-time monitoring of AI-user interaction performance                                  |
| **Controllability**  | High, can repeatedly test specific questions                                                           | Low, question sources uncontrollable, users may ask unexpected questions                 |
| **Use Cases**        | Before launch, version updates, prompt adjustments, database corrections—used to verify for regression | Long-term monitoring after launch, used to understand AI performance in real scenarios   |
| **Advantages**       | Batch testing possible, stable comparable results                                                      | Can discover real issues "not in the question bank but frequently occurring in practice" |

## What AgentOps Can Do

### Specific Application Scenarios

#### 🏥 Medical Clinic

```
Create test question bank:
"What treatment is suitable for me?"
"When does Dr. Lin have consultation hours?"
```

```
AgentOps automated test report:
1 answer violates medical regulations
2 answers lack sufficient precision
```

#### 🏥 School Administration

```
Create test question bank:
"How to request leave?"
"Grade calculation explanation"
Transfer student procedures
```

```
AgentOps automated test report:
Accuracy rate: 86%
2 answers with poor content quality
1 answer with excessively low relevance
```

## How to Use AgentOps Features in MaiAgent

### 1. Create Test Question Bank (First-Time Use)

#### Step 1: Create New Question Bank

1. Open "AgentOps" and enter "Test Sets"
2. Click the "Create Test Set" button

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-8ea7b122e651e859d326efd2d92ea42c8e2f19fd%2F1%20(1).jpg?alt=media" alt=""><figcaption></figcaption></figure>

3. Set the test set name

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-cfd0666de938cb54efd32f59d2efde943c11efb8%2F%E6%88%AA%E5%9C%96%202025-11-14%20%E4%B8%8B%E5%8D%884.24.25.png?alt=media" alt=""><figcaption></figcaption></figure>

### Step 2: Add Test Questions

**Method A: Upload CSV File (Recommended)**

1. Enter the test cases page
2. Click "Add Test Case"

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-ce08edc30919fee75c7a62798377a62c8ce3cf9f%2F1%20(2).jpg?alt=media" alt=""><figcaption></figcaption></figure>

3. Select "Import"

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-ac328629f127803be7efd0ac9747cb16d2095042%2F%E6%88%AA%E5%9C%96%202025-11-14%20%E4%B8%8B%E5%8D%884.39.03.png?alt=media" alt=""><figcaption></figcaption></figure>

4. Prepare CSV file with the following format:

{% hint style="danger" %}
If there is no standard answer, the ground\_truth (standard answer) field can be left empty (requires CSV file upload)
{% endhint %}

| question                            | Ground Truth                                                                     |
| ----------------------------------- | -------------------------------------------------------------------------------- |
| Can I return items after 14 days?   | Currently, returns can only be requested within 14 days of receiving the product |
| What payment methods are supported? | Credit card, ATM transfer, cash on delivery (depending on channel availability)  |

| **Scoring Criteria**    | Whether answer is fluent, complete, and logically correct | Whether it semantically matches the standard answer                             |
| ----------------------- | --------------------------------------------------------- | ------------------------------------------------------------------------------- |
| **Discoverable Issues** | Off-topic, logical errors                                 | Incorrect information, policy inconsistencies, too verbose, missing information |

5. After uploading the file, the system will automatically import

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-a0478356a9fbb25c2e20800f112872e89c72be43%2F%E6%88%AA%E5%9C%96%202025-11-27%20%E4%B8%AD%E5%8D%8812.51.42.png?alt=media" alt=""><figcaption></figcaption></figure>

**Method B: Manual Input**

1. Enter the test cases page
2. Click "Add Test Case"

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-ce08edc30919fee75c7a62798377a62c8ce3cf9f%2F1%20(2)%20(1).jpg?alt=media" alt=""><figcaption></figcaption></figure>

3. Select "Manual Input"

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-17f991ea8e71af8ea9df58e5ea831675c71a4628%2F%E6%88%AA%E5%9C%96%202025-11-14%20%E4%B8%8B%E5%8D%884.58.27.png?alt=media" alt=""><figcaption></figcaption></figure>

4. Enter question and expected answer

## 2. Execute Batch Testing

#### Step 3: Create Evaluation Task

1. Enter "AgentOps" → "Automated Testing"
2. Click "Create Test"

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-11538f6b25ea8a160bf93d7b43a109df1938cc17%2F1%20(3).jpg?alt=media" alt=""><figcaption></figcaption></figure>

3. Fill in information:

* Name: `E-commerce Customer Service Test - 2025/11/18`
* Description: (Test records, refer to template below)

```markdown
[Test Type] - [Brief Description]

Test Scope: [Describe the content covered by the test]
Test Purpose: [Explain why this test is being conducted]
Special Notes: [If there are changes or items requiring attention]
Expected Goals: [If there are specific objectives]
```

* Select Test Set: `E-commerce Customer Service Test`
* Select AI Assistant: `E-commerce Customer Service`

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-fd0c520786f803b661cf44169ec639c5d0e9c28e%2F%E6%88%AA%E5%9C%96%202025-11-18%20%E4%B8%8B%E5%8D%886.29.00.png?alt=media" alt=""><figcaption></figcaption></figure>

4. Click "Confirm" to start testing

#### Step 4: Wait for Test Completion

* Status shows "Evaluating"
* Status changes to "Completed" after completion
* Test Duration → Overview of overall evaluation execution time

{% hint style="danger" %}
When exceeding 50 cases, it is recommended to execute in batches
{% endhint %}

| Number of Test Cases | Estimated Execution Time        | Description         |
| -------------------- | ------------------------------- | ------------------- |
| **10**               | **Approximately 2-3 minutes**   | Quick test          |
| **20**               | **Approximately 4-6 minutes**   | Routine test        |
| **30**               | **Approximately 6-9 minutes**   | Medium scale        |
| **50**               | **Approximately 10-15 minutes** | Complete evaluation |
| **100**              | **Approximately 20-30 minutes** | Large-scale test    |

Factors Affecting Execution Time

1. AI Assistant Complexity ⇒ Normal / Agent Mode
2. Knowledge Base Size ⇒ Larger knowledge bases require longer retrieval times
3. Network Conditions ⇒ API call delays affect overall time

## 3. View Results

#### Step 5: View Evaluation Results

1. Click evaluation record to enter details page

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-bdcf7416f69d89162aefca185ba5185669ac689c%2F1%20(4).jpg?alt=media" alt=""><figcaption></figcaption></figure>

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-87e7bb6ab63058f8dbf21f99d1c0518f513d37fe%2F1%20(7).jpg?alt=media" alt=""><figcaption></figcaption></figure>

<mark style="color:$info;">All Q\&A records and scores from the same test set</mark>

1. View key metrics:

   * Pass Rate: <mark style="color:red;">`0%`</mark> (No test cases passed)

   All test questions do not match the preset answers (Ground Truth)

   * Quality Score: 100 points
   * Average Response Time: 5.0 seconds
2. View scores for each metric:
   * Answer Relevance: <mark style="color:red;">`0.88`</mark> (Needs improvement)
   * Bias: <mark style="color:red;">`0`</mark> (Excellent)
   * Offensiveness: <mark style="color:red;">`0`</mark> (Excellent)
   * Hallucination: <mark style="color:red;">`100`</mark> (Needs improvement)

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-8a39f7f9cd3d0b3912c953f660ce3984737d45ce%2F%E6%88%AA%E5%9C%96%202025-11-26%20%E4%B8%8B%E5%8D%886.11.42.png?alt=media" alt=""><figcaption></figcaption></figure>

*<mark style="color:$info;">Individual Q\&A scoring details</mark>*

<figure><img src="https://1360999650-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F6v6TNkkOQVfRYfcNirHL%2Fuploads%2Fgit-blob-60d8fc182047e8ee6ad1788074e9e33a59581318%2F1%20(8).jpg?alt=media" alt=""><figcaption></figcaption></figure>

<mark style="color:$info;">View detailed analysis description of individual metrics</mark>

#### Step 6: View Failed Cases

1. Click "Test Case Details"
2. View reasons for `low-scoring metrics` and AI responses

### 4. Improve AI Assistant

#### Step 7: Adjust Based on Results

Make improvements based on evaluation results:

**Issue 1: Low Answer Relevance (0.88)**

* Improvement methods:
  * Adjust role instructions, emphasize that answers need to stay on topic
  * Check if knowledge base content is complete
  * Optimize retrieval parameters

**Issue 2: Low Context Recall Rate (0)**

* Improvement methods:
  * Check if knowledge base covers relevant information

**Issue 3: Hallucinations Occurring (100)**

* Improvement methods:
  * Adjust role instructions, add: Please answer user questions **completely based on** the provided knowledge base content
  * Check the "gray areas" in the knowledge base, make all ambiguous guidelines specific
  * ❌ Before correction: Please contact us.
  * ✅ After correction: Please call customer service hotline 02-1234-5678 or email <service@example.com>.

#### Step 8: Modify AI Assistant Settings

1. Enter the "AI Assistant" page
2. Modify "Role Instructions"
3. Save changes

### 5. Test Again to Verify Improvement Effects

#### Step 9: Create New Round of Testing

1. Repeat "Steps 3-4" to create a new evaluation task
2. Name: `E-commerce Customer Service Test - 2025/11/27 (After Improvement)`

   Detailed records can be noted in the `Description` field
3. Use the same test question bank for comparison

#### Step 10: Compare Before and After Results

Compare results from two tests:

| Metric                | First Test | After Improvement | Improvement Margin |
| --------------------- | ---------- | ----------------- | ------------------ |
| Answer Relevance      | 0.88       | 1                 | +0.12              |
| Hallucination         | 1          | 0                 | -1                 |
| Average Response Time | 5 seconds  | 2.3 seconds       | -2.7 seconds       |

#### 6. Continuous Optimization Cycle

```
Create Question Bank → Execute Test → View Results → Improve Settings → Test Again → Compare Results
    ↑                                              ↓
    └─────────────────── Continuous Optimization Cycle ───────────────┘

```

#### Recommended Frequency

* Test immediately after each major update
* Test after adding knowledge base content
* Test after adjusting AI assistant settings

#### Quick Checklist

* [ ] Test question bank created (at least 10 test cases)
* [ ] Test cases include questions and standard answers
* [ ] First batch test executed
* [ ] Evaluation results reviewed and issues identified
* [ ] AI assistant settings adjusted based on results
* [ ] Post-improvement test executed
* [ ] Before and after test results compared

### 6. Common Questions

| Question No. | Question                                                                                                       | Answer                                                                                                                                                                                                                                                                                                                                          |
| ------------ | -------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Q1**       | Can it be used if there are only questions (Q) without standard answers (A)?                                   | **Yes**. The system will execute tests and generate evaluation results, but limited to metrics that don't require standard answers (such as answer relevance, bias, offensiveness, hallucination, etc.).                                                                                                                                        |
| **Q2**       | Will the model apply the Prompt we set when generating answers?                                                | **Yes**. Batch testing will use the AI assistant's current complete settings.                                                                                                                                                                                                                                                                   |
| **Q3**       | During batch testing, will answers to consecutive questions influence each other?                              | **No**. Each test case is an independent conversation and does not retain the conversation history from the previous question, ensuring each answer is generated based on the same contextual conditions.                                                                                                                                       |
| **Q4**       | Generating answers in one batch vs. asking questions one by one manually—will there be differences in answers? | <p><strong>Theoretically the same</strong>. Because each test case is an independent conversation using the same Prompt and knowledge base settings.<br>Actual differences may come from: knowledge base content updates, system setting changes, LLM randomness (usually minimal).</p>                                                         |
| **Q5**       | Without standard answers, how to judge test results?                                                           | <p>Can refer to the following metrics:<br>• <strong>Answer Relevance</strong>: Is the answer on topic<br>• <strong>Safety Metrics</strong>: Is there bias, offensiveness, hallucination<br>• <strong>Response Time</strong>: Is the speed reasonable<br>• <strong>Manual Review</strong>: Check if actual answer content meets expectations</p> |
| **Q6**       | How to compare test results before and after improvements?                                                     | <p>Recommendations:</p><ol><li>Use the same test question bank</li><li>Record each test's description and setting changes</li><li>Compare pass rates, quality scores, and scores for each metric</li><li>View failed cases and analyze improvement effects</li></ol>                                                                            |
| **Q7**       | What to do if response time is too long?                                                                       | <p>Improvement methods:<br>◦ Adjust LLM model parameters<br>◦ Optimize retrieval process<br>◦ Reduce unnecessary tool calls</p>                                                                                                                                                                                                                 |
| **Q8**       | If test scores are low, how to improve?                                                                        | <p>Recommended adjustment steps:</p><ol><li><strong>Adjust role instructions</strong>: Add clear restrictions, such as "Please answer completely based on the provided knowledge base content."<br><strong>2. Check knowledge base</strong>: Make ambiguous guidelines specific (e.g., add specific phone numbers).</li></ol>                   |
| **Q9**       | Why does success rate drop after adding standard answers (Ground Truth)?                                       | Because system verification becomes stricter. Without standard answers, the system only checks response quality (such as tone, completeness); after adding them, the system compares "whether AI responses are completely consistent with company policy."                                                                                      |
