# LiveKit Voice Agent Integration Architecture

> This document explains how the MaiAgent platform integrates LiveKit to implement real-time voice call functionality, including WebRTC connection management, Socket.IO namespace isolation, and audio data streaming.

## 1. What is LiveKit?

LiveKit is an open-source real-time communication infrastructure that provides low-latency, high-quality audio and video communication capabilities. MaiAgent uses LiveKit to implement its AI Voice Agent feature, allowing users to have natural voice conversations with AI assistants.

### 1.1 Core Technical Advantages

| Aspect            | Traditional Phone Systems                   | LiveKit + MaiAgent                                                |
| ----------------- | ------------------------------------------- | ----------------------------------------------------------------- |
| **Latency**       | 200-500ms                                   | < 100ms, approaching real human conversation experience           |
| **Audio Quality** | Limited by telephone network encoding       | Supports Opus high-quality encoding with adjustable bitrate       |
| **Scalability**   | Requires dedicated voice gateways           | Based on WebRTC, easily scalable horizontally                     |
| **Cost**          | Requires phone system rental fees           | Internet-based, lower cost                                        |
| **Integration**   | Difficult to integrate with digital systems | Natively integrates with web applications, embeddable in any page |

### 1.2 Common Use Cases

* **AI Voice Customer Service**: Customers have voice conversations with AI assistants directly through web pages or apps
* **Voice Assistants**: Internal enterprise use of voice commands to operate systems, such as querying data or creating tickets
* **Remote Assistance**: AI assistants guide users through operations via voice, such as equipment installation or troubleshooting
* **Multilingual Support**: Combining speech recognition and translation for real-time cross-language communication

***

## 2. MaiAgent LiveKit Integration Architecture

{% @mermaid/diagram content="flowchart TB
subgraph Client\["Browser/Client"]
Web\["Web UI"]
WebRTC\["WebRTC SDK"]
end

```
subgraph MaiAgent["MaiAgent Platform"]
    Django["Django Backend"]
    SIO_Main["Socket.IO Server<br/>(Main Namespace)"]
    SIO_Voice["Socket.IO Server<br/>(Voice Agent Dedicated)"]
    VoiceService["Voice Agent Service"]
end

subgraph LiveKit["LiveKit Infrastructure"]
    SFU["LiveKit SFU<br/>(Selective Forwarding Unit)"]
    Room["Virtual Room"]
end

subgraph AI["AI Processing Layer"]
    STT["Speech Recognition<br/>(Speech-to-Text)"]
    LLM["Language Model"]
    TTS["Speech Synthesis<br/>(Text-to-Speech)"]
end

Web <--> WebRTC
WebRTC <-. "WebRTC Media Stream" .-> SFU
Web <-. "Socket.IO Control" .-> SIO_Voice
SIO_Voice <--> VoiceService
VoiceService <--> SFU
VoiceService --> STT
STT --> LLM
LLM --> TTS
TTS -.-> SFU

Django --> SIO_Main
Django --> SIO_Voice" %}
```

### 2.1 Core Component Overview

* **LiveKit SFU (Selective Forwarding Unit)**: Handles audio/video data routing and forwarding, supporting multiple simultaneous calls
* **Socket.IO Server**: Handles control signals such as call start/end and state synchronization
* **Voice Agent Service**: Coordinates the complete flow of speech recognition, LLM inference, and speech synthesis
* **WebRTC**: Establishes peer-to-peer media connections in the browser for audio data transmission

### 2.2 Connection Establishment Flow

{% @mermaid/diagram content="sequenceDiagram
participant User as User
participant Browser as Browser
participant Django as Backend
participant SocketIO as Voice Agent Socket
participant LiveKit as LiveKit SFU

```
User->>Browser: Click "Start Call"
Browser->>Django: Request to create voice session
Django->>LiveKit: Create virtual room
LiveKit-->>Django: Return room information and Token
Django-->>Browser: Return connection parameters

Browser->>SocketIO: Establish Socket.IO connection
SocketIO-->>Browser: Connection confirmed

Browser->>LiveKit: WebRTC connection request
LiveKit-->>Browser: ICE Candidates exchange
Browser->>LiveKit: Establish DTLS/SRTP encrypted channel
LiveKit-->>Browser: Encrypted channel established

Note over Browser,LiveKit: Audio streaming begins

Browser->>SocketIO: Send PCM audio data
SocketIO->>LiveKit: Forward audio to AI processing" %}
```

***

## 3. Socket.IO Namespace Isolation

### 3.1 Why a Dedicated Namespace?

MaiAgent's original Socket.IO was primarily used for real-time delivery of general chat messages. Voice call requirements have the following special characteristics:

* **High-frequency Data Transmission**: Audio data must be transmitted continuously at 20-60ms intervals
* **Low Latency Requirements**: Any delay noticeably affects the call experience
* **Independent Error Handling**: Voice connection failures should not affect general chat functionality
* **Different CORS Policies**: Voice features may need to allow more origin domains

### 3.2 Namespace Isolation Architecture

MaiAgent has created a dedicated Socket.IO server for the Voice Agent, completely separated from the general chat message Socket.IO. The Voice Agent uses a dedicated namespace with an independently configured CORS policy.

**Benefits of isolation**:

1. **Performance Isolation**: High-frequency voice data transmission does not affect general chat message processing
2. **Independent Scaling**: Server resources can be added specifically for voice traffic
3. **Enhanced Security**: Different features use different CORS policies and authentication mechanisms
4. **Fault Isolation**: Voice Agent issues do not affect other features

### 3.3 Cross-Origin Resource Sharing (CORS) Configuration

The Voice Agent's CORS settings are independent from general chat functionality, allowing only authorized origin domains to connect, ensuring the security of voice calls.

***

## 4. Audio Data Streaming

### 4.1 Audio Data Format

Audio data transmitted via LiveKit uses the following specifications:

* **Format**: PCM (Pulse-Code Modulation)
* **Sample Rate**: 16kHz or 48kHz
* **Bit Depth**: 16-bit
* **Channels**: Mono
* **Data Transmission**: Base64 encoded and transmitted via Socket.IO

### 4.2 audioData Event Handling

{% @mermaid/diagram content="sequenceDiagram
participant Browser as Browser
participant SocketIO as Socket.IO Server
participant Queue as Audio Queue
participant STT as Speech Recognition

```
loop Continuous Transmission
    Browser->>SocketIO: Send audio data
    Note over SocketIO: Priority processing, bypasses general message queue
    SocketIO->>Queue: Add to voice processing queue
end

Queue->>STT: Batch process audio segments
STT-->>Queue: Return recognized text
Queue->>SocketIO: Return recognition results
SocketIO->>Browser: Display real-time captions" %}
```

**Audio Priority Processing Mechanism**:

MaiAgent has optimized audio data transmission to ensure audio data:

* **Does not enter the general message queue**: Avoids being blocked by other messages
* **Has priority processing**: Audio data can jump the queue to reduce latency
* **Has independent routing**: Forwarded directly to the voice processing module

### 4.3 Disconnection Handling

MaiAgent implements graceful disconnection handling:

{% @mermaid/diagram content="flowchart TD
A\["Disconnection detected"] --> B{"Disconnect reason?"}
B -- "User hung up" --> C\["Normal resource cleanup"]
B -- "Network issue" --> D\["Attempt reconnection"]
B -- "Server error" --> E\["Log error"]

```
C --> F["Release LiveKit room"]
D --> G{"Reconnection successful?"}
E --> F

G -- "Yes" --> H["Resume call"]
G -- "No" --> I["Notify user<br/>and clean up resources"]

F --> J["Close Socket.IO connection"]
H --> K["Continue call"]
I --> J
J --> L["Complete"]" %}
```

**Disconnection handling mechanisms**:

* **Heartbeat Detection**: Periodic ping/pong messages to check connection status
* **Automatic Reconnection**: Automatic reconnection attempts during brief disconnections without user intervention
* **Resource Cleanup**: Ensures LiveKit rooms and server resources are properly released
* **State Synchronization**: Restores conversation context after reconnection

***

## 5. Detailed Logging

MaiAgent implements comprehensive logging mechanisms for debugging and monitoring:

### 5.1 Log Types

| Log Type             | Recorded Content                                | Purpose                  |
| -------------------- | ----------------------------------------------- | ------------------------ |
| **Connection Logs**  | Socket.IO connect/disconnect events, source IP  | User session tracking    |
| **Audio Logs**       | Audio data transmission frequency, data size    | Call quality analysis    |
| **Error Logs**       | Exception stack traces, error messages, context | Problem investigation    |
| **Performance Logs** | Processing latency, queue length, CPU usage     | Performance optimization |

### 5.2 Log Examples

```log
[2025-12-23 14:30:15] INFO - Voice Agent Socket.IO Configuration:
  - Namespace: /voice-agent
  - Allowed Origins: ['https://maiagent.ai', ...]
  - Max Connections: 1000
  - Ping Timeout: 60s

[2025-12-23 14:30:45] INFO - Client connected:
  - Socket ID: a1b2c3d4e5f6
  - User ID: user_12345
  - Origin: https://admin.maiagent.ai

[2025-12-23 14:31:02] DEBUG - Audio data received:
  - Size: 3200 bytes
  - Format: PCM 16kHz mono
  - Latency: 45ms

[2025-12-23 14:32:15] WARNING - Connection lost:
  - Socket ID: a1b2c3d4e5f6
  - Reason: Network timeout
  - Duration: 90s
  - Attempting reconnection...
```

***

## 6. Technical Advantages of MaiAgent's LiveKit Integration

### 6.1 Performance Advantages

* **Low-latency Communication**: WebRTC peer-to-peer transmission with latency as low as under 100ms
* **High Audio Quality**: Supports Opus encoding with dynamic bitrate adjustment based on network conditions
* **Namespace Isolation**: Voice and general messages are processed separately, avoiding interference
* **Audio Queue Optimization**: The ignore\_queue parameter ensures audio data is processed with priority

### 6.2 Reliability Advantages

* **Automatic Reconnection**: Automatic connection recovery during network fluctuations
* **Resource Cleanup**: Ensures LiveKit rooms are properly released on disconnection
* **Error Isolation**: Voice Agent issues do not affect other features
* **Detailed Logging**: Comprehensive log records for easy issue tracking

### 6.3 Scalability Advantages

* **Horizontal Scaling**: Easily add LiveKit SFU nodes to handle more concurrent calls
* **Distributed Architecture**: Socket.IO and LiveKit can be deployed on different servers
* **Load Balancing**: Supports multiple Socket.IO instances behind a load balancer

### 6.4 Security Advantages

* **Token Authentication**: Every LiveKit connection requires a time-limited JWT Token
* **End-to-end Encryption**: WebRTC uses DTLS/SRTP encrypted transmission
* **CORS Protection**: Strict control over allowed connection origins
* **Connection Tracking**: Records the source and user information for all connections

***

## 7. Related Documentation

* [Voice Customer Service](https://docs.maiagent.ai/application/voicecs) - User guide
* [IVR Customer Service Intent Recognition](https://docs.maiagent.ai/application/voicecs/ivr-ke-fu-yi-tu-bian-shi)
* [Voice Call Summary](https://docs.maiagent.ai/application/voicecs/yu-yin-tong-hua-zhai-yao)

### Reference Links

* [LiveKit Documentation](https://docs.livekit.io/)
* [WebRTC Specification](https://www.w3.org/TR/webrtc/)
* [Socket.IO Documentation](https://socket.io/docs/v4/)
* [Opus Codec](https://opus-codec.org/)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.maiagent.ai/tech/maiagent-tech-en/advanced-genai-tech/livekit-voice-agent.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
