Large Language Model (LLM)
Selected key points
When choosing a large language model, consider the following key factors:
Environment: Whether the usage environment can access the internet to decide between cloud or on-premise models.
Quality: The model's ability to generate responses and its obedience to instructions.
Speed: The text generation speed and latency requirements to ensure the model's responsiveness.
Pricing: Consider the model's usage cost and choose a suitable model based on needs. (No need to consider model pricing on MaiAgent)
Others: Whether it supports multimodality and whether it supports Function calling.

Large language models supported on MaiAgent
Cloud models (closed-source)
o4-mini
Faster than o3-mini-high, quality slightly lower than o3-mini-high
Yes
A choice focused on high quality and speed
o3-mini-high
High quality, moderate speed, performs multi-layered chain-of-thought style reasoning before answering to provide more complete and accurate answers
Yes
High difficulty tasks requiring deep reasoning and creativity
o3-mini-medium
Fast speed, moderate quality
Yes
Most commercial applications, simple creation, or routine Q&A
o3-mini-low
Fastest speed, basic quality, lacks deep reasoning
Yes
Suitable for simple tasks that prioritize speed over deep generation
o1-mini 2024-09-12
The o1 series large language models are trained with reinforcement learning to perform complex reasoning. The o1 model thinks before answering, producing a long internal chain of thought before responding to the user. Slowest speed,High quality.
Yes
Very difficult questions, when other LLMs are powerless
GPT-4o 2024-08-06
Above-average quality and speed
No
Slightly weaker at following instructions and logical ability compared to Claude 3.5 Sonnet, but faster than Claude 3.5 Sonnet. A commonly used choice 👍
GPT-4o mini 2024-07-18
Fast speed, moderate quality. Quality slightly lower than Gemini 2.0 Flash
No
Simple tasks, alternative when Gemini 2.0 Flash cannot be chosen
Claude 4 Sonnet
Moderately slow speed,Strong at generating/extracting structured data, and intool callingparticularly skilled; logic reasoning and coding performance surpass Claude 3.7 Sonnet, with further reduced hallucination rate.
Hybrid reasoning models
First choice for Agent mode 👍 Suitable for highly complex tasks, professional domain applications, and ultra-long conversations.
Claude 3.7 Sonnet
Moderately slow speed,good at producing structured(Structured) data, with stronger logical reasoning ability than Claude 3.5 Sonnet.Low probability of hallucination
Hybrid reasoning models
First choice in most situations 👍 Suitable for high-complexity tasks, professional domains, and long-conversation applications
Claude 3.5 Sonnet
Follows role instructions; logical reasoning is weaker compared to Claude 3.7 Sonnet, butfaster.Low probability of hallucination
No
If speed feels too slow, you can switch to Gemini 2.0 Flash
Gemini 2.5 Pro
In longer conversations and code generation scenarios, quality is better than Claude 3.7 Sonnet, but performance is slightly worse in Agent mode and tool invocation
No
Can be used interchangeably with Claude 3.7 Sonnet
Gemini 2.0 Pro
Compared to Claude 3.5 Sonnet, similar quality but slower
No
An alternative to Claude 3.5 Sonnet
Gemini 2.5 Flash
Fast, good multimodal capabilities
No
Gemini 2.0 Flash
Fast, moderate quality
No
First choice for simple tasks 👍
DeepSeek V3
Fast speed, high quality
Yes
Suitable for document retrieval and large-scale database query tasks
DeepSeek R1 Distill Llama 70B
High reply quality, moderate speed (slower than DeepSeek V3)
Yes
Suitable for tasks requiring multi-step reasoning and background knowledge
DeepSeek R1
Reply speed is relatively slow, buthas strong understanding of Chinese,high reply content quality. Deep thinking, and indeed adjusts according to role instruction content
Yes
For scenarios requiring complex multi-turn Chinese conversations. Handles complex role instructions 👍
On-premise models (open-source)
Below is a comparison table of mainstream open-source models; hardware requirements for open-source models can refer to the GPU section.
Meta Llama3.3 70B
High quality, moderate speed
Data analysis, copywriting
Meta Llama3.3 70B instruct (M2Ultra)
High quality, fast speed
Voice customer service
Meta Llama3.2 90B
Extremely high quality, moderate speed
Professional domain Q&A, high-precision tasks
Llama3-TAIDE-LX-70B-Chat (National Grid Center)
High quality, strong Chinese generation ability, moderate speed
Customer service Q&A, knowledge Q&A
TAIDE-LX-70B-Chat (National Grid Center)
High quality, moderate speed
Customer service Q&A, knowledge Q&A
Mistral Large (24.07)
Moderate quality, lacks deep reasoning ability, fast speed
Customer service Q&A, simple text generation
Meta-Llama 3.1-70B
Moderate quality, moderate compute requirements
Customer service, knowledge Q&A, advanced translation and summarization
Meta-Llama 3.1-8B
Acceptable quality, low compute requirements
Translation, summarization
Mistral Large 2
High quality, high hardware requirements
Customer service, knowledge Q&A, advanced translation and summarization
Mistral 8x7B
Low quality, fastest speed
Translation, summarization
Gemma3 27B (M2 Ultra)
High quality, high hardware requirements
Professional knowledge Q&A, data analysis, complex content generation
Does a model necessarily need fine-tuning?
With the rapid development of AI technology, language models already have powerful language understanding and generation capabilities and are widely applied in many fields. For example, pre-trained language models can easily handle daily conversations, article generation, and simple Q&A tasks.
However, when models face more challenging professional domain tasks, such as in medical, legal, or technical support fields, relying solely on pre-trained models may not provide the best performance. Some developers choose to perform Fine-tuning, which means additional training for specific domains to improve the model's domain expertise.
However, fine-tuning is not the only solution; there are two other effective methods to achieve this goal:Prompt Engineering and RAG (Retrieval-Augmented Generation).
1. Prompt Engineering: optimizing model performance through precise prompts
Prompt Engineering is designing precise prompt phrases to guide the model to generate desired results. The core of this method is to design detailed expressions based on task requirements to help the model narrow the answer range and understand the context and required output format.
Suppose there is a language model whose goal is to recommend suitable products based on user needs. In this case, expressing "I want to buy a high-performance phone" may lead the model to generate imprecise answers, because "high-performance" can be interpreted in many ways, such as processing speed, camera performance, battery life, etc.
To improve recommendation accuracy, we can perform Prompt Engineering, designing more specific questions or providing additional context to better guide the model to understand user needs. For example, change to the following prompt:
"I need a phone withlong battery lifeanda high-efficiency processorand the phone should be priced between USD 500 to 800, please recommend several phones that meet these criteria."
2. RAG: combining external knowledge to enhance generation capabilities
RAG is the approach of combining external knowledge retrieval with the generation process to improve model performance.
In traditional generation tasks, the model relies only on knowledge learned during pre-training. RAG uses a retrieval system to obtain external information in real time and combines that information with the generation model, enabling more accurate answers or text generation.
For example, in the medical field, when a model is asked about a rare disease, RAG can first retrieve relevant materials from professional medical databases and then generate a more accurate answer based on those materials. The advantage of this method is that even if the model itself did not encounter certain materials during training, it can still produce high-quality responses by retrieving existing knowledge. This is especially suitable for scenarios requiring real-time knowledge updates and can greatly expand the model's knowledge scope.
A more detailed introduction to RAG will be in the next chapter "RAG Knowledge Retrieval System Explanation".
In summary, Fine-tuning, Prompt Engineering, and RAG each have their advantages and applicable scopes. Choose the most appropriate strategy based on the application scenario and needs, rather than relying solely on a single fine-tuning method.
Prompt Engineering provides a low-cost and flexible solution by designing precise prompt phrases to guide the model to produce high-quality results.
Meanwhile, RAG provides a method that combines external knowledge and generation capability, enabling more precise answers when dynamic knowledge acquisition is needed.
Fine-tuning Can significantly improve model performance in specific domains, but requires large amounts of domain data and consumes additional resources; it can be regarded as Prompt Engineering and RAG both failing before being implemented as the last resort
Last updated
Was this helpful?