Large Language Model (LLM)

Selected key points

When choosing a large language model, consider the following key factors:

  1. Environment: Whether the usage environment can access the internet to decide between cloud or on-premise models.

  2. Quality: The model's ability to generate responses and its obedience to instructions.

  3. Speed: The text generation speed and latency requirements to ensure the model's responsiveness.

  4. Pricing: Consider the model's usage cost and choose a suitable model based on needs. (No need to consider model pricing on MaiAgent)

  5. Others: Whether it supports multimodality and whether it supports Function calling.

Large language model analysis - Artificial Analysis

Large language models supported on MaiAgent

Cloud models (closed-source)

Model name
Description
Whether it is an inference model
Usage scenario

o4-mini

Faster than o3-mini-high, quality slightly lower than o3-mini-high

Yes

A choice focused on high quality and speed

o3-mini-high

High quality, moderate speed, performs multi-layered chain-of-thought style reasoning before answering to provide more complete and accurate answers

Yes

High difficulty tasks requiring deep reasoning and creativity

o3-mini-medium

Fast speed, moderate quality

Yes

Most commercial applications, simple creation, or routine Q&A

o3-mini-low

Fastest speed, basic quality, lacks deep reasoning

Yes

Suitable for simple tasks that prioritize speed over deep generation

o1-mini 2024-09-12

The o1 series large language models are trained with reinforcement learning to perform complex reasoning. The o1 model thinks before answering, producing a long internal chain of thought before responding to the user. Slowest speed,High quality.

Yes

Very difficult questions, when other LLMs are powerless

GPT-4o 2024-08-06

Above-average quality and speed

No

Slightly weaker at following instructions and logical ability compared to Claude 3.5 Sonnet, but faster than Claude 3.5 Sonnet. A commonly used choice 👍

GPT-4o mini 2024-07-18

Fast speed, moderate quality. Quality slightly lower than Gemini 2.0 Flash

No

Simple tasks, alternative when Gemini 2.0 Flash cannot be chosen

Claude 4 Sonnet

Moderately slow speed,Strong at generating/extracting structured data, and intool callingparticularly skilled; logic reasoning and coding performance surpass Claude 3.7 Sonnet, with further reduced hallucination rate.

Hybrid reasoning models

First choice for Agent mode 👍 Suitable for highly complex tasks, professional domain applications, and ultra-long conversations.

Claude 3.7 Sonnet

Moderately slow speed,good at producing structured(Structured) data, with stronger logical reasoning ability than Claude 3.5 Sonnet.Low probability of hallucination

Hybrid reasoning models

First choice in most situations 👍 Suitable for high-complexity tasks, professional domains, and long-conversation applications

Claude 3.5 Sonnet

Follows role instructions; logical reasoning is weaker compared to Claude 3.7 Sonnet, butfaster.Low probability of hallucination

No

If speed feels too slow, you can switch to Gemini 2.0 Flash

Gemini 2.5 Pro

In longer conversations and code generation scenarios, quality is better than Claude 3.7 Sonnet, but performance is slightly worse in Agent mode and tool invocation

No

Can be used interchangeably with Claude 3.7 Sonnet

Gemini 2.0 Pro

Compared to Claude 3.5 Sonnet, similar quality but slower

No

An alternative to Claude 3.5 Sonnet

Gemini 2.5 Flash

Fast, good multimodal capabilities

No

Gemini 2.0 Flash

Fast, moderate quality

No

First choice for simple tasks 👍

DeepSeek V3

Fast speed, high quality

Yes

Suitable for document retrieval and large-scale database query tasks

DeepSeek R1 Distill Llama 70B

High reply quality, moderate speed (slower than DeepSeek V3)

Yes

Suitable for tasks requiring multi-step reasoning and background knowledge

DeepSeek R1

Reply speed is relatively slow, buthas strong understanding of Chinese,high reply content quality. Deep thinking, and indeed adjusts according to role instruction content

Yes

For scenarios requiring complex multi-turn Chinese conversations. Handles complex role instructions 👍

On-premise models (open-source)

Below is a comparison table of mainstream open-source models; hardware requirements for open-source models can refer to the GPU section.

Model name
Description
Usage scenario

Meta Llama3.3 70B

High quality, moderate speed

Data analysis, copywriting

Meta Llama3.3 70B instruct (M2Ultra)

High quality, fast speed

Voice customer service

Meta Llama3.2 90B

Extremely high quality, moderate speed

Professional domain Q&A, high-precision tasks

Llama3-TAIDE-LX-70B-Chat (National Grid Center)

High quality, strong Chinese generation ability, moderate speed

Customer service Q&A, knowledge Q&A

TAIDE-LX-70B-Chat (National Grid Center)

High quality, moderate speed

Customer service Q&A, knowledge Q&A

Mistral Large (24.07)

Moderate quality, lacks deep reasoning ability, fast speed

Customer service Q&A, simple text generation

Meta-Llama 3.1-70B

Moderate quality, moderate compute requirements

Customer service, knowledge Q&A, advanced translation and summarization

Meta-Llama 3.1-8B

Acceptable quality, low compute requirements

Translation, summarization

Mistral Large 2

High quality, high hardware requirements

Customer service, knowledge Q&A, advanced translation and summarization

Mistral 8x7B

Low quality, fastest speed

Translation, summarization

Gemma3 27B (M2 Ultra)

High quality, high hardware requirements

Professional knowledge Q&A, data analysis, complex content generation

Does a model necessarily need fine-tuning?

With the rapid development of AI technology, language models already have powerful language understanding and generation capabilities and are widely applied in many fields. For example, pre-trained language models can easily handle daily conversations, article generation, and simple Q&A tasks.

However, when models face more challenging professional domain tasks, such as in medical, legal, or technical support fields, relying solely on pre-trained models may not provide the best performance. Some developers choose to perform Fine-tuning, which means additional training for specific domains to improve the model's domain expertise.

However, fine-tuning is not the only solution; there are two other effective methods to achieve this goal:Prompt Engineering and RAG (Retrieval-Augmented Generation).

1. Prompt Engineering: optimizing model performance through precise prompts

Prompt Engineering is designing precise prompt phrases to guide the model to generate desired results. The core of this method is to design detailed expressions based on task requirements to help the model narrow the answer range and understand the context and required output format.

Suppose there is a language model whose goal is to recommend suitable products based on user needs. In this case, expressing "I want to buy a high-performance phone" may lead the model to generate imprecise answers, because "high-performance" can be interpreted in many ways, such as processing speed, camera performance, battery life, etc.

To improve recommendation accuracy, we can perform Prompt Engineering, designing more specific questions or providing additional context to better guide the model to understand user needs. For example, change to the following prompt:

"I need a phone withlong battery lifeanda high-efficiency processorand the phone should be priced between USD 500 to 800, please recommend several phones that meet these criteria."

2. RAG: combining external knowledge to enhance generation capabilities

RAG is the approach of combining external knowledge retrieval with the generation process to improve model performance.

In traditional generation tasks, the model relies only on knowledge learned during pre-training. RAG uses a retrieval system to obtain external information in real time and combines that information with the generation model, enabling more accurate answers or text generation.

For example, in the medical field, when a model is asked about a rare disease, RAG can first retrieve relevant materials from professional medical databases and then generate a more accurate answer based on those materials. The advantage of this method is that even if the model itself did not encounter certain materials during training, it can still produce high-quality responses by retrieving existing knowledge. This is especially suitable for scenarios requiring real-time knowledge updates and can greatly expand the model's knowledge scope.

In summary, Fine-tuning, Prompt Engineering, and RAG each have their advantages and applicable scopes. Choose the most appropriate strategy based on the application scenario and needs, rather than relying solely on a single fine-tuning method.

Prompt Engineering provides a low-cost and flexible solution by designing precise prompt phrases to guide the model to produce high-quality results.

Meanwhile, RAG provides a method that combines external knowledge and generation capability, enabling more precise answers when dynamic knowledge acquisition is needed.

Fine-tuning Can significantly improve model performance in specific domains, but requires large amounts of domain data and consumes additional resources; it can be regarded as Prompt Engineering and RAG both failing before being implemented as the last resort

Last updated

Was this helpful?