GPU compute hardware planning
Llama3 inference speed on GPU (tokens/sec)

GPU
Memory (VRAM)
8B Q4_K_M
8B F16
70B Q4_K_M
70B F16
RTX 4090
24GB
127.74
54.34
Exceeds memory
Exceeds memory
RTX A6000
48GB
102.22
40.25
14.58
Exceeds memory
L40S
48GB
113.60
43.42
15.31
Exceeds memory
RTX 6000 Ada
48GB
130.99
51.97
18.36
Exceeds memory
A100
80GB
138.31
54.56
22.11
Exceeds memory
H100
80GB
144.49
67.79
25.01
Exceeds memory
M2 Ultra
192GB
76.28
36.25
12.13
4.71
VRAM required by Llama3 models
Model
Q4_K_M (quantized)
F16 (original)
Llama3 8B
4.58 GB
14.96 GB
Llama3 70B
39.59 GB
131.42 GB
Source
Hardware configuration recommendations
MaiAgent recommends two combinations suitable for different groups.
Two H100 (80GB): Higher budget, seeking quality and performance
L40S (48GB), RTX 6000 Ada (48GB) Two: Moderate budget, seeking cost-effectiveness
If you need more detailed information, feel free to contact MaiAgent's professional consultants to discuss; please email [email protected]
Last updated
Was this helpful?