GPU Computing Hardware Planning
Llama3 Inference Speed on GPUs (tokens/second)

GPU
Memory(VRAM)
8B Q4_K_M
8B F16
70B Q4_K_M
70B F16
RTX 4090
24GB
127.74
54.34
Out of Memory
Out of Memory
RTX A6000
48GB
102.22
40.25
14.58
Out of Memory
L40S
48GB
113.60
43.42
15.31
Out of Memory
RTX 6000 Ada
48GB
130.99
51.97
18.36
Out of Memory
A100
80GB
138.31
54.56
22.11
Out of Memory
H100
80GB
144.49
67.79
25.01
Out of Memory
M2 Ultra
192GB
76.28
36.25
12.13
4.71
VRAM Requirements for Llama3 Models
Model
Q4_K_M (Quantized)
F16 (Original)
Llama3 8B
4.58 GB
14.96 GB
Llama3 70B
39.59 GB
131.42 GB
Source
Hardware Configuration Recommendations
MaiAgent recommends two combinations suitable for different groups:
Two H100(80GB): Higher budget, prioritizing quality and performance
L40S(48GB) and RTX 6000 Ada(48GB): Standard budget, focusing on cost-effectiveness
For more detailed information, please contact MaiAgent's professional consultants at [email protected]
Last updated
Was this helpful?
