GPU compute hardware planning

Llama3 inference speed on GPU (tokens/sec)

GPU

Memory (VRAM)

8B Q4_K_M

8B F16

70B Q4_K_M

70B F16

RTX 4090

24GB

127.74

54.34

Exceeds memory

RTX A6000

48GB

102.22

40.25

14.58

Exceeds memory

L40S

48GB

113.60

43.42

15.31

Exceeds memory

RTX 6000 Ada

48GB

130.99

51.97

18.36

Exceeds memory

A100

80GB

138.31

54.56

22.11

Exceeds memory

H100

80GB

144.49

67.79

25.01

Exceeds memory

M2 Ultra

192GB

76.28

36.25

12.13

4.71

VRAM required by Llama3 models

Model

Q4_K_M (quantized)

F16 (original)

Llama3 8B

4.58 GB

14.96 GB

Llama3 70B

39.59 GB

131.42 GB

Source

GitHub - XiongjieDai/GPU-Benchmarks-on-LLM-Inference: Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?GitHub

Hardware configuration recommendations

MaiAgent recommends two combinations suitable for different groups.

Two H100 (80GB): Higher budget, seeking quality and performance
L40S (48GB), RTX 6000 Ada (48GB) Two: Moderate budget, seeking cost-effectiveness

If you need more detailed information, feel free to contact MaiAgent's professional consultants to discuss; please email [email protected]

PreviousDeployment environment planning (SaaS/private cloud/on-premise)NextText to SQL

Last updated 6 months ago

Was this helpful?