GPU compute hardware planning

Llama3 inference speed on GPU (tokens/sec)

Performance comparison of mainstream GPUs for Llama3 8B / 70B

GPU
Memory (VRAM)
8B Q4_K_M
8B F16
70B Q4_K_M
70B F16

RTX 4090

24GB

127.74

54.34

Exceeds memory

Exceeds memory

RTX A6000

48GB

102.22

40.25

14.58

Exceeds memory

L40S

48GB

113.60

43.42

15.31

Exceeds memory

RTX 6000 Ada

48GB

130.99

51.97

18.36

Exceeds memory

A100

80GB

138.31

54.56

22.11

Exceeds memory

H100

80GB

144.49

67.79

25.01

Exceeds memory

M2 Ultra

192GB

76.28

36.25

12.13

4.71


VRAM required by Llama3 models

Model
Q4_K_M (quantized)
F16 (original)

Llama3 8B

4.58 GB

14.96 GB

Llama3 70B

39.59 GB

131.42 GB

Source


Hardware configuration recommendations

MaiAgent recommends two combinations suitable for different groups.

  1. Two H100 (80GB): Higher budget, seeking quality and performance

  2. L40S (48GB), RTX 6000 Ada (48GB) Two: Moderate budget, seeking cost-effectiveness

If you need more detailed information, feel free to contact MaiAgent's professional consultants to discuss; please email [email protected]

Last updated

Was this helpful?