GPU Computing Hardware Planning

Llama3 Inference Speed on GPUs (tokens/second)

GPU

Memory(VRAM)

8B Q4_K_M

8B F16

70B Q4_K_M

70B F16

RTX 4090

24GB

127.74

54.34

Out of Memory

RTX A6000

48GB

102.22

40.25

14.58

Out of Memory

L40S

48GB

113.60

43.42

15.31

Out of Memory

RTX 6000 Ada

48GB

130.99

51.97

18.36

Out of Memory

A100

80GB

138.31

54.56

22.11

Out of Memory

H100

80GB

144.49

67.79

25.01

Out of Memory

M2 Ultra

192GB

76.28

36.25

12.13

4.71

VRAM Requirements for Llama3 Models

Model

Q4_K_M (Quantized)

F16 (Original)

Llama3 8B

4.58 GB

14.96 GB

Llama3 70B

39.59 GB

131.42 GB

Source

GitHub - XiongjieDai/GPU-Benchmarks-on-LLM-Inference: Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?GitHub

Hardware Configuration Recommendations

MaiAgent recommends two combinations suitable for different groups:

Two H100(80GB): Higher budget, prioritizing quality and performance
L40S(48GB) and RTX 6000 Ada(48GB): Standard budget, focusing on cost-effectiveness

For more detailed information, please contact MaiAgent's professional consultants at [email protected]

PreviousCloud Model Inference API Service NextText to SQL

Last updated 4 months ago

Was this helpful?

hashtagLlama3 Inference Speed on GPUs (tokens/second)

hashtagVRAM Requirements for Llama3 Models

hashtagHardware Configuration Recommendations

Llama3 Inference Speed on GPUs (tokens/second)

VRAM Requirements for Llama3 Models

Hardware Configuration Recommendations