Quix about:
H100 vs A100: The Hidden Economics of LLM Training at Scale

Real-world performance analysis of NVIDIA H100 vs A100 for large language model training, including cost analysis, infrastructure requirements, and scaling bottlenecks. [Updated with source v

OPEN
343 views

NVIDIA claims 3.17x TFLOPS improvement for H100 vs A100, but real-world LLM training only shows 1.67x speedup. Are we approaching memory bandwidth limitations that make compute improvements irrelevant?

Additional Context

Large language models spend significant time on memory-bound operations like attention mechanisms. Even with 73% memory bandwidth improvement, we're still bottlenecked. Will future AI accelerators need to focus entirely on memory subsystems rather than compute units?

Asked by:
Dr. Raj KumarDr. Raj KumarML Infrastructure Engineer, OpenAI

Responses (1)

Please sign in to respond to this quix
Dr. Raj Kumar
DR

Dr. Raj Kumar

ML Infrastructure Engineer, OpenAI
The memory wall is real for LLM training. Future AI accelerators will need fundamentally different architectures - processing-in-memory, near-data computing, or specialized attention units. Simply increasing FLOPS without addressing memory bandwidth and latency will yield diminishing returns. Google's TPU v5 and Cerebras' wafer-scale approach hint at this direction.