Quix about:

H100 vs A100: The Hidden Economics of LLM Training at Scale

Real-world performance analysis of NVIDIA H100 vs A100 for large language model training, including cost analysis, infrastructure requirements, and scaling bottlenecks. [Updated with source v

OPEN

343 views

July 15, 2025

NVIDIA claims 3.17x TFLOPS improvement for H100 vs A100, but real-world LLM training only shows 1.67x speedup. Are we approaching memory bandwidth limitations that make compute improvements irrelevant?

Additional Context

Large language models spend significant time on memory-bound operations like attention mechanisms. Even with 73% memory bandwidth improvement, we're still bottlenecked. Will future AI accelerators need to focus entirely on memory subsystems rather than compute units?

Asked by:

Dr. Raj Kumar•ML Infrastructure Engineer, OpenAI

Responses (1)

Please sign in to respond to this quix

Dr. Raj Kumar

•ML Infrastructure Engineer, OpenAI

Jul 15, 2025, 09:57 PM

The memory wall is real for LLM training. Future AI accelerators will need fundamentally different architectures - processing-in-memory, near-data computing, or specialized attention units. Simply increasing FLOPS without addressing memory bandwidth and latency will yield diminishing returns. Google's TPU v5 and Cerebras' wafer-scale approach hint at this direction.