Quix about:
H100 vs A100: The Hidden Economics of LLM Training at ScaleReal-world performance analysis of NVIDIA H100 vs A100 for large language model training, including cost analysis, infrastructure requirements, and scaling bottlenecks. [Updated with source v
OPEN
343 views
NVIDIA claims 3.17x TFLOPS improvement for H100 vs A100, but real-world LLM training only shows 1.67x speedup. Are we approaching memory bandwidth limitations that make compute improvements irrelevant?
Additional Context
Large language models spend significant time on memory-bound operations like attention mechanisms. Even with 73% memory bandwidth improvement, we're still bottlenecked. Will future AI accelerators need to focus entirely on memory subsystems rather than compute units?
Asked by:
Dr. Raj Kumar•ML Infrastructure Engineer, OpenAI
Responses (1)
Please sign in to respond to this quix
DR
Dr. Raj Kumar
•ML Infrastructure Engineer, OpenAIThe memory wall is real for LLM training. Future AI accelerators will need fundamentally different architectures - processing-in-memory, near-data computing, or specialized attention units. Simply increasing FLOPS without addressing memory bandwidth and latency will yield diminishing returns. Google's TPU v5 and Cerebras' wafer-scale approach hint at this direction.