While bandwidth is often discussed, memory capacity per rack is a primary constraint for large models. Even with advanced interconnects, if a model's total parameters and KV cache exceed the available memory within a scale-up domain (rack), pipelining becomes necessary to distribute the memory load, even if it doesn't improve latency.
Impact: High. This highlights that memory capacity, not just speed, is a critical bottleneck for scaling AI models, driving architectural decisions like pipelining to manage resource constraints effectively.
In the source video, this keypoint occurs from 01:01:17 to 01:02:11.
Sources in support: Dwarkesh Patel (Host)

