Inter-rack communication introduces latency costs (a few milliseconds per hop) that stack up sequentially during decode. While pipelining helps with model capacity, larger scale-up domains are crucial for improving memory bandwidth, which in turn supports longer context lengths and lower inference latency.
Impact: High. The interplay between communication latency and memory bandwidth dictates the feasibility of large-scale AI deployments. Optimizing scale-up size is essential for overcoming bandwidth limitations and enabling more capable, responsive models.
In the source video, this keypoint occurs from 01:15:03 to 01:18:37.
Sources in support: Dwarkesh Patel (Host)

