When models exceed a single rack's capacity, pipeline parallelism can be used across multiple racks. This involves processing layers sequentially across different racks. While it introduces 'bubbles' (idle time) in training, it significantly reduces memory capacity requirements per rack, making it beneficial for inference and enabling larger models.
Impact: High. Pipeline parallelism is a critical strategy for scaling beyond single-rack limits, offering a trade-off between memory savings and computational efficiency, especially vital for inference workloads.
In the source video, this keypoint occurs from 00:48:00 to 00:51:00.
Sources in support: Dwarkesh Patel (Host)

