Pipeline parallelism necessitates micro-batching, which can lead to inefficiencies in training by not fully amortizing weight loading and gradient calculations across a full batch. While smaller batches improve gradient freshness, they increase system overhead. In inference, pipelining is largely neutral for latency but offers memory capacity benefits.
Impact: Medium. Understanding the micro-batching implications is key to optimizing training performance and managing the trade-offs between model convergence and system throughput when using pipeline parallelism.
In the source video, this keypoint occurs from 00:56:00 to 00:59:00.
Sources in support: Dwarkesh Patel (Host)

