Increasing pipeline stages significantly reduces the memory footprint for model weights but does not similarly reduce the memory needed for activations and KV caches. This means that beyond a certain point, pipelining offers diminishing returns for memory savings, with KV cache becoming the dominant memory consumer.
Impact: High. This finding challenges the assumption that more pipelining is always better for memory efficiency. It highlights that KV cache size is a fundamental constraint that pipelining alone cannot solve, necessitating other architectural considerations.
In the source video, this keypoint occurs from 01:11:22 to 01:13:15.
Sources in support: Dwarkesh Patel (Host)

