The optimal batch size, where compute and memory times are balanced, is primarily determined by model sparsity, not just scale. A hardware constant (FLOPs/memory bandwidth) multiplied by a sparsity parameter (total parameters / active parameters) dictates the required batch size, suggesting that highly sparse models need larger batches.
Impact: High. This finding reveals a critical, yet often overlooked, factor in scaling AI: the relationship between sparsity and the necessary batch size for efficient inference.
In the source video, this keypoint occurs from 00:17:20 to 00:20:40.
Sources in support: Reiner Pope (CEO of MatX, former TPU architect at Google)

