For inference, the strategy leans heavily towards expert parallelism, increasing it up to the scale-up domain size while minimizing pipelining. Tensor parallelism, once relevant for cutting up experts, is now less profitable due to smaller expert sizes. This approach is favored unless the model size exceeds a single rack's memory.
Impact: High. This strategic choice optimizes inference performance by prioritizing parallelism that directly addresses model size and latency, rather than solely focusing on memory capacity. It reflects a pragmatic approach to deploying large models efficiently.
In the source video, this keypoint occurs from 01:13:00 to 01:14:43.
Sources in support: Dwarkesh Patel (Host), Jane Street (Trading firm)

