The cost per token is minimized when compute and memory fetches are effectively amortized over a large batch. Initially, cost is dominated by weight fetches, leading to high expenses at small batch sizes. As batch size increases, compute time becomes the limiting factor, establishing a lower bound on cost per token.
Impact: High. This analysis clarifies why 'slow modes' are economically unviable and establishes the minimum cost achievable for inference, directly influencing pricing strategies.
In the source video, this keypoint occurs from 00:13:12 to 00:17:20.
Sources in support: Reiner Pope (CEO of MatX, former TPU architect at Google)

