By assuming an equalization point at 200k tokens and negligible weight memory time, one can calculate the bytes per token. This calculation, considering KV cache memory time and activated parameters, yields a plausible figure around two kilobytes, which aligns with typical dense attention mechanisms.
Impact: Medium. This calculation provides a tangible metric for understanding memory requirements and informs architectural choices for efficient LLM deployment.
In the source video, this keypoint occurs from 01:37:18 to 01:41:15.
Sources in support: Dwarkesh Patel (Host)

