Skim Logo
Dwarkesh PatelApril 30, 2026
How GPT-5, Claude, and Gemini are actually trained and served – Reiner Pope
2:13:40
DP

How GPT-5, Claude, and Gemini are actually trained and served – Reiner Pope

Bytes Per Token Calculation — Dwarkesh Patel

From How GPT-5, Claude, and Gemini are actually trained and served – Reiner Pope. Category: Tech. Format: Commentary. This is a single keypoint from the analysis.

By assuming an equalization point at 200k tokens and negligible weight memory time, one can calculate the bytes per token. This calculation, considering KV cache memory time and activated parameters, yields a plausible figure around two kilobytes, which aligns with typical dense attention mechanisms.

Impact: Medium. This calculation provides a tangible metric for understanding memory requirements and informs architectural choices for efficient LLM deployment.

In the source video, this keypoint occurs from 01:37:18 to 01:41:15.

Sources in support: Dwarkesh Patel (Host)

For the full credibility analysis, key takeaways, and other keypoints from this video, see the full analysis on skim.

This keypoint analysis was generated by skim (skim.plus), an AI-powered content analysis platform by Credible AI.