Skim Logo
Dwarkesh PatelApril 30, 2026
How GPT-5, Claude, and Gemini are actually trained and served – Reiner Pope
2:13:40
DP

How GPT-5, Claude, and Gemini are actually trained and served – Reiner Pope

MoE Layer Architecture and Expert Parallelism — Dwarkesh Patel

From How GPT-5, Claude, and Gemini are actually trained and served – Reiner Pope. Category: Tech. Format: Commentary. This is a single keypoint from the analysis.

Mixture of Experts (MoE) layers route tokens to a subset of specialized 'experts,' typically a small fraction like 1 in 32. This approach is mapped to GPUs using expert parallelism, where different experts reside on different GPUs. Communication costs arise from routing tokens to and from these experts, with the goal of avoiding communication bottlenecks.

Impact: High. This is the foundational strategy for scaling models beyond dense architectures, enabling larger parameter counts while managing computational load. The efficiency of this routing and communication is paramount for performance.

In the source video, this keypoint occurs from 00:32:00 to 00:37:00.

Sources in support: Dwarkesh Patel (Host)

For the full credibility analysis, key takeaways, and other keypoints from this video, see the full analysis on skim.

This keypoint analysis was generated by skim (skim.plus), an AI-powered content analysis platform by Credible AI.