DeepSeek V4 Training Data Doubled to 33T, Triggering Instability That Delayed Release

Gate News message, April 24 — DeepSeek's V4 technical report reveals that V4-Flash and V4-Pro were pre-trained on 32T and 33T tokens respectively, double the approximately 15T tokens used for V3. The report acknowledges encountering "significant instability challenges" during training, with loss spikes repeatedly occurring due to anomalies in the Mixture-of-Experts (MoE) layer; the routing mechanism itself exacerbates these anomalies, and simple rollback cannot resolve the issue.

DeepSeek implemented two solutions now applied to actual training: Anticipatory Routing, which decouples routing index computation from backbone network updates and automatically triggers only when loss spikes are detected (adding approximately 20% overhead), and SwiGLU Clamping, which directly suppresses anomalies by clamping activation values to a fixed range. The report states both approaches are effective but admits "the underlying principles remain insufficiently understood."

Susan Zhang, a Google DeepMind researcher who previously worked at Meta AI and OpenAI, commented that the instability triggered by doubling training data "explains the delay." She described the two solutions as "band-aids" while acknowledging DeepSeek's technical transparency.

Disclaimer: The information on this page may come from third-party sources and is for reference only. It does not represent the views or opinions of Gate and does not constitute any financial, investment, or legal advice. Virtual asset trading involves high risk. Please do not rely solely on the information on this page when making decisions. For details, see the Disclaimer.
Comment
0/400
No comments