Gate News message, April 24 — DeepSeek's V4 technical report reveals that V4-Flash and V4-Pro were pre-trained on 32T and 33T tokens respectively, double the approximately 15T tokens used for V3. The report acknowledges encountering "significant instability challenges" during training, with loss spikes repeatedly occurring due to anomalies in the Mixture-of-Experts (MoE) layer; the routing mechanism itself exacerbates these anomalies, and simple rollback cannot resolve the issue.
DeepSeek implemented two solutions now applied to actual training: Anticipatory Routing, which decouples routing index computation from backbone network updates and automatically triggers only when loss spikes are detected (adding approximately 20% overhead), and SwiGLU Clamping, which directly suppresses anomalies by clamping activation values to a fixed range. The report states both approaches are effective but admits "the underlying principles remain insufficiently understood."
Susan Zhang, a Google DeepMind researcher who previously worked at Meta AI and OpenAI, commented that the instability triggered by doubling training data "explains the delay." She described the two solutions as "band-aids" while acknowledging DeepSeek's technical transparency.