Gate News message, April 22 — Princeton PhD student Yifan Zhang disclosed complete technical specifications for DeepSeek V4 on X, following a preview on April 19. V4 features 1.6 trillion total parameters and a lightweight variant, V4-Lite, with 285 billion parameters.
The model employs DSA2 attention mechanism, which combines DeepSeek’s previous DSA (DeepSeek Sparse Attention) from V3.2 and NSA (Native Sparse Attention) with 512-dimensional head embeddings, paired with Sparse Multi-Query Attention (MQA) and Sliding Window Attention (SWA). The MoE (Mixture of Experts) layer contains 384 experts with 6 activated per forward pass, utilizing Fused MoE Mega-Kernel. Residual connections employ Hyper-Connections architecture.
Training details revealed for the first time include the use of Muon optimizer (applying Newton-Schulz orthogonalization to momentum updates), a 32K token pre-training context window, and GRPO (Group Relative Policy Optimization) with KL divergence correction during reinforcement learning. The final context window extends to 1 million tokens. The model is text-only.
Zhang is not employed by DeepSeek, and the company has not officially commented on the disclosed information.
Related News
DeepSeek discusses its first round of external funding, valuation at $20 billion: China’s AI valuation hits a new high
Ripple Sets 2028 Target for XRPL Quantum Shift Plan
Bitcoin Breaks Through $78,000, Ethereum Hits $2,390: Market Panic Eases