WeSearch

DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles

The SGLang and Miles Team· ·15 min read · 0 reactions · 0 comments · 1 view
#deepseek-v4#sglang#miles#speculative decoding#hybrid attention
DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles
⚡ TL;DR · AI summary

SGLang and Miles have launched Day-0 support for DeepSeek-V4, enabling fast inference and reinforcement learning training for its 1.6T Pro and 284B Flash models. The system leverages novel technologies like ShadowRadix for hybrid attention caching, HiSparse for CPU-extended KV offloading, and speculative decoding with in-graph metadata. Optimizations include Flash Compressor, Lightning TopK, and kernel integrations for Blackwell and Hopper GPUs. Full parallelism and FP8 training are supported in RL workflows via the Miles framework.

Key facts
Original article
Lmsys · The SGLang and Miles Team
Read full at Lmsys →
Full article excerpt tap to expand

‹ Back to Blog‹ Back to BlogContentsTL;DRModel Key Features & New CapabilitiesDesigns, Features and Performance OptimizationsShadowRadix: Native Prefix Caching for Hybrid AttentionSpeculative DecodingHiSparse: Turbocharging Sparse Attention with Hierarchical MemoryFast Kernel IntegrationsVarious Kernel OptimizationsFlash Compressor: IO-aware Exact CompressionLightning TopKParallelism and DeploymentHierarchical Multi-Stream OverlapReinforcement Learning -- Miles SupportTraining backend: DeepSeek-V4 Modeling in Megatron-LMParallelism: DP/TP/SP/EP/PP/CP supportedKernelsRL FeaturesNumerical Precision on a Mixed-Precision StackTraining ResultBenchmark NotesRoadmap: The Path AheadAcknowledgementCitationDeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and MilesThe SGLang and Miles TeamApril 25, 2026We are thrilled to announce Day-0 support for DeepSeek-V4 across both inference and RL training. SGLang and Miles form the first open-source stack to serve and train DeepSeek-V4 on launch day — with systems purpose-built for its hybrid sparse-attention architecture, manifold-constrained hyper-connections (mHC), and FP4 expert weights. Figure 1. Decode throughput of SGLang vs the other OSS engine on a 30K-token prompt truncated from "Dream of the Red Chamber". We tried the best-effort spec configuration for each engine based on its official recipe. See benchmark notes for details. TL;DR SGLang and Miles ship Day-0 inference and RL for DeepSeek-V4 (1.6T Pro, 284B Flash). Inference (caching & attention): ShadowRadix prefix cache, HiSparse CPU-extended KV, MTP speculative decoding with in-graph metadata, Flash Compressor, Lightning TopK, hierarchical multi-stream overlap. Inference (kernels & deployment): fast kernel integrations (FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, TileLang mHC), DP/TP/CP attention, EP MoE on DeepEP, PD disaggregation. RL training: full parallelism (DP/TP/SP/EP/PP/CP), tilelang attention, enhanced stability, FP8 training. Hardware: Hopper, Blackwell, Grace Blackwell, AMD, NPU. Launch Commands: SGLang Cookbook Model Key Features & New Capabilities DeepSeek-V4 (1.6T Pro, 284B Flash) extends its predecessor DeepSeek-V3.2 along three axes: Hybrid sparse-attention: each layer mixes sliding window attention with one of two compression mechanisms (4:1 top-k or 128:1 dense), keeping the 1M-token context window tractable. mHC (Manifold-Constrained Hyper-Connections): a generalization of standard residual connections that improves gradient flow and representation quality. FP4 expert weights: native FP4 MoE experts for efficient serving on the latest Blackwell hardware. Designs, Features and Performance Optimizations ShadowRadix: Native Prefix Caching for Hybrid Attention Every layer of DeepSeek-V4 combines SWA (sliding window attention over the last 128 raw tokens) with either C4 (top-512 sparse over 4:1-compressed KV) or C128 (dense over 128:1-compressed KV). Also, for maintaining the inflight compressing KV slots, each compression layer has a state pool that stores the in-progress compression state. This complex mechanism breaks traditional prefix caching assumption: three heterogeneous KV pools and two compression-state pools must stay coherent across prefill, decode, and speculative decoding. The following figure shows the per-layer hybrid attention scope for an N = 1024 example. To solve this coherence problem, we introduce ShadowRadix -- a native prefix caching mechanism for hybrid attentions. One…

This excerpt is published under fair use for community discussion. Read the full article at Lmsys.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Email

Discussion

0 comments

More from Lmsys