DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles
SGLang and Miles have launched Day-0 support for DeepSeek-V4, enabling fast inference and reinforcement learning training for its 1.6T Pro and 284B Flash models. The system leverages novel technologies like ShadowRadix for hybrid attention caching, HiSparse for CPU-extended KV offloading, and speculative decoding with in-graph metadata. Optimizations include Flash Compressor, Lightning TopK, and kernel integrations for Blackwell and Hopper GPUs. Full parallelism and FP8 training are supported in RL workflows via the Miles framework.
- ▪SGLang and Miles provide Day-0 inference and RL training support for DeepSeek-V4's hybrid sparse-attention architecture and FP4 expert weights.
- ▪ShadowRadix enables coherent prefix caching across multiple KV and compression-state pools in DeepSeek-V4’s hybrid attention layers.
- ▪HiSparse boosts throughput by offloading inactive C4 KV cache to CPU memory, increasing capacity and efficiency for long-context generation.
- ▪Speculative decoding is accelerated via in-graph metadata preparation and hierarchical multi-stream overlap to reduce launch overhead.
- ▪The stack supports DP/TP/SP/EP/PP/CP parallelism, FP8 training, and optimized kernels like FlashMLA, FlashInfer TRTLLM-Gen, and DeepGEMM Mega MoE.
Full article excerpt tap to expand
‹ Back to Blog‹ Back to BlogContentsTL;DRModel Key Features & New CapabilitiesDesigns, Features and Performance OptimizationsShadowRadix: Native Prefix Caching for Hybrid AttentionSpeculative DecodingHiSparse: Turbocharging Sparse Attention with Hierarchical MemoryFast Kernel IntegrationsVarious Kernel OptimizationsFlash Compressor: IO-aware Exact CompressionLightning TopKParallelism and DeploymentHierarchical Multi-Stream OverlapReinforcement Learning -- Miles SupportTraining backend: DeepSeek-V4 Modeling in Megatron-LMParallelism: DP/TP/SP/EP/PP/CP supportedKernelsRL FeaturesNumerical Precision on a Mixed-Precision StackTraining ResultBenchmark NotesRoadmap: The Path AheadAcknowledgementCitationDeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and MilesThe SGLang and Miles TeamApril 25, 2026We are thrilled to announce Day-0 support for DeepSeek-V4 across both inference and RL training. SGLang and Miles form the first open-source stack to serve and train DeepSeek-V4 on launch day — with systems purpose-built for its hybrid sparse-attention architecture, manifold-constrained hyper-connections (mHC), and FP4 expert weights. Figure 1. Decode throughput of SGLang vs the other OSS engine on a 30K-token prompt truncated from "Dream of the Red Chamber". We tried the best-effort spec configuration for each engine based on its official recipe. See benchmark notes for details. TL;DR SGLang and Miles ship Day-0 inference and RL for DeepSeek-V4 (1.6T Pro, 284B Flash). Inference (caching & attention): ShadowRadix prefix cache, HiSparse CPU-extended KV, MTP speculative decoding with in-graph metadata, Flash Compressor, Lightning TopK, hierarchical multi-stream overlap. Inference (kernels & deployment): fast kernel integrations (FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, TileLang mHC), DP/TP/CP attention, EP MoE on DeepEP, PD disaggregation. RL training: full parallelism (DP/TP/SP/EP/PP/CP), tilelang attention, enhanced stability, FP8 training. Hardware: Hopper, Blackwell, Grace Blackwell, AMD, NPU. Launch Commands: SGLang Cookbook Model Key Features & New Capabilities DeepSeek-V4 (1.6T Pro, 284B Flash) extends its predecessor DeepSeek-V3.2 along three axes: Hybrid sparse-attention: each layer mixes sliding window attention with one of two compression mechanisms (4:1 top-k or 128:1 dense), keeping the 1M-token context window tractable. mHC (Manifold-Constrained Hyper-Connections): a generalization of standard residual connections that improves gradient flow and representation quality. FP4 expert weights: native FP4 MoE experts for efficient serving on the latest Blackwell hardware. Designs, Features and Performance Optimizations ShadowRadix: Native Prefix Caching for Hybrid Attention Every layer of DeepSeek-V4 combines SWA (sliding window attention over the last 128 raw tokens) with either C4 (top-512 sparse over 4:1-compressed KV) or C128 (dense over 128:1-compressed KV). Also, for maintaining the inflight compressing KV slots, each compression layer has a state pool that stores the in-progress compression state. This complex mechanism breaks traditional prefix caching assumption: three heterogeneous KV pools and two compression-state pools must stay coherent across prefill, decode, and speculative decoding. The following figure shows the per-layer hybrid attention scope for an N = 1024 example. To solve this coherence problem, we introduce ShadowRadix -- a native prefix caching mechanism for hybrid attentions. One…
This excerpt is published under fair use for community discussion. Read the full article at Lmsys.