DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles

The SGLang and Miles Team· April 25, 2026 at 11:44 PM ·15 min read · 0 reactions · 0 comments · 1 view

#deepseek-v4#sglang#miles#speculative decoding#hybrid attention

⚡ TL;DR · AI summary

SGLang and Miles have launched Day-0 support for DeepSeek-V4, enabling fast inference and reinforcement learning training for its 1.6T Pro and 284B Flash models. The system leverages novel technologies like ShadowRadix for hybrid attention caching, HiSparse for CPU-extended KV offloading, and speculative decoding with in-graph metadata. Optimizations include Flash Compressor, Lightning TopK, and kernel integrations for Blackwell and Hopper GPUs. Full parallelism and FP8 training are supported in RL workflows via the Miles framework.

Key facts

▪SGLang and Miles provide Day-0 inference and RL training support for DeepSeek-V4's hybrid sparse-attention architecture and FP4 expert weights.
▪ShadowRadix enables coherent prefix caching across multiple KV and compression-state pools in DeepSeek-V4’s hybrid attention layers.
▪HiSparse boosts throughput by offloading inactive C4 KV cache to CPU memory, increasing capacity and efficiency for long-context generation.
▪Speculative decoding is accelerated via in-graph metadata preparation and hierarchical multi-stream overlap to reduce launch overhead.
▪The stack supports DP/TP/SP/EP/PP/CP parallelism, FP8 training, and optimized kernels like FlashMLA, FlashInfer TRTLLM-Gen, and DeepGEMM Mega MoE.

Original article

Lmsys · The SGLang and Miles Team

Read full at Lmsys →

Full article excerpt tap to expand

‹ Back to Blog‹ Back to BlogContentsTL;DRModel Key Features & New CapabilitiesDesigns, Features and Performance OptimizationsShadowRadix: Native Prefix Caching for Hybrid AttentionSpeculative DecodingHiSparse: Turbocharging Sparse Attention with Hierarchical MemoryFast Kernel IntegrationsVarious Kernel OptimizationsFlash Compressor: IO-aware Exact CompressionLightning TopKParallelism and DeploymentHierarchical Multi-Stream OverlapReinforcement Learning -- Miles SupportTraining backend: DeepSeek-V4 Modeling in Megatron-LMParallelism: DP/TP/SP/EP/PP/CP supportedKernelsRL FeaturesNumerical Precision on a Mixed-Precision StackTraining ResultBenchmark NotesRoadmap: The Path AheadAcknowledgementCitationDeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and MilesThe SGLang and Miles TeamApril 25, 2026We are thrilled to announce Day-0 support for DeepSeek-V4 across both inference and RL training. SGLang and Miles form the first open-source stack to serve and train DeepSeek-V4 on launch day — with systems purpose-built for its hybrid sparse-attention architecture, manifold-constrained hyper-connections (mHC), and FP4 expert weights. Figure 1. Decode throughput of SGLang vs the other OSS engine on a 30K-token prompt truncated from "Dream of the Red Chamber". We tried the best-effort spec configuration for each engine based on its official recipe. See benchmark notes for details. TL;DR SGLang and Miles ship Day-0 inference and RL for DeepSeek-V4 (1.6T Pro, 284B Flash). Inference (caching & attention): ShadowRadix prefix cache, HiSparse CPU-extended KV, MTP speculative decoding with in-graph metadata, Flash Compressor, Lightning TopK, hierarchical multi-stream overlap. Inference (kernels & deployment): fast kernel integrations (FlashMLA, FlashInfer TRTLLM-Gen MoE, DeepGEMM Mega MoE, TileLang mHC), DP/TP/CP attention, EP MoE on DeepEP, PD disaggregation. RL training: full parallelism (DP/TP/SP/EP/PP/CP), tilelang attention, enhanced stability, FP8 training. Hardware: Hopper, Blackwell, Grace Blackwell, AMD, NPU. Launch Commands: SGLang Cookbook Model Key Features & New Capabilities DeepSeek-V4 (1.6T Pro, 284B Flash) extends its predecessor DeepSeek-V3.2 along three axes: Hybrid sparse-attention: each layer mixes sliding window attention with one of two compression mechanisms (4:1 top-k or 128:1 dense), keeping the 1M-token context window tractable. mHC (Manifold-Constrained Hyper-Connections): a generalization of standard residual connections that improves gradient flow and representation quality. FP4 expert weights: native FP4 MoE experts for efficient serving on the latest Blackwell hardware. Designs, Features and Performance Optimizations ShadowRadix: Native Prefix Caching for Hybrid Attention Every layer of DeepSeek-V4 combines SWA (sliding window attention over the last 128 raw tokens) with either C4 (top-512 sparse over 4:1-compressed KV) or C128 (dense over 128:1-compressed KV). Also, for maintaining the inflight compressing KV slots, each compression layer has a state pool that stores the in-progress compression state. This complex mechanism breaks traditional prefix caching assumption: three heterogeneous KV pools and two compression-state pools must stay coherent across prefill, decode, and speculative decoding. The following figure shows the per-layer hybrid attention scope for an N = 1024 example. To solve this coherence problem, we introduce ShadowRadix -- a native prefix caching mechanism for hybrid attentions. One…

This excerpt is published under fair use for community discussion. Read the full article at Lmsys.

Anonymous · no account needed

Discussion

0 comments

DeepSeek-V4 on Day 0: From Fast Inference to Verified RL with SGLang and Miles

Discussion

More from Lmsys