Disaggregated Serving for Hybrid SSM Models in vLLM
Hybrid models combining Mamba-style SSM layers with full-attention (FA) layers, such as NVIDIA Nemotron-H, are increasingly used for their efficiency and expressiveness. vLLM now supports disaggregated prefill/decode serving for these hybrid models by extending its NIXL-based KV connector to handle fundamentally different state formats. The solution introduces dual descriptor views, physical/logical block bridging, and a 3-descriptor conv transfer without modifying existing workflows for standard transformers.
- ▪Hybrid SSM-FA models like Nemotron-H combine the linear-time efficiency of state-space models with the expressiveness of attention mechanisms.
- ▪The NIXL KV connector in vLLM was extended to support disaggregated serving by managing different state layouts and sizes for FA and SSM layers.
- ▪Key innovations include dual descriptor views, 3-descriptor conv state transfer, and support for heterogeneous tensor parallelism without data reshuffling on the sender side.
- ▪This functionality is available in vLLM version 0.20.0 and later, requiring no changes to the existing workflow for standard transformer models.
Opening excerpt (first ~120 words) tap to expand
Disaggregated Serving for Hybrid SSM Models in vLLMApril 21, 202615 min readNicolò Lucchesi, Zhanqiu Hu (Red Hat), and the vLLM team#disaggregation#mambaIntroductionBackground: The NIXL KV Transfer WorkflowThe Challenge: FA and SSM State Are Fundamentally DifferentThe HMA Shared-Tensor LayoutDual Descriptor ViewsPhysical vs. Logical Block SizesThe 3-Descriptors Conv TransferThe DS Layout SolutionZero-Overhead: No Extra Buffers, No PermutationPutting It Together: Nemotron-H ExamplePerformanceGetting StartedLimitations and Future WorkAcknowledgmentsTable of ContentsIntroduction Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time efficiency of state-space…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Vercel.