Disaggregated Serving for Hybrid SSM Models in vLLM

vLLM Team· Apr 28, 2026 · 8:33 PM UTC ·13 min read · 0 reactions · 0 comments · 10 views

#machine learning #model serving #state-space models #transformer models #distributed systems

Disaggregated Serving for Hybrid SSM Models in vLLM

⚡ TL;DR · AI summary

Hybrid models combining Mamba-style SSM layers with full-attention (FA) layers, such as NVIDIA Nemotron-H, are increasingly used for their efficiency and expressiveness. vLLM now supports disaggregated prefill/decode serving for these hybrid models by extending its NIXL-based KV connector to handle fundamentally different state formats. The solution introduces dual descriptor views, physical/logical block bridging, and a 3-descriptor conv transfer without modifying existing workflows for standard transformers.

Key facts

▪Hybrid SSM-FA models like Nemotron-H combine the linear-time efficiency of state-space models with the expressiveness of attention mechanisms.
▪The NIXL KV connector in vLLM was extended to support disaggregated serving by managing different state layouts and sizes for FA and SSM layers.
▪Key innovations include dual descriptor views, 3-descriptor conv state transfer, and support for heterogeneous tensor parallelism without data reshuffling on the sender side.
▪This functionality is available in vLLM version 0.20.0 and later, requiring no changes to the existing workflow for standard transformer models.

Original article

Vercel · vLLM Team

Read full at Vercel →

Opening excerpt (first ~120 words) tap to expand

Disaggregated Serving for Hybrid SSM Models in vLLMApril 21, 202615 min readNicolò Lucchesi, Zhanqiu Hu (Red Hat), and the vLLM team#disaggregation#mambaIntroductionBackground: The NIXL KV Transfer WorkflowThe Challenge: FA and SSM State Are Fundamentally DifferentThe HMA Shared-Tensor LayoutDual Descriptor ViewsPhysical vs. Logical Block SizesThe 3-Descriptors Conv TransferThe DS Layout SolutionZero-Overhead: No Extra Buffers, No PermutationPutting It Together: Nemotron-H ExamplePerformanceGetting StartedLimitations and Future WorkAcknowledgmentsTable of ContentsIntroduction Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time efficiency of state-space…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Vercel.

Anonymous · no account needed

Discussion

0 comments

Disaggregated Serving for Hybrid SSM Models in vLLM

Discussion

More from Vercel