WeSearch

Disaggregated Serving for Hybrid SSM Models in vLLM

vLLM Team· ·13 min read · 0 reactions · 0 comments · 0 views
Disaggregated Serving for Hybrid SSM Models in vLLM

Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way

Original article
Vercel · vLLM Team
Read full at Vercel →
Opening excerpt (first ~120 words) tap to expand

Disaggregated Serving for Hybrid SSM Models in vLLMApril 21, 202615 min readNicolò Lucchesi, Zhanqiu Hu (Red Hat), and the vLLM team#disaggregation#mambaIntroductionBackground: The NIXL KV Transfer WorkflowThe Challenge: FA and SSM State Are Fundamentally DifferentThe HMA Shared-Tensor LayoutDual Descriptor ViewsPhysical vs. Logical Block SizesThe 3-Descriptors Conv TransferThe DS Layout SolutionZero-Overhead: No Extra Buffers, No PermutationPutting It Together: Nemotron-H ExamplePerformanceGetting StartedLimitations and Future WorkAcknowledgmentsTable of ContentsIntroduction Hybrid architectures that interleave Mamba-style SSM layers with standard full-attention (FA) layers — such as NVIDIA Nemotron-H — are gaining traction as a way to combine the linear-time efficiency of state-space…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Vercel.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Vercel