SSV: Sparse Speculative Verification for Efficient LLM Inference
The paper presents SSV, a framework designed to enhance the efficiency of long-context LLM inference. By integrating speculative decoding and dynamic sparse attention, SSV addresses structural mismatches that limit performance. Experimental results indicate significant improvements in throughput and kernel speedups on NVIDIA H100 GPUs.
- ▪SSV combines overlap-aware grouped-query execution and profile-guided prompt-adaptive orchestration.
- ▪The framework improves cross-query reuse and reduces overheads associated with selected-index and branch-fusion.
- ▪Experiments show SSV achieves up to 3.49x end-to-end throughput over autoregressive NSA decoding.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Operating Systems arXiv:2605.19893 (cs) [Submitted on 19 May 2026 (v1), last revised 20 May 2026 (this version, v2)] Title:SSV: Sparse Speculative Verification for Efficient LLM Inference Authors:Zhibin Wang, Ziyu Zhong, Nuo Shen, Yuhang Zhou, Rong Gu, Sheng Zhong View a PDF of the paper titled SSV: Sparse Speculative Verification for Efficient LLM Inference, by Zhibin Wang and 4 other authors View PDF HTML (experimental) Abstract:Speculative decoding and dynamic sparse attention are two complementary approaches for accelerating long-context LLM inference: the former amortizes target-model execution across multiple verifier queries, while the latter reduces each query's KV-cache working set.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.