SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference

May 19, 2026 · 1:23 AM UTC ·1 min read · 0 reactions · 0 comments · 17 views

#technology #artificial intelligence #machine learning

via

Github

⚡ TL;DR · AI summary

SuperInfer is a new LLM inference system designed to improve responsiveness while meeting stringent latency Service Level Objectives. It utilizes a proactive rotary scheduler and an optimized rotation engine to enhance performance on advanced Superchips. Evaluations indicate that SuperInfer significantly improves latency metrics while maintaining throughput comparable to existing systems.

Key facts

▪SuperInfer addresses the tension between latency SLOs and limited GPU memory capacity.
▪The system includes RotaSched, a proactive scheduler that maintains responsiveness on Superchips.
▪Evaluations show a 74.7% improvement in TTFT SLO attainment rates compared to state-of-the-art systems.

Original article

Github

Read full at Github →

Opening excerpt (first ~120 words) tap to expand

Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. To address these issues, we present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Github.

Anonymous · no account needed

Discussion

0 comments

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference

Discussion

More from Github