SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference
SuperInfer is a new LLM inference system designed to improve responsiveness while meeting stringent latency Service Level Objectives. It utilizes a proactive rotary scheduler and an optimized rotation engine to enhance performance on advanced Superchips. Evaluations indicate that SuperInfer significantly improves latency metrics while maintaining throughput comparable to existing systems.
- ▪SuperInfer addresses the tension between latency SLOs and limited GPU memory capacity.
- ▪The system includes RotaSched, a proactive scheduler that maintains responsiveness on Superchips.
- ▪Evaluations show a 74.7% improvement in TTFT SLO attainment rates compared to state-of-the-art systems.
Opening excerpt (first ~120 words) tap to expand
Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. To address these issues, we present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Github.