WeSearch

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference

·1 min read · 0 reactions · 0 comments · 17 views
#technology#artificial intelligence#machine learning
⚡ TL;DR · AI summary

SuperInfer is a new LLM inference system designed to improve responsiveness while meeting stringent latency Service Level Objectives. It utilizes a proactive rotary scheduler and an optimized rotation engine to enhance performance on advanced Superchips. Evaluations indicate that SuperInfer significantly improves latency metrics while maintaining throughput comparable to existing systems.

Key facts
Original article
Github
Read full at Github →
Opening excerpt (first ~120 words) tap to expand

Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. To address these issues, we present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Github.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Github