Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

May 19, 2026 · 1:24 AM UTC ·10 min read · 0 reactions · 0 comments · 33 views

#technology #artificial intelligence #machine learning

via

MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

TL;DR · WeSearch summary

Nitsum is a new serving system designed to optimize the handling of tiered LLM requests using adaptive tensor parallelism. It allows for dynamic reconfiguration of GPU resources to meet varying latency requirements for different workloads. This approach significantly enhances goodput, achieving up to 5.3 times improvement over existing systems.

Key facts

▪Nitsum treats tensor parallelism as a runtime control surface rather than a fixed deployment choice.
▪The system improves service-level objective compliance by dynamically adjusting GPU configurations based on workload changes.
▪Nitsum can serve a mix of latency-critical and relaxed background jobs under a fixed GPU budget.

Original article

MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

Read full at MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism →

Opening excerpt (first ~120 words) tap to expand

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor ParallelismMay 16, 2026 - 12 mins readLLMServingTensor ParallelismAuthor: Vikranth Srivatsa, Zijian He, Pu Guo, Dongming Li, and Yiying ZhangTLDR: A single LLM deployment now serves everything from latency-critical chat to relaxed background jobs under a fixed GPU budget, creating a tiered-SLO serving problem. We designed Nitsum [arXiv ‘26], the first serving system that treats tensor parallelism (TP) as a runtime control surface instead of a fixed deployment choice.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism.

Anonymous · no account needed

Discussion

0 comments

Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism

Discussion

More from MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism