Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism
Nitsum is a new serving system designed to optimize the handling of tiered LLM requests using adaptive tensor parallelism. It allows for dynamic reconfiguration of GPU resources to meet varying latency requirements for different workloads. This approach significantly enhances goodput, achieving up to 5.3 times improvement over existing systems.
- ▪Nitsum treats tensor parallelism as a runtime control surface rather than a fixed deployment choice.
- ▪The system improves service-level objective compliance by dynamically adjusting GPU configurations based on workload changes.
- ▪Nitsum can serve a mix of latency-critical and relaxed background jobs under a fixed GPU budget.
Opening excerpt (first ~120 words) tap to expand
Nitsum: Serving Tiered LLM Requests with Adaptive Tensor ParallelismMay 16, 2026 - 12 mins readLLMServingTensor ParallelismAuthor: Vikranth Srivatsa, Zijian He, Pu Guo, Dongming Li, and Yiying ZhangTLDR: A single LLM deployment now serves everything from latency-critical chat to relaxed background jobs under a fixed GPU budget, creating a tiered-SLO serving problem. We designed Nitsum [arXiv ‘26], the first serving system that treats tensor parallelism (TP) as a runtime control surface instead of a fixed deployment choice.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at MLSys @ WukLab - Nitsum: Serving Tiered LLM Requests with Adaptive Tensor Parallelism.