WeSearch

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

·3 min read · 0 reactions · 0 comments · 14 views
#artificial intelligence#energy efficiency#machine learning
PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
⚡ TL;DR · AI summary

The paper presents PALS, a power-aware runtime for serving large language models (LLMs) that optimizes GPU power alongside software parameters. This system aims to enhance energy efficiency while meeting throughput targets without requiring model retraining. Results indicate that PALS can improve energy efficiency by up to 26.3% and significantly reduce quality of service violations under power constraints.

Key facts
Original article
arXiv cs.AI
Read full at arXiv cs.AI →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Artificial Intelligence arXiv:2605.21427 (cs) [Submitted on 20 May 2026] Title:PALS: Power-Aware LLM Serving for Mixture-of-Experts Models Authors:Can Hankendi, Rana Shahout, Minlan Yu, Ayse K. Coskun View a PDF of the paper titled PALS: Power-Aware LLM Serving for Mixture-of-Experts Models, by Can Hankendi and 3 other authors View PDF HTML (experimental) Abstract:Large language model (LLM) inference has become a dominant workload in modern data centers, driving significant GPU utilization and energy consumption. While prior systems optimize throughput and latency by batching, scheduling, and parallelism, they largely treat GPU power as a static constraint rather than a controllable resource.

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv cs.AI