WeSearch

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

·3 min read · 0 reactions · 0 comments · 15 views
#machine learning#artificial intelligence#quantization
Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization
⚡ TL;DR · AI summary

The paper presents a framework called Quant.npu for efficient mobile NPU inference of large language models through fully static quantization. It addresses the limitations of existing post-training quantization methods that rely on dynamic activation quantization. The proposed method achieves comparable accuracy to state-of-the-art techniques while significantly reducing inference latency.

Key facts
Original article
arXiv cs.AI
Read full at arXiv cs.AI →
Opening excerpt (first ~120 words) tap to expand

Computer Science > Machine Learning arXiv:2605.20295 (cs) [Submitted on 19 May 2026] Title:Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization Authors:Jinghe Zhang, Daliang Xu, Chenghua Wang, Weikai Xie, Tao Qi, Yun Ma, Mengwei Xu, Gang Huang View a PDF of the paper titled Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization, by Jinghe Zhang and 7 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency.

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from arXiv cs.AI