Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

May 22, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 33 views

#machine learning #artificial intelligence #quantization

TL;DR · WeSearch summary

The paper presents a framework called Quant.npu for efficient mobile NPU inference of large language models through fully static quantization. It addresses the limitations of existing post-training quantization methods that rely on dynamic activation quantization. The proposed method achieves comparable accuracy to state-of-the-art techniques while significantly reducing inference latency.

Key facts

▪Quant.npu enables efficient inference for large language models on mobile devices using fully static quantization.
▪The framework incorporates learnable quantization parameters and rotation matrices to optimize performance.
▪Experiments show that Quant.npu can reduce inference latency by up to 15.1% while maintaining accuracy.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Machine Learning arXiv:2605.20295 (cs) [Submitted on 19 May 2026] Title:Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization Authors:Jinghe Zhang, Daliang Xu, Chenghua Wang, Weikai Xie, Tao Qi, Yun Ma, Mengwei Xu, Gang Huang View a PDF of the paper titled Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization, by Jinghe Zhang and 7 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization

Discussion

More from arXiv cs.AI