Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization
The paper presents a framework called Quant.npu for efficient mobile NPU inference of large language models through fully static quantization. It addresses the limitations of existing post-training quantization methods that rely on dynamic activation quantization. The proposed method achieves comparable accuracy to state-of-the-art techniques while significantly reducing inference latency.
- ▪Quant.npu enables efficient inference for large language models on mobile devices using fully static quantization.
- ▪The framework incorporates learnable quantization parameters and rotation matrices to optimize performance.
- ▪Experiments show that Quant.npu can reduce inference latency by up to 15.1% while maintaining accuracy.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.20295 (cs) [Submitted on 19 May 2026] Title:Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization Authors:Jinghe Zhang, Daliang Xu, Chenghua Wang, Weikai Xie, Tao Qi, Yun Ma, Mengwei Xu, Gang Huang View a PDF of the paper titled Quant.npu: Enabling Efficient Mobile NPU Inference for on-device LLMs via Fully Static Quantization, by Jinghe Zhang and 7 other authors View PDF HTML (experimental) Abstract:Large language models (LLMs) are increasingly deployed on mobile devices, where Neural Processing Units (NPUs) necessitate fully static quantization for optimal inference efficiency.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.