Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
The paper presents Llamas on the Web (LlamaWeb), a WebGPU backend designed for efficient language model inference in web browsers. It addresses challenges related to memory constraints and hardware variability, achieving significant reductions in memory usage and improvements in performance. The evaluation shows LlamaWeb outperforms existing frameworks in both memory efficiency and decode throughput across multiple devices.
- ▪LlamaWeb enables memory-efficient and performance-portable LLM inference across various model weight formats in the browser.
- ▪The design reduces memory overhead by 29-33% compared to existing browser-based LLM frameworks.
- ▪LlamaWeb increases decode throughput by 45-69% across four GPUs from different vendors.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2605.20706 (cs) [Submitted on 20 May 2026] Title:Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU Authors:Reese Levine, Rithik Sharma, Nikhil Jain, Abhijit Ramesh, Zheyuan Chen, Neha Abbas, James Contini, Tyler Sorensen View a PDF of the paper titled Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU, by Reese Levine and 7 other authors View PDF HTML (experimental) Abstract:Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.