Rotary GPU: Exploring Local Execution for Large Moe Models Under Limited VRAM
The paper titled 'Rotary GPU' explores the feasibility of executing large Mixture-of-Experts models in environments with limited GPU memory. It highlights the importance of making advanced models accessible to organizations constrained by hardware and budget limitations. The findings suggest that while not definitive, there is potential for local execution paths to enhance deployment accessibility for large models.
- ▪The study investigates how large models can be made more accessible in environments with limited hardware resources.
- ▪Rotary GPU is an exploratory execution approach that was validated using a Mixture-of-Experts model on a consumer laptop with an RTX 4060 GPU.
- ▪The system maintained approximately 6.3 GB of VRAM usage while generating 2048 output tokens at a throughput of 21.06 tokens per second.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Performance arXiv:2605.29135 (cs) [Submitted on 27 May 2026] Title:Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory Authors:Myeong Jun Jo View a PDF of the paper titled Rotary GPU: Exploring Local Execution Paths for Large Mixture-of-Experts Models Under Limited GPU Memory, by Myeong Jun Jo View PDF HTML (experimental) Abstract:Large language models have achieved remarkable capabilities through scaling, and this paper does not challenge that. It instead investigates a different question: once large models already exist, can they become more accessible to environments with substantially smaller hardware resources? The motivation came from deployment concerns rather than architecture research.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.