First Gemma 4 ExecuTorch Deployment on Raspberry Pi 5 — and Why It's 7.7 Slower Than llama.cpp
The first deployment of Gemma 4 ExecuTorch on a Raspberry Pi 5 has been documented, revealing significant performance differences compared to ARM's benchmarks. The deployment achieved bit-exact output but was found to be 7.7 times slower than llama.cpp. Various issues were encountered during the process, highlighting the challenges of deploying on non-SME2 hardware.
- ▪Gemma 4 was optimized for ARM devices, achieving notable speed improvements on flagship chips.
- ▪The deployment on Raspberry Pi 5 resulted in a decode speed of 0.87 tokens per second, significantly slower than the 6.71 tokens per second achieved by llama.cpp.
- ▪The performance gap is attributed to issues with kernel fusion in the ExecuTorch XNNPACK backend on aarch64 systems.
Opening excerpt (first ~120 words) tap to expand
try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3950536) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Viik Posted on May 25 First Gemma 4 ExecuTorch Deployment on Raspberry Pi 5 — and Why It's 7.7 Slower Than llama.cpp #executorch #edgeai #raspberrypi #gemma4 On April 2, ARM published a blog post announcing Gemma 4 optimised for ARM devices via XNNPACK + KleidiAI, reporting 5.5× prefill speedup and 1.6× faster decode. Those numbers target Armv9 chips with SME2 — flagship phone silicon. I wanted to see what happens on the broader ARM ecosystem.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).