First Gemma 4 ExecuTorch Deployment on Raspberry Pi 5 — and Why It's 7.7 Slower Than llama.cpp

May 25, 2026 · 11:19 AM UTC ·5 min read · 0 reactions · 0 comments · 34 views

#executorch #edgeai #raspberrypi #gemma4

First Gemma 4 ExecuTorch Deployment on Raspberry Pi 5 — and Why It's 7.7 Slower Than llama.cpp

TL;DR · WeSearch summary

The first deployment of Gemma 4 ExecuTorch on a Raspberry Pi 5 has been documented, revealing significant performance differences compared to ARM's benchmarks. The deployment achieved bit-exact output but was found to be 7.7 times slower than llama.cpp. Various issues were encountered during the process, highlighting the challenges of deploying on non-SME2 hardware.

Key facts

▪Gemma 4 was optimized for ARM devices, achieving notable speed improvements on flagship chips.
▪The deployment on Raspberry Pi 5 resulted in a decode speed of 0.87 tokens per second, significantly slower than the 6.71 tokens per second achieved by llama.cpp.
▪The performance gap is attributed to issues with kernel fusion in the ExecuTorch XNNPACK backend on aarch64 systems.

Original article

DEV.to (Top)

Read full at DEV.to (Top) →

Opening excerpt (first ~120 words) tap to expand

try { if(localStorage) { let currentUser = localStorage.getItem('current_user'); if (currentUser) { currentUser = JSON.parse(currentUser); if (currentUser.id === 3950536) { document.getElementById('article-show-container').classList.add('current-user-is-article-author'); } } } } catch (e) { console.error(e); } Viik Posted on May 25 First Gemma 4 ExecuTorch Deployment on Raspberry Pi 5 — and Why It's 7.7 Slower Than llama.cpp #executorch #edgeai #raspberrypi #gemma4 On April 2, ARM published a blog post announcing Gemma 4 optimised for ARM devices via XNNPACK + KleidiAI, reporting 5.5× prefill speedup and 1.6× faster decode. Those numbers target Armv9 chips with SME2 — flagship phone silicon. I wanted to see what happens on the broader ARM ecosystem.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at DEV.to (Top).

Anonymous · no account needed

Discussion

0 comments

First Gemma 4 ExecuTorch Deployment on Raspberry Pi 5 — and Why It's 7.7 Slower Than llama.cpp

Discussion

More from DEV.to (Top)