Most efficient way of running Gemma 4 E4B with multimodal capabilities on a laptop?

Apr 28, 2026 · 6:55 AM UTC · 0 reactions · 0 comments · 8 views

via

The gemma 4 E4B and E2B models have built-in multimodal capabilities. However, as far as I am aware, llama.cpp does not have proper support for vision and audio inputs (specially audio) for these models as of now. I was able to extract the audio encoder from the official model repository on huggingface, and vibe-code a bridge that passes on the embeddings of the audio directly to the model, and it actually works as well. This system uses the Unsloth's GGUF version at Q4 and the audio encoder at

Original article

Read full at Reddit →

Anonymous · no account needed

Discussion

0 comments

Most efficient way of running Gemma 4 E4B with multimodal capabilities on a laptop?

Discussion

More from Reddit