FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally

Apr 25, 2026 · 3:42 PM UTC · 0 reactions · 0 comments · 12 views

Both llama.cpp and ik_llama.cpp now have FP4 support — but with different flavors worth knowing about. llama.cpp recently merged NVFP4 (Nvidia's block-scaled FP4, `GGML_TYPE_NVFP4 = 40`), with CUDA kernels landing in `mmq.cuh`, `mmvq.cu`, `convert.cu` and others. ik_llama.cpp has had MXFP4 (`GGML_TYPE_MXFP4 = 39`) since PR #682 — the MX-standard FP4 used in gpt-oss models. Coverage is actually broader: CPU (AVX2, NEON, Zen4), CUDA, are all implemented. They're not the same wire format — NVFP4 is

Original article

Read full at Reddit →

Anonymous · no account needed

Discussion

0 comments

FP4 inference in llama.cpp (NVFP4) and ik_llama.cpp (MXFP4) landed - Finally

Discussion

More from Reddit