Benchmarking llama.cpp's new MTP support on Strix Halo
The article discusses the new Multi-Token Prediction (MTP) support added to llama.cpp, which enhances decoding speed for specific models. Benchmark tests showed significant speed improvements on Strix Halo and RTX 3090 hardware. MTP allows models to draft multiple tokens simultaneously, improving output efficiency without sacrificing accuracy.
- ▪MTP support was added to llama.cpp on May 16, 2026.
- ▪Benchmarking showed a 1.81× speedup on Strix Halo with Qwen3.6 27B model using MTP n=3.
- ▪The MTP feature allows models to draft several tokens at once, reducing the time taken for output.
Opening excerpt (first ~120 words) tap to expand
Benchmarking llama.cpp's brand-new MTP support on Strix Halo2026-05-18 19:30:00::AUTHOR: CALEBPR #22673 landed in llama.cpp on May 16. It adds first-class Multi-Token Prediction (MTP) speculative decoding for models that ship with an MTP head, including Qwen3.6 27B dense and the 35B-A3B MoE. The author posted ~2.5× speedups on a DGX Spark. I have a Strix Halo Framework Desktop and an RTX 3090, so I built llama.cpp from master a few hours after the merge and ran my speed-bench harness against both. Most wrappers (lemonade, ollama, LM Studio) won't have MTP for a while, so this is from-source territory.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at calebcoffie.com.