WeSearch

Benchmarking llama.cpp's new MTP support on Strix Halo

Caleb Coffie· ·10 min read · 0 reactions · 0 comments · 18 views
#technology#artificial intelligence#machine learning
⚡ TL;DR · AI summary

The article discusses the new Multi-Token Prediction (MTP) support added to llama.cpp, which enhances decoding speed for specific models. Benchmark tests showed significant speed improvements on Strix Halo and RTX 3090 hardware. MTP allows models to draft multiple tokens simultaneously, improving output efficiency without sacrificing accuracy.

Key facts
Original article
calebcoffie.com · Caleb Coffie
Read full at calebcoffie.com →
Opening excerpt (first ~120 words) tap to expand

Benchmarking llama.cpp's brand-new MTP support on Strix Halo2026-05-18 19:30:00::AUTHOR: CALEBPR #22673 landed in llama.cpp on May 16. It adds first-class Multi-Token Prediction (MTP) speculative decoding for models that ship with an MTP head, including Qwen3.6 27B dense and the 35B-A3B MoE. The author posted ~2.5× speedups on a DGX Spark. I have a Strix Halo Framework Desktop and an RTX 3090, so I built llama.cpp from master a few hours after the merge and ran my speed-bench harness against both. Most wrappers (lemonade, ollama, LM Studio) won't have MTP for a while, so this is from-source territory.

Excerpt limited to ~120 words for fair-use compliance. The full article is at calebcoffie.com.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from calebcoffie.com