Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max

Apr 28, 2026 · 5:17 PM UTC · 0 reactions · 0 comments · 6 views

via

LocalLlama

Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already published M5 Max numbers up to 32K. I wanted to see what the curves looked like once you push them. Hardware: MacBook Pro M5 Max, 128 GB unified memory. Built the fork with cmake -B build -DGGML_METAL=ON . llama-bench, 3 reps per cell, flash-attn on, mlock on, 8 hours wall-clock overnight. Cache types: f

Original article

LocalLlama

Read full at LocalLlama →

Anonymous · no account needed

Discussion

0 comments

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max

Discussion

More from LocalLlama