Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max
·
0 reactions
·
0 comments
·
6 views
Took TheTom's TurboQuant Metal fork of llama.cpp (github.com/TheTom/llama-cpp-turboquant, the feature/turboquant-kv-cache branch) and ran a depth sweep on Qwen 3.6-35B-A3B Q8. TheTom had already published M5 Max numbers up to 32K. I wanted to see what the curves looked like once you push them. Hardware: MacBook Pro M5 Max, 128 GB unified memory. Built the fork with cmake -B build -DGGML_METAL=ON . llama-bench, 3 reps per cell, flash-attn on, mlock on, 8 hours wall-clock overnight. Cache types: f
Original article
LocalLlama
Anonymous · no account needed