Kv cache quantization: ignorance, or malice?

May 2, 2026 · 3:34 PM UTC · 0 reactions · 0 comments · 2 views

via

LocalLlama

I run Qwen-3.6 27B FP8 on vllm for long-horizon agentic coding harness workloads with high context window and concurrent sub-agents. On two 3090s that aren’t used for anything else, it seems reasonable to expect a good balance between speed and reliability. I want to bring up a particular point of contention regarding this optimization process. I have extensive software engineering background but am relatively new to this so feel free to correct me if I’m not on the right track. It seems like co

Original article

LocalLlama

Read full at LocalLlama →

Anonymous · no account needed

Discussion

0 comments

Kv cache quantization: ignorance, or malice?

Discussion

More from LocalLlama