KV Cache Locality: The Hidden Variable in Your LLM Serving Cost
Every time your load balancer sends a request to the wrong GPU, that GPU recomputes a prefill it already computed somewhere else. The KV cache for that 4,000-token system prompt exists. It’s just sitting on a different card. Your load balancer doesn’t know. It can’t know. It’s counting connections, not tokens.
Opening excerpt (first ~120 words) tap to expand
KV Cache Locality: The Hidden Variable in Your LLM Serving Cost Apr 30, 2026 • Minds Aspire Every time your load balancer sends a request to the wrong GPU, that GPU recomputes a prefill it already computed somewhere else. The KV cache for that 4,000-token system prompt exists. It’s just sitting on a different card. Your load balancer doesn’t know. It can’t know. It’s counting connections, not tokens. That recomputation takes real time and real money. On a Llama 3.1 70B at half precision, a 4,000-token prefill takes over a second. If eight GPUs each recompute the same system prompt independently because round-robin sent one request to each, you just paid for the same work eight times. Multiply by every request, every hour, every day.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Ranvier.