KV Cache Locality: The Hidden Variable in Your LLM Serving Cost

Minds Aspire· May 1, 2026 · 2:09 AM UTC ·7 min read · 0 reactions · 0 comments · 7 views

via

Ranvier

Every time your load balancer sends a request to the wrong GPU, that GPU recomputes a prefill it already computed somewhere else. The KV cache for that 4,000-token system prompt exists. It’s just sitting on a different card. Your load balancer doesn’t know. It can’t know. It’s counting connections, not tokens.

Original article

Ranvier · Minds Aspire

Read full at Ranvier →

Opening excerpt (first ~120 words) tap to expand

KV Cache Locality: The Hidden Variable in Your LLM Serving Cost Apr 30, 2026 • Minds Aspire Every time your load balancer sends a request to the wrong GPU, that GPU recomputes a prefill it already computed somewhere else. The KV cache for that 4,000-token system prompt exists. It’s just sitting on a different card. Your load balancer doesn’t know. It can’t know. It’s counting connections, not tokens. That recomputation takes real time and real money. On a Llama 3.1 70B at half precision, a 4,000-token prefill takes over a second. If eight GPUs each recompute the same system prompt independently because round-robin sent one request to each, you just paid for the same work eight times. Multiply by every request, every hour, every day.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Ranvier.

Anonymous · no account needed

Discussion

0 comments

KV Cache Locality: The Hidden Variable in Your LLM Serving Cost

Discussion

More from Ranvier