WeSearch

KV Cache Locality: The Hidden Variable in Your LLM Serving Cost

Minds Aspire· ·7 min read · 0 reactions · 0 comments · 7 views

Every time your load balancer sends a request to the wrong GPU, that GPU recomputes a prefill it already computed somewhere else. The KV cache for that 4,000-token system prompt exists. It’s just sitting on a different card. Your load balancer doesn’t know. It can’t know. It’s counting connections, not tokens.

Original article
Ranvier · Minds Aspire
Read full at Ranvier →
Opening excerpt (first ~120 words) tap to expand

KV Cache Locality: The Hidden Variable in Your LLM Serving Cost Apr 30, 2026 • Minds Aspire Every time your load balancer sends a request to the wrong GPU, that GPU recomputes a prefill it already computed somewhere else. The KV cache for that 4,000-token system prompt exists. It’s just sitting on a different card. Your load balancer doesn’t know. It can’t know. It’s counting connections, not tokens. That recomputation takes real time and real money. On a Llama 3.1 70B at half precision, a 4,000-token prefill takes over a second. If eight GPUs each recompute the same system prompt independently because round-robin sent one request to each, you just paid for the same work eight times. Multiply by every request, every hour, every day.

Excerpt limited to ~120 words for fair-use compliance. The full article is at Ranvier.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from Ranvier