ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

May 25, 2026 · 4:00 AM UTC ·3 min read · 0 reactions · 0 comments · 25 views

#computer science #distributed computing #artificial intelligence

TL;DR · WeSearch summary

The paper presents ObjectCache, a new approach for managing KV caching in large language model serving. By utilizing S3-compatible object storage, it aims to alleviate the constraints of GPU memory and local DRAM while minimizing latency. The proposed system demonstrates improved efficiency in data transfer and computation overlap, leading to reduced time to first token in various contexts.

Key facts

▪ObjectCache co-designs the storage protocol and transfer schedule to optimize KV cache retrieval.
▪The system adds only 5.6% latency for 64K contexts compared to local DRAM.
▪Under shared bandwidth caps, ObjectCache reduces added time to first token by 1.2-1.8x.

Original article

arXiv cs.AI

Read full at arXiv cs.AI →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Distributed, Parallel, and Cluster Computing arXiv:2605.22850 (cs) [Submitted on 16 May 2026] Title:ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse Authors:Yu Zhu, Aditya Dhakal, Yunming Xiao, Dejan Milojicic, Gustavo Alonso View a PDF of the paper titled ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse, by Yu Zhu and 4 other authors View PDF HTML (experimental) Abstract:Prefix KV caching has become a key mechanism in LLM serving: it reduces time to first token (TTFT) by avoiding redundant computation across requests that share a prefix (i.e., the system prompt). However, the accumulated KV cache is often larger than what GPU memory and local DRAM can hold.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv cs.AI.

Anonymous · no account needed

Discussion

0 comments

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

Discussion

More from arXiv cs.AI