DeepSeek-V4 KV Cache Explained: Why 1M Context Uses Less VRAM
DeepSeek-V4 introduces a hybrid compressed attention mechanism that significantly reduces the memory requirements of KV Cache during inference. By compressing multiple historical tokens into fewer KV entries, it alleviates VRAM pressure while maintaining performance for long-context models. This advancement allows for efficient processing of up to 1 million tokens with reduced memory consumption compared to previous architectures.
- ▪DeepSeek-V4's KV Cache is about 10% of DeepSeek-V3.2 and 2% of a common bf16 GQA architecture in a 1M-token setting.
- ▪The model compresses old context into blocks, allowing it to store summaries of earlier tokens instead of full KV for every distant token.
- ▪Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) are two mechanisms used in DeepSeek-V4 to optimize memory usage.
Opening excerpt (first ~120 words) tap to expand
AI Industry DeepSeek-V4 KV Cache Explained: Why 1M Context Uses Less VRAM A comparison of DeepSeek-V4's CSA/HCA hybrid compressed attention with traditional MHA, GQA, and MLA, explaining why DeepSeek-V4 can greatly reduce KV Cache memory for 1M-token context. 2026-05-18 8 minute read 中文简体 中文繁體 日本語 Español The real cost of long-context models is often not whether they can accept one million tokens, but how much VRAM the KV Cache consumes during inference. During Transformer decoding, every newly generated token needs access to the Key and Value states of previous tokens. The longer the context, the larger the KV Cache. A larger KV Cache puts pressure on VRAM, memory bandwidth, time to first token, and throughput.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at KnightLi Blog.