Recent Developments in LLM Architectures: KV Sharing, MHC, Compressed Attention
Recent advancements in open-weight large language models focus on improving long-context efficiency through architectural innovations. Models like Gemma 4, ZAYA1-8B, and DeepSeek V4 implement techniques such as KV sharing, compressed attention, and mHC to reduce memory and computational costs. These optimizations target the transformer's attention mechanisms and KV cache usage, enabling more scalable and efficient inference.
- ▪Gemma 4 introduces KV sharing and per-layer embeddings to reduce KV-cache size and memory traffic in long-context scenarios.
- ▪ZAYA1-8B employs compressed convolutional attention to lower attention computation costs.
- ▪DeepSeek V4 combines mHC (multi-head compression) with compressed attention for improved efficiency.
- ▪Laguna XS.2 implements layer-wise attention budgeting to manage computational load across layers.
- ▪These architectural changes are designed to support reasoning models and agent workflows that maintain long token contexts.
Opening excerpt (first ~120 words) tap to expand
Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed AttentionFrom Gemma 4 to DeepSeek V4, How New Open-Weight LLMs Are Reducing Long-Context CostsSebastian Raschka, PhDMay 16, 20268138ShareAfter a short family break, I am excited to be back and catching up on a busy few weeks of open-weight LLM releases. The thing that stood out to me is how much newer architectures are focused on long-context efficiency.As reasoning models and agent workflows keep more tokens around (for longer), KV-cache size, memory traffic, and attention cost quickly become the main constraints, and LLM developers are adding a growing number of architecture tricks to reduce those costs.The main examples I want to look at are KV sharing and per-layer embeddings in Gemma 4, layer-wise attention…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Hacker News (AI / LLM).