Recent Developments in LLM Architectures: KV Sharing, MHC, Compressed Attention

Sebastian Raschka, PhD· May 16, 2026 · 2:52 PM UTC ·23 min read · 0 reactions · 0 comments · 11 views

#llm architecture #attention mechanisms #memory efficiency #transformer optimization #ai research

Recent Developments in LLM Architectures: KV Sharing, MHC, Compressed Attention

⚡ TL;DR · AI summary

Recent advancements in open-weight large language models focus on improving long-context efficiency through architectural innovations. Models like Gemma 4, ZAYA1-8B, and DeepSeek V4 implement techniques such as KV sharing, compressed attention, and mHC to reduce memory and computational costs. These optimizations target the transformer's attention mechanisms and KV cache usage, enabling more scalable and efficient inference.

Key facts

▪Gemma 4 introduces KV sharing and per-layer embeddings to reduce KV-cache size and memory traffic in long-context scenarios.
▪ZAYA1-8B employs compressed convolutional attention to lower attention computation costs.
▪DeepSeek V4 combines mHC (multi-head compression) with compressed attention for improved efficiency.
▪Laguna XS.2 implements layer-wise attention budgeting to manage computational load across layers.
▪These architectural changes are designed to support reasoning models and agent workflows that maintain long token contexts.

Original article

Hacker News (AI / LLM) · Sebastian Raschka, PhD

Read full at Hacker News (AI / LLM) →

Opening excerpt (first ~120 words) tap to expand

Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed AttentionFrom Gemma 4 to DeepSeek V4, How New Open-Weight LLMs Are Reducing Long-Context CostsSebastian Raschka, PhDMay 16, 20268138ShareAfter a short family break, I am excited to be back and catching up on a busy few weeks of open-weight LLM releases. The thing that stood out to me is how much newer architectures are focused on long-context efficiency.As reasoning models and agent workflows keep more tokens around (for longer), KV-cache size, memory traffic, and attention cost quickly become the main constraints, and LLM developers are adding a growing number of architecture tricks to reduce those costs.The main examples I want to look at are KV sharing and per-layer embeddings in Gemma 4, layer-wise attention…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Hacker News (AI / LLM).

Anonymous · no account needed

Discussion

0 comments

Recent Developments in LLM Architectures: KV Sharing, MHC, Compressed Attention

Discussion

More from Hacker News (AI / LLM)