Introducing RadixAttention to Trellis

Jun 3, 2026 · 7:16 AM UTC ·5 min read · 0 reactions · 0 comments · 40 views

#technology #artificial intelligence #machine learning

TL;DR · WeSearch summary

Trellis has been developed to enhance LLM inference while ensuring user data privacy. The system allows deployment on existing hardware and optimizes the prefill phase of LLM inference using a caching strategy. The introduction of RadixAttention improves efficiency by caching common prefixes in chat-based sessions.

Key facts

▪Trellis enables LLM inference on user-owned hardware like laptops and servers.
▪RadixAttention utilizes a radix tree to optimize storage for shared string prefixes.
▪The KV caching mechanism in Trellis enhances concurrency by allowing multiple sessions to share prefix blocks.

Original article

Trellis

Read full at Trellis →

Opening excerpt (first ~120 words) tap to expand

We created Trellis to democratize LLM inference without making compromises on the data privacy of its users. Towards that goal, we built a system that users can deploy on hardware they already own and operate, i.e. laptops, workstations and servers alike. To meet users where they are, we must accommodate more or less performant hardware, and therefore take optimization opportunities whenever possible. In this post, we will focus on how we optimized the prefill phase of LLM inference in Trellis. We'll start by giving a brief background on the problem, discuss our implementation of a caching strategy that is relevant for chat-based and agentic LLM sessions, and conclude by showing some benchmark results.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Trellis.

Anonymous · no account needed

Discussion

0 comments

Introducing RadixAttention to Trellis

Discussion

More from Trellis