Stream2LLM: Overlap Context Streaming and Prefill for Reduced TTFT
Stream2LLM introduces a new method for streaming context to large language models (LLMs) that significantly reduces latency. By allowing concurrent requests and managing memory contention, it achieves up to an 11x improvement in time-to-first-token (TTFT). However, the system must carefully manage memory to avoid increasing tail latency.
- ▪Stream2LLM extends vLLM to support concurrent streaming of context for multiple requests.
- ▪The system can achieve up to 11x faster TTFT while maintaining throughput parity.
- ▪Effective memory scheduling is crucial to prevent increased tail latency when handling multiple requests.
Opening excerpt (first ~120 words) tap to expand
tl;dr Streaming context to an LLM as it arrives -- rather than waiting for complete retrieval -- reduces latency dramatically. But prior systems only handle one request at a time. Stream2LLM extends vLLM with concurrent streaming support, introducing scheduling policies that manage memory contention and dynamic input changes across concurrent requests. Evaluated on real-world web crawling and vector search traces, it achieves up to 11x TTFT improvement while maintaining throughput parity. A user asks a question. Behind the scenes, a web crawler fetches pages to build context over about 10 seconds, with each page arriving roughly 700 milliseconds apart. Without streaming, the user stares at a blank screen the entire time – because the model cannot start until every page has arrived.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at @rajveerbach’s blog.