Asynchronicity in Continuous Batching
Synchronous batching in LLM inference causes inefficiencies by making CPU and GPU work alternately, leading to significant idle time. Asynchronous batching allows CPU and GPU to operate in parallel, improving GPU utilization and reducing overall inference time. This approach requires careful coordination of hardware and data flow but can yield substantial performance gains without modifying models or kernels.
- ▪Synchronous batching results in the CPU and GPU taking turns, leaving the GPU idle for nearly 24% of total runtime.
- ▪Asynchronous batching enables parallel CPU batch preparation and GPU computation, reducing wasted compute time.
- ▪The implementation relies on CUDA streams and events to manage concurrency and ensure data readiness between CPU and GPU tasks.
- ▪Eliminating CPU-induced delays could reduce generation time by 24%, from 300.6 seconds to approximately 228 seconds for 8K token generation.
- ▪No changes to model architecture or kernels are needed to implement asynchronous batching, only proper hardware coordination.
Opening excerpt (first ~120 words) tap to expand
Back to Articles Unlocking asynchronicity in continuous batching Published May 14, 2026 Update on GitHub Upvote 33 +27 Rémi Ouazan Reboul ror Follow Pedro Cuenca pcuenq Follow Aritra Roy Gosthipaty ariG23498 Follow Synchronous batching Creating concurrency What is a CUDA stream? Default and non-default streams Back to Continuous Batching Enforcing synchronization What is a CUDA event? Using events in Continuous Batching Filling the vacuum Race conditions Carry-over The full async loop Does it actually work? Conclusion TL;DR: we explain how to separate CPU and GPU workloads to get a massive performance boost for inference. This is the second post in a series on efficient LLM inference. The first post covered continuous batching from first principles.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Huggingface.