Ada-MK: Adaptive MegaKernel Optimization via DAG-Based Search for LLM Inference
Ada-MK is a novel optimization framework for large language model (LLM) inference that reduces latency by eliminating kernel launch overhead through operator fusion into a single persistent kernel. It introduces a compile-time DAG-based search to determine the optimal execution path, removing runtime branching and improving efficiency on resource-constrained GPUs. The system has been successfully deployed in a commercial online advertising setting, demonstrating consistent performance gains over existing inference engines.
- ▪Ada-MK reduces peak shared memory usage by 50% using a three-dimensional shared-memory constraint model and K-dimension splitting.
- ▪The framework uses MLIR-based offline DAG search to eliminate runtime branching, enhancing performance in latency-sensitive applications.
- ▪Ada-MK integrates with TensorRT-LLM as a plugin, combining high-throughput prefill and low-latency decode phases.
- ▪On an NVIDIA L20 GPU, Ada-MK achieves up to 23.6% higher throughput than TensorRT-LLM and 50.2% over vLLM.
- ▪It marks the first industrial deployment of MegaKernel technology in a commercial online advertising system.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Computation and Language arXiv:2605.11581 (cs) [Submitted on 12 May 2026] Title:Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference Authors:Wenxin Dong, Mingqing Hu, Guanghui Yu, Qiang Fu, Peng Xu, Hui Xu, Yue Xing, Xuewu Jiao, Shuanglong Li, Lin Liu View a PDF of the paper titled Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference, by Wenxin Dong and 8 other authors View PDF HTML (experimental) Abstract:When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.