Ada-MK: Adaptive MegaKernel Optimization via DAG-Based Search for LLM Inference

May 16, 2026 · 4:54 PM UTC ·3 min read · 0 reactions · 0 comments · 39 views

#machine learning #gpu optimization #llm inference #compiler design #high performance computing #Ada-MK #Wenxin Dong #Mingqing Hu #Guanghui Yu #Qiang Fu #Peng Xu #Hui Xu #Yue Xing

TL;DR · WeSearch summary

Ada-MK is a novel optimization framework for large language model (LLM) inference that reduces latency by eliminating kernel launch overhead through operator fusion into a single persistent kernel. It introduces a compile-time DAG-based search to determine the optimal execution path, removing runtime branching and improving efficiency on resource-constrained GPUs. The system has been successfully deployed in a commercial online advertising setting, demonstrating consistent performance gains over existing inference engines.

Key facts

▪Ada-MK reduces peak shared memory usage by 50% using a three-dimensional shared-memory constraint model and K-dimension splitting.
▪The framework uses MLIR-based offline DAG search to eliminate runtime branching, enhancing performance in latency-sensitive applications.
▪Ada-MK integrates with TensorRT-LLM as a plugin, combining high-throughput prefill and low-latency decode phases.
▪On an NVIDIA L20 GPU, Ada-MK achieves up to 23.6% higher throughput than TensorRT-LLM and 50.2% over vLLM.
▪It marks the first industrial deployment of MegaKernel technology in a commercial online advertising system.

Original article

arXiv.org

Read full at arXiv.org →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Computation and Language arXiv:2605.11581 (cs) [Submitted on 12 May 2026] Title:Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference Authors:Wenxin Dong, Mingqing Hu, Guanghui Yu, Qiang Fu, Peng Xu, Hui Xu, Yue Xing, Xuewu Jiao, Shuanglong Li, Lin Liu View a PDF of the paper titled Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference, by Wenxin Dong and 8 other authors View PDF HTML (experimental) Abstract:When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.

Anonymous · no account needed

Discussion

0 comments

Ada-MK: Adaptive MegaKernel Optimization via DAG-Based Search for LLM Inference

Discussion

More from arXiv.org