Coda: Rewriting Transformer Blocks as GEMM-Epilogue Programs
The paper introduces CODA, a new GPU kernel abstraction designed to optimize Transformer block computations. By reparameterizing these computations as GEMM-plus-epilogue programs, CODA aims to reduce memory-bound bottlenecks in training systems. The results indicate that this approach can enhance both productivity and efficiency in machine learning frameworks.
- ▪CODA addresses the inefficiencies caused by memory-bound operators in Transformer training systems.
- ▪The abstraction allows for the execution of computations while keeping GEMM output tiles on chip.
- ▪Both human- and LLM-authored CODA kernels demonstrate high performance across various Transformer workloads.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2605.19269 (cs) [Submitted on 19 May 2026 (v1), last revised 20 May 2026 (this version, v2)] Title:CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs Authors:Han Guo, Jack Zhang, Arjun Menon, Driss Guessous, Vijay Thakkar, Yoon Kim, Tri Dao View a PDF of the paper titled CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs, by Han Guo and 6 other authors View PDF HTML (experimental) Abstract:Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.