Coda: Rewriting Transformer Blocks as GEMM-Epilogue Programs

May 22, 2026 · 4:54 AM UTC ·3 min read · 0 reactions · 0 comments · 24 views

#machine learning #transformers #gpu #optimization

TL;DR · WeSearch summary

The paper introduces CODA, a new GPU kernel abstraction designed to optimize Transformer block computations. By reparameterizing these computations as GEMM-plus-epilogue programs, CODA aims to reduce memory-bound bottlenecks in training systems. The results indicate that this approach can enhance both productivity and efficiency in machine learning frameworks.

Key facts

▪CODA addresses the inefficiencies caused by memory-bound operators in Transformer training systems.
▪The abstraction allows for the execution of computations while keeping GEMM output tiles on chip.
▪Both human- and LLM-authored CODA kernels demonstrate high performance across various Transformer workloads.

Original article

arXiv.org

Read full at arXiv.org →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Machine Learning arXiv:2605.19269 (cs) [Submitted on 19 May 2026 (v1), last revised 20 May 2026 (this version, v2)] Title:CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs Authors:Han Guo, Jack Zhang, Arjun Menon, Driss Guessous, Vijay Thakkar, Yoon Kim, Tri Dao View a PDF of the paper titled CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs, by Han Guo and 6 other authors View PDF HTML (experimental) Abstract:Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.

Anonymous · no account needed

Discussion

0 comments

Coda: Rewriting Transformer Blocks as GEMM-Epilogue Programs

Discussion

More from arXiv.org