A Case for Tracing Based DSL Kernel Languages
The article discusses the architectural differences between parsing and tracing kernel DSLs for NVIDIA GPU programming. It highlights the emergence of various Pythonic DSLs that aim to simplify the process of writing GPU kernels. The author argues in favor of a tracing-based approach over traditional parsing methods for better performance and flexibility.
- ▪NVIDIA's GPU programming has evolved from using only CUDA to incorporating several Pythonic DSLs like Triton and CuTe-DSL.
- ▪Most of these DSLs aim to lower tile-oriented programs into PTX or LLVM-IR, with varying methods of embedding into Python.
- ▪The article advocates for a tracing-based approach, suggesting it can be more advantageous than parsing in certain scenarios.
Opening excerpt (first ~120 words) tap to expand
On the architectural divide between parsing and tracing kernel DSLs, and what tends to go wrong in each. The language for writing NVIDIA GPU kernels was always exclusively CUDA, but since Triton appeared, a wave of Pythonic DSLs has followed: CuTe-DSL, cuTile, Pallas, Gluon, Warp, and the more recent TileLang used in DeepSeek’s DeepGEMM. Most of these systems share the same goal of lowering a tile-oriented program into PTX or LLVM-IR, and are embedded in Python. The question is how to embed the DSL into Python. Triton and CuTe-DSL parse the source AST. Pallas runs the function under abstract values and traces the resulting operations.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at George's Blog.