Elusive order of async GPU kernels: scheduling, abstractions, DSL implications
The article discusses the complexities of scheduling asynchronous GPU kernels and the various approaches to manage them. It highlights the differences between static, temporal, and spatial scheduling methods based on hardware capabilities. Additionally, it explores the challenges of developing domain-specific languages (DSLs) for kernel writing that align with hardware behavior.
- ▪There are three main approaches to scheduling GPU kernels: static, temporal, and spatial.
- ▪Different hardware architectures influence the choice of scheduling method, with Nvidia GPUs introducing warp specialization.
- ▪Libraries like CUTLASS and ThunderKittens help streamline kernel writing by packaging common patterns and managing synchronization.
Opening excerpt (first ~120 words) tap to expand
The elusive order of things Written by Ian in ML Infrastructure, posts SIMT offered a fantastic bargain. You write a straight-line program, the machine runs a lot of copies of it, and when one waits for memory the hardware swaps in others. You look with disdain on the less enlightened thread programmers dealing with deadlocks and concurrency etc. etc. Choosing what to run where and when is a scheduling problem, and there have been three effective approaches to that so far. You can schedule statically: decide ahead of time what all the units should do each tick. You can schedule temporally: swapping in different phases of workers via a pipeline. Or you can schedule spatially: divide the resources of the machine into different roles.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Ian’s Blog.