WeSearch

Finding deadlocks in CuTe kernels with SPIN

·24 min read · 0 reactions · 0 comments · 16 views
#gpu#debugging#model-checking#synchronization#programming
Finding deadlocks in CuTe kernels with SPIN
⚡ TL;DR · AI summary

The article discusses the use of the SPIN model checker to identify or prove the absence of deadlocks in CuTe DSL kernels on NVIDIA B200. It highlights the challenges of debugging synchronization bugs in GPU kernels, particularly when barriers deadlock without providing useful error information. The author proposes a method to encode the synchronization model in Promela DSL to improve debugging efficiency and reliability.

Key facts
Original article
George's Blog
Read full at George's Blog →
Opening excerpt (first ~120 words) tap to expand

Using SPIN model checker to statically find or prove the absence of deadlocks in CuTe DSL kernels on NVIDIA B200, and presenting a proof-of-concept github.com/cheshire/cute2promela lowering from CuTe to SPIN. Synchronization bugs in GPU kernels are hard to debug. When a barrier deadlocks, the hardware yields no stack trace, and no error code until the benchmark times out. Hence each iteration of the debug loop starts to potentially cost tens of minutes. As we’ve worked on FlashInfer MLSYS Challenge (our solution took 1st place in the mixture-of-experts track), we had to iterate on a persistent fused mixture-of-experts kernel for DeepSeek-V3, written in CUTLASS’s CuTe DSL for an NVIDIA B200 and stitched together from FF1, SwiGLU, and FF2 stages across clusters of CTAs.

Excerpt limited to ~120 words for fair-use compliance. The full article is at George's Blog.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Threads WhatsApp Bluesky Mastodon Email

Discussion

0 comments

More from George's Blog