Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs
The paper evaluates NVIDIA's CUDA Tile (CuTile) for AI workloads on Hopper and Blackwell GPUs. It compares CuTile's performance against established methods like cuBLAS and Triton across various AI tasks. Results indicate that CuTile's effectiveness varies by workload and architecture, with notable performance gaps on different GPU models.
- ▪CuTile simplifies GPU kernel development while maintaining efficiency on modern GPUs.
- ▪On the Blackwell B200, CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x.
- ▪CuTile reaches 52-79% of cuBLAS performance for GEMM, making it a practical alternative to hand-written CUDA kernels.
Opening excerpt (first ~120 words) tap to expand
Computer Science > Machine Learning arXiv:2604.23466 (cs) [Submitted on 25 Apr 2026] Title:Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs Authors:Divakar Kumar Yadav, Tian Zhao, Deepak Kumar View a PDF of the paper titled Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs, by Divakar Kumar Yadav and 2 other authors View PDF HTML (experimental) Abstract:NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.