Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Apr 29, 2026 · 12:07 AM UTC ·3 min read · 0 reactions · 0 comments · 20 views

#machine learning #artificial intelligence #hardware architecture

⚡ TL;DR · AI summary

The paper evaluates NVIDIA's CUDA Tile (CuTile) for AI workloads on Hopper and Blackwell GPUs. It compares CuTile's performance against established methods like cuBLAS and Triton across various AI tasks. Results indicate that CuTile's effectiveness varies by workload and architecture, with notable performance gaps on different GPU models.

Key facts

▪CuTile simplifies GPU kernel development while maintaining efficiency on modern GPUs.
▪On the Blackwell B200, CuTile achieves up to 1007 TFLOP/s for fused attention, outperforming FlashAttention-2 by 2.5x.
▪CuTile reaches 52-79% of cuBLAS performance for GEMM, making it a practical alternative to hand-written CUDA kernels.

Original article

arXiv.org

Read full at arXiv.org →

Opening excerpt (first ~120 words) tap to expand

Computer Science > Machine Learning arXiv:2604.23466 (cs) [Submitted on 25 Apr 2026] Title:Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs Authors:Divakar Kumar Yadav, Tian Zhao, Deepak Kumar View a PDF of the paper titled Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs, by Divakar Kumar Yadav and 2 other authors View PDF HTML (experimental) Abstract:NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (TMA) efficiency on modern GPUs.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at arXiv.org.

Anonymous · no account needed

Discussion

0 comments

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Discussion

More from arXiv.org