WeSearch

Show HN: Utilyze – an open source GPU monitoring tool more accurate than nvtop

·4 min read · 0 reactions · 0 comments · 0 views
Show HN: Utilyze – an open source GPU monitoring tool more accurate than nvtop

The standard GPU utilization metric reported by nvidia-smi, nvtop, Weights & Biases, Amazon CloudWatch, Google Cloud Monitoring, and Azure Monitor is highly misleading. It reports the fraction of time that any kernel is running on the GPU, which means a GPU can report 100% utilization even if only a small portion of its compute capacity is actually being used. In practice, we've seen workloads with ~1–10% real compute throughput while dashboards show 100%. This becomes a problem when teams rely

Original article
Systalyze
Read full at Systalyze →
Full article excerpt tap to expand

Before describing how Utilyze works, let’s unpack why accurate GPU utilization is a technically difficult measurement problem. GPUs have two fundamentally different types of compute resources: CUDA cores for general floating-point math, and Tensor Cores that perform matrix multiplications. They also have multiple levels of memory: HBM (high bandwidth memory) sitting off-chip, L2 cache, shared memory inside each SM, and registers local to each thread. Each of these resources can be a bottleneck independently. A workload can be using its Tensor Cores at full capacity while memory bandwidth sits nearly idle, or vice versa. A single percentage cannot represent this two-dimensional reality. As a result, every AI operation on a GPU is constrained by two physical limits: how fast the math units can execute arithmetic (compute throughput), and how fast data can move between memory and the math units (memory bandwidth). Every kernel hits one of these limits first, and that determines its maximum possible performance. This brings us to the framework that actually captures GPU utilization accurately: the Speed-of-Light (SOL) model. This model is a performance framework that measures how close a kernel gets to the GPU's theoretical hardware ceiling, reporting two key numbers: Compute SOL % (= achieved FLOPs ÷ peak FLOPs) and Memory SOL % (= achieved bandwidth ÷ peak bandwidth). It derives from the roofline model, where every kernel is bounded by either compute or memory, and the higher of the two SOL percentages identifies the binding constraint. Utilyze provides exactly that, with two headline numbers: Compute SOL % and Memory SOL %. Both are shown live. The numerator comes from direct measurement of each compute engine (e.g., Tensor Cores, FP32/FP64/INT32 pipelines) and each memory subsystem (e.g., HBM bandwidth, L2, L1) where NVIDIA exposes each as a percentage of that hardware unit's theoretical maximum. The denominator is the SOL itself, the hardware peak. Together, these give you an accurate, live picture of GPU utilization that no other tool provides. If the compute number is dominant, your workload is compute-bound. If the memory number is dominant, you're memory-bound, and optimizations should target data movement first.But it doesn’t end here. Here's something important that raw SOL % doesn't tell you on its own: 100% is not a realistic target.The theoretical hardware peak of 2,000 TFLOPS of compute, 3.4 TB/s of memory bandwidth on an H100, is a physical limit that no real AI workload can reach. Kernel launches have overhead. Data moves between levels of the memory hierarchy. Thread synchronization takes cycles. In multi-GPU setups, communication between GPUs consumes time that could otherwise be spent on computation. For Mixture-of-Experts models, routing tokens to different experts creates irregular memory access patterns that reduce effective throughput. None of these are signs of poor optimization, they're structural properties of real deployments.Every deployment has a natural ceiling below 100% that reflects the specific combination of model architecture, hardware, parallelism strategy, and batch size. We call this ceiling the Attainable Compute SOL %, hereafter referred to as Attainable SOL %. The gap between your current SOL % and the Attainable SOL % is your optimization budget. The gap between the Attainable SOL % and 100% is the physics of your deployment; you can't close it by tuning.For instance, if you're running a…

This excerpt is published under fair use for community discussion. Read the full article at Systalyze.

Anonymous · no account needed
Share 𝕏 Facebook Reddit LinkedIn Email

Discussion

0 comments

More from Systalyze