Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate

May 18, 2026 · 9:20 PM UTC ·4 min read · 0 reactions · 0 comments · 31 views

#technology #machine learning #apple silicon

via

Pytorch

TL;DR · WeSearch summary

The ExecuTorch MLX Delegate has been introduced to enable optimized GPU-accelerated inference for PyTorch models on Apple Silicon Macs. This new backend integrates with the PyTorch 2 export stack and supports a variety of quantization options. Currently experimental, the MLX delegate significantly enhances performance for generative AI workloads compared to previous ExecuTorch options.

Key facts

▪The MLX delegate allows PyTorch models to run on Apple Silicon GPUs using Apple's MLX framework.
▪It supports various quantization options and a range of models, including dense transformers and speech-to-text models.
▪The MLX delegate achieves 3-6x higher throughput on generative AI workloads compared to existing ExecuTorch delegates.

Original article

Pytorch

Read full at Pytorch →

Opening excerpt (first ~120 words) tap to expand

Featured projects TL;DR: Introducing the ExecuTorch MLX Delegate The new MLX delegate enables optimized, GPU-accelerated inference for PyTorch models on Apple Silicon Macs, using Apple’s MLX framework. The delegate seamlessly integrates with the PyTorch 2 export stack and supports a wide range of quantization options (BF16, FP16, FP32, 2/4/8-bit affine, NVFP4). It supports various models, including dense transformers (Llama, Qwen, Gemma), sparse Mixture-of-Experts, and speech-to-text models (Whisper, Voxtral, Parakeet) for both offline and real-time transcription. Note: The MLX delegate is currently experimental. Apple Silicon has become a popular platform for running large language models locally.

…

Excerpt limited to ~120 words for fair-use compliance. The full article is at Pytorch.

Anonymous · no account needed

Discussion

0 comments

Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate

Discussion

More from Pytorch