Running PyTorch Models on Apple Silicon GPUs with the ExecuTorch MLX Delegate
The ExecuTorch MLX Delegate has been introduced to enable optimized GPU-accelerated inference for PyTorch models on Apple Silicon Macs. This new backend integrates with the PyTorch 2 export stack and supports a variety of quantization options. Currently experimental, the MLX delegate significantly enhances performance for generative AI workloads compared to previous ExecuTorch options.
- ▪The MLX delegate allows PyTorch models to run on Apple Silicon GPUs using Apple's MLX framework.
- ▪It supports various quantization options and a range of models, including dense transformers and speech-to-text models.
- ▪The MLX delegate achieves 3-6x higher throughput on generative AI workloads compared to existing ExecuTorch delegates.
Opening excerpt (first ~120 words) tap to expand
Featured projects TL;DR: Introducing the ExecuTorch MLX Delegate The new MLX delegate enables optimized, GPU-accelerated inference for PyTorch models on Apple Silicon Macs, using Apple’s MLX framework. The delegate seamlessly integrates with the PyTorch 2 export stack and supports a wide range of quantization options (BF16, FP16, FP32, 2/4/8-bit affine, NVFP4). It supports various models, including dense transformers (Llama, Qwen, Gemma), sparse Mixture-of-Experts, and speech-to-text models (Whisper, Voxtral, Parakeet) for both offline and real-time transcription. Note: The MLX delegate is currently experimental. Apple Silicon has become a popular platform for running large language models locally.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at Pytorch.