Molecular Dynamics on Apple M4
A developer implemented 15 molecular dynamics kernels on the Apple M4 chip to explore performance across different hardware units, including CPU, GPU, and Neural Engine. The project achieved up to 810 GFLOPS on the Metal GPU and demonstrated significant speedups using optimization techniques like cell lists and tiling. The rapid iteration allowed real-time exploration of hardware-specific optimizations for N-body simulations.
- ▪The M4 Metal GPU achieved a peak of 810 GFLOPS, reaching ~19% of its theoretical 4.26 TFLOPS.
- ▪Cell list optimizations reduced complexity from O(N²) to O(N), making Metal+CL 45× faster than all-pairs at 70K particles.
- ▪An ANE direct kernel used reverse-engineered private APIs to run Lennard-Jones forces on the Neural Engine with FP16 precision.
- ▪OpenMP on 4 P-cores outperformed the Metal GPU in cell list performance above 32,000 particles.
- ▪Double-precision NEON (f64) was 2.2× slower than single-precision (f32), with negligible accuracy gains for simulation stability.
Opening excerpt (first ~120 words) tap to expand
moleqular Molecular dynamics on Apple M4 — pushing every compute path to its limits. LJ (Lennard-Jones) N-body simulation with 15 force kernels targeting different hardware units on Apple Silicon. Same physics, same particles, wildly different performance characteristics. Built in 2 days. 15 kernels across 5 architectures (M4 NEON, Metal GPU, M4 Neural Engine, NVIDIA CUDA, GCP Axion SVE2). A real-time Metal particle renderer. A quantized BVH. A GROMACS-style NBNXM cluster pair kernel. A direct ANE kernel bypassing CoreML via reverse-engineered private APIs. Cross-compiled and benchmarked on cloud GPUs.
…
Excerpt limited to ~120 words for fair-use compliance. The full article is at GitHub.