Hi,
Can I use M2 pro chip’s GPU for Data Science with R/Rstudio and/or Python?
Thanks for your attention.
At this time, I believe not. We do not yet have OpenCL for Apple Silicon GPUs. Other members of the @asahi-sig may have more details on the timeline for that kind of enablement.
Could the M1 and M2’s neural engine also be used for these kinds of applications?
I see there’s a github repository for the nerual engine driver, however I haven’t been able to find much information on its current state/usability.
Edit: It seems it’s set to be merged into asahi soon.
The neural engine is basically an FP16 convolution accelerator. So it can run certain low-precision ML models and do general computations like FFTs and matrix multiplications. If that sounds useful for your specific application, then yes, though whatever framework you’re using would have to be ported to support it (there is no common API for ML accelerators). It’s not a general purpose Turing-complete unit like GPU shaders are, so it can’t run everything.
I believe the GPU cannot do FP64 (only up to FP32), so if this is something you need, then the GPU will not be usable for this purpose. In my own experience playing around with the M1 GPU in PyTorch, it’s much slower (for this particular use case) than even a really low-end discrete Nvidia GPU would be, so I’m not sure how practical even the M2 Pro would be for this kind of use case.
You can only use the neural engine if you put in the work to. As the ANE lacks instructions and primitive data structures, it cannot execute arbitrary code like GPU shaders. To run some computation graph on the ANE, best supported notation being a Pytorch nn.module, the graph must be compiled into the ANE “microcode”. The driver really just loads this per-model microcode binary. Meaning, anything you want to run on the ANE, you must 1) know at compile time 2) compile it. This is where it gets its speed. This repo streamlines the compilation part, hopefully. But it’s neither packaged or integrated yet. So not really. Getting to that.
Python being everything but static, there’s no way to answer libraries’ dynamic API calls to be a seamless/hidden computation backend (unless we fork aten to embed the kernels for matmul_2x2, matmul_2x3, matmul_2x4, etc.). This is how the hardware is, so same for macOS/Asahi. Cuda also has limitations, e.g. “[CUDA] only supports powers of 2 signal length in every transformed dimension” for torch.fft.fft(). Torch could probably benefit from embedding popular kernels, but I don’t really have the power to. Apple does, but they don’t want to (see GPU comment by joel2 above).
It was reverse-engineered six months ago. It’s taking long because we’re trying to make an inherently sucky process least sucky as possible. I’m assuming the applications in question are popular trained neural nets. Those run well. I’ve been meaning to get a Stable Diffusion demo running, which works but without the weights loaded because I hit max RAM. 8GB is cruel.