1.) Apple Silicon currently can't compete with Nvidia GPUs in terms of raw compute power, but they're already way ahead on energy efficiency. Training a small deep learning model on battery power on a laptop could actually be a thing now.
Edit: I've been informed that for matrix math, Apple Silicon isn't actually ahead in efficiency
2.) Apple Silicon probably will compete directly with Nvidia GPUs in the near future in terms of raw compute power in future generations of products like the Mac Studio and Mac Pro, which is very exciting. Competition in this space is incredibly good for consumers.
3.) At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.
1) Nope. For neural network training not the case: https://tlkh.dev/benchmarking-the-apple-m1-max
And that's with the 3090 set at a very high 400W power limit, can get far more efficient when clocked lower.
(which is normal, because no dedicated matrix math accelerators on the GPU notably)
2) We'll see, hopefully Apple thinks that the market is worth bothering with... (which would be great)
3) Indeed, if you need a giant pool of VRAM above everything else at a relatively low price tag, Apple is indeed a quite enticing option. If you can stand Metal for your use case of course.
For raw compute like you need for ML training, the M1s efficiency doesn't matter. Under the hood at hardware level, you have a direct mapping of power consumption to compute circuit activation that you really can't get around.
The general efficiency of M1 is due its architecture and how it fits together with normal consumer use. Less stuff on the instruction decode, more efficient reordering, less energy wasted moving around data due to shared memory architecture, e.t.c
The comparison of efficiency between Apple and Nvidia here is a bit misleading because one compares Apples general-purpose ALUs to Nvidia’s specialized ALUs. For a more direct efficiency comparison, one would need to compare the Tensor Cores against the AMX or ANE coprocessors.
As to how Apple achieves such high efficiency, nobody knows. The fact that they are on 5nm node might help, but there must be something special about the ALU design as well. My speculation is that they are wider and much more simpler than in other GPUs, which directly translates to efficiency wins.
Apple is simply behind in the GPU space.
> At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.
The reason why it's cheaper is that its memory is at a fraction (around 20-35%) of the memory bandwidth of a 128GB equivalent GPU set up, which also has to be split with the CPU. This is an unavoidable bottleneck of shared memory systems, and for a great many applications this is a terminal performance bottleneck.
That's the reason you don't have a GPU with 128GB of normal DDR5. It would just be quite limited. Perhaps for some cases it can be useful.
Here's some info about M1 memory bandwidth: https://www.anandtech.com/show/17024/apple-m1-max-performanc...
It seems like this is ideal as an accelerator for already trained models; one can imagine Photoshop utilizing it for deep-learning based infill-painting.
I was doing training on battery with a laptop that had a 1080 and could do training; I have trained models on the airplane while totalyl unplugged and still had enough power to websurf afterwards.
- Apple undoubtedly owns the densest nodes, and will fight TSMC tooth-and-nail over first dibs on whatever silicon they have coming next.
- Apple's current GPU design philosophy relies on horizontally scaling the tech they already use, whereas Nvidia has been scaling vertically, albeit slowly.
- Nvidia has insane engineers. Despite the fact they're using silicon that's more than twice as large by-area when compared to Apple, they're still doubling their numbers across the board. And that's their last-gen tech too, the comparison once they're on 5nm later this summer is going to be insane.
I expect things to be very heated by the end of this year, with new Nvidia, Intel and potentially new Apple GPUs.
Interesting observation. I wonder is the biggest memory iGPU configuration you can get on the x86 side?
Better would be mobilenets or efficientNets or NFNets or vision transformers or almost anything that's come out in the 8 years since VGG was published (great work it was at the time!).
Why not? It's still good for simple classification tasks. We use it as an encoder for a segmentation model in some cases. Most ResNet variants are much heavier.
https://www.kaggle.com/code/jhoward/which-image-models-are-b...
Those slow and inaccurate models at the bottom of the graph are the VGG models. A resnet34 is faster and more accurate than any VGG model. And there are better options now -- for example resnet34d is as fast as resnet34, and more accurate. And then convnext is dramatically better still.
https://github.com/jcjohnson/cnn-benchmarks#:~:text=ResNet%2....
Probably because it makes the hardware look good.
It makes me feel like i’m missing something! Is is still used as a backbone in the same way as legacy code is everywhere, or is it something else entirely??
pip3 install --pre torch==1.12.0.dev20220518 --extra-index-url https://download.pytorch.org/whl/nightly/cpu
In both cases the unified memory machines outperformed much larger machines in specific use cases.
- It needs extremely high-bandwidth controllers, which severely limits the amount of memory you can use (Intel Macs could be configured with an order of magnitude more ram in it's server chips)
- ECC is still off-the-table on M1 apparently
- Most workloads aren't really constrained by memory access in modern programs/kernels/compilers. Problems only show up when you want to run a GPU off the same memory, which is what these new Macs account for.
- Most of the so-called "specific workloads" that you're outlining aren't very general applications. So far I've only seen ARM outrun x86 in some low-precision physics demos, which is... fine, I guess? I still don't foresee meteorologists dropping their Intel rigs to buy a Mac Studio anytime soon.
I'm curious on how the benchmarks change with this recent new release!
$ conda install pytorch torchvision torchaudio -c pytorch-nightly
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
PackagesNotFoundError: The following packages are not available from current channels:
- torchaudio
And the pip install variant installs an old version of torchaudio that is broken OSError: dlopen(/opt/homebrew/Caskroom/miniforge/base/envs/test123/lib/python3.10/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at14RecordFunctionC1ENS_11RecordScopeEb pip3 install pytorch
worked for me. I think it's something with your brew installation. fragmede@samairmac:~$ python
Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:24:02)
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__file__
'/Users/fragmede/projects/miniforge3/lib/python3.9/sitepackages/torch/__init__.py'
>>>The neural engine is small and inference only. It's also only exposed by a far higher level interface, CoreML.
Where it could still make sense is if you have a small VRAM pool on the dGPU and a big one on the M1, but with the price of a Mac, not sure that makes a lot of sense either in most scenarios compared to paying for a big dGPU.
That's because the M1 has a dedicated matrix math accelerator called AMX [1]. I've used it with both Swift and pure C.
https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...
Why is it inference only? At least the operations are the same...just a bunch of linear algebra
It is not really comparable on a step per second level but the power consumption and now GPU memory will make it pretty enticing.
(pytorch_env) ~/dev/ai/ python -c "import torch"
Note that for ch<O(512) (varies by GPU & hw) you tend to be memory-transfer-speed limited, not compute limited.
So unfortunately depthwise convolutions end up having terrible performance.
Note that pointwise 1x1 convolutions are a special case of group convolutions and actually I think they might be specially optimized in PyTorch (I’d have to run some benchmarks to test it though).
Since there are no hard benchmarks against other GPUs, here's a Geekbench against an RTX 3080 Mobile laptop I have [1]. Looks like it's about 2x slower--the RTX laptop absolutely rips for gaming, I love it.
[1] https://browser.geekbench.com/v5/compute/compare/4140651?bas...
What do shaders have to do with it? Deep learning is a mature field now, it shouldn't need to borrow compute architecture from the gaming/entertainment field. Anyone else find this disconcerting?
Why is that concerning to you?
People new to CG are likely to intuit “shaders” as something related to, well, shading, but vertex shaders et al have nothing to do with the color of a pixel or a polygon.
And there are "GPUs" today that can't do graphics at all (AMD MI100/MI200 generations) or in a restricted way (Hopper GH100) which has the fixed function pipeline only on two TPCs, for compatibility, but running very slowly due to that.
Concessions towards compute: a C++ programming language for device code (totally unlike what's done for most graphics APIs!)
Concessions towards graphics: no single-source programming model at all for example...