Accelerated PyTorch Training on M1 Mac (opens in new tab)

(pytorch.org)

443 pointstgymnich4y ago146 comments

146 comments

Curiously neither PyTorch nor Tensorflow currently use M1's Neural Engine. Is too limited? Too hard to interact with? Not worth the effort?

why_only_154y ago

The ANE only has support for calculations with fp16, int16 and int8 all of which are too small to train with (too much instability). A common thing to do is train in fp32 to be able to get the small differences and gradients and then once the model is frozen do inference on fp16 or bf16.

jph004y ago

Using mixed precision training you can do most operations in fp16 and just a few in fp32 where it's needed. This is the norm for NVIDIA GPU training nowadays. For instance using fastai add `.to_fp16()` after your learner call, and that happens automatically.

1 more reply

RicoElectrico4y ago

Most probably Neural Engine is optimized for inference, not training.

munro4y ago

That /sounds/ right, but training still has a forward part, so OP does raise a really great question. And looking at the silicon, the neural engine is almost the size of the GPU. Really need someone educated in this area to chime in :)

2 more replies

sillyinseattle4y ago

Question about terminology (no background in AI). In econometrics, estimation is model fitting (training, I guess), and inference refers to hypothesis testing (e.g. t or F tests). What does inference mean here?

6 more replies

alexfromapex4y ago

Since it's tangentially relevant, if you have an M1 Mac I've created some boilerplate for working with the latest Tensorflow with GPU acceleration as well: https://github.com/alexfromapex/tensorexperiments . I'm thinking of adding a branch for PyTorch now.

masklinn4y ago

Did you compare that to Apple's tf plugin to see what was what?

galoisscobi4y ago

This is great! Appreciate the note on H5Py troubleshooting as well.

mkaic4y ago

This is really cool for a number of reasons:

1.) Apple Silicon currently can't compete with Nvidia GPUs in terms of raw compute power, but they're already way ahead on energy efficiency. Training a small deep learning model on battery power on a laptop could actually be a thing now.

Edit: I've been informed that for matrix math, Apple Silicon isn't actually ahead in efficiency

2.) Apple Silicon probably will compete directly with Nvidia GPUs in the near future in terms of raw compute power in future generations of products like the Mac Studio and Mac Pro, which is very exciting. Competition in this space is incredibly good for consumers.

3.) At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.

my1234y ago

> but they're already way ahead on energy efficiency

1) Nope. For neural network training not the case: https://tlkh.dev/benchmarking-the-apple-m1-max

And that's with the 3090 set at a very high 400W power limit, can get far more efficient when clocked lower.

(which is normal, because no dedicated matrix math accelerators on the GPU notably)

2) We'll see, hopefully Apple thinks that the market is worth bothering with... (which would be great)

3) Indeed, if you need a giant pool of VRAM above everything else at a relatively low price tag, Apple is indeed a quite enticing option. If you can stand Metal for your use case of course.

highfrequency4y ago

What do you mean by: "if you can stand Metal for your use case?" What is Metal?

2 more replies

ActorNightly4y ago

> but they're already way ahead on energy efficiency.

For raw compute like you need for ML training, the M1s efficiency doesn't matter. Under the hood at hardware level, you have a direct mapping of power consumption to compute circuit activation that you really can't get around.

The general efficiency of M1 is due its architecture and how it fits together with normal consumer use. Less stuff on the instruction decode, more efficient reordering, less energy wasted moving around data due to shared memory architecture, e.t.c

ribit4y ago

And yet somehow Apples GPU ALUs are more efficient at 3.8 watts per TFLOP. Mind, I am not talking about specialized matrix multiplication units that have a different internal organization and can do things like matrix multiplication much more efficiently, but about basic general-purpose GPU ALUs.

The comparison of efficiency between Apple and Nvidia here is a bit misleading because one compares Apples general-purpose ALUs to Nvidia’s specialized ALUs. For a more direct efficiency comparison, one would need to compare the Tensor Cores against the AMX or ANE coprocessors.

As to how Apple achieves such high efficiency, nobody knows. The fact that they are on 5nm node might help, but there must be something special about the ALU design as well. My speculation is that they are wider and much more simpler than in other GPUs, which directly translates to efficiency wins.

1 more reply

sudosysgen4y ago

Apple Silicon is not ahead at all on energy efficiency for desktop workloads. If they were ahead on energy efficiency, they would simply be ahead on power. Indeed, GPUs are massively parallel architectures, and they are generally limited by the transistor and power budget (and memory, of course).

Apple is simply behind in the GPU space.

> At $4800, an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory. With proper PyTorch support, we'll actually be able to use this memory for training big models or using big batch sizes. For the kind of DL work I do where dataloading is much more of a bottleneck than actual raw compute power, Mac Studio is now looking very enticing.

The reason why it's cheaper is that its memory is at a fraction (around 20-35%) of the memory bandwidth of a 128GB equivalent GPU set up, which also has to be split with the CPU. This is an unavoidable bottleneck of shared memory systems, and for a great many applications this is a terminal performance bottleneck.

That's the reason you don't have a GPU with 128GB of normal DDR5. It would just be quite limited. Perhaps for some cases it can be useful.

p1esk4y ago

its memory is at a fraction (around 30-40%) of the memory bandwidth of a 128GB equivalent GPU setup

Here's some info about M1 memory bandwidth: https://www.anandtech.com/show/17024/apple-m1-max-performanc...

2 more replies

mkaic4y ago

Interesting, I wasn't aware of the memory bandwidth point, though it makes sense. TIL!

dekhn4y ago

I remain skeptical that Apple's best GPU silicon will match nvidia's premiere products (either the top-end desktop card, or a server monster) for training.

It seems like this is ideal as an accelerator for already trained models; one can imagine Photoshop utilizing it for deep-learning based infill-painting.

I was doing training on battery with a laptop that had a 1080 and could do training; I have trained models on the airplane while totalyl unplugged and still had enough power to websurf afterwards.

hedgehog4y ago

To me the cool thing is working through a PyTorch-based course like FastAI on a local Mac may now be above the tolerably fast threshold.

mhh__4y ago

The thing is with the efficiency (which I'm not sure of) and the competition (probably possible) is that the current nvidia lineup is pretty old and on an even older process. They have a big moat.

smoldesu4y ago

There's definitely competition, and it's going to be really interesting to watch Nvidia and Apple duke it out over the next few years:

- Apple undoubtedly owns the densest nodes, and will fight TSMC tooth-and-nail over first dibs on whatever silicon they have coming next.

- Apple's current GPU design philosophy relies on horizontally scaling the tech they already use, whereas Nvidia has been scaling vertically, albeit slowly.

- Nvidia has insane engineers. Despite the fact they're using silicon that's more than twice as large by-area when compared to Apple, they're still doubling their numbers across the board. And that's their last-gen tech too, the comparison once they're on 5nm later this summer is going to be insane.

I expect things to be very heated by the end of this year, with new Nvidia, Intel and potentially new Apple GPUs.

fulafel4y ago

> an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory

Interesting observation. I wonder is the biggest memory iGPU configuration you can get on the x86 side?

ekelsen4y ago

Nice results! But why are people still reporting benchmark results on VGG? Does anybody actually use this network anymore?

Better would be mobilenets or efficientNets or NFNets or vision transformers or almost anything that's come out in the 8 years since VGG was published (great work it was at the time!).

plonk4y ago

> Does anybody actually use this network anymore?

Why not? It's still good for simple classification tasks. We use it as an encoder for a segmentation model in some cases. Most ResNet variants are much heavier.

jph004y ago

I don't think that's true - have a look at this analysis here:

https://www.kaggle.com/code/jhoward/which-image-models-are-b...

Those slow and inaccurate models at the bottom of the graph are the VGG models. A resnet34 is faster and more accurate than any VGG model. And there are better options now -- for example resnet34d is as fast as resnet34, and more accurate. And then convnext is dramatically better still.

YetAnotherNick4y ago

> ResNet > VGG: ResNet-50 is faster than VGG-16 and more accurate than VGG-19 (7.02 vs 9.0); ResNet-101 is about the same speed as VGG-19 but much more accurate than VGG-16 (6.21 vs 9.0).

https://github.com/jcjohnson/cnn-benchmarks#:~:text=ResNet%2....

p1esk4y ago

why are people still reporting benchmark results on VGG?

Probably because it makes the hardware look good.

0-_-04y ago

This is the right answer. Efficient networks like EfficientNet are much harder to accelerate in HW.

DSingularity4y ago

No. Because it is a way to compare performance. That’s all. Just convenience.

jorgemf4y ago

Probably because it will be impossible to compare with old results. If every year the community chooses a different model, how are you going to compare results year over year?

ekelsen4y ago

The numbers are relative speedups, not absolute numbers that can be compared with any prior results, so I don't really see how this matters.

1 more reply

learndeeply4y ago

ResNets have been around for 7 years...

1 more reply

6gvONxR4sf7o4y ago

> But why are people still reporting benchmark results on VGG?

It makes me feel like i’m missing something! Is is still used as a backbone in the same way as legacy code is everywhere, or is it something else entirely??

sanxiyn4y ago

VGG works better for style transfer than ResNet (this is a surprising result, but empirically true), but that's the only case I am aware of.

https://arxiv.org/abs/2104.05623

eoerl4y ago

actually it seems that it was because a lot of other well known models are not yet supported, missing ops in the Metal backend

singularity20014y ago

The installation command generated on https://pytorch.org/get-started/locally/ didn't install the latest version for me. What did it was:

pip3 install --pre torch==1.12.0.dev20220518 --extra-index-url https://download.pytorch.org/whl/nightly/cpu

singularity20014y ago

If you came late make sure to update the date to 20220521 …

tzekid4y ago

Ahh just saw this after compiling pytorch from source. Thanks!

nafizh4y ago

Exciting!! But don't see comparison with any laptop Nvidia GPUs in terms of performance. That would be insightful.

sudosysgen4y ago

It compares unfavourably, but then again NVidia GPUs on laptop are massive powerhogs.

smlacy4y ago

Do apple users really require the ability to train large ML models while mobile and without access to A/C power? Is this a real-world use case for the target market?

1 more reply

buildbot4y ago

This is very interesting since the M1 studio supports 128GB of unified memory - training a large memory heavy model slowly on a single device could be interesting, or inferencing a very large model.

zdw4y ago

Everything old is new again - the M1 studio's unified memory echos the SGI O2 which had similar unified CPU/GPU memory back in the 90's.

In both cases the unified memory machines outperformed much larger machines in specific use cases.

smoldesu4y ago

...specific use cases being the key operand here. Unified memory is cool, but there are reasons we don't use it at-scale:

- It needs extremely high-bandwidth controllers, which severely limits the amount of memory you can use (Intel Macs could be configured with an order of magnitude more ram in it's server chips)

- ECC is still off-the-table on M1 apparently

- Most workloads aren't really constrained by memory access in modern programs/kernels/compilers. Problems only show up when you want to run a GPU off the same memory, which is what these new Macs account for.

- Most of the so-called "specific workloads" that you're outlining aren't very general applications. So far I've only seen ARM outrun x86 in some low-precision physics demos, which is... fine, I guess? I still don't foresee meteorologists dropping their Intel rigs to buy a Mac Studio anytime soon.

2 more replies

ivstitia4y ago

There was a report comparing M1 Pro with several other Nvidia GPUs from a few months ago: https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...

I'm curious on how the benchmarks change with this recent new release!

almostdigital4y ago

Anyone actually got this to run on an M1 Mac?

    $ conda install pytorch torchvision torchaudio -c pytorch-nightly
    Collecting package metadata (current_repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.
    Collecting package metadata (repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.

    PackagesNotFoundError: The following packages are not available from current channels:

      - torchaudio

And the pip install variant installs an old version of torchaudio that is broken

    OSError: dlopen(/opt/homebrew/Caskroom/miniforge/base/envs/test123/lib/python3.10/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at14RecordFunctionC1ENS_11RecordScopeEb

fragmede4y ago

    pip3 install pytorch

worked for me. I think it's something with your brew installation.

    fragmede@samairmac:~$ python
    Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:24:02)
    [Clang 11.1.0 ] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import torch
    >>> torch.__file__
    '/Users/fragmede/projects/miniforge3/lib/python3.9/sitepackages/torch/__init__.py'
    >>>

almostdigital4y ago

Does torchaudio work for you? I can get torch and torchvision to work but not torchaudio

boopmaster4y ago

pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Scene_Cast24y ago

I'm curious about the performance compared to something like, say, the RTX 3070.

my1234y ago

Low. Apple doesn't have matrix math accelerators in their current GPUs.

The neural engine is small and inference only. It's also only exposed by a far higher level interface, CoreML.

Where it could still make sense is if you have a small VRAM pool on the dGPU and a big one on the M1, but with the price of a Mac, not sure that makes a lot of sense either in most scenarios compared to paying for a big dGPU.

Kon-Peki4y ago

> Apple doesn't have matrix math accelerators in their current GPUs.

That's because the M1 has a dedicated matrix math accelerator called AMX [1]. I've used it with both Swift and pure C.

https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...

1 more reply

LeanderK4y ago

> The neural engine is small and inference only

Why is it inference only? At least the operations are the same...just a bunch of linear algebra

1 more reply

apohn4y ago

I wrote a comment about an Tensorflow on M1 comparison to some cloud providers. I imagine PyTorch on M1 would give similar results. I think the gist would be that the 3070 is going to be a better investment.

https://news.ycombinator.com/item?id=30608125

ivstitia4y ago

Here are some comparison numbers I've come across: https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...

It is not really comparable on a step per second level but the power consumption and now GPU memory will make it pretty enticing.

MasterScrat4y ago

Small code example in the PyTorch doc:

https://pytorch.org/docs/master/notes/mps.html

nxpnsv4y ago

Tried https://pytorch.org/tutorials/beginner/basics/quickstart_tut... with mps vs cpu. mps worked, but cpu actually was faster (16 vs 21s). Perhaps I am doing it wrong...

singularity20014y ago

Anyone else getting "illegal hardware instruction"?

(pytorch_env) ~/dev/ai/ python -c "import torch"

zimpenfish4y ago

IIRC, when I had that problem, it was because it was loading the wrong arch for Python.

in3d4y ago

It’s surprising to see PyTorch developers working on things like that when common operations like group convolutions are still completely unoptimized on Nvidia GPUs, despite many requests.

jacobn4y ago

Grouped convolutions can't really run faster than groups * conv(ch/group) and I believe that's close to where they're at?

Note that for ch<O(512) (varies by GPU & hw) you tend to be memory-transfer-speed limited, not compute limited.

So unfortunately depthwise convolutions end up having terrible performance.

in3d4y ago

Why wouldn’t you be able to run them in parallel using CUDA? You shouldn’t be memory-transfer speed limited when group convolution layers are a part of a bigger net.

Note that pointwise 1x1 convolutions are a special case of group convolutions and actually I think they might be specially optimized in PyTorch (I’d have to run some benchmarks to test it though).

1 more reply

arecurrence4y ago

This is much nicer ergonomics than what I had to do for tensorflow. It’s ostensibly out of the box support as a different torch device.

mark_l_watson4y ago

I agree. I appreciated the M1/Metal TensorFlow support, but that was not as easy to setup.

alfalfasprout4y ago

I mean, building tensorflow is generally an awful experience.

dangrie1584y ago

You must have installed it a while ago then :). I just recently did and only needed to I stall 2 packages via pip (i think tensoflow-macos and tenderfoot-metal ) which I found much better than wrangling with cuda and cudnn versions for Nvidia cards

dilielloneluca4y ago

I started collecting benchmarks of the M1 Max on PyTorch here: https://github.com/lucadiliello/pytorch-apple-silicon-benchm...

munro4y ago

yess! This is important for me, because I don't have any $$$ to rent GPUs for personal projects. Now we just need M1 support for JAX.

Since there are no hard benchmarks against other GPUs, here's a Geekbench against an RTX 3080 Mobile laptop I have [1]. Looks like it's about 2x slower--the RTX laptop absolutely rips for gaming, I love it.

[1] https://browser.geekbench.com/v5/compute/compare/4140651?bas...

jph004y ago

You can use GPUs for free on Paperspace Gradient, Google Colab, and Kaggle.

kristianp4y ago

A tangential thought: will we see animation studios buy mac studios for their rendering farms? What do they use these days, aws ec2?

Kalanos4y ago

Anyone care to comment on how this is better than Metal's TensorFlow support?

macshome4y ago

Does this work on any Metal hardware or just the M1 GPU?

atty4y ago

This is targeting AMD GPUs and M1 GPUs currently not targeting the integrated Intel GPUs present in Intel machines. However if you have a 16 inch Intel MBP, or a Mac Pro, etc, this should work with your AMD GPUs. That support isn’t in the nightly packages yet (only Apple Silicon support so far) but the PyTorch team is saying that it will be available by the end of the week hopefully. If you just can’t wait, you should be able to build from source to test it out right now.

cj89894y ago

really hope to see some comparisons with nvidia gpus!

toppy4y ago

Does speed up refer to absolute value or percentage?

dagmx4y ago

At least for the charts, it looks like a multiplier (or divisor I guess) since the CPU baseline looks to be at 1

toppy4y ago

You're right! I've missed this.

sbeckeriv4y ago

What is the * in the chart referencing?

mrchucklepants4y ago

Probably supposed to be referencing the text under the plot stating the specific configuration of the hardware and software.

sbeckeriv4y ago

looks like the website was updated after I posted. I used page search to look for the *.

amelius4y ago

> Accelerated GPU training is enabled using Apple’s Metal Performance Shaders (MPS) as a backend for PyTorch.

What do shaders have to do with it? Deep learning is a mature field now, it shouldn't need to borrow compute architecture from the gaming/entertainment field. Anyone else find this disconcerting?

dagmx4y ago

Shaders are just the way compute is defined on the GPU.

Why is that concerning to you?

WhitneyLand4y ago

It’s not the greatest term even for graphics only.

People new to CG are likely to intuit “shaders” as something related to, well, shading, but vertex shaders et al have nothing to do with the color of a pixel or a polygon.

1 more reply

my1234y ago

That terminology isn't used at all in GPGPU compute APIs specifically tailored for that purpose, which use quite different programming models where you can mix host and device code in the same program.

And there are "GPUs" today that can't do graphics at all (AMD MI100/MI200 generations) or in a restricted way (Hopper GH100) which has the fixed function pipeline only on two TPCs, for compatibility, but running very slowly due to that.

1 more reply

my1234y ago

Apple doesn't have a separate API tailored towards compute only, but a single unified API that makes concessions to both.

Concessions towards compute: a C++ programming language for device code (totally unlike what's done for most graphics APIs!)

Concessions towards graphics: no single-source programming model at all for example...

sudosysgen4y ago

Many GPUs allow you to write device code in C++ via SYCL. It works well enough.

geertj4y ago

Not sure if it’s concerning but it caught my eye as well.

j / k navigate · click thread line to collapse

146 comments

lekevicius4y ago

Curiously neither PyTorch nor Tensorflow currently use M1's Neural Engine. Is too limited? Too hard to interact with? Not worth the effort?

why_only_154y ago

jph004y ago

1 more reply

RicoElectrico4y ago

Most probably Neural Engine is optimized for inference, not training.

munro4y ago

2 more replies

sillyinseattle4y ago

6 more replies

alexfromapex4y ago

masklinn4y ago

Did you compare that to Apple's tf plugin to see what was what?

galoisscobi4y ago

This is great! Appreciate the note on H5Py troubleshooting as well.

mkaic4y ago

This is really cool for a number of reasons:

Edit: I've been informed that for matrix math, Apple Silicon isn't actually ahead in efficiency

my1234y ago

> but they're already way ahead on energy efficiency

1) Nope. For neural network training not the case: https://tlkh.dev/benchmarking-the-apple-m1-max

And that's with the 3090 set at a very high 400W power limit, can get far more efficient when clocked lower.

(which is normal, because no dedicated matrix math accelerators on the GPU notably)

2) We'll see, hopefully Apple thinks that the market is worth bothering with... (which would be great)

3) Indeed, if you need a giant pool of VRAM above everything else at a relatively low price tag, Apple is indeed a quite enticing option. If you can stand Metal for your use case of course.

highfrequency4y ago

What do you mean by: "if you can stand Metal for your use case?" What is Metal?

2 more replies

ActorNightly4y ago

> but they're already way ahead on energy efficiency.

ribit4y ago

1 more reply

sudosysgen4y ago

Apple is simply behind in the GPU space.

That's the reason you don't have a GPU with 128GB of normal DDR5. It would just be quite limited. Perhaps for some cases it can be useful.

p1esk4y ago

its memory is at a fraction (around 30-40%) of the memory bandwidth of a 128GB equivalent GPU setup

Here's some info about M1 memory bandwidth: https://www.anandtech.com/show/17024/apple-m1-max-performanc...

2 more replies

mkaic4y ago

Interesting, I wasn't aware of the memory bandwidth point, though it makes sense. TIL!

dekhn4y ago

I remain skeptical that Apple's best GPU silicon will match nvidia's premiere products (either the top-end desktop card, or a server monster) for training.

It seems like this is ideal as an accelerator for already trained models; one can imagine Photoshop utilizing it for deep-learning based infill-painting.

I was doing training on battery with a laptop that had a 1080 and could do training; I have trained models on the airplane while totalyl unplugged and still had enough power to websurf afterwards.

hedgehog4y ago

To me the cool thing is working through a PyTorch-based course like FastAI on a local Mac may now be above the tolerably fast threshold.

mhh__4y ago

The thing is with the efficiency (which I'm not sure of) and the competition (probably possible) is that the current nvidia lineup is pretty old and on an even older process. They have a big moat.

smoldesu4y ago

There's definitely competition, and it's going to be really interesting to watch Nvidia and Apple duke it out over the next few years:

- Apple undoubtedly owns the densest nodes, and will fight TSMC tooth-and-nail over first dibs on whatever silicon they have coming next.

- Apple's current GPU design philosophy relies on horizontally scaling the tech they already use, whereas Nvidia has been scaling vertically, albeit slowly.

I expect things to be very heated by the end of this year, with new Nvidia, Intel and potentially new Apple GPUs.

fulafel4y ago

> an M1 Ultra Mac Studio appears to be far and away the cheapest machine you can buy with 128GB of GPU memory

Interesting observation. I wonder is the biggest memory iGPU configuration you can get on the x86 side?

ekelsen4y ago

Nice results! But why are people still reporting benchmark results on VGG? Does anybody actually use this network anymore?

Better would be mobilenets or efficientNets or NFNets or vision transformers or almost anything that's come out in the 8 years since VGG was published (great work it was at the time!).

plonk4y ago

> Does anybody actually use this network anymore?

Why not? It's still good for simple classification tasks. We use it as an encoder for a segmentation model in some cases. Most ResNet variants are much heavier.

jph004y ago

I don't think that's true - have a look at this analysis here:

https://www.kaggle.com/code/jhoward/which-image-models-are-b...

YetAnotherNick4y ago

> ResNet > VGG: ResNet-50 is faster than VGG-16 and more accurate than VGG-19 (7.02 vs 9.0); ResNet-101 is about the same speed as VGG-19 but much more accurate than VGG-16 (6.21 vs 9.0).

https://github.com/jcjohnson/cnn-benchmarks#:~:text=ResNet%2....

p1esk4y ago

why are people still reporting benchmark results on VGG?

Probably because it makes the hardware look good.

0-_-04y ago

This is the right answer. Efficient networks like EfficientNet are much harder to accelerate in HW.

DSingularity4y ago

No. Because it is a way to compare performance. That’s all. Just convenience.

jorgemf4y ago

Probably because it will be impossible to compare with old results. If every year the community chooses a different model, how are you going to compare results year over year?

ekelsen4y ago

The numbers are relative speedups, not absolute numbers that can be compared with any prior results, so I don't really see how this matters.

1 more reply

learndeeply4y ago

ResNets have been around for 7 years...

1 more reply

6gvONxR4sf7o4y ago

> But why are people still reporting benchmark results on VGG?

It makes me feel like i’m missing something! Is is still used as a backbone in the same way as legacy code is everywhere, or is it something else entirely??

sanxiyn4y ago

VGG works better for style transfer than ResNet (this is a surprising result, but empirically true), but that's the only case I am aware of.

https://arxiv.org/abs/2104.05623

eoerl4y ago

actually it seems that it was because a lot of other well known models are not yet supported, missing ops in the Metal backend

singularity20014y ago

The installation command generated on https://pytorch.org/get-started/locally/ didn't install the latest version for me. What did it was:

pip3 install --pre torch==1.12.0.dev20220518 --extra-index-url https://download.pytorch.org/whl/nightly/cpu

singularity20014y ago

If you came late make sure to update the date to 20220521 …

tzekid4y ago

Ahh just saw this after compiling pytorch from source. Thanks!

nafizh4y ago

Exciting!! But don't see comparison with any laptop Nvidia GPUs in terms of performance. That would be insightful.

sudosysgen4y ago

It compares unfavourably, but then again NVidia GPUs on laptop are massive powerhogs.

smlacy4y ago

Do apple users really require the ability to train large ML models while mobile and without access to A/C power? Is this a real-world use case for the target market?

1 more reply

buildbot4y ago

This is very interesting since the M1 studio supports 128GB of unified memory - training a large memory heavy model slowly on a single device could be interesting, or inferencing a very large model.

zdw4y ago

Everything old is new again - the M1 studio's unified memory echos the SGI O2 which had similar unified CPU/GPU memory back in the 90's.

In both cases the unified memory machines outperformed much larger machines in specific use cases.

smoldesu4y ago

...specific use cases being the key operand here. Unified memory is cool, but there are reasons we don't use it at-scale:

- It needs extremely high-bandwidth controllers, which severely limits the amount of memory you can use (Intel Macs could be configured with an order of magnitude more ram in it's server chips)

- ECC is still off-the-table on M1 apparently

2 more replies

ivstitia4y ago

There was a report comparing M1 Pro with several other Nvidia GPUs from a few months ago: https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...

I'm curious on how the benchmarks change with this recent new release!

almostdigital4y ago

Anyone actually got this to run on an M1 Mac?

    $ conda install pytorch torchvision torchaudio -c pytorch-nightly
    Collecting package metadata (current_repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.
    Collecting package metadata (repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.

    PackagesNotFoundError: The following packages are not available from current channels:

      - torchaudio

And the pip install variant installs an old version of torchaudio that is broken

    OSError: dlopen(/opt/homebrew/Caskroom/miniforge/base/envs/test123/lib/python3.10/site-packages/torchaudio/lib/libtorchaudio.so, 0x0006): Symbol not found: __ZN2at14RecordFunctionC1ENS_11RecordScopeEb

fragmede4y ago

    pip3 install pytorch

worked for me. I think it's something with your brew installation.

    fragmede@samairmac:~$ python
    Python 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 19:24:02)
    [Clang 11.1.0 ] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import torch
    >>> torch.__file__
    '/Users/fragmede/projects/miniforge3/lib/python3.9/sitepackages/torch/__init__.py'
    >>>

almostdigital4y ago

Does torchaudio work for you? I can get torch and torchvision to work but not torchaudio

boopmaster4y ago

pip install --pre torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/nightly/cpu

Scene_Cast24y ago

I'm curious about the performance compared to something like, say, the RTX 3070.

my1234y ago

Low. Apple doesn't have matrix math accelerators in their current GPUs.

The neural engine is small and inference only. It's also only exposed by a far higher level interface, CoreML.

Kon-Peki4y ago

> Apple doesn't have matrix math accelerators in their current GPUs.

That's because the M1 has a dedicated matrix math accelerator called AMX [1]. I've used it with both Swift and pure C.

https://medium.com/swlh/apples-m1-secret-coprocessor-6599492...

1 more reply

LeanderK4y ago

> The neural engine is small and inference only

Why is it inference only? At least the operations are the same...just a bunch of linear algebra

1 more reply

apohn4y ago

https://news.ycombinator.com/item?id=30608125

ivstitia4y ago

Here are some comparison numbers I've come across: https://wandb.ai/tcapelle/apple_m1_pro/reports/Deep-Learning...

It is not really comparable on a step per second level but the power consumption and now GPU memory will make it pretty enticing.

MasterScrat4y ago

Small code example in the PyTorch doc:

https://pytorch.org/docs/master/notes/mps.html

nxpnsv4y ago

Tried https://pytorch.org/tutorials/beginner/basics/quickstart_tut... with mps vs cpu. mps worked, but cpu actually was faster (16 vs 21s). Perhaps I am doing it wrong...

singularity20014y ago

Anyone else getting "illegal hardware instruction"?

(pytorch_env) ~/dev/ai/ python -c "import torch"

zimpenfish4y ago

IIRC, when I had that problem, it was because it was loading the wrong arch for Python.

in3d4y ago

It’s surprising to see PyTorch developers working on things like that when common operations like group convolutions are still completely unoptimized on Nvidia GPUs, despite many requests.

jacobn4y ago

Grouped convolutions can't really run faster than groups * conv(ch/group) and I believe that's close to where they're at?

Note that for ch<O(512) (varies by GPU & hw) you tend to be memory-transfer-speed limited, not compute limited.

So unfortunately depthwise convolutions end up having terrible performance.

in3d4y ago

Why wouldn’t you be able to run them in parallel using CUDA? You shouldn’t be memory-transfer speed limited when group convolution layers are a part of a bigger net.

Note that pointwise 1x1 convolutions are a special case of group convolutions and actually I think they might be specially optimized in PyTorch (I’d have to run some benchmarks to test it though).

1 more reply

arecurrence4y ago

This is much nicer ergonomics than what I had to do for tensorflow. It’s ostensibly out of the box support as a different torch device.

mark_l_watson4y ago

I agree. I appreciated the M1/Metal TensorFlow support, but that was not as easy to setup.

alfalfasprout4y ago

I mean, building tensorflow is generally an awful experience.

dangrie1584y ago

dilielloneluca4y ago

I started collecting benchmarks of the M1 Max on PyTorch here: https://github.com/lucadiliello/pytorch-apple-silicon-benchm...

munro4y ago

yess! This is important for me, because I don't have any $$$ to rent GPUs for personal projects. Now we just need M1 support for JAX.

[1] https://browser.geekbench.com/v5/compute/compare/4140651?bas...

jph004y ago

You can use GPUs for free on Paperspace Gradient, Google Colab, and Kaggle.

kristianp4y ago

A tangential thought: will we see animation studios buy mac studios for their rendering farms? What do they use these days, aws ec2?

Kalanos4y ago

Anyone care to comment on how this is better than Metal's TensorFlow support?

macshome4y ago

Does this work on any Metal hardware or just the M1 GPU?

atty4y ago

cj89894y ago

really hope to see some comparisons with nvidia gpus!

toppy4y ago

Does speed up refer to absolute value or percentage?

dagmx4y ago

At least for the charts, it looks like a multiplier (or divisor I guess) since the CPU baseline looks to be at 1

toppy4y ago

You're right! I've missed this.

sbeckeriv4y ago

What is the * in the chart referencing?

mrchucklepants4y ago

Probably supposed to be referencing the text under the plot stating the specific configuration of the hardware and software.

sbeckeriv4y ago

looks like the website was updated after I posted. I used page search to look for the *.

amelius4y ago

> Accelerated GPU training is enabled using Apple’s Metal Performance Shaders (MPS) as a backend for PyTorch.

What do shaders have to do with it? Deep learning is a mature field now, it shouldn't need to borrow compute architecture from the gaming/entertainment field. Anyone else find this disconcerting?

dagmx4y ago

Shaders are just the way compute is defined on the GPU.

Why is that concerning to you?

WhitneyLand4y ago

It’s not the greatest term even for graphics only.

People new to CG are likely to intuit “shaders” as something related to, well, shading, but vertex shaders et al have nothing to do with the color of a pixel or a polygon.

1 more reply

my1234y ago

1 more reply

my1234y ago

Apple doesn't have a separate API tailored towards compute only, but a single unified API that makes concessions to both.

Concessions towards compute: a C++ programming language for device code (totally unlike what's done for most graphics APIs!)

Concessions towards graphics: no single-source programming model at all for example...

sudosysgen4y ago

Many GPUs allow you to write device code in C++ via SYCL. It works well enough.

geertj4y ago

Not sure if it’s concerning but it caught my eye as well.

j / k navigate · click thread line to collapse