undefined | Better HN

0 pointsfotcorn2mo ago0 comments

The memory bandwith on M4 Max is 546 GB/s, M5 Max is 614GB/s, so not a huge jump.

The new tensor cores, sorry, "Neural Accelerator" only really help with prompt preprocessing aka prefill, and not with token generation. Token generation is memory bound.

Hopefully the Ultra version (if it exists) has a bigger jump in memory bandwidth and maximum RAM.

0 comments

anentropic2mo ago

Do any frameworks manage to use the neural engine cores for that?

Most stuff ends up running Metal -> GPU I thought

abhikul02mo ago

It's referring to the neural cores(for matrix mul) in the GPU itself, not the NPU.

https://creativestrategies.com/research/m5-apple-silicon-its...

sumek832mo ago

https://github.com/maderix/ANE

irusensei2mo ago

I noticed that even on my M3 MLX tends to do prefill it a lot faster than llama.cpp and GGML models. Anyone knows how they do it?

j / k navigate · click thread line to collapse