"Discover new Metal profiling tools for M3 and A17 Pro"
https://developer.apple.com/videos/play/tech-talks/111374/
"Learn performance best practices for Metal shaders"
https://developer.apple.com/videos/play/tech-talks/111373/
"Bring your high-end game to iPhone 15 Pro"
These are two excellent posts that go deep on this:
The Shader Permutation Problem - Part 1: How Did We Get Here?
The Shader Permutation Problem - Part 2: How Do We Fix It?
In particular, the second post has the line:
We probably should not expect any magic workarounds for static register allocation: if a callable shader requires many registers, we can likely expect for the occupancy of the entire batch to suffer. It’s possible that GPUs could diverge from this model in the future, but that could come with all kinds of potential pitfalls (it’s not like they’re going to start spilling to a stack when executing thousands of pixel shader waves).
... And some kind of 'magic workaround for static register allocation' is pretty much what has been done.https://therealmjp.github.io/posts/shader-permutations-part1...
https://therealmjp.github.io/posts/shader-permutations-part2...
So how does one translate to an equivalent in "CUDA cores" type terminology?
For example a 1080 GTX GPU has 20 stream multiprocessors (SM), each containing 128 cores, each of which supports 16 threads.
Meanwhile Apple describes the M1 GPU as having 8 cores, where “each core is split into 16 Execution Units, which each contain eight Arithmetic Logic Units (ALUs). In total, the M1 GPU contains up to 128 Execution units or 1024 ALUs, which Apple says can execute up to 24,576 threads simultaneously and which have a maximum floating point (FP32) performance of 2.6 TFLOPs.”
So one option to get a single number for a rough comparison is to count threads. The 1080 GTX supports 40,960 threads while the M1 supports 24,576 threads.
There’s obviously a lot more to a GPU — for starters, varying clock speeds, ALUs can have different capabilities, memory bandwidth, etc. But at least counting threads gives a better idea of the processing bandwidth than talking about cores.
The fact that each SM can support 1024 threads (that's the maximum blocksize of CUDA on that card) doesn't do much for the theoretical flops. Only a fraction of those threads can be active at a time. The others are idling or waiting on their memory requests. This hides a lot of the i/o latency.
Apple has always been really good at parallelism, which is why they get so much performance from less power consumption.
The formal TFLOPS comparison as a result would be most sensible between pre-M3 designs, AMD 6000 series (RNDA 2), and Nvidia's 2000 series (Turing). After that it gets really murky with AMD's "TFLOPS" looking nearly 2x more than they are actually worth by the standards of prior architectures, followed by Nvidia (some coefficient lower than 2, but still high), followed by M3 which from the looks of it is basically 1.0x on this scale, so long as we're talking FP32 TFLOPS specifically as those are formally defined.
You can see this effect the easiest by comparing perf & TFLOPS of AMD 6000 series and Nvidia 3000 series - they have released nearly at the same time, but AMD 6000 is one gen before the "near-fake-doubling", while Nvidia's 3000 series is the first gen with the "close-to-fake-doubling": with a little effort you'll find GPUs between these two that perform very similarly (and have very similar DRAM bandwidth), but Ampere's counterpart has almost 2x the FP32 TFLOPS.
I don’t really think we can, even if we knew exactly what is in a M3 GPU core, which we don’t. Both architectures are very different, and different again from AMD GPUs. We have to count Tflops.
I watched the full video and thought it was excellent. I wish other CPU/GPU manufacturers made technical overview videos like this. I've never programmed graphics targeting metal before, but I feel much more inclined after watching this so I guess it was good advertising.
Have you dived into them?
There are advanced (for me) sessions like:
https://developer.apple.com/videos/play/wwdc2023/10127/
https://developer.apple.com/videos/play/wwdc2023/10042/
Although it's true that they won't discuss at hardware level.
It will still be years before it is practical for Linux developers to target these features. Eventually, the rate of change in GPU design will slow and Linux will catch up once and for all. But it's hard to not drool over the hardware that proprietary OSs get to use today.
There are some hyperbole interjected about how incredible the performance is but that's only in between the useful data. (I did chuckle at the… enthusiasm of the speaker though)
This isn't a technical document for GPU designers. Apple doesn't really need or want you to understand exactly how the implementation works because that's basically trade secret for them. This is aimed at letting app / game developers know how they should optimize for the new GPUs, since previously Apple just made some ambiguous remarks about some of these new technology ("Dynamic Caching") without explaining what they meant.
But yes, I do like how the Asahi folks tend to end up documenting a lot of how these hardware works, but they also only have public information like this to start from so these are still useful info to have for them.
Somewhere in the ballpark of 5:30-6:00 or so it describes prior hardware design of the Apple's shader core, and starting 7:00 it goes into hardware design of the new M3/A17Pro shader core. It's actually surprisingly detailed, e.g. Nvidia's whitepapers provide less detail on the actual organization of their SMs.
"The new Max"? He clearly meant "the new Macs".
Kinda weird that Apple can't properly transcribe its own content.
I suspect they have the videos transcribed externally, and don’t check the transcription (or only do so in a cursory manner).
For YT vids, especially shorts, it's because churning out shorts/reels/tiktoks of clips from longer form videos (and/or with the split screen gameplay of some mobile game/minecraft platforming run) is now a common tactic for trying to gain tons of views on your account for monetisation later.
[1] https://developer.apple.com/videos/play/tech-talks/111374?ti...
Not going to happen, what's more likely is a Proton-like layer above macOS APIs to simplify porting games over. Also see "Game Porting Toolkit" here: https://developer.apple.com/games/
https://forums.macrumors.com/threads/m1-m2-flickering-ghosti...
https://www.benq.com/en-us/knowledge-center/knowledge/how-to...
https://www.howtogeek.com/805459/mac-flickering-external-scr...
In fact, even in discrete GPUs, the display scanout engine is generally a nearly completely independent block relative to the rest of the GPU.
[1] https://gist.github.com/GetVladimir/c89a26df1806001543bef4c8...
One works perfectly fine and is automatically RGB. Other flickers and when changing to RGB mode it is lime green.
Wonder why it is not problem with Intel Macs and if M3 fixes those bugs?
Maybe it is Apples feature to sell more of their own monitors.
They make sure other high end brands do not work with macOS.
It does not make sense that this kind of bug is 3 years active.
edit: Found the article: https://www.ea.com/frostbite/news/circular-separable-convolu...
> Does this mean there, effectively, are no registers?
I can only point out just for context that if by any chance you're asking whether the registers are implemented as actual hardware design "registers" - individually routed and and individually accessible small strings of flip-flops or D-latches - then the history of the question is actually "it never was registers in the first place" - architectural (ISA) registers in GPUs are implemented by a chunk of addressable ported SRAM, with an address bus, data bus, and limited number of accesses at the same time and limited b/w [1].
[1] see the diagram at https://www.renesas.com/us/en/products/memory-logic/multi-po...
Anyone else literally cannot compete, they don't have billions in pocket change they don't know how to spend otherwise, so they'll have to wait until the exclusivity agreement expires.
your parent comment's example is literally Google, world-class experts at burning money on developers producing a million dead-end products and abandoning them a year later.
if Google would get some sensible leadership, focus on a few core products, and stick with them for a decade, they'd have just as much money to spend. But "focus" and "Google" seem to have become opposites.
My point: the 'winning formula' of Apple is laser-sharp focus: have a few products, do them as well as anyone else or better, and only introduce a new product if it is mature-ish and very profitable. (We'll see how the vision headset fits in here)
So it's something they took advantage of after they grew (well, which company at their scale wouldn't ask for the best wholesale deals?), but not what made them big in the first place.
"Is it that hard to just copy the winning formula?"
yes it is, thanks to IP law. And back in the day Steve Jobs already wanted thermonuclear war on Samsung, because he felt their flagship at the time was too close to the IPhone.
Apple has a style, and he talks with that style very well, preventing the listener from drifting off.