GPU advancements in M3 and A17 Pro [video] (opens in new tab)

(developer.apple.com)

187 pointsbhj2y ago142 comments

142 comments

pjmlp2y ago

There are additionally related videos,

"Discover new Metal profiling tools for M3 and A17 Pro"

https://developer.apple.com/videos/play/tech-talks/111374/

"Learn performance best practices for Metal shaders"

https://developer.apple.com/videos/play/tech-talks/111373/

"Bring your high-end game to iPhone 15 Pro"

https://developer.apple.com/videos/play/tech-talks/111372/

frogblast2y ago

If you're interested in more background about one user-visible problem being directly attacked by this new GPU architecture, that could be "shader compilation stutter" (although there are many others).

These are two excellent posts that go deep on this:

The Shader Permutation Problem - Part 1: How Did We Get Here?

The Shader Permutation Problem - Part 2: How Do We Fix It?

In particular, the second post has the line:

  We probably should not expect any magic workarounds for static register allocation: if a callable shader requires many registers, we can likely expect for the occupancy of the entire batch to suffer. It’s possible that GPUs could diverge from this model in the future, but that could come with all kinds of potential pitfalls (it’s not like they’re going to start spilling to a stack when executing thousands of pixel shader waves).

... And some kind of 'magic workaround for static register allocation' is pretty much what has been done.

https://therealmjp.github.io/posts/shader-permutations-part1...

https://therealmjp.github.io/posts/shader-permutations-part2...

zmmmmm2y ago

Does apple document exactly how many actual true cores there are inside their GPUs? It is always confusing they say "40 core GPU" but I assume these are shader cores which each inside them can execute (per the video) "many thousands" of parallel execution paths.

So how does one translate to an equivalent in "CUDA cores" type terminology?

pavlov2y ago

What Apple calls a GPU core seems to be roughly the same as what Nvidia calls a “stream multiprocessor”.

For example a 1080 GTX GPU has 20 stream multiprocessors (SM), each containing 128 cores, each of which supports 16 threads.

Meanwhile Apple describes the M1 GPU as having 8 cores, where “each core is split into 16 Execution Units, which each contain eight Arithmetic Logic Units (ALUs). In total, the M1 GPU contains up to 128 Execution units or 1024 ALUs, which Apple says can execute up to 24,576 threads simultaneously and which have a maximum floating point (FP32) performance of 2.6 TFLOPs.”

So one option to get a single number for a rough comparison is to count threads. The 1080 GTX supports 40,960 threads while the M1 supports 24,576 threads.

There’s obviously a lot more to a GPU — for starters, varying clock speeds, ALUs can have different capabilities, memory bandwidth, etc. But at least counting threads gives a better idea of the processing bandwidth than talking about cores.

KeplerBoy2y ago

Just for clarification: The 1080 has 20 SMs with 128 FPUs each. Each FPU can perform 2 FLOPs per cycle (fused multiply adds). Combined with the frequency of 1607 MHz we land on the advertised 8.2 TFlop/s.

The fact that each SM can support 1024 threads (that's the maximum blocksize of CUDA on that card) doesn't do much for the theoretical flops. Only a fraction of those threads can be active at a time. The others are idling or waiting on their memory requests. This hides a lot of the i/o latency.

pavlov2y ago

For sure. Just counting threads doesn't give anything like a complete picture of performance.

It's still somewhat interesting because threads are a low-level programming primitive. If you can come up with work for 40k simultaneous threads, you can use the GPU effectively. For some tasks this parallelization is obvious (a HD video frame has 2 million pixels and shading them independently is trivial), and of course often it's anything but.

thaanpaa2y ago

If the architecture is vastly different, these comparisons become sort of meaningless, though. The ultimate performance is determined by all the tiny little bottlenecks, like how quickly the architecture can move data between different types of cores, memory, cache, etc.

Apple has always been really good at parallelism, which is why they get so much performance from less power consumption.

photonerd2y ago

Yeah, it’s completely meaningless to compare but HN loves specs & seemingly hasn’t learned—after about 15 years of it being true—that direct spec comparisons are meaningless.

I see if with every major product announcement, the worst are usually Apple threads but it’s not constrained only there.

cm21872y ago

Question from a GPU novice. I presume a thread is the individual calculation perform on one element of the vector? Can the 128 cores, or 20 SM perform different operations at the same time or all 24,576 threads perform the same operation at the same time, on a vector of data of length of 24,576?

reroute222y ago

"128 * the-number-of-cores" of threads can make progress truly in parallel (at the same time).

24,576 threads (or however many, I didn't validate the number and it depends on the occupancy, which depends on thread resource usage, like registers => depends on the shader program code) is how many threads can be executed concurrently (as opposed to in parallel), as in, how many of them can simultaneously reside on the GPU. A subset of those at any time are actually executed in parallel, the rest are idle.

You can think of this situation as follows using an analogy with a CPU and an OS:

1. 128 * the-number-of-cores is the number of CPU cores(*1)

2. 24,576 threads is the number of threads in the system that the OS is switching between

Major differences with the GPU:

3. On a CPU context switch (getting a thread off the core, waking up a different thread, restoring the context, and proceeding) takes about 2,000 cycles. On a GPU _from the analogy_ that kind of thread switching takes ~1-10 cycles depending on the exact GPU design and various other details.

4. In CPU/OS world the context switching and scheduling on the OS side is done mostly in software, as the OS is indeed software. In GPU's case the scheduler and all the switching is implemented as fixed function hardware finely permeating the GPU design.

5. In CPU/OS world those 2,000 cycles per context switch is so much larger than a roundtrip to DRAM while executing a load instruction that happened to miss in all caches - which is about 400-800 cycles or so depending on the design - that OS never switches threads to hide latencies of loads, it's pointless. As far as performance is concerned (as opposed to maintaining the illusion of parallel execution of all programs on the computer), the thread switching is used to hide the latency of IO - non-volatile storage access, network access, user input, etc. (which takes millions of cycles or more - so it makes sense).

In the GPU world the switching is so fast, that the hardware scheduler absolutely does switch from thread to thread to hide latencies of loads (even the ones hitting in half of the caches, if that happens), in fact, hiding these latencies and thus keeping ALUs fed is the whole point of this basic design of pretty much all programmable GPUs that there ever were.

6. In real world CPU/OS, the threads that aren't running at the time reside (their local variables, etc) in the memory hierarchy, technically, some of it ends up in caches, but ultimately, the bulk of it on a loaded system is in system DRAM. On a GPU, or I suppose by now we have to say, on a traditional GPU, these resident threads (their local variables, etc) reside in on-chip SRAM that is a part of the GPU cores (not even in a chunk on a side, but close to execution units in many small chunks, one per core). While the amount of DRAM (CPU/OS) is a) huge, gigabytes, and b) easily configurable, the amount of thread state the GPU scheduler is shuffling around is measured typically in hundreds of KBs per GPU Core (so on the order of about "a few MBs" per GPU), and the equally sized SRAM storing this state is completely hardwired in the silicon design of the GPU and not configurable at all.

Hope that helps!

footnotes (*1) a better analogy would be not "number of CPU cores", but "number-of-CPU-cores * SMT(HT) * number-of-lanes-in-AVX-registers", where number-of-lanes-in-AVX-registers is basically "AVX-register-width / 32" for FP32 processing which (the latter) yields about ~8 give or take 2x depending on the processor model. Whether to include SMT(HT) multiplier (2) in this analogy is also murky, there is an argument to be made for yes, and an argument to be made for no, and depends on the exact GPU design in question.

1 more reply

jeffybefffy5192y ago

Even moreso we shouldnt assume they have similar architectural layouts.

cmovq2y ago

I think a better comparison is to look at floating point performance. For example, the 10 core M2 GPU does 3.6 TFLOPS (FP32) while an RTX 4060 does 15 TFLOPS and an RTX 4090 82.58

reroute222y ago

Unfortunately, hardly. Ampere's (Nvidia 3000 series), Ada's (Nvidia 4000 series), and RNDA 3's (AMD 7000 series) GPUs have doubled up their FP32 units in ways that differ in implementation (between AMD and Nvidia) but are relatively similarly poor in their ability to be utilized properly at rates much higher than pre-doubling (Nvidia is doing better than AMD in that, but very far from great).

The formal TFLOPS comparison as a result would be most sensible between pre-M3 designs, AMD 6000 series (RNDA 2), and Nvidia's 2000 series (Turing). After that it gets really murky with AMD's "TFLOPS" looking nearly 2x more than they are actually worth by the standards of prior architectures, followed by Nvidia (some coefficient lower than 2, but still high), followed by M3 which from the looks of it is basically 1.0x on this scale, so long as we're talking FP32 TFLOPS specifically as those are formally defined.

You can see this effect the easiest by comparing perf & TFLOPS of AMD 6000 series and Nvidia 3000 series - they have released nearly at the same time, but AMD 6000 is one gen before the "near-fake-doubling", while Nvidia's 3000 series is the first gen with the "close-to-fake-doubling": with a little effort you'll find GPUs between these two that perform very similarly (and have very similar DRAM bandwidth), but Ampere's counterpart has almost 2x the FP32 TFLOPS.

KeplerBoy2y ago

Those statements have to be made carefully. A lot of the time the GPU is memory-bandwidth bound, so a increase in FLOPS does nothing. Doesn't mean they're worthless.

1 more reply

YetAnotherNick2y ago

While FP32 non tensor flops at least looks comparable, FP16/BF16 with tensor core(nowadays a default for any neural network including LLM) at 330 TFlops/s blows away M2.

mort962y ago

Many of us still use our graphics chips for graphics, where FP32 is king.

1 more reply

wmf2y ago

Each Apple core (heh) has 128 FPUs so 40 cores would be akin to 5120 CUDA "cores".

pixelpoet2y ago

Not quite, because Nvidia counts dual issue as a flat doubling of "core" (which previously you could accurately call a vector lane) count.

kergonath2y ago

> So how does one translate to an equivalent in "CUDA cores" type terminology?

I don’t really think we can, even if we knew exactly what is in a M3 GPU core, which we don’t. Both architectures are very different, and different again from AMD GPUs. We have to count Tflops.

dontlaugh2y ago

Nvidia "cheat" by counting approximately ALUs.

gary_02y ago

I skimmed the video but a lot of it sounded more like advertising than technical information to me. On the other hand, I'm looking forward to watching the Asahi folks crack this stuff open.

GuestHNUser2y ago

It is an overview of what is new about their hardware and technical advice (but not overly technical) on how to maximize the GPUs performance.

I watched the full video and thought it was excellent. I wish other CPU/GPU manufacturers made technical overview videos like this. I've never programmed graphics targeting metal before, but I feel much more inclined after watching this so I guess it was good advertising.

pjmlp2y ago

Maybe it is my former graphics background, but I find GTC videos from NVidia and their developer site quite understandable, as AMD GPUOpen as well.

Have you dived into them?

ytch2y ago

I didn't watch too many developer conference, but IHMO the Apple WWDC sessions are fairly easy and enjoyable for learning. especially they provide full transcript.

saagarjha2y ago

I think it's important to understand that this video is aimed at a broad audience of developers. Obviously some of them are GPU experts but a lot of people are quite literally app developers who are going to watch this video as an introduction to "hmm everyone keeps telling about this GPU thing, what can I do with it?" So the video has to provide them with some context they can use to explore more.

ytch2y ago

It's one of ̶W̶W̶D̶C̶ developer sessions. Some of them at WWDC are basic 101 introduction for rookies.

There are advanced (for me) sessions like:

https://developer.apple.com/videos/play/wwdc2023/10127/

https://developer.apple.com/videos/play/wwdc2023/10042/

Although it's true that they won't discuss at hardware level.

aroman2y ago

How can it be a WWDC session if it's referring to hardware only announced a few weeks ago?

ytch2y ago

Yes, It's my fault. The website uses the same layout with WWDC sessions, so I misunderstand it :/

tux19682y ago

> On the other hand, I'm looking forward to watching the Asahi folks crack this stuff open.

It will still be years before it is practical for Linux developers to target these features. Eventually, the rate of change in GPU design will slow and Linux will catch up once and for all. But it's hard to not drool over the hardware that proprietary OSs get to use today.

xign2y ago

What's not technical about it? It tells you what the new advancements in the M3/A17 GPUs are, what you should look out for, and how you can take advantage of them. It provides enough information so you can understand what technical tradeoffs you need to make when you target M3 / A17 GPUs. E.g. Register pressure is a real concern in large shaders, and this helps explain how the behavior would be different under a dynamic allocation scheme. It explains how the ray tracing acceleration works, and how it reorders the different intersection calls and how you should avoid intersection queries.

There are some hyperbole interjected about how incredible the performance is but that's only in between the useful data. (I did chuckle at the… enthusiasm of the speaker though)

This isn't a technical document for GPU designers. Apple doesn't really need or want you to understand exactly how the implementation works because that's basically trade secret for them. This is aimed at letting app / game developers know how they should optimize for the new GPUs, since previously Apple just made some ambiguous remarks about some of these new technology ("Dynamic Caching") without explaining what they meant.

But yes, I do like how the Asahi folks tend to end up documenting a lot of how these hardware works, but they also only have public information like this to start from so these are still useful info to have for them.

kanwisher2y ago

it literally has code and diagrams of how to organize SIMD instructions to fill out all the shader cores

wmf2y ago

The beginning does sound like marketing but eventually it gets into technical information.

reroute222y ago

I'd say the beginning sounds like an introduction to GPU architectures in general, not marketing.

Somewhere in the ballpark of 5:30-6:00 or so it describes prior hardware design of the Apple's shader core, and starting 7:00 it goes into hardware design of the new M3/A17Pro shader core. It's actually surprisingly detailed, e.g. Nvidia's whitepapers provide less detail on the actual organization of their SMs.

runeks2y ago

> I'm excited to tell you about the new Apple family 9 GPU architecture in A17 Pro and the M3 family of chips, which are at the heart of iPhone 15 Pro and the new Max.

"The new Max"? He clearly meant "the new Macs".

Kinda weird that Apple can't properly transcribe its own content.

Tijdreiziger2y ago

I’ve also noticed this on lots of YouTube videos, where the creator clearly meant one thing, but the subtitles substitute a more common, similarly-sounding word with a different meaning.

I suspect they have the videos transcribed externally, and don’t check the transcription (or only do so in a cursory manner).

TheCapeGreek2y ago

Or automated transcription.

For YT vids, especially shorts, it's because churning out shorts/reels/tiktoks of clips from longer form videos (and/or with the split screen gameplay of some mobile game/minecraft platforming run) is now a common tactic for trying to gain tons of views on your account for monetisation later.

Tijdreiziger2y ago

I’ve also frequently seen it on long-form videos. I think the transcriptions must be at least partially reviewed by humans, because YouTube already has automatic transcription for videos without subtitles.

eviks2y ago

Why is it surprising, it's not like content ownership gives you any advantage in the typical transcription algorithms

runeks2y ago

It’s surprising since I would expect Apple to check whatever they get back from a transcription service.

Reason0772y ago

I’m always impressed with the speech synthesis that Apple uses to make the voiceovers in these videos. Some of them almost sound like real people!

stingraycharles2y ago

The guy introduced himself by name, I was really confused for a while if it was just a human trained to sound like an AI, or an AI trained to sound like a human.

Reason0772y ago

Perhaps they ask the AI speech/content generator to give itself a name as part of its training prompt.

SushiHippie2y ago

They even created a Twitter profile for this AI persona

https://nitter.net/jhaberstro

franzb2y ago

From the variety of intonations based on context, I doubt it’s speech synthesis.

simbolit2y ago

I think parent comment is being ironic. Not sure tho.

steve19772y ago

Maybe it’s one AI dissing another?

TradingPlaces2y ago

Media is so focused on CPUs, they are missing the fact that Apple focused on the GPU and Neural Engine for this round of chips

wincy2y ago

It was wild seeing Linus Tech Tips demoing resident evil village on the iPhone 15 Pro.

rsynnott2y ago

This is unfortunately inevitable; CPUs are just so much easier to benchmark in a broadly useful way. And the extreme leakiness of geekbench is helpful (I suspect Apple sees this as a feature; most recent Apple chip iterations have leaked on geekbench)

reroute222y ago

Okay, I went through the other video they reference ("Discover new Metal profiling tools for M3 and A17 Pro" [1]), and there is actually a whole bunch of extra very relevant (IMO) information on the subject, starting about 13:30 or so.

[1] https://developer.apple.com/videos/play/tech-talks/111374?ti...

w10-12y ago

The video is just enough of a peek into the GPU's to encourage people to write using Metal API's (and by the way, use the new APIs and FP16).

jeffybefffy5192y ago

They should just support directx. Devs will never support two graphics api’s. it costs too much especially to grab the marginal mac os share that has powerful enough gpu’s. Id bed in 4 years apple moves to directx.

flohofwoe2y ago

> Id bed in 4 years apple moves to directx.

Not going to happen, what's more likely is a Proton-like layer above macOS APIs to simplify porting games over. Also see "Game Porting Toolkit" here: https://developer.apple.com/games/

pjmlp2y ago

Devs have been supporting multiple APIs since graphics programming exists, it is only on the FOSS world that it keeps being touted as a pain point.

nemothekid2y ago

DirectX is exclusive to the Windows platform. At this point, it's probably deeply tied into Windows. I don't see how you can make that bet.

reroute222y ago

I've no idea exactly how MS licenses uses of DX, but just for context Imagination Technologies just released a custom GPU design that implements DirectX Feature Level 11_0 (which corresponds to earlier versions of DX 12 [1]): https://www.imaginationtech.com/news/imagination-launches-br...

Imagination Technologies is a near 40 year old British silicon IP company that has been doing GPUs for quite some time, just not ones supporting DX up until now, and it has nothing to do with MS (in terms of ownership / rights / etc).

[1] https://learn.microsoft.com/en-us/windows/win32/direct3d11/o...

1 more reply

flohofwoe2y ago

DirectX APIs can be emulated on top of other 3D APIs, see Proton on Linux for instance.

Of course the idea of Apple switching from Metal to D3D is rubbish, but a Proton-like solution for macOS totally makes sense (maybe the "Game Porting Toolkit" will be exactly that eventually).

aurareturn2y ago

DirectX is closed source. Also, there are more games on Metal than DirectX.

pixelpoet2y ago

> Also, there are more games on Metal than DirectX.

Sorry, I was around for DirectX 1.0 back when GPUs were called "graphics accelerators", and don't see how that's possible.

Do you have a source for that, or is there some implicit caveat like counting some emulation later or something? Even then...

1 more reply

sccxy2y ago

Does M3 still outputs garbage which make external displays flicker?

https://forums.macrumors.com/threads/m1-m2-flickering-ghosti...

https://www.benq.com/en-us/knowledge-center/knowledge/how-to...

https://www.howtogeek.com/805459/mac-flickering-external-scr...

monocasa2y ago

Interestingly, that's a different component than the GPU on these chips, which is the typical architecture in SoCs.

In fact, even in discrete GPUs, the display scanout engine is generally a nearly completely independent block relative to the rest of the GPU.

lwkl2y ago

This issue can be fixed by switching the display to RGB [1]. So I think it’s a software bug but it‘s really annoying since the fix sometimes resets and the bug only occurs when there is a lot of black on the screen.

[1] https://gist.github.com/GetVladimir/c89a26df1806001543bef4c8...

sccxy2y ago

I have two monitors connected to M1 mac.

One works perfectly fine and is automatically RGB. Other flickers and when changing to RGB mode it is lime green.

Wonder why it is not problem with Intel Macs and if M3 fixes those bugs?

Maybe it is Apples feature to sell more of their own monitors.

They make sure other high end brands do not work with macOS.

It does not make sense that this kind of bug is 3 years active.

Someone2y ago

> Wonder why it is not problem with Intel Macs and if M3 fixes those bugs?

Let’s guess: maybe they have different drivers? Would be far from surprising, given that they’re running on different processors.

1 more reply

mannyv2y ago

That's hilarious because that sounds like the old sync-on-green problem that Macs and monitors used to have. And that used to happen a lot with Dell monitors.

How very odd.

leloctai2y ago

Does the complex block in the diagram refer to complex numbers? That doesn't sound typical, does it? What type of work load that typically run on the GPU that would require complex numbers?

reroute222y ago

Judging by the output/GUI of their GPU profiler, "complex" there is more like "complex instructions", think f32 (floating point) ops that aren't additions and multiplications (and FMAs), but trigonometry, square roots, that sort of thing.

mattsan2y ago

FFT plus some game stuff requires complex numbers to do partial rendering (e.g. do some now and then do more next frame - I've lost the link to the talk but IIRC EA did a talk on how they made a shader that emulates lights in the background that are out of focus (not Guassian but the actual cool effect as if it was a real camera))

Arelius2y ago

bokeh?

edit: Found the article: https://www.ea.com/frostbite/news/circular-separable-convolu...

mattsan2y ago

Holy crap thank you for finding it so fast, I've been searching for it for a while

make32y ago

quaternions? for rotations

RantyDave2y ago

So ... the registers are dynamically allocated from a chunk of cache? Does this mean there, effectively, are no registers? Does this cache have one clock latency?

reroute222y ago

I doubt anyone will be able to answer questions this fine grained, not now (if the implementation is architecturally exposed - leaks into the ISA - and Asahi Linux group figures some of it out), or possibly not ever (if it's architecturally transparent and thus entirely micro-architectural).

> Does this mean there, effectively, are no registers?

I can only point out just for context that if by any chance you're asking whether the registers are implemented as actual hardware design "registers" - individually routed and and individually accessible small strings of flip-flops or D-latches - then the history of the question is actually "it never was registers in the first place" - architectural (ISA) registers in GPUs are implemented by a chunk of addressable ported SRAM, with an address bus, data bus, and limited number of accesses at the same time and limited b/w [1].

[1] see the diagram at https://www.renesas.com/us/en/products/memory-logic/multi-po...

RantyDave2y ago

Oh! Well, that explains that then. Wild!

reroute222y ago

There is a fairly informative survey on the subject: https://www.osti.gov/servlets/purl/1332070 (A Survey of Techniques for Architecting and Managing GPU Register File)

1 more reply

josu2y ago

Does the narration sound like AI to anyone else?

alphanullmeric2y ago

It’s amazing how bad the competition is. The A17 pro has 2 performance cores and 4 efficiency cores. The Google G3 has 9 cores of 3 different types, the fastest being slower than Apple’s performance cores, the most efficient being less efficient than apple’s efficiency cores. And it’s a phone so you don’t take advantage of the extra parallelism. You just get the worst of both worlds. no wonder these android phones have 50% more battery and 50% less battery life. Is it that hard to just copy the winning formula?

creshal2y ago

A big part of Apple's "winning formula" is taking their giant piles of money and negotiating exclusive contracts for whatever is scheduled to be the most advanced semiconductor node next year.

Anyone else literally cannot compete, they don't have billions in pocket change they don't know how to spend otherwise, so they'll have to wait until the exclusivity agreement expires.

simbolit2y ago

> they don't have billions in pocket change they don't know how to spend otherwise

your parent comment's example is literally Google, world-class experts at burning money on developers producing a million dead-end products and abandoning them a year later.

if Google would get some sensible leadership, focus on a few core products, and stick with them for a decade, they'd have just as much money to spend. But "focus" and "Google" seem to have become opposites.

My point: the 'winning formula' of Apple is laser-sharp focus: have a few products, do them as well as anyone else or better, and only introduce a new product if it is mature-ish and very profitable. (We'll see how the vision headset fits in here)

creshal2y ago

> My point: the 'winning formula' of Apple is laser-sharp focus: have a few products, do them as well as anyone else or better, and only introduce a new product if it is mature-ish and very profitable. (We'll see how the vision headset fits in here)

They also aimed at markets that are ripe for disruption, because of weak competition: The MP3 player market before the iPod, the PDA-with-a-SIM-card market before the iPhone, etc. pp. all could be reasonably disrupted by just delivering a reasonably (but not even best-in-class, specs wise) product with better UX (not hard, in the cases mentioned) and massive marketing. You can't do that in a heavily competitive market that's already full of these products. VR headsets are probably closer to the "ripe for disruption" end of the spectrum, and I think the Vision will probably do well. But I doubt the "Apple Car" plans that have been floating around for 10 years now will ever lead to anything.

1 more reply

coldtea2y ago

Well, they got into that position starting from a near bankrupt company, which couldn't negotiate anything exclusive, and which was for a long time at the mercy of Motorola and the Intel.

So it's something they took advantage of after they grew (well, which company at their scale wouldn't ask for the best wholesale deals?), but not what made them big in the first place.

creshal2y ago

What made them big in the first place were the iPod/iTunes/iPhone and the ludicrous revenues from the App Store.

The iPod's only notable hardware that wasn't just a random off the shelf part was the click wheel, the chips were all off-the-shelf (until old iPhone chips counted as that), and iPhones didn't get custom chips until the 4.

So I guess the other part of the winning formula is "use market dominance in one sector to subsidize expansion into the next". I guess that's indeed one area where Google could reasonably try to be less inept, but I think all the institutional inertia makes that impossible by now. They'll go the DEC route of just drowning in their own internal problems until someone buys them up.

3 more replies

HatchedLake7212y ago

Sorry, but this sounds the same as the cheap “you just got lucky” to someone who worked day and night through sweat and blood to achieve something.

hutzlibu2y ago

I don't know about your facts, but

"Is it that hard to just copy the winning formula?"

yes it is, thanks to IP law. And back in the day Steve Jobs already wanted thermonuclear war on Samsung, because he felt their flagship at the time was too close to the IPhone.

rootusrootus2y ago

To be fair, there was a moment when Samsung was in full copy mode. All the way down to having their own version of the dock connector and a retail box that closely mimicked Apple. In retrospect, a bit embarrassing for a company we know is capable of much more.

hutzlibu2y ago

The thing is, people do not like to learn a new UI to do the same thing on a different device. But they have to.

KingLancelot2y ago

Steve Jobs wanted nuclear war with Google (not Samsung) because Eric Schmidt was on Apple’s board of directors while the iPhone was being developed, so Jobs felt Schmidt was basically doing insider trading for google to develop Android.

hutzlibu2y ago

Ah, I did not remember that one about Eric Schmidt, but I thought there was also something with Samsung.

Btw. it appears you are shadow banned. You might want to check some of your comments and then contact dang.

i5-2520M2y ago

Google is probably not delusional, so they know the Tensor has no chance of being "good" as it is. Why don't you compare the A17 to the 8g3 and 8g2?

zmmmmm2y ago

here have some Qualcomm Kool-Aid to go with that sweet Apple juice :-)

https://www.youtube.com/watch?v=h_vh7_n_OPs

baz002y ago

It's not that simple though though. I have never got through an entire day with an iPhone (XR, 12, 13 Pro). I'm just hitting 2 days easily with my Pixel 7a with the same crap on it. My daughter just took an iPhone 15 back because it won't get through the day.

gumby2y ago

Hmm, interesting. I have the opposite with my 12. We must have quite different use patterns.

baz002y ago

The 12 was the least crappy.

diimdeep2y ago

Really awful narration with overuse of unnecessary pitch glides

bayindirh2y ago

I don't share your sentiment about the guy, but that's OK.

Apple has a style, and he talks with that style very well, preventing the listener from drifting off.

flohofwoe2y ago

I'm German and I don't hear anything wrong about the voice. He's probably not a native speaker, give the guy some slack.

ta86452y ago

I had a similar reaction at first. But, for whatever reason, when playing the video at 1.5x it felt completely fine.

goosinmouse2y ago

It is pretty funny how amateur a multi trillion dollar can be, its so distracting and sounds like an 80's instructional VHS.

riscy2y ago

the narrator is an actual engineer, not a television personality

diimdeep2y ago

but he is cosplaying television personality for no reason

2 more replies

dylan6042y ago

you could never hear the audio that cleanly on a VHS even with HiFi tracks.

diimdeep2y ago

It is about attitude in voice, not about audio quality. It is very noticeable at 1.5x speed, 1x speed is too slow anyway with infinite pauses for such low density technical details.

1 more reply

j / k navigate · click thread line to collapse

142 comments

pjmlp2y ago

There are additionally related videos,

"Discover new Metal profiling tools for M3 and A17 Pro"

https://developer.apple.com/videos/play/tech-talks/111374/

"Learn performance best practices for Metal shaders"

https://developer.apple.com/videos/play/tech-talks/111373/

"Bring your high-end game to iPhone 15 Pro"

https://developer.apple.com/videos/play/tech-talks/111372/

frogblast2y ago

These are two excellent posts that go deep on this:

The Shader Permutation Problem - Part 1: How Did We Get Here?

The Shader Permutation Problem - Part 2: How Do We Fix It?

In particular, the second post has the line:

  We probably should not expect any magic workarounds for static register allocation: if a callable shader requires many registers, we can likely expect for the occupancy of the entire batch to suffer. It’s possible that GPUs could diverge from this model in the future, but that could come with all kinds of potential pitfalls (it’s not like they’re going to start spilling to a stack when executing thousands of pixel shader waves).

... And some kind of 'magic workaround for static register allocation' is pretty much what has been done.

https://therealmjp.github.io/posts/shader-permutations-part1...

https://therealmjp.github.io/posts/shader-permutations-part2...

zmmmmm2y ago

So how does one translate to an equivalent in "CUDA cores" type terminology?

pavlov2y ago

What Apple calls a GPU core seems to be roughly the same as what Nvidia calls a “stream multiprocessor”.

For example a 1080 GTX GPU has 20 stream multiprocessors (SM), each containing 128 cores, each of which supports 16 threads.

So one option to get a single number for a rough comparison is to count threads. The 1080 GTX supports 40,960 threads while the M1 supports 24,576 threads.

KeplerBoy2y ago

pavlov2y ago

For sure. Just counting threads doesn't give anything like a complete picture of performance.

thaanpaa2y ago

Apple has always been really good at parallelism, which is why they get so much performance from less power consumption.

photonerd2y ago

Yeah, it’s completely meaningless to compare but HN loves specs & seemingly hasn’t learned—after about 15 years of it being true—that direct spec comparisons are meaningless.

I see if with every major product announcement, the worst are usually Apple threads but it’s not constrained only there.

cm21872y ago

reroute222y ago

"128 * the-number-of-cores" of threads can make progress truly in parallel (at the same time).

You can think of this situation as follows using an analogy with a CPU and an OS:

1. 128 * the-number-of-cores is the number of CPU cores(*1)

2. 24,576 threads is the number of threads in the system that the OS is switching between

Major differences with the GPU:

Hope that helps!

1 more reply

jeffybefffy5192y ago

Even moreso we shouldnt assume they have similar architectural layouts.

cmovq2y ago

I think a better comparison is to look at floating point performance. For example, the 10 core M2 GPU does 3.6 TFLOPS (FP32) while an RTX 4060 does 15 TFLOPS and an RTX 4090 82.58

reroute222y ago

KeplerBoy2y ago

Those statements have to be made carefully. A lot of the time the GPU is memory-bandwidth bound, so a increase in FLOPS does nothing. Doesn't mean they're worthless.

1 more reply

YetAnotherNick2y ago

While FP32 non tensor flops at least looks comparable, FP16/BF16 with tensor core(nowadays a default for any neural network including LLM) at 330 TFlops/s blows away M2.

mort962y ago

Many of us still use our graphics chips for graphics, where FP32 is king.

1 more reply

wmf2y ago

Each Apple core (heh) has 128 FPUs so 40 cores would be akin to 5120 CUDA "cores".

pixelpoet2y ago

Not quite, because Nvidia counts dual issue as a flat doubling of "core" (which previously you could accurately call a vector lane) count.

kergonath2y ago

> So how does one translate to an equivalent in "CUDA cores" type terminology?

I don’t really think we can, even if we knew exactly what is in a M3 GPU core, which we don’t. Both architectures are very different, and different again from AMD GPUs. We have to count Tflops.

dontlaugh2y ago

Nvidia "cheat" by counting approximately ALUs.

gary_02y ago

I skimmed the video but a lot of it sounded more like advertising than technical information to me. On the other hand, I'm looking forward to watching the Asahi folks crack this stuff open.

GuestHNUser2y ago

It is an overview of what is new about their hardware and technical advice (but not overly technical) on how to maximize the GPUs performance.

pjmlp2y ago

Maybe it is my former graphics background, but I find GTC videos from NVidia and their developer site quite understandable, as AMD GPUOpen as well.

Have you dived into them?

ytch2y ago

I didn't watch too many developer conference, but IHMO the Apple WWDC sessions are fairly easy and enjoyable for learning. especially they provide full transcript.

saagarjha2y ago

ytch2y ago

It's one of ̶W̶W̶D̶C̶ developer sessions. Some of them at WWDC are basic 101 introduction for rookies.

There are advanced (for me) sessions like:

https://developer.apple.com/videos/play/wwdc2023/10127/

https://developer.apple.com/videos/play/wwdc2023/10042/

Although it's true that they won't discuss at hardware level.

aroman2y ago

How can it be a WWDC session if it's referring to hardware only announced a few weeks ago?

ytch2y ago

Yes, It's my fault. The website uses the same layout with WWDC sessions, so I misunderstand it :/

tux19682y ago

> On the other hand, I'm looking forward to watching the Asahi folks crack this stuff open.

xign2y ago

There are some hyperbole interjected about how incredible the performance is but that's only in between the useful data. (I did chuckle at the… enthusiasm of the speaker though)

kanwisher2y ago

it literally has code and diagrams of how to organize SIMD instructions to fill out all the shader cores

wmf2y ago

The beginning does sound like marketing but eventually it gets into technical information.

reroute222y ago

I'd say the beginning sounds like an introduction to GPU architectures in general, not marketing.

runeks2y ago

> I'm excited to tell you about the new Apple family 9 GPU architecture in A17 Pro and the M3 family of chips, which are at the heart of iPhone 15 Pro and the new Max.

"The new Max"? He clearly meant "the new Macs".

Kinda weird that Apple can't properly transcribe its own content.

Tijdreiziger2y ago

I’ve also noticed this on lots of YouTube videos, where the creator clearly meant one thing, but the subtitles substitute a more common, similarly-sounding word with a different meaning.

I suspect they have the videos transcribed externally, and don’t check the transcription (or only do so in a cursory manner).

TheCapeGreek2y ago

Or automated transcription.

Tijdreiziger2y ago

eviks2y ago

Why is it surprising, it's not like content ownership gives you any advantage in the typical transcription algorithms

runeks2y ago

It’s surprising since I would expect Apple to check whatever they get back from a transcription service.

Reason0772y ago

I’m always impressed with the speech synthesis that Apple uses to make the voiceovers in these videos. Some of them almost sound like real people!

stingraycharles2y ago

The guy introduced himself by name, I was really confused for a while if it was just a human trained to sound like an AI, or an AI trained to sound like a human.

Reason0772y ago

Perhaps they ask the AI speech/content generator to give itself a name as part of its training prompt.

SushiHippie2y ago

They even created a Twitter profile for this AI persona

https://nitter.net/jhaberstro

franzb2y ago

From the variety of intonations based on context, I doubt it’s speech synthesis.

simbolit2y ago

I think parent comment is being ironic. Not sure tho.

steve19772y ago

Maybe it’s one AI dissing another?

TradingPlaces2y ago

Media is so focused on CPUs, they are missing the fact that Apple focused on the GPU and Neural Engine for this round of chips

wincy2y ago

It was wild seeing Linus Tech Tips demoing resident evil village on the iPhone 15 Pro.

rsynnott2y ago

reroute222y ago

[1] https://developer.apple.com/videos/play/tech-talks/111374?ti...

w10-12y ago

The video is just enough of a peek into the GPU's to encourage people to write using Metal API's (and by the way, use the new APIs and FP16).

jeffybefffy5192y ago

flohofwoe2y ago

> Id bed in 4 years apple moves to directx.

Not going to happen, what's more likely is a Proton-like layer above macOS APIs to simplify porting games over. Also see "Game Porting Toolkit" here: https://developer.apple.com/games/

pjmlp2y ago

Devs have been supporting multiple APIs since graphics programming exists, it is only on the FOSS world that it keeps being touted as a pain point.

nemothekid2y ago

DirectX is exclusive to the Windows platform. At this point, it's probably deeply tied into Windows. I don't see how you can make that bet.

reroute222y ago

[1] https://learn.microsoft.com/en-us/windows/win32/direct3d11/o...

1 more reply

flohofwoe2y ago

DirectX APIs can be emulated on top of other 3D APIs, see Proton on Linux for instance.

Of course the idea of Apple switching from Metal to D3D is rubbish, but a Proton-like solution for macOS totally makes sense (maybe the "Game Porting Toolkit" will be exactly that eventually).

aurareturn2y ago

DirectX is closed source. Also, there are more games on Metal than DirectX.

pixelpoet2y ago

> Also, there are more games on Metal than DirectX.

Sorry, I was around for DirectX 1.0 back when GPUs were called "graphics accelerators", and don't see how that's possible.

Do you have a source for that, or is there some implicit caveat like counting some emulation later or something? Even then...

1 more reply

sccxy2y ago

Does M3 still outputs garbage which make external displays flicker?

https://forums.macrumors.com/threads/m1-m2-flickering-ghosti...

https://www.benq.com/en-us/knowledge-center/knowledge/how-to...

https://www.howtogeek.com/805459/mac-flickering-external-scr...

monocasa2y ago

Interestingly, that's a different component than the GPU on these chips, which is the typical architecture in SoCs.

In fact, even in discrete GPUs, the display scanout engine is generally a nearly completely independent block relative to the rest of the GPU.

lwkl2y ago

[1] https://gist.github.com/GetVladimir/c89a26df1806001543bef4c8...

sccxy2y ago

I have two monitors connected to M1 mac.

One works perfectly fine and is automatically RGB. Other flickers and when changing to RGB mode it is lime green.

Wonder why it is not problem with Intel Macs and if M3 fixes those bugs?

Maybe it is Apples feature to sell more of their own monitors.

They make sure other high end brands do not work with macOS.

It does not make sense that this kind of bug is 3 years active.

Someone2y ago

> Wonder why it is not problem with Intel Macs and if M3 fixes those bugs?

Let’s guess: maybe they have different drivers? Would be far from surprising, given that they’re running on different processors.

1 more reply

mannyv2y ago

That's hilarious because that sounds like the old sync-on-green problem that Macs and monitors used to have. And that used to happen a lot with Dell monitors.

How very odd.

leloctai2y ago

Does the complex block in the diagram refer to complex numbers? That doesn't sound typical, does it? What type of work load that typically run on the GPU that would require complex numbers?

reroute222y ago

mattsan2y ago

Arelius2y ago

bokeh?

edit: Found the article: https://www.ea.com/frostbite/news/circular-separable-convolu...

mattsan2y ago

Holy crap thank you for finding it so fast, I've been searching for it for a while

make32y ago

quaternions? for rotations

RantyDave2y ago

So ... the registers are dynamically allocated from a chunk of cache? Does this mean there, effectively, are no registers? Does this cache have one clock latency?

reroute222y ago

> Does this mean there, effectively, are no registers?

[1] see the diagram at https://www.renesas.com/us/en/products/memory-logic/multi-po...

RantyDave2y ago

Oh! Well, that explains that then. Wild!

reroute222y ago

There is a fairly informative survey on the subject: https://www.osti.gov/servlets/purl/1332070 (A Survey of Techniques for Architecting and Managing GPU Register File)

1 more reply

josu2y ago

Does the narration sound like AI to anyone else?

alphanullmeric2y ago

creshal2y ago

A big part of Apple's "winning formula" is taking their giant piles of money and negotiating exclusive contracts for whatever is scheduled to be the most advanced semiconductor node next year.

Anyone else literally cannot compete, they don't have billions in pocket change they don't know how to spend otherwise, so they'll have to wait until the exclusivity agreement expires.

simbolit2y ago

> they don't have billions in pocket change they don't know how to spend otherwise

your parent comment's example is literally Google, world-class experts at burning money on developers producing a million dead-end products and abandoning them a year later.

creshal2y ago

1 more reply

coldtea2y ago

Well, they got into that position starting from a near bankrupt company, which couldn't negotiate anything exclusive, and which was for a long time at the mercy of Motorola and the Intel.

So it's something they took advantage of after they grew (well, which company at their scale wouldn't ask for the best wholesale deals?), but not what made them big in the first place.

creshal2y ago

What made them big in the first place were the iPod/iTunes/iPhone and the ludicrous revenues from the App Store.

3 more replies

HatchedLake7212y ago

Sorry, but this sounds the same as the cheap “you just got lucky” to someone who worked day and night through sweat and blood to achieve something.

hutzlibu2y ago

I don't know about your facts, but

"Is it that hard to just copy the winning formula?"

yes it is, thanks to IP law. And back in the day Steve Jobs already wanted thermonuclear war on Samsung, because he felt their flagship at the time was too close to the IPhone.

rootusrootus2y ago

hutzlibu2y ago

The thing is, people do not like to learn a new UI to do the same thing on a different device. But they have to.

KingLancelot2y ago

hutzlibu2y ago

Ah, I did not remember that one about Eric Schmidt, but I thought there was also something with Samsung.

Btw. it appears you are shadow banned. You might want to check some of your comments and then contact dang.

i5-2520M2y ago

Google is probably not delusional, so they know the Tensor has no chance of being "good" as it is. Why don't you compare the A17 to the 8g3 and 8g2?

zmmmmm2y ago

here have some Qualcomm Kool-Aid to go with that sweet Apple juice :-)

https://www.youtube.com/watch?v=h_vh7_n_OPs

baz002y ago

gumby2y ago

Hmm, interesting. I have the opposite with my 12. We must have quite different use patterns.

baz002y ago

The 12 was the least crappy.

diimdeep2y ago

Really awful narration with overuse of unnecessary pitch glides

bayindirh2y ago

I don't share your sentiment about the guy, but that's OK.

Apple has a style, and he talks with that style very well, preventing the listener from drifting off.

flohofwoe2y ago

I'm German and I don't hear anything wrong about the voice. He's probably not a native speaker, give the guy some slack.

ta86452y ago

I had a similar reaction at first. But, for whatever reason, when playing the video at 1.5x it felt completely fine.

goosinmouse2y ago

It is pretty funny how amateur a multi trillion dollar can be, its so distracting and sounds like an 80's instructional VHS.

riscy2y ago

the narrator is an actual engineer, not a television personality

diimdeep2y ago

but he is cosplaying television personality for no reason

2 more replies

dylan6042y ago

you could never hear the audio that cleanly on a VHS even with HiFi tracks.

diimdeep2y ago

It is about attitude in voice, not about audio quality. It is very noticeable at 1.5x speed, 1x speed is too slow anyway with infinite pauses for such low density technical details.

1 more reply

j / k navigate · click thread line to collapse