undefined | Better HN

0 pointsmenaerus1y ago0 comments

> Your 642 GB/s figure should be for a single Golden Cove core

Correct.

> That is also significantly below your 642 GB/sec prediction.

Not exactly the prediction. It's an extract from one of the Chips and Cheese articles. In particular, the one that covers the architectural details of Golden Cove core and not Sapphire Rapids core. See https://chipsandcheese.com/p/popping-the-hood-on-golden-cove

From that article, their experiment shows that Golden Cove core was able to sustain 642 GB/s in L1 cache with AVX-512.

> They do not give an exact figure for multithreaded L3 cache bandwidth,

They quite literally do - it's in the graph in "Multi-threaded Bandwidth" section. 32-core Xeon Platinum 8480 instance was able to sustain 534 GB/s from L3 cache.

> The Chips and Cheese chart also shows that Sapphire Rapids reaches around 450 GB/sec single threaded read bandwidth for L1 cache.

If you look closely into my comment you're referring to you will see that I explicitly referred to Golden Cove core and not to the Sapphire Rapids core. I am not being pedantic here but they're actually different things.

And yes, Sapphire Rapids reach 450 GB/s in L1 for AVX-512 workloads. But SPR core is also clocked @3.8Ghz which is much much less than what the Golden Cove core is clocked at - @5.2GHz. And this is where the difference of ~200 GB/s comes from.

> Reaching that level of bandwidth out of L1 cache is not likely to be very useful, since bandwidth limited operations will operate on far bigger amounts of memory than fit in cache, especially L1 cache

With that said, both Intel and AMD are limited by the system memory bandwidth and both are somewhere in the range of ~100ns per memory access. The actual BW value will depend on the number of cores per chip but the BW is roughly the same since it heavily depends on the DDR interface and speed.

Does that mean that both Intel and AMD are basically of the same compute capabilities for workloads that do not fit into CPU cache?

And AMD just spent 7 years of their engineering efforts to implement what now looks like a superior CPU cache design and vectorized (SIMD) execution capabilities only to be applicable very few (mostly unimportant in grand scheme of things) workloads that actually fit into the CPU cache?

I'm not sure I follow this reasoning but if true then AMD and Intel have nothing to compete against each other since by the logic of CPU caches being limited in applicability, their designs are equally good for the most $$$ workloads.

0 comments

janwas1y ago

It is not that the entire working set has to fit within SRAM. Kernels that reuse portions of their inputs several times, such as matmul, can be compute bound and there AMD's AVX-512 shines.

menaerusOP1y ago

Parent comment I am responding to is arguing that CPU caches are not that relevant because the CPU for bigger workloads is anyways bottlenecked by the system memory BW. And thus, AVX-512 is irrelevant because it can only provide compute boost for a very small fraction of time (reciprocal to the size of the L1 cache).

I am in disagreement with that obviously.

ryao1y ago

Your description of what I told you is nothing like what I wrote at all. Also, the guy here is telling you that AVX-512 shines on compute bound workloads, which is effectively what I have been saying. Try going back and rereading everything.

1 more reply

ryao1y ago

> They quite literally do - it's in the graph in "Multi-threaded Bandwidth" section. 32-core Xeon Platinum 8480 instance was able to sustain 534 GB/s from L3 cache.

They do not. The chip has 105MB L3 cache and they tested on 128MB of memory. This exceeds the size of L3 cache and thus, it is not a proper test of L3 cache.

> If you look closely into my comment you're referring to you will see that I explicitly referred to Golden Cove core and not to the Sapphire Rapids core. I am not being pedantic here but they're actually different things.

Sapphire Rapids uses Golden Cove cores.

> And yes, Sapphire Rapids reach 450 GB/s in L1 for AVX-512 workloads. But SPR core is also clocked @3.8Ghz which is much much less than what the Golden Cove core is clocked at - @5.2GHz. And this is where the difference of ~200 GB/s comes from.

This would explain the discrepancy between your calculation and the L1 cache performance, although being able to get that level of bandwidth only out of L1 cache is not very useful for the reasons I stated.

> I'm not sure I follow this reasoning but if true then AMD and Intel have nothing to compete against each other since by the logic of CPU caches being limited in applicability, their designs are equally good for the most $$$ workloads.

You seem to view CPU performance as being determined by memory bandwidth rather than computational ability. Upon being correctly told L1 cache memory bandwidth does not matter since the bottleneck is system memory, you assume that only system memory performance matters. That would be true if the primary workload of CPUs were memory bandwidth bound workloads, but it is not since the primary workloads of CPUs is compute bound workloads. Thus, how fast CPUs read from memory does not really matter for CPU workloads.

The purpose of a CPU’s cache is to reduce the von Neumann bottleneck by cutting memory access latency. That way the CPU core spends less time waiting before it can use the data and it can move on to a subsequent calculation. How much memory throughput CPUs get from L1 cache is irrelevant to CPU performance outside of exceptional circumstances. There are exceptional circumstances where cache memory bandwidth matters, but they are truly exceptional since any importan workload where memory bandwidth matters is offloaded to a GPU because a GPU often has 1 to 2 orders of magnitude more memory bandwidth than a CPU.

That said, it would be awesome if the performance of a part could be determined by a simple synthetic benchmark such as memory bandwidth, but that is almost never the case in practice.

menaerusOP1y ago

> They do not. The chip has 105MB L3 cache and they tested on 128MB of memory. This exceeds the size of L3 cache and thus, it is not a proper test of L3 cache.

First, you claimed that there was no L3 BW test. Now, I am not even sure if you're trolling me or lacking knowledge or what at this point?

Please do tell what you consider a "proper test of L3 cache"? And why do you consider their test invalid?

I am curious because triggering 32 physical core threads to run over 32 independent chunks of data (totaling 3G and not 128M) seems like a pretty valid read BW experiment to me.

> Sapphire Rapids uses Golden Cove cores.

Right, but you missed the part that former is configured for the server market and the latter for the client market. Two different things, two different chips, different memory controllers if you wish. That's why you cannot compare one to each other directly without caveats.

Chips and Cheese are actually guilty of doing that but it's because they're lacking more HW to compare against. So some figures that you find in their articles can be misleading if you are not aware of it.

> You seem to view CPU performance as being determined by memory bandwidth rather than computational ability.

But that's what you said trying to refute the fact why Intel was in a lead over AMD up until zen5? You're claiming that AVX-512 workloads and load-store BW are largely irrelevant because CPUs are anyway bottlenecked by the system memory bandwidth.

> That would be true if the primary workload of CPUs were memory bandwidth bound workloads, but it is not since the primary workloads of CPUs is compute bound workloads. Thus, how fast CPUs read from memory does not really matter for CPU workloads.

I am all ears to hear what datacenter workloads you have in mind that are CPU-bound?

Any workload besides the most simplest one is at some point bound by the memory BW.

> The purpose of a CPU’s cache is to reduce the von Neumann bottleneck by cutting memory access latency.

> That way the CPU core spends less time waiting before it can use the data and it can move on to a subsequent calculation.

> How much memory throughput CPUs get from L1 cache is irrelevant to CPU performance outside of exceptional circumstances.

You're contradicting your own claims by saying that cache is there to hide (cut) the latency but then you continue to say that this is irrelevant. Not sure what else to say here.

> but they are truly exceptional since any importan workload where memory bandwidth matters is offloaded to a GPU because a GPU often has 1 to 2 orders of magnitude more memory bandwidth than a CPU.

99% of the datacenter machines are not attached to the GPU. Does that mean that 99% of datacenter workloads are not "truly exceptional" for whatever the definition of that formulation and they are therefore mostly CPU bound?

Or do you think they might be memory-bound but are missing out for not being offloaded to the GPU?

ryao1y ago

> First, you claimed that there was no L3 BW test.

I claimed that they did not provide figures for L3 cache bandwidth. They did not.

> Now, I am not even sure if you're trolling me or lacking knowledge or what at this point?

You should be grateful that a professional is taking time out of his day to explain things that you do not understand.

> Please do tell what you consider a "proper test of L3 cache"? And why do you consider their test invalid?

You cannot measure L3 cache performance by measuring the bandwidth on a region of memory larger than the L3 cache. What they did is a partially cached test and it does not necessarily reflect the true L3 cache performance.

> I am curious because triggering 32 physical core threads to run over 32 independent chunks of data (totaling 3G and not 128M) seems like a pretty valid read BW experiment to me.

You just described a generic memory bandwidth test that does not test L3 cache bandwidth at all. Chips and Cheese’s graphs show performance at different amounts of memory to show the performance of the memory hierarchy. When they exceed the amount of cache at a certain level, the performance transitions to different level. They did benchmarks on different amounts of memory to get the points in their graph and connected them to get a curve.

> Right, but you missed the part that former is configured for the server market and the latter for the client market. Two different things, two different chips, different memory controllers if you wish. That's why you cannot compare one to each other directly without caveats.

The Xeon Max chips with its HBM2e memory is the one place where 2 AVX-512 loads per cycle could be expected to be useful, but due to internal bottlenecks they are not.

Also, for what it is worth, Intel treats AVX-512 as a server only feature these days, so if you are talking about Intel CPUs and AVX-512, you are talking about servers.

> But that's what you said trying to refute the fact why Intel was in a lead over AMD up until zen5? You're claiming that AVX-512 workloads and load-store BW are largely irrelevant because CPUs are anyway bottlenecked by the system memory bandwidth.

I never claimed AVX-512 workloads were irrelevant. I claimed doing more than 1 load per cycle on AVX-512 was not very useful for performance.

Intel losing its lead in the desktop space to AMD is due to entirely different reasons than how many AVX-512 loads per cycle AMD hardware can do. This is obvious when you consider that most desktop workloads do not touch AVX-512. Certainly, no desktop workloads on Intel CPUs touch AVX-512 these days because Intel no longer ships AVX-512 support on desktop CPUs.

To be clear, when you can use AVX-512, it is useful, but the ability to do 2 loads per cycle does not add to the usefulness very much.

> I am all ears to hear what datacenter workloads you have in mind that are CPU-bound?

This is not a well formed question. See my remarks further down in this reply where I address your fabricated 99% figure for the reason why.

> Any workload besides the most simplest one is at some point bound by the memory BW.

Simple workloads are bottlenecked by memory bandwidth (e.g. BLAS levels 1 and 2). Complex workloads are bottlenecked by compute (e.g. BLAS level 3). A compiler for example is compute bound, not memory bound.

> You're contradicting your own claims by saying that cache is there to hide (cut) the latency but then you continue to say that this is irrelevant. Not sure what else to say here.

There is no contradiction. The cache is there to hide latency. The TACC explanation of how queuing theory applies to CPUs makes it very obvious that memory bandwidth is inversely proportional to memory access times, which is why the cache has more memory bandwidth than system RAM. It is a side effect of the actual purpose, which is to reduce memory latency. That is an attempt to reduce the von Neumann bottleneck.

To give a concrete example, consider linked lists. Traversing a linked list requires walking random memory locations. You have a pointer to the first item on the list. You cannot go to the second item without reading the first. This is really slow. If the list is frequently accessed to be in cache, then the cache will hide the access times and make this faster.

> 99% of the datacenter machines are not attached to the GPU. Does that mean that 99% of datacenter workloads are not "truly exceptional" for whatever the definition of that formulation and they are therefore mostly CPU bound?

99% is a number you fabricated. Asking if something is CPU bound only makes sense when you have a GPU or some other accelerator attached to the CPU that needs to wait on commands from the CPU. When there is no such thing, asking if it is CPU bound is nonsensical. People instead discuss being compute bound, memory bandwidth bound or IO bound. Technically, there are three ways to be IO bound, which are memory, storage and network. Since I was already discussing memory bandwidth bound work loads, my inclusion of IO bound as a category refers to the other two subcategories.

By the way, while memory bandwidth bound workloads are better run on GPUs than CPUs, that does not mean all workloads on GPUs are memory bandwidth bound. Compute bound workloads with minimal branching are better done on GPUs than CPUs too.

1 more reply

j / k navigate · click thread line to collapse