> First, you claimed that there was no L3 BW test.
I claimed that they did not provide figures for L3 cache bandwidth. They did not.
> Now, I am not even sure if you're trolling me or lacking knowledge or what at this point?
You should be grateful that a professional is taking time out of his day to explain things that you do not understand.
> Please do tell what you consider a "proper test of L3 cache"? And why do you consider their test invalid?
You cannot measure L3 cache performance by measuring the bandwidth on a region of memory larger than the L3 cache. What they did is a partially cached test and it does not necessarily reflect the true L3 cache performance.
> I am curious because triggering 32 physical core threads to run over 32 independent chunks of data (totaling 3G and not 128M) seems like a pretty valid read BW experiment to me.
You just described a generic memory bandwidth test that does not test L3 cache bandwidth at all. Chips and Cheese’s graphs show performance at different amounts of memory to show the performance of the memory hierarchy. When they exceed the amount of cache at a certain level, the performance transitions to different level. They did benchmarks on different amounts of memory to get the points in their graph and connected them to get a curve.
> Right, but you missed the part that former is configured for the server market and the latter for the client market. Two different things, two different chips, different memory controllers if you wish. That's why you cannot compare one to each other directly without caveats.
The Xeon Max chips with its HBM2e memory is the one place where 2 AVX-512 loads per cycle could be expected to be useful, but due to internal bottlenecks they are not.
Also, for what it is worth, Intel treats AVX-512 as a server only feature these days, so if you are talking about Intel CPUs and AVX-512, you are talking about servers.
> But that's what you said trying to refute the fact why Intel was in a lead over AMD up until zen5? You're claiming that AVX-512 workloads and load-store BW are largely irrelevant because CPUs are anyway bottlenecked by the system memory bandwidth.
I never claimed AVX-512 workloads were irrelevant. I claimed doing more than 1 load per cycle on AVX-512 was not very useful for performance.
Intel losing its lead in the desktop space to AMD is due to entirely different reasons than how many AVX-512 loads per cycle AMD hardware can do. This is obvious when you consider that most desktop workloads do not touch AVX-512. Certainly, no desktop workloads on Intel CPUs touch AVX-512 these days because Intel no longer ships AVX-512 support on desktop CPUs.
To be clear, when you can use AVX-512, it is useful, but the ability to do 2 loads per cycle does not add to the usefulness very much.
> I am all ears to hear what datacenter workloads you have in mind that are CPU-bound?
This is not a well formed question. See my remarks further down in this reply where I address your fabricated 99% figure for the reason why.
> Any workload besides the most simplest one is at some point bound by the memory BW.
Simple workloads are bottlenecked by memory bandwidth (e.g. BLAS levels 1 and 2). Complex workloads are bottlenecked by compute (e.g. BLAS level 3). A compiler for example is compute bound, not memory bound.
> You're contradicting your own claims by saying that cache is there to hide (cut) the latency but then you continue to say that this is irrelevant. Not sure what else to say here.
There is no contradiction. The cache is there to hide latency. The TACC explanation of how queuing theory applies to CPUs makes it very obvious that memory bandwidth is inversely proportional to memory access times, which is why the cache has more memory bandwidth than system RAM. It is a side effect of the actual purpose, which is to reduce memory latency. That is an attempt to reduce the von Neumann bottleneck.
To give a concrete example, consider linked lists. Traversing a linked list requires walking random memory locations. You have a pointer to the first item on the list. You cannot go to the second item without reading the first. This is really slow. If the list is frequently accessed to be in cache, then the cache will hide the access times and make this faster.
> 99% of the datacenter machines are not attached to the GPU. Does that mean that 99% of datacenter workloads are not "truly exceptional" for whatever the definition of that formulation and they are therefore mostly CPU bound?
99% is a number you fabricated. Asking if something is CPU bound only makes sense when you have a GPU or some other accelerator attached to the CPU that needs to wait on commands from the CPU. When there is no such thing, asking if it is CPU bound is nonsensical. People instead discuss being compute bound, memory bandwidth bound or IO bound. Technically, there are three ways to be IO bound, which are memory, storage and network. Since I was already discussing memory bandwidth bound work loads, my inclusion of IO bound as a category refers to the other two subcategories.
By the way, while memory bandwidth bound workloads are better run on GPUs than CPUs, that does not mean all workloads on GPUs are memory bandwidth bound. Compute bound workloads with minimal branching are better done on GPUs than CPUs too.