On WSE-3s however, there's enough memory that the model can actually be stored on-chip provided that you have a sufficient number of them. 20 are enough for some of the largest open models.
This, depending on how it's set up, allows more efficient use of what logic is available, for actually doing computations instead of just loading and unloading the weights. This can potentially make a system like this much more efficient than a GPU.
It doesn't matter whether Mistral are small fish or not. I don't agree that they are small fish, but whether or not they are they are experts. They are very capable people. They haven't chosen Cerebras to be different, they've chosen it because they believe it's the best way to do inference.
If you do the math you will find that Cerebras loses in all of them. They need 460 kW from 20x CSE-3 nodes to do inference for Llama 4 Maverick. A single DGX-200 node only needs 14.4kW. If you buy 32 nodes so that power consumption is the same and naively give each a full copy of the model, you would get 32,000 T/sec aggregate from a batch size of 1 while the 20 CSE-3 node cluster only gets 2,500 T/sec aggregate from a batch size of 1. This is having spent only $16 million for the 32 DGX B200 nodes versus the $40 million for the 20 CSE-3 nodes. Each DGX B200 node has 1.4TB of memory, while the CSE-3 cluster has only 880GB of memory. The CSE-3 cluster will run out of memory as you scale the batch size and context length. Now, if you buy another 15 CSE-3 nodes, you could match the memory of a single DGX B200, but then you could just store partial models on each DGX-200 like how Cerebras stored partial models on each CSE-3, and suddenly, you have more memory to scale to higher batch sizes on the Nvidia hardware. At some point, you will likely become compute bound and cannot keep scaling up the batch size, but that is hard to predict without actually testing for it. The prediction for what the CSE-3 could do based on advertised memory bandwidth was off by a factor of >1000 when given real data. It seems reasonable to think that what it can do as far as compute will similarly be limited to well below the theoretical capability.
Note that my numbers for power consumption were from Cerebras:
https://www.cerebras.ai/blog/cerebras-cs-3-vs-nvidia-b200-20...
Interestingly, the peak number for the DGX B200 is based on the power supplies for the DGX B200 and is actually 0.1 kW higher than Nvidia’s specification that puts it at 14.3kW:
https://docs.nvidia.com/dgx/dgxb200-user-guide/introduction-...
PSU peak output is always in excess of the maximum power usage capability of the hardware, but I did not know how Cerebras determined their 23kW figure, so I went with the Cerebras figure for Nvidia, even though I know it is unrealistically high. This likely gave Cerebras the benefit of a handicap on Nvidia’s hardware in the comparison, such that reality is even more in favor of Nvidia.
Calling Cerebras’ hardware the best way of doing inference is ridiculous. We are talking about doing mostly linear algebra. There is no best way of doing it. Pointing at Mistral to say that Cerebras has the best way is an absurd appeal to authority. None of the major players are using them, since they are incapable of handling their needs. The instant responses are nice and are a way for mistral to differentiate itself, but their models are not as good as those from others and few people use them, which is why Cerebras has the capacity to handle their needs.
From a historical standpoint, Cerebras is very similar to Thinking Machines Corporation, which went out of business after 11 years when there was a market downturn because they could not secure business. Cerebras is hemorrhaging money and is only in business because they found some investors willing to cover their losses. Once they run out of people willing to give them money (likely during the next AI winter), they will become insolvent, no matter how good their technology is. When the next AI winter hits, Mistral will likely become insolvent too, since they similarly are hemorrhaging money and are only in business because they found some investors willing to cover their losses.
By the way, you are lecturing someone who actually has worked on code for doing inference:
I will have to think through your comment, but won't be able to do so properly this month.