4T transistors, one giant chip (Cerebras WSE-3) [video] (opens in new tab)

(youtube.com)

118 pointsasdfasdf12y ago86 comments

86 comments

According to the company, the new chip will enable training of AI models with up to 24 trillion parameters. Let me repeat that, in case you're as excited as I am: 24. Trillion. Parameters. For comparison, the largest AI models currently in use have around 0.5 trillion parameters, around 48x times smaller.

Each parameter is a connection between artificial neurons. For example, inside an AI model, a linear layer that transforms an input vector with 1024 elements to an output vector with 2048 elements has 1024×2048 = ~2M parameters in a weight matrix. Each parameter specifies by how much each element in the input vector contributes to or subtracts from each element in the output vector. Each output vector element is a weighted sum (AKA a linear combination), of each input vector element.

A human brain has an estimated 100-500 trillion synapses connecting biological neurons. Each synapse is quite a complicated biological structure[a], but if we oversimplify things and assume that every synapse can be modeled as a single parameter in a weight matrix, then the largest AI models in use today have approximately 100T to 500T ÷ 0.5T = 200x to 1000x fewer connections between neurons that the human brain. If the company's claims prove true, this new chip will enable training of AI models that have only 4x to 20x fewer connections that the human brain.

We sure live in interesting times!

---

[a] https://en.wikipedia.org/wiki/Synapse

mlyle2y ago

> but if we oversimplify things and assume that every synapse can be modeled as a single parameter in a weight matrix

Which, it probably can't... but offsetting those simplifications and 4-20x difference is the massive difference in how quickly those synapses can be activated.

topspin2y ago

> that have only 4x to 20x fewer connections that the human brain

So only 4-20 of these systems are necessary to match the human brain. No?

ipsum22y ago

Fun fact, I can also train a 24 trillion parameter model on my laptop! Just need to offload weights to the cloud every layer.

...

It's meaningless to say something can train a model that has 24 trillion parameters without specifying the dataset size and time it takes to train.

cryptonector2y ago

I dare say this thing will be many times faster than thrashing your 24T parameters to the cloud.

ipsum22y ago

Yeah, but it'll be slower than the equivalent Nvidia GPU cluster.

brucethemoose22y ago

Reposting the CS-2 teardown in case anyone missed it. The thermal and electrical engineering is absolutely nuts:

https://vimeo.com/853557623

https://web.archive.org/web/20230812020202/https://www.youtu...

(Vimeo/Archive because the original video was taken down from YouTube)

bsder2y ago

As always, IBM did it first: https://www.righto.com/2021/03/logic-chip-teardown-from-vint...

bitwrangler2y ago

20,000 amps

200,000 electrical contacts

850,000 cores

and that's the "old" one. wow.

Keyframe2y ago

This is something I'm clueless about and can't really understand. They say this is 24kW hungry. How does CPU power consumption really work on electrical level, what warrants that much power, even for regular CPUs? Like, from the basics level.. is it resistance of the material with frequency of switching or what is really going on there? Where does the power go on such a relatively small surface?

edit: thanks people, makes sense now!

magicalhippo2y ago

Modern CPUs are built using CMOS MOSFET transistors[1]. The gate, which controls if the transistor conducts or not, is effectively a small capacitor. The gate capacitor has to be charged up for the transistor to conduct[2], ie you have to stuff some electrons into it to turn the transistor on.

Once you've done that, the transistor is on until the gate capacitor is discharged. This requires getting rid of the electrons you stuffed into it. The easiest is to just connect the gate to ground, essentially throwing the electrons away.

So for each time the transistor goes through an on-off cycle, you need to "spend" some electrons, which in turn need to be supplied from the power supply. Thus higher frequency means more current just from more on-off cycles per second.

There's also resistive losses and leakage currents and such.

Now in theory I suppose you could recycle some of these electrons (using a charge pump arrangement[3]), reducing the overall demand. But that would require relatively large capacitors, and on-chip capacitors take a lot of chip area which could have been used for many transistors instead.

[1]: https://en.wikipedia.org/wiki/CMOS

[2]: https://en.wikipedia.org/wiki/MOSFET#Modes_of_operation

[3]: https://en.wikipedia.org/wiki/Charge_pump

Workaccount22y ago

It simply takes a non-zero amount of energy to turn a transistor on and off.

Add up trillions of transistors, flicking on and off billions of times a second, and you get enormous power draws.

What is actually drawing power is the gate capacitance of the transistors. If the transistor were a physical switch, the gate capacitance is the "weight" that must be put on the switch to flip it. Of course this weight gets smaller as the switches shrink and as the tech improves, but it will always be non-zero.

None of this accounts for resistive losses either, which is just the cost of doing business for a CPU.

danbruc2y ago

Yes, the power consumption comes from the resistance of the circuit. In case of CMOS circuits there would ideally be no current flow when no signal changes but transistors are not perfect and have leakage currents. When signals change, primarily triggered by the clock rising or falling, there is a short time in which the supply rail and ground rail are essentially shorted out.

Each gate has logically two transistors of which exactly one is always conducting, either connecting the output to the supply rail making the output a one, or connecting the output to the ground rail making the output a zero. When the output of the gate changes, both transistors have to switch in order to connect the output to the other rail than before. While this happens both transistors are conducting at the same time allowing current to flow from the supply rail to the ground rail.

In addition to that the input capacitances of subsequent gates get charged from the supply rail when the output goes high and discharged into the ground rail when input goes low. So every signal change pumps some charge from the supply rail through the input capacitances to the ground rail.

1 more reply

bgnn2y ago

a lot of good answers. just couple points to add: - Charging a capacitor from 0 to Vdd volts takes E=CVdd^2 amount of energy. To find the power (energy/second) you need to multiply this with how frequently you do this, i.e. clock frequency. So, P=FCVdd^2. So most digitak circuits power scales linearly with frequency, and quadratically with power supply voltage Vdd.

- Half of themis power is resistive loses. This is kinda a fundamental thing. Charging a capacitor with a resistor is "lossy".

- Some of the charge is reused within the circuits

magicalhippo2y ago

Just to get some sense of perspective, the Zen 4-based Ryzen 7950X3D is built on TMSC's 5nm node and is listed[1] as being two 71mm^2 dies. The 5nm node uses a 300mm wafer[2], which means roughly 900 dies or 450 7950X3D's on one wafer, for a total of 8 trillion transistors.

The peak power average of the 7950X3D is roughly 150W[3], which means if you could somehow run all 450 CPUs (900 dies) at peak, they'd consume around 68kW.

edit: I forgot about the IO die which contains the memory controller, so that will suck some power as well. So if we say 50W for that and 50W for the CPU dies, that's 45kW.

That's assuming you get a "clean wafer" with all dies working, not "just" the 80% yield or so.

[1]: https://www.techpowerup.com/cpu-specs/ryzen-9-7950x3d.c3024

[2]: https://www.anandtech.com/show/15219/early-tsmc-5nm-test-chi...

[3]: https://www.tomshardware.com/reviews/amd-ryzen-9-7950x3d-cpu...

brucethemoose22y ago

And even mundane details are so difficult... thermal expansion, the sheer number of pins that have to line up. This thing is a marvel.

Rexxar2y ago

It seems the vimeo video has been removed now too.

dougmwne2y ago

I want this woman running my postAI-apocalypse hardware research lab.

nachexnachex2y ago

She's got strong merit to be spared by the Roko's basilisk ai.

fxj2y ago

It has its own programming language CSL

https://www.cerebras.net/blog/whats-new-in-r0.6-of-the-cereb...

"CSL allows for compile time execution of code blocks that take compile-time constant objects as input, a powerful feature it inherits from Zig, on which CSL is based. CSL will be largely familiar to anyone who is comfortable with C/C++, but there are some new capabilities on top of the C-derived basics."

https://github.com/Cerebras/csl-examples

rbanffy2y ago

It is oddly reminiscent of the Thinking Machines CM-1/2 series, but with CSL as the official language instead of Lisp.

And far fewer blinking lights.

GaryNumanVevo2y ago

Due to their architecture (non-stationary dataflow) their compiler is approaching something close to a place and route FPGA algorithm

RetroTechie2y ago

If you were to add up all transistors fabricated worldwide, up until <year>, such that total roughly matches the # on this beast, what year would you arrive? Hell, throw in discrete transistors if you want.

How many early supercomputers / workstations etc would that include? How much progress did humanity make using all those early machines (or any transistorized device!) combined?

itishappy2y ago

Rough guess: mid 1980s

4004 from the 1970s used 2300 transistors, so it would have needed to sell billions.

Pentium from 1990s had 3M transistors, so it could hit our target by selling a million units.

I'm betting (without much research) that the Pentium line alone sold millions, and the industry as a whole could hit those numbers about 5 years earlier.

ortusdux2y ago

Not trying to sound critical, but is there a reason to use 4B,000 vs 4T?

wincy2y ago

My guess is the titles get auto adjusted by Hacker News, but the script that does it doesn’t have logic for a trillion and only goes up to a billion, hence the weirdness of a string match and replace

yalok2y ago

because Billion is ambiguous -

Quote:

Billion is a word for a large number, and it has two distinct definitions:

1,000,000,000, i.e. one thousand million, or 10^9 (ten to the ninth power), as defined on the short scale. This is now the most common sense of the word in all varieties of English; it has long been established in American English and has since become common in Britain and other English-speaking countries as well.

1,000,000,000,000, i.e. one million million, or 10^12 (ten to the twelfth power), as defined on the long scale. This number is the historical sense of the word and remains the established sense of the word in other European languages. Though displaced by the short scale definition relatively early in US English, it remained the most common sense of the word in Britain until the 1950s and still remains in occasional use there.

https://en.wikipedia.org/wiki/Billion

bee_rider2y ago

It looks like it has now been switched to just have the number. I wonder if there was just some auto-formatting error.

geph20212y ago

original title is:

"4,000,000,000,000 Transistors, One Giant Chip (Cerebras WSE-3)"

So I guess they're trying to stay true to it.

ortusdux2y ago

There are times where diverting from normal conventions make sense. The average consumer might not know that 1Gb/s is faster than 750Mb/s. That being said, I don't think I've ever seen anything along the lines of 1G,000b/s.

gosub1002y ago

at least a trillion means a trillion. Unlike the "tebi vs tera"-byte marketing-speak in storage and ram.

leptons2y ago

Made you click!

imbusy1112y ago

I wish they dug into how this monstrosity is powered. Assuming 1V and 24kW, that's 24kAmps.

crotchfire2y ago

Backside power delivery using through-silicon vias.

I'm sure those TSVs connect to a huge array of switching power supplies, so the 24kW doesn't travel very far at such low voltages.

geph20212y ago

and cooling!

Imagine the heat sink on that thing. Would look like a cast-iron Dutch oven :)

tivert2y ago

The one of the videos posted here gets into that: https://news.ycombinator.com/item?id=39693930: https://vimeo.com/853557623

geph20212y ago

Thanks for sharing! Very interesting.

"We call it the engine block because it somewhat resembles a three cylinder motorcycle engine"

(referring to the power distribution and cooling)

jandrese2y ago

24kW translates to like 32hp, so you could imagine this thing with a liquid cooling loop hooked up to something that looks like a car radiator.

dist-epoch2y ago

24kW is on the lower end of a home heating gas boiler.

1 more reply

MenhirMike2y ago

Finally, a chip that outmatches the Noctua NH-D15!

asdfasdf1OP2y ago

https://www.cerebras.net/press-release/cerebras-announces-th...

https://www.cerebras.net/product-chip/

Rexxar2y ago

Is there a reason it's not roughly a disc if they use the whole wafer ? They could have 50% more surface.

terafo2y ago

To quote their official response "If the WSE weren't rectangular, the complexity of power delivery, I/O, mechanical integrity and cooling become much more difficult, to the point of impracticality.".

londons_explore2y ago

Packaging method can't handle non-square dies?

modeless2y ago

As I understand it, WSE-2 was kind of handicapped because its performance could only really be harnessed if the neural net fit in the on-chip SRAM. Bandwidth to off-chip memory (normalized to FLOPS) was not as high as Nvidia. Is that improved with WSE-3? Seems like the SRAM is only 10% bigger, so that's not helping.

In the days before LLMs 44 GB of SRAM sounded like a lot, but these days it's practically nothing. It's possible that novel architectures could be built for Cerebras that leverage the unique capabilities, but the inaccessibility of the hardware is a problem. So few people will ever get to play with one that it's unlikely new architectures will be developed for it.

txyx3032y ago

That was more of a WSE-1 problem maybe? They switched to a new compute paradigm (details on their site if you look up "weight streaming") where they basically store the activation on the wafer instead of the whole model. For something very large (say, 32K context and 16k hidden dimension) this would make an activation layer only 1-2GB (16 bit or 32 bit). As I understand it, this was one of the key changes needed to go from single system boxes to these super computing clusters they have been able to deploy.

The Nvidia bandwidth to compute ratio is more necessary because they are moving things around all the time. By keeping all the outputs on the wafer and only streaming the weights, you have a much more favorable requirement for BW to compute. And the number of layers becomes less impactful because they are storing transient outputs.

This is probably one of the primary reasons they didn't need to increase SRAM for WSE-3. WSE-2 was developed based on the old "fit the whole model on the chip" paradigm but models eclipsed 1TB so the new solution is more scalable.

brucethemoose22y ago

As I understand it, the WSE-2's interconnect is actually quite good, and models are split across chips kinda like GPUs.

And keep in mind that these nodes are hilariously "fat" compared to a GPU node (or even an 8x GPU node), meaning less congestion and overhead from the topology.

imtringued2y ago

One thing I don't understand about their architecture is that they have spent so much effort building this monster of a chip, but if you are going to do something crazy, why not work on processing in memory instead? At least for transformers you will primarily be bottlenecked on matrix multiplication and almost nothing else, so you only need to add a simple matrix vector unit behind your address decoder and then almost every AI accelerator will become obsolete over night. I wouldn't suggest this to a random startup though.

TheDudeMan2y ago

FWIW, this chip has 44 GB of on-chip memory.

marmaduke2y ago

Hm, let's wait and see what the gemm/W perf is, and how many programmer hours it takes to implement say an mlp. Wafer scale data flow may not be a solved problem?

tivert2y ago

Interesting. I know there's a lot of attempts to hobble China by limiting their access to cutting edge chips and semiconductor manufacturing technology, but could something like this be a workaround for them, at least for datacenter-type jobs?

Maybe it wouldn't be as powerful as one of these, due to their less capable fabs, but something that's good enough to get the job done in spite of the embargoes.

eternauta3k2y ago

What do you mean by "this", and how does it work around the restrictions? Do you mean just making bigger chips instead of shrinking the transistors?

tivert2y ago

> Do you mean just making bigger chips instead of shrinking the transistors?

Yes.

asdfasdf1OP2y ago

WHITE PAPER Training Giant Neural Networks Using Weight Streaming on Cerebras Wafer-Scale Clusters

https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20B...

asdfasdf1OP2y ago

- Interconnect between WSE-2's chips in the cluster was 150GB/s, much lower than NVIDIA's 900GB/s.

- non-sparse fp16 in WSE-2 was 7.5 tflops (about 8 H100s, 10x worse performance per dollar)

Does anyone know the WSE-3 numbers? Datasheet seems lacking loads of details

Also, 2.5 million USD for 1 x WSE-3, why just 44GB tho???

xcv1232y ago

>> why just 44GB tho???

You can order one with 1.2 Petabytes of external memory. Is that enough?

"External memory: 1.5TB, 12TB, or 1.2PB"

https://www.cerebras.net/press-release/cerebras-announces-th...

"214Pb/s Interconnect Bandwidth"

https://www.cerebras.net/product-system/

acchow2y ago

I can't find the memory bandwidth to that external memory. Did they publish this?

1 more reply

Tuna-Fish2y ago

44GB is the SRAM on a single device, comparable to the 50MB of L2 on the H100. There is also a lot of directly attached DRAM.

terafo2y ago

No, it's comparable to 230Mb of SRAM on Groq chip, since both of them are SRAM-only chips that can't really use external memory.

terafo2y ago

Because SRAM stopped getting smaller with recent nodes.

bee_rider2y ago

Is that 150GB/s between elements that expect to run tightly coupled processes together? Maybe the bandwidth between chips is less important.

I mean, in a cluster you might have a bunch of nodes with 8x GPUs hanging off each, if this thing replaces a whole node rather than a single GPU, which I assume is the case, it is not really a useful comparison, right?

holoduke2y ago

Better sell all nvidia stocks. Once these chips are common there is no need anymore for GPUs in training super large AI models.

anon2912y ago

Ex Cerebras engineer. In my opinion, this is not going to be the case. The WSE-2 was a b** to program and debug. Their compilation strategy is a dead end, and they invest very little into developer ease. My two cents.

1 more reply

incrudible2y ago

This chip does not outperform NVIDIA on key metrics. Economics of scale are unfavorable. Software is exotic.

I trust that gamers will outlast every hype, be it crypto or AI.

nickpsecurity2y ago

Two of you have a take on this that sounds similar to prior projects, like the Cell processor. They lost in the long run. Not a good sign.

imtringued2y ago

I would be more worried about the fact that next year every CPU is going to ship with some kind of AI accelerator already integrated to the die, which means the only competitive differentiation boils down to how much SRAM and memory bandwidth your AI accelerator is going to have. TOPS or FLOPS will become an irrelevant differentiator.

terafo2y ago

This thing targets training, which isn't affected by tiny accelerators inside CPUs.

TradingPlaces2y ago

Near-100% yield is some dark magic.

api2y ago

I'm surprised we haven't seen wafer scale many-core CPUs for cloud data centers yet.

GaryNumanVevo2y ago

Power limitations, a standard rack might only support 18kW - 30kW this is 24kW for ONE chip.

beautifulfreak2y ago

So it's increased from 2.6 to 4 trillion transistors over the previous version.

tedivm2y ago

The missing numbers that I really want to see-

* Power Usage

* Rack Size (last one I played with was 17u)

* Cooling requirements

tibbydudeza2y ago

Wow - it as bigger than my kitchen tiles - who uses them ???. NSA ???.

pgraf2y ago

related discussion (2021): https://news.ycombinator.com/item?id=27459466

hashtag-til2y ago

Any idea on what’s the yield on these chips?

wtallis2y ago

Previous versions have had basically 100% yield, because when you're working with the whole wafer it's pretty easy to squeeze in enough redundancy to route around defects unless you get a really unlucky cluster of defects.

wizardforhire2y ago

But can it run doom?

mlhpdx2y ago

Does it come in a mobile/laptop version?

pmontra2y ago

It's 215 x 215 mm so it fits in a large laptop, some 15" and definitely 17" ones. The keyboard could get a little warm and battery life doesn't look good.

terafo2y ago

Not right now.

whyenot2y ago

Imagine setting up a Beowulf cluster of these /s

AdamH121132y ago

Title should be either "4,000,000,000,000 Transistors" (as in the actual video title) or "4 Trillion Transistors" or maybe "4T Transistors". "4B,000" ("four billion thousand"?) looks like 48,000 (forty-eight thousand).

1 more reply

j / k navigate · click thread line to collapse

86 comments

cs7022y ago

We sure live in interesting times!

---

[a] https://en.wikipedia.org/wiki/Synapse

mlyle2y ago

> but if we oversimplify things and assume that every synapse can be modeled as a single parameter in a weight matrix

Which, it probably can't... but offsetting those simplifications and 4-20x difference is the massive difference in how quickly those synapses can be activated.

topspin2y ago

> that have only 4x to 20x fewer connections that the human brain

So only 4-20 of these systems are necessary to match the human brain. No?

ipsum22y ago

Fun fact, I can also train a 24 trillion parameter model on my laptop! Just need to offload weights to the cloud every layer.

...

It's meaningless to say something can train a model that has 24 trillion parameters without specifying the dataset size and time it takes to train.

cryptonector2y ago

I dare say this thing will be many times faster than thrashing your 24T parameters to the cloud.

ipsum22y ago

Yeah, but it'll be slower than the equivalent Nvidia GPU cluster.

brucethemoose22y ago

Reposting the CS-2 teardown in case anyone missed it. The thermal and electrical engineering is absolutely nuts:

https://vimeo.com/853557623

https://web.archive.org/web/20230812020202/https://www.youtu...

(Vimeo/Archive because the original video was taken down from YouTube)

bsder2y ago

As always, IBM did it first: https://www.righto.com/2021/03/logic-chip-teardown-from-vint...

bitwrangler2y ago

20,000 amps

200,000 electrical contacts

850,000 cores

and that's the "old" one. wow.

Keyframe2y ago

edit: thanks people, makes sense now!

magicalhippo2y ago

There's also resistive losses and leakage currents and such.

[1]: https://en.wikipedia.org/wiki/CMOS

[2]: https://en.wikipedia.org/wiki/MOSFET#Modes_of_operation

[3]: https://en.wikipedia.org/wiki/Charge_pump

Workaccount22y ago

It simply takes a non-zero amount of energy to turn a transistor on and off.

Add up trillions of transistors, flicking on and off billions of times a second, and you get enormous power draws.

None of this accounts for resistive losses either, which is just the cost of doing business for a CPU.

danbruc2y ago

1 more reply

bgnn2y ago

- Half of themis power is resistive loses. This is kinda a fundamental thing. Charging a capacitor with a resistor is "lossy".

- Some of the charge is reused within the circuits

magicalhippo2y ago

The peak power average of the 7950X3D is roughly 150W[3], which means if you could somehow run all 450 CPUs (900 dies) at peak, they'd consume around 68kW.

edit: I forgot about the IO die which contains the memory controller, so that will suck some power as well. So if we say 50W for that and 50W for the CPU dies, that's 45kW.

That's assuming you get a "clean wafer" with all dies working, not "just" the 80% yield or so.

[1]: https://www.techpowerup.com/cpu-specs/ryzen-9-7950x3d.c3024

[2]: https://www.anandtech.com/show/15219/early-tsmc-5nm-test-chi...

[3]: https://www.tomshardware.com/reviews/amd-ryzen-9-7950x3d-cpu...

brucethemoose22y ago

And even mundane details are so difficult... thermal expansion, the sheer number of pins that have to line up. This thing is a marvel.

Rexxar2y ago

It seems the vimeo video has been removed now too.

dougmwne2y ago

I want this woman running my postAI-apocalypse hardware research lab.

nachexnachex2y ago

She's got strong merit to be spared by the Roko's basilisk ai.

fxj2y ago

It has its own programming language CSL

https://www.cerebras.net/blog/whats-new-in-r0.6-of-the-cereb...

https://github.com/Cerebras/csl-examples

rbanffy2y ago

It is oddly reminiscent of the Thinking Machines CM-1/2 series, but with CSL as the official language instead of Lisp.

And far fewer blinking lights.

GaryNumanVevo2y ago

Due to their architecture (non-stationary dataflow) their compiler is approaching something close to a place and route FPGA algorithm

RetroTechie2y ago

How many early supercomputers / workstations etc would that include? How much progress did humanity make using all those early machines (or any transistorized device!) combined?

itishappy2y ago

Rough guess: mid 1980s

4004 from the 1970s used 2300 transistors, so it would have needed to sell billions.

Pentium from 1990s had 3M transistors, so it could hit our target by selling a million units.

I'm betting (without much research) that the Pentium line alone sold millions, and the industry as a whole could hit those numbers about 5 years earlier.

ortusdux2y ago

Not trying to sound critical, but is there a reason to use 4B,000 vs 4T?

wincy2y ago

yalok2y ago

because Billion is ambiguous -

Quote:

Billion is a word for a large number, and it has two distinct definitions:

https://en.wikipedia.org/wiki/Billion

bee_rider2y ago

It looks like it has now been switched to just have the number. I wonder if there was just some auto-formatting error.

geph20212y ago

original title is:

"4,000,000,000,000 Transistors, One Giant Chip (Cerebras WSE-3)"

So I guess they're trying to stay true to it.

ortusdux2y ago

gosub1002y ago

at least a trillion means a trillion. Unlike the "tebi vs tera"-byte marketing-speak in storage and ram.

leptons2y ago

Made you click!

imbusy1112y ago

I wish they dug into how this monstrosity is powered. Assuming 1V and 24kW, that's 24kAmps.

crotchfire2y ago

Backside power delivery using through-silicon vias.

I'm sure those TSVs connect to a huge array of switching power supplies, so the 24kW doesn't travel very far at such low voltages.

geph20212y ago

and cooling!

Imagine the heat sink on that thing. Would look like a cast-iron Dutch oven :)

tivert2y ago

The one of the videos posted here gets into that: https://news.ycombinator.com/item?id=39693930: https://vimeo.com/853557623

geph20212y ago

Thanks for sharing! Very interesting.

"We call it the engine block because it somewhat resembles a three cylinder motorcycle engine"

(referring to the power distribution and cooling)

jandrese2y ago

24kW translates to like 32hp, so you could imagine this thing with a liquid cooling loop hooked up to something that looks like a car radiator.

dist-epoch2y ago

24kW is on the lower end of a home heating gas boiler.

1 more reply

MenhirMike2y ago

Finally, a chip that outmatches the Noctua NH-D15!

asdfasdf1OP2y ago

https://www.cerebras.net/press-release/cerebras-announces-th...

https://www.cerebras.net/product-chip/

Rexxar2y ago

Is there a reason it's not roughly a disc if they use the whole wafer ? They could have 50% more surface.

terafo2y ago

To quote their official response "If the WSE weren't rectangular, the complexity of power delivery, I/O, mechanical integrity and cooling become much more difficult, to the point of impracticality.".

londons_explore2y ago

Packaging method can't handle non-square dies?

modeless2y ago

txyx3032y ago

brucethemoose22y ago

As I understand it, the WSE-2's interconnect is actually quite good, and models are split across chips kinda like GPUs.

And keep in mind that these nodes are hilariously "fat" compared to a GPU node (or even an 8x GPU node), meaning less congestion and overhead from the topology.

imtringued2y ago

TheDudeMan2y ago

FWIW, this chip has 44 GB of on-chip memory.

marmaduke2y ago

Hm, let's wait and see what the gemm/W perf is, and how many programmer hours it takes to implement say an mlp. Wafer scale data flow may not be a solved problem?

tivert2y ago

Maybe it wouldn't be as powerful as one of these, due to their less capable fabs, but something that's good enough to get the job done in spite of the embargoes.

eternauta3k2y ago

What do you mean by "this", and how does it work around the restrictions? Do you mean just making bigger chips instead of shrinking the transistors?

tivert2y ago

> Do you mean just making bigger chips instead of shrinking the transistors?

Yes.

asdfasdf1OP2y ago

WHITE PAPER Training Giant Neural Networks Using Weight Streaming on Cerebras Wafer-Scale Clusters

https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20B...

asdfasdf1OP2y ago

- Interconnect between WSE-2's chips in the cluster was 150GB/s, much lower than NVIDIA's 900GB/s.

- non-sparse fp16 in WSE-2 was 7.5 tflops (about 8 H100s, 10x worse performance per dollar)

Does anyone know the WSE-3 numbers? Datasheet seems lacking loads of details

Also, 2.5 million USD for 1 x WSE-3, why just 44GB tho???

xcv1232y ago

>> why just 44GB tho???

You can order one with 1.2 Petabytes of external memory. Is that enough?

"External memory: 1.5TB, 12TB, or 1.2PB"

https://www.cerebras.net/press-release/cerebras-announces-th...

"214Pb/s Interconnect Bandwidth"

https://www.cerebras.net/product-system/

acchow2y ago

I can't find the memory bandwidth to that external memory. Did they publish this?

1 more reply

Tuna-Fish2y ago

44GB is the SRAM on a single device, comparable to the 50MB of L2 on the H100. There is also a lot of directly attached DRAM.

terafo2y ago

No, it's comparable to 230Mb of SRAM on Groq chip, since both of them are SRAM-only chips that can't really use external memory.

terafo2y ago

Because SRAM stopped getting smaller with recent nodes.

bee_rider2y ago

Is that 150GB/s between elements that expect to run tightly coupled processes together? Maybe the bandwidth between chips is less important.

holoduke2y ago

Better sell all nvidia stocks. Once these chips are common there is no need anymore for GPUs in training super large AI models.

anon2912y ago

1 more reply

incrudible2y ago

This chip does not outperform NVIDIA on key metrics. Economics of scale are unfavorable. Software is exotic.

I trust that gamers will outlast every hype, be it crypto or AI.

nickpsecurity2y ago

Two of you have a take on this that sounds similar to prior projects, like the Cell processor. They lost in the long run. Not a good sign.

imtringued2y ago

terafo2y ago

This thing targets training, which isn't affected by tiny accelerators inside CPUs.

TradingPlaces2y ago

Near-100% yield is some dark magic.

api2y ago

I'm surprised we haven't seen wafer scale many-core CPUs for cloud data centers yet.

GaryNumanVevo2y ago

Power limitations, a standard rack might only support 18kW - 30kW this is 24kW for ONE chip.

beautifulfreak2y ago

So it's increased from 2.6 to 4 trillion transistors over the previous version.

tedivm2y ago

The missing numbers that I really want to see-

* Power Usage

* Rack Size (last one I played with was 17u)

* Cooling requirements

tibbydudeza2y ago

Wow - it as bigger than my kitchen tiles - who uses them ???. NSA ???.

pgraf2y ago

related discussion (2021): https://news.ycombinator.com/item?id=27459466

hashtag-til2y ago

Any idea on what’s the yield on these chips?

wtallis2y ago

wizardforhire2y ago

But can it run doom?

mlhpdx2y ago

Does it come in a mobile/laptop version?

pmontra2y ago

It's 215 x 215 mm so it fits in a large laptop, some 15" and definitely 17" ones. The keyboard could get a little warm and battery life doesn't look good.

terafo2y ago

Not right now.

whyenot2y ago

Imagine setting up a Beowulf cluster of these /s

AdamH121132y ago

1 more reply

j / k navigate · click thread line to collapse