Cerebras’s giant chip will smash deep learning’s speed barrier (opens in new tab)

(spectrum.ieee.org)

159 pointspheme16y ago102 comments

102 comments

The article talks about a few things that they call inventions, like making interconnections across what would normally be scribe lines. But I personally worked on wafer scale integration about 25 years ago and we were already doing that. We called it inter-reticle stitching. The technology was ancient back then - 0.5 micron feature size on 4 inch wafers - but the wafer scale techniques are applicable to modern technologies. In particular, developing a yield model that informs your on-chip redundancy choices and designing built-in self test and selection circuitry so that you can yield large chips. The chip we developed was so large that only two would fit on a wafer. We got 50% yield on a line that was far from mature at the time. The company lacked the vision to do anything with what they had developed. To them it was just a chip for which there were few customers. The suits didn't know how to make bank with this methodology that could yield nearly arbitrarily complex chips in nearly any target process.

Edit: There were a number of papers and conference proceedings published back then but not much shows up when searching Google. Here's one discussing the issues and results of field stitching https://fdocuments.in/document/ieee-comput-soc-press-1992-in...

From 1992, so yeah, field stitching is not a recent invention.

DaniFong6y ago

Great post, but I would like to add that the critical question for whether an invention because a useful innovation is usually not "is this novel" but rather "is there a currently viable project here with people who care about the thing and genuine motivation and persistence and adequate resources."

In other words, "how is this effort new to the universe?"

I would say it's certainly at a different scale and a different time. And we should be super thankful that the commercial interest is such that we can try out new chip designs in a different domain now; you can really imagine a rethink for the kinds of things that are possible once you're really at scale here.

mark_l_watson6y ago

I don’t know if this mega-chip will be successful, but I like the idea. Before I retired I managed a deep learning team that had a very cool internal product for running distributed TensorFlow. Now in retirement I get by with a single 1070 GPU for experiments - not bad but having something much cheaper, much more memory, and much faster would help so much.

I tend to be optimistic, so take my prediction with a grain of salt: I bet within 7 or 8 years there will be an inexpensive device that will blow away what we have now. There are so many applications for much larger end to end models that will but pressure on the market for something much better than what we have now. BTW, the ability to efficiently run models on my new iPhone 11 Pro is impressive and I have to wonder if the market for super fast hardware for training models might match the smartphone market. For this to happen, we need a deep learning rules the world shift. BTW, off topic, but I don’t think deep learning gets us to AGI.

corporate_shi116y ago

It's also my impression - from my modest exposure to DL over the past two years as a student taking courses - that deep learning must be overcome to reach AGI.

Specifically gradient descent is a post hoc approach to network tuning, while human neural connections are reinforced simultaneously as they fire together. The post hoc approach restricts the scope of the latent representations a network learns because such representations must serve a specific purpose (descending the gradient), while the human mind works by generating representations spontaneously at multiple levels of abstraction without any specific or immediate purpose in mind.

I believe the brain's ability to spontaneously generate latent representations capable of interacting with one another in a shared latent space is functionally enabled by the paradigm of neurons 'firing and wiring' together. I also believe it is the brain's ability to spontaneously generate hierarchically abstract representations in a shared space that is the key to AGI. We must therefore move away from gradient descent.

mantap6y ago

Don't forget the human brain takes about 7 to 8 hours off every day to rejiggle itself, to use a scientific term. The brain's architecture is better than having a training stage but it's by no means able to continually learn without stops and starts.

1 more reply

jamesblonde6y ago

Commodity deep learning might be a lot closer than you think. Nvidia won't bring us there (without kicking and screaming), but AMD might. You can pick up a Radeon VII for about 600 dollars and use it in your data center without licensing issues (16 GB, about the training speed of a 2080Ti for ImageNet ConvNets). AMD use ROCm (now fully upstreamed into TensorFlow instead of Cuda - https://www.logicalclocks.com/blog/welcoming-amd-rocm-to-hop... ). Disclaimer: I worked on getting ROCm into YARN for Hopsworks.

dsign6y ago

Deep learning is not going to get us to AGI. But the hardware techniques definitely are going to get us a bit closer.

I did the numbers a while ago and honestly I don't think we need smaller transistors to get the computation volume of our mushy brains -- although ofc, more and smaller transistors is always very nice. I believe the only thing stopping AGI at this point is architecture -- we really have no idea how to connect and structure something as complex as our brains -- and cognitive maturity. The last part is my way of saying "two weeks for training a NN? Wait until you have a kid and have to work on training the little human for decades....".

TBH, the ethical implications of AGI seem insurmountable to me. Life is a game --meaning the universe doesn't care about us, nor we owe anything to it--, and for now, it's our game. So, I would rather we put all that computing to improving human life -- including mind upload,-- and put AGI right there with nuclear weapons.

GaryNumanVevo6y ago

Cerebras is a reaction to the recent Deep Learning trend. Larger networks, supposedly better performance. As someone doing distributed training regularly, I've seen some super inefficient models that take 3x more resources (time / compute / bandwidth) for a 2% bump. I think we'll see a big wave in NN optimization in the near future.

011000116y ago

Artificial 'neurons' used in deep learning networks are absolutely worlds apart from real biological neurons. I don't think anyone in the field seriously believes we will get to AGI via DL or our current models.

innagadadavida6y ago

> I have to wonder if the market for super fast hardware for training models might match the smartphone market.

Recently Intel acquired Habana Labs for $2B [1] and Intel could possibly integrate this into upcoming CPUs (Intel sells more CPUs than iPhone for sure). However, this was on the inference side - unlike Cerebras which is making training faster. The most likely products to benefit from this would be Azure or AWS.

1. https://newsroom.intel.com/news-releases/intel-ai-acquisitio...

sanxiyn6y ago

Habana does both training and inference. Gaudi is for training, Goya is for inference.

MaxBarraclough6y ago

> I get by with a single 1070 GPU

Amazon/Google/Microsoft will gladly take your money for time on their nVidia GPU instances, but they charge tens of cents per hour.

varelse6y ago

I am far more excited by the underlying Wafer Scale Integration moonshot than I am by any AI benchmarks here. I know it's trendy to think there can only be one w/r to the AI Iron Throne but nope, not the case, everyone is writing bespoke code in production where the money is made. Well, almost everyone, Amazon seems to be the odd duck but they're a bunch of cheapskate thought leaders anyway (except for their offers to junior engineers in their desperate hail mary attempt to catch up with FAIR and DeepMind, but... I... digress...).

Which is to say that graphs written to run specifically on Cerebras's giant chip will smash deep learning's speed barrier for graphs written to run best on Cerebras's giant chip. And that's great, but it won't be every graph, there is no free lunch. Hear me now, believe me later(tm).

But if we can cut the cost of interconnect by putting a figurative datacenter's worth of processors on a chip, that's genuinely interesting, and it has applications far beyond the multiplies and adds of AI. But be very wary of anyone wielding the term "sparse" for it is a massively overloaded definition and every single one of those definitions is a beautiful and unique snowflake w/r to efficient execution on bespoke HW.

011000116y ago

I just wonder about the reliability of a system that large. Sure, it's mostly used for machine learning where we don't seem to care as much, but what is the average MTBF of a chip this large? How many chips actually make it out of production?

Also, is this something that will likely scale up, or will this style of design hit a wall(power dissipation?) faster than, say, silicon-interconnect fabric?

Time will tell if this is the new path forward or just a curious footnote in the history of semiconductors.

why_only_156y ago

They built the chip specifically so that it can tolerate failures in some of the cores. I wonder if it can do that adaptation only once or if it can automatically detect it and route around it.

foota6y ago

Isn't that similar to what AMD is doing with infinity fabric? Obviously not at such a large scale.

jamesblonde6y ago

Infinity fabric is closer to Nvidia's NVLink - much lower interconnect B/W. PCI 4.0 will be interesting as a commodity alternative, particularly when paired with AMD Rome chips with huge numbers of I/O lanes - for distributed training. https://wccftech.com/amd-epyc-rome-zen-2-7nm-server-cpu-162-...

bcatanzaro6y ago

Reminds me of that other great prediction of a GPU killer from IEEE Spectrum back in 2009:

https://spectrum.ieee.org/computing/software/winner-multicor...

ajtulloch6y ago

For the folks who are downvoting this comment, the author is absolutely a subject matter expert (and completely correct).

deepnotderp6y ago

But he also works at nVidia and Larrabee versus the WSE are two entirely different things. Larrabee was an architectural approach to more general purpose parallel hardware whereas the WSE is a more special purpose and physically different than a GPU.

Google2346y ago

What did go wrong with Intel's MIC (Xeon Phi) project? I can't find a compressive account of this from HPC people. The idea seemed pretty sound: large die, simpler circuit, and much more parallelism, in the x86 line..

liuliu6y ago

I vaguely remember that at the dawn of the deep learning (2013 to 2014), there were talks and hopes that Xeon Phi would smash the performance of Nvidia GPUs. However, the sample people got are too late (I believe it is at the end of 2014) and the performance figures are disappointing. It might be just the software was simply not there yet unfortunately. But then the wheels moved forward and everyone started to buy Nvidia chips in their datacenters.

desertrider126y ago

They didn't really go that wrong, Cori and Trinity are still useful machines. But I can say for computational science, programming models have gotten way better in the last few years. Now it's as easy to get a new sparse algorithm to high occupancy on a GPU as it is to scale on a manycore CPU. So GPUs just look better now considering cost, power efficiency and software support.

raphlinus6y ago

You'll probably find Tom Forsyth's blog on this to be interesting reading: https://tomforsyth1000.github.io/blog.wiki.html#%5B%5BWhy%20...

Zenst6y ago

A chip that size, imagine the yield. Equally, cooling - has to be water based as a heatsink that size would be on par to a small anvil and the weight factor would be some serious issues. Though unsure as no pictures of it in-play alas and all they say is - "20 kilowatts being consumed by each blew out into the Silicon Valley streets through a hole cut into the wall", which does somewhat beg for a picture as just raises more questions.

Why would they make a chip this big with AMD showing a chiplet design approach is cheaper and more scalable on so many levels. Let alone, yields.

Equally, arms approach to utilising the back of the chip as a power delivery :- https://spectrum.ieee.org/nanoclast/semiconductors/design/ar...

Then a wafer scale chip like this, using that approach, would save so much power. But again, yeilds will be a factor and can imagine this is not the cutting edge process node as you find as nodes mature, the yields improve. So an older node size would have a better yield and be more suitable for such wafer scale chips. But again, no mention of what is used. I have read in the past that it would use Intel's 10nm, but this article mentions TSMC. Another article I read that they used a 16nm node ( https://fuse.wikichip.org/news/3010/a-look-at-cerebras-wafer... ), which as mentioned above about node maturity, understandable.

tedivm6y ago

I've seen a demo of the machine. It's about 17u in size, with the vast majority (like 15u) of that being for cooling. This was over two years ago so things may have changed.

Right now I'm hosting some DGX's, and only one datacenter in the bay area had the ability to power a full rack of them. Power density is going to be a real issue for the these systems.

Zenst6y ago

Wow, that really does add some perspective upon the cooling and the aspect about power requirements datacenter wise really does highlight how out-there these type of systems are over the usual rack layouts.

Equally, the cooling capacity of the datacenter comes into play with such systems. Given the power density, the amount of heat being generated would equally be above your normal rack output.

2 more replies

monocasa6y ago

> Why would they make a chip this big with AMD showing a chiplet design approach is cheaper and more scalable on so many levels. Let alone, yields.

They're taking a radically different approach, and hoping that they'll be able to route around defects, unlike AMD where a defect in the uncore kills the whole chiplet.

tedivm6y ago

A lot of the people involved in this actually come from AMD, so I imagine they're familiar with the issues AMD ran into.

wolf550e6y ago

Not 100% of the chip is enabled, they disable defective parts and don't advertise a model that has 100% parts enabled, so they don't need magical zero defect wafers.

Images of the whole computer were published, you can see the massive cooling system: https://www.tomshardware.com/news/worlds-largest-chip-gets-a...

011000116y ago

Did they come up with an architecture which can route around any defect? Probably not. Now, granted, 90% of their chip is probably dedicated to compute, but I'd bet there's some management infrastructure where they absolutely cannot tolerate a defect.

1 more reply

Accujack6y ago

>which does somewhat beg for a picture as just raises more questions.

There's a picture in the article.

>Why would they make a chip this big

Did you read the article?

>this article mentions TSMC. Another article I read that they used a 16nm node

Yes, 16nm/TSMC.

Zenst6y ago

> There's a picture in the article.

Yes - hardly helpful ones as you get a picture of a wafer and a box, not breakdown beyond that - hence had look and found other articles with much more detail upon this that answers the questions I raised in relation to the lack of pictures - like the cooling aspect in which you snipped my quote and removed that lovely thing we call context.

>Did you read the article?

Yes and had you read what I said you would see that the article does not answer the aspects I was asking - see what you did there.

>Yes, 16nm/TSMC

Yes - I found that in another article - which I also linked, you're welcome.

gimmeThaBeet6y ago

I'm really curious about the benefits of their implementation. It's far beyond my grasp to make any serious criticisms and I don't really want to doubt them, it just seems a pretty radical departure from even the direction of innovation.

The way they paint it sounds like they're putting in redundant cores to account for failure of what seems like what I would call the 'first line' cores, i.e. there's cores that are only used if some primary ones aren't working?

But sort of intuitively that doesn't make a whole lot of sense given the parallel nature. Maybe they are just putting in 101% of specified cores, and if there's a ~1% hopefully uniform-ish core failure rate then it's all gucci?

I guess my question is probably similar to yours, what are you giving up with yield-enhancing redundancy of a behemoth die vs integrating a bunch of confirmed working chiplets together?

phonon6y ago

The CEO says 1-1.5%.

"Cerebras approached the problem using redundancy by adding extra cores throughout the chip that would be used as backup in the event that an error appeared in that core’s neighborhood on the wafer. “You have to hold only 1%, 1.5% of these guys aside,” Feldman explained to me. Leaving extra cores allows the chip to essentially self-heal, routing around the lithography error and making a whole-wafer silicon chip viable."

https://techcrunch.com/2019/08/19/the-five-technical-challen...

whatshisface6y ago

The article claims that keeping everything on one die raises interconnect bandwidths and lowers latencies over what would be possible in a conventional supercomputing setup. Connections are made over the silicon that is normally left aside for cutting the chips apart. Apparently that is a special process that they had to collaborate with a partner in order to get working.

frankchn6y ago

Chiplet designs means that you still have to route signals either onto an interposer or onto a PCB. If you have a silicon interposer you have the same issue of making a really large silicon die. If you route into the PCB, then you may need SerDes depending on what you do and bandwidth will be lower and latency will be higher due to signal integrity issues.

Maybe something like Intel's EMIB technology where they have small interposers along edges of chips rather than having a giant interposer might help here.

Yields are probably fairly good if they design for manufacturing by placing extra cores / wires to route around failures as I am sure they are.

agoodthrowaway6y ago

The future of these interconnects is to make them optical. Once the interconnects are optical lots of problems get solved. Chips don’t have to be in same enclosure, simplifying cooking etc.

baybal26y ago

I will dissent. Organic interposers are dirt cheap, and nearly as good unless all you want is density.

ivalm6y ago

> A chip that size, imagine the yield

From discussion at a demo the yield is good, since they are using a large node. Their hardware rerouting also mitigates defects on most chips.

BooneJS6y ago

Many single-chip processors contain redundancy or ability to route around bad units. Yield isn’t an issue if it has programmable datapaths, even at this scale.

wbhart6y ago

From the perspective of an outsider, I can't see how a company like this could survive. They claim on the one hand to have done something really amazing and are at the stage where they are looking for customers. Normally, you'd expect them to be touting performance figures to secure such investment. Instead, they've decided to keep the performance secret. And they've managed to find some "expert" who says this is normal.

Does anyone here have expertise in this area? Is this the model for a successful company in this area?

recursivecaveat6y ago

As someone who works for another startup in this area, building the chip is only half the battle. The other half is tooling for compiling benchmark networks onto the chip in a performant manner. With 400k cores and their 'duplicate and re-route' defect strategy, this might literally be the most challenging compilation target ever made. It probably stacks up absolutely terribly in every metric right now. That's not to say it will necessarily get better, most of the people I've talked to don't think the megachip will ultimately amount to much more than a clever marketing ploy.

Veedrac6y ago

A bit baffled by this because on every axis I look this seems like a dream of a compilation target.

* No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle.

* Model parallel alone is full performance, no need for data parallel if you size to fit.

* Defects are handled in hardware; any latency differences are hidden & not in load path anyway.

* Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes.

I genuinely don't know how you'd build a simpler system than this.

3 more replies

dnautics6y ago

Is it not the case that the defect identification and rerouting happens at the hw level in a QA phase post production? If not I'm even less bullish on cerebras.

1 more reply

joe_the_user6y ago

With 400k cores and their 'duplicate and re-route' defect strategy, this might literally be the most challenging compilation target ever made.

While I'd be generally skeptical, it seems like the compilation for the rerouting could be done on a single low level, below whatever their assembler is, and so the could just look like a regular array of cores - just a single array that translates from i to the ith "real" core and similar structures seems like it could be enough.

Edit: I mean, if they're smart, it seems like they'd make the thing look as much as possible like a generic GPU capable of OpenCL. I have no idea if they'll do that but since they have size, they won't have to sell their stuff an otherwise custom approach.

Veedrac6y ago

They have customers already, one (Argonne National Labs) is given explicitly.

The issue with using ‘industry standard’ benchmarks is that it's like measuring a bus' efficiency by shuttling around a single person at a time. The CS-1 is just bigger than that; the workloads that it provides the most value on are ones that are sized to fit, and specifically built for the device.

This does make it hard to evaluate as outsiders (certainly for similar reasons I never liked Graphcore), but I don't think it means anything as grim as you say. The recipe fits.

typon6y ago

They could always release figures for larger networks - they don't have to target Resnet50 (which is the MLPerf standard). I don't think anyone would hold it against them if they show massive improvements in something like GPT-2 training time (a network 37000x the size of Resnet)

1 more reply

m0zg6y ago

That' sounds like horseshit to me. Very large public datasets and models are available to test training on a chip or system of any size. ImageNet is large enough for this. But if that's not sufficient, OpenImages is also available.

To me as a practitioner a meaningful metric would be "it trains an ImageNet classifier to e.g. 80% top1 in a minute". If it's not suitable for CNNs, do BERT or something else non-convolutional. Even better if I can replicate this result in a public cloud somewhere. They know this, and yet all we have is a single mention of a customer under an NDA and no public benchmarks of any kind, let alone any verifiable ones. If it did excel at those, we'd already know.

2 more replies

phonon6y ago

There are probably only a few hundred prospective customers. (Some may buy several units). Each unit will cost millions. They can discuss the expected workloads/performance with each prospective customer individually.

jandrese6y ago

Keeping the performance figures a secret is a red flag on the level of "run, don't walk, away from this company".

At best their solution is on par with GPUs in a performance per watt/dollar sense. At worst they're scammers looking for a sucker.

privateSFacct6y ago

I'm also very curious about their performance per watt / dollar for the standard ML datasets out there (facial recognition etc). We have a reasonable sense of both training time and runtime for these in the cloud (and it is falling FAST).

tedivm6y ago

I got a demo of this two years ago, and honestly I don't think it matters that they aren't sharing these numbers. Any company that is going to consider this is going to want to benchmark it on their own models and systems, and as long as Cerebras allows that they aren't going to have trouble finding customers (assuming their claims line up with reality).

Even if that doesn't work out most of the people on these time have built companies that were acquired by either AMD or another chip maker.

justicezyx6y ago

Mass market customers are just going to skip without benchmarks.

Although, at this stage, Crebras does not care about mass market yet.

michelpp6y ago

The members of the GraphBLAS forum have discussed this chip a couple of times. There's a lot of research on making deep neural networks more sparse, not just by pruning a dense matrix, but by starting with a sparse matrix structure de novo. Lincoln Laboratory's Dr. Jeremy Kepner has a good paper on Radix-Net mixed radix topologies that achieve good learning ability but with far fewer neurons and memory requirements. Cited in the paper was a network constructed with these techniques that simulated the size and sparsity of the human brain:

https://arxiv.org/pdf/1905.00416.pdf

It would be cool to see the GraphBLAS API ported to this chip, which from what I can tell comes with sparse matrix processing units. As networks become bigger, deeper, but sparser, a chip like this will have some demonstrable advantages over dense numeric processors like GPUs.

giacaglia6y ago

I've wrote about the challenges that Cerebras went through and what is next: https://towardsdatascience.com/why-cerebras-announcement-is-...

rsp19846y ago

This fits perfectly into the narrative of yesterday's discussion on HN [1].

Deep Neural Nets are somewhat of a brute force approach to machine learning. Training efficiency is horrible as compared with other ML approaches, but hey, as long as we can trade +5% of classification performance for +500% of NN complexity and throw more money at the problem, who cares?

I see a dystopian future where much better and much more efficient approaches to ML exist, but nobody's paying attention because we have Deep Neural Nets in hardware and decades of infrastructure supporting it.

[1] https://news.ycombinator.com/item?id=21929709

justicezyx6y ago

Well, if a better algorithm cannot beat DNN in a realistic product setting, then how can you say its better after all?

If the algorithm is indeed better, how can DNN dominates and turn into a dystonia...

someguyorother6y ago

What economists call path dependence.

The alternative algorithm would be better than DNN if the same amount of effort was put into creating special-purpose hardware, libraries, and so on; but in the dystopia, it's not fully refined DNN vs fully refined alternative algorithm, but fully refined DNN vs alternative algorithm with hardware and software optimized for DNN.

The alternative algorithm always looks unappealing because the playing field historically favors DNN, and so doesn't take off in the dystopia.

2 more replies

m0zg6y ago

They did build some valuable tech, no question there, but be sure to account for the typical startup hyperbole. By the time you can get your hands on this (if that ever happens), the hyperbole will converge a bit closer to reality, the tradeoffs will become apparent, etc, and you'll discover that it is not, in fact, going to "smash" barriers of any kind in any practical sense.

From TFA: "Cerebras hasn’t released MLPerf results or any other independently verifiable apples-to-apples comparisons."

That's all you really need to know.

ZhuanXia6y ago

Them shunning benchmarks is pretty lame.

baybal26y ago

The guy who runs Cerebas has history of quick selling companies that then went nowhere. He bets all on wow-effect, and sells to trend chasing suckers.

Less than stellar benchmarks will ruin the "magic"

green-eclipse6y ago

The Cerebras chip really stands out in terms of the chip industry's relationship to Moore's law. Look at the graphs in this article for reference:

https://medium.com/predict/cerebras-trounces-moores-law-with...

atq21196y ago

That article is hogwash. Sure, the Cerebras "chip" is impressive. But the idea that it will accelerate Moore's law and usher in the singularity is just nonsense. Nobody has even made serious efforts to use deep learning for physical design, and its scope for improving designs is limited at best even in theory.

If this was trying to aim at solid state physics and materials research, then maybe one could be carefully optimistic about a genuine breakthrough via something like room temperature, standard pressure super-conducting. As it stands, I call blind hype.

wpietri6y ago

Agreed. One of the things that fascinates me about technology is how the new thing is always treated as magic. I'm hoping we're almost out of that phase for ML, as the hype is exhausting.

And for the curious, a good example of this is radium. For a while it was a miracle cure-all, put in everything from lipstick to jock straps. That did not work out well: https://www.theatlantic.com/health/archive/2013/03/how-we-re...

Veedrac6y ago

> Nobody has even made serious efforts to use deep learning for physical design

DeepMind have for place&route IIRC.

mlyle6y ago

Yah, it's BS. But it may be teaching TSMC a whole lot about making larger chips with good yield, and the across-reticle interconnect technology is impressive too and may find some general applicability (e.g. it sounds like something AMD might like).

1 more reply

ThrowawayR26y ago

That article is utter balderdash. Yes, it's obvious that you can fit more transistors on a "chip" if you make the chip be much, much larger than what we ordinarily think of as a chip. No, it does not mean that Moore's Law has been invalidated or some new "AI Moore’s Law" (quoting from the post) has come into being.

dnautics6y ago

> Yes, it's obvious that you can fit more transistors on a "chip" if you make the chip be much, much larger than what we ordinarily think of as a chip.

Without defending the article, it is however the case that simply scaling a chip size has nontrivial problems. For example, Will the piece of silicon warp or shatter if one side happens to get hotter than the other?

1 more reply

gfodor6y ago

I’m a know-nothing when it comes to this area, but I shouted expletives at least twice when I read this article. This is crazy.

j / k navigate · click thread line to collapse

102 comments

geomark6y ago

From 1992, so yeah, field stitching is not a recent invention.

DaniFong6y ago

In other words, "how is this effort new to the universe?"

mark_l_watson6y ago

corporate_shi116y ago

It's also my impression - from my modest exposure to DL over the past two years as a student taking courses - that deep learning must be overcome to reach AGI.

mantap6y ago

1 more reply

jamesblonde6y ago

dsign6y ago

Deep learning is not going to get us to AGI. But the hardware techniques definitely are going to get us a bit closer.

GaryNumanVevo6y ago

011000116y ago

innagadadavida6y ago

> I have to wonder if the market for super fast hardware for training models might match the smartphone market.

1. https://newsroom.intel.com/news-releases/intel-ai-acquisitio...

sanxiyn6y ago

Habana does both training and inference. Gaudi is for training, Goya is for inference.

MaxBarraclough6y ago

> I get by with a single 1070 GPU

Amazon/Google/Microsoft will gladly take your money for time on their nVidia GPU instances, but they charge tens of cents per hour.

varelse6y ago

011000116y ago

Also, is this something that will likely scale up, or will this style of design hit a wall(power dissipation?) faster than, say, silicon-interconnect fabric?

Time will tell if this is the new path forward or just a curious footnote in the history of semiconductors.

why_only_156y ago

They built the chip specifically so that it can tolerate failures in some of the cores. I wonder if it can do that adaptation only once or if it can automatically detect it and route around it.

foota6y ago

Isn't that similar to what AMD is doing with infinity fabric? Obviously not at such a large scale.

jamesblonde6y ago

bcatanzaro6y ago

Reminds me of that other great prediction of a GPU killer from IEEE Spectrum back in 2009:

https://spectrum.ieee.org/computing/software/winner-multicor...

ajtulloch6y ago

For the folks who are downvoting this comment, the author is absolutely a subject matter expert (and completely correct).

deepnotderp6y ago

Google2346y ago

liuliu6y ago

desertrider126y ago

raphlinus6y ago

You'll probably find Tom Forsyth's blog on this to be interesting reading: https://tomforsyth1000.github.io/blog.wiki.html#%5B%5BWhy%20...

Zenst6y ago

Why would they make a chip this big with AMD showing a chiplet design approach is cheaper and more scalable on so many levels. Let alone, yields.

Equally, arms approach to utilising the back of the chip as a power delivery :- https://spectrum.ieee.org/nanoclast/semiconductors/design/ar...

tedivm6y ago

I've seen a demo of the machine. It's about 17u in size, with the vast majority (like 15u) of that being for cooling. This was over two years ago so things may have changed.

Right now I'm hosting some DGX's, and only one datacenter in the bay area had the ability to power a full rack of them. Power density is going to be a real issue for the these systems.

Zenst6y ago

Equally, the cooling capacity of the datacenter comes into play with such systems. Given the power density, the amount of heat being generated would equally be above your normal rack output.

2 more replies

monocasa6y ago

> Why would they make a chip this big with AMD showing a chiplet design approach is cheaper and more scalable on so many levels. Let alone, yields.

They're taking a radically different approach, and hoping that they'll be able to route around defects, unlike AMD where a defect in the uncore kills the whole chiplet.

tedivm6y ago

A lot of the people involved in this actually come from AMD, so I imagine they're familiar with the issues AMD ran into.

wolf550e6y ago

Not 100% of the chip is enabled, they disable defective parts and don't advertise a model that has 100% parts enabled, so they don't need magical zero defect wafers.

Images of the whole computer were published, you can see the massive cooling system: https://www.tomshardware.com/news/worlds-largest-chip-gets-a...

011000116y ago

1 more reply

Accujack6y ago

>which does somewhat beg for a picture as just raises more questions.

There's a picture in the article.

>Why would they make a chip this big

Did you read the article?

>this article mentions TSMC. Another article I read that they used a 16nm node

Yes, 16nm/TSMC.

Zenst6y ago

> There's a picture in the article.

>Did you read the article?

Yes and had you read what I said you would see that the article does not answer the aspects I was asking - see what you did there.

>Yes, 16nm/TSMC

Yes - I found that in another article - which I also linked, you're welcome.

gimmeThaBeet6y ago

I guess my question is probably similar to yours, what are you giving up with yield-enhancing redundancy of a behemoth die vs integrating a bunch of confirmed working chiplets together?

phonon6y ago

The CEO says 1-1.5%.

https://techcrunch.com/2019/08/19/the-five-technical-challen...

whatshisface6y ago

frankchn6y ago

Maybe something like Intel's EMIB technology where they have small interposers along edges of chips rather than having a giant interposer might help here.

Yields are probably fairly good if they design for manufacturing by placing extra cores / wires to route around failures as I am sure they are.

agoodthrowaway6y ago

The future of these interconnects is to make them optical. Once the interconnects are optical lots of problems get solved. Chips don’t have to be in same enclosure, simplifying cooking etc.

baybal26y ago

I will dissent. Organic interposers are dirt cheap, and nearly as good unless all you want is density.

ivalm6y ago

> A chip that size, imagine the yield

From discussion at a demo the yield is good, since they are using a large node. Their hardware rerouting also mitigates defects on most chips.

BooneJS6y ago

Many single-chip processors contain redundancy or ability to route around bad units. Yield isn’t an issue if it has programmable datapaths, even at this scale.

wbhart6y ago

Does anyone here have expertise in this area? Is this the model for a successful company in this area?

recursivecaveat6y ago

Veedrac6y ago

A bit baffled by this because on every axis I look this seems like a dream of a compilation target.

* No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle.

* Model parallel alone is full performance, no need for data parallel if you size to fit.

* Defects are handled in hardware; any latency differences are hidden & not in load path anyway.

* Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes.

I genuinely don't know how you'd build a simpler system than this.

3 more replies

dnautics6y ago

Is it not the case that the defect identification and rerouting happens at the hw level in a QA phase post production? If not I'm even less bullish on cerebras.

1 more reply

joe_the_user6y ago

With 400k cores and their 'duplicate and re-route' defect strategy, this might literally be the most challenging compilation target ever made.

Veedrac6y ago

They have customers already, one (Argonne National Labs) is given explicitly.

This does make it hard to evaluate as outsiders (certainly for similar reasons I never liked Graphcore), but I don't think it means anything as grim as you say. The recipe fits.

typon6y ago

1 more reply

m0zg6y ago

2 more replies

phonon6y ago

jandrese6y ago

Keeping the performance figures a secret is a red flag on the level of "run, don't walk, away from this company".

At best their solution is on par with GPUs in a performance per watt/dollar sense. At worst they're scammers looking for a sucker.

privateSFacct6y ago

tedivm6y ago

Even if that doesn't work out most of the people on these time have built companies that were acquired by either AMD or another chip maker.

justicezyx6y ago

Mass market customers are just going to skip without benchmarks.

Although, at this stage, Crebras does not care about mass market yet.

michelpp6y ago

https://arxiv.org/pdf/1905.00416.pdf

giacaglia6y ago

I've wrote about the challenges that Cerebras went through and what is next: https://towardsdatascience.com/why-cerebras-announcement-is-...

rsp19846y ago

This fits perfectly into the narrative of yesterday's discussion on HN [1].

[1] https://news.ycombinator.com/item?id=21929709

justicezyx6y ago

Well, if a better algorithm cannot beat DNN in a realistic product setting, then how can you say its better after all?

If the algorithm is indeed better, how can DNN dominates and turn into a dystonia...

someguyorother6y ago

What economists call path dependence.

The alternative algorithm always looks unappealing because the playing field historically favors DNN, and so doesn't take off in the dystopia.

2 more replies

m0zg6y ago

From TFA: "Cerebras hasn’t released MLPerf results or any other independently verifiable apples-to-apples comparisons."

That's all you really need to know.

ZhuanXia6y ago

Them shunning benchmarks is pretty lame.

baybal26y ago

The guy who runs Cerebas has history of quick selling companies that then went nowhere. He bets all on wow-effect, and sells to trend chasing suckers.

Less than stellar benchmarks will ruin the "magic"

green-eclipse6y ago

The Cerebras chip really stands out in terms of the chip industry's relationship to Moore's law. Look at the graphs in this article for reference:

https://medium.com/predict/cerebras-trounces-moores-law-with...

atq21196y ago

wpietri6y ago

Agreed. One of the things that fascinates me about technology is how the new thing is always treated as magic. I'm hoping we're almost out of that phase for ML, as the hype is exhausting.

Veedrac6y ago

> Nobody has even made serious efforts to use deep learning for physical design

DeepMind have for place&route IIRC.

mlyle6y ago

1 more reply

ThrowawayR26y ago

dnautics6y ago

> Yes, it's obvious that you can fit more transistors on a "chip" if you make the chip be much, much larger than what we ordinarily think of as a chip.

1 more reply

gfodor6y ago

I’m a know-nothing when it comes to this area, but I shouted expletives at least twice when I read this article. This is crazy.

j / k navigate · click thread line to collapse