Edit: There were a number of papers and conference proceedings published back then but not much shows up when searching Google. Here's one discussing the issues and results of field stitching https://fdocuments.in/document/ieee-comput-soc-press-1992-in...
From 1992, so yeah, field stitching is not a recent invention.
In other words, "how is this effort new to the universe?"
I would say it's certainly at a different scale and a different time. And we should be super thankful that the commercial interest is such that we can try out new chip designs in a different domain now; you can really imagine a rethink for the kinds of things that are possible once you're really at scale here.
I tend to be optimistic, so take my prediction with a grain of salt: I bet within 7 or 8 years there will be an inexpensive device that will blow away what we have now. There are so many applications for much larger end to end models that will but pressure on the market for something much better than what we have now. BTW, the ability to efficiently run models on my new iPhone 11 Pro is impressive and I have to wonder if the market for super fast hardware for training models might match the smartphone market. For this to happen, we need a deep learning rules the world shift. BTW, off topic, but I don’t think deep learning gets us to AGI.
Specifically gradient descent is a post hoc approach to network tuning, while human neural connections are reinforced simultaneously as they fire together. The post hoc approach restricts the scope of the latent representations a network learns because such representations must serve a specific purpose (descending the gradient), while the human mind works by generating representations spontaneously at multiple levels of abstraction without any specific or immediate purpose in mind.
I believe the brain's ability to spontaneously generate latent representations capable of interacting with one another in a shared latent space is functionally enabled by the paradigm of neurons 'firing and wiring' together. I also believe it is the brain's ability to spontaneously generate hierarchically abstract representations in a shared space that is the key to AGI. We must therefore move away from gradient descent.
I did the numbers a while ago and honestly I don't think we need smaller transistors to get the computation volume of our mushy brains -- although ofc, more and smaller transistors is always very nice. I believe the only thing stopping AGI at this point is architecture -- we really have no idea how to connect and structure something as complex as our brains -- and cognitive maturity. The last part is my way of saying "two weeks for training a NN? Wait until you have a kid and have to work on training the little human for decades....".
TBH, the ethical implications of AGI seem insurmountable to me. Life is a game --meaning the universe doesn't care about us, nor we owe anything to it--, and for now, it's our game. So, I would rather we put all that computing to improving human life -- including mind upload,-- and put AGI right there with nuclear weapons.
Recently Intel acquired Habana Labs for $2B [1] and Intel could possibly integrate this into upcoming CPUs (Intel sells more CPUs than iPhone for sure). However, this was on the inference side - unlike Cerebras which is making training faster. The most likely products to benefit from this would be Azure or AWS.
1. https://newsroom.intel.com/news-releases/intel-ai-acquisitio...
Amazon/Google/Microsoft will gladly take your money for time on their nVidia GPU instances, but they charge tens of cents per hour.
Which is to say that graphs written to run specifically on Cerebras's giant chip will smash deep learning's speed barrier for graphs written to run best on Cerebras's giant chip. And that's great, but it won't be every graph, there is no free lunch. Hear me now, believe me later(tm).
But if we can cut the cost of interconnect by putting a figurative datacenter's worth of processors on a chip, that's genuinely interesting, and it has applications far beyond the multiplies and adds of AI. But be very wary of anyone wielding the term "sparse" for it is a massively overloaded definition and every single one of those definitions is a beautiful and unique snowflake w/r to efficient execution on bespoke HW.
Also, is this something that will likely scale up, or will this style of design hit a wall(power dissipation?) faster than, say, silicon-interconnect fabric?
Time will tell if this is the new path forward or just a curious footnote in the history of semiconductors.
https://spectrum.ieee.org/computing/software/winner-multicor...
Why would they make a chip this big with AMD showing a chiplet design approach is cheaper and more scalable on so many levels. Let alone, yields.
Equally, arms approach to utilising the back of the chip as a power delivery :- https://spectrum.ieee.org/nanoclast/semiconductors/design/ar...
Then a wafer scale chip like this, using that approach, would save so much power. But again, yeilds will be a factor and can imagine this is not the cutting edge process node as you find as nodes mature, the yields improve. So an older node size would have a better yield and be more suitable for such wafer scale chips. But again, no mention of what is used. I have read in the past that it would use Intel's 10nm, but this article mentions TSMC. Another article I read that they used a 16nm node ( https://fuse.wikichip.org/news/3010/a-look-at-cerebras-wafer... ), which as mentioned above about node maturity, understandable.
Right now I'm hosting some DGX's, and only one datacenter in the bay area had the ability to power a full rack of them. Power density is going to be a real issue for the these systems.
Equally, the cooling capacity of the datacenter comes into play with such systems. Given the power density, the amount of heat being generated would equally be above your normal rack output.
They're taking a radically different approach, and hoping that they'll be able to route around defects, unlike AMD where a defect in the uncore kills the whole chiplet.
Images of the whole computer were published, you can see the massive cooling system: https://www.tomshardware.com/news/worlds-largest-chip-gets-a...
There's a picture in the article.
>Why would they make a chip this big
Did you read the article?
>this article mentions TSMC. Another article I read that they used a 16nm node
Yes, 16nm/TSMC.
Yes - hardly helpful ones as you get a picture of a wafer and a box, not breakdown beyond that - hence had look and found other articles with much more detail upon this that answers the questions I raised in relation to the lack of pictures - like the cooling aspect in which you snipped my quote and removed that lovely thing we call context.
>Did you read the article?
Yes and had you read what I said you would see that the article does not answer the aspects I was asking - see what you did there.
>Yes, 16nm/TSMC
Yes - I found that in another article - which I also linked, you're welcome.
The way they paint it sounds like they're putting in redundant cores to account for failure of what seems like what I would call the 'first line' cores, i.e. there's cores that are only used if some primary ones aren't working?
But sort of intuitively that doesn't make a whole lot of sense given the parallel nature. Maybe they are just putting in 101% of specified cores, and if there's a ~1% hopefully uniform-ish core failure rate then it's all gucci?
I guess my question is probably similar to yours, what are you giving up with yield-enhancing redundancy of a behemoth die vs integrating a bunch of confirmed working chiplets together?
"Cerebras approached the problem using redundancy by adding extra cores throughout the chip that would be used as backup in the event that an error appeared in that core’s neighborhood on the wafer. “You have to hold only 1%, 1.5% of these guys aside,” Feldman explained to me. Leaving extra cores allows the chip to essentially self-heal, routing around the lithography error and making a whole-wafer silicon chip viable."
https://techcrunch.com/2019/08/19/the-five-technical-challen...
Maybe something like Intel's EMIB technology where they have small interposers along edges of chips rather than having a giant interposer might help here.
Yields are probably fairly good if they design for manufacturing by placing extra cores / wires to route around failures as I am sure they are.
From discussion at a demo the yield is good, since they are using a large node. Their hardware rerouting also mitigates defects on most chips.
Does anyone here have expertise in this area? Is this the model for a successful company in this area?
* No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle.
* Model parallel alone is full performance, no need for data parallel if you size to fit.
* Defects are handled in hardware; any latency differences are hidden & not in load path anyway.
* Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes.
I genuinely don't know how you'd build a simpler system than this.
While I'd be generally skeptical, it seems like the compilation for the rerouting could be done on a single low level, below whatever their assembler is, and so the could just look like a regular array of cores - just a single array that translates from i to the ith "real" core and similar structures seems like it could be enough.
Edit: I mean, if they're smart, it seems like they'd make the thing look as much as possible like a generic GPU capable of OpenCL. I have no idea if they'll do that but since they have size, they won't have to sell their stuff an otherwise custom approach.
The issue with using ‘industry standard’ benchmarks is that it's like measuring a bus' efficiency by shuttling around a single person at a time. The CS-1 is just bigger than that; the workloads that it provides the most value on are ones that are sized to fit, and specifically built for the device.
This does make it hard to evaluate as outsiders (certainly for similar reasons I never liked Graphcore), but I don't think it means anything as grim as you say. The recipe fits.
To me as a practitioner a meaningful metric would be "it trains an ImageNet classifier to e.g. 80% top1 in a minute". If it's not suitable for CNNs, do BERT or something else non-convolutional. Even better if I can replicate this result in a public cloud somewhere. They know this, and yet all we have is a single mention of a customer under an NDA and no public benchmarks of any kind, let alone any verifiable ones. If it did excel at those, we'd already know.
At best their solution is on par with GPUs in a performance per watt/dollar sense. At worst they're scammers looking for a sucker.
Even if that doesn't work out most of the people on these time have built companies that were acquired by either AMD or another chip maker.
Although, at this stage, Crebras does not care about mass market yet.
https://arxiv.org/pdf/1905.00416.pdf
It would be cool to see the GraphBLAS API ported to this chip, which from what I can tell comes with sparse matrix processing units. As networks become bigger, deeper, but sparser, a chip like this will have some demonstrable advantages over dense numeric processors like GPUs.
Deep Neural Nets are somewhat of a brute force approach to machine learning. Training efficiency is horrible as compared with other ML approaches, but hey, as long as we can trade +5% of classification performance for +500% of NN complexity and throw more money at the problem, who cares?
I see a dystopian future where much better and much more efficient approaches to ML exist, but nobody's paying attention because we have Deep Neural Nets in hardware and decades of infrastructure supporting it.
If the algorithm is indeed better, how can DNN dominates and turn into a dystonia...
The alternative algorithm would be better than DNN if the same amount of effort was put into creating special-purpose hardware, libraries, and so on; but in the dystopia, it's not fully refined DNN vs fully refined alternative algorithm, but fully refined DNN vs alternative algorithm with hardware and software optimized for DNN.
The alternative algorithm always looks unappealing because the playing field historically favors DNN, and so doesn't take off in the dystopia.
From TFA: "Cerebras hasn’t released MLPerf results or any other independently verifiable apples-to-apples comparisons."
That's all you really need to know.
Less than stellar benchmarks will ruin the "magic"
https://medium.com/predict/cerebras-trounces-moores-law-with...
If this was trying to aim at solid state physics and materials research, then maybe one could be carefully optimistic about a genuine breakthrough via something like room temperature, standard pressure super-conducting. As it stands, I call blind hype.
And for the curious, a good example of this is radium. For a while it was a miracle cure-all, put in everything from lipstick to jock straps. That did not work out well: https://www.theatlantic.com/health/archive/2013/03/how-we-re...
DeepMind have for place&route IIRC.
Without defending the article, it is however the case that simply scaling a chip size has nontrivial problems. For example, Will the piece of silicon warp or shatter if one side happens to get hotter than the other?