We've had some really great progress that we hope to share in the near future, so stay tuned.
EDIT: Since this article is over a year old, we have made a lot of progress, and have recently taped out our first chip. We haven't officially posted a job opening, but we are very shortly going to be looking for software engineers that would love to work on our architecture. Feel free to shoot me an email if you're interested!
Firstly, your claims about virtual memory in general purpose CPUs is misleading: its purpose is memory virtualization and I wouldn't want a system without it in the presence of multiple processes (how can you trust every process not to shoot down another by accidentally accessing a wrong memory location?).
Ultimately, our hardware will become more specialized/heterogeneous, and we'll have many accelerators for various tasks, but there will likely always be a general purpose CPU at the heart of the system (that will have virtual memory, caches, etc.); for an overview, I enjoyed [1]. I see what you're building as another accelerator for inherently parallel latency-insensitive workloads (like you find in HPC). In a way, GPUs (+ Xeon Phi) cater to these markets today (benchmarks against these would be useful).
Second, I remember the previous post [2], where you claimed the system you are building relies on a RISC ISA, but now you claim it has changed to VLIW. You said yourself before "[...] stick to RISC, instead of some crazy VLIW or very long pipeline scheme. In doing this, we limit compiler complexity while still having very simple/efficient core design, and thus hopefully keeping every core's pipeline full and without hazards [...]"
What is the rationale behind this? Do you think you'll be able to manage compiler complexity now?
Any response is much appreciated!
As for why we think "this time is different" is a mix of a combination of good ideas and timing. I 100% agree with you that in the 50 years of von Neumann derivatives, basically all the low hanging fruit (and many higher up) have been attempted, and thankfully I can saw I've learned from a lot of them. Rather than be an entirely new concept, I think we have gone back to some fairly old ideas in going back to the time before hardware managed caches, and thought about simplicity when it comes to what it takes to actually accomplish computational goals. A lot of the hardware complexity that was starting to be added back in the mid/late 80's around the memory system (our big focus at REX) was before much attention was put into the compiler. While I am proud of what we have done on the hardware side, I think most of the credit will go to the compiler and software tools if we are successful, as that is what enables us to have such a powerful and efficient architecture. Ergo, we have the advantage of ~30 years of compiler advancements (plus a good amount of our own) where we have the luxury to remake the decision for software complexity over hardware complexity... plus 30 years of fabrication improvement. Couple that with Intel's declining revenues, end of easy CMOS scaling, and established portability tools (e.g. LLVM, which we have used as the basis for our toolchain) and I think this is the best time possible for us.
When it comes to virtual memory: Why would you need to have your memory space virtualized (which requires address translation) in order to have segmentation? We use physical addresses since it saves a lot of time and energy at the hardware level, but that doesn't mean you can't have software implement the same features and benefits that virtual memory, garbage collection, etc provide. The way our memory system as a whole (and in particular our Network On Chip) behaves and its system of constraints plays a very large role in this, but I can't/don't want to go into the details of that publicly right now. It may seem a bit hand wavy right now, but we do not see this as a limitation/real concern for us, and unless you want to write everything in assembly, the toolchain will make this no different than C/C++ code running on todays machines.
In the case of GPGPUs for HPC, we have the advantage of being truly MIMD over a SIMD architecture, plus a big improvement in power efficiency, programmability, and cost. We'd win in the same areas (I guess tie on programmability) for the Xeon Phi for benchmarks like LINPACK and STREAM, but the one benchmark I am especially looking forward to is HPCG (and anything else that tries to stress the memory system along with compute). While NVIDIA and Intel systems on the TOP500 list struggle to get 2% of their LINPACK score on HPCG[0], we should be performing 25x+ better[0]. Based on our simulations, we should be performing roughly equally across all 3 BLAS levels, which has been unheard of in HPC since the days of the original (Seymour designed) Cray machines.
Of course, my naivety from 2 years ago haunts me now ;) When the linked comment was written, I had yet to "see the light". Once I understood (through my co founder, the brilliant Paul Sebexen) the 'magic' that is possible when a toolchain has enough information to make good compilation decisions, did I realize that the simplicity of a VLIW decoding system made the most sense (and gave us a lot of extra abilities). It was about ~3 months after I made that comment that we started to go down this path, which early prototyping that applied to existing VLIW and scratchpad based systems led to our DARPA and later seed funding. It is only because of the fact that our hardware is so simple (and mathematically elegant in its organization) that the compiler can efficiently schedule instructions and memory movement. While I've only lived through a small fraction of the last 50 years of computer architecture, I think of myself as a very avid historian of it, and it really shocks me that no one has gone about thinking of the memory system quite like we have. I totally agree with my younger self on long pipelines though.
TL;DR: We think we'll succeed because we are combining old hardware ideas with new software ideas to make (in our opinion) the best architecture, plus this is the best time for a new fabless semiconductor startup. We have actually built the mythical "sufficiently smart compiler" due to some very clever (but simple) hardware that enables people to actually effectively program for this. We think we will be more energy efficient, performant, and easier to program for than our competition in our target areas (HPC, high end DSP).
[0] http://www.hpcg-benchmark.org/downloads/sc15/hpcg_sc15_updat...
Why 3 interchip links? What network topology are you planning to use to scale to large numbers of chips? If you're still using parallel I/O, how are you planning to communicate beyond a single PCB?
What memory interface are you using? The article seems to confuse your interchip links with your memory controller.
Most of the information in the linked article is very outdated (~16 months old), so we have decided to ditch the idea of having a separate DRAM and "External I/O" and just have our chip-to-chip on all four sides of the chip. The chip-to-chip interface uses the same protocol as our Network On Chip, and expands in the same 2D mesh. We are also looking into (with a sketched out plan) on how to directly interface this I/O with HBM dies that can be in the same MCM package. As far as supporting other memories/IOs, we are leaning towards having "adapter chips" that would convert our chip-to-chip interface to DDR4, Ethernet, Infiniband, etc.
As far as bandwidth numbers, our aggregate bandwidth for this test chip we have just taped out (16 cores + 2 chip-to-chip I/O macros on TSMC 28nm, 12mm^2 in size) is 60GB/s though for the planned production chip, we will be over 256GB/s. I have a good feeling we will be a fair margin higher than that, but I would rather under promise and over deliver.
I read a couple of times that you got funding from various govt agencies. Most of these funding agencies publish rfp responses or slide decks unless you insisted on an NDA and was approved. I couldnt find any documents talking in depth about your work.
I am in the HPC space (academic, research) and am genuinely interested in learning more about your work.
Think you're doing some very exciting work with REX and would love to be a part of the team :).
It would be great if there would be some raspberry pi like distribution with a Chip included. I think this could speed up the adoption.
Did I get it right?
As a lay processor designer, I couldn't agree more. I don't like VLIW, but this architecture makes a lot of sense. I think it took up to this point for compiler technology to catch up with what is possible in hardware.
Almost all the good ideas in computing were mined out long ago, the trick I think is to get the computing world to give up on those which are holding things back (cold dead hands if necessary).
also a speaking engagement: http://insidehpc.com/2016/01/call-for-papers-supercomputing-...
and a comment elsewhere that mentions another approach: the "Mill CPU of Mill Computing"
As I recollect (perhaps quite wrongly) Itanium (VLIW) failed because compiler-writers couldn't really be bothered or couldn't mount the learning curve. So I'm most curious about what progress is being made on the compiler side.
You can read my comments on the Mill architecture elsewhere on HN (not a fan of stack machines), but my biggest disappointment in them is the fact that they have been working on Mill for ~10 years with a team ranging of 5 to 20 (from what I have heard) and have yet to get to silicon, while we have gone from a complete custom architectural idea to tapeout in ~11 months from closing our first seed funding.
The big technical failure point for Itanium (in my opinion) is the fact that Intel took the relatively pure VLIW research by Josh Fisher @ HP Labs and tried to add a ridiculous number of features (and attempted x86 compatibility) that impacted the ability to statically schedule instructions. The resulting bastard architecture Intel called "EPIC" (rather than VLIW) had a very difficult job in getting the compiler to generate instruction parallel code since Intel added a huge amount of indeterminism into the architecture that goes against the original VLIW tenets. If your compiler has to assume the worst case latency for all instructions and memory operations, you are going to have a bad time.
To my understanding, the Mill project is not financed. They're enthusiasts working for sweat equity, and are likely going to seek (non-controlling?) investment to finally hit silicon when they're ready.
For the scope of what they're doing, I think it's a defensible enough approach. It's not something that can be created in evolutionary stages; all designs of all parts need to be working together properly for there to be benefit from any part, and it's quite complex while also trying out tons of novel designs.
(and the Mill isn't stack-based or stack-related. It's basically a crossbar of recent ALU/Load results being fed into further ALU/Store inputs in parallel. The belt is just some way to represent the set of recent results.)
The only moderately successful general purpose VLIW are Conroe and the related Denver, and they use a runtime translation layer to collect the required dynamic informations.
It's interesting to note that convolutional neural nets (CNNs) are one solution to the software challenge. It's an imperfect solution, in the sense that CNNs are not as general purpose (at the same efficiency) and have strict data requirements for training, but it is a solution, and the big N are investing heavily to the point of designing ASICs.
Eventually, though, we need to solve the software problem. That will require rethinking programming languages.
Some concepts, like how to manage concurrent data processing and thread communications, need to be handled carefully, but that's more at the level of 'standard library' than the compiler. There is a clear pathway to getting C working on the architecture, and a reasonable direction (that will need some fleshing out) to getting performance-enhancing optimization of something like LLVM IR.
In protected mode (i.e., what the kernel is using), will an Intel processor not also disable virtual memory lookup? Couldn't we just recompile scientific software to a protected mode environment to get those same benefits?
Also, I think it is more useful and fair to compare against a GPU than a general purpose CPU.
(As an aside, I don't see where the reduced latency gives such a big advantage. There will be latency anyway, so in any case your software has to deal with waiting in an efficient way (doing useful stuff in the mean time). Shaving off some latency will only help if your software design was bad to begin with.)
From the article, the power density is (4 W)/ (0.1mm^2), or 40W/mm^2. Intel's Haswell chip has a TDP of ~ 65W, an area of 14.7mm^2, for a power density of 4.4W/mm^2.
Is this power density a cooling challenge?
After tapeout of our first test chip, the final size for one of our cores is 0.27mm^2 (including the SRAM that makes up the scratchpad memory) on TSMC's 28nm process. We actually came in using less gates than originally anticipated, and our size without SRAM is a little less than 0.01mm^2.
Now, for just going by what is on the linked article: The diagram comparing sizes are for single cores (0.1mm^2 estimate back then for a Neo core, 14.5mm^2 for a single Intel Haswell core). The power numbers in the table below that are for entire chips. You are quoting 65W for a single core, which is incorrect... The 65W Haswell chip I believe you may be referring to is the 4770S, which is 4 cores @ 65 watts, and looks like it has a die size of 177mm^2.
Calculating this out using our current numbers, our planned full 256 core chip has changed a bit (doubled the performance since last year, doubled the power due to adding more stuff) and we estimate the TDP to now be 8 Watts and ~100mm^2, which gives us a power density of 0.08W/mm^2. Intel would then have 65W / 177mm^2 = 0.367W/mm^2.
As would make sense in the case where we are claiming lower power operation, our power density is also lower.
The power density is impressively low, indeed. Looking forward to more info in Sept.
Let's say it wasn't well received.