The article started talking about the VAX and how it was the gold standard everybody competed against.
The VAX is little endian.
Little endian is not a hack. It's a natural way to represent numbers. Its just that most languages in earth write words left to right while writing numbers right to left
CTC talked to Intel and Texas Instruments to see if the processor could be put onto VLSI chips to replace the board of TTL chips. Texas Instruments produced the TMX 1795 processor, shortly followed by the Intel 8008, both processors cloning the Datapoint 2200's instruction set and architecture including little-endian. CTC rejected both processors and stuck with TTL. TI couldn't find another customer for the TMX 1795 and it vanished from history. Intel successfully marketed the 8008 as a general-purpose microprocessor. Its architecture was copied for the 8080 and then modified for the 16-bit 8086, leading to the x86 architecture that rules the desktop and server market. As a result, x86 has the little-endian architecture and other features of the Datapoint 2200. I consider the Datapoint 2200 to be one of the most influential processors ever, even though it's almost completely forgotten.
A funny thing about the 8008 is that Intel's manual for its instruction set is unnecessarily shitty — even if you didn't know the history with Datapoint, the Intel manual is obviously not by the people who designed the instruction set because it's in hexadecimal, a tradition sadly followed by the 8080 and 8086 manuals. The Datapoint manuals, by contrast, are all in octal, making the machine code enormously easier to understand. (The H8 I grew up with used an Intel chip, but the front panel monitor program used octal.)
The "PDP-endian" is only a quirk due to its Floating Point Unit's long integer and double-precision floating point formats. The FPU was an extra module attached to the processor, and the original PDP-11 did not even have an FPU. It only appeared on later models: on low-end machines a simplified FPU version was available for separate purchase with limited functionalities, and only high-end models had the full FPU. On a system without FPU installed, you basically don't need to worry about "PDP-endian", it's a pure little-endian machine. But for convenience, the Unix C compiler always stored long integers in PDP-endian to avoid swapping endians. Because the same Unix and C software ran on all machines with or without FPU, all Unix programmers needed to worry about it, thus the PDP-endian folklore.
But why did the PDP-11 FPU use this strange format? @aka_pugs from Twitter did some digging, and found the PDP-endian was already in used as a softfloat format by DEC's PDP-11 Fortran compiler. So the FPU was made compatible with that...
If you spell out numbers in little-endian, once you start you’re committed to spelling it out in full, whereas big endian lets you stop at basically any point you feel like.
To represent integers, for real numbers it's quite weird.
<noobermin> X is Y
<baryphonic> You're absolutely right. X is totally not YFair enough. It's an 80's debate really. ISA is probably by a long margin not the most important factor in these comparisons.
But he says it makes no difference at all without evidence.
And he ignores that there is a whole world of simpler (especially in-order) cores where ISA probably does matter a lot.
> The Cortex-A53 is the most widely used architecture for mobile SoCs since 2014 to the present day, making it one of the longest-running ARM processors for mobile devices. It is currently featured in most entry-level and lower mid-range SoCs, while higher-end SoCs used the newer ARM Cortex-A55. The latest SoCs still using the Cortex-A53 are MediaTek Helio G37, both of which are entry-level SoCs designed for budget smartphones.
These may not be the most exciting CPUs but does the ISA matter here? Yes probably quite a bit. Does it matter that they are compatible with beefier OoO cores in say a big.LITTLE configuration. Yes it does.
I wouldn't mind too much if he had said I'm just talking about high end. But he implicitly claims to cover everything except 100k transistor CPUs.
?
>The interesting parts of CPU evolution are the three decades from 1964 with IBM's System/360 mainframe and 2007 with Apple's iPhone. The issue was a 32-bit core with memory-protection allowing isolation among different programs with virtual memory. These were real computers, from the modern perspective: real computers have at least 32-bit and an MMU (memory management unit).
Complexity needs to be justified, and the article does a very poor job there.
I know Dhrystone isn't real but https://www.realworldtech.com/arms-race/2/ says an 8 MHz Archimedes got 4901 Dhrystones per second to the 16-MHz 386's 3626 Dhrystones per second. https://en.wikipedia.org/wiki/Instructions_per_second gives the 8-MHz ARM2 4 [Dhrystone] MIPS at 8 MHz and the 16-MHz "i386DX" 2.15 MIPS at 16 MHz. In fact, the cacheless ARM2 even beat the CISC 68020, which did have a tiny cache!
Also, I think squished RISC instruction encodings like Thumb, MIPS16, and RVC seem pretty competitive with popular CISCs on code density; RVC even seems to best them. So even if your data access is competing with instruction fetch for memory bandwidth because you don't have an icache, you'd probably still get more instructions per memory cycle out of RVC than out of i386 or AMD64.
In fact I don't think code size was that much bigger for these designs so cache was probably less important than they initially thought.
The Arm team recognised that memory bandwidth was key for a cache less design and so designed to maximise this and make the most of it - hence the outperformance.
[1] https://thechipletter.substack.com/p/the-first-risc-john-coc...
So how did they fail sooo badly at breaking into the mobile CPU market? Their Android phones were notoriously slow and inefficient.
Also isn't one of the reasons the M1 is so fast because it has so many instruction decoders which is much easier because of the ISA?
The author clearly knows a lot of history but it wasn't an especially convincing argument. Especially the idiotic ranting about what makes something a "real" computer.
Through the RISC story we pay a cultural debt we owe to RISC. It is story telling, about a time long gone, and the tale is mythical in nature. In opposition to the myth, as the article states, RISC by itself is no longer an ideal worth pursuing.
This is relevant to the other Big Myth of our tech times, the Unix Story, and by extension to Linux. UNIX is mythical, having birthed OS and file abstractions, as well as C. It was a big idea event. But its design is antithetical to what a common user today needs, owning many devices and installing software that can't be trusted, at all, yet needs to be cooperative.
When Unix was born, many users had to share the same machine, and resources were scarce to the point there was an urgent need to share them, between users. Unix created the system administrator concept and glorified him. But today Unix botches the ideals it was once born of, the ideals of software modularity and reusability. Package managers are a thing, yet people seem blind to the fact they actually bubble up from hell. Many PM's have come already and none will ever cure the disease.
Despite this the younger generations see Unix through rosy glasses, as the pinnacle of software design, kinda like a Statue of Liberty, instead of the destruction of creative forces it actually results in. I posit Linux's contribution to the world is actually negative now. We don't articulate the challenges ahead, we're just procrastinating on Linux. It's the only game in town. But the money is still flowing, servers are still a thing, and so the myth is still alive.
The Unix Myth has become a toxic lie, and as collateral Linus has become a playmate for the tech titans. I'm waiting for him to come out and do the right thing, for it is evil for the Myth to continue to govern today's reality.
You talk about network effects, as Linux is the only game in town, currently. Implicitly I talk about that too, that's why I mention Linus, I expect leadership. Why develop Linux? There's not much ROI, unless change is the keyword.
Indeed the challenge is to create a new operating system suited to the demands that exist today. Fuchsia is a step, the only thing I can point to right now, but it is hardly accessible.
Note that Android works to overcome the need for a system administrator. That's because Linux works against what it needs to be: an invisible OS. The most prolific use of Linux isn't really a success story.
Furthermore, you suggest "power". Perhaps you talk about piping and shell tools. These are indeed part of the myth, good ideas but, pardon me, terribly executed. They do not compose, they are not scalable. Again because of the time frame they were conceived in, this was impossible. But that is the refrain throughout. As a result everything is just messy, resulting in huge time sinks.
Indeed I hope to one day have the disposition to truly go back to basics, and make a runtime, a substrate if you will, that will run fully inspectable code, with an execution that can be visualized, questioned and reasoned about. That way users and tools can adjust any process, repeatedly, or just once. Incrementally so.
All machine details would be hidden, and that includes (obviously, for me) everything binaries: compilation, ABI's. Something current OSes (not counting the Web) don't do.
In the end the best OS is an invisible OS.
It gets a lot of details simply wrong. For example, the 68030 wasn't "around 100000 transistors", it was 273000 [1]. The 80386 was very similar at 275000 [2]. By comparison, the ARM1 was around 25000 transistors[3], and yet delivered comparable or better performance. That's a factor of 10! So RISC wasn't just a slight re-allocation of available resources, it was a massive leap.
Furthermore, the problem with the complex addressing modes in CISC machines wasn't just a matter of a tradeoff vs. other things this machinery could be used for, the problem was that compilers weren't using these addressing modes at all. And since the vast majority of software was written in high-level language and thus via compilers, the chip area and instruction space dedicated to those complex instructions was simply wasted. And one of the reasons that compilers used sequences of simple instructions instead of one complex instruction was that even on CISCs, the sequence of simple instructions was often faster than the single complex instruction.
Calling the seminal book by Turing award winners Patterson and Hennessy "horrible" without any discernible justification is ... well it's an opinion, and everybody is entitled to their opinion, I guess. However, when claiming that "Everything you know about RISC is wrong", you might want to actually provide some evidence for your opinions...
Or this one: "These 32-bit Unix systems from the early 1980s still lagged behind DEC's VAX in performance. " What "early 1980s" 32-bit Unix systems were these? The Mac came out in 1984, and it had the 16 bit 68000 CPU. The 68020 was only launched in 1984, I doubt many 32 bit designs based on it made it out the door "early 1980s". The first 32 bit Sun, the 68020-based Sun-3 was launched in September of 1985, so second half of the 1980s, don't think that qualifies as "early". And of course the Sun-3 was faster than the VAX 11. The VAX 8600 and later were introduced around the same time as the Sun-3.
Or "it's the thing that nobody talks about: horizontal microcode". Hmm...actually everybody talked about the RISC CPUs not having microcode, at least at the time. So I guess it's technically true that "nobody" talked about horizontal microcode...
He seems to completely miss one of the major simplifying benefits of a load/store architecture: simplified page fault handling. When you have a complex instruction with possibly multiple references to memory, each of those references can cause a fault, so you need complex logic to back out of and restart those instructions at different stages. With a load/store architecture, the instruction that faults is a load. Or a store. And that's all it does.
It also isn't true that it was the Pentium and OoO that beat the competing RISCs. Intel was already doing that earlier, with the 386 and 486. What allowed Intel to beat superior architectures was that Intel was always at least one fab generation ahead. And being one fab generation ahead meant that they had more transistors to play with (Moore's Law) and those transistors were faster/used less power (Dennard scaling). Their money generated an advantage that sustained the money that sustained the advantage.
As stated above, the 386 had 10x the transistors of the ARM1. It also ran at significant faster clock speed (16Mhz-25Hmz vs. 8Mhz). With comparable performance. But comparable performance was more than good enough when you had the entire software ecosystem behind you, efficiency be damned Advantage Wintel.
Now that Dennard scaling has been dead and buried for a while, Moore's law is slowing and Intel is no longer one fab generation ahead, x86 is behind ARM and not by a little either. Superior architecture can finally show its superiority in general purpose computing and not just in extremely power sensitive applications. (Well part of the reason is that power-consumption has a way of dominating even general purpose computing).
That doesn't mean that everything he writes is wrong, it certainly is true that a complex OoO Pentium and a complex OoO PowerPC were very similar, and only a small percent of the overall logic was decode.
But I don't think his overall conclusion is warranted, and with so much of what he writes being simply wrong the rest that is more hand-wavy doesn't convince. Just because instruction decode is not a big part doesn't mean it can't be important for importance. For example, it is claimed that one of the reasons the M1 is comparatively faster than x86 designs is that it has one more instruction decode unit. And the reason for that is not so much that it takes so much less space, but that the units can operate independently, whereas with a variable length instruction stream you need all sorts of interconnects between the decode units, and these interconnects add significant complexity and latency.
Right now, RISC, in the from of ARM in general and Apple's MX CPUs in particular, is eating x86's lunch, and no, it's not a coincidence.
I just returned my Intel Macbook to my former employer and good riddance. My M1 is sooooo much better in just about every respect that it's not even funny.
[1] https://en.wikipedia.org/wiki/Motorola_68030
[2] https://en.wikipedia.org/wiki/I386
[3] https://www.righto.com/2015/12/reverse-engineering-arm1-ance...
at least in the 80s microcomputer compilers were very primitive compared to what we have now which maintained a strong need for ASM. Dev tools used to be very expensive and proprietary too.
GCC started to slowly changes that starting by 1987.
So there was a time when software started to be mainly compiled high level language but using stupid compilers and CPU designers had to live with that.
I find worth noting this is not always the case.
e.g. RISC-V C extension provides variable length instructions, but they're still either 16 or 32 bit.
Special care has been put into making the decoding overhead of dealing with this situation negligible, and it is indeed so. There's benefit, transistor-budget-wise, the moment there's any on-die cache or on-die rom. Any chip that's smaller than that is going to be very specialized and can simply omit C. In any chip that's larger, C is a net benefit.
As a practical example, the RISC-V based Ascalon by Jim Keller's team is a 8-wide (like M1), 10-issue CPU.
However, you're absolutely right the wild sort of variable instruction length that is seen in CISC architectures like x86 is a huge issue that massively complicates implementations and outright imposes a practical limit in decoder width.
OTOH in aarch64, the adoption of a fixed instruction size, thus tanking code density, was unenlightened to the point of brain-dead, we see the cache sizes M1/M2 need just to deal with this, and I'm afraid ARM will be gone for other reasons (non-technical, to do with mismanagement) before they have a chance to correct course and re-introduce compressed instructions.
As for the rest of the article, I generally agree with you that it presents outright wrong information as facts and then tries to push the wrong conclusion. It is utter bull, practically nothing of value can be found in there. I'm not even surprised, as it is pretty much the norm in RISC opposition.
It's more than that. In RISC-V, you only need the first two bits of each instruction to determine whether it's a 16 bit or 32 bit instruction; you don't need to decode an instruction to know its length.
> [...] we see the cache sizes M1/M2 need just to deal with this, [...]
Do the M1/M2 need these cache sizes, or do they have these cache sizes because they can have these cache sizes, due to having a 4x larger page size by default? (Normally, page size wouldn't be that much of a problem for instruction caches, but for x86 it is because the x86 ISAs don't require explicit instruction cache invalidation on self-modifying code; x86 processors would likely have larger L1 instruction cache sizes if they could get away with it.)
I wonder if Rust or similar could make the MMU transistors and energy budget redondant.
Disclaimer: I am a 68k fan.
The MMU does two things:
- shields processes from each other
- creates the illusion that the machine has more memory that in it has
To do away with the former without giving up its benefits, the CPU would, somehow, have to know the code it runs won’t interfere with other processes. It could trust a particular compiler to produce code that’s safe, and rust could provide such a compiler, but then, the CPU would have to prevent Mallory (https://en.wikipedia.org/wiki/Alice_and_Bob#Cast_of_characte...) from producing a binary that he claims was created by that rust compiler, but isn’t.
One way would be to make the CPU run that compiler. The CPU then would not be able to run anything else than code compiled by that rust compiler. That may be seen as prohibitive.
Even if it isn’t, the CPU probably would not want to commit to being tied to one particular compiler. Checking that the actual output of the compiler is safe may be easier. That’s one reason why byte codes were invented. They decrease the coupling between programming language and CPU, allowing evolution of a compiler (often even compiler_s_, supporting multiple languages) independent of the byte code.
So, yes, you could use rust for the first item, but you probably don’t want to. Technologies such as the JVM, Microsoft CLR, or WASM are more suited for that kind of stuff.
Also, if you want to give processes the illusion that the machine has more memory that in it has, you would still need a MMU. It could be a bit simpler, but it still would be a MMU.
No, those concerns are completely independent of each other. Rust's memory safety protects from accidentally accessing the wrong memory within the same address space, while the MMU protects against accessing (accidentally or intentionally) any memory in other address spaces.
In addition, the address translation done by the MMU has many more applications, like swapping, memory-mapped files, shared memory, copy-on-write after fork, or stack guard pages, none of which can be done by software alone.
On systems without MMU there's only one shared address space (like on the Amiga, you only had lightweight processes/threads called Exec Tasks which all ran in the same global address space).
Rust could definitely help to isolate memory accesses of applications that all run in the same address space.
Tbh, and may be this is just the limits of my imagination, but I'm not sure what rust's guarantees would have on the ISA level, they usually concern safety on the application level. Systems programming in general still needs loads of unsafe blocks to actually work (see the debate a few weeks ago where Linus Torvalds critiqued a patch where rust folks wanted to change memory allocators in Linux so they could play nicer with safe rust code).
Like, ownership and move semantics are really a higher level concept and anything that happens within a single page the MMU will not care about with machines today, so this wouldn't be a small evolution but a completely different kind of arch. Again, may be I'm just to uninformed or lack the imagination.
> For example, they added a lot of great JavaScript features, cognizant of the ton of online and semi-offline apps that are written in JavaScript. In contrast, Intel attempts to optimize a chip simultaneously for laptops, desktops, and servers, leading poorly optimizations for laptops.
Now there is a JS feature in the M1. It's the FJCVTZS instruction "Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero" which ensures this conversion follows the JS specification. [1]
And this does indeed improve JS performance for Arm CPUs. But why does JS behave this way? Because it was specified to follow what x86 does!
So to say that 'M1 is optimised for JS but x86 isn't etc' is just plain wrong.
Also: - Apple didn't do it Arm did. - It has nothing whatsoever to do with memory management.
[1] https://stackoverflow.com/questions/50966676/why-do-arm-chip...
How do you figure? The article outlines MMU development since well before C, it's not like people came up with memory protection because of C.
(On the other hand, the B5000 had hardware memory protection despite being programmed in Algol. The B5000 inspired Smalltalk, which inspired Oberon and Java. But its memory protection didn't use an MMU.)
GUI environments of the 80s all brought multitasking with them, and system stability was mediocre to very bad... All writers pointed to memory protection as the cure of all this. See also Mac os history for a more detailed usecase.
Rising software size and complexity made the industry abandon assembly programming for higher level languages and for GUI apps this quickly meant C/C++.
How else might processor topology design dogma be hindering the performance we could get by having better compilers? This is especially important now the transistor budget isn't nearly so flexible.
What work does OoO execution displace to the compiler? I thought that OoO CPUs get better performance on the exact same programs compared to in order CPUs.
But if 68k was really a 16-bit design, then Z-80 was a really 4-bit chip, because that was the size of its ALU. What matters, really, is the register size, and how much work you can do in one instruction. Federico Faggin ("fajjeen", btw) recognized that the Z-80 did not need its 8-bit result in the next click cycle anyway, so took two 4-bit cycles, and nobody was the wiser.
The point remains that a simplified ISA that’s easy to decode (and, more recently) implement dynamic reordering, will always have an edge by freeing up resources that can be dedicated to execution of the workload rather than housekeeping (as in resolving all inter-instruction dependencies).
OTOH, going too far in that direction gives you VLIW, which has proven itself to be a pain more often than not.
That's intentional. A straw man tends to be easier to attack.