Why Registers Are Fast and RAM Is Slow (opens in new tab)

(mikeash.com)

225 pointsanandabits12y ago90 comments

90 comments

yan12y ago

For a way more detailed look at memory architectures and implementation, check out Ulrich Drepper's classic paper "What Every Programmer Should Know About Memory"[1]

[1] http://www.akkadia.org/drepper/cpumemory.pdf

demallien12y ago

Or on a more light-hearted note: http://folklore.org/StoryView.py?project=Macintosh&story=Sou...

Which just goes to show, hitting memory is a Bad Thing(tm) even when you're running on a slow(from today's perspective) processor like a 68000.

raverbashing12y ago

Very impressive

Doing 22kHz generation on a Macintosh is very close to the limit

1 more reply

derleth12y ago

It wasn't always thus: On the 6502, which the early Apple II machines were built around, it was possible to access RAM at only a one- or two-cycle penalty compared to doing everything in registers and immediate values. This was only the case if you used zero-page memory without indexing, however, so you couldn't have a lot of stuff in RAM without incurring more speed penalties.

http://www.6502.org/tutorials/6502opcodes.html

(Zero-page memory on the 6502 was the memory accessed via addresses with a high byte of 0x00. Since 6502 had sixteen-bit RAM addressing, this meant each page was 256 bytes large, so the zero-page was almost as good as having 256 single-byte registers.)

bloodorange12y ago

This, ladies and gentlemen, is a particularly detailed and good read. Please do give it a glance if you haven't already.

solarexplorer12y ago

If you want to buy good book on the topic: "Memory Systems: Cache, DRAM, Disk" by Bruce Jacob, Spencer Ng, David Wang

yan12y ago

A few days late, but thanks

codex12y ago

A simple thought experiment suffices here. What is the shape which holds the most physical bits while minimizing the overall latency for random access? It's a sphere. Each bit occupies a space packed within that sphere. The radius of the sphere is the distance that light must traverse, and thus corresponds to latency.

"slow" elements of the memory hierarchy are on the outside of the sphere, while faster elements (cache, registers, etc) are layered on the inside, like an onion. Since those spheres are smaller they must, by definition, hold fewer bits, but they are, by definition, faster.

The total number of bits you can store is a function of the volume of the sphere. For a given latency level, it's a function of the surface area of the sphere at a given radius.

The volume of a sphere is 4/3pir^3. Because latency is a function of the radius (how far it takes light to bounce to the edge of the sphere and back) that means that latency must rise as at least the cube root of the number of bits you want to store. That is the best possible bound.

This implies that no algorithm is ever O(1) time for an asymptotically large number of elements accessed randomly--not even hash tables or pointer dereferences. They're at best O(n^1/3).

ajuc12y ago

> This implies that no algorithm is ever O(1) for an asymptotically large number of elements--not even hash tables or pointer dereferences.

O(1) is about number of operations required by algorithm to finish for given data size, not about the time. So latency doesn't matter.

Also: if the amount of information that can be kept in universe is finite (most probably it is) - then you can make algorithm that takes the same amount of operations no matter data size (just always add dummy data to fill up the data to the physical limit). Thus every algorithm is technically O(1).

Proof: let N be the number of bits that we can keep in memory. Every deterministic algorithm either does infinite loop, or finishes the execution after at most 2^N changes of state (otherways it is 2 times in the same state with different follow-up, and he can't, cause it's deterministic). So if we design an algorithm, that for every data fitting into memory calculates the result and then does busy loop for the remaining steps until the step 2^N - this algorithm is O(1) no matter what it does.

There's probably a hole in my understanding somewhere, cause algorithmic complexity would be a really useless definition if that was true :)

gizmo68612y ago

I think the hole in your understanding is assuming that math (in this case big-O) actually maps to reality. Big-O (and algorithms themselves) is defined entirely in mathematical terms. This model can allow input to be arbitrary large, and can allow operation to take a constant time. If you want to, you can talk about the algorithmic complexity of an algorithm assuming prime factorization in constant time. Maybe not useful, but no reason we cannot talk about it.

codex12y ago

Usually the implicit assumption with O notation is that n may go to infinity.

Time and the number of operations are equivalent here: as proof, just define the operation as "move an information-carrying photon a tiny distance episilon". That must take a finite amount of time, as the speed of light is finite, and the number of those operations must increase with the number of randomly accessed elements you're working with, as they're necessary simply to retrieve the element from memory.

1 more reply

Scaevolus12y ago

This is right for theoretical limits, but modern chips are fabricated as stacked 2D layers, forming planes rather than spheres. This changes information density gain per distance from the core from cubic to quadratic-- in Nehalem, the 64KB of L1 cache has 4 cycle latency, while 256KB of L2 cache (4x more) has 10 cycle latency (~2x slower).

joosters12y ago

Has anyone done a study on the optimal number of registers to have?

The website answers the register question well, but leads to a further question: If registers are so great, why stick with just 16/32/64/n registers? Why not have more? After all, x86-64 and ARM64 decided that having more suited them.

In the end it must come down to a compromise, with the downsides of having more registers possibly being some of the following:

* Increased instruction set size (having to encode a larger register space in the bit patterns of each instruction)

* Increased latency for interrupts? e.g. if your CPU has 1000 registers and an interrupt occurs, you're going to end up having to save all those 1000 registers somewhere. There could be some HW-assist but you'll pay the price somewhere.

* Extra cost for saving registers in functions. Sure, depends upon the ABI as some registers will be 'scratch' and not preserved between function calls, but if you've got more registers you'll end up wanting to save more of them.

* Algorithms might not need all the registers. I wonder what algorithm uses 20 live variables? 50? 100? etc. At some point, those extra registers could be unused.

* Registers still need to be 'spilled' to memory. In an extreme case, you could imagine compiling a small program where every variable maps to a unique register. Ultimate speed! But asides from that optimal case, you'll end up still having to write registers back to memory. It makes no difference having 100 registers if you store the results of every computation...

Anyway, that's all speculation. I was wondering if someone had done a study. You could construct a virtual, bespoke CPU with n registers, then make gcc compile some SPEC benchmarks using the ISA and model it to see how efficient having an extra register makes it. You could graph registers vs simulated runtime and see where the sweet spot is.

Scaevolus12y ago

Yes, it's been studied. You rapidly run into diminishing returns.

http://arxiv.org/ftp/arxiv/papers/1205/1205.1871.pdf

Here's a good thread discussing this: https://groups.google.com/forum/#!searchin/comp.arch/number$...

joosters12y ago

Awesome! Thank you for the link.

brigade12y ago

The studies would vary over time because CPU design and bottle necks have changed. Early designs were of course limited by transistor count, now we have OoOe and physical registers are limited by muxers and latency (see the presentations by the mill CPU guy [1]

Saving registers in functions is mostly irrelevant - you only save what you'd use, so saving more means fewer spills within the function.

Saving on context switches (interrupts alone aren't a big deal) was indeed a problem back when AltiVec was designed, thus it has a special register to keep track of which registers need to be saved. In modern designs this is less of a problem, between higher frequencies, multiple cores, and the other effects of a context switch dominating (effective flush of l1 cache and predictors).

The interesting bits nowadays are that load/store is expensive power-wise, which was what ARM identified as the major motivation behind having 32 registers (fewer spills in functions) and OoOe designs.

[1] http://m.youtube.com/watch?v=QGw-cy0ylCc&desktop_uri=%2Fwatc...

joosters12y ago

Saving registers in functions is mostly irrelevant - you only save what you'd use, so saving more means fewer spills within the function.

Ah, but I'm sure that if you have more registers available, you'd use more registers. Up to a certain point. But what point? Just how many registers?

1 more reply

masklinn12y ago

> The website answers the register question well, but leads to a further question: If registers are so great, why stick with just 16/32/64/n registers?

TFA gives at least one reason:

> Registers use an expensive and power-hungry active design. They're continuously powered, and when reading them, this means that their value can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the powerful register hardware to push the read line to the appropriate state.

Registers use up a lot of silicon, and consume a lot of energy to power it. They also need to stay physically close to computing circuits, otherwise you end up with an L1 cache more than a register.

Furthermore, although ISA expose a number of registers A, OOO architectures (and their friends parallel and speculative executions) pretty much require the CPU to have > A registers and do register renaming, which lowers the number of registers the ISA can define. For instance the Alpha ISA defines 32 integer registers, but the Alpha 21264 had 80 physical integer registers.

joosters12y ago

That's definitely another factor. Again though, I doubt it's the limiting one. No-one (as far as I know) has produced a power-hungry CPU with (say) 5000 registers on it.

1 more reply

solarexplorer12y ago

Register windows are a way to put 1000 registers in a CPU. See the SPARC and Itanium instruction sets for how this can be done. There are also plenty of studies about both.

Vector registers are another way to use 1000 registers.

But directly coding 1000 registers into each instruction does not seem to be such a good idea. You might as well use a 1st level cache. The difference between the cache and the register file ist mostly how the instruction set architecture references it. Registers are usually easier to acces because each one has a single name and the CPU can detect dependencies and conflicts easily. Memory accesses and caches are more complex because you need to calculate the addresses before you can detect dependencies/conflicts.

PD: Yet another way to use 1000 registers is massive multi-threading like the Tera MTA.

terrymah12y ago

It's complicated, but modern processors actually do have many more registers than you can name in the instructions. They use things like "register renaming" to avoid false conflicts between instructions.

Registers that you name in assembly != physical registers. And when you use a register in two different instructions, you won't necessarily get the same physical register each time.

edderly12y ago

I thought this was an interesting insight in to that: http://ootbcomp.com/docs/belt/index.html

vilda12y ago

Note that the actual number of registers is considerably different than the number of registers you can access through instruction set. They are used via register renaming and optimizations of complex instructions.

joosters12y ago

Yes. As other commentors have said, if you are doing out-of-order execution well, the CPU will have many more 'hidden' registers and do register renaming to use them. But this has an interesting interaction with compilers.

Say you have a simple function that is going to add 1 to a bunch of variables. In an ARM-like assembly code, this could be written as:

  LDR r1, [r0, #0]
  ADD r1, r1, #1
  STR r1, [r0, #0]
  LDR r1, [r0, #4]
  ADD r1, r1, #1
  STR r1, [r0, #4]
  LDR r1, [r0, #8]
  ADD r1, r1, #1
  STR r1, [r0, #8]

Now, if your CPU can do OoOE, it can spot that register r1 is used for three independent loads, adds and stores, and can internally use three different registers for them, allowing the operations to be done in parallel. But, equally, the compiler could have written the code as:

  LDR r1, [r0, #0]
  ADD r1, r1, #1
  STR r1, [r0, #0]
  LDR r2, [r0, #4]
  ADD r2, r2, #1
  STR r2, [r0, #4]
  LDR r3, [r0, #8]
  ADD r3, r3, #1
  STR r3, [r0, #8]

Compilers and register renaming are fighting each other. In traditional compiler writing, you try to minimise the register usage and output the first code listing. But if you have plenty of registers, you could output the second code instead, and let the CPU do parallel execution without the need for register renaming.

In other words, once you have enough 'real' registers does it get rid of the need for register renaming? Intel added it to their pentiums to improve existing x86 code, but I wonder if it has that much of a benefit with newer ISAs that have 'enough' registers and properly tuned compilers?

brigade12y ago

You still need OoOe to execute your second example optimally since you didn't schedule the instructions, which points to why OoOe isn't going away - there are going to be code sequences that the compiler cannot schedule optimally, particularly around branches. Additionally, cache misses are impossible to predict statically, and OoOe helps hide those.

And no one does OoOe without register renaming.

1 more reply

Symmetry12y ago

Most high performance CPUs will have ~100 physical registers or so, possibly divided up in multiple segments.

But abstracting those you have your architectural registers that are presented by your ISA, and the CPU uses register renaming to map those onto the physical registers.

The tradeoffs involving ISA registers are more intense. You have to load and store all of them on thread swaps, but that's pretty tivial. More importantly the bits you have to use to specify which register you're using are bits that you're paying in every single instruction you have, increasing the size of your executable and the pressure on your caches.

Different sorts of architectures have their sweet spots at different places. In order processors doing lots of matrix math and such benefit from lots of architectural registers, the Itanium had 128 integer and 128 floating point registers and that was the right amount for a VLIW architecture with it's features. Modern GPUs are similar.

On the other hand, your typical OoO CPU will have either 16 or 32 registers you can address at a time, and that seems to be close to optimal. It's hard to say since instructions come in discrete chunks and your number of registers has to be a power of 2 as a practical matter.

codex12y ago

Fundamentally, having more registers increases the speed of light delays in accessing the register. If it did not, we would just operate on main memory itself. However, two few registers and you lose the ability to perform complex computations efficiently. So I believe it is, indeed, a compromise between speed and a need to maintain scratch state. I would be surprised if Intel and AMD didn't constantly run simulations of common computations in an effort to find the optimal size of all on-chip structures.

joosters12y ago

That's definitely another factor but I suspect it isn't the limiting factor. Sure, design a chip with a million registers and you'll end up constructing them like RAM. But with orders-of-magnitude fewer registers, 16 or 32 or whatever, the size of the register banks on the CPU can't be that significant to incur speed-of-light style delays, surely?

2 more replies

Theodores12y ago

Perhaps it would clarify things with analogy:

Let's say Bubba's watching the Super Bowl. The table in front of him are his registers, the fridge is cache, and the corner shop a quick walk away is memory.

He looks and see he doesn't have any beer on the table. So he goes to the fridge and gets what he wants, and comes back to the couch. Later, Bubba runs out of beer (useful data). This is a cache miss, so he has to go down to the corner store and get some beer. Instead of just getting what he wants, maybe he gets some Hungry-Man frozen dinners, in case he'll want some later. He goes back, puts the beer and TV dinner in the fridge, and brings some beers with him to the table. Next time he runs out of beer, he goes to the corner store, but they're all out of beer. So he buys some seed, tills the fields, and grows his own barley. This is disk access.

http://ucb.class.cs61c.narkive.com/pKzt4z6G/the-doe-library-...

terrymah12y ago

Hmph. You forgot the part where Bubba's friends are watching him drink beer and eat Hungry-Mans, and if they want some, they can force Bubba to throw out all his food and pour all his beer down the drain, and everyone has to go back to the store.

revelation12y ago

Theres something in between, which you will find on microcontrollers: SRAM. If you use simple architectures, like AVR, you also get completely deterministic timings for a load from SRAM (e.g. 2 cycles for AVR).

Edit: Chill, everyone. Yes, it's "implementation detail of the substrate", but it is a very important implementation detail given that it is directly exposed to the programmer as memory, not in some automagically managed cache.

georgecmu12y ago

SRAM is used in every CPU, not just microcontrollers. Registers and cache are usually implemented as SRAM. The false distinction this article makes between registers and RAM is misleading and indicative of the author's general ignorance of computer architecture.

masklinn12y ago

It's not misleading in the least unless you're a pedantic smartass who wants something to complain about. TFA uses terminology which "Reader Daniel Hooper" will understand, and in which RAM is a synonym for "main memory". Which is the colloquial meaning of RAM outside of hardware design labs and pedantic smartassery.

masklinn12y ago

That's the implementation detail of the substrate, TFA uses "RAM" in the sense of "main memory" which is the colloquial meaning of the acronym. Registers can be implemented in SRAM. So can CPU-level caches or various hardware buffers.

zwieback12y ago

That was my thought when I was reading the article. On-chip SRAM on microcontrollers feels different because on general purpose CPUs the generic programming model has registers and RAM with the cache managed for us by others. On MCUs you almost always end up being aware of on-chip SRAM and off-chip SRAM or DRAM. The lines are blurry for larger MCUs but for lower end stuff like Cortex M, AVR or MSP430 it's definitely a good idea to look over instruction timing for all the different flavors of storage.

simias12y ago

Most ARM SoCs have a few hundred kilobytes of "internal RAM" (which is obviously SRAM) used mainly by the ROM and bootloader before the memory controller is initialized and can usually be accessed with the same latency as the L2 cache.

It's usually unused once the kernel has started but it can be mapped by the kernel later on if there's a use for it.

makomk12y ago

Modern x86 chips generally allow the onboard cache to be used as RAM during early boot for the same reason, too.

sigkill12y ago

So, I'm a bit confused. Are registers SRAM? Or are they faster than SRAM?

ramchip12y ago

Any of these computer architecture concepts: register file, L1/L2/L3 cache, main memory

Can be implemented with any of these components: DRAM, SRAM, D-FF (flip-flops)

It's common for main memory (in embedded systems) and register files to use SRAM. But you can also implement the registers with flip-flop banks, and get something bulkier but faster. I'm not sure what Intel/AMD does.

1 more reply

mikeash12y ago

This is great stuff to know. Not relevant to my audience, I think, but it's something I wasn't quite aware of before, and I'm happy you pointed it out.

twoodfin12y ago

Do any current ARM implementations do register renaming over a physical register set larger than the architected set?

Obviously Intel has been doing this for a while: Haswell has something like 168 integer registers, while the x86-64 ISA only exposes 16.

EDIT: Some Googling tells me that at least the Cortex-A9 mapped 32 architectural registers to 56 physical: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

Symmetry12y ago

Basically anybody doing out of order execution these days is going to be be doing register remapping at some level.

Symmetry12y ago

That article did a lot of simplifying, but probably simplifying that was needed for the person who asked that question.

An interesting thing about Apples take on AArch64 in particular that some people have been speculating about is about how Apple's Cyclone core's memory subsystem works. ARM cores usually use the virtual (post-MMU) address of data to determine where in the cache data lives, but if you stick with page size as big or bigger than the L1 size you can start your L1 lookup at the same time you do your TLB lookup, and save a lot of latency. Apple's control of the OS is what lets them force 64K page sizes.

eksith12y ago

This part stood out : "The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions."

And a bit later : "3.Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code."

This may potentially explain why a smaller executable isn't necessarily faster when executing. I guess a lot of compiler gymnastics are devoted to breaking down complex instructions to take advantage of this.

knappador12y ago

In some ways, the actual execution of code is opaque to compilers. Modern x86 processors further divide their instructions into op-codes in the instruction translation units. AMD and Intel both have their approaches to this internal instruction set deeply ingrained into every CPU since perhaps K7 for AMD and Pentium Pro for Intel. Pentium M and later the Core architecture contained op-code fusing where instead of just rearranging op-codes, the op-codes were combined into composite op-codes that could be executed in one step. The opcode fusing + out-of-order execution basically makes the CPU act like a compiler internally for binary. It's a like a JIT run-time for binary that's implemented in hardware.

As far as executable size and performance, compiling with -Os in GCC will occasionally yield a performance increase that might even change across CPU's and architectures as the memory sub-systems hit a good rhythm or there are less misses overall. Usually smaller is better for this. -O3 will occasionally unroll gigantic loops, while using compiler directed optimization to analyze which parts of a binary can benefit overall execution from unrolling vs less misses with smaller executable size can yield even better agreement between memory subsystem performance and execution speed.

Microarchitectures like MIPS have further blind alleys such as branch-delay slots that will finish execution even if a branch instruction -before- the slots is taken. This is an out-of-order program, but putting the burden on the compiler instead of implementing the reordering in hardware actually became a nuisance because the architecture couldn't change how it expected instructions without breaking binary compatibility and the compiler wouldn't have been able to tweak for different CPU's without a fat-binary approach.

arohner12y ago

Depends on the app, and your use case, and the CPU, etc. YMMV.

For a long time though, the Linux kernel has been compiled to optimize code-size rather than 'performance' (according to GCC). Why? Because the kernel gets involved in every syscall the OS makes, so the kernel code gets paged in and out very frequently. Loading a little less code from RAM means everything goes faster.

jff12y ago

Well, it's going to be faster if the smaller executable can keep its entire text segment in memory.

I've done the instruction scheduling stuff by hand on paper; it's pretty interesting. We did Tomasulo scheduling, which is hardly modern, being developed in 1967, but it'll execute your instructions all sorts of ways.

eliteraspberrie12y ago

Great explanation for folks without a hardware background. I also enjoyed his previous article about ARM64. Thanks for sharing.

ams611012y ago

As a kid I had a TI 99/4a. The TMS9900 processor didn't have any registers, it had a "workspace pointer" which let you treat a block of RAM as your registers. This was slow, but in theory allowed for convenient context switches as you'd just load a new workspace pointer.

Do any modern CPUs still use an approach like this?

Symmetry12y ago

Not if the CPU runs faster than 10 MHz or so. Fundamentally the speed of CPUs has gone up much, much faster than the speed of RAM for the reasons listed in the article. Some micro-controllers can still do things like you describe, but anything you'd think of as a modern CPU uses some form of caching that makes things more complicated than that.

rlu12y ago

It's funny reading this and then remembering that on top of all this, there's paging (i.e. fetching from hard drive).

It's like registers are refrigerators, RAM is like the grocery store around the corner, and Page faults are like the grocery stores in a neighboring town

woooooo memory!

grogenaut12y ago

Don't forget DMA which is like drop shipping with a guaranteed delivery date, like 3 days later, but they just shove it directly into the shelf

taeric12y ago

While I personally love this answer, I have to admit a basic physical metaphor works. If you remember an answer, it is practically immediate. The further back in your records you have to go to find something, the slower it will be.

We have faster ways of recalling notes today than we did in the past, you might say? Well, yeah. In many respects our ram is faster than registers of early computers, too. That all things have gotten faster doesn't change that things which were faster are still faster. (I'd be delighted to know examples where this radically changed somehow.)

grogenaut12y ago

Jebus christ, it's because they're close. Like IN the cpu. Not nuzzled not ON, not NEXT TO.

Hell, if you know and optimize for registers and don't know why they're fast, you should be shot. Otherwise you're using a language that doesn't really give you control over registers why do you care?

Okay okay, I like reading about the blackbird and I know that I know nothing about how it really works other than lots of fuel. Still. Okay, I'm a Hypokrite.

ars12y ago

While the CPU is waiting for data to load from RAM, is the operating system smart enough to give it a different task to execute?

dexen12y ago

Glad you asked ;-) It's called Hyper-threading [1] and works best when the scheduler is aware. Provided in some Intel's CPUs (2 threads per core) and in Sun's (later Oracle's) UltraSPARC T1...T5 (8 threads per core).

[1] http://en.wikipedia.org/wiki/Hyper-threading

[2] for example, CONFIG_SCHED_SMT in linux

mikeash12y ago

The overhead of task switching is too great for that to be useful, plus the OS would probably need to talk to RAM as part of the whole process anyway.

However, this is part of what hyperthreading accomplishes. The OS gives the CPU two tasks ahead of time, then when one task stalls, the CPU can switch over to the other one and work on it for a while.

solarexplorer12y ago

This is actually what hyperthreading is all about: cache misses. I missed that in the article. There are more things missing actually, but I guess it would be too much to explain it all in a single article. Things like caches, coherence protocols, prefetching, memory disambiguation. Registers are also much more complex because you have things like register renaming, result forwarding etc. In the end there are simply much less registers than memory locations, that's why you can build faster registers than memory.

1 more reply

cfallin12y ago

In addition to what others said about the overhead of context switching just for a DRAM access stall (at today's DRAM latencies, which are ~200 to 400 cycles), there's an architectural issue with the idea, too. Consider that from software's point of view, missing the cache and going to DRAM is "invisible": it happens as part of executing a single instruction. Software doesn't know the cache miss happened; architecturally, the result of the memory load is the same whether it came from cache or DRAM. So to allow the OS to do something clever, the processor would have to define a way of notifying the software that a cache miss occurred, probably by raising an exception and aborting the instruction, to be resumed later (like a page fault). So it would take a nontrivial amount of effort by CPU architects to enable such an OS feature.

Interestingly, there is at least one academic proposal to do something like this [1], but I'm not aware of any real implementations.

[1] http://dl.acm.org/citation.cfm?id=891494

masklinn12y ago

The OS does not. That's the job of the out of order architecture (load prediction and reordering of later non-dependent instructions).

And when that fails, it's exactly the use case for hyperthreading.

markb13912y ago

But electrical signals don't propogate at the speed of light in a vacuum ( I didn't read past that point). The signals travel at about 2/3 the speed of light. This is very significant when you look at path lenghts

gonzo12y ago

I'll just leave this here. https://www.usenix.org/system/files/1309_14-17_mickens.pdf

bane12y ago

So what's the state of research in breaking out of the Von Neumann approach and going with a RAM-free architecture where the CPU has m(b)illions of registers you just do everything in? Of course it's expensive, but let's say you have effectively infinite dollars, is this a good idea?

knz4212y ago

Where would your processor fetch its program code from, if not RAM?

Assuming you place code also into registers...

If you squint hard enough registers are also a form of RAM, just closer to the processor and faster. A machine with only instruction execute and registers would still have a Harvard/Von Neumann architecture.

The reason why processors don't have more registers is because they are quite power hungry and they are not very dense. For a given chip area, D-RAM gives you more than 6x the capacity for less than half the power use. And no, you can't make registers with the same technology as D-RAM.

bane12y ago

Right, registers are a kind of very small working memory, the only place where "work" operations can happen. Most program code eventually has to go through the register bank anyway, except it all has to be MOV in and out of the registers, eating up unbelievable amount of time.

I've always viewed RAM as a kind of register cache, necessitated because registers are expensive to build and RAM, though expensive, is cheaper. I've heard registers these days are just a small bit of SRAM, but reaching into my way back machine in college, I seem to remember them being a different kind of memory element.

But RAM and all the caches these days leading up to registers are all require fetch from somewhere, store in the register, do the work, then write back the result somewhere (even if the instruction set obfuscates that). If you had enough registers, the fetch and store parts of that work are pretty much gone, turning something like

mov 0xaddressh-1 RegA mov 0xaddressh-2 RegB add RegA RegB RegC mov RegC 0xaddressh-3

into

add 0xReg-1 0xReg-2 0xReg-3

where each mov we do today introduces a cascade down the cache and memory stack (perhaps even dipping into on-disk VM) just to copy a few bytes into a register. And we have to do that 3 times here. The number of adds we could do in the time it takes to do a mov is probably pretty high, but we simply can't do them because we're waiting on bits moving from one place to another.

So suppose money, power etc. weren't considered issues and engineering effort was put into a register-only approach, how much faster would that be? (one the reasons that the Von Neumann architecture became "the" way to do things was that registers were considered expensive to build, but what if we didn't care about money?)

I'd bet a general purpose system built this way would be an order of magnitude faster than anything we have today. But you're right, it would be an enormous resource hog and be expensive as a medium-sized mega yacht.

1 more reply

grogenaut12y ago

georgecmu12y ago

Cites distance and cost/power as reasons why "RAM is slow, registers are fast", not a mention of differences between SRAM and (S)DRAM.

Not worth reading.

jff12y ago

He clearly describes SRAM in the following paragraph, then contrasts it with DRAM in the rest:

Registers use an expensive and power-hungry active design. They're continuously powered, and when reading them, this means that their value can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the powerful register hardware to push the read line to the appropriate state.

georgecmu12y ago

Yeah, I did not even recognize this as description of SRAM, thanks for pointing this out.

from http://www.differencebetween.net/technology/difference-betwe...

Because of its lower price, DRAM has become the mainstream in computer main memory despite being slower and more power hungry compared to SRAM. SRAM memory is still used in a lot of devices where speed is more crucial than capacity. The most prominent use of SRAM is in the cache memory of processors where speed is very essential, and the low power consumption translates to less heat that needs to be dissipated.

In any case, SRAM can be more power hungry that DRAM (per bit) but it can also be vastly less. SRAM power consumption is not at all driven by the fact that registers are "continuously powered". Accessing SRAM is the power hungry operation, but powering requirements otherwise are negligible. If anything, it's the DRAM that requires constant powering (refreshing).

1 more reply

alain9404012y ago

Agreed. Completely misses on-die caches for instance. Article might have been accurate in the 6502 days, but not today.

masklinn12y ago

> Agreed. Completely misses on-die caches for instance.

Absolutely, except for the part where he mentions them

> If you're really lucky and the value is in L1 cache, it'll only take a few cycles.

j / k navigate · click thread line to collapse

90 comments

yan12y ago

For a way more detailed look at memory architectures and implementation, check out Ulrich Drepper's classic paper "What Every Programmer Should Know About Memory"[1]

[1] http://www.akkadia.org/drepper/cpumemory.pdf

demallien12y ago

Or on a more light-hearted note: http://folklore.org/StoryView.py?project=Macintosh&story=Sou...

Which just goes to show, hitting memory is a Bad Thing(tm) even when you're running on a slow(from today's perspective) processor like a 68000.

raverbashing12y ago

Very impressive

Doing 22kHz generation on a Macintosh is very close to the limit

1 more reply

derleth12y ago

http://www.6502.org/tutorials/6502opcodes.html

bloodorange12y ago

This, ladies and gentlemen, is a particularly detailed and good read. Please do give it a glance if you haven't already.

solarexplorer12y ago

If you want to buy good book on the topic: "Memory Systems: Cache, DRAM, Disk" by Bruce Jacob, Spencer Ng, David Wang

yan12y ago

A few days late, but thanks

codex12y ago

The total number of bits you can store is a function of the volume of the sphere. For a given latency level, it's a function of the surface area of the sphere at a given radius.

This implies that no algorithm is ever O(1) time for an asymptotically large number of elements accessed randomly--not even hash tables or pointer dereferences. They're at best O(n^1/3).

ajuc12y ago

> This implies that no algorithm is ever O(1) for an asymptotically large number of elements--not even hash tables or pointer dereferences.

O(1) is about number of operations required by algorithm to finish for given data size, not about the time. So latency doesn't matter.

There's probably a hole in my understanding somewhere, cause algorithmic complexity would be a really useless definition if that was true :)

gizmo68612y ago

codex12y ago

Usually the implicit assumption with O notation is that n may go to infinity.

1 more reply

Scaevolus12y ago

joosters12y ago

Has anyone done a study on the optimal number of registers to have?

In the end it must come down to a compromise, with the downsides of having more registers possibly being some of the following:

* Increased instruction set size (having to encode a larger register space in the bit patterns of each instruction)

* Algorithms might not need all the registers. I wonder what algorithm uses 20 live variables? 50? 100? etc. At some point, those extra registers could be unused.

Scaevolus12y ago

Yes, it's been studied. You rapidly run into diminishing returns.

http://arxiv.org/ftp/arxiv/papers/1205/1205.1871.pdf

Here's a good thread discussing this: https://groups.google.com/forum/#!searchin/comp.arch/number$...

joosters12y ago

Awesome! Thank you for the link.

brigade12y ago

Saving registers in functions is mostly irrelevant - you only save what you'd use, so saving more means fewer spills within the function.

[1] http://m.youtube.com/watch?v=QGw-cy0ylCc&desktop_uri=%2Fwatc...

joosters12y ago

Saving registers in functions is mostly irrelevant - you only save what you'd use, so saving more means fewer spills within the function.

Ah, but I'm sure that if you have more registers available, you'd use more registers. Up to a certain point. But what point? Just how many registers?

1 more reply

masklinn12y ago

> The website answers the register question well, but leads to a further question: If registers are so great, why stick with just 16/32/64/n registers?

TFA gives at least one reason:

Registers use up a lot of silicon, and consume a lot of energy to power it. They also need to stay physically close to computing circuits, otherwise you end up with an L1 cache more than a register.

joosters12y ago

That's definitely another factor. Again though, I doubt it's the limiting one. No-one (as far as I know) has produced a power-hungry CPU with (say) 5000 registers on it.

1 more reply

solarexplorer12y ago

Register windows are a way to put 1000 registers in a CPU. See the SPARC and Itanium instruction sets for how this can be done. There are also plenty of studies about both.

Vector registers are another way to use 1000 registers.

PD: Yet another way to use 1000 registers is massive multi-threading like the Tera MTA.

terrymah12y ago

Registers that you name in assembly != physical registers. And when you use a register in two different instructions, you won't necessarily get the same physical register each time.

edderly12y ago

I thought this was an interesting insight in to that: http://ootbcomp.com/docs/belt/index.html

vilda12y ago

joosters12y ago

Say you have a simple function that is going to add 1 to a bunch of variables. In an ARM-like assembly code, this could be written as:

  LDR r1, [r0, #0]
  ADD r1, r1, #1
  STR r1, [r0, #0]
  LDR r1, [r0, #4]
  ADD r1, r1, #1
  STR r1, [r0, #4]
  LDR r1, [r0, #8]
  ADD r1, r1, #1
  STR r1, [r0, #8]

  LDR r1, [r0, #0]
  ADD r1, r1, #1
  STR r1, [r0, #0]
  LDR r2, [r0, #4]
  ADD r2, r2, #1
  STR r2, [r0, #4]
  LDR r3, [r0, #8]
  ADD r3, r3, #1
  STR r3, [r0, #8]

brigade12y ago

And no one does OoOe without register renaming.

1 more reply

Symmetry12y ago

Most high performance CPUs will have ~100 physical registers or so, possibly divided up in multiple segments.

But abstracting those you have your architectural registers that are presented by your ISA, and the CPU uses register renaming to map those onto the physical registers.

codex12y ago

joosters12y ago

2 more replies

Theodores12y ago

Perhaps it would clarify things with analogy:

Let's say Bubba's watching the Super Bowl. The table in front of him are his registers, the fridge is cache, and the corner shop a quick walk away is memory.

http://ucb.class.cs61c.narkive.com/pKzt4z6G/the-doe-library-...

terrymah12y ago

revelation12y ago

georgecmu12y ago

masklinn12y ago

zwieback12y ago

simias12y ago

It's usually unused once the kernel has started but it can be mapped by the kernel later on if there's a use for it.

makomk12y ago

Modern x86 chips generally allow the onboard cache to be used as RAM during early boot for the same reason, too.

sigkill12y ago

So, I'm a bit confused. Are registers SRAM? Or are they faster than SRAM?

ramchip12y ago

Any of these computer architecture concepts: register file, L1/L2/L3 cache, main memory

Can be implemented with any of these components: DRAM, SRAM, D-FF (flip-flops)

1 more reply

mikeash12y ago

This is great stuff to know. Not relevant to my audience, I think, but it's something I wasn't quite aware of before, and I'm happy you pointed it out.

twoodfin12y ago

Do any current ARM implementations do register renaming over a physical register set larger than the architected set?

Obviously Intel has been doing this for a while: Haswell has something like 168 integer registers, while the x86-64 ISA only exposes 16.

EDIT: Some Googling tells me that at least the Cortex-A9 mapped 32 architectural registers to 56 physical: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

Symmetry12y ago

Basically anybody doing out of order execution these days is going to be be doing register remapping at some level.

Symmetry12y ago

That article did a lot of simplifying, but probably simplifying that was needed for the person who asked that question.

eksith12y ago

This part stood out : "The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions."

And a bit later : "3.Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code."

knappador12y ago

arohner12y ago

Depends on the app, and your use case, and the CPU, etc. YMMV.

jff12y ago

Well, it's going to be faster if the smaller executable can keep its entire text segment in memory.

eliteraspberrie12y ago

Great explanation for folks without a hardware background. I also enjoyed his previous article about ARM64. Thanks for sharing.

ams611012y ago

Do any modern CPUs still use an approach like this?

Symmetry12y ago

rlu12y ago

It's funny reading this and then remembering that on top of all this, there's paging (i.e. fetching from hard drive).

It's like registers are refrigerators, RAM is like the grocery store around the corner, and Page faults are like the grocery stores in a neighboring town

woooooo memory!

grogenaut12y ago

Don't forget DMA which is like drop shipping with a guaranteed delivery date, like 3 days later, but they just shove it directly into the shelf

taeric12y ago

grogenaut12y ago

Jebus christ, it's because they're close. Like IN the cpu. Not nuzzled not ON, not NEXT TO.

Okay okay, I like reading about the blackbird and I know that I know nothing about how it really works other than lots of fuel. Still. Okay, I'm a Hypokrite.

ars12y ago

While the CPU is waiting for data to load from RAM, is the operating system smart enough to give it a different task to execute?

dexen12y ago

[1] http://en.wikipedia.org/wiki/Hyper-threading

[2] for example, CONFIG_SCHED_SMT in linux

mikeash12y ago

The overhead of task switching is too great for that to be useful, plus the OS would probably need to talk to RAM as part of the whole process anyway.

solarexplorer12y ago

1 more reply

cfallin12y ago

Interestingly, there is at least one academic proposal to do something like this [1], but I'm not aware of any real implementations.

[1] http://dl.acm.org/citation.cfm?id=891494

masklinn12y ago

The OS does not. That's the job of the out of order architecture (load prediction and reordering of later non-dependent instructions).

And when that fails, it's exactly the use case for hyperthreading.

markb13912y ago

gonzo12y ago

I'll just leave this here. https://www.usenix.org/system/files/1309_14-17_mickens.pdf

bane12y ago

knz4212y ago

Where would your processor fetch its program code from, if not RAM?

Assuming you place code also into registers...

bane12y ago

mov 0xaddressh-1 RegA mov 0xaddressh-2 RegB add RegA RegB RegC mov RegC 0xaddressh-3

into

add 0xReg-1 0xReg-2 0xReg-3

1 more reply

grogenaut12y ago

georgecmu12y ago

Cites distance and cost/power as reasons why "RAM is slow, registers are fast", not a mention of differences between SRAM and (S)DRAM.

Not worth reading.

jff12y ago

He clearly describes SRAM in the following paragraph, then contrasts it with DRAM in the rest:

georgecmu12y ago

Yeah, I did not even recognize this as description of SRAM, thanks for pointing this out.

from http://www.differencebetween.net/technology/difference-betwe...

1 more reply

alain9404012y ago

Agreed. Completely misses on-die caches for instance. Article might have been accurate in the 6502 days, but not today.

masklinn12y ago

> Agreed. Completely misses on-die caches for instance.

Absolutely, except for the part where he mentions them

> If you're really lucky and the value is in L1 cache, it'll only take a few cycles.

j / k navigate · click thread line to collapse