Why Intel is adding instructions to speed up non-volatile memory (opens in new tab)

(danluu.com)

118 pointsjoe_bleau11y ago43 comments

43 comments

There are several storage class memories that are nearing commercialization. Intel is betting big on at least one of them. Most technologies in this class are orders of magnitude faster and have orders of magnitude better endurance than flash memory, while being only slightly slower the DRAM, yet non-volatile.

It is plausible that with another layer of in-package cache they could eliminate DRAM altogether, replacing it with ultrafast NVM. Imagine the resume/suspend speed and power savings of a machine whose state is always stored in NVM.

runeks11y ago

> There are several storage class memories that are nearing commercialization.

I'm very interested in this. Could you point out which technologies that are near ready for commercialization?

My understanding is that the current cost is orders of magnitude higher per unit of storage for these new technologies compared to NAND flash or even DDR3 RAM. But of course, a dedicated fab could change that very quickly.

jhallenworld11y ago

Well nvDIMMs are available right now (from companies like Netlist, Agigatech, Viking, Smart, Micron). This is DRAM with an analog switch, a controller and flash memory. When you lose power, the DRAM is disconnected from the processor and the contents are copied to the flash. The newer technology might be cheaper, but I thought so far the write performance is not as good as DRAM.

The issue is the cache: the data is not non-volatile until it has been written back to DRAM. Even then, you need some advanced warning of a power outage for it all to work.

Unibus (bus for PDP-11 core memory systems) had an early warning signal, to give the memory controller a chance to write back the previous (destructive) read.

bahahah11y ago

Components are available on the market now based on PCM, MRAM, and FRAM. I know that Intel has large productization, not research, teams working on a variant of SCM. Near means 2-3 years though. Research exit to market ready is always a 3-5 year cycle when process engineering is involved.

fleitz11y ago

Is this basically memristors coming to market or are memristors still a few years off?

the847211y ago

This should be useful for any type of NVRAM, be it battery-backed DRAM, MRAM, memristors or DMA-mapped flash.

sweis11y ago

I've heard predictions that a significant portion of new x86 servers will be using non-volatile memory within the new 5-7 years.

Memory is becoming the new disk. This could have major security implications, as memory contents are unencrypted in general.

Fortunately, Intel CPUs will have hardware support to encrypt SGX enclaves. Perhaps that support can be used for general memory access as well.

ams611011y ago

if non-volatile memory is becoming the new disk, why is it any more or less likely to be encrypted than current disk storage (mostly not, as far as I've seen).

sweis11y ago

Long story short, memory bandwidth is much faster than the best x86 crypto implementations can handle.

Encrypting disks or network is no problem today, but we'll need architectural changes to support full memory encryption without a performance hit.

WallWextra11y ago

This could be done mostly transparently, with the encryption in the memory controller. Addresses and data are already scrambled with a (non-cryptographic) scrambling code for EMI reasons. Of course, a sufficiently fast hardware crypto core would be required.

EDIT: Also, I forgot that the last generation of consoles (and I assume the current) have transparent encryption of main memory.

1 more reply

eloff11y ago

How do you square that with the performance of the AES-NI instructions? That is theoretically 16 bytes per cycle from the manual. Per core. That is way in excess of memory bandwidth, even with DDR4.

1 more reply

crest11y ago

The VIA C7 AES implementation could keep up with memory (ca. 20Gb/s). With suitable cipher modes you can use multiple pipelined units in parallel with negligible overhead.

AlyssaRowan11y ago

Or fast, strong, pipelined hardware encryption.

AES is not the best you could do there.

justcommenting11y ago

three words for you: cold boot attacks

AlyssaRowan11y ago

No.

Remanence attacks are pointless against non-volatile media. You use them against volatile media in a physical attack in an attempt to sneak under/manipulate the limits of that volatility to cause violations of security assumptions, such as "the keys are in RAM" (true) > "RAM is instantly volatile on shutdown" (not quite true) > "keys are instantly zeroised on shutdown" (not this easily they're not).

Some RAM is much more volatile than conventional bulk SRAM or DRAM (for example, frequently L1/L2 caches on CPUs are impractical to exploit). Properly encrypt bulk data held in high-remenance or non-volatile RAM with a key held in such low-remanance RAM, and your security problem is solved.

gizmo68611y ago

That still doesn't answer the question. If you treat non-volatile memory as a disk, then the data would never touch it unencrypted, so a cold boot attack is useless against the non-volotile memory. Of course, you could still launch a cold boot attack on the volotile memory, but we can do that already.

Animats11y ago

Computing really hasn't figured out how to handle non-volatile memory as yet. It's almost always used to emulate rotating disks, with file systems, named files, and a trip through the OS to access anything. Access times for non-volatile memory are orders of magnitude faster than disk access times, so small accesses are feasible. But that's not how it's treated under existing operating systems.

There are alternatives. Non-volatile memory could be treated as a key/value store, or a tree, with a storage controller between the CPU and the memory device. With appropriate protection hardware, this could be accessed from user space through special instructions. That's what I though this article indicated. But no. This is just better cache management for the OS.

zurn11y ago

There have been systems where everything is memory mapped and disks are just used to emulate more memory.

It's called "single-level store" in System 36 and descendants. File access in Multics was all memory mapped.

There's nothing inherently rotating-disky about current filesystem APIs from the user point of view, a they just provide a database interface which has a certain type of namespace system for access. The block level part is largely invisible to the FS users (modulo leaky abstractions).

wereHamster11y ago

It is already treated as a k/v store, where key is a LBA and the value is a 512/4096byte block. The OS builds everything else (ie. filesystems) on top of that. Applications can already now access the raw k/v store directly if they wish (open /dev/sd? directly, permissions allowing).

dfryer11y ago

This is not (specifically) for the OS. This is for non-volatile memory that is directly attached to the memory bus. The OS can then directly map NVRAM into the address space of a user-space process; the application could use these instructions to efficiently ensure the crash consistency of its persistent data.

soamv11y ago

Well, this is about userspace access to nvram, with the nvram mapped as memory. It just so happens that cache management is one of the hard parts of doing that, so that's what these new instructions are for.

rbanffy11y ago

Don't forget core memory existed before rotating disks. The first internet routers were shipped with their program already loaded.

harshreality11y ago

Can't use current flash chips in that way, because write endurance.

rbanffy11y ago

Also, current Flash memories do not allow single address writes. At least the write endurance problem could be addressed by adding write leveling to an address translation layer. The single address thing could be addressed by a caching/grouping layer that could interact with the leveling mechanisms. Add to that an all-core state dump to a block write and you can recover to an internally consistent state after a power failure.

unwind11y ago

Very interesting! It's always fun to see "external" development in the general field of computer architecture affect low-level stuff like a CPU's cache and memory subsystems.

It wasn't super-easy to figure out who in the grand ecosystem view of things is going to have to care about these instructions, but I guess database and OS folks.

Also, if the author reads this, the first block quote with instruction descriptions has an editing fail, it repeats the same paragraph three times (text begins "CLWB instruction is ordered only by store-fencing operations").

0x011y ago

Isn't there a higher risk of data loss, if your "hard drive" is 100% memory mapped - all it would take is one buggy kernel driver writing to an invalid pointer or memset'ing the whole thing to 0?

JoeAltmaier11y ago

Certainly damage can happen faster, since the NVRAM is faster. But my buggy driver could write the whole disk to 0 already.

signa1111y ago

well, the same is true now as well right ? for example, a buggy driver can override a buffer-cache pointer with something else, and then you are hosed. if you are playing in the kernel-land and not careful enough, you are courting disaster...

0x011y ago

True, but if it overruns a buffer, it still needs to maintain a valid SCSI/ATAPI/whatever command packet format and submit the packet to the controller with repeatedly increasing block numbers - that's a lot of instructions, while something that clears the entire address space could probably be done in 1-2 assembly instructions (mov rcx, -1; rep stosq)

jhallenworld11y ago

Support for non-voltile memory needs to be added to Linux. For example, one should be able to map the non-volatile memory into user space and directly access it. There needs to be some BIOS-OS interaction so that the OS doesn't treat the non-volatile memory as general memory (for the likely case where only some of the memory is non-volatile). Alternatively, the non-volatile memory should be usable as a block device.

The non-volatile memory needs a layer of RAID-like volume management. For example, when you transfer the memory from one system to another, there should be a way to determine that the memory is inserted in the correct slots (remember there is RAID like interleaving/striping across memory modules).

shaurz11y ago

How to solve the context switch overhead issue: https://www.destroyallsoftware.com/talks/the-birth-and-death...

JoeAltmaier11y ago

How about: a cpu that has scores of hyperthreads? They don't block in the kernel; they stall on a semaphore register bitmask. That mask can include timer register matches another register; interrupt complete; event signaled.

Now I can do almost all of my I/o, timer and inter-process synchronization without ever entering a kernel or swapping out thread context. I've been waiting for this chip since the Z80.

rbanffy11y ago

While not exactly a chip (it never reached board stage) I designed a processor in college where the register file was keyed to a task-id register. This way, context switches could take no longer than an unconditional jump.

I dropped this feature when I switched to a single-task stack-based machine (inspired by my adventures with GraFORTH - thank you, Paul Lutus). This ended up being my graduation project.

j / k navigate · click thread line to collapse

43 comments

bahahah11y ago

runeks11y ago

> There are several storage class memories that are nearing commercialization.

I'm very interested in this. Could you point out which technologies that are near ready for commercialization?

jhallenworld11y ago

The issue is the cache: the data is not non-volatile until it has been written back to DRAM. Even then, you need some advanced warning of a power outage for it all to work.

Unibus (bus for PDP-11 core memory systems) had an early warning signal, to give the memory controller a chance to write back the previous (destructive) read.

bahahah11y ago

fleitz11y ago

Is this basically memristors coming to market or are memristors still a few years off?

the847211y ago

This should be useful for any type of NVRAM, be it battery-backed DRAM, MRAM, memristors or DMA-mapped flash.

sweis11y ago

I've heard predictions that a significant portion of new x86 servers will be using non-volatile memory within the new 5-7 years.

Memory is becoming the new disk. This could have major security implications, as memory contents are unencrypted in general.

Fortunately, Intel CPUs will have hardware support to encrypt SGX enclaves. Perhaps that support can be used for general memory access as well.

ams611011y ago

if non-volatile memory is becoming the new disk, why is it any more or less likely to be encrypted than current disk storage (mostly not, as far as I've seen).

sweis11y ago

Long story short, memory bandwidth is much faster than the best x86 crypto implementations can handle.

Encrypting disks or network is no problem today, but we'll need architectural changes to support full memory encryption without a performance hit.

WallWextra11y ago

EDIT: Also, I forgot that the last generation of consoles (and I assume the current) have transparent encryption of main memory.

1 more reply

eloff11y ago

How do you square that with the performance of the AES-NI instructions? That is theoretically 16 bytes per cycle from the manual. Per core. That is way in excess of memory bandwidth, even with DDR4.

1 more reply

crest11y ago

The VIA C7 AES implementation could keep up with memory (ca. 20Gb/s). With suitable cipher modes you can use multiple pipelined units in parallel with negligible overhead.

AlyssaRowan11y ago

Or fast, strong, pipelined hardware encryption.

AES is not the best you could do there.

justcommenting11y ago

three words for you: cold boot attacks

AlyssaRowan11y ago

No.

gizmo68611y ago

Animats11y ago

zurn11y ago

There have been systems where everything is memory mapped and disks are just used to emulate more memory.

It's called "single-level store" in System 36 and descendants. File access in Multics was all memory mapped.

wereHamster11y ago

dfryer11y ago

soamv11y ago

rbanffy11y ago

Don't forget core memory existed before rotating disks. The first internet routers were shipped with their program already loaded.

harshreality11y ago

Can't use current flash chips in that way, because write endurance.

rbanffy11y ago

unwind11y ago

Very interesting! It's always fun to see "external" development in the general field of computer architecture affect low-level stuff like a CPU's cache and memory subsystems.

It wasn't super-easy to figure out who in the grand ecosystem view of things is going to have to care about these instructions, but I guess database and OS folks.

0x011y ago

Isn't there a higher risk of data loss, if your "hard drive" is 100% memory mapped - all it would take is one buggy kernel driver writing to an invalid pointer or memset'ing the whole thing to 0?

JoeAltmaier11y ago

Certainly damage can happen faster, since the NVRAM is faster. But my buggy driver could write the whole disk to 0 already.

signa1111y ago

0x011y ago

jhallenworld11y ago

shaurz11y ago

How to solve the context switch overhead issue: https://www.destroyallsoftware.com/talks/the-birth-and-death...

JoeAltmaier11y ago

Now I can do almost all of my I/o, timer and inter-process synchronization without ever entering a kernel or swapping out thread context. I've been waiting for this chip since the Z80.

rbanffy11y ago

I dropped this feature when I switched to a single-task stack-based machine (inspired by my adventures with GraFORTH - thank you, Paul Lutus). This ended up being my graduation project.

j / k navigate · click thread line to collapse