There's a huge shared buffer between two threads. 256 * 4K. One thread reads a byte of kernel memory, literally any byte it wants, and it then reads one of those 4K pages from that buffer in order to cache that one memory page that corresponds to the byte it just read. Then at some point the CPU determines that the thread shouldn't be permitted to access the kernel memory location, and rolls back all of that speculative execution, but the cached memory page isn't affected by the rollback.
The other thread iterates through those 256 pages, timing how long it takes to read from each page, and the one page that Thread A accessed will have a different (shorter?) timing because it's cached already. It now understands one byte of kernel memory that it shouldn't. That's just one byte but the whole process is so fast that it's easy to just go nuts on the whole kernel address space.
So what would the fixes be? Disable speculative execution? Only do it if the target memory location is within userspace, or within the same space as the executing address? Plug all of the sideband information leak mechanisms? I dunno.
The real problem with Meltdown seems to occur when: 1) The offending instruction is NOT really executed because it is in a branch which is not actually taken. 2) The offending instruction is executed but within a transaction, which leads to an exception-free rollback (with leaked information left in cache though).
AFAIK neither is (or can be made) visible to the kernel (which could explain the very large PTI patch), but I do wonder if they are events that can be hanlded at the microcode level, in which case a microcode update from Intel could mitigate them.
When a load is found to be illegal, an exception flag is set so that if the instruction is retired (ie. the speculated execution is found to be the actual path taken), a page fault exception can be raised. To prevent MELTDOWN, at the same time that the flag is raised you can set the result of the load to zero.
SPECTRE is the really hard one to deal with. Part of the solution might be providing a way for software to flush the branch predictor state.
1. Badly written code where bugs are being masked by the handler. 2. Any kind of virtualization?
So, for cloud providers it looks like a 30% performance hit, but for the rest of us I would rather have a patch that stops applications handling the SEGV trap.
> or within the same space as the executing address
That's probably a good place to start from. I'm guessing there still would be issues here with JITed code coming from a untrusted source.
But no solution leaps to mind for the problem of preventing speculative code from leaking things via cache, short of entirely preventing speculating code from being able to load things into the cache. If nobody can come up with a solution for that, that's going to cost us something to close that side channel. Not sure what though, without a really thorough profiling run.
And I'd put my metaphorical $5 down on someone finding another side channel from the speculative code; interactions with changing processor flags in a speculative execution, or interaction with some forgotten feature [2] where the speculation ends up incorrectly undone or something.
Yuck.
[1]: https://blog.booking.com/hardening-perls-hash-function.html
[2]: https://www.youtube.com/watch?v=lR0nh-TdpVg - The Memory Sinkhole - Unleashing An X86 Design Flaw Allowing Universal Privilege Escalation (Dec 29, 2015)
It's going to be really hard to give up real world gains from branch prediction. Branch prediction can make a lot of real world (read "not the finest code in the world") run at reasonable speeds. Another common pattern to give up would be eliding (branch predicting away) nil reference checks.
> short of entirely preventing speculating code from being able to load things into the cache
Some new server processors allow us to partition cache (to prevent noisy neighbors) [1,2]. I don't have experience working with this technology but everything I read makes me believe this mechanism can works on a per process basis.
If that kind of complexity is already possible in CPU cache hierarchy I wonder if it's possible to implement per process cache encryption. New processors (EPYC) can already use different encryption keys for each VM, so it might be a matter of time till this is extended further.