Even if you hand-wrote some assembly, carefully managing where data is stored, wiping registers after use, you still end up information leakage. Typically the CPU cache hierarchy is going to end up with some copies of keys and plaintext. You know that? OK, then did you know that typically a "cache invalidate" operation doesn't actually zero its data SRAMs, and just resets the tag SRAMs? There are instructions on most platforms to read these back (if you're at the right privilege level). Timing attacks are also possible unless you hand-wrote that assembly knowing exactly which platform it's going to run on. Intel et al have a habit of making things like multiply-add have a "fast path" depending on the input values, so you end up leaking the magnitude of inputs.
Leaving aside timing attacks (which are just an algorithm and instruction selection problem), the right solution is isolation. Often people go for physical isolation: hardware security modules (HSMs). A much less expensive solution is sandboxing: stick these functions in their own process, with a thin channel of communication. If you want to blow away all its state, then wipe every page that was allocated to it.
Trying to tackle this without platform support is futile. Even if you have language support. I've always frowned at attempts to make userland crypto libraries "cover their tracks" because it's an attempt to protect a process from itself. That engineering effort would have been better spent making some actual, hardware supported separation, such as process isolation.
I wonder if current VM implementations are doing this systematically.
It seems like a kernel API to request "secure" memory and then have the kernel ensure zeroing would be useful. Without this I'm wondering if it's even possible for a process to ensure that physical memory is zero'd, since it can only work with virtual memory.
Apologies to everyone suffering Mill fatigue, but we've tried to address this not at a language level but a machine level.
As mitigation, we have a stack whose rubble you cannot browse, and no ... No registers!
But the real strong security comes from the Mill's strong memory protection.
It is cheap and easy to create isolated protection silos - we call them "turfs" - so you can tightly control the access between components. E.g. you can cheaply handle encryption in a turf that has the secrets it needs, whilst handling each client in a dedicated sandbox turf of its own that can only ask the encryption turf to encrypt/decrypt buffers, not access any of that turf's secrets.
More in this talk http://millcomputing.com/docs/security/ and others on same site.
Wow. There's the Wheel of Reincarnation [1] in action. The Intel iAPX 432 microprocessor had similar ideas.[2] E.g. no programmer visible general purpose registers, "capability-based addressing" to control access to memory.
That was a mere 30+ years ago. Let's hope you're more successful than they were.
[1] http://www.catb.org/jargon/html/W/wheel-of-reincarnation.htm... [2] http://en.wikipedia.org/wiki/IAPX432#Object-oriented_memory_...
This processor didn't have general purpose registers, it had "workspaces" in RAM that served as register sets. The processor had a workspace pointer register that pointed to the workspace currently in use. This was cool because it meant that a context switch could be achieved by just changing the workspace pointer. However the downside is that RAM access is slower than register access.
There is no public SDK yet, and hardware is also under development.
We've had a simulator for a long time, and we show it off a bit in the Specification talk:
Alas, microcode, and unreadability, and the difficulty of going from a provably correct kind of implementation all the way down to bare metal by hand.
The proposed compiler extension, however, makes sense to me. Let's get it added to LLVM & GCC?
In other words, if you write a crypto library in x86 assembler, Intel don't guarantee that they won't introduce a side channel in their next chip model or stepping.
Until then, we do the best we can with turtles all the way down. Software running under that same undocumented pipeline is going to find it very hard to access or leak (accidentally or otherwise) internal registers, at least.
For the other avenue of attack (cold-boot attacks), it's also notable that registers, at least, have extremely fast remanence compared to cache, or DRAM - bit-fade is a very complex process, but broadly speaking, faster memory usually fades faster.
Digression along that vein: I basically pulled off a cold-boot attack on my Atari 520 STe in the early 1990s (due to my wanting to, ahem, 'debug' a pesky piece of software that played shenanigans with the reset vector and debug interrupts), with Servisol freezer spray pointed directly at the SIMMs in my Xtra-RAM Deluxe RAM expansion (and no, cold-boot attacks are not new, GCHQ's known about them for at least 3 decades and change under the peculiarly-descriptive cover name NONSTOP, I believe?). It just seemed sensible to me: cold things move slower, and they had a particularly long (and very pretty) remanence - I was able to get plenty of data intact, including finding where I needed to jump to avoid the offending routine and continue my analysis with a saner technique (i.e. one that didn't make me worry about blowing up the power supply or electrocuting myself)! It's harder these days - faster memory - but the technique incredibly still works and was independently rediscovered as such more recently: very much a "wait, this still works on modern RAM?" moment for me. (By the way, when I accidentally pulled out the SIMMs with the internal RAM disabled - whoops - and rebooted the Atari on my first try, it actually powered up with an effect that I can only describe as "pretty rainbow noise with double-height scrolling bombs" that would not have looked out of place in a demoscreen! I don't know if that was just mine, but... the ROM probably never expected to find RAM not working, and I guess the error-plotting routine had a very pretty and unusual error in that event?)
I've never seen or heard of anyone pulling off a NONSTOP on a register in a CPU, or actually even on an L1, L2 or L3 cache (maybe an L2 or L3 might be possible, depending on design?). They're fast - ns->µs remanence? - and cooling doesn't help much. I don't know if it's possible at all, but I'd tentatively suggest that it might be beyond practical attack - unless the attacker has decapped the processor and it's already in their lab (in which case you're fucked, no matter what!). That's what suggests that approaches like TRESOR (abuse spare CPU debug registers to store encryption key; use that key to encrypt keys in RAM), despite being diabolical hacks, actually work.
If you fancy giving it a try in the wild by the way, I think a Raspberry Pi might be a good modern test subject - the RAM's exposed on top of the SoC, there are no access problems, and it's cheap so if it dies for science, it's not such a problem. (Of course, you'd want probably to want to change bootcode.bin so that it dumps the RAM after it enables it but before it clears it.) The VideoCore IV is kind of a beast - and is frustratingly close to being able to do ChaCha20 extraordinarily efficiently, if I can just figure out how to access the diagonal vectors... or if I can, or if I can fake it.
As an alternative, maybe write crypto algorithms in LLVM IR?
What I'd really like to see is qhasm put on github along with the syntax files he or others create. q files aren't really useful without the syntax files they were made for, and without a central repo, custom made syntaxes will be a mis-mash of random decisions and instructions.
For the stack, if you can guess how large the function's stack allocation can be (shouldn't be too hard for most functions), you could after returning from it call a separate assembly function which allocates a larger stack frame and wipes it (don't forget about the redzone too!). IIRC, openssl tries to do that, using an horrible-looking piece of voodoo code.
For the registers, the same stack-wiping function could also zero all the ones the ABI says a called function can overwrite. The others, if used at all by the cryptographic function, have already been restored before returning to the caller.
Yes, it's not completely portable due to the tiny amount of assembly; but the usefulness of portable code comes not from it being 100% portable, but from reducing the amount of machine- and compiler-specific code to a minimum. Write one stack- and register-wipe function in assembly, one "memset and I mean it" function using either inline assembly or a separate assembly file, and the rest of your code doesn't have to change at all when porting to a new system.
So stop doing that. Have a low-level system service (e.g., a hypervisor with well-defined isolation) do your crypto operations. Physically isolate the machines that need to do this, and carefully control their communication to other machines (PCI requires this for credit card processing, btw). Do end-to-end encryption of things like card numbers, at the point of entry by the user, and use short lifetime keys in environments you don't control very well.
The problem is much, much wider than a compiler extension.
So how do you get that isolated box onto everyone's computer and phone? How do you move these users' sensitive information onto that isolated box without leaving a trace on their non-isolated computer? How do you move their keys around?
When you use two systems to process sensitive information, you have at least two problems to solve...
And sometimes they don't worry enough either. They ought to fail a FIPS-style audit for that. But, well... they ought not to contain proprietary LFSR "crypto" algorithms, either. They are not as well audited, or as publicly designed, as they ought to be: many are as black-box closed-source as they could possibly be.
They tend to be based on extraordinarily old architectures with new bits glued on - think Intel 8051, that kind of era. If you're really lucky you might get an ARM, or at least a Thumb. People making them are notoriously hyper-conservative (most don't support ECC yet, and many don't even go above RSA-2048 or SHA-1 without going to firmware), and minimise any changes, perhaps for cost reasons, the effects of which are not always positive (actually, CFRG are discussing that general area right now in the context of side-channnel defences for elliptic-curve crypto).
So, how would you think that environment translates to writing secure firmware, or designing secure, state-of-the-art hardware? ;-)
This is incorrect. The AES key schedule is bijective, which makes recovering the last round key as dangerous as recovering the first.
Afaict there's no generic solution to these problems. 99.9% of what these code paths handle is just non-sensitive, so applying some kind of "secure tag" to them is just unworkable, and they're easily used without knowing it... it only takes one ancillary library to touch your data.
Similarly, if you encrypt all of your information from within a safe library before handing it out to unsafe libraries, they can't leak anything. This can add overhead and redundant encryption (and you still need to trust that the remote server processing your data is safe), but there are steps you can take to be more safe.
I don't know enough of modern hardware, but on CPUs with register renaming, is that even possible from assembly?
I am thinking of the case where the CPU, instead of clearing register X in process P, renames another register to X and clears it.
After that, program Q might get back the old value of register X in program P by XOR-ing another register with some value (or just by reading it, but that might be a different case (I know little of hardware specifics)), if the CPU decide to reuse the bits used to store the value of register X in P.
Even if that isn't the case, clearing registers still is fairly difficult in multi-core systems. A thread might move between CPUs between the time it writes X and the time it clears it. That is less risky, as the context switch will overwrite most state, but, for example, floating point register state may not be restored if a process hasn't used floating point instructions yet.
Computer programs of all kinds are being executed on top of increasingly complicated abstractions. E.g., once upon a time, memory was memory; today it is an abstraction. The proposed attribute seems workable if you compile and execute a C program in the "normal" way. But what if, say, you compile C into asm.js?
Saying, "So don't do that" doesn't cut it. In not too many years I might compile my OS and run the result on some cloud instance sitting on top of who-knows-what abstraction written in who-knows-what language. Then someone downloads a carefully constructed security-related program and runs it on that OS. And this proposed ironclad security attribute becomes meaningless.
So I'm thinking we need to do better. But I don't know how that might happen.
I think we need, right at the base metal, a way of saying "this data needs to not be copied" and/or "if you do copy it you must remember all copy locations so we can sanitize them all." And then we require every abstraction on up to have a way of maintaining this, the same way all the abstractions are required to, say, let us read data.
Or I guess this is part of what HSMs are supposed to do -- do all your "secure" work in something that is very strictly controlled.
The point is, if you want security you need to look at the whole system and in the situation you describe you can't guarantee it, no.
I'm not going to say "So don't do that", but I am going to say "If you're going to do things like that, please realise that the assumptions the system security was built on no longer hold true".
I think to do it better we just need to pay a bit more attention. And try not to let ourselves get into situations (cough heartbleed cough) where memory zeroing is actually an important feature. IE - by the time the attacker is able to read your process memory you're probably already screwed.
Now, if your decryption hardware was an actual separate box, where the user inserts their keys via some mechanism and you can't run any software on it, but simply say "please decrypt this data with key X", then we'd be on to something. It could be just a small SoC which plugs into your USB port.
Or you could have a special crypto machine kept completely unconnected to anything, in a Faraday cage. You take the encrypted data, you enter your key in the machine, you enter the data and you copy the decrypted data back. No chance of keys leaking in any way.
These are dedicated boxes that just do crypto. You keep them on the network or attached via a serial port or... whatever. Accessible to your machines but not the outside world. Then you send them messages to ask them to encrypt and decrypt data for you. That way the keys never leave the box. The HSM doesn't accept new software, nor does it ever expose the keys to anyone.
They are, however, quite expensive.
Assembly is the simplest language you can write a computer program in, for a certain very textbook definition of "simple" - it's just that you actually have to do everything by hand that you normally wouldn't. And yes, that can be a pain in the ass, and yes, you do have to watch out for not seeing the wood for the trees - but one thing it most definitely is, is auditable.
Bearing in mind, say, the utter briar-patch that is OpenSSL: a crufty intractably complex library written in a high-level language with myriad compiler bug workarounds, compatibility kludges and where - despite it being open source, and "many eyes making bugs shallow" - few eyes ever actually looked, or saw, or wanted to see, and when attention was finally paid to it, it was found wanting... might not assembly be perhaps better for a compact, high-assurance crypto library? Radical, I know, but perhaps an approach that's worthy of consideration.
I understand you may well be more familiar with high-level languages, and I don't know if you're confident about your ability to audit that - but I must point out, if you're auditing it from source, you're trusting the compiler to faithfully translate it. So to actually audit the code, you need to include the compiler in that audit. Compilers have (lots of) bugs and oversights too (lots of OpenSSL cruft is compiler bug workarounds, it seems?): as the article points out, existing compilers just weren't really designed to accommodate writing secure code.
Meanwhile an assembler makes a direct translation from source assembly to object machine code - that is deterministic (a perniciously-hard process with compilers) and much more easily, and automatically, auditable and indeed directly reversible.
To be clear, I'm not suggesting we replace, say, libsodium with something written in assembly language tomorrow! There are good high-level language implementations. And inline assembly is already used in some places for certain functions, including this exact one (zeroing memory), to try to minimise the compiler second-guessing us. But as the article points out, that approach only takes us so far, and it's something we need to be guarded against when trying to write secure code.
Botan, for example, has something called a "SecureVector" which I have never actually verified as being secure, but it's the same idea.
#include <string.h>
void bar(void *s, size_t count)
{
memset(s, 0, count);
__asm__ ("" : "=r" (s) : "0" (s));
}
int main(void)
{
char foo[128];
bar(foo, sizeof(foo));
return 0;
}
gcc -O2 -o foo foo.c -g
gdb ./foo
...
(gdb) disassemble main
Dump of assembler code for function main:
0x00000000004003d0 <+0>: sub $0x88,%rsp
0x00000000004003d7 <+7>: mov $0x80,%esi
0x00000000004003dc <+12>: mov %rsp,%rdi
0x00000000004003df <+15>: callq 0x400500 <bar>
0x00000000004003e4 <+20>: xor %eax,%eax
0x00000000004003e6 <+22>: add $0x88,%rsp
0x00000000004003ed <+29>: retq
End of assembler dump.
(gdb) disassemble bar
Dump of assembler code for function bar:
0x0000000000400500 <+0>: sub $0x8,%rsp
0x0000000000400504 <+4>: mov %rsi,%rdx
0x0000000000400507 <+7>: xor %esi,%esi
0x0000000000400509 <+9>: callq 0x4003b0 <memset@plt>
0x000000000040050e <+14>: add $0x8,%rsp
0x0000000000400512 <+18>: retq
End of assembler dump.However, they could indeed be able to circumvent the permission system by figuring out what sensitive data your program left behind in uninitialized memory and in CPU registers.
Not leaving traces behind then becomes a serious issue. Could the kernel be tasked with clearing registers and clearing re-assigned memory before giving these resources to another program? The kernel knows exactly when he is doing that, no?
It would be a better solution than trying to fix all possible compilers and scripting engines in use. Fixing these tools smells like picking the wrong level to solve this problem ...
Malicious programs running with your program's privileges are a different scenario altogether, and usually they can do a lot of damage. Want sensitive information out of another process? Try gdb.
But yes, it is trivial for the kernel to zero a page before handing it out.
Also, wouldn't a wrapper function that performs the AES decryption and then manually zeroes the registers be a good enough work around?
Yes, you probably ought to be clearing xmm* registers touched by it, and that would I hope be good enough.
The point in the article about compiled code very seldom touching xmm* so that if you don't wipe it - is doing so currently common practice? I haven't checked, but I feel like that would be something that needs checking! - it's hanging around and you might leak it, is completely valid, however.
I googled pretty hard for real life example uses of a timing attack, and now using of stale data on the register, but couldn't find anything. Does anyone know of examples of this actually being done?
These type of attacks also might become more of a problem as more sensitive computation is done on shared machines (IE cloud compute).
So, while there's no reason to panic because these security features aren't implemented hardly anywhere, you can't let the issues sit unaddressed for long periods of time.
So, I've seen a lot of (conceptually) trivial exploits and combinations of trivial exploits, but I would love to see a real world example of someone collecting enough information from an 'bad RNG', registers, or timing, to do anything with it.
Some of those are fixed now, but the history stealing link redrawing one is still an issue as far as I know (or at least, this bug is still open https://bugzilla.mozilla.org/show_bug.cgi?id=884270 ).
"Remote Timing Attacks are Practical" https://crypto.stanford.edu/~dabo/papers/ssl-timing.pdf
"RSA Key Extraction via Low-Bandwidth Acoustic Cryptanalysis" http://www.tau.ac.il/~tromer/acoustic/
Because the compiler is perfectly within its rights to optimize that out!
The proposal seems goods
It would be consequential (but utterly impractical) to add another C-level primitive to prevent OS-level task suspension during critical code paths. Good luck getting that into a kernel without opening a huge DoS surface :)
If someone has the root privs to peek at your memory, they can also stop your process at any time and examine all the registers, whether they were swapped out to disk or not.
Moving the crypto code into the kernel and running with disabled interrupts doesn't help because the attacker is already assumed to have super-user privileges (they can peek at arbitrary RAM, after all). There are also non-maskable interrupts.
You basically cannot hide the machine state from someone who controls the machine: not without splitting the machine itself into additional privilege levels, such that there is a security level that is not accessible even to the OS kernel. The sensitive crypto routines run in that level. The manufacturer of the SoC provides these as firmware, and the regular kernel has no visibility to the internals.
ARM has a security model that supports this.
There is also something even more paranoid called TrustZone: http://en.wikipedia.org/wiki/ARM_architecture#Security_exten...
1) Zero the buffer.
2) Check that the buffer is completely zeroed.
3) If you found any non-zeros in the buffer, return an error.
Is the compiler still allowed to optimize away the zeroing in this case?
Yes, completely. In the snippet below, the compiler is allowed to eliminate all code after “leave secrets in array c”.
{
char c[2];
... /* leave secrets in array c */
memset(c, 0, 2);
c[0] = 0;
c[1] = 0;
memset(c, 0, 2);
if (c[0] || c[1]) exit 1;
}
The compiler is also allowed to compile the last three instructions below as if they were “return 0;” {
char c[2];
... /* leave secrets in array c */
c[0] = 0;
c[1] = 0;
return c[0] + c[1];
}gcc 4.4.5 doesn't though (-O3), it still clears the stack once and performs the comparison.
I believe these optimizations can be defeated by declaring a global
volatile char fill = 0;
and using that instead of 0 in memset().With 'volatile', generally not, modulo bugs. Without volatile, it would never return an error.
I am still uncertain while people want to just 'zero' it. Filling random data (just one random() call) and then using inline PRNG, then summing the result, storing it globally in volatile would reliable 'zero' the data but it's quite CPU intense.
User logs in by sending password. System transitions to authorized state. System wants to wipe password to avoid later leak. If you reboot at this point, the user will no longer be authorized.
What about Rust?