Do other ISAs have anything similar?
Perhaps this indicates a missing concept in our compilation and execution of code, metadata that demands a section is not optimized.
It doesn't carry any security guarantees (Rust's documentation on their wrapper for it explicitly strikes out "cryptographic or security" purposes) but one could imagine a similar intrinsic with stronger abilities.
> 1. Compilers are applying optimization techniques that are heuristically good for performance on general software, but happen to leak information through timing-based sidechannels when used on secret data. Since general-purpose CPUs and compilers are applied to a wide variety of tasks, compilers have no reason to stop doing such optimizations.
> 2. Constant-time coding techniques mostly aim at fooling the compiler, to prevent it from applying these optimizations. Since compilers keep getting smarter, theses techniques lose their efficiency.
> 3. Just-in-time (JIT) compilers have access to runtime data; they can, and will, use such information to perform extra optimizations that static compilers cannot do, and further destroy the developer’s attempt at achieving constant-time code.
> 4. JIT compilation is becoming pervasive and is now employed in various contexts, in particular inside CPUs. Such compilers can do the same optimizations as “external” JIT compilers, but are not publicly documented, and the output of their “machine” code translation cannot be inspected to detect deviations from constant-time execution.
> 5. Modern hardware and software stacks use a very layered structure, with each layer striving to hide its internal functioning details from upper layers. The socio-economic context which allowed such hardware to exist inherently relies on such abstraction (under the name of “industrial secrecy”). An effective constant-time coding process would require a model of computation with strong guarantees on timing-related characteristics, that would be maintained through all these layers, down to the semiconductors that implement the logic gates. This industry-wide vertical cooperation is unlikely to happen.
I imagine it’s not easy* easy, because it requires every layer to cooperate, but if there’s demand I think it would be done.
IPv6 is in a similar situation where it requires widespread adoption, but I also suspect most people have issues with it and there isn’t much demand because NATs and whatever support for cellular networks has made it unnecessary.
Compilers like Clang actually generate terrible code; it's expected that a sufficiently smart optimizer (of which LLVM is a member) will clean it up anyway, so Clang makes no attempt to generate good code. Rust is similar. For example, a simple for-loop's induction variable is stored/loaded to an alloca (ie stack) on every use, it isn't an SSA variable. So one of the first things in the optimization pipeline is to promote those to SSA registers/variables. Disabling that would cost a ton of perf just right there, nevermind the impact on instruction combining/value tracking/scalar evolution, and crypto is pretty perf sensitive after security.
BTW, Clang/LLVM already has such a function-level attribute, `optnone`, which was actually added to support LTO. But it's all or nothing; LLVM IR/Clang doesn't have the info needed to know what instructions are timing sensitive.
Note that memset_s() is an existing special case already. https://en.cppreference.com/w/c/string/byte/memset
> Modern hardware and software stacks use a very layered structure, with each layer striving to hide its internal functioning details from upper layers. The socio-economic context which allowed such hardware to exist inherently relies on such abstraction (under the name of “industrial secrecy”). An effective constant-time coding process would require a model of computation with strong guarantees on timing-related characteristics, that would be maintained through all these layers, down to the semiconductors that implement the logic gates. This industry-wide vertical cooperation is unlikely to happen
It's not mentioned in the paper, but hardware description languages provide strong guarantees on timing-related characteristics on the semiconductor level. Hardware companies are willing to provide user access to FPGAs, as evidenced by the userspace FPGA subsystem in the Linux kernel as well.[1] You can buy Versal SOCs right now that combine ARM CPUs with FPGAs on one die, run Linux on it, and outsource your crypto tasks to the FPGA.
I see a future where we give up on constant-time coding for CPUs and move crypto libraries to FPGAs, given the hundreds of billions of dollars in our economy reliant on strong encryption. That'd be a world in which every corporate laptop is required to have an FPGA for compliance purposes.
It'd be interesting to know if anyone has tried this.
[1]https://www.phoronix.com/news/FPGA-User-Space-Interface-AMD
Many high-frequency trading companies use FPGAs over ASICs for similar reasons. FPGAs are more expensive but allow them to have full control over the algorithms implemented and doesn't require giving secret information to a foundry.
In other words, eliminate the impedance mismatch between responsibility and control for developing a secure system.
It'll be cheaper to implement cryptography on an ASIC. But the author of this paper wants to control every single aspect of the encryption/decryption process. Said developer can confidently say the system is secure. You can't say you've delivered a secure system if you're getting your ASICs from another company that doesn't provide implementation details because it'd (justifiably) give others a competitive advantage.
Question I'd have is whether the cost difference between ASICs/FPGAs is worth it for the majority of applications. $1 or $10 on every CPU might mean a world in which every laptop has an FPGA, but $100? What about for server-side applications? Would a hyperscaler spend $1000 extra on every rack if it allowed guaranteed constant-time encryption?
Meanwhile Windows is requiring that every laptop have a small crypto processor for its own compliance processes (i.e. bitlocker).
In terms of moving the data over to the FPGA, I have no idea. But if it's all on the same die, it doesn't seem like there's a big physical attack surface that's different than just doing stuff on the CPU.
Abstractions cannot send messages before they receive them. So any delay you add at the top must be magnified as you go down. The exception to this is if the contents of the message require different behaviors for different payloads, in which case they are correct. But encrypted payloads are opaque to the layers they are sent through. You can observe who the message was sent to, and know the maximum amount of data the conversation might have contained. But not a very clear idea of how long it took to build the message, unless you’ve already compromised the machine.
So practice your martial arts on the bow of a rowboat. Balance, Danielsan.
The whole timing thing is essentially an issue of getting off-balance as in martial arts and the solution is that there are actions you simply never do, and ones that look like they lack economy of motion because so much of the work is in keeping your center of gravity from shifting outside of your base. Tai chi is infinitely better at protecting you from concussing yourself on black ice than it is at protecting you from a mugger.
So if you have a conditional move for one calculation, then move a to b if !c and move it to d if c. And when the compilers chase that down in another generation, or someone notices that the L1 cache timing is different if d gets cold, then use both calculations in the output. Calculate complements where the runtime of a + !a = 1 to several decimal points.
Will it work for eternity? Unlikely. Will it work for a decade? Probably.
ChaCha20 or Kyber might be using that. Although, I don’t know if any cryptographic algorithm depend on vector only calculation as a base for their strength.
Alternatively, I suppose that specialised cryptographic CPUs could be made to follow strict standards ensuing constant time runtime.
Is that actually guaranteed? Having to do the whole vector doesn't necessarily mean the time can't vary for different values for complex operations. (like division)
I can't find the details, but https://gleissen.github.io/papers/iodine.pdf at least mentions "Dually, if timing variability is unavoidable, e.g., in SIMD or floating-point units, making this variability explicit can better inform..." so it sounds like simd is at least situation specific here.
So I would think if you treat all of the bytes as one SIMD instruction, do a SIMD compare, and then fire another operation as a result… if you could guarantee that the third operation would have to fire for at least one of the vector values every time, then I think you’d have your solution.
The mention of CMOV is good. Additional tricks are available through vectorization - necessarily all elements of the vector have to execute at the same speed, which gives you extra options to put crafted data in a different column to force computation to proceed at a known rate.
We should not forget that some CPUs offer dedicated instructions - e.g. Intel AES-NI.
Another solution is to just stop trying to do it on the general purpose processor altogether and move it to some sort of HSM or TPM. Put your eggs in a smaller, more defensible basket. In a world of GPUs and AI accelerators it's easier to ask for a crypto accelerator (or rather decelerator!).
Is there any good work on constant-time crypto on the GPU?
If anyone manages to make homomorphic encryption viable, that provides another solution, at a huge performance penalty.
AES-NI, the SHA instructions, and constant-time subsets of instructions are generally good enough that you can do this in assembly.
1. Clock how long the operation takes on any type of data. By operation, it would be a function call maybe for a whole block or buffer.
2. Add a small margin of time to that.
3. Modify the function call to make sure it has waited that long before returning a response. That is true whether it finished early or not.
The result prevents the function call from leaking timing information.
High-assurance, security certification in the 1980's also required mitigating covert, storage channels. That would require, at a minimum, overwriting any memory used during the call and all the CPU registers before returning to the caller.
https://security.stackexchange.com/a/96493/271113
For example, consider the case of a cache-timing leak (rather than the classical "string compare" one). You'd have to have a lot of control over the behavior of your operating environment to guarantee it doesn't leak bits of your secret from cache misses.
When you add a delay to your processing, unless the same tasks being executed, I would expect power analysis leakage and random jitter from the kernel's scheduler to reveal that fact. It might be a more costly analysis to detect, but it's still there.
Generally, I think this is a doomed line of reasoning.
The previous comment was suggesting making sure that every code path takes the same amount of time, not adding a random delay (which doesn't work).
And while I agree that power-analysis attacks etc. are still going to apply, the over-arching context here is just timing-analysis
most of the points brought up in the article have been happening already for decade(s), it's not something that 'will soon' happen.
I wonder if they’re fast enough…
That doesn't solve the hardware-JIT-level problem, but it would appear to solve the compiler-level problem.
ETA: I only read up to p. 3, but I did search for the text "volatile" and didn't find it.
This would be something like a 'secret' keyword that would qualify integer types; i.e. in C you could have a variable of type 'secret unsigned int' and the compiler would reject optimizations that would reveal timing information about the variable's contents, while optimizing the rest of the program's non-secret variables. Values can be promoted to secret but not demoted. Code that intends to reveal secrets (e.g. to write them out to secure storage) must explicitly cast away the secretness.
AFAIK Golang has crypto/subtle, but that's more special functions to do certain things in a data-independent way and not pervasive language support for it (and, presumably, they have to keep updating the compiler to not optimize that specific module).
My guess is that bitslicing only gets you so far.