I've been on the team working on this project over the past ~2 years. AMA!
Here is the GitHub repository: https://github.com/facebookexperimental/hermit
- How does this compare with rr?
- Out of curiosity, have you ever looked at libTAS? (I realize it has a very different intended use case.)
- Have you had an issues with behavior differences between CPUs? I know there is architecture-level undefined behavior that can differ between CPUs; on the other hand, it sounds like you’re primarily interested in running well-behaved executables that wouldn’t intentionally invoke such behavior.
- I haven't seen libTAS until now, so I can't comment on it much. At first glance, it does look similar!
- See @rrnewton's reply on this one.
We're big fans of rr!
Hermit is different in that creating a deterministic OS semantics is different than recording whatever nondeterministic behavior occurs under normal Linux. BUT, there's a lot of overlap. And indeed `hermit record` is straight up RnR (record & replay).
But hermit for RnR but is not nearly as developed as rr. We integrate with gdb/lldb as an (RSP) debugger backend, just like rr. Any failing execution you can create with hermit, you can attach a debugger. But our support is very preliminary, and you'll probably find rough edges. Also, we don't support backwards stepping yet (except by running again).
If we invest more in using Hermit as a debugger (rather than for finding and analyzing concurrency bugs), then there should be some advantages over traditional RnR. These would relate to the fact that deterministically executing is different than recording. For example, process and thread IDs, and memory addresses all stay the same across multiple runs of the program, even as you begin adding printfs and modifying the program to fix the bug. With traditional RnR, you can play the same recording as many times as you like, but as soon as you take a second recording all bets are off wrt what is the same or different compared to the prior recording. (That includes losing the "mental state" of things like tids & memory addresses, which is a good point Robert O Callahan makes about the benefits of RnR when accessing the same recording multiple times.)
(2) libTAS - no we haven't! Checking it out now.
(3) Yes, definitely issues with CPU portability.
In general, we are interested in not just determinism on the same machine, but portability between machines in our fleet. As with any tech company that uses the cloud, at Meta people are usually trying to debug an issue on a different machine than where the problem occurred. I.e. taking a crash from a production or CI machine to a local dev machine.
The way we do this is that we mostly report a fairly old CPU to the guest, which disables certain features IF the guest is well behaved.
With the current processor tech, I don't think there's any way we can stop an adversarial program, which, for example, would execute CPUID, find that RDRAND is not supported on the processor, but then execute RDRAND anyway. We could build a much more invasive binary-instrumentation based emulator that would be able to enforce these kinds of rules at the instruction granularity, but it would have higher overhead, especially startup overhead. The nice thing about Reverie though is that we (or others) can add different instrumentation backends while keeping the same programming instrumentation API. So we could have a "hardened" backend that was more about sandboxing and reverse-engineering adversarial software, making a different tradeoff with respect to performance overhead.
- https://github.com/dettrace/dettrace
And one cool part of it is this Rust program instrumentation layer:
- https://github.com/facebookexperimental/reverie
It's good for building OS-emulator style projects or tracing tools.
Does Hermit support running different processes in parallel in certain situations?
It seems like this should be possible to do as long as they are appropriately isolated from each other.
Specifically, if two processes don't share any writable memory mappings then I'm thinking it might be possible for Hermit to exploit the underlying existing isolation between them so that they could run code in-between system calls in parallel (while still making sure that the actual system calls happen in a deterministic order).
Perhaps it would even be possible for the two processes to execute system calls in parallel if they are unrelated and they do not affect deterministic execution (such as if they are doing I/O to different files or directories, or one process doing I/O while another is doing something completely unrelated like getting the time, etc).
Although I guess this latter point (of executing syscalls in parallel) would require doing rewind/replay because it wouldn't be possible to know in advance whether the system calls are going to be related or not, as the two processes might not do the syscalls at exactly the same time.
Is Hermit doing this (executing code in-between syscalls in parallel, at least)? Or do you think it would be possible to do it?
This could significantly increase the usefulness of Hermit for distros that want to do deterministic builds of packages, as most compilation could happen in parallel / i.e. use several cores at the same time!
I'm thinking about huge package builds such as Chromium, Firefox, the Linux kernel, etc. which would perhaps take days to build if they were completely serialized into a single CPU core.
Our earlier (dettrace) prototype allowed syscall-free regions in separate processes to run physically in parallel. Hermit actually hasn't added any process parallelism yet, but it's designed to actually go further than dettrace in this respect.
Specifically, Hermit is architected so that the thread-local syscall handler "checks out" resources from the central scheduler. Resources include things like paths on the file system, contents of files, shared memory, and permission to perform external side effects. Right now, all requests wait for the scheduler to background all other threads and make the current thread the only runnable. But the idea is: in the future we will keep the semantical identical log of linear "commits" (linearization), but will simply background the current thread while it uses the resources they checked out, move forward to the next scheduler iteration, and only block the next runnable thread until its requested resources are freed, not until ALL other threads have finished their timeslice and gone back to waiting on the scheduler.
I was going to ask if it could do the flip side - instead of stabilizing the scheduler, make it less predictable.
AFAICT, it can! Awesome, looking forward to giving it a try.
hermit run --chaos --seed-from=SystemRandom ./target/debug/hello_race;That concurrency testing capability is a pretty well-studied area and we implement a couple existing algorithms. The first is our adaptation of the PCT algorithm (ASPLOS'10 https://www.microsoft.com/en-us/research/wp-content/uploads/...). That's what you get by default with `--chaos`.
But we also have variations on straight up randomized scheduler (random thread selection at each time step).
rr chaos mode has its own take on this: https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mo...
This study compares a few approaches - http://www.doc.ic.ac.uk/~afd/homepages/papers/pdfs/2016/TOPC....
But to the second part of your question about deterministic benchmarking, that is really a separate question. Hermit defines a deterministic notion of virtual time, which is based on the branches retired and system calls executed by all threads. When you run hermit with `--summary`, it reports a total "Elasped virtual global time", which is completely deterministic:
$ hermit run --summary /bin/date
...
Elapsed virtual global (cpu) time: 5_039_700ns
Therefore, any program that runs under hermit can get this deterministic notion of performance. We figured that could be useful for setting performance regression tests with very small regression margins (<1%), which you can't do on normal noisy systems. Compilers are one place I've worked where we wanted smaller performance regression alarms (for generated code) than we could achieve in practice. We haven't actually explored this application yet though. There's a whole small field of people studying performance modeling and prediction, and if one wanted to try this deterministic benchmarking approach, they might want take some of that knowledge and build a more accurate (correlated with wall time) performance model, more realistic than Hermit's current virtual time that is.
One question: How do you avoid the program being affected by things like overall system load and memory pressure?
Eager to try it but encountering the build error here - https://github.com/facebookexperimental/hermit/issues/11
Do you have a reference build log / environment you can share? Last known good commit sha and/or output from "rustup show"?
Personally I'd also be interested in this for academic work -- anything which makes it easier to be sure an experiment can be reproduced later (a week, year, or decade later, in increasing order of difficulty), is good to have.
https://www.acm.org/publications/policies/artifact-review-ba...
But it's been pretty frustrating. As an author, my PLDI 2014 artifact stopped working less than 5 years later (Docker image binary incompatibility). And when I was co-chair of an Artifact Evaluation Committee in 2017, there was not great reproducibilty of the artifacts that were submitted either.
If you package a VM (freezing the Linux kernel), and are pretty sure that VM will run in 10 years, PLUS you determinize the execution itself... that should allow durable bitwise reproducibility. Maybe Hermit could be one ingredient of that.
For scientific reproducibility, there is a lot of other tooling to build too, and I know some folks have been working in that area:
I have a few questions. Hermit's README says:
> Instead, in order to provide complete determinism, the user should provide a fixed file system base image (e.g., with Docker)
How are you sanitizing the result of stat(), for example?
I'm guessing you're already aware of this, but fixed file system images are not sufficient to guarantee deterministic behavior by filesystems.
Specifically, inode numbers are not guaranteed to be assigned in any order. So how do you return deterministic inode numbers in stat() calls?
I'm sure you're also aware of this, but other filesystem results are also not guaranteed to be deterministic. For example, the `st_nblocks` and `st_blksize` fields in `struct stat` can change in-between stat() calls when nothing else happens from the user-space point of view except time passing (this happens in ZFS, for example, due to delayed allocation).
statvfs() is another problematic call which would have to be heavily sanitized.
As another example, there are filesystems that generate a random seed for every directory that is created, to prevent applications from exploiting hash table weaknesses and causing a denial-of-service when creating many directory entries that hash to the same value.
This random seed can have the side effect of readdir() calls returning directory entries in a different order based on the seed, even if the directory was created in the exact same way (with identical directory entries, created in the same order).
It seems like this would require reading the entire directory and then sorting the results, even when doing a single readdir() call to read a single directory entry.
Similarly, telldir() / seekdir() can also be problematic because the cookie value returned by the filesystem in telldir() is opaque and would be different for different directory seed values.
This seems like it would require Hermit to maintain a full in-memory view of all the directory entries of a directory that is opened and is being read, which also means that this in-memory view can become out-of-date in-between readdir() / telldir() / seekdir() calls by one process and calls that modify the directory by another process (Edit: I guess Hermit can synchronize this view easily since it can assume it has full control of all processes inside the container).
It also seems like Hermit would also need to re-read the entire directory when rewinddir() is called, to avoid introducing non-previously-existing concurrency bugs like described in the above paragraph (Edit: again, this is probably not necessary since Hermit can maintain a synchronized view of a directory on its own).
Reading the directory also seems to imply that you'd have to assign deterministic inode numbers for all directory entries, which also seems non-trivial.
Do you think this is an accurate assessment of the situation? How does Hermit handle all of this?
Also, CPUs can (according to the manuals, at least) behave non-deterministically when executing instructions in certain undefined conditions. How do you handle this?
Last question: how do you sanitize the RDTSC CPU instruction?
> How are you sanitizing the result of stat(), for example?
Ah, that part's the same as it was described in the ASPLOS'20 paper (https://dl.acm.org/doi/10.1145/3373376.3378519). Briefly, we present a somewhat sanitized version of the file system. Like if you do `hermit run -- ls -l`, you'll see that you are root, and everything owned by you is owned by root (and everything else is owned by nfsnobody currently). The file mod/access times are all set to whatever you provide for `--epoch`, because we think you mainly care about the file system contents, not the mod times. (Mod times do CHANGE as execution progresses though, otherwise programs like `make` become confused.)
> fixed file system images are not sufficient to guarantee deterministic behavior by filesystems.
Indeed! Hence we present only the sanitized view of the filesystem, so that we can treat each filesystem as its abstract contents (files as bitstrings, and directories as sorted lists of files). For inodes, we translate to and from virtual inodes, rather than reveal the real inodes of the file system.
If there's a flaw in this abstraction of the file system... we'd love to find it. Mainly we've been using it with zfs and Meta's "edenfs" virtual file system so far.
> `st_nblocks` and `st_blksize` fields in `struct stat` can change in-between stat() calls when nothing else happens
Yes, now you feel our pain. We've been plugging these kinds of things, and there are surely some we missed. We report boring constant values where we can get away with it (for st_blksize). Irreproducible "counter example" programs are a welcome form of contribution ;-)!
> As another example, there are filesystems that generate a random seed
Ah, that's an interesting one. Since we determinize (only) userspace, any random bytes that get into guest memory are deterministic (getrandom, /dev/random, etc), but internal nondeterminism in the filesystem we won't see, and instead we'll rely on sanitizing the way the filesystem appears to userspace. But if it just affects order, we should be ok, because we sort before returning from the getdents syscall.
> reading the entire directory and then sorting the results, even when doing a single readdir() call to read a single directory entry.
Yes, unfortunately. That's not great for performance, and we need to implement a caching scheme to at least amortize the overhead of this sort for pathological cases. Still, we will need to trigger the sort even on a single read as you say. So there's a pathological case there -- listing one item from a ginormous directory -- that would run much slower than the native/nondeterministic version.
> telldir() / seekdir()
Well these (and readdir) are not syscalls. So if we've handled getdents correctly we should be ok here. But this is a good suggestion for tests we need to add that stress these!
> in-memory view can become out-of-date in-between readdir() / telldir() / seekdir() calls by one process
Yes, we have a "non-interferene" assumption that either you're running with a container-private file system (e.g. inside docker) or none of your directories are concurrently messed with by other processes outside the container.
> assign deterministic inode numbers for all directory entries, which also seems non-trivial.
Yep, that's what we do!
> Also, CPUs can (according to the manuals, at least) behave non-deterministically when executing instructions in certain undefined conditions. How do you handle this?
We require a certain level of good behavior by the program. To be practical, we aim to work for realistic programs but not necessarily adversarial ones. One way that you can break our sandboxing, for example, is to run CPUID, learn that the processor does not support instruction X, but then execute X anyway. For example, we can trap RDTSC in userspace, but not RDRAND.
If someone wants to use Hermit for, say, reverse engineering malware, then we need a Reverie backend that is hardened by emulating/rewriting the instruction stream carefully to protect against unsupported instructions or undefined conditions.
> Last question: how do you sanitize the RDTSC CPU instruction?
That one we can trap in userspace and then we return deterministic virtual time. Currently that time is a linear combination of the branches and system calls executed by all threads up to the current moment in time. For example, if you do RDTSC in a loop, you will see time advancing.
But using rdtsc to recreate fine-grained timers and implement side-channel attacks is impossible under Hermit. (Side channel attacks in general are only possible if first breaking the sandboxing in some way, and we don't know of a specific attack that will do the trick yet.)
The underlying Reverie instrumentation layer works on ARM, but Hermit isn't ported yet, and we haven't touched RISC-V yet at all. (Contributions welcome!)
One thing we haven't tried yet is just putting a whole emulator (qemu etc) underneath Hermit. That would address any sources of irreproducibility that the emulator lets through from the host (threads, RNG, etc).
I'd like to think we could help someone somehow with public relations, but I don't think we can ;-). Actually, I don't think any of the big techs are leaning all that much into recruiting right now though....
And I hate to say it, but it's working on me. I've seen some really cool projects come out of Facebook lately!
> IO-intensive software builds have an average overhead of 3.49x, while a compute-bound bioinformatics workflow is under 2%.
That's still roughly accurate because Hermit is, today, still ptrace-powered. I'll quote my reply from elsewhere about the WIP high-perf backend:
> The `experimental/reverie-sabre` directory in the Reverie repo contains our high performance backend, but it's still work-in-progress. It uses binary instrumentation and in our early experiments is 10X faster than our current backend in the worst case (i.e. strace is >10X faster when rewritten with reverie-sabre and run on a program that does nothing but syscalls).
Indeed, releasing a faster drop-in "riptrace" strace replacement is one of the goals ;-).
(1) internal networking
If you run a test like `rust/network_hello_world.rs` under Hermit, then the communication between threads is part of the "deterministic bubble" that we're running inside of. When one thread blocks on a network call, the Hermit scheduler takes the thread out of the run pool, and it has to deterministically decide when it is ready to rejoin the run-pool by waking up. The scheduler proceeds in linear turns (labeled "COMMIT" in the logs), and if thread 5 unblocks from a network read at turn 100 in one run, it must unblock at that same point in time in all other runs.
Sometimes we use a precise model of the blocking operation (like with futexes) and other times we depend on sending Linux a non-blocking version of the syscall as a way to poll the IO and see if it is ready to complete (given the history of every operation that has committed on turns 1..N-1).
(2) external networking
This is impossible to determinize, of course. Unless you suck the whole network including both hosts into the deterministic bubble, as the DDOS fork of Linux experimented with in ~2013. That was kind of a negative result IMO because performance was pretty bad, but the paper is here:
https://www.dcc.fc.up.pt/~ines/aulas/1314/SDM/papers/DDOS.pdf
That's where record-replay comes in. `hermit record` can record network calls, but is in a pretty early state and doesn't support many programs. `hermit run` can just allow networking through and hope for the best, but in the future we plan to add features to record just network calls (and no other syscalls), so that you can mix and match different external-network-responses with different thread schedules. That is, you could "pin" the network responses with network-only recording, and then mess around with other parameters or even modify the program.Good to see Meta making more practical Open Source tools like this (and BOLT).
Another related effort is antithesis.com which also seems to use a hypervisor approach rather than Hermits Linux-syscall-API level approach.
We could maybe have shared this logic to intercept syscalls and redirect them to user space code serving as the kernel. That is, we could have shared the Reverie layer. We saw ourselves as headed towards an in-guest binary instrumentation model (like rr’s syscall buffer). And so one factor is that Rust is a better fit than Go for injecting code into guest processes.
Regarding the actual gVisor user space kernel.. we could have started with that and forked it to start adding determinism features to that kernel. At first glance that would seem to save on implementation work, but “implement futexes deterministically” is a pretty different requirement than “implement futexes”, so it’s not clear how much savings could have been had.
We could still have a go at reusing their kvm setup to implement a Reverie backend. But there’s some impedance matching to do across the FFI there, with the Reverie API relying on Rusts async concurrency mode and Tokio. Hopefully we could cleanly manage separate thread pools for the go threads taking syscalls vs the Tokio thread pool hosting Reverie handler tasks. Or maybe it would be possible to reuse their solution without delivering each syscall to Go code.
It doesn't. It only requires privileges to access /dev/kvm
Overall, we only have two ptrace stops: one before the syscall is executed and one after. We have a "tail_inject" optimization that can avoid the second ptrace stop and it results in about a 40% speed up, but in my observations we usually do care about the result of the syscall and must do the second ptrace stop for correctness. Perhaps ptrace's SYSEMU can be combined with seccomp can lead to a speed up, but I just haven't looked into it yet.
Undo, like rr, and Microsoft TTD, records basically all syscalls, but doesn’t determinize anything in the original execution, only in the replay. A “hermit run” call is like a 0-byte recording —- no nondeterminism so you can “replay” by just running again.
On the overhead, the largest factor is the means of program instrumentation. I don’t know where rr sits, but I’ve heard Microsoft’s solution is quite performant on Windows.
The Coyote project at Microsoft is a concurrency testing project with some similarities to Hermit. For the reasons above, they say in their docs to use a constant seed for CI regression testing, but use random exploration for bug finding:
https://www.microsoft.com/en-us/research/project/coyote/
Still, it does feel like wasted resources to test the same points in the (exponentially large) schedule space again and again. Kind of like some exploration/exploitation tradeoff.We don't do it yet, but I would consider doing a randomized exploration during CI, but making the observable semantics the fixed version. If the randomized one fails, send that over to the "bug finding" component for further study, while quickly retrying with the known-good seed for the CI visible regression test results.
I don't think there's one right policy here. But having control over these knobs lets us be intentional about it.
P.S. Taking the random schedules the OS gives us is kind of "free fuzzing", but it is very BAD free fuzzing. It over-samples the probable, boring schedules and under-samples the more extreme corner cases. Hence concurrency bugs lurk until the machine is under load in production and edge cases emerge.
But we can tell when our `--chaos` stress tests cease to produce crashes in reasonable numbers of runs. And when we do achieve a crash we can use our analysis phase to identify the racing operations.
It's both a pro and a con of the approach that we work with real crashes/failures. This means its a less sensitive instrument than tools like TSAN (which can detect data races that never cause a failure in an actual run), but conversely we don't have to worry about false positives, because we can present evidence that a particular order of events definitely causes a failure. Also we catch a much more general category of concurrency bugs (ordering problems between arbitrary instructions/syscalls, even between processes and in multiple languages).
In practice, one challenge we have is bridging between the runtime view of the software (as a debugger would see) -- raw machine instructions and system calls, and the static view that you would get from analyzing the source code.
Sanitizers, for example (ASAN, TSAN, etc), are typically implemented in the compiler as program instrumenations. If we integrated binary instrumentation tools like Intel Pin or DynamoRio, we could perform additional dynamic analysis, but still at the machine code rather than source code level, which is a big gap from how symbolic execution normally happens, at the source code / AST level.