Deterministic Linux for controlled testing and software bug-finding | Better HN

Deterministic Linux for controlled testing and software bug-finding | Better HN

69 comments

jasonwhite3y ago

TL;DR: This is a Rust project that forces deterministic execution of arbitrary programs and acts like a reproducible container. That is, it hermetically isolates the program from sources of non-determinism such as time, thread interleavings, random number generation, etc. Guaranteed determinism is a powerful tool and it serves as a basis for a number of applications, including concurrency stress testing, record/replay, reproducible builds, automatic diagnosis of concurrency bugs, and more.

I've been on the team working on this project over the past ~2 years. AMA!

Here is the GitHub repository: https://github.com/facebookexperimental/hermit

comex3y ago

Nice. Some questions:

- How does this compare with rr?

- Out of curiosity, have you ever looked at libTAS? (I realize it has a very different intended use case.)

- Have you had an issues with behavior differences between CPUs? I know there is architecture-level undefined behavior that can differ between CPUs; on the other hand, it sounds like you’re primarily interested in running well-behaved executables that wouldn’t intentionally invoke such behavior.

jasonwhite3y ago

- We've taken a lot of inspiration from RR and it is very good at what it does (record/replay + randomizing thread schedules). This project diverges a bit from RR in that we focus more on the concurrency testing application of determinism. I wrote the toy record/replay system that sits on top of the determinism layer (so that we don't have to record as much stuff), but it's a long way from being complete.

- I haven't seen libTAS until now, so I can't comment on it much. At first glance, it does look similar!

- See @rrnewton's reply on this one.

rrnewtonOP3y ago

(1) rr [formerly Mozilla rr]

We're big fans of rr!

Hermit is different in that creating a deterministic OS semantics is different than recording whatever nondeterministic behavior occurs under normal Linux. BUT, there's a lot of overlap. And indeed `hermit record` is straight up RnR (record & replay).

But hermit for RnR but is not nearly as developed as rr. We integrate with gdb/lldb as an (RSP) debugger backend, just like rr. Any failing execution you can create with hermit, you can attach a debugger. But our support is very preliminary, and you'll probably find rough edges. Also, we don't support backwards stepping yet (except by running again).

If we invest more in using Hermit as a debugger (rather than for finding and analyzing concurrency bugs), then there should be some advantages over traditional RnR. These would relate to the fact that deterministically executing is different than recording. For example, process and thread IDs, and memory addresses all stay the same across multiple runs of the program, even as you begin adding printfs and modifying the program to fix the bug. With traditional RnR, you can play the same recording as many times as you like, but as soon as you take a second recording all bets are off wrt what is the same or different compared to the prior recording. (That includes losing the "mental state" of things like tids & memory addresses, which is a good point Robert O Callahan makes about the benefits of RnR when accessing the same recording multiple times.)

(2) libTAS - no we haven't! Checking it out now.

(3) Yes, definitely issues with CPU portability.

In general, we are interested in not just determinism on the same machine, but portability between machines in our fleet. As with any tech company that uses the cloud, at Meta people are usually trying to debug an issue on a different machine than where the problem occurred. I.e. taking a crash from a production or CI machine to a local dev machine.

The way we do this is that we mostly report a fairly old CPU to the guest, which disables certain features IF the guest is well behaved.

With the current processor tech, I don't think there's any way we can stop an adversarial program, which, for example, would execute CPUID, find that RDRAND is not supported on the processor, but then execute RDRAND anyway. We could build a much more invasive binary-instrumentation based emulator that would be able to enforce these kinds of rules at the instruction granularity, but it would have higher overhead, especially startup overhead. The nice thing about Reverie though is that we (or others) can add different instrumentation backends while keeping the same programming instrumentation API. So we could have a "hardened" backend that was more about sandboxing and reverse-engineering adversarial software, making a different tradeoff with respect to performance overhead.

rrnewtonOP3y ago

Note that this is a follow-on project from the earlier Dettrace system, which was applied mainly to reproducible builds (as in the academic paper, https://dl.acm.org/doi/10.1145/3373376.3378519, and presented to the Debian Reproducible Builds summit):

- https://github.com/dettrace/dettrace

And one cool part of it is this Rust program instrumentation layer:

- https://github.com/facebookexperimental/reverie

It's good for building OS-emulator style projects or tracing tools.

Sorry, just an additional question as this project is so interesting:

Does Hermit support running different processes in parallel in certain situations?

It seems like this should be possible to do as long as they are appropriately isolated from each other.

Specifically, if two processes don't share any writable memory mappings then I'm thinking it might be possible for Hermit to exploit the underlying existing isolation between them so that they could run code in-between system calls in parallel (while still making sure that the actual system calls happen in a deterministic order).

Perhaps it would even be possible for the two processes to execute system calls in parallel if they are unrelated and they do not affect deterministic execution (such as if they are doing I/O to different files or directories, or one process doing I/O while another is doing something completely unrelated like getting the time, etc).

Although I guess this latter point (of executing syscalls in parallel) would require doing rewind/replay because it wouldn't be possible to know in advance whether the system calls are going to be related or not, as the two processes might not do the syscalls at exactly the same time.

Is Hermit doing this (executing code in-between syscalls in parallel, at least)? Or do you think it would be possible to do it?

This could significantly increase the usefulness of Hermit for distros that want to do deterministic builds of packages, as most compilation could happen in parallel / i.e. use several cores at the same time!

I'm thinking about huge package builds such as Chromium, Firefox, the Linux kernel, etc. which would perhaps take days to build if they were completely serialized into a single CPU core.

rrnewton23y ago

[Caught by the no-procrastination feature by accident.]

Our earlier (dettrace) prototype allowed syscall-free regions in separate processes to run physically in parallel. Hermit actually hasn't added any process parallelism yet, but it's designed to actually go further than dettrace in this respect.

Specifically, Hermit is architected so that the thread-local syscall handler "checks out" resources from the central scheduler. Resources include things like paths on the file system, contents of files, shared memory, and permission to perform external side effects. Right now, all requests wait for the scheduler to background all other threads and make the current thread the only runnable. But the idea is: in the future we will keep the semantical identical log of linear "commits" (linearization), but will simply background the current thread while it uses the resources they checked out, move forward to the next scheduler iteration, and only block the next runnable thread until its requested resources are freed, not until ALL other threads have finished their timeslice and gone back to waiting on the scheduler.

> thread interleavings

I was going to ask if it could do the flip side - instead of stabilizing the scheduler, make it less predictable.

AFAICT, it can! Awesome, looking forward to giving it a try.

   hermit run --chaos --seed-from=SystemRandom ./target/debug/hello_race;

rrnewtonOP3y ago

Yes indeed.

That concurrency testing capability is a pretty well-studied area and we implement a couple existing algorithms. The first is our adaptation of the PCT algorithm (ASPLOS'10 https://www.microsoft.com/en-us/research/wp-content/uploads/...). That's what you get by default with `--chaos`.

But we also have variations on straight up randomized scheduler (random thread selection at each time step).

rr chaos mode has its own take on this: https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mo...

This study compares a few approaches - http://www.doc.ic.ac.uk/~afd/homepages/papers/pdfs/2016/TOPC....

Thanks for open-sourcing this! Roughly, what's the performance overhead from running code under hermit? I'm wondering if this could be used for doing benchmarking with less variance on non-deterministic platforms such as the JVM (I assume hermit is "deterministic enough" that the JIT and GC threads of the JVM will run the same code on every execution?)

rrnewtonOP3y ago

Alas the performance overhead in realtime is not great yet. It still uses ptrace currently, which often results in a multiple-X slowdown (but at least it doesn't "subscribe" to every syscall like strace does, because some are naturally deterministic). Reverie's whole design is to make it support swappable backends, and this ptrace backend is just the reference implementation. The `experimental/reverie-sabre` directory in the Reverie repo contains our high performance backend, but it's still work-in-progress. It uses binary instrumentation and in our early experiments is 10X faster than our current backend in the worst case (i.e. strace is >10X faster when rewritten with reverie-sabre and run on a program that does nothing but syscalls).

But to the second part of your question about deterministic benchmarking, that is really a separate question. Hermit defines a deterministic notion of virtual time, which is based on the branches retired and system calls executed by all threads. When you run hermit with `--summary`, it reports a total "Elasped virtual global time", which is completely deterministic:

$ hermit run --summary /bin/date

...

Elapsed virtual global (cpu) time: 5_039_700ns

Therefore, any program that runs under hermit can get this deterministic notion of performance. We figured that could be useful for setting performance regression tests with very small regression margins (<1%), which you can't do on normal noisy systems. Compilers are one place I've worked where we wanted smaller performance regression alarms (for generated code) than we could achieve in practice. We haven't actually explored this application yet though. There's a whole small field of people studying performance modeling and prediction, and if one wanted to try this deterministic benchmarking approach, they might want take some of that knowledge and build a more accurate (correlated with wall time) performance model, more realistic than Hermit's current virtual time that is.

mtlmtlmtlmtl3y ago

This is exactly the type of thing I've been wanting for testing my chess engine. Parallelism is based on emergent pseudorandom effects of things like interleaving causing searcher threads to mostly end up in non-overlapping parts of the search tree.

One question: How do you avoid the program being affected by things like overall system load and memory pressure?

jasonwhite3y ago

Since CPU is a "compressible" resource, system load doesn't affect the determinism. It'll make it slower of course. Since memory is a non-compressible resource, things can start getting killed by the OOM-killer and there's nothing we can do about it. There are also certainly things like external network communication and file system access that are non-deterministic that must be handled at a higher level (e.g., with a reproducible FS image or by recording & replaying network traffic).

> AMA!

Eager to try it but encountering the build error here - https://github.com/facebookexperimental/hermit/issues/11

Do you have a reference build log / environment you can share? Last known good commit sha and/or output from "rustup show"?

jasonwhite3y ago

We're working on it! Should be fixed soon.

CJefferson3y ago

This looks cool.

Personally I'd also be interested in this for academic work -- anything which makes it easier to be sure an experiment can be reproduced later (a week, year, or decade later, in increasing order of difficulty), is good to have.

rrnewtonOP3y ago

Yes, I'm very interested in that as well. I've been involved with the ACM Artifact Evaluation process which has been going on in several conferences for a while.

https://www.acm.org/publications/policies/artifact-review-ba...

But it's been pretty frustrating. As an author, my PLDI 2014 artifact stopped working less than 5 years later (Docker image binary incompatibility). And when I was co-chair of an Artifact Evaluation Committee in 2017, there was not great reproducibilty of the artifacts that were submitted either.

If you package a VM (freezing the Linux kernel), and are pretty sure that VM will run in 10 years, PLUS you determinize the execution itself... that should allow durable bitwise reproducibility. Maybe Hermit could be one ingredient of that.

For scientific reproducibility, there is a lot of other tooling to build too, and I know some folks have been working in that area:

https://ctuning.org/

> AMA!

I have a few questions. Hermit's README says:

> Instead, in order to provide complete determinism, the user should provide a fixed file system base image (e.g., with Docker)

How are you sanitizing the result of stat(), for example?

I'm guessing you're already aware of this, but fixed file system images are not sufficient to guarantee deterministic behavior by filesystems.

Specifically, inode numbers are not guaranteed to be assigned in any order. So how do you return deterministic inode numbers in stat() calls?

I'm sure you're also aware of this, but other filesystem results are also not guaranteed to be deterministic. For example, the `st_nblocks` and `st_blksize` fields in `struct stat` can change in-between stat() calls when nothing else happens from the user-space point of view except time passing (this happens in ZFS, for example, due to delayed allocation).

statvfs() is another problematic call which would have to be heavily sanitized.

As another example, there are filesystems that generate a random seed for every directory that is created, to prevent applications from exploiting hash table weaknesses and causing a denial-of-service when creating many directory entries that hash to the same value.

This random seed can have the side effect of readdir() calls returning directory entries in a different order based on the seed, even if the directory was created in the exact same way (with identical directory entries, created in the same order).

It seems like this would require reading the entire directory and then sorting the results, even when doing a single readdir() call to read a single directory entry.

Similarly, telldir() / seekdir() can also be problematic because the cookie value returned by the filesystem in telldir() is opaque and would be different for different directory seed values.

This seems like it would require Hermit to maintain a full in-memory view of all the directory entries of a directory that is opened and is being read, which also means that this in-memory view can become out-of-date in-between readdir() / telldir() / seekdir() calls by one process and calls that modify the directory by another process (Edit: I guess Hermit can synchronize this view easily since it can assume it has full control of all processes inside the container).

It also seems like Hermit would also need to re-read the entire directory when rewinddir() is called, to avoid introducing non-previously-existing concurrency bugs like described in the above paragraph (Edit: again, this is probably not necessary since Hermit can maintain a synchronized view of a directory on its own).

Reading the directory also seems to imply that you'd have to assign deterministic inode numbers for all directory entries, which also seems non-trivial.

Do you think this is an accurate assessment of the situation? How does Hermit handle all of this?

Also, CPUs can (according to the manuals, at least) behave non-deterministically when executing instructions in certain undefined conditions. How do you handle this?

Last question: how do you sanitize the RDTSC CPU instruction?

rrnewtonOP3y ago

Thanks for all the questions! Whew, here goes.

> How are you sanitizing the result of stat(), for example?

Ah, that part's the same as it was described in the ASPLOS'20 paper (https://dl.acm.org/doi/10.1145/3373376.3378519). Briefly, we present a somewhat sanitized version of the file system. Like if you do `hermit run -- ls -l`, you'll see that you are root, and everything owned by you is owned by root (and everything else is owned by nfsnobody currently). The file mod/access times are all set to whatever you provide for `--epoch`, because we think you mainly care about the file system contents, not the mod times. (Mod times do CHANGE as execution progresses though, otherwise programs like `make` become confused.)

> fixed file system images are not sufficient to guarantee deterministic behavior by filesystems.

Indeed! Hence we present only the sanitized view of the filesystem, so that we can treat each filesystem as its abstract contents (files as bitstrings, and directories as sorted lists of files). For inodes, we translate to and from virtual inodes, rather than reveal the real inodes of the file system.

If there's a flaw in this abstraction of the file system... we'd love to find it. Mainly we've been using it with zfs and Meta's "edenfs" virtual file system so far.

> `st_nblocks` and `st_blksize` fields in `struct stat` can change in-between stat() calls when nothing else happens

Yes, now you feel our pain. We've been plugging these kinds of things, and there are surely some we missed. We report boring constant values where we can get away with it (for st_blksize). Irreproducible "counter example" programs are a welcome form of contribution ;-)!

> As another example, there are filesystems that generate a random seed

Ah, that's an interesting one. Since we determinize (only) userspace, any random bytes that get into guest memory are deterministic (getrandom, /dev/random, etc), but internal nondeterminism in the filesystem we won't see, and instead we'll rely on sanitizing the way the filesystem appears to userspace. But if it just affects order, we should be ok, because we sort before returning from the getdents syscall.

> reading the entire directory and then sorting the results, even when doing a single readdir() call to read a single directory entry.

Yes, unfortunately. That's not great for performance, and we need to implement a caching scheme to at least amortize the overhead of this sort for pathological cases. Still, we will need to trigger the sort even on a single read as you say. So there's a pathological case there -- listing one item from a ginormous directory -- that would run much slower than the native/nondeterministic version.

> telldir() / seekdir()

Well these (and readdir) are not syscalls. So if we've handled getdents correctly we should be ok here. But this is a good suggestion for tests we need to add that stress these!

> in-memory view can become out-of-date in-between readdir() / telldir() / seekdir() calls by one process

Yes, we have a "non-interferene" assumption that either you're running with a container-private file system (e.g. inside docker) or none of your directories are concurrently messed with by other processes outside the container.

> assign deterministic inode numbers for all directory entries, which also seems non-trivial.

Yep, that's what we do!

> Also, CPUs can (according to the manuals, at least) behave non-deterministically when executing instructions in certain undefined conditions. How do you handle this?

We require a certain level of good behavior by the program. To be practical, we aim to work for realistic programs but not necessarily adversarial ones. One way that you can break our sandboxing, for example, is to run CPUID, learn that the processor does not support instruction X, but then execute X anyway. For example, we can trap RDTSC in userspace, but not RDRAND.

If someone wants to use Hermit for, say, reverse engineering malware, then we need a Reverie backend that is hardened by emulating/rewriting the instruction stream carefully to protect against unsupported instructions or undefined conditions.

> Last question: how do you sanitize the RDTSC CPU instruction?

That one we can trap in userspace and then we return deterministic virtual time. Currently that time is a linear combination of the branches and system calls executed by all threads up to the current moment in time. For example, if you do RDTSC in a loop, you will see time advancing.

But using rdtsc to recreate fine-grained timers and implement side-channel attacks is impossible under Hermit. (Side channel attacks in general are only possible if first breaking the sandboxing in some way, and we don't know of a specific attack that will do the trick yet.)

Are there any limitations regarding hermit running programs which emit code at runtime?

jasonwhite3y ago

There shouldn't be! We don't do any sort of binary rewriting, which is the typical problem with JITed code.

oulipo3y ago

Would something like this work for embedded code, like ESP32?

rrnewtonOP3y ago

Well, probably not on device ;-).

The underlying Reverie instrumentation layer works on ARM, but Hermit isn't ported yet, and we haven't touched RISC-V yet at all. (Contributions welcome!)

One thing we haven't tried yet is just putting a whole emulator (qemu etc) underneath Hermit. That would address any sources of irreproducibility that the emulator lets through from the host (threads, RNG, etc).

alex_suzuki3y ago

Is it just me or are we experiencing an uptick in high-quality, sophisticated software projects being open-sourced by FAANG companies?

crmd3y ago

It’s great how much open source infrastructure software has come out of FAANG in the past decade. Between Kubernetes, react, Kafka, etc it’s wild how much of my tech stack is open source software developed by people at large tech companies. It’s also great PR for the companies.

rrnewtonOP3y ago

Heh, I implore you to consider the engineers-eye view. We're tech geeks inside or outside of FAANG, with all the usual incentives. I've been the tech lead on this project for 5 years, through 2 companies, and of course I hope someone finds it useful and chooses to contribute.

I'd like to think we could help someone somehow with public relations, but I don't think we can ;-). Actually, I don't think any of the big techs are leaning all that much into recruiting right now though....

Enginerrrd3y ago

Facebook/Meta in particular seem to be hitting the front page with nerd-bait research. I suspect it is indeed part of a public relations strategy.

And I hate to say it, but it's working on me. I've seen some really cool projects come out of Facebook lately!

umanwizard3y ago

Are we? PyTorch, React, etc. have been around for ages.

tobyhinloopen3y ago

React? He mentioned high quality. React is a mess, especially when hooks became dominant.

rrnewtonOP3y ago

This has been the culmination of several years of work intercepting and sanitizing the Linux system call API. It's now open source.

hoosieree3y ago

I guess FB poaching you from IU in the middle of teaching your compiler course has a silver lining! Nice to see this being shared with the open source community.

rrnewtonOP3y ago

Ah, at least you were in good hands with Michael Vollmer. Btw, he's now a prof at University of Kent in the UK (http://recurial.com/).

aseipp3y ago

Hey Ryan, just wanted to say I hope you're doing great these days! Really glad to see that the work originally done by you and Joe at Cloudseal evolving and becoming more readily available to all of us. I still thought back every once in a while about what y'all were up to. :) Super excited that it's kept on chuggin' and now is something we can all enjoy!

theamk3y ago

missing from blog post: overhead of the system. The full paper provides answer:

> IO-intensive software builds have an average overhead of 3.49x, while a compute-bound bioinformatics workflow is under 2%.

rrnewtonOP3y ago

Yep, good point -- should have mentioned it.

That's still roughly accurate because Hermit is, today, still ptrace-powered. I'll quote my reply from elsewhere about the WIP high-perf backend:

> The `experimental/reverie-sabre` directory in the Reverie repo contains our high performance backend, but it's still work-in-progress. It uses binary instrumentation and in our early experiments is 10X faster than our current backend in the worst case (i.e. strace is >10X faster when rewritten with reverie-sabre and run on a program that does nothing but syscalls).

Indeed, releasing a faster drop-in "riptrace" strace replacement is one of the goals ;-).

yjftsjthsd-h3y ago

For what it is, that's really quite good:) I wouldn't run a prod database in it, but for development/testing/debugging 3.5x is completely fine.

srosenberg3y ago

Great work and thanks for making it OSS! I was familiar with the prior (academic) work and its limitations, specifically TCP/IP. Could you elaborate on how you solved that problem?

rrnewtonOP3y ago

Sure! So it really breaks down into two cases: internal and external networking relative to the container Hermit creates.

(1) internal networking

If you run a test like `rust/network_hello_world.rs` under Hermit, then the communication between threads is part of the "deterministic bubble" that we're running inside of. When one thread blocks on a network call, the Hermit scheduler takes the thread out of the run pool, and it has to deterministically decide when it is ready to rejoin the run-pool by waking up. The scheduler proceeds in linear turns (labeled "COMMIT" in the logs), and if thread 5 unblocks from a network read at turn 100 in one run, it must unblock at that same point in time in all other runs.

Sometimes we use a precise model of the blocking operation (like with futexes) and other times we depend on sending Linux a non-blocking version of the syscall as a way to poll the IO and see if it is ready to complete (given the history of every operation that has committed on turns 1..N-1).

(2) external networking

This is impossible to determinize, of course. Unless you suck the whole network including both hosts into the deterministic bubble, as the DDOS fork of Linux experimented with in ~2013. That was kind of a negative result IMO because performance was pretty bad, but the paper is here:

  https://www.dcc.fc.up.pt/~ines/aulas/1314/SDM/papers/DDOS.pdf

That's where record-replay comes in. `hermit record` can record network calls, but is in a pretty early state and doesn't support many programs. `hermit run` can just allow networking through and hope for the best, but in the future we plan to add features to record just network calls (and no other syscalls), so that you can mix and match different external-network-responses with different thread schedules. That is, you could "pin" the network responses with network-only recording, and then mess around with other parameters or even modify the program.

Some many years ago there was a commercial product called Jinx debugger [1]. I think I only ever got to kick the tires and find out I couldn't get the hypervisor to run on my machine.

Good to see Meta making more practical Open Source tools like this (and BOLT).

[1] https://en.wikipedia.org/wiki/Jinx_Debugger

rrnewtonOP3y ago

There’s a bit of shared lineage there. Joseph Devietti and I started working in this area together around 2015. And Joe had worked on the deterministic dOS at UW during his PhD (which led to Jinx).

Another related effort is antithesis.com which also seems to use a hypervisor approach rather than Hermits Linux-syscall-API level approach.

daniel-levin3y ago

Neat! This is the direction I’d hoped to see gvisor go in. What’s the reasoning for building from scratch and not piggybacking off gvisor?

jasonwhite3y ago

We certainly looked into gVisor and Firecracker when we started this project a few years ago. These systems use KVM and gVisor in particular uses the Model Specific Registers (MSRs) to intercept system calls before forwarding them to the host kernel. Intercepting syscalls this way has less overhead than ptrace and we would have complete control over the system environment. I think it's a good approach and worth exploring more, but ultimately the deal breaker was that KVM requires root privileges to run and it wouldn't run on our already-virtualized dev machines. We also wanted to allow the guest program to interact with the host's file system. So, we went with good ol' ptrace. Last I checked gVisor also has a ptrace backend, but it wasn't very far along at the time. When going the ptrace route, there is less reason to depend on another project. Another reason of course is that we'd be beholden to a Google project. ;)

rrnewtonOP3y ago

I thought it was very cool how gVisor is multi-backend (their “sentry” implemented vie either ptrace or kvm), which is pretty unusual with instrumentation tools.

We could maybe have shared this logic to intercept syscalls and redirect them to user space code serving as the kernel. That is, we could have shared the Reverie layer. We saw ourselves as headed towards an in-guest binary instrumentation model (like rr’s syscall buffer). And so one factor is that Rust is a better fit than Go for injecting code into guest processes.

Regarding the actual gVisor user space kernel.. we could have started with that and forked it to start adding determinism features to that kernel. At first glance that would seem to save on implementation work, but “implement futexes deterministically” is a pretty different requirement than “implement futexes”, so it’s not clear how much savings could have been had.

We could still have a go at reusing their kvm setup to implement a Reverie backend. But there’s some impedance matching to do across the FFI there, with the Reverie API relying on Rusts async concurrency mode and Tokio. Hopefully we could cleanly manage separate thread pools for the go threads taking syscalls vs the Tokio thread pool hosting Reverie handler tasks. Or maybe it would be possible to reuse their solution without delivering each syscall to Go code.

nn33y ago

> that KVM requires root privileges to run

It doesn't. It only requires privileges to access /dev/kvm

eatonphil3y ago

I don't know for sure if they use sysemu in ptrace to do this (just that they use ptrace) but here's an awesome blog post that shows how you could build an emulator with just ptrace's sysemu: https://nullprogram.com/blog/2018/06/23/.

jasonwhite3y ago

I'm the primary author of Reverie[1], the syscall interception framework that Hermit sits on top of. We don't currently use ptrace's SYSEMU, but that's something I'm interested in looking into. Any way we can speed up syscall interception is a win. We currently use seccomp to trap just the syscalls we're interested in. This has the advantage of not even stopping on syscalls we don't care about. Note that we commonly want to inject multiple syscalls per intercepted syscall. I'm not sure how this would work with SYSEMU, but with the seccomp approach we mmap a page at a fixed address into the guest's memory that contains a syscall instruction. Then, we run this instruction instead of the original syscall instruction. Since it is always at a fixed address, we can exclude it from our seccomp filter. This prevents us from intercepting syscalls that we're injecting, getting into an infinite loop.

Overall, we only have two ptrace stops: one before the syscall is executed and one after. We have a "tail_inject" optimization that can avoid the second ptrace stop and it results in about a 40% speed up, but in my observations we usually do care about the result of the syscall and must do the second ptrace stop for correctness. Perhaps ptrace's SYSEMU can be combined with seccomp can lead to a speed up, but I just haven't looked into it yet.

[1]: https://github.com/facebookexperimental/reverie

mrich3y ago

Performance is much better than UndoDB I suppose? Are there any sources of nondeterminism UndoDB handles but hermit does not?

rrnewtonOP3y ago

The distinction that we often have trouble getting across is between eliminating/controlling nondeterminism, vs recording what happens to occur.

Undo, like rr, and Microsoft TTD, records basically all syscalls, but doesn’t determinize anything in the original execution, only in the replay. A “hermit run” call is like a 0-byte recording —- no nondeterminism so you can “replay” by just running again.

On the overhead, the largest factor is the means of program instrumentation. I don’t know where rr sits, but I’ve heard Microsoft’s solution is quite performant on Windows.

teknopaul3y ago

Can you explain how making flakey tests, not flakey, helps find bugs. I would have thought these differences are essentially free fuzzing and desirable?

rrnewtonOP3y ago

Sure! I think underpinning your question is a really subtle point there. And I think the answer is in the different purposes of regression testing and bug finding. In regression testing (CI), you're testing if the code introduced new problems. You don't at that point in time really want to know that someone else's test downstream from your component fails when given a new thread schedule that it has not previously seen. Wherease if you're stress testing (including fuzzing and concurrency testing) you probably want to torture the program overnight to see if you can turn up new failures.

The Coyote project at Microsoft is a concurrency testing project with some similarities to Hermit. For the reasons above, they say in their docs to use a constant seed for CI regression testing, but use random exploration for bug finding:

  https://www.microsoft.com/en-us/research/project/coyote/

Still, it does feel like wasted resources to test the same points in the (exponentially large) schedule space again and again. Kind of like some exploration/exploitation tradeoff.

We don't do it yet, but I would consider doing a randomized exploration during CI, but making the observable semantics the fixed version. If the randomized one fails, send that over to the "bug finding" component for further study, while quickly retrying with the known-good seed for the CI visible regression test results.

I don't think there's one right policy here. But having control over these knobs lets us be intentional about it.

P.S. Taking the random schedules the OS gives us is kind of "free fuzzing", but it is very BAD free fuzzing. It over-samples the probable, boring schedules and under-samples the more extreme corner cases. Hence concurrency bugs lurk until the machine is under load in production and edge cases emerge.

jasonwhite3y ago

Once we have complete control over the determinism of a test, we can start to play with tweaking the non-deterministic inputs in a controlled way. For example, we can tweak the RNG seed used for thread scheduling to explore schedules that wouldn't normally happen under the Linux scheduler.

stonemetal123y ago

How do you know if a flakey test has been fixed? A deterministic environment can turn flakey into repeatable failure and then known to be fixed.

rrnewtonOP3y ago

Well, we don't prove the absence of concurrency bugs -- that would be more a job for formal verification, type systems, at the source level.

But we can tell when our `--chaos` stress tests cease to produce crashes in reasonable numbers of runs. And when we do achieve a crash we can use our analysis phase to identify the racing operations.

It's both a pro and a con of the approach that we work with real crashes/failures. This means its a less sensitive instrument than tools like TSAN (which can detect data races that never cause a failure in an actual run), but conversely we don't have to worry about false positives, because we can present evidence that a particular order of events definitely causes a failure. Also we catch a much more general category of concurrency bugs (ordering problems between arbitrary instructions/syscalls, even between processes and in multiple languages).

maybe symbolic execution also can be included here?

rrnewtonOP3y ago

It's a good question. We would like to make it usable as a platform for dynamic analysis. The idea being that you can control all these external factors (like thread scheduling), find a crashing run, and then ask introspective questions of what the code is doing in a crashing run.

In practice, one challenge we have is bridging between the runtime view of the software (as a debugger would see) -- raw machine instructions and system calls, and the static view that you would get from analyzing the source code.

Sanitizers, for example (ASAN, TSAN, etc), are typically implemented in the compiler as program instrumenations. If we integrated binary instrumentation tools like Intel Pin or DynamoRio, we could perform additional dynamic analysis, but still at the machine code rather than source code level, which is a big gap from how symbolic execution normally happens, at the source code / AST level.

j / k navigate · click thread line to collapse