A couple minor notes:
- If your program is very light on syscalls (i.e. basically entirely in-memory computation), rr can go to a basically 1.0x slowdown. In particular this means you can run benchmarks in it at full capacity, provided that I/O is outside of the repeated part (e.g. if sometimes the bench is noticably slower, you can replay and see if some important loads/stores crossed a cacheline/page). You can even "perf record" / "perf stat" a replay if you want to! (none of this is too useful, but it's fun! Gathering repeated stats over the same execution for more resolution might be useful with proper tooling though)
- rr does have an in-memory buffer of recording data.
- rr recordings should be portable within the architecture, as long as the replay hardware has the extensions the recorder did (or if replayer-unsupported features are disabled at record-time).