Sure, there will always be some races/timing issues that just won't repro under recording (Heisenberg principal and all that), but in fact most races are _more likely_ to occur under recording. Part of this is because you slow down the process being recorded, which is equivalent to speeding up the outside world.
And of course, when you do have your gnarly timing issue captured in a recording, it's usually trivial to root-cause exactly what happened. Our customers tell us that races and timing issues are a major use-case.
The former basically only exists for embedded boards and the latter does not exist (at say less than a 10x slowdown) for Linux or any other common desktop operating system as far as I am aware.
You can trace hardware that exposes trace functionality usually via a debug port of some kind. Many chips have trace functionality in their production design, but no debug connector is physically present in off-the-shelf boards (to reduce manufacturing cost). You can usually physically modify the board to get access to this functionality which is routinely done when porting software to a new chip/board.
Trace functionality comes in two major flavors, control flow trace and memory trace. Control flow trace only records control flow, so the contents of memory are unknown which is not very useful for your desired use case. Memory trace records memory accesses, so the contents of memory are known. Unfortunately, memory trace is very resource intensive, so most systems that support trace only implement control flow trace. As far as I am aware, it is very unlikely that any desktop or server CPU has memory trace.
The major manufacturers of trace probes and solutions that I know of are Green Hills Software, Lauterbach, and Segger.
With native multithreading data can pass from thread to thread millions of times per second, and you're much less likely to hit obscure interactions when limited instead to maybe a couple hundred context switches per second.