Since it's 12x faster than real time on a 4090, I wonder how fast would it be on a small form factor device (a SBC); I get it as this is using CUDA, so I really wonder how would that perform on my nV Xavier NX (and the more common Nano's out there)...!