I think where that idea goes wrong is in order to compile it unmodified for nvptx, you need to use a toolchain which knows hip and nvptx, which the cuda toolchain does not. Clang can mostly compile cuda successfully but it's far less polished than the cuda toolchain. ROCm probably has the nvptx backend disabled, and even if it's built in, best case it'll work as well as clang upstream does.
What I'm told does work is keeping all the source as cuda and using hipify as part of a build process when using amdgpu - something like `cat foo.cu | hipify | clang -x hip -` - though I can't personally vouch for that working.
The original idea was people would write in opencl instead of cuda but that really didn't work out.