The ease of implementation using CUDA means that your code because effed for life, because it is no longer valid C/C++, unless you totally litter it with #ifdefs to special case for CUDA. In my own proprietary AI inference pipeline I've ended up code-generating to a bunch of different backends (OpenCL SpirV, Metal, CUDA, HLSL, CPU w. OpenMP), giving no special treatment to CUDA, and the resulting code is much cleaner and builds with standard open source toolchains.