There is indeed something excellent about CUDA from a user perspective that is hard to beat. I do high-level DNN and it is not clear to me what it is or why that is. Anytime I have worked on optimizing to mobile hardware (not Jetson, but actual phones or accelerators), it is just a world of hurt and incompatibilities. This notion that operators or subgraphs can be accelerated by lower level closed blobs .. I wonder if that is part of the issue. But then why doesn't OpenCL not just work? I thought it gave a CUDA kernel like general purpose abstraction.
I just don't understand the details enough to understand why things are problematic without CUDA :(