Also until all of these libraries are made amenable to kernel fusion or just sometimes prologue/epilogue features they can be beaten on memory bandwidth with pretty lowly-optimized kernels with no global memory traffic.
I'm very glad cuFFT and cuBLAS are getting 'device' (Dx) versions, and NVIDIA is getting wiser on the kernel-fusion track. They're amazingly fast and game-changing but they're still not covering a big chunk of the original libraries.
Also, a lot of problems that are amenable to GPU compute are not expressed in blas/dnn and still can be very, very simply expressed as CUDA code, and still extract huge performance gains against CPUs, without a chance that the Olympians will ever get an interest to your problem space.