I must be in a niche where we're consistently crushing cublas, cusolver and cudnn, sometimes cutlass with internship-level competency, mostly because our problem-sizes are not in the cone of optimization of the Olympians of NVIDIA. Large batches of small matrices, specific matrix forms, long kernel pipelines...
Also until all of these libraries are made amenable to kernel fusion or just sometimes prologue/epilogue features they can be beaten on memory bandwidth with pretty lowly-optimized kernels with no global memory traffic.
I'm very glad cuFFT and cuBLAS are getting 'device' (Dx) versions, and NVIDIA is getting wiser on the kernel-fusion track. They're amazingly fast and game-changing but they're still not covering a big chunk of the original libraries.
Also, a lot of problems that are amenable to GPU compute are not expressed in blas/dnn and still can be very, very simply expressed as CUDA code, and still extract huge performance gains against CPUs, without a chance that the Olympians will ever get an interest to your problem space.