Some things are not heavily optimized by NVIDIA, it's fine, and a good thing too that they can focus their effort on what's useful to the overall community.
What I'm saying is that very often writing by hand a naive kernel, optimized by a non expert for some months, can reach better performance than library code that isn't optimized for niche use cases. Which is a testament to how easy to get good or OK (not optimal) performance...
I don't know about pyTorch (I was talking about niche use cases?) but TensorRT allows custom kernels and it's worth to use them and plonk a house-implemented kernel if you know what's your bottleneck and no-one has bothered writing a less-generic version yet... again, intern-level competency (not senior CUDA optimizer).