undefined | Better HN

0 pointstouisteur2y ago0 comments

I don't think I'm saying anything revolutionary or derogatory when I say that e.g. linear algebra with big batches of small complex-valued matrices, or thin/very-tall matrix multiplication, or 1D-complex convolutions with large filters are not in the main path of the NVIDIA engineers (I did say 'niche').

Some things are not heavily optimized by NVIDIA, it's fine, and a good thing too that they can focus their effort on what's useful to the overall community.

What I'm saying is that very often writing by hand a naive kernel, optimized by a non expert for some months, can reach better performance than library code that isn't optimized for niche use cases. Which is a testament to how easy to get good or OK (not optimal) performance...

I don't know about pyTorch (I was talking about niche use cases?) but TensorRT allows custom kernels and it's worth to use them and plonk a house-implemented kernel if you know what's your bottleneck and no-one has bothered writing a less-generic version yet... again, intern-level competency (not senior CUDA optimizer).

0 comments

david-gpu2y ago

Sorry, I thought this article/thread was all about pyTorch/AI and NVidia's moat in this area vs AMD and other competitors, so my comments are written in that specific context.

If I have lost track of the conversation, please accept my apologies.

touisteurOP2y ago

Heh, I might have veered off topic too... working at NVIDIA... having serious nerd envy here ;-)

j / k navigate · click thread line to collapse