Lol that's absolutely not true. What you're describing is literally impossible for any company that has more than one product family on the market since each product has different scratch sizes, number of vector registers, data types supported/emulated etc.
Outside of trade show demos, kernels are codegened. What is true is there are recurring "themes/patterns" that are handled by engineers for a class of products. Lately this is flash attention...
> Again, it is trivial to write a correct opencl matrix multiply, but that's never going to be the highest performance.
I guess you work at AMD. The reason AMD ships a whole bunch of binary kernels is not because someone tuned/designed each one but because AMD doesn't have a PTX/SASS equivalent. So each kernel has to be compiled at build time for each device (it's also why they can't have LTS support for architectures).