I _really_ want an alternative but the architecture churn imposed by targeting ROCm for say an MI350X is brutal. The way their wavefronts and everything work is significantly different enough that if you're trying to get last-mile perf (which for GPUs unfortunately yawns back into the 2-5x stretch) you're eating a lot of pain to get the same cost-efficiency out of AMD hardware.
FPGAs aren't really any more cost effective unless the $/kwh goes into the stratosphere which is a hypothetical I don't care to contemplate.
Pytorch, Jax, tensorflow are all examples to me of very capable products, that compete very well in ML space.
But more broadly work like XLA and IREE are very interesting toolkits for mapping a huge variety of computation onto many types of hardware. While Pytorch et al are fine example applications, are things you can do, XLA is the Big Tent idea, the toolkit to erode not just specific CUDA use cases, but to allow hardware in general to be more broadly useful.