I do research in ML hw field: there are currently a couple hundred designs to run a convolutional NN inference. A couple of dozen have been built. They have pretty different underlying technologies (CMOS, floating gate, ReRAM/memristors, etc), different ideas (systolic arrays, analog crossbars, cache organization, lookup tables, data reuse, TDM, using spikes, etc), wildly different power (from microwatts to hundreds of Watts), size, speed, precision, flexibility, cost, ease of use/integration, etc. This is just convnet inference. Lots more is needed to do training in hw, again with multiple choices on how to do it.
So which one design you suggest we all use for all our ML needs?