For one TensorFlow is not a generic framework like cuda is, so you lose a whole bunch of the configurability you have with cudaWhy make generalizations like this? It's not true, and we've devolved back into the "nu uh" we originally started with.
This is trivial to do on a GPU, and is built into the library
Yes, I'm sure there are hardwired operations that are trivial to do on GPUs. That's not exactly a +1 in favor of generic programmability. There are also operations that are trivial to do on TPUs, such as CrossReplicaSum across a massive cluster of cores, or the various special-case Adam operations. This doesn't seem related to the claim that TPUs are less flexible.
The raw functions it provides is not direct access to the hardware and memory subsystem.
Not true. https://www.tensorflow.org/api_docs/python/tf/raw_ops/Inplac...
Jax is also going to be giving even lower-level access than TF, which may interest you.
You did not give an example of something GPUs can't do. all you said was that TPUs are faster for a specific function in your case.
Well yeah, I care about achieving goals in my specific case, as you do yours. And simply getting together a VM that can feed 500 examples/sec to a set of GPUs is a massive undertaking in and of itself. TPUs make it more or less "easy" in comparison. (I won't say effortless, since it does take some effort to get yourself into the TPU programming mindset.)