The central reason that TPUs feel less flexible is Google's awful mistake in encouraging everyone to use TPUEstimator as the One True API For Doing TPU Programming. Getting off that API was the single biggest boost to my TPU skills.
You can see an example of how to do that here: https://github.com/shawwn/ml-notes/blob/master/train_runner.... This is a repo that can train GPT-2 1.5B at 10 examples/sec on a TPUv3-8 (aka around 10k tokens/sec).
Happy to answer any specific questions or peek at codebases you're hoping to run on TPUs.