You have to stash more information from the forward pass in order to calculate the gradients during backprop. You can't just naively use an inference accelerator as part of training - inference-only gets to discard intermediate activations immediately.
(Also, many inference accelerators use lower precision than you do when training)
There are tricks you can do to use inference to accelerate training, such as one we developed to focus on likely-poorly-performing examples: https://arxiv.org/abs/1910.00762