As a concrete example, on a camera you might want to run a facial detector so the camera can automatically adjust its focus when it sees a human face. Or you might want a person detector that can detect the outline of the person in the shot, so that you can blur/change their background in something like a Zoom call. All of these applications are going to work better if you can run your model at, say, 60 HZ instead of 20 HZ. Optimizing hardware to do inference tasks like this as fast as possible with the least possible power usage it pretty different from optimizing for all the things a GPU needs to do, so you might end up with hardware that has both and uses them for different tasks.
When I learned and used gradient descent, you had to analytically determine your own gradients (https://web.archive.org/web/20161028022707/https://genomics....). I went to grad school to learn how to determine my own gradients. Unfortunately, in my realm, loss landscapes have multiple minima, and gradient descent just gets trapped in local minima.
If you don't mind about learning the part where you program, it's got a lot of beginner/intermediate concepts clearly explained. If you do dive into the programming examples, you get to play around with a few architectures and ideas and you're left on the step to dive into the more advanced material knowing what you're doing.
I'm not sure when or why this started.
The model is literally "inferring" something about its inputs: e.g., these pixels denote a hot dog, those don't.
Training is learning the weights (millions or billions of parameters) that control the model's behavior, vs inference is "running" the trained model on user data.