Even worse, even if you DO get an improvement you are likely to find that it was a waste of time in a month or two when then next upgraded version of the underlying models are released.
The places it makes sense from what I can tell are mainly when you are running so many prompts that the cost saving by running a smaller, cheaper model can outweigh the labor and infrastructure costs involved in getting it to work. If your token spend isn't tens (probably hundreds) of thousands of dollars you're unlikely to save money like this.
If it's not about cost saving, the other reasons are latency and being able to achieve something that the root model just couldn't do.
Datadog reported a latency improvement, because fine-tuning let them run a much smaller (and hence faster) model. That's a credible reason if you are building high value features that a human being is waiting on, like live-typing features.
The most likely cases I've heard of for getting the model to do something it just couldn't do before mainly involve vision LLMs, which makes sense to me - training a model to be able to classify images that weren't in the training set might make more sense than stuffing more example images into the prompt (though models like Gemini will accept dozens of not hundreds of comparable images in the prompt, which can then benefit from prompt caching).
The last category is actually teaching it a new skill. The best example here are low-resource programming languages - Jane Street and OCaml or Morgan Stanley and Q for example.
Jane Street OCaml: https://www.youtube.com/watch?v=0ML7ZLMdcl4
Morgan Stanley Q: https://huggingface.co/morganstanley/qqWen-1.5B-SFT