Otherwise, you should just use gpt5
Preparing a few thousands training examples and pressing fine tune can improve the base LLM in a few situations, but it also can make the LLM worse at other tasks in hard to understand ways that only show up in production because you didn’t build evals that are good enough to catch them. It also has all of the failure modes of deep learning. There is a reason why deep learning training never took off like LLMs did despite many attempts at building startups around it.
Andrej karpathy has a rant about it that captures some of the failure modes of fine tuning - https://karpathy.github.io/2019/04/25/recipe/