[1] Reinforcement learning from human feedback; basically participants got two model responses and had to judge them on multiple criteria relative to the prompt
I suspect in part because the provider also didn't want to create an easy cop out for the people working on the fine-tuning part (a lot of my work was auditing and reviewing output, and there was indeed a lot of really sloppy work, up to and including cut and pasting output from other LLMs - we know, because on more than one occasion I caught people who had managed to include part of Claudes website footer in their answer...)
I upgraded to a new model (gpt-4o-mini to grok-4.1-fast), suddenly all my workflows were broken. I was like "this new model is shit!", then I looked into my prompts and realized the model was actually better at following instructions, and my instructions were wrong/contradictory.
After I fixed my prompts it did exactly what I asked for.
Maybe models should have another tuneable parameters, on how well it should respect the user prompt. This reminds me of imagegen models, where you can choose the config/guidance scale/diffusion strength.