In that sense it's not that surprising that on a pure text extraction task with little "thinking" required a 7B model does well and outperforms other models after fine tuning. In the "noshotsfired" label GPT-4 is even accused of overthinking it.
It is interesting how finetuned mistral-7b and llama3-7b outperform finetuned gpt3.5-turbo. I would tend to attribute that to those models being newer and "more advanced" despite their low parameter count, but maybe that's interpreting too much into a small score difference.