The largest models like GPT4 have the interesting property to really,
really finish what you started. If you start with flaws of any kind, it will continue to produce them. The inverse is true as well.
This is an actual thing[1] and it’s something larger models are actually worse (better?) at. They score higher and higher on the loss function (did I predict correctly), but their utility (does it work) goes down.
Just thought it was noteworthy.
[1] https://arxiv.org/abs/2102.03896