“ This indicates that while CoT can improve performance on difficult questions, it can also introduce variability that causes errors on “easy” questions the model would otherwise answer correctly.”
Other response to strawberry example; There are 25,000 people employed globally that repair broken responses and create training data, a big whack-a-mole effort to remediate embarrassing errors.