Source: we benchmark this sort of stuff at my company and for the past year or so frontier models with a modest reasoning budget typically succeed at arithmetic problems (except for multiplication/division problems with many decimal places, which this isn't).