That Apple paper mainly demonstrated that "reasoning" LLMs - with no access to additional tools - can't solve problems that deliberately exceed their token context length.
I don't think it has much relevance at all to a conversational about how good LLMs are at solving programming problems by running tools in a loop.
I keep seeing this idea that LLMs can't handle problems that aren't in their training data and it's frustrating because anyone who has spent significant time working with these systems knows that it obviously isn't true.