They did for some problems. If you gave me
five iterations at a problem like this in brainfuck:
> "Read a string S and produce its run-length encoding: for each maximal block of identical characters, output the character followed immediately by the length of the block as a decimal integer. Concatenate all blocks and output the resulting string.
I'd do absolutely awfully at it.
And to be clear that's not "five runs from scratch repeatedly trying it" it's five iterations so at most five attempts at writing the solution and seeing the results.
I'd also note that when they can iterate they get it right much more than "n zero shot attempts" when they have feedback from the output. That doesn't seem to correlate well with a lack of reasoning to me.
Given new frameworks or libraries and they can absolutely build things in them with some instructions or docs. So they're not very basically just outputting previously seen things, it's at least much more pattern based than words.
edit -
I play clues by sam, a logical reasoning puzzle. The solutions are unlikely to be available online, and in this benchmark the cutoff date for training seems to be before this puzzle launched at all:
https://www.nicksypteras.com/blog/cbs-benchmark.html
Frankly just watching them debug something makes it hard for me to say there's no reasoning happening at all.