Agreed! While these results are very promising, there's still a lot to explore in this space.
In addition to the "prompt consistency" and "thought-control" ideas mentioned in the post, I'm definitely curious how the performance is on more complex structured data (things like codegen).