Something interesting I noticed in the responses was that for shorter puzzles it would make deductions, building up a set additional "clues" for itself, before answering the question. However, for harder puzzles with more clues it would often merely repeat all the given clues and then try to directly answer the questions.
Maybe some form of curriculum learning would help, starting with easier puzzles and progressing to more challenging ones.
Other ideas to explore include:
- Distilling responses from stronger models - Encouraging exploration with entropy regularization or reward shaping - Training from base models instead of instruct models, like DeepSeek-R1-Zero