What you need is: good image understanding, at least GPT-5 tier, general purpose reasoning over images training, and then some domain-specific training, or at least some few-shot guidance to get it to adopt the correct reasoning patterns.
If I had to guess which model would be able to do it best out of the box, few-shot, I'd say Gemini 3 Pro.
There is nothing preventing an autoregressive LLM from revisiting images and rewriting the texts as new clues come in. This is how they can solve puzzles like sudoku.