If you train a model where the input is an integer between 1 and 10, and the output is a specific image from a set of ten, the model will be able to get zero loss on the task. That is what's happening here.
Are you saying the demonstrated results are all in sample? Because this is definitely not true for out of sample data. And the GP comment implies that there is in fact a validation/holdout set.
It's still a legitimate direction to pursue. Once you get to large enough training sets, it's basically the same way our own brains work. We don't perceive or remember all the details of a building - just "building, style 19B", plus a few extra generic parameters like distance, angle, color and so on. Totally manageable for deep learning to recognize, and perhaps even combine.
We performed visual reconstruction from fMRI signals using LDM in three simple steps as follows (Figure 2, middle). The only training required in our method is to construct linear models that map fMRI signals to each LDM component, and no training or fine-tuning of deep-learning models is needed. We used the default parameters of image- to-image and text-to-image codes provided by the authors of LDM 2, including the parameters used for the DDIM sam- pler. See Appendix A for details.
I am pretty pretty sure that this is just per person. so all it does is categorize complex brain patterns of one person into 10 category numbers and then do some hula hoop to display the numbers.