> I immediately found the results suspect, and think I have found what is actually going on. The dataset it was trained on was 2770 images, minus 982 of those used for validation. I posit that the system did not actually read any pictures from the brains, but simply overfitted all the training images into the network itself. For example, if one looks at a picture of a teddy bear, you'd get an overfitted picture of another teddy bear from the training dataset instead.
> The best evidence for this is a picture(1) from page 6 of the paper. Look at the second row. The building generated by 'mind reading' subject 2 and 4 look strikingly similar, but not very similar to the ground truth! From manually combing through the training dataset, I found a picture of a building that does look like that, and by scaling it down and cropping it exactly in the middle, it overlays rather closely(2) on the output that was ostensibly generated for an unrelated image.
> If so, at most they found that looking at similar subjects light up similar regions of the brain, putting Stable Diffusion on top of it serves no purpose. At worst it's entirely cherry-picked coincidences.
> 1. https://i.imgur.com/ILCD2Mu.png
So we are doing both reconstruction and retrieval.
The reconstruction achieves SOTA results. The retrieval demonstrates that the image embeddings contain fine-grained information, not just saying it's just a picture of a teddy bear and then the diffusion model just generates a random teddy bear picture.
I think the zebra example really highlights that. The image embedding generated matches the exact zebra image that was seen by the person. If the model only could say it's just a zebra picture, it wouldn't be able to do that. But the model is picking up on fine-grained info present in the fMRI signal.
The blog post has more information and the paper itself has even more information so please check it out! :)
Reconstruction is the primary and difficult aim, but is what you want and expect when people talk such „mind reading”. Classifying something on brain activity has long been solved and is not difficult, it is almost trivial with modern data sizes and quality. At 80 categories and with data from higher visual areas you could even use an SVM for the basic classifier and then some method for getting a similar blob shape from the activity (V1-V3 are map-like), and get good results.
If you are ignorant about the question whether you are just doing classification you can easily get too-good-to-be-true results. With these newer methods relying on pretrained features this classification case can hide deep inside the model too, and can easily be missed.
The community is currently discussing to what extent this applies to these newer papers (start with original post): https://twitter.com/ykamit/status/1677872648590864385?s=20
One thing they showed is that the 80 categories of that data collapse to just 40 clusters in the semantic space.
(Kamitani has been working on the reconstruction question for long time and knows all these traps quite well.)
The deeprecon dataset proposed as an alternative has been around for a few years and been used in multiple reconstruction papers. It has many more classes, out of distribution „abstract“ images and no class overlap between train and test images, so it’s quite suitable for proving that it is actually reconstruction. But it’s also one order of magnitude smaller than the NSD data used for the newer reconstruction studies. If you modify the 80-class NSD data to not have train-test class overlap, the two diffusion methods tested there do not work as well, but still look like they do some reconstruction.
On deeprecon the two tested diffusion methods fail at reconstructing the abstract OOD images (which NSD does not have), something previous reconstruction methods could do.
As the abstract says, "In particular, MindEye can retrieve the exact original image even among highly similar candidates indicating that its brain embeddings retain fine-grained image-specific information. This allows us to accurately retrieve images even from large-scale databases like LAION-5B. We demonstrate through ablations that MindEye's performance improvements over previous methods result from specialized submodules for retrieval and reconstruction, improved training techniques, and training models with orders of magnitude more parameters."
Note that LAION-5B has five billion images.
What you can think of contrastive learning as is: two separate models that take different inputs and make vectors of the same length as outputs. This is achieved by training both models on pairs of training data (in this case fMRI images and observed images).
What the LAION-5B work shows is that they did a good enough job of this training that the models are really good at creating similar vectors for nearly any image and fMRI pair.
Then, they make a prior model which basically says “our fMRI vectors are essentially image vectors with an arbitrary amount of randomness in them (representing the difference between the contrastive learning models). Let’s train a model to learn to remove that randomness, then we have image vectors.”
So yes, this is an impressive result at first glance and not some overfitting trick.
It’s also sort of bread and butter at this point (replace fMRI with “text” and that’s just what Stable Diffusion is).
They’ll be lots of these sort of results coming out soon.
One of the biggest issues with any attempt to extract information from an fMRI scan is resolution, both spatial and temporal - this study used 1.8mm voxels which is a TON of neurons (also recall that fMRIs scan blood flow, not neuron activity - we just count on those things being correlated). Temporally, fMRI sample frequency are often <1hz. I didn't see that they mentioned a specific frequency, but they showed images to the subject for 3 seconds at a time so I'd guess that's designed to ensure you get a least a frame or three while the subject is looking at the image. You can sort of trade voxel size for sample frequency - so you can get more voxels, or more samples, but not both. So detecting things that happen quickly (like, say, moving images or speech) would probably be quite hard (even if you could design an ai thingey that could do it, getting the raw data at the resolution you'd need is not currently possible with existing scanners)
Also, not all brain functions are as clearly localized as vision - the visual cortex areas in the back of the brain map pretty directly to certain kinds of visual stimulus, while other kinds of stimulus and activity are much less localize (there isn't a clear "lighting up" of an area). You can get better resolution if you only scan part of the brain (i.e. the visual cortex) (I don't know if that's what they did for this study), but that's obviously only useful for activity happening in a small part of the brain.
ANYWAY SO COOL!!! I wonder if you could use this to draw people's faces with a subject who is imagining looking at a face? fMRI police sketch? How do brains even work!?
The fMRI dataset includes signal from the whole brain but we only use the data from the visual cortex for this study.
People will suddenly realize that the reason language is primarily serial is simply due to the fact that it must be conveyed by a series of sounds. There will likely be a new type of visual language used via BCI "telepathy". It may have some ordering but will not rely so heavily on serializing information, since the world is quite multidimensional.
Something else that's interesting about language is it's just a form of compressive medium for thoughts; I think of a concept, then I put it into words (compression) that you the listener then have to interpret and understand (decompression) and then fit your brain state to the new data you've received. It's overall a very lossy medium compared to what brains can do. It would be much easier to beam my thoughts and images and videos in my mind directly to you.
Unless you or I have aphantasia, of course.
There's definitely room for direct transfer of concrete unrolled information. But at the same time we would still need some forms of abstraction in many cases.
I think the biggest issue with the compression of natural language is that the loss is different for each person, since everyone's "codec" varies. In other words, people often interpret language in different ways.
But suppose that humans or AIs or AI-enhanced humans could have exactly the same base dictionary or interpretive network or "codec" or whatever for a (visual or word-based) language. Then we could get away from many of the disputes and misunderstandings that arise purely from different interpretations.
It does seem more efficient than sound.
Here's a radiograph of the primary visual cortex created in 1982 by projecting a pattern onto a macaque's retina: https://web.archive.org/web/20100814085656im_/http://hubel.m...
An injection of radioactive sugar lets you see where the neurons were firing away and metabolizing the sugar.
Fig4 shows the letter M on the cortical surface, where the stimulus accounted for the effects of foveal magnification (foveal vision gets more cortical space). Keep in mind that we now, in theory, have stronger magnets, better head coils (the part that picks up the image information), and better sequences (the software that manipulates the magnets to produce the images) so we could do even better than that these days.
- Each participant in the dataset spent up to 40 hours in the MRI machine to gather sufficient training data.
- Models were trained separately for every participant and are not generalizable across people. Image limitations: MindEye is limited to the kinds of natural scenes used for training the model. For other image distributions, additional data collection and specialized generative models would be needed.
- … Non-invasive neuroimaging methods like fMRI not only require participant compliance but also full concentration on following instructions during the lengthy scan process. …
Those things are tight; not “look where you are going” tight, but you absolutely need to tell people beforehand that they will feel very uncomfortable inside, remind them that they can get out at any time, and show them how because they will not like being in there.
I wouldn’t spend 20 minutes in one if it were not important. I’d seriously push back on an hour. 40 hours is something I’d only do if that is absolutely necessary.
That said, and coming from a background in neuroimaging 20 years ago, what’s the applicability? MRI hasn’t gotten that much more cost effective for more widespread uses. Magnets are expensive.
> Models were trained separately for every participant and are not generalizable across people.
You'd be quite surprised.
This is exactly what you'd want to use controlnet for - mapping semantic information onto the perceived structure.