fMRI-to-image with contrastive learning and diffusion priors (opens in new tab)

(stability.ai)

146 pointstmabraham2y ago64 comments

64 comments

Wasn't there something similar a few months ago on HN and where the top comment talked about how it's not as impressive as it sounds [0]? The main issue is that this type of methodology is pulling from a pool of images, not literally reconstructing what image was seen in the brain directly.

> I immediately found the results suspect, and think I have found what is actually going on. The dataset it was trained on was 2770 images, minus 982 of those used for validation. I posit that the system did not actually read any pictures from the brains, but simply overfitted all the training images into the network itself. For example, if one looks at a picture of a teddy bear, you'd get an overfitted picture of another teddy bear from the training dataset instead.

> The best evidence for this is a picture(1) from page 6 of the paper. Look at the second row. The building generated by 'mind reading' subject 2 and 4 look strikingly similar, but not very similar to the ground truth! From manually combing through the training dataset, I found a picture of a building that does look like that, and by scaling it down and cropping it exactly in the middle, it overlays rather closely(2) on the output that was ostensibly generated for an unrelated image.

> If so, at most they found that looking at similar subjects light up similar regions of the brain, putting Stable Diffusion on top of it serves no purpose. At worst it's entirely cherry-picked coincidences.

> 1. https://i.imgur.com/ILCD2Mu.png

> 2. https://i.imgur.com/ftMlGq8.png

[0] https://news.ycombinator.com/item?id=35012981

tmabrahamOP2y ago

Our model generates CLIP image embeddings from fMRI signals and those image embeddings can be used for retrieval (using cosine similarity for example) or passed into a pretrained diffusion model that takes in CLIP image embeddings and generates an image (it's a bit more complicated than that but that's the gist, read the blog post for more info).

So we are doing both reconstruction and retrieval.

The reconstruction achieves SOTA results. The retrieval demonstrates that the image embeddings contain fine-grained information, not just saying it's just a picture of a teddy bear and then the diffusion model just generates a random teddy bear picture.

I think the zebra example really highlights that. The image embedding generated matches the exact zebra image that was seen by the person. If the model only could say it's just a zebra picture, it wouldn't be able to do that. But the model is picking up on fine-grained info present in the fMRI signal.

The blog post has more information and the paper itself has even more information so please check it out! :)

meindnoch2y ago

So what's the output if I show a completely novel image to the subject? E.g. a picture of my armpit covered in blue paint?

mtlmtlmtlmtl2y ago

Why are you building this, and what kind of ethical considerations have you taken, if any?

andybak2y ago

I'm curious what answers you would find acceptable? I'm not being snarky - I genuinely struggle with this line of thinking. People seem to find "if I don't then someone else will" to be an unacceptable answer but it seems to me to be fairly central.

There's a inevitability about most scientific discoveries (there are notable exceptions but they are few) and unless we're talking about something with capital outlay in the trillions of dollars then it's going to happen whether we like it or not - short of a global totalitarian state capable of deep scrutiny of all research.

2 more replies

boh4232y ago

The underlying NSD dataset used in the three prominent (and impressive) recent papers on this topic (including the one linked here) is a bit problematic as it invites this (classification/identification, not reconstruction): It only has 80 categories. It has not been recorded with reconstruction in mind.

Reconstruction is the primary and difficult aim, but is what you want and expect when people talk such „mind reading”. Classifying something on brain activity has long been solved and is not difficult, it is almost trivial with modern data sizes and quality. At 80 categories and with data from higher visual areas you could even use an SVM for the basic classifier and then some method for getting a similar blob shape from the activity (V1-V3 are map-like), and get good results.

If you are ignorant about the question whether you are just doing classification you can easily get too-good-to-be-true results. With these newer methods relying on pretrained features this classification case can hide deep inside the model too, and can easily be missed.

The community is currently discussing to what extent this applies to these newer papers (start with original post): https://twitter.com/ykamit/status/1677872648590864385?s=20

One thing they showed is that the 80 categories of that data collapse to just 40 clusters in the semantic space.

(Kamitani has been working on the reconstruction question for long time and knows all these traps quite well.)

The deeprecon dataset proposed as an alternative has been around for a few years and been used in multiple reconstruction papers. It has many more classes, out of distribution „abstract“ images and no class overlap between train and test images, so it’s quite suitable for proving that it is actually reconstruction. But it’s also one order of magnitude smaller than the NSD data used for the newer reconstruction studies. If you modify the 80-class NSD data to not have train-test class overlap, the two diffusion methods tested there do not work as well, but still look like they do some reconstruction.

On deeprecon the two tested diffusion methods fail at reconstructing the abstract OOD images (which NSD does not have), something previous reconstruction methods could do.

jph002y ago

Yes there was. However this is a different paper, describing a different method, applied to a different dataset, with different results.

As the abstract says, "In particular, MindEye can retrieve the exact original image even among highly similar candidates indicating that its brain embeddings retain fine-grained image-specific information. This allows us to accurately retrieve images even from large-scale databases like LAION-5B. We demonstrate through ablations that MindEye's performance improvements over previous methods result from specialized submodules for retrieval and reconstruction, improved training techniques, and training models with orders of magnitude more parameters."

Note that LAION-5B has five billion images.

RC_ITR2y ago

> To achieve the goals of retrieval and reconstruction with a single model trained end-to-end, we adopt a novel approach of using two parallel submodules that are specialized for retrieval (using contrastive learning) and reconstruction (using a diffusion prior).

What you can think of contrastive learning as is: two separate models that take different inputs and make vectors of the same length as outputs. This is achieved by training both models on pairs of training data (in this case fMRI images and observed images).

What the LAION-5B work shows is that they did a good enough job of this training that the models are really good at creating similar vectors for nearly any image and fMRI pair.

Then, they make a prior model which basically says “our fMRI vectors are essentially image vectors with an arbitrary amount of randomness in them (representing the difference between the contrastive learning models). Let’s train a model to learn to remove that randomness, then we have image vectors.”

So yes, this is an impressive result at first glance and not some overfitting trick.

It’s also sort of bread and butter at this point (replace fMRI with “text” and that’s just what Stable Diffusion is).

They’ll be lots of these sort of results coming out soon.

atom_1012y ago

This is mostly correct, except that there is only one model. This model takes an fMRI and predicts 2 outputs. The first is specialized for retrieval and the second can be fed into a diffusion model to reconstruct images.

You can see the comparison in performance between LAION-5B retrieval and actual reconstructions in the paper. When retrieving from a large enough database like LAION-5B, we can get images that are quite similar to the seen images in terms of high level content, but not so similar in low-level details (relative position of objects, colors, texture, etc). Reconstruction with diffusion models does much better in terms of low-level metrics.

1 more reply

satvikpendem2y ago

If it's still retrieving an image and not reconstructing it, if the dataset is large enough that's decently fine, but this is generally not how diffusion models work in general and I'd have expected the model to map the fMRI data to a wholly new image.

jph002y ago

Please read the paper. Or at least the blog post. It's really quite readable.

They explain that they've done both retrieval and reconstruction, and have lots of pictures showing examples of each.

https://medarc-ai.github.io/mindeye/

1 more reply

RC_ITR2y ago

If you can retrieve an image using a latent vector, it’s trivial to reconstruct it (decently well) with a diffusion model.

rocqua2y ago

They tested themselves both on retrieval and reconstruction.

dehrmann2y ago

That one was a bit like not hotdogs: https://www.youtube.com/watch?v=ACmydtFDTGs

hospadar2y ago

This is SO COOL. I'd guess (I did analysis for an fMRI lab for a year so I'm not a pro but not totally talking out of my orifice) that detecting images like this is among the easier things you could do (it probably wouldn't be so easy to do things like "guess the words I'm thinking of") and I suspect other sensory stuff might be harder but I have little knowledge there.

One of the biggest issues with any attempt to extract information from an fMRI scan is resolution, both spatial and temporal - this study used 1.8mm voxels which is a TON of neurons (also recall that fMRIs scan blood flow, not neuron activity - we just count on those things being correlated). Temporally, fMRI sample frequency are often <1hz. I didn't see that they mentioned a specific frequency, but they showed images to the subject for 3 seconds at a time so I'd guess that's designed to ensure you get a least a frame or three while the subject is looking at the image. You can sort of trade voxel size for sample frequency - so you can get more voxels, or more samples, but not both. So detecting things that happen quickly (like, say, moving images or speech) would probably be quite hard (even if you could design an ai thingey that could do it, getting the raw data at the resolution you'd need is not currently possible with existing scanners)

Also, not all brain functions are as clearly localized as vision - the visual cortex areas in the back of the brain map pretty directly to certain kinds of visual stimulus, while other kinds of stimulus and activity are much less localize (there isn't a clear "lighting up" of an area). You can get better resolution if you only scan part of the brain (i.e. the visual cortex) (I don't know if that's what they did for this study), but that's obviously only useful for activity happening in a small part of the brain.

ANYWAY SO COOL!!! I wonder if you could use this to draw people's faces with a subject who is imagining looking at a face? fMRI police sketch? How do brains even work!?

tmabrahamOP2y ago

Yeah using data from a 7T MRI giving higher spatial resolution definitely helps!

The fMRI dataset includes signal from the whole brain but we only use the data from the visual cortex for this study.

StackOverlord2y ago

Are you able to extract an image showing the screen in the fMRI machine, as the subject can see it in between pictures ?

atom_1012y ago

We did have a face reconstruction project planned. It is on the back-burner for now. That one will be based on something like the Celeb-A dataset instead of the Natural Scenes Dataset (images from MS-COCO) used here.

ilaksh2y ago

Human communication will change dramatically once useful invasive brain-computer interfaces are available.

People will suddenly realize that the reason language is primarily serial is simply due to the fact that it must be conveyed by a series of sounds. There will likely be a new type of visual language used via BCI "telepathy". It may have some ordering but will not rely so heavily on serializing information, since the world is quite multidimensional.

satvikpendem2y ago

Indeed, it reminds me of the movie Arrival (and the short story upon which it's based) where the heptapods are able to show a complete sentence and story within one glyph. I thought it was interesting just how much the movie focused on linguistics, which is rare to see in Hollywood films.

Something else that's interesting about language is it's just a form of compressive medium for thoughts; I think of a concept, then I put it into words (compression) that you the listener then have to interpret and understand (decompression) and then fit your brain state to the new data you've received. It's overall a very lossy medium compared to what brains can do. It would be much easier to beam my thoughts and images and videos in my mind directly to you.

Unless you or I have aphantasia, of course.

ilaksh2y ago

Right and I would go so far as to say that most types of intelligence are a type of functional compression also.

There's definitely room for direct transfer of concrete unrolled information. But at the same time we would still need some forms of abstraction in many cases.

I think the biggest issue with the compression of natural language is that the loss is different for each person, since everyone's "codec" varies. In other words, people often interpret language in different ways.

But suppose that humans or AIs or AI-enhanced humans could have exactly the same base dictionary or interpretive network or "codec" or whatever for a (visual or word-based) language. Then we could get away from many of the disputes and misunderstandings that arise purely from different interpretations.

arketyp2y ago

I wonder what the limits are to such a universal codec. From what I've gathered about synaesthesia (e.g. from V.S. Ramachandran, or Galton earlier), it varies quite significantly between persons. I believe it's said that some 3% of people have aphantasia for instance. That means entire modalities would be excluded for some in a latent space glyph language. Unless, I suppose, one could find ways of stimulating the synaesthetic connections artificially too.

SamPatt2y ago

In a popular sci-fi (avoiding spoliers), the alien race has transparent skulls, and their visible thoughts are broadcast to anyone within visual range.

It does seem more efficient than sound.

flangola72y ago

I want to know the scifi

You may base64 to spoiler proof it

lithiumii2y ago

I think it's "VGhlIFRocmVlLUJvZHkgUHJvYmxlbQ==" The aliens cannot lie to each other (they don't even have the idea of a lie), because their thoughts are transparent to each other.

1 more reply

mometsi2y ago

For context, early vision is easier to map than you might expect.

Here's a radiograph of the primary visual cortex created in 1982 by projecting a pattern onto a macaque's retina: https://web.archive.org/web/20100814085656im_/http://hubel.m...

An injection of radioactive sugar lets you see where the neurons were firing away and metabolizing the sugar.

(https://pubmed.ncbi.nlm.nih.gov/7134981/)

dr_dshiv2y ago

But can brain activity be mapped anywhere like this with fmri? I doubt it. But yes — it is cool that the brain keeps spatial proportions of reality in the brain map! Very unlike latent space.

echo_time2y ago

In some ways, absolutely not - precision is a huge challenge with an indirect method like fMRI - but this example is over a decade old now: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130346/

Fig4 shows the letter M on the cortical surface, where the stimulus accounted for the effects of foveal magnification (foveal vision gets more cortical space). Keep in mind that we now, in theory, have stronger magnets, better head coils (the part that picks up the image information), and better sequences (the software that manipulates the magnets to produce the images) so we could do even better than that these days.

mikeiz4042y ago

Some important points from the article under the limitations section

- Each participant in the dataset spent up to 40 hours in the MRI machine to gather sufficient training data.

- Models were trained separately for every participant and are not generalizable across people. Image limitations: MindEye is limited to the kinds of natural scenes used for training the model. For other image distributions, additional data collection and specialized generative models would be needed.

- … Non-invasive neuroimaging methods like fMRI not only require participant compliance but also full concentration on following instructions during the lengthy scan process. …

bertil2y ago

For anyone who hasn’t been in an MRI: 40 hours is a lot.

Those things are tight; not “look where you are going” tight, but you absolutely need to tell people beforehand that they will feel very uncomfortable inside, remind them that they can get out at any time, and show them how because they will not like being in there.

I wouldn’t spend 20 minutes in one if it were not important. I’d seriously push back on an hour. 40 hours is something I’d only do if that is absolutely necessary.

robg2y ago

Awesome, predicting words from fMRI has been around for a while and visual cortex can be mapped well.

That said, and coming from a background in neuroimaging 20 years ago, what’s the applicability? MRI hasn’t gotten that much more cost effective for more widespread uses. Magnets are expensive.

011000112y ago

Yeah, the first thing that comes to mind(har har) when I see this is that we'd be better off trying to develop better scanning technology. You can't exactly walk around town with an MRI strapped to your skull.

tmabrahamOP2y ago

We think it could be useful for clinical research and maybe even diagnostics. For example, you could imagine a person with depression(or other neurological disorders) may have a different perception of the same image than a healthy person. Now with the much higher fidelity that both more powerful MRI machines and better generative AI tools can provide, this may now be a very promising direction for future research.

2ap2y ago

I work in pediatrics and am an academic investigating MRI of kids in various diseases. When I saw this work, I did wonder about us being better able to functionally map where things are going wrong in the pathways of neurodisability. I wondered if this would have applications in being able to do that - for example being able to say that someone could process the image. Do you think it could have this type of application? One thing which would be a deal breaker at the moment is the amount of time participants spend in the scanner. But if we wanted to (for example) see if a child could perceive simple objects, would that be doable do you think?

aledalgrande2y ago

People with disabilities could benefit greatly from this.

ggm2y ago

As long as they want to talk about london buses, steam trains, surfing and football.

1 more reply

seydor2y ago

reading suspect's mind

umvi2y ago

What if a copyrighted image or video can be recovered from your brain using external tech like this? That's not fair to the rights holders, what we really need is technology that can clean such illegal memories from the brain; a brainwasher if you will

ImaCake2y ago

Aside from helping those with disabilities, what is stopping authorities using this as a lie detector? I assume the tech isn’t quite there yet.

FPGAhacker2y ago

Not yet.

> Models were trained separately for every participant and are not generalizable across people.

ImaCake2y ago

Ah right, so the models are subject to some serious overfitting then. Good proof of concept, but not useful in practice yet.

woeirua2y ago

Techniques like this give me hope that some day we will be able to objectively diagnose mental illness, and monitor the efficacy of treatment.

puchatek2y ago

As someone suffering from intrusive thoughts I do not look forward to a future where other people can see what I sometimes see in my head.

wiz21c2y ago

As a person who is normal by any measuring standard, I do not look forward to a future where other people can see what I sometimes see in my head.

You'd be quite surprised.

theaiquestion2y ago

I think the method of merging the pipelines via img2img should use controlnet. Possibly needing to be finetuned specifically for this, although existing controlnet models might work fine for this.

This is exactly what you'd want to use controlnet for - mapping semantic information onto the perceived structure.

tmabrahamOP2y ago

Yes we've been looking into ControlNet as well, and I think there is one recent fMRI-to-image paper that also has tried ControlNet. Maybe we'll use ControlNet in MindEye v2 :)

atom_1012y ago

Yes controlnet will be used in the next version. For this one we couldn't get it working in time.

radicaldreamer2y ago

They should dose people with DMT in the fMRI and run it through the model.

smusamashah2y ago

There was also DreamDiffusion recently. https://arxiv.org/abs/2306.16934

parth172912y ago

This is cool . We are going towards future like shown in inception to plant an idea.

caycep2y ago

the best recons I've seen so far are from Jack Gallant and Alex Huth's labs (at least what was shown publicly at SfN)

collsni2y ago

This is mind reading, we are in the future.

QuantumG2y ago

Found the sucker.

1 more reply

j / k navigate · click thread line to collapse

64 comments

satvikpendem2y ago

> 1. https://i.imgur.com/ILCD2Mu.png

> 2. https://i.imgur.com/ftMlGq8.png

[0] https://news.ycombinator.com/item?id=35012981

tmabrahamOP2y ago

So we are doing both reconstruction and retrieval.

The blog post has more information and the paper itself has even more information so please check it out! :)

meindnoch2y ago

So what's the output if I show a completely novel image to the subject? E.g. a picture of my armpit covered in blue paint?

mtlmtlmtlmtl2y ago

Why are you building this, and what kind of ethical considerations have you taken, if any?

andybak2y ago

2 more replies

boh4232y ago

The community is currently discussing to what extent this applies to these newer papers (start with original post): https://twitter.com/ykamit/status/1677872648590864385?s=20

One thing they showed is that the 80 categories of that data collapse to just 40 clusters in the semantic space.

(Kamitani has been working on the reconstruction question for long time and knows all these traps quite well.)

On deeprecon the two tested diffusion methods fail at reconstructing the abstract OOD images (which NSD does not have), something previous reconstruction methods could do.

jph002y ago

Yes there was. However this is a different paper, describing a different method, applied to a different dataset, with different results.

Note that LAION-5B has five billion images.

RC_ITR2y ago

What the LAION-5B work shows is that they did a good enough job of this training that the models are really good at creating similar vectors for nearly any image and fMRI pair.

So yes, this is an impressive result at first glance and not some overfitting trick.

It’s also sort of bread and butter at this point (replace fMRI with “text” and that’s just what Stable Diffusion is).

They’ll be lots of these sort of results coming out soon.

atom_1012y ago

1 more reply

satvikpendem2y ago

jph002y ago

Please read the paper. Or at least the blog post. It's really quite readable.

They explain that they've done both retrieval and reconstruction, and have lots of pictures showing examples of each.

https://medarc-ai.github.io/mindeye/

1 more reply

RC_ITR2y ago

If you can retrieve an image using a latent vector, it’s trivial to reconstruct it (decently well) with a diffusion model.

rocqua2y ago

They tested themselves both on retrieval and reconstruction.

dehrmann2y ago

That one was a bit like not hotdogs: https://www.youtube.com/watch?v=ACmydtFDTGs

hospadar2y ago

ANYWAY SO COOL!!! I wonder if you could use this to draw people's faces with a subject who is imagining looking at a face? fMRI police sketch? How do brains even work!?

tmabrahamOP2y ago

Yeah using data from a 7T MRI giving higher spatial resolution definitely helps!

The fMRI dataset includes signal from the whole brain but we only use the data from the visual cortex for this study.

StackOverlord2y ago

Are you able to extract an image showing the screen in the fMRI machine, as the subject can see it in between pictures ?

atom_1012y ago

ilaksh2y ago

Human communication will change dramatically once useful invasive brain-computer interfaces are available.

satvikpendem2y ago

Unless you or I have aphantasia, of course.

ilaksh2y ago

Right and I would go so far as to say that most types of intelligence are a type of functional compression also.

There's definitely room for direct transfer of concrete unrolled information. But at the same time we would still need some forms of abstraction in many cases.

arketyp2y ago

SamPatt2y ago

In a popular sci-fi (avoiding spoliers), the alien race has transparent skulls, and their visible thoughts are broadcast to anyone within visual range.

It does seem more efficient than sound.

flangola72y ago

I want to know the scifi

You may base64 to spoiler proof it

lithiumii2y ago

I think it's "VGhlIFRocmVlLUJvZHkgUHJvYmxlbQ==" The aliens cannot lie to each other (they don't even have the idea of a lie), because their thoughts are transparent to each other.

1 more reply

mometsi2y ago

For context, early vision is easier to map than you might expect.

Here's a radiograph of the primary visual cortex created in 1982 by projecting a pattern onto a macaque's retina: https://web.archive.org/web/20100814085656im_/http://hubel.m...

An injection of radioactive sugar lets you see where the neurons were firing away and metabolizing the sugar.

(https://pubmed.ncbi.nlm.nih.gov/7134981/)

dr_dshiv2y ago

But can brain activity be mapped anywhere like this with fmri? I doubt it. But yes — it is cool that the brain keeps spatial proportions of reality in the brain map! Very unlike latent space.

echo_time2y ago

In some ways, absolutely not - precision is a huge challenge with an indirect method like fMRI - but this example is over a decade old now: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3130346/

mikeiz4042y ago

Some important points from the article under the limitations section

- Each participant in the dataset spent up to 40 hours in the MRI machine to gather sufficient training data.

- … Non-invasive neuroimaging methods like fMRI not only require participant compliance but also full concentration on following instructions during the lengthy scan process. …

bertil2y ago

For anyone who hasn’t been in an MRI: 40 hours is a lot.

I wouldn’t spend 20 minutes in one if it were not important. I’d seriously push back on an hour. 40 hours is something I’d only do if that is absolutely necessary.

robg2y ago

Awesome, predicting words from fMRI has been around for a while and visual cortex can be mapped well.

That said, and coming from a background in neuroimaging 20 years ago, what’s the applicability? MRI hasn’t gotten that much more cost effective for more widespread uses. Magnets are expensive.

011000112y ago

tmabrahamOP2y ago

2ap2y ago

aledalgrande2y ago

People with disabilities could benefit greatly from this.

ggm2y ago

As long as they want to talk about london buses, steam trains, surfing and football.

1 more reply

seydor2y ago

reading suspect's mind

umvi2y ago

ImaCake2y ago

Aside from helping those with disabilities, what is stopping authorities using this as a lie detector? I assume the tech isn’t quite there yet.

FPGAhacker2y ago

Not yet.

> Models were trained separately for every participant and are not generalizable across people.

ImaCake2y ago

Ah right, so the models are subject to some serious overfitting then. Good proof of concept, but not useful in practice yet.

woeirua2y ago

Techniques like this give me hope that some day we will be able to objectively diagnose mental illness, and monitor the efficacy of treatment.

puchatek2y ago

As someone suffering from intrusive thoughts I do not look forward to a future where other people can see what I sometimes see in my head.

wiz21c2y ago

As a person who is normal by any measuring standard, I do not look forward to a future where other people can see what I sometimes see in my head.

You'd be quite surprised.

theaiquestion2y ago

I think the method of merging the pipelines via img2img should use controlnet. Possibly needing to be finetuned specifically for this, although existing controlnet models might work fine for this.

This is exactly what you'd want to use controlnet for - mapping semantic information onto the perceived structure.

tmabrahamOP2y ago

Yes we've been looking into ControlNet as well, and I think there is one recent fMRI-to-image paper that also has tried ControlNet. Maybe we'll use ControlNet in MindEye v2 :)

atom_1012y ago

Yes controlnet will be used in the next version. For this one we couldn't get it working in time.

radicaldreamer2y ago

They should dose people with DMT in the fMRI and run it through the model.

smusamashah2y ago

There was also DreamDiffusion recently. https://arxiv.org/abs/2306.16934

parth172912y ago

This is cool . We are going towards future like shown in inception to plant an idea.

caycep2y ago

the best recons I've seen so far are from Jack Gallant and Alex Huth's labs (at least what was shown publicly at SfN)

collsni2y ago

This is mind reading, we are in the future.

QuantumG2y ago

Found the sucker.

1 more reply

j / k navigate · click thread line to collapse