Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors

MindEye overall schematic depicts the retrieval and reconstruction submodules alongside an independent low-level perceptual pipeline meant to enhance reconstruction fidelity.

I really identify with the spirit of LAION (Large-scale Artificial Intelligence Open Network). Their motto is "Truly Open AI", a dig at OpenAI, and they've released several important datasets and models created purely by unpaid volunteers. This project was initially incubated on their Discord, and when I joined there was mostly just a Jupyter notebook that was trying to reconstruct the images that a person was looking at in an fMRI using just the fMRI signal. That notebook didn't work very well at all compared to the image above.

After making some tweaks to that initial notebook that improved things slightly, we ended up getting adopted by Stability AI, which meant we moved over to the newly created MedARC Discord and got access to a huge number of A100 GPUs.

I worked on both the retrieval model training via contrastive learning and the reconstruction model (a diffusion prior similar to the one used by DALL·E 2). The quality of the reconstructed images improved quite a bit when I was able to use the pre-trained diffusion prior weights released from another LAION project dedicated to reproducing OpenAI's DALL·E 2 model. It really highlighted to me the effectiveness of the open science movement, as we would have never been able to get these results had those weights not been publicly released.

As time went on, people from Princeton Neuroscience Institute, Ecole Normale Supérieure, PSL University, University of Toronto, the Hebrew University of Jerusalem, EleutherAI, and Stability AI all chipped in to eventually set a new state-of-the-art result for this task. I felt early on in this project that a large part of our contribution would be not just the results, but the open source code and documentation that would allow others to build on our work. I'm glad to have had a hand in making our code easily installable and fairly easy to read, especially compared to code released by similar papers.

The full paper is here, and if you'd like to get involved, the project page with roadmap and recorded meetings is here. The paper was near the top of the HackerNews frontpage for a while and was later accepted as a spotlight for the NeurIPS 2023 conference.

MindEye on Hacker News

What follows is a summary of the paper originally written by my wonderful co-author Paul Scotti at the Princeton Neuroscience Institute and lightly edited by me.

Introduction

Functional magnetic resonance imaging (fMRI) measures brain activity by detecting changes in oxygenated blood flow. It is used to analyze which parts of the brain handle different functions and to assist in evaluating treatments for the brain. MindEye was trained and evaluated on the Natural Scenes Dataset [1], an offline fMRI dataset containing data from human participants who each agreed to spend up to 40 hours viewing a series of static images, for a few seconds each, inside the MRI machine.

MindEye achieves state-of-the-art performance across both image retrieval and reconstruction. That is, given a sample of fMRI activity from a participant viewing an image, MindEye can either identify which image out of a pool of possible image candidates was the original seen image (retrieval), or it can recreate the image that was seen (reconstruction).

To achieve the goals of retrieval and reconstruction with a single model trained end-to-end, we adopt a novel approach of using two parallel submodules that are specialized for retrieval (using contrastive learning) and reconstruction (using a diffusion prior).

Each unique image in the dataset was viewed three times, for three seconds at a time. Corresponding fMRI activity (flattened spatial patterns across 1.8mm cubes of cortical tissue called “voxels”) was collected for each image presentation. fMRI activity across the three same-image viewings was averaged together and input to MindEye to retrieve and reconstruct the original image.

MindEye pipeline

Retrieval

For retrieval, MindEye finds the exact (top-1) match in a pool of test samples with >90% accuracy for both image and brain retrieval, outperforming previous work which showed <50% retrieval accuracy. This suggests that MindEye brain embeddings retain fine-grained image-specific signals.

MindEye image retrieval. Given a pool of candidate images, the nearest neighbor search in CLIP space enables searching for the original image based on brain activity.

We accomplish this feat through contrastive learning. First fMRI brain activity from regions of the brain receptive to visual information are flattened and fed through a dense 940M parameter multilayer perceptron (MLP). This outputs brain embeddings that are the same dimensionality as the outputs from the last hidden layer of CLIP ViT-L/14 [2] (although we could use any multimodal latent space). These brain embeddings are fed through a lightweight MLP projector and then we use a novel bidirectional implementation of mixup contrastive data augmentation using CLIP loss to train the model to map brain embeddings into the same space as pre-trained, frozen CLIP image embeddings.

For inference, you can simply compute the CLIP image embedding for every possible image in the pool of image candidates and see which one has the highest cosine similarity with the brain embedding output from MindEye. We found that this approach worked even when scaling up to the billions of image candidates contained in LAION-5B.

Reconstruction

Side-by-side comparison of reconstructions from fMRI-to-Image NSD papers.

For reconstructions, we take the outputs from the dense MLP backbone mentioned above and feed them through a diffusion prior trained from scratch to better align brain embeddings to CLIP image space. This is the same approach used by DALL-E 2 to align their CLIP text embeddings to CLIP image space before they feed the aligned embeddings through another diffusion model to output images. As visualized by UMAP dimensionality reduction, the inputs from the MLP backbone are clearly disjointed in reference to the CLIP embeddings following the initial MLP backbone (left subplot below), but they are well-aligned following the diffusion prior (right subplot).

UMAP plots depict CLIP image latents (blue), MindEye MLP backbone latents (orange), MindEye MLP projector latents (green), and MindEye diffusion prior latents (red).

This alignment allows us to substitute CLIP image latents for MindEye brain latents. We can simply take any pre-trained generative model that accepts CLIP image latents as inputs and feed the model a brain latent instead (no fine-tuning required!). This flexibility suggests that MindEye reconstructions will continue to improve as newer, more powerful image generation models are released.