A self-supervised deep neural network for image completion resembles early visual cortex fMRI activity patterns for occluded scenes

Abstract The promise of artificial intelligence in understanding biological vision relies on the comparison of computational models with brain data with the goal of capturing functional principles of visual information processing. Convolutional neural networks (CNN) have successfully matched the transformations in hierarchical processing occurring along the brain's feedforward visual pathway, extending into ventral temporal cortex. However, we are still to learn if CNNs can successfully describe feedback processes in early visual cortex. Here, we investigated similarities between human early visual cortex and a CNN with encoder/decoder architecture, trained with self-supervised learning to fill occlusions and reconstruct an unseen image. Using representational similarity analysis (RSA), we compared 3T functional magnetic resonance imaging (fMRI) data from a nonstimulated patch of early visual cortex in human participants viewing partially occluded images, with the different CNN layer activations from the same images. Results show that our self-supervised image-completion network outperforms a classical object-recognition supervised network (VGG16) in terms of similarity to fMRI data. This work provides additional evidence that optimal models of the visual system might come from less feedforward architectures trained with less supervision. We also find that CNN decoder pathway activations are more similar to brain processing compared to encoder activations, suggesting an integration of mid- and low/middle-level features in early visual cortex. Challenging an artificial intelligence model to learn natural image representations via self-supervised learning and comparing them with brain data can help us to constrain our understanding of information processing, such as neuronal predictive coding.


VGG16 representation similarity
Similarity results between VGG16 layer activations and brain data are reported in Figure 1. Correlation with brain representation (Kendall's tau-a) Figure 1: Comparison between brain and VGG16 RDMs. Averaged results across quadrants are shown.
In Fig. 1, results across areas are showing, accordingly with [Cichy et al., 2016, Güçlü andvan Gerven, 2015], a decreasing of similarity going deeper in the network. The first layer, implementing mainly edge and color contrast detectors (low-level features), is overall the more similar with the brain. From the second, performance is different for every quadrant, which brings a correlation across the entire image to be low. Note that only the first three layers have receptive fields that are separable; from the fourth layer, it is not possible to disentangle quadrants anymore.

Encoder/decoder layers detail
Comparison between brain and aggregated encoder/decoder network RDMs are reported in Figure 2.

Visual results on MNIST dataset
For sake of completeness, we trained and tested a model on MNIST dataset with occlusions. The steps performed are the same as in the manuscript: we downloaded the dataset, we applied the occlusion, we trained the model, and we tested it on a different set (also in this example we split training/validation/testing sets). Results in Figure 3 show how the method can learn well the data distribution, as long as it is trained and tested on the same population. Zooming in, it is possible to spot small differences, even if network outputs are very similar to the original image and indistinguishable. The good results derive from the task being relatively simple and from the high capability of the model.

Training and testing approach
The followed procedure to train our model, takes advantages of two networks: a generator (our encoder/decoder in Figure 2), which takes an occluded image as input and produces a reconstructed image in output, and a discriminator, which has to detect if the synthetised image is real or fake. In testing, only the generator is used, which, starting from

Increase and decrease of similarity in spatial encoders and decoders
We here performed a statistical analysis to support the claim on page 12: "progressive encoding layers become increasingly better descriptions of brain activity patterns". To test the significance of an increase (or decrease), we fit a linear regression model and display the p-value found with the t-statistic on the hypothesis test that the corresponding coefficient (the slope) is equal to zero or not. Figure 6 shows the results of this analysis; we displayed p-Values obtained in top right corners, coloring them in green or red for passing or failing the t-test, respectively (p < 0.05). As an example, the p-value of the t-statistic for the top left subplot (V1, non-occluded, spatial encoder) is p = 1.6449e − 13, which is smaller than 0.05, so this term is significant at the 5% significance level. As it is possible to notice, in every condition (V1 and V2, for occluded and non-occluded, 4/4) we have a statistically significant increase of similarity for spatial encoders. For spatial decoder, we have significant decreases for three cases over four (V2 occluded is not significant). For latent feature vectors, we do not have any significant results, which means that the slopes found are not statistically different than zero.

Experiment images
The 24 images used in the experiment are reported in Figure 7.

Model summaries VGG16
In Table 1, VGG16 network details, with activation and receptive field dimensions, are presented. In blue the analysed layers.

Encoder/decoder
In Table 2, encoder/decoder network details, with activation and receptive field dimensions, are presented.  Figure 6: Statistical analysis to support the sentence: "progressive encoding layers become increasingly better descriptions of brain activity patterns". Every row displays results for a specific visual area and quadrant tested: V1 non-occluded, V1 occluded, V2 non-occluded, and V2 occluded. Each column shows results for spatial encoders, latent feature vectors, and spatial decoders respectively. To test the significance of an increase (or decrease), we fit a linear regression model (red line) using correlation data (every ?x? is a subject) and display the p-value found with the t-statistic on the hypothesis test that the corresponding coefficient (the slope) is equal to zero or not. The p-values are colored in green or red for passing or failing the t-test, respectively (p < 0.05).

Figure 7:
The 24 images (of 6 categories: forests, mountains, highways, industry, beaches, and buildings) used in the experiment. Images are taken from SUN database [Xiao et al., 2010].

Layer visualisation Layer visualisation encoder/decoder
In Figures 8 to 15 are shown layer activations for encoder_1 to encoder_8. Every Figure corresponds to a specific layer and it has two subplots. (A) Once a specific channel of the layer analysed is selected (randomly, the number is reported at the top), the top five activations are shown in column. (B) depicts the images where these patches are taken from, with red bounding boxes that indicate the location of the patches in the image.

Layer visualisation VGG16
In Figures 16 to 20 are shown layer activations for block1_conv2, block2_conv2, block3_conv3, block4_conv3, and block5_conv3. Every Figure corresponds to a specific layer and it has two subplots. (A) Once a specific channel of the layer analysed is selected (randomly, the number is reported at the top), the top five activations are shown in column. (B) depicts the images where these patches are taken from, with red bounding boxes that indicate the location of the patches in the image.