Net2Brain: A Toolbox to compare artificial vision models with human brain responses

We introduce Net2Brain, a graphical and command-line user interface toolbox for comparing the representational spaces of artificial deep neural networks (DNNs) and human brain recordings. While different toolboxes facilitate only single functionalities or only focus on a small subset of supervised image classification models, Net2Brain allows the extraction of activations of more than 600 DNNs trained to perform a diverse range of vision-related tasks (e.g semantic segmentation, depth estimation, action recognition, etc.), over both image and video datasets. The toolbox computes the representational dissimilarity matrices (RDMs) over those activations and compares them to brain recordings using representational similarity analysis (RSA), weighted RSA, both in specific ROIs and with searchlight search. In addition, it is possible to add a new data set of stimuli and brain recordings to the toolbox for evaluation. We demonstrate the functionality and advantages of Net2Brain with an example showcasing how it can be used to test hypotheses of cognitive computational neuroscience.


Introduction
Several studies have demonstrated the potential of DNNs to serve as state-of-the-art computational models of the primate visual cortex Khaligh-Razavi & Kriegeskorte, 2014;Yamins et al., 2014;Guclu & van Gerven, 2015;Cichy, Khosla, Pantazis, Torralba, & Oliva, 2016). In the last decade, DNNs trained to perform visual tasks have successfully been able to resemble, predict and explain neural activity in the visual cortex. Different implementations of these models (varying, for example, their architecture, objective function, or training algorithm) have been compared to uncover the computational principles, algorithms and neurobiological mechanisms behind visual processing (Richards, 2019).
To promote this line of research, new benchmarks, datasets, and challenges relevant to cognitive neuroscience experiments have been developed (Cichy, Roig, Andonian, et al., 2019;Cichy, Roig, & Oliva, 2019;Cichy et al., 2021;Schrimpf et al., 2018). However, to fully take advantage of these models and frameworks, a toolbox for efficiently comparing the representational spaces of state-of-the-art DNNs and brain responses is needed. Some toolboxes have been developed to facilitate the use of DNNs, however, they tend to focus only on a small subset of supervised image classification models, even though studies have shown that DNNs trained for different tasks can also help to provide new information about the visual cortex (Tang, LeBel, & Huth, 2021;Dwivedi, Bonner, Cichy, & Roig, 2021).
We, therefore, introduce Net2Brain, an easy-to-use toolbox that allows neuroscientists to efficiently incorporate over 600 DNN trained for different objective functions, datasets, etc, into their research. We opensource it to promote its continual growth over time.

Related Work
In the past, deep learning models have been adopted across scientific fields to answer domain-specific questions (Raghu & Schmidt, 2020). This was greatly facilitated by open-source software that allows the straightforward usage and development of DNNs, such as PyTorch (Paszke et al., 2019), Tensorflow (Abadi et al., 2015), Caffe (Jia et al., 2014) and Keras (Chollet et al., 2015). With such a variety of libraries at hand and the increasing use of deep learning models in neu-roscience research, recent toolboxes have been developed to facilitate synergy between both fields. The rsatoolbox (Nili et al., 2014) provides functions for comparing the representational space of computational models and brain responses. This software library expects the user to provide as input the already extracted activations of a DNN. BrainScore (Schrimpf et al., 2018(Schrimpf et al., , 2020 and THINGSvision (Muttenthaler & Hebart, 2021) are toolboxes that extend this functionality and allow computing feature representations from some DNNs as well as compare them with brain recordings. However, these libraries implement DNNs that were mainly developed for image classification tasks. This sub-selection limits the use of this approach when examining the neural representations of humans performing other perceptual and cognitive functions. Net2Brain expands the DNNs available for comparison from supervised models trained on image classification, instance and panoptic segmentation, 3D scene understanding, and action recognition tasks, to self-supervised models (Caron et al., 2020;He, Fan, Wu, Xie, & Girshick, 2020) and multimodal DNNs (Radford et al., 2021). We further recognize the importance of video datasets which could provide new insights into the human processing of motion and event understanding.

Net2Brain
Net2Brain is based on the ideas and goals of the Algonauts project (Cichy, Roig, Andonian, et al., 2019). This intuitive toolbox provides all the functionality needed for rapidly extracting the representations of a variety of DNNs, computing their representational dissimilarity matrices (RDMs), and comparing them to brain datasets. It employs RSA, weighted RSA, to make this comparison, and provides an in-depth examination of the correlation between the representational space of brain datasets and DNNs, for specific ROIs or in searchlight fashion. In addition, Net2Brain also informs about the quality of the brain recording being inspected and provides the flexibility to add new datasets and DNNs for analysis. Users can test a new hypothesis with a few clicks via CLI-Commands, a command-line interface ideal for servers like Google Colab, or a conveniently-designed GUI.
Using these models, features can be generated to be compared with available brain datasets. The evaluation function of Net2Brain allows the simultaneous comparison of the RDMs of multiple DNNs and brain datasets using RSA and weighted RSA. As an output of this step, the toolbox supplies a graph with the squared correlation coefficient per layer obtained through the analysis, along with a measure of statistical significance, and an estimate of the lower and upper noise ceiling of the brain responses. The computed data and the resulting graph are automatically stored in the filesystem to be easily accessed. The toolbox can be downloaded from GitHub (https://github.com/ToastyDom/Net2Brain.git) and also contains the fMRI and MEG datasets used in the 2019 Algonauts challenge (Cichy, Roig, Andonian, et al., 2019), provided in RDM format. Providing these datasets enables the user to immediately test the functionality of the program, and intuitively shows how to add new brain recordings to the toolbox.

Prediction of brain responses using multimodal DNNs
In the last few years, the field of deep learning has shown that DNNs trained on multi-sensory input, which are capable of creating multimodal representations, achieve better generalization and overall performance. In this context, much debate exists in the field of cognitive neuroscience on the multimodal nature of cortical representations, and the idea that brain areas higher up in the hierarchy might need to encode these types of representations for carrying out more abstract computations (Tang et al., 2021). Combining both fields, this hypothesis could be tested by analyzing if brain representation are more similar to multimodal DNNs than unimodal ones. As an exploratory work, we used Net2Brain to compare the responses of the multimodal CLIP-ResNet50 and CLIP-ViT-B/32, a self-supervised DNN trained on image-text pairs (Radford et al., 2021), with its unimodal counterparts ResNet50 and ViT-B/32, which are supervised DNN trained to perform object recognition on Imagenet, to human functional magnetic resonance imaging (fMRI) recordings from the dataset by Michael F. Bonner et al. (Bonner & Epstein, 2017).
As illustrated in Fig.1, we found that the multimodal CLIP-ResNet50 has significantly better predictability of the regions of interest (ROIs), which are displayed in Fig. 2, than its unimodal counterpart ResNet50 throughout all presented layers. This can be seen as a prelude toward research that argues whether the inclusion of captions allows encoding spatial relations and how other modalities could improve predictability.
Another pattern that can be observed is that although CLIP-ViT and normal ViT behave similarly, they both have better predictability of the regions than ResNet50. This invites to delve deeper into exploring regions of the brain using other DNNs rather than CNNs, and having different architectures to help understand the structure of the visual cortex.
In sum, Net2Brain facilitates investigating correlations be- Figure 1: Prediction of brain responses using multimodal DNNs vs their unimodal counterparts in the ROIs in Fig. 2 and a table displaying the layers with the highest correlation. The range from lower to upper noise ceiling is indicated by the gray box and the asterisk above the bars indicates the significance of the calculated data. The error bar represents the standard error across subjects. tween different DNNs and brain ROIs and reveals exciting patterns that can be further explored.

Conclusion
We have introduced Net2Brain, a toolbox for comparing the responses of artificial neural networks and the human visual cortex using representation similarity analysis. Our toolbox facilitates the adoption of DNNs in cognitive neuroscience research, lowers the knowledge barrier for newcomers that want to implement these tools, and provides users the flexibility to carry out these analyses using their computational models and brain datasets. We have also demonstrated the simplicity of using Net2Brain for testing a hypothesis from cognitive computational neuroscience. In the future, the toolbox will include more brain datasets and functions for carrying out common analyses in neuroscience research, such as variance partitioning analysis and encoding models.