Deep neural networks and visuo-semantic models explain complementary components of human ventral-stream representational dynamics

Deep neural networks (DNNs) are promising models of the cortical computations supporting human object recognition. However, despite their ability to explain a significant portion of variance in neural data, the agreement between models and brain representational dynamics is far from perfect. We address this issue by asking which representational features are currently unaccounted for in neural timeseries data, estimated for multiple areas of the ventral stream via source-reconstructed magnetoencephalography (MEG) data acquired in human participants (9 females, 6 males) during object viewing. We focus on the ability of visuo-semantic models, consisting of human-generated labels of object features and categories, to explain variance beyond the explanatory power of DNNs alone. We report a gradual reversal in the relative importance of DNN versus visuo-semantic features as ventral-stream object representations unfold over space and time. While lower-level visual areas are better explained by DNN features starting early in time (at 66 ms after stimulus onset), higher-level cortical dynamics are best accounted for by visuo-semantic features starting later in time (at 146 ms after stimulus onset). Among the visuo-semantic features, object parts and basic categories drive the advantage over DNNs. These results show that a significant component of the variance unexplained by DNNs in higher-level cortical dynamics is structured, and can be explained by readily nameable aspects of the objects. We conclude that current DNNs fail to fully capture dynamic representations in higher-level human visual cortex and suggest a path toward more accurate models of ventral stream computations.


Introduction
When we view objects in our visual environment, the neural representation of these objects dynamically unfolds over time across the cortical hierarchy of the ventral visual stream. In brain recordings from both humans and nonhuman primates, this dynamic representational unfolding can be quantified from neural population activity, showing a staggered emergence of ecologically relevant object information such as facial features, followed by object categories, and then the individuation of these inputs into specific exemplars (Sugase et al., 1999;Hung et al., 2005;Meyers et al., 2008;Carlson et al., 2013;Clarke et al., 2013;Cichy et al., 2014;Isik et al., 2014;Ghuman et al., 2014;Hebart et al., 2018;Kietzmann et al., 2019b). These neural reverberations are thought to reflect the cortical computations that support object recognition.
To address this question, we enriched our modeling strategy with visuo-semantic object information. By "visuo-semantic", we mean nameable properties of visual objects. Our visuo-semantic models consist of object labels generated by human observers, describing lower-level object features such as "green", higher-level object features such as "eye", and categories such as "face". The visuo-semantic labels can be interpreted as vectors in a space defined by humans at the behavioral level. In contrast to DNNs, our visuo-semantic models are not image-computable. However, they provide unique benchmarks for comparison with image-computable models. Prior work indicates that visuo-semantic labels explain significant amounts of response variance in higher-level primate visual cortex (Tanaka, 1996;Yamane et al., 2008;Freiwald et al., 2009;Issa and DiCarlo, 2012;Kanwisher et al., 1997;Epstein and Kanwisher, 1998;Downing et al., 2001;Haxby et al., 2001;Kriegeskorte et al., 2008;Huth et al., 2012;Mur et al., 2012;Jozwik et al., 2016Jozwik et al., , 2018. Moreover, visuo-semantic models outperform DNNs (AlexNet (Krizhevsky et al., 2012) and VGG (Simonyan and Zisserman, 2014) architectures) at predicting perceived object similarity in humans (Jozwik et al., 2017). In addition, a recent functional magnetic resonance imaging (fMRI) study showed that combining DNNs with a semantic feature model is beneficial for explaining visual object representations at advanced processing stages of the ventral visual stream (Devereux et al., 2018). Given these findings, we hypothesized that visuo-semantic models capture representational features in ventral-stream neural dynamics that DNNs fail to account for.
We tested this hypothesis on temporally resolved magnetoencephalography (MEG) data, which can capture representational dynamics at a millisecond timescale. Human brain data acquired at this rapid sampling rate provide rich information about temporal dynamics, and by extension, about the underlying neural computations. For example, in a MEG study that used source reconstruction to localize time series to distinct areas of the ventral visual stream, time series analyses revealed temporal inter-dependencies between areas suggestive of recurrent information processing (Kietzmann et al., 2019b).
In this work, we used representational similarity analysis (RSA) to test both DNNs and visuo-semantic models for their ability to explain representational dynamics observed across multiple ventral stream areas in the human brain. As DNNs, we used feedforward CORnet-Z and locally recurrent CORnet-R, which are inspired by the anatomy of monkey visual cortex . As visuo-semantic models, we used existing human-generated labels of object features and categories (Jozwik et al., 2016). We analyzed previously published source-reconstructed MEG data acquired in healthy human participants while they were viewing object images from a range of categories (Kietzmann et al., 2019b;Cichy et al., 2014). We investigated three distinct stages of processing in the ventral cortical hierarchy: lower-level visual areas V1-3, intermediate visual areas V4t/LO, and higher-level visual areas IT/PHC. At each stage of processing, we tested both model classes for their ability to explain variance in the temporally evolving representations. This strategy allowed us to test what visuo-semantic object information is unaccounted for by DNNs as ventral-stream processing unfolds over space and time.

Stimuli
Stimuli were 92 colored images of real-world objects spanning a range of categories, including humans, nonhuman animals, natural objects, and manmade objects (12 human all pairwise correlations were below threshold. The final full feature and category models consisted of 119 and 110 dimensions, respectively.

Construction of the visuo-semantic representational dissimilarity matrices-
To compare the models to the measured brain representations, the models and the data should reside in the same representational space. This motivates transforming our models to representational dissimilarity matrix (RDM) space. For each model dimension, we computed, for each pair of images, the squared difference between their values on that dimension. The squared difference reflects the dissimilarity between the two images in a pair. Given that a specific feature or category can either be present or absent in a particular image, image dissimilarities along a single model dimension are binary: they are zero if a feature or category is present or absent in both images, and one if a feature or category is present in one image but absent in the other. The dissimilarities were stored in an RDM, yielding as many RDMs as model dimensions. The full visuo-semantic model consists of 229 RDM predictors (119 feature predictors and 110 category predictors).

Deep neural networks
CORnet-Z and CORnet-R architectures have been described in , where further details can be found.
Architecture and training-We used feedforward (CORnet-Z) and locally recurrent (CORnet-R)  models in our analyses. The architectures of the two DNNs are schematically represented in Figure 1b. The architecture of CORnets is inspired by the anatomy of monkey visual cortex. Each processing stage in the model is thought to correspond to a cortical visual area, so that the four model layers correspond to areas V1, V2, V4, and IT respectively . The output of the last model layer is mapped to the model's behavioral choices using a linear decoder. We chose the two CORnets because they have similar architectures but one is purely feedforward and the other is feedforward plus locally recurrent, they are one of the best models for predicting visual responses in monkey and human IT Jozwik et al., 2019b,a), and their architectures are relatively simple compared to other DNNs. Each "visual area" in CORnet-Z ("Zero") consists of a single convolution, followed by a ReLU nonlinearity and max pooling. CORnet-R ("Recurrent") introduces local recurrent dynamics within an area. The recurrence occurs only within an area; there are no bypass or feedback connections between areas. For each area, the input is down-scaled twofold and the number of channels is increased twofold by passing the input through a convolution, followed by group normalization (Wu and He, 2018) and a ReLU nonlinearity. The area's internal state (initially zero) is added to the result and passed through another convolution, again followed by group normalization and a ReLU nonlinearity, resulting in the new internal state of the area. At time step "t0" there is no input to "V2" and beyond, and as a consequence no image-elicited activity is present beyond "V1". From time step "t1" onwards, the image-elicited activity is present in all "visual areas" as the output of the previous area is immediately propagated forward. CORnet-R was trained using five time steps ("t0" -"t4"). Both DNNs were trained on 1.2 million images from the 2012 ILSVRC data base (Russakovsky et al., 2015). The ILSVRC data base provides annotations that contain a category label for each image, assigning the object in an image to one out of 1,000 categories, e.g., "daisy", "macaque", and "speedboat". The networks' task is to classify each object image into one of the 1,000 categories.

Construction of the DNN representational dissimilarity matrices-DNN
representations of the 92 images were computed from the layer activations of CORnet-Z and CORnet-R. For CORnet-Z, we included the decoder layer and the final processing stage (output) from each "visual area" layer, which resulted in five layers. For CORnet-R, we included the decoder layer and the final processing stage from each "visual area" layer for each time step, which resulted in 21 layers. For each layer of CORnet-Z and CORnet-R, we extracted the unit activations in response to the images and converted these into one activation vector per image. For each pair of images, we computed the dissimilarity (1 minus Spearman's correlation) between the activation vectors. This yielded an RDM for each DNN layer. The resulting RDMs capture which stimulus information is emphasized and which is de-emphasized by the DNNs at different stages of processing.

MEG source-reconstructed data
Acquisition and analysis of the MEG data have been described in (Cichy et al., 2014), where further details can be found. The source reconstruction of the MEG data has been described in (Kietzmann et al., 2019b), where further details can be found.
Participants-Sixteen healthy human volunteers participated in the MEG experiment (mean age = 26, 10 females). MEG source reconstruction analyses were performed for a subset of 15 participants for whom structural and functional MRI data were acquired. Participants had normal or corrected-to-normal vision. Before scanning, the participants received information about the procedure of the experiment and gave their written informed consent for participating. The experiment was conducted in accordance with the Ethics Committee of the Massachusetts Institute of Technology Institutional Review Board and the Declaration of Helsinki.
Experimental design and task-Stimuli were presented at the center of the screen for 500 ms, while participants performed a paper clip detection task. Stimuli were overlaid with a light gray fixation cross and displayed at a width of 2.9° visual angle. Participants completed 10 to 14 runs. Each image was presented twice in every run in random order. Participants were asked to press a button and blink their eyes in response to a paper clip image shown randomly every 3 to 5 trials. These trials were excluded from further analyses. Each participant completed two MEG sessions.
MEG data acquisition and preprocessing-MEG signals were acquired from 306 channels (204 planar gradiometers, 102 magnetometers) using an Elekta Neuromag TRIUX system (Elekta) at a sampling rate of 1,000 Hz. The data were bandpass filtered between 0.03 and 330 Hz, cleaned using spatiotemporal filtering, and down-sampled to 500 Hz. Baseline correction was performed using a time window of 100 ms before stimulus onset.
MEG source reconstruction-The source reconstructions were performed using the MNE Python toolbox (Gramfort, 2013). We used participant individual structural T1 scans to obtain volume conduction estimates using single layer boundary element models (BEMs) based on the inner skull boundary. Instead of BEMs being based on the FreeSurfer watershed algorithm originally used in the MNE Python toolbox, we extracted BEMs using FieldTrip as the original method yielded poor reconstruction results. The source space consisted of 10,242 source points per hemisphere. The source points were positioned along the gray/white matter boundary, as estimated via FreeSurfer. We defined source orientations as surface normals with a loose orientation constraint. We used an iterative closest point procedure for MEG/MRI alignment based on fiducials and digitizer points along the head surface, after initial alignment based on fiducials. We estimated the sensor noise covariance matrix from the baseline period (100 ms to 0 ms before stimulus onset) and regularized it according to the Ledoit-Wolf procedure (Ledoit and Wolf, 2004). We projected source activations onto the surface normal, obtaining one activation estimate per point in source space and time. Source reconstruction allowed us to estimate temporal dynamics in specific brain regions. Source reconstruction provides an estimate of what brain regions the signal is coming from rather than a direct measurement of representations in different brain regions (see (Hauk et al., 2022) for a discussion).
Construction of the MEG representational dissimilarity matrices-We computed temporally changing RDM movies from the source-reconstructed MEG data for each participant, ROI, hemisphere, and session. We first extracted a trial-average multivariate source time series for each stimulus. We then computed an RDM at each time point by estimating the pattern distance between all pairs of images using correlation distance (1 minus Pearson correlation). The RDM movies were averaged across hemispheres and sessions, resulting in one RDM movie for each participant and ROI.

Evaluating and comparing model performance
To assess performance of the models at explaining variance in the source-reconstructed MEG data, we performed first-and second-level model fitting as described below. Model fitting within the RSA framework has been described in (Khaligh-Razavi and Jozwik et al., 2016Jozwik et al., , 2017Storrs et al., 2020a;Kaniuth and Hebart, 2021;Kietzmann et al., 2019b), where further details can be found.

First-level model fitting: obtaining cross-validated model predictions-We
could predict the brain representations by making the assumption that each model dimension, i.e. each visuo-semantic object label or each DNN layer, contributes equally to the representation. Our visuo-semantic models use the squared Euclidean distance as the representational dissimilarity measure, which is the sum across dimensions of the squared response difference for a given pair of stimuli. The squared differences simply sum across dimensions, so the model prediction would be the sum of the singledimension model RDMs. A similar reasoning applies to our DNN model, which uses the correlation distance as the representational dissimilarity measure. The correlation distance is proportional to the squared Euclidean distance between normalized patterns. However, we expect that not all model dimensions contribute equally to brain representations. To improve model performance, we linearly combined the different model dimensions to yield an object representation that best predicts the source-reconstructed MEG data. Because the squared differences sum across dimensions in the squared Euclidean distance, weighting the dimensions and computing the RDM is equivalent to a weighted sum of the single-dimension RDMs. When a dimension is multiplied by weight w, then the squared differences along that dimension are multiplied by w 2 . We can therefore perform the fitting on the RDMs. We performed model fitting for the DNN model (26 predictors), the visuosemantic model (229 predictors), and for the following visuo-semantic submodels: color (10 predictors), texture (12 predictors), shape (15 predictors), object parts (82 predictors), subordinate categories (38 predictors), basic categories (67 predictors), and superordinate categories (5 predictors). We included a constant term in each model to account for homogeneous changes in dissimilarity across the whole RDM. For each model, we estimated the model weights using regularized (L2) linear regression, implemented in MATLAB using Glmnet (https://hastie.su.domains/glmnet_matlab/?). We standardized the predictors before fitting and constrained the weights to be nonnegative. To prevent biased model predictions due to overfitting to the images, model predictions were estimated by cross validation to a subset of the images held out during fitting. For each cross validation fold, we randomly selected 84 of the 92 images as the training set and eight images as the test set, with the constraint that test images had to contain four animate objects (two faces and two body parts) and four inanimate objects. We used the pairwise dissimilarities of the training images to estimate the model weights. The model weights were then used to predict the pairwise dissimilarities of the eight held-out images. This procedure was repeated many times until predictions were obtained for all pairwise dissimilarities. For each cross validation fold, we determined the best regularization parameter (i.e. the one with the minimum squared error between prediction and data) using nested cross validation to held-out images within the training set. We performed the first-level fitting procedure for each participant, ROI, and time point. Second-level model fitting: estimating model performance-We estimated model performance using a second-level general linear model (GLM) approach. We used the cross-validated RDM predictions from the first-level model fitting as GLM predictors. We included a constant term in the GLM to account for homogeneous changes in dissimilarity across the whole RDM. We fit the GLM predictors to the source-reconstructed MEG data using nonnegative least squares. We first estimated the variance explained by each individual model when fit in isolation (reduced GLM). We next estimated the variance explained by the visuo-semantic and DNN models when fit simultaneously (full GLM). We then computed the unique variance explained by each model by subtracting the variance explained by the reduced GLMs from the variance explained by the full GLM. For example, to compute the unique variance explained by the visuo-semantic model, we subtracted the variance explained by the DNN model from the variance explained by the full GLM. This approach allowed us to address whether visuo-semantic models capture representational features in ventral-stream dynamics that DNNs fail to account for, and vice versa. We also estimated the unique variance explained in the source-reconstructed MEG data for visuo-semantic submodels in the presence of the DNN model, again by fitting a full GLM (all models included) and a reduced GLM (excluding the model of interest). We performed the secondlevel GLM fitting procedure for each participant, ROI, and time point. Statistical inference on model performance-To evaluate the significance of the (unique) variance explained by each model across participants, we first subtracted an estimate of the prestimulus baseline in each participant and then performed a one-sided Wilcoxon signed-rank test against 0. The prestimulus baseline was defined as the average (unique) variance explained between 200 -0 ms before stimulus onset. We also tested if and when the (unique) variance explained differed between the visuo-semantic and DNN models using a two-sided Wilcoxon signed-rank test. We controlled the expected false discovery rate at 0.05 across time points for each model evaluation, model comparison, and ROI. We used a continuity criterion (minimally 10 consecutive significant time points sampled every 2 ms = 20 ms) to report significant time points in the manuscript text. For completeness, Figures  2 and 3 show significant time points both before and after applying the continuity criterion. Lines shown in Figures 2 and 3 were low-pass filtered at 80 Hz (Butterworth IIR filter; order 6) for better visibility. Statistical inference is based on unsmoothed data.

DNNs better explain lower-level visual representations, visuo-semantic models better explain higher-level visual representations
We first evaluated the overall ability of the DNN and visuo-semantic models to explain the time course of information processing along the human ventral visual stream. We hypothesized that visuo-semantic models capture representational features in neural data that DNNs may fail to account for. Figure 1 shows an overview of our approach. We computed RDM movies from the source-reconstructed MEG data to characterize how the ventral-stream object representations evolved over time in each participant. We computed a RDM movie for each participant and ROI and explained variance in the movies using a DNN model and a visuo-semantic model. The DNN model consisted of internal object representations in layers of CORnet-Z, a purely feedforward model, and CORnet-R, a locally recurrent variant , to account for both feedforward and locally recurrent computations. The visuo-semantic model consisted of human-generated labels of object features (e.g., "brown", "furry", "round", "ear"; 119 labels) and categories (e.g., "great dane", "dog", "organism"; 110 labels) for the object images presented during the MEG experiment (Jozwik et al., 2016). We computed model predictions by linearly combining either all DNN layers or all visuo-semantic labels to best explain variance in the RDM movies across time. We evaluated the model predictions on data for images left out during fitting. For each model, we tested if and when the variance explained in the RDM movies exceeded the prestimulus baseline using a one-sided Wilcoxon signed-rank test. We also tested if and when the amounts of explained variance differed between the two models using a two-sided Wilcoxon signed-rank test. We controlled the expected false discovery rate at 0.05 across time points. We applied a continuity criterion (20 ms) for reporting results in the text.
For lower-level visual cortex (V1-3), the DNN model explained significant amounts of variance between 60 and 638, and 818 and 884 ms after stimulus onset, while the visuosemantic model did so between 118 and 660 ms after stimulus onset (118 -142 ms, 146 -178 ms, 194 -256 ms, 264 -414 ms, 430 -458 ms, 486 -520 ms, 570 -598 ms, 608 -660 ms, Figure 2a). The DNN model explained more variance than the visuo-semantic model during the early (66 -128 ms) as well as the late (422 -516 ms, 520 -544 ms, 820 -844 ms) phases of the response. For intermediate visual cortex (V4t/LO), the DNN model explained variance predominantly between 62 and 610 ms after stimulus onset (62 -562 ms, 590 -610 ms, 820 -848 ms, 854 -874 ms, 952 -976 ms), while the visuo-semantic model explained variance predominantly between 110 and 562 ms after stimulus onset (110 -478 ms, 482 -562 ms, 832 -854 ms, Figure 2a). The amount of explained variance did not significantly differ between the two models. The results for lower-level visual cortex indicate that the DNN model outperformed the visuo-semantic model at explaining object representations, during the early phase of the response (< 128 ms after stimulus onset), as well as the late phase of the response (> 422 ms after stimulus onset). In contrast, for higher-level visual cortex (IT/PHC), the visuo-semantic model outperformed the DNN model. The DNN model explained variance only between 182 and 270 ms after stimulus onset (Figure 2a). The visuo-semantic model explained variance during a longer time window, between 96 and 658 ms after stimulus onset (96 -464 ms, 468 -500 ms, 542 -578 ms, 606 -658 ms, Figure  2a). Furthermore, the visuo-semantic model explained more variance than the DNN model between 146 and 488 ms after stimulus onset (specifically 146 -188 ms, 196 -234 ms, 326 -344 ms, 348 -402 ms, 412 -464 ms, 468 -488 ms). In summary, the results across the ventral stream regions show a reversal in which model best explains variance in the RDM movies, from the DNN model in lower-level visual cortex, starting at 66 ms after stimulus onset, to the visuo-semantic model in higher-level visual cortex, starting at 146 ms after stimulus onset.

Visuo-semantic models explain unique variance in higher-level visual representations
Our results suggest that DNNs and visuo-semantic models explain complementary components of human ventral-stream representational dynamics. To explicitly test this hypothesis, we assessed the unique contributions of the two models. For this, we first computed the best RDM predictions for each model class, and then used the resulting crossvalidated RDM predictions in a second-level GLM in which we combined the two model classes. We computed the unique contribution of a model class by subtracting the variance explained by the reduced model (i.e. the GLM without the model class of interest) from the variance explained by the full model (including both model classes). For lower-level visual cortex (V1-3), the DNN model explained unique variance between 60 and 638, and 818 and 884 ms after stimulus onset, while the visuo-semantic model did so between 124 and 654 ms after stimulus onset (124 -142 ms, 148 -170 ms, 228 -246 ms, 298 -364 ms, 368 -412 ms, 612 -654 ms, Figure 2b). For intermediate visual cortex (V4t/LO), the DNN model explained unique variance predominantly between 62 and 610 ms after stimulus onset (62 -558 ms, 590 -610 ms, 820 -848 ms, 952 -976 ms), while the visuo-semantic model did so predominantly between 118 and 546 ms after stimulus onset (118 -478 ms, 490 -546 ms, 832 -854 ms, Figure 2b). These results indicate that the DNN and visuo-semantic models each explained a significant amount of unique variance in lower-level and intermediate visual cortex compared to the baseline period. However, for lower-level visual cortex, the DNN model explained more unique variance than the visuo-semantic model during the early (66 -128 ms) as well as the late phases of the response (422 -516 ms, 520 -544 ms, 820 -844 ms). For intermediate visual cortex, the unique variance explained did not significantly differ between the two models. For higher-level visual cortex (IT/PHC), only the visuo-semantic model explained unique variance, between 104 and 640 ms after stimulus onset (specifically 104 -464 ms, 468 -500 ms, 542 -578 ms, and 608 -640 ms). Furthermore, the visuo-semantic model explained significantly more unique variance than the DNN model between 146 and 488 ms after stimulus onset (specifically 146 -188 ms, 196 -234 ms, 326 -344 ms, 348 -402 ms, 412 -464 ms, 468 -488 ms, Figure  2b). These results indicate that, in the context of a visuo-semantic predictor, the tested DNNs explain unique variance at lower-level but not higher-level stages of visual processing which instead show a unique contribution of visuo-semantic models. Visuo-semantic models appear to explain components of the higher-level visual representations that DNNs fail to fully capture, starting at 146 ms after stimulus onset.

Object parts and basic categories contribute to the unique variance explained by visuosemantic models in higher-level visual representations
To better understand which components of the visuo-semantic model contribute to explaining unique variance in the higher-level visual representations, we repeated our analyses separately for subsets of object features and subsets of categories. We grouped the visuo-semantic labels into the following subsets: color, texture, shape, and object parts, and subordinate, basic, and superordinate categories (Figure 1b). The dimensionality of the submodels was naturally smaller than that of the full visuo-semantic model, which consisted of 229 object labels. The number of dimensions for the submodels was as follows: color (10), texture (12), shape (15), object parts (82), subordinate categories (38), basic categories (67), superordinate categories (5). Some of the submodels explained a similar amount of variance as the full visuo-semantic model (Figure 3a,b), which indicates that including fewer dimensions did not necessarily reduce model performance. A more in-depth understanding of the relationship between model dimensionality and performance remains an important objective for future study. Here we found that, among the object features, only object parts explained variance in higher-level visual cortex (IT/PHC) (Figure 3a). Furthermore, object parts explained unique variance in higher-level visual cortex, while the DNN model did not (Figure 3b). Among the categories, subordinate and basic categories explained variance in higher-level visual cortex (Figure 3a). Furthermore, each of these models explained unique variance in higher-level visual cortex, while the DNN model did not (Figure 3b). We next evaluated the three best predictors among the object features and categories together in the context of the DNN predictor. While object parts, subordinate categories, basic categories, and DNNs all explained variance in higher-level visual cortex, only object parts and basic categories explained unique variance (Figure 3b).

Discussion
Neural representations of visual objects dynamically unfold over time as we are making sense of the visual world around us. These representational dynamics are thought to reflect the cortical computations that support human object recognition. Here we show that DNNs and human-derived visuo-semantic models explain complementary components of representational dynamics in the human ventral visual stream, estimated via sourcereconstructed MEG data. We report a gradual reversal in the importance of DNN and visuosemantic features from lower-to higher-level visual areas. DNN features explain variance over and above visuo-semantic features in lower-level visual areas V1-3 starting early in time (at 66 ms after stimulus onset). In contrast, visuo-semantic features explain variance over and above DNN features in higher-level visual areas IT/PHC starting later in time (at 146 ms after stimulus onset). Among the visuo-semantic features, object parts and basic categories drive the advantage over DNNs. Our results suggests that a significant component of the variance unexplained by DNNs in higher-level visual areas is structured, and can be explained by relatively simple, readily nameable aspects of the images. Figure 4 shows a visual summary of our results. Consistent with our hypothesis, our findings suggest that current DNNs fail to fully capture the visuo-semantic features represented in higher-level human visual cortex, and suggest a path towards more accurate models of ventral stream computations.
Our finding that DNNs outperform visuo-semantic models at explaining lower-level cortical dynamics replicates and extends prior fMRI work, which showed that DNNs explain response variance across all stages of the ventral stream while visuo-semantic models predominantly explain response variance in higher-level visual cortex (Khaligh-Razavi and Güçlü and van Gerven, 2015;Huth et al., 2012;Jozwik et al., 2018;Devereux et al., 2018). Using source-reconstructed MEG data, we show that the advantage of DNNs over visuo-semantic models in V1-3 emerges early in time, starting within 70 ms after stimulus onset. The early advantage lasts for approximately 60 ms. During this early time window, the response is likely dominated by feedforward and local recurrent processing as opposed to top-down feedback signals from higher-level areas (Isik et al., 2014;Kietzmann et al., 2019a). DNNs also outperform visuo-semantic models in V1-3 late in time, starting around 420 ms after stimulus onset. The late advantage lasts for approximately 120 ms. Prior analysis of the same source-reconstructed MEG data showed a relative increase in the explanatory power of lower-level visual features (GIST model) (Oliva and Torralba) and interspecies face clustering in V1-3 during this late time window (Kietzmann et al., 2019b). These effects were observed in the presence of a slightly elevated noise ceiling. During the late time window, the response may reflect an interplay between bottom-up stimulus processing and top-down feedback signals. Our results show the importance of analyzing temporally resolved neuroimaging data for revealing when in time competing models account for the rapid dynamic unfolding of human ventral-stream representations.
Our findings show that DNNs, despite reaching human-level performance on large-scale object recognition tasks , fail to fully capture visuo-semantic features represented in higher-level human visual cortex, in particular object parts and Jozwik et al. Page 12 basic categories. Higher-level visual representations in dynamic MEG data instead more closely resemble human perceptual judgements of object properties. In line with our results, prior fMRI work showed that DNNs only adequately accounted for higher-level visual representations after adding new representational features (Khaligh-Razavi and Devereux et al., 2018;Storrs et al., 2020a,b). The new features were either explicit semantic features (Devereux et al., 2018) or were created by linearly combining DNN features to emphasize categorical divisions observed in the higher-level visual representations, including the division between faces and nonfaces and between animate and inanimate objects (Khaligh-Razavi and Storrs et al., 2020a). Our results show that visuo-semantic models start outperforming DNNs in higherlevel visual areas around 150 ms after stimulus onset. This timeline coincides with the emergence of animate clustering in these areas (Kietzmann et al., 2019b) as well as with the emergence of conceptual object representations as reported in prior MEG work . Our results are also consistent with an earlier MEG study which showed that adding semantic features to a simpler HMAX model was beneficial for modeling object representations in visual cortex starting around 200 ms after stimulus onset (Clarke et al., 2015). DNNs may, at least in part, use different object features for object recognition than humans do. This conclusion is consistent with prior reports that DNNs rely more strongly on lower-level image features such as texture for object categorization (Geirhos et al., 2019).
While we refer to both DNNs and visuo-semantic object labels as 'models', there are substantial differences between the two. DNNs are image-computable, which means that they can compute a representation for any image. In contrast, visuo-semantic object labels are generated by human observers. How the human brain computes these labels remains unknown. This can be considered a disadvantage relative to DNNs, which are computationally explicit, i.e. we have full knowledge of their computational units and of the transformations applied to the image at each processing stage. However, it is challenging to pinpoint what these processing stages represent and how they may differ from those in humans. Visuo-semantic object labels, on the other hand, are easy to interpret. By comparing DNNs and visuo-semantic models in their ability to capture human ventralstream representational dynamics, we can identify features in the data that DNNs fail to account for and use outcomes to guide model improvement.
Our results can be considered consistent with theories that propose an integral role for feedback in visual perception (Rao and Ballard, 1999;Bar, 2003;Ahissar and Hochstein, 2004). As summarized in Figure 4, within the first 120 ms of stimulus processing, we observe a peak in the relative contribution of DNNs in lower-level and intermediate visual cortex, followed by a peak in the relative contribution of visuo-semantic models in higher-level visual cortex. These peaks may reflect a feedforward sweep of initial stimulus processing, which is thought to support perception of the gist of the visual scene and initial analysis of category information (Oliva and Torralba;Lowe et al., 2018;Kirchner and Thorpe, 2006;Liu et al., 2009). The initial peaks are followed by a visuo-semantic peak in intermediate visual cortex around 150 ms after stimulus onset, which appears after a period of possible feedback information flow from higher-level to intermediate visual cortex (Kietzmann et al., 2019b), and additional fluctuations in relative model performance as time unfolds. These fluctuations include a re-appearance of the advantage of DNNs over visuo-semantic models in lower-level visual cortex around 420 ms after stimulus onset. The observed sequence of events is consistent with the reverse hierarchy theory of visual perception, which proposes an initial feedforward analysis for vision at a glance followed by explicit feedback signalling for vision with scrutiny (Ahissar and Hochstein, 2004). Future research should study visual perception under challenging viewing conditions, including occlusion and clutter, which are expected to strongly engage feedback signals and recurrent computation (Lamme and Roelfsema, 2000;O'Reilly et al., 2013;Spoerer et al., 2017;Tang et al., 2018;Kar et al., 2019;Rajaei et al., 2019;Kietzmann et al., 2019a).
Our study makes several important contributions to the existing body of work on modeling ventral-stream computations with DNNs. First, our results suggest that introducing locally recurrent connections to DNNs, to more closely match the architecture of the ventral visual stream, is not sufficient to fully capture the representational dynamics observed in higherlevel human visual cortex. Second, our results tie together space and time through analysis of source-reconstructed MEG data. We show that DNNs outperform visuo-semantic models in lower-level visual areas V1-3 starting at 66 ms after image onset, while visuo-semantic models outperform DNNs in higher-level visual areas IT/PHC starting at 146 ms after image onset. Third, we show that a significant component of the unexplained variance in higher-level cortical dynamics is structured, and can be explained by readily nameable aspects of object images, specifically object parts and basic categories. In prior behavioral work using the same image set and visuo-semantic labels, we showed that category labels, but not object parts, outperformed DNNs at explaining object similarity judgements (Jozwik et al., 2017). These results suggest that, compared to responses in ventral visual cortex, behavioral similarity judgements may more strongly emphasize semantic object information (Mur et al., 2013;Jozwik et al., 2017;Groen et al., 2018). Future studies should extend this work to richer stimulus and model sets.
To build more accurate models of human ventral stream computations, we need to provide DNNs with a more human-like learning experience. Two important areas for improvement are visual diet and learning objectives. Each of these shapes the internal object representations that develop during visual learning. Humans have a rich visual diet and learn to distinguish between ecologically relevant categories at multiple levels of abstraction, including faces, humans, and animals (Mur et al., 2013;Jozwik et al., 2016). DNNs have a more constrained visual diet and are trained on category divisions that do not entirely match the ones that humans learn in the real world. For example, the most common large-scale image dataset for training DNNs with category supervision (Russakovsky et al., 2015;Khaligh-Razavi and Kriegeskorte, 2014;Güçlü and van Gerven, 2015;Cichy et al., 2017;Kubilius et al., 2018;Schrimpf et al., 2018;Jozwik et al., 2019b;Storrs et al., 2020a,b), the ILSVRC 2012 dataset (Russakovsky et al., 2015), contains subordinate categories that most humans would not be able to distinguish, including dog breeds such as "schipperke" and "groenendael", and lacks some higher-level categories relevant to humans, including "face" and "animal". The path forward is unfolding along two main directions. The first is enrichment of the visual diet of DNNs by better matching the visual variability present in the real world, for example by increasing variability in viewpoint or by training on videos instead of static images (Barbu et al., 2019;Zhuang et al., 2019). The second is to more closely match human learning objectives, for example by introducing more human-like category objectives or unsupervised objectives (Mehrer et al., 2021;Higgins et al., 2020;Zhuang et al., 2021;Konkle and Alvarez, 2020). Training DNNs on more human-like visual diets and learning objectives may give rise to representational features that more closely match the visuo-semantic features represented in human higher-level visual cortex.

Significance Statement
When we view objects such as faces and cars in our visual environment, their neural representations dynamically unfold over time at a millisecond scale. These dynamics reflect the cortical computations that support fast and robust object recognition. Deep neural networks (DNNs) have emerged as a promising framework for modeling these computations but cannot yet fully account for the neural dynamics. Using magnetoencephalography data acquired in human observers during object viewing, we show that readily nameable aspects of objects, such as "eye", "wheel", and "face", can account for variance in the neural dynamics over and above DNNs. These findings suggest that DNNs and humans may in part rely on different object features for visual recognition and provide guidelines for model improvement.  a) Variance explained by the DNNs (green) and visuo-semantic models (blue) in the sourcereconstructed MEG data. For each model class, we fit the model predictors to the data using nonnegative least squares regression. Variance explained was computed as the variance explained by the model predictions in data for images left out during fitting. Significant variance explained is indicated by green and blue points above the graph (one-sided Wilcoxon signed-rank test, p < 0.05 corrected). Significant differences between models in variance explained are indicated by grey points above the graph (two-sided Wilcoxon signed-rank test, p < 0.05 corrected). Lighter colors indicate individually significant time points, and darker colors indicate time points that additionally satisfy a continuity criterion (minimally 20 ms of consecutive significant time points). The shaded area around the lines shows the standard error of the mean across participants. The x axis shows time relative to stimulus onset. The gray horizontal bar on the x axis indicates the stimulus duration. b) Unique variance explained by the DNNs and visuo-semantic models in the source-reconstructed MEG data. To estimate the unique variance explained by each model class, we used a second-level general linear model (GLM) and fit the cross-validated model predictions to the data using nonnegative least squares. Unique variance explained was computed by subtracting the variance explained by the reduced GLM (excluding the model class of interest) from the total variance explained by the full GLM (including both model classes). Conventions are the same as in panel a. To summarize our findings, we computed a model difference score based on the results shown in Figure 2b. We subtracted the unique variance explained by the visuosemantic models from that explained by the DNNs in the dynamic ventral-stream representations. Difference scores are shown for each ROI during the first 600 ms of stimulus processing. Results show a gradual reversal in the relative importance of DNN versus visuo-semantic features in explaining the visual representations as they unfold over space and time. Between 66 and 128 ms after stimulus onset, DNNs outperform visuo-semantic models in lowerlevel areas V1-3 (grey line, positive deflection). This early time window is thought to be dominated by feedforward and local recurrent processing. In contrast, starting 146 ms after stimulus onset, visuo-semantic models outperform DNNs in higher-level visual areas IT/PHC (red line, negative deflection). The same pattern of complementary contributions of DNNs and visuo-semantic models seems to re-appear during the late phase of the response, starting around 400 ms after stimulus onset, when responses may reflect interactions between visual areas. These results show that DNNs fail to account for a significant component of variance in higher-level cortical dynamics, which is instead accounted for by visuo-semantic features, in particular object parts and basic categories. The peak of visuo-semantic model performance in higher-level areas (red vertical line) precedes the peak in intermediate areas (blue vertical line). This sequence of events aligns with the timing of possible feedback information flow from higher-level to intermediate areas (light grey rectangle and arrow) as reported in (Kietzmann et al., 2019b). The shaded area around the lines shows the standard error of the mean across participants.