Manipulating and measuring variation in deep neural network (DNN) representations of objects

We explore how DNNs can be used to develop a computational understanding of individual differences in high-level visual cognition given their ability to generate rich meaningful object representations informed by their architecture, experience, and training protocols. As a first step to quantifying individual differences in DNN representations, we systematically explored the robustness of a variety of representational similarity measures: Representational Similarity Analysis (RSA), Centered Kernel Alignment (CKA), and Projection-Weighted Ca-nonical Correlation Analysis (PWCCA), with an eye to how these measures are used in cognitive science, cognitive neuroscience, and vision science. To manipulate object representations, we next created a large set of models varying in random initial weights and random training image order, training image frequencies, training category frequencies, and model size and architecture and measured the representational variation caused by each manipulation. We examined both small (All-CNN-C) and commonly-used large (VGG and ResNet) DNN architectures. To provide a comparison for the magnitude of representational differences, we established a baseline based on the representational variation caused by image-augmentation techniques used to train those DNNs. We found that variation in model randomization and model size never exceeded baseline. By contrast, differences in training image frequency and training category frequencies caused representational variation that exceeded baseline, with training category frequency manipulations exceeding baseline earlier in the networks. These findings provide insights into the magnitude of representational variations that can be expected with a range of manipulations and provide a springboard for further exploration of systematic model variations aimed at modeling individual differences in high-level visual cognition.

Deep neural networks (DNN) have emerged as useful tools for understanding visual cognition.While DNNs were inspired by the neural networks first developed to understand the brain (e.g., McClelland et al., 1986;Rosenblatt, 1958) and have features important for biological vision, such as image-computability, hierarchical structure, convolution and pooling operations, and increasing receptive field sizes, these stateof-the-art DNN models were not intended to be models of the brain.Their goal was to solve object classification tasks for computer vision (Krizhevsky et al., 2012;Russakovsky et al., 2015) and were generalized to tasks like object localization (e.g., Redmon et al., 2016) and image generation (e.g., Goodfellow et al., 2020).Nonetheless, these DNN models have become useful models for capturing high-level object representations in the ventral visual stream (e.g., Cichy et al., 2016;Khaligh-Razavi & Kriegeskorte, 2014;Yamins et al., 2014; but see Ayzenberg & Behrmann, 2022; see also Sexton & Love, 2022).Interest in using DNNs to understand human and non-human primate vision has rapidly expanded in recent years, with vision researchers applying them to ask questions about perceptual invariance (e.g., Nanda et al., 2021;Rajaei et al., 2019), expertise effects (e.g., Blauch et al., 2021;Xu et al., 2021), categorization (e.g., Guest & Love, 2019;Lake et al., 2015), and visual search (e.g., Eckstein et al., 2017;Yu et al., 2019).Even when accounting for potential caveats in using DNNs are models of biological vision (e.g., Baker et al., 2020;Heinke et al., 2021;Tuli et al., 2021), DNNs prove to be a valuable tool in the wider theoretical modeling toolkit.
In parallel with the recent growth in using DNNs, there has been recent growth in using individual differences techniques to characterize high-level visual cognition.Individual differences seek to leverage the natural variations in a population to explain the organization of underlying mechanisms (Wilmer, 2008).Bivariate correlation approaches alongside more complex statistical techniques like factor analysis and structural equation modeling (Loehlin, 2004) are used to search for association between distinct tasks (suggesting some overlapping mechanisms) and dissociations between similar tasks (suggesting some distinct mechanisms).With the development of reliable high-level visual tests (e.g., Dennett et al., 2012;Duchaine & Nakayama, 2006;Growns et al., 2022;Richler et al., 2017;Smithson et al., 2023), a growing web of associations and dissociations has demonstrated the existence of a general object recognition ability (Richler et al., 2019), how these domain-general abilities may interact with domain-specific visual recognition abilities (e.g., Chang & Gauthier, 2021;Jastrzębski et al., 2021;Ventura et al., 2022), and how individual differences in visual recognition ability may be related to non-visual recognition abilities (Chow et al., 2022).However, individual differences techniques can only suggest potential overlapping and distinct representations and processes, not what those representations and processes might be, how they might vary, and why.
Cognitive modeling can answer questions about mechanisms using highly constrained cognitive models that have explainable components and interpretable parameters.These models explain mechanisms and account for behavior across a wide range of tasks, such as memory (e.g., Polyn et al., 2009), categorization (e.g., Nosofsky & Palmeri, 1997), decision making (e.g., Ratcliff & Rouder, 1998), and skilled action (e.g., Logan, 2018).These models are often evaluated at the individual subject level and account for individual differences (Shen & Palmeri, 2016).Despite their success, these models often fall short in modeling the mechanisms converting visual images into high-level object representations.In most cognitive modeling work, these representational processes are sometimes obviated by using simple experimenter-controlled stimuli with low-dimensional psychophysical representations or they are abstracted using statistical approaches to simulate representations or psychological scaling procedures to estimate complex object representations.Unlike DNN models, cognitive models are typically not image computable.
A potentially fruitful approach to developing a mechanistic understanding of individual differences in visual cognition over a wide range of tasks (Shen & Palmeri, 2016) would combine complex object representations produced by an image computable DNN with cognitive models of visual memory, knowledge, and decision making (e.g., Annis et al., 2020;Battleday et al., 2020;Sanders & Nosofsky, 2020).Individual differences in performance across tasks would be predicted by variation in modeled representations from a DNN and variation in modeled processes from a cognitive model (Shen & Palmeri, 2016).Annis et al. (2020) used DNNs to generate object representations of complex novel objects as input to a cognitive model to account for errors and response times in an object matching task.In their case, individual differences in performance were modeled solely within the cognitive model via variation in its parameters.Potential variation in representations from the DNN was not considered because multiple trained DNNs were not available and because it was not feasible, either practically or technically, to train multiple (large) DNNs.It was also not entirely clear how to incorporate individual differences into DNNs to produce variation in object representations of sufficient magnitude to account for individual differences in performance.
Given that context, our focus in this article is on manipulating and measuring variation in DNN representations.The insights gained here can then be used in future work to consider how variation in DNN representations might account for individual differences in visual cognition; they can also be potentially used in future work to understand variation in neural representations.
DNNs can be used as a sandbox of sorts to test possible sources of individual differences in object representations, which may then manifest as individual differences in behaviors.Perhaps some of the differences in representations are merely idiosyncratic.On average, people might be exposed to the same distribution of objects, but there are random differences in the initial conditions of their neural substrates prior to learning: in a DNN, this would constitute different randomizations of the initial weights and biases in different trained networks.Or people might be exposed to the same distribution of objects, but there are random differences in the order with which they experience them: in a DNN, this would constitute different random orders of training images for different trained networks.
Perhaps some of the differences in representations are because of differences in experience.People might be exposed to different objects from the same categories: in a DNN, the sampling and frequency of objects from trained categories might vary across different trained networks.Or people might be exposed to a different distribution of categories: in a DNN, some categories of objects might be experienced more frequently than other categories of objects for different trained networks.And perhaps, of course, there are qualitative or quantitative differences in the neural substrates supporting object representations: in a DNN.This would mean that differences between people could be simulated by qualitative or quantitative differences in neural network architectures (see Chow, 2023).
Individual differences are not merely statistical variation.They are typically more than the variation one might observe testing the same individual twice.They are also typically stable in that someone performing poorly on a task on one occasion will likely perform poorly on a task on another occasion.So in addition to measuring how different types of variation in a DNN manifest as differences in representations, we develop a baseline for assessing whether a manipulation has produced differences of sufficient magnitude to be consider meaningful individual differences: Meaningful in the sense of more than just random fluctuations, say from variation in how an image falls on the retina, and of a magnitude that might be consider something more akin to a stable trait rather than a state.The kinds of meaningful individual differences we seek to explain are those that consistently affect behavior in highlevel visual cognition tasks as result of variation in traits like neural morphology or development, or from variation in experience and result in differences in object representations of a magnitude beyond that of state differences causing variation in performance.
Fundamental to using DNNs to study individual differences is measuring differences in representations between different trained networks subject to different kinds of manipulated variation.Whether comparing the representations between two DNNs, between a DNN and voxels or neurons in an area of cortex, or between an area of cortex in one brain and an area of cortex in another brain, one common approach is as follows (e.g., Kriegeskorte, 2008): Each system (a model or an organism) is presented the same set of inputs (images of a set of objects in our case) and the representations they form for each input (activations from a layer of a DNN, voxels or neurons in an area of cortex) are recorded.The result for each system is a representation matrix. 1  To compare the representations produced by two different systems, the different matrices must be compared.How similar or different are they?Despite the inputs to two systems being the same, it is likely that the two representation matrices are not commensurate with one another.The representations produced by the two systems could be completely different, whether in the number of features (number of units in a DNN layer, number of voxels or neurons in an area of cortex) or in that their features encode completely different kinds of information (even if they happen to have the same number of features).
Measures of representational similarity are used in cognitive neuroscience and machine learning to solve the problem of comparing representations produced by two different systems.A challenge we and other researchers face is picking the right method to measure this similarity.Representational Similarity Analysis (RSA; Kriegeskorte, 2008), Centered Kernel Alignment (CKA; Kornblith et al., 2019), and Projection-Weighted Canonical Correlation Analysis (PWCCA; Morcos et al., 2018;Raghu et al., 2017) are used by different researchers and 1 In the representation matrix, the rows index different inputs (images of objects) and the columns index the representations produced for each input (across units in a layer of a DNN, across voxels or neurons in an area of cortex).different literatures, are often described as providing converging evidence, and sometimes are described as if they are largely interchangeable with one another (e.g., Mehrer et al., 2020;van Assen et al., 2020), despite yielding distinct results and having important practical differences in their usage.Having an appropriate measure of representational similarity is foundational for us to characterize the factors that cause variation in representations produced by DNNs and ultimately to use these factors in mechanistic models of individual differences in visual cognition.
In the first section of this article, we evaluated the various similarity measures for DNN representations in terms of their effectiveness, robustness, and practicality, with a particular eye towards testing hypotheses about the mechanisms underlying individual differences in visual cognition.
In the second section, armed with a preferred measure of representational similarity, we measured variation in representations produced by DNNs subject to a range of manipulations.In comparing the impacts of different kinds of manipulations on the resulting representations, we developed and justified a set of image augmentation baselines for determining what level of variation in DNN representations would be sufficient to be responsible for consistent individual differences in visual cognition.

Evaluating measures of representational similarity
What makes a measure good for quantifying the differences in object representations produced by two different DNNs?When we speak of object representations in a DNN, we are considering the pattern of unit2 activations along one particular layer of a DNN, whether it be along the first layer after the image input, intermediate layers, or the penultimate layer before the classification layer.Each image input to the DNN generates a pattern of activation along a layer and the concatenation of multiple patterns from multiple images constitutes the DNN's representation matrix for those images.These representation matrices contain the features from the layer to represent each image.Given two representation matrices generated from the same images but different DNNs, we need to calculate a scalar similarity score.When comparing representation matrices from two different DNNs, the number of features may not match across models and even if they did, each feature may not represent the same information.Simply measuring feature-wise similarity of representation matrices produced by the two DNNs does not work.
Representational Similarity Analysis (RSA), Centered Kernel Alignment (CKA), and Projection-Weighted Canonical Correlation Analysis (PWCCA) have all been used to measure similarity in representations.RSA is commonly used in cognitive neuroscience to measure similarity between DNN representations and cortical representations or between DNN representations and behavior.CKA and PWCCA are commonly used in machine learning to measure similarity between DNN representations.While these measures are often characterized as largely interchangeable measures of similarity, they differ mathematically and in their practical utility, and there is mixed evidence regarding which of these measures perform better overall (Ding et al., 2021;Kornblith et al., 2019;Williams et al., 2021).For studying visual cognition, especially individual differences, which of these measures should be used?This first part of our article aims to help answer that question.A brief description of each measure is provided below with a more detailed description of each in Appendix A.
RSA creates representational dissimilarity matrices (RDMs) for each DNN formed by calculating the pairwise dissimilarity between image representation vectors in the representation matrices (Kriegeskorte, 2008).The dissimilarities that make up an RDM can be formed using any metric that compares pairs of vectors.We tested three metrics variously used in the literature: Pearson correlation distance, Spearman correlation distance, and Euclidean distance.The final similarity score was calculated by comparing the RDMs.We followed one common convention (Kriegeskorte, 2008;Mehrer et al., 2020) and used a squared Pearson correlation to compare RDMs.
CKA (Kornblith et al., 2019) is commonly characterized as simply another measure of representational similarity but it follows from RSA.Following Kornblith et al. (2019), we used the dot product between representation matrices to form RDMs and the normalized Hilbert-Schmidt independence criterion (Gretton et al., 2005) to compare the RDMs.
PWCCA searches for successive linear transformations for each of the representation matrices that maximizes the correlation between the matrices.This results in a collection of correlation coefficients that can be summarized into a single similarity value.We implemented the projection-weighted version of canonical correlation analysis to summarize the correlations, where the weights were determined by the how much of the original representations was maintained after the linear transformation was applied (Morcos et al., 2018).PWCCA requires that the number of features to be less than the number of images and therefore imposes an important practical limitation that we will detail later in our simulations.
The measures we chose to test here do not exhaustively represent all of the possible ways to measure representational similarity (Sucholutsky et al., 2023).Other methods are available that could be better suited to compare representations from different types of sources (e.g., DNN to neural representations with linear fits; Schrimpf et al., 2018;Yamins et al., 2014), to directly align units from different DNNs (e.g., orthogonal Procrustes; Ding et al., 2021), or to focus on the descriptive statistics of representations between DNNs (for a recent review see Klabunde et al., 2023).Our focus here is on the methods commonly used to compare representation matrices between DNNs and in their potential use in characterizing individual differences across DNNs.The measures should be relatively computationally and data efficient (as to be able to calculate similarity across many "individuals") and fit well with a goal of comparing perceptual representations (as opposed to semantic or relational representations).
To summarize, we evaluated variants of RSA assuming three different vector dissimilarity metrics (Euclidean, Pearson, and Spearman) as well as versions of CKA and PWCCA as outlined above with an eye towards how these measures can be used in the context of studying individual differences in visual cognition.Some of these evaluations involved representations produced by different trained DNNs or by different layers of the same trained DNN and measuring representational similarity.Other evaluations involved simulated representation matrices that approximated the statistical distributions of real representation matrices produced by a DNN and measuring representational similarity after some manipulations.For consistency, we will use s to generally represent a scalar similarity score regardless of the measure. 3e specifically evaluated the different measures with the following tests: (1) a layer correspondence test, asking whether a measure indicates that the same layer in two different DNNs trained on the same image set are maximally similar to one another; (2) an image to feature ratio test, asking how the measure responds to changes in the ratio of the number of test images to the number of features in a representation matrix; (3) a permutation invariance test, asking whether a measure is invariant to consistently permuting the order of features in a representation matrix (preserving representational structure); (4) a randomized shuffling sensitivity test, asking whether a measure is sensitive to random shuffling of features in a representation matrix (destroying representational structure); (5) a robustness to noise test, asking how measures respond to the presence of added noise in representations; (6) a robustness to feature loss test, asking how measures respond to the removal of features in representation.We performed all of these tests on representation matrices produced by small DNNs as we need many instances of a trained model in the correspondence test and many repetitions in the other tests; as larger models are more commonly used in visual cognition, we performed several tests with representation matrices from large models where practically possible (where a single instance of a pretrained network was sufficient for the test and where the output representation matrices were not too large for the measure to be calculated).
We start with a brief outline of the methods used to generate representation matrices used in these tests.Then, the details of each test and the results obtained will be presented together, one at a time.

General DNN method 1.Models and datasets
We tested all measures on representation matrices from with small models. 4We generated representation matrices from the All-CNN-C architecture (Springenberg et al., 2014).The All-CNN-C architecture primarily uses convolutional layers with non-linear activation functions (rectified linear units, ReLUs) and increasing receptive fields with feature bottlenecks that are common building blocks of convolutional DNNs.These models were trained on CIFAR10 (Krizhevsky & Hinton, 2009).When generating representation matrices from these models for testing, we randomly selected 100 images from each of the ten categories of CIFAR10 test set for a total of 1000 images.
We additionally tested each similarity measure on large pretrained models.Because we needed multiple instances of the same model for the Layer Correspondence Test (details below), we used pretrained models (Mehrer et al., 2021) trained on ImageNet-1 K (Russakovsky et al., 2015).A set of 10 models used the AlexNet architecture (Krizhevsky et al., 2012) and the vNet architecture.Both models used the same feedforward convolutional DNN architecture style with similar building blocks as our smaller networks.For the other simulations tests, we chose models with more architectural diversity: VGG16 (Simonyan & Zisserman, 2014; for parameter count) and ResNet50 (He et al., 2015; for architecture complexity).Both models were trained on ImageNet-1 K.The exact details of these architectures are not of critical importance, only that they are large models that use larger images to generate much larger representation matrices than the All-CNN-C architecture.When generating representation matrices from these models for testing, we randomly selected one image from each of the 1000 categories of the ILSVRC-2016 test set for a total of 1000 images.Some simulations cannot be practically applied to representation matrices from these models (as detailed later).
We implemented and trained our models in Tensorflow.

Simulating representations
In some tests, we simulated vectors of representations.To approximate the statistical properties of real representation matrices from our DNNs, we generated representation matrices given our testing datasets from the penultimate layer in instances of our models and then used bootstrap resampling to create many simulated representation matrices.These procedures meant that we had simulated representation matrices that matched the statistics of real DNN representation matrices but do not necessarily encode information about specific images.To ensure that our results were not specific to a particular randomization of representations, we repeated the resampling process 1000 times for each simulation and each measure we tested to index uncertainty in the results.
All-CNN-C simulated representation matrices came from the penultimate layer, a convolutional layer with a global average pooling applied to it, with 10 features.VGG16-simulated representation matrices came from the penultimate layer, a fully-connected layer with a ReLUs, with 4096 features.ResNet50-simulated representation matrices came from the penultimate layer, a convolutional layer with a global average pooling operation applied to it, with 2048 features.For the larger models, we repeated the resampling procedures 10 times for each simulation.
The use of bootstrapped simulated representations ensured that our representations followed the statistical properties of real representations, including the range of possible values (e.g., activations from ReLU having a lower bound of 0), the distribution of high values (e.g., most values are relatively small with few large values), and the number of features overall.These procedures improve the generalizability of our findings to real use-cases and are critical in some simulations.

Layer correspondence test
In this test we asked whether a measure can match corresponding layers between two trained DNNs with the same architecture.For example, consider two instances of All-CNN-C trained on the same dataset, differing only on the randomization of the initial weights.This is known to cause measurable differences in representations, despite different networks performing similarly in the trained task (Mehrer et al., 2020).Regardless of differences in representations, the corresponding layers between instances of the same architecture should arguably still be the most similar, relative to other layers (Kornblith et al., 2019).
We tested each measure using a layer correspondence test proposed by Kornblith et al. (2019).Given a DNN, we generated representation matrices from a layer and then measured the similarity of those representation matrices to representation matrices from every layer of another instance of a DNN with the same architecture.A measure would succeed on the correspondence test if the most similar representation matrices were from the same layer in both instances of the network.That is not to say no other layers could be similar, but that the most similar layer should be the corresponding ones between instances.The layer correspondence test was only applied to All-CNN-C models because we did not have multiple instances of the identical architecture/training procedures for the large model architectures.
Fig. 1 illustrates the confusion matrices between layers using each of our measures, where each axes represents model layers (1− 10) of two compared All-CNN-C networks with each cell representing the proportion of those layers being matched with the highest similarity score.For 100 trained networks, we measured the proportion of comparisons where the representation in layer p of one network was most similar (per the representational similarity measure being used) to layer q of another network.Each cell (p, q) is colored coded according to that proportion.A perfect confusion matrix would include only ones (100% correspondence) along the diagonal with zeroes everywhere else.Pearson RSA, Euclidean RSA, and CKA yielded nearly perfect confusion matrices, with >99% of corresponding layers matched.Spearman RSA performed worse, with 79% of corresponding layers matched, particularly in the intermediate layers.Notably, PWCCA performed considerably worse than the others, with only 24% of the corresponding layers being matched.Early and intermediate layers were often most similar to layer 1 and layer 4.These are the first layers within a block of convolutional layers, demarcated by dropout and pooling layers; this was seen similarly in the final convolutional layers.
As before, Fig. 2 illustrates the confusion matrices between layers using the three RSA variants using the AlexNet and vNet architectures using variations of RSA.All variations of RSA matched most of the corresponding layers (>97% layers matched) on both model architectures.We could not perform the layer correspondence test with PWCCA and CKA on the larger models as these models often have more features than the number of images we used to perform the test (details below).We could not perform the layer correspondence test with VGG and ResNet models as we only had one instance of each model architecture.

Image:feature ratio sensitivity test
In this simulation test, we asked how sensitive each measure is to the ratio of the number of images used in the test and number of features in a representation matrix.Cognitive science, cognitive neuroscience, and vision science applications often have limited image datasets constrained by what is practical for data collection (from humans or non-  human primates); datasets usually include fewer than 1000 images (sometimes even far fewer).Considering that the penultimate layer of large DNNs have thousands of units (e.g., 4096 in VGG16), the number of features in its output representation matrices often greatly exceeds the number of images used in visual cognition research.Knowing whether a measure can adequately measure representational similarity given a particular image:feature ratio is important.
We generated simulated representation matrices by systematically varying the number of features using bootstrap random sampling from features produced by an instance of our All-CNN-C model.We created simulated random representation matrices ranging from 950 features to 50 features (in steps of 50; each step repeated 100 times), while holding the number of images constant at 1000.This resulted in simulated representation matrices with image:feature ratios ranging from nearly 1:1 to 20:1.We also systematically varied the number of test images in the same range, while holding the number of features constant at 1000 with the same bootstrap sampling procedures.This resulted in image: feature ratios from nearly 1:1 to 1:20.We measured the similarity (using each measure) between pairs of differently randomized representation matrices at the same image:feature ratio.Because all representation matrices were randomly sampled using a bootstrapping procedure, we expected the similarity score to be zero no matter the image:feature ratio.
Fig. 3 shows similarity scores between pairs of bootstrappedsimulated representations (y-axis) across a range of image:feature ratios between 1:1 to 20:1 (x-axis) for each measure (lines).PWCCA (yellow line) and CKA (orange line) appear to be sensitive to the image: feature ratio while the RSA (blue line; collapsed across all variations of RSA because the results were essentially identical) measures were invariant.
It is known (though perhaps not widely known) that PWCCA is very sensitive to the ratio between the number of images and the number of features (Raghu et al., 2017).As this ratio approaches equality, PWCCA greatly overestimates similarity.Further, when the number of features exceeds the number of images, PWCCA cannot be calculated.As expected, with PWCCA, similarity scores approached s = 1.0 near the 1:1 ratio despite comparing completely randomized representation matrices that should have zero similarity.
CKA was also sensitive to the image:feature ratio, only beginning to asymptote at a 10:1 image:feature ratio (s = 0.09)5 and even at the 20:1 only yielded s = 0.05 (not s = 0.0).This poses a significant problem in using CKA to measure differences in representation matrices between DNNs using the datasets commonly found in visual cognition research.Using CKA, measures of representational similarity could be vastly overestimated if there are too few images relative to the size of the DNN.
Unsurprisingly, these results (not shown in Fig. 3) held true when manipulating the number of images (holding the number of features constant at 1000) with the ratio ranging from nearly 1:1 to 1:20.In this range, PWCCA is impossible to calculate.As above, CKA erroneously yielded s = 1.0 throughout the range because the number of features exceeded the number of images.The only way to use PWCCA or CKA in most applications in visual cognition would be to apply some kind of dimensionality reduction on the feature space.
RSA (irrespective of whether a Euclidean, Pearson, or Spearman metric was used) was invariant to the image:feature ratio.As shown in Fig. 3, RSA results in s = 0.0 (expected for randomized representations) across image:feature ratios; this also held across ratios ranging 1:1 to 1:20.
Based on these results, for the remaining tests, we did not test PWCCA and CKA on representation matrices from the larger models as these models had more units than the number of images in their penultimate layers.Using PWCCA and CKA with these models would likely yield high similarity no matter the test because of how many features they produce relative to the number of images we used.

Permutation invariance test
A successful measure of representational similarity needs to be invariant to feature order permutations (permutations of the columns in a representation matrix) because in that case the representational structure is preserved.As a permutation invariance test, we permuted the order of features (columns) of simulated representation matrices while maintaining the order of the images (rows).The similarity between the original simulated representation matrices and the permuted representation matrices was calculated; the correct result would be perfect representational similarity (s = 1).
For each bootstrap-simulated representation, we performed the permutation invariance test by comparing it with the same simulated representation matrices with the feature order (columns) permuted.This ensured that our results were not specific to the content of the representation matrices nor the feature order.
All measures succeeded in the permutation invariance test for representation matrices from the All-CNN-C model.All measures also succeeded in this test for ResNet50-simulated representations.However, Spearman RSA yielded s = 0 when the features were permuted in the VGG16-simulated representations.This is due to the fact that these representation matrices were from a fully-connected layer with the rectified linear activation function, which resulted in many 0 s in the representations.These zeros caused ties in ranks that disrupted the calculation of a Spearman correlation when forming the RDMs.This did not occur for other simulated representation matrices as they were extracted from a convolutional layer with global average pooling applied, resulting in few if any values all at 0.

Randomized shuffling sensitivity test
On the opposite end from permutation invariance, if all the units in a representation matrix are randomly shuffled (destroying any column Fig. 3. Results for the Image:Feature Ratio Sensitivity Test in small (All-CNN-C) models.Average similarity scores (y-axis) are calculated using each measure (RSA, PWCCA, CKA, represented by colored lines) between pairs of randomized representation matrices varying on the ratio between images and features (xaxis); the three RSA variants had the same results and are represented by the same line.The number of test images was held constant at 1000 and the number of simulated features ranged from 950 (nearly 1:1 ratio) to 50 (20:1 ratio).Ideally, the similarity scores should remain at 0 regardless of the ratio.There was little variability in the results, therefore uncertainty bands are omitted.structure), then a representational similarity measure should yield s = 0.In a randomized shuffling sensitivity test, we compared the simulated representation matrices and the same representation matrices with every value randomly shuffled in the representation.There should be no shared information between the representation matrices, despite matching in overall statistical properties.As in the permutation invariance test, we used bootstrap simulated representation matrices that each were compared to itself with all of the features randomly shuffled.
For All-CNN-C simulated representations, PWCCA detected small but spurious similarity (s = 0.11) between representation matrices that were completely shuffled.Given that PWCCA successively searches for transformations to maximize correlations, the measure seems to find spurious signals even from the lack of structure (shuffled representations).With All-CNN-C, CKA performed far better (s = 0.01) and all RSA measures performed nearly perfectly (s < 0.001).For Resnet50 and VGG16, all RSA measures performed nearly perfectly as well.

Robustness to noise test
We tested how measures of representational similarity responded to a range of noise added to representations.Across a range of added noise, when do measures start to fail and what is the robustness to the presence of noise?With the wider goal of using DNNs to model individual differences, robustness to differences in representation matrices caused by certain amounts of noise is an important consideration for selecting a measure of representational similarity (and using it to assess the magnitude of representational differences across simulated individuals).And given that cognitive science, cognitive neuroscience, and vision science research works with data that is rarely noiseless and sometimes involves missing data, robust measures of representational similarity are necessary to compare models to models, models to behavior, or models to brains.
In a robustness to noise test, we added Gaussian noise with a mean of zero and a standard deviation that was parametrically manipulated.Specifically, the standard deviation of the original representation matrices was calculated and then the standard deviation of the noise was that value multiplied by a scalar ranging from 0 (no noise) to 4 (high noise) in steps of 0.01 (steps of 0.5 for large model representations).We anticipated that for a modest amount of noise (e.g., noise multiplier = 0.5), a successful measure of representational similarity would continue to yield relatively high similarity scores.Obviously, at the highest levels of noise (e.g., noise multiplier = 4.0), measures should yield relatively low similarity scores because the added noise would overwhelm any signal in the representations.As a whole, the expected pattern of results was a monotonic decrease in similarity, with nearly perfect similarity when the noise multiplier was very low (e.g., noise multiplier = 0.01 should result in s = 1) and complete dissimilarity when the noise multiplier was very high (e.g., noise multiplier = 4.0 should result in s = 0).As in the previous tests we used bootstrapped simulated representation matrices from each model.Each simulated representation matrix was compared to itself with the added Gaussian noise at each level of noise intensity.
Fig. 4 shows representational similarity scores (y-axis) across a range of noise (x-axis) for each measure (lines) on 1000 sets of simulated representation matrices from the small All-CNN-C architecture.Compared to the other RSA measure, Spearman RSA (purple line) was particularly vulnerable to noise, dropping rapidly to zero.Pearson RSA (green line) only performed slightly better than Spearman RSA.Of the RSA measures, Euclidean RSA (blue line) was the best.CKA (orange line) performed somewhat better than Euclidean RSA, and PWCCA (yellow line) performed even better.An important caveat to the performance of PWCCA, however, is that PWCCA also yielded non-zero similarity with randomized shuffling, so some of this robustness to noise is perhaps suspect.Euclidean RSA appeared to be a useful compromise in maintaining relatively high similarity for low to moderate levels of noise, while gradually dropping to zero for higher levels of noise.
Results of the robustness to noise test were similar with the large model representations, with Pearson RSA and Euclidean RSA performing reasonably well in all cases but Spearman RSA yielding nearly complete dissimilarity with the addition of noise to VGG16 representation matrices (see Fig. S1), likely due to the ReLU rank ties once again.

Robustness to feature loss test
The final test measured the effects of pruning (removing) units from a layer, akin to ablating a proportion of neurons in a region of the brain generating representations.For the robustness to feature loss test, we simulated the random removal of a proportion of units from a layer (10% -70% in steps of 10%).To the extent a unit represents unique features, removal of a proportion of units would roughly equate to the same proportion of lost information.We expected the similarity scores to monotonically track the loss of features, ranging from perfect similarity with no features lost (s = 1) to low similarity when 70% of the features were removed.As with previous tests, we used bootstrap-simulated representation matrices.Each simulated representation matrix was compared to itself with a proportion (repeated across the whole range between 10 and 70%) of features removed.
Fig. 5 shows representational similarity scores (y-axis) as a function of the proportion of information loss (x-axis) for each measure (lines) using representation matrices simulated from the small All-CNN-C architecture.All measures, except PWCCA (yellow line), decreased sensibly, in a roughly linear fashion as a function of the proportion of features removed.CKA (orange line) is the most robust to information loss, followed by Euclidean RSA (blue line), and then finally the other RSA variants (green line for Pearson RSA and purple line for Spearman RSA).PWCCA remained at its maximum value irrespective of the proportion of features removed; this insensitivity is due to the nature of this measure, being calculated according to the number of features in the smaller of the pair of representation matrices being compared.
For representation matrices from large models (both VGG16 and ResNet50), Euclidean and Pearson RSA measures decreased roughly linearly with the proportion of features lost (see Fig. S2).For Spearman RSA applied to VGG16 representations, any feature loss led to complete dissimilarity; the reasons for this failure are the same as the reasons for the failure of Spearman RSA in the permutation invariance test described earlier as fundamentally losing any features changes the feature order.

Interim summary
In this first section, we evaluated several measures of representational similarity (CCA, CKA, and three variants of RSA) each on several tests using representation matrices from both small networks (All-CNN-C) and large networks (VGG16 and ResNet50).These alternative measures are sometimes described as options for measuring representational similarity in ways that suggest that they may be interchangeable with one another.They are not, especially when used in cognitive neuroscience, vision science, and cognitive science applications.
CCA performed poorly.It failed to select corresponding layers based on how different models represent the same images at each layer.PWCCA also detected spurious similarities when features in representation matrices were randomly shuffled, when high amounts of noise were added to representations, and when a significant proportion of features were removed.In addition to these limitations, PWCCA also requires significantly more (as in 10× more) images than features in the representation matrices to avoid spurious similarities.PWCCA cannot be used with large networks (like VGG and ResNet) with thousands of units in a layer for applications (like those in visual cognition) where there are often relatively small numbers of images.
CKA performed very well in most tests.CKA was robust in sensible ways, such as being robust to noise while not overestimating similarity when noise began to overwhelm the signal, unlike PWCCA.Its main limitation is the same as that for PWCCA.While it is not as sensitive to the image:feature ratio as PWCCA, it still requires more images than features in the representation matrices, making CKA impractical for many applications in visual cognition.
The three versions of RSA varied in the metric used to compute dissimilarity between vector representations.Euclidean RSA performed well on all tests.Correlation-based RSA (Pearson and Spearman) were more susceptible to noise than Euclidean RSA.Spearman RSA failed in some tests, especially with large networks with ReLUs (e.g., VGG); in the case of feature permutations and noise perturbations, the large number of zero activation values from the ReLUs disrupted the ties used to calculation a Spearman correlation.RSA as a family of methods can offer further variations to better handle specific methodological concerns such as the presence of high multivariate noise in brain data (Walther et al., 2016).While Euclidean RSA worked best for our purposes here, there may be other options to further explore for other specific use cases.
Based on these evaluations of measures of representational similarity, in the following section manipulating variation in representations, we used Euclidean RSA.

Manipulating variation in DNN representations
Because DNNs are image-computable, because they learn, and because their architecture and training sets can be systematically manipulated, DNNs provide a unique sandbox to explore factors that impact variation in object representations that could cause individual differences in high-level visual cognition.In this next section, we trained sets of models with different manipulations to measure the resulting representational variation.
Our methods build on and extend recent work by Mehrer et al. (2020), which to our knowledge is the only other article to systematically examine individual differences in representations created by DNNs.They varied the random initialization of network weights, explored how learned representations across layers of those networks varied, and characterized the nature of that variation.They observed that the difference in the randomization of initial weights can propagate to relatively large differences in representations (especially later layers) and called into question the common practice in cognitive neuroscience of using a single trained network to derive insights into the neural and behavioral bases of visual cognition.
We systematically explored a range of manipulations that could affect DNN representations, including both randomization of initial network weights (controlling for randomization of training image order), randomization of training image order (controlling for randomization of initial weights), a range of relative frequency of training images within each category, a range of relative frequency of training categories, and model size.Our focus is on variation in the representation matrices produced by DNNs, not on variation in performance by DNNs subject to these manipulations.In many cases, different networks trained in different ways achieved similar overall performance but differed in the representations they produce given the same images.
Given our motivation to systematically manipulate DNNs to model human individual differences, we considered what differences in representations might constitute more stable individual differences (akin to trait differences) versus more fleeting idiosyncratic differences (akin to state differences).Representations might differ according to different ways network training is manipulated, but what kinds of manipulations produce relatively large differences in representations?Any criteria to distinguish small from large, state from trait, is admittedly rather arbitrary.For a baseline, we measured the representational variation caused simply by the image-augmentation techniques commonly used to train DNNs (such as translation, reflection, scaling, and color channel manipulations).Consider biological individual differences: differences in representations of a magnitude similar to or smaller than that caused by variations in the image across the retina (state difference) should probably not be considered to be of a magnitude sufficient to characterize true individual differences in visual cognition (trait difference).
Most of our explorations in this section used a small (All-CNN-C) network because it allowed for training and simulating large numbers of networks, measuring differences in the resulting representation matrices caused by a range of manipulations, and comparing these to baseline.We also explored differences between large pretrained networks of the same type (a set of AlexNets and vNets trained on two different datasets, two versions of VGG, two versions of ResNet; details below), which tests differences in training dataset, model size, and architecture, again comparing differences in representation matrices to baseline.

DNNs and Datasets
We used the All-CNN-C architecture (Springenberg et al., 2014) to test the effects of different manipulations.This architecture has the same building blocks 6 as the seminal AlexNet architecture (Krizhevsky et al., 2012) that characterize most feedforward DNNs used in visual cognition (which often prefers architectural simplicity over squeezing out a bit better performance by adding complexity, as is sometimes the case in computer vision).The simplicity of ALL-CNN-C alongside its small size and training on a small dataset allowed us to create many instances of the same architecture with different manipulations, allowing us to quantify the resulting representational variability caused by different manipulations.The All-CNN-C has 8 convolutional layers split across 3 blocks with a dropout layer (dropout frequency at 50%) between blocks with a fully-connected classification softmax layer at the end.
To train the All-CNN-C architecture, we used the CIFAR-10 dataset (Krizhevsky & Hinton, 2009).We used images in batches of 128 to train the model using stochastic gradient descent with a learning rate = 0.1 and momentum = 0.9 over 350 epochs.The learning rate was reduced by a factor of 0.1 at the 200th, 250th, and 300th epochs.During training, images were randomly augmented with image translation (5px maximum in each direction, filling empty parts of the image with gray pixels) and horizontal reflection.
As in our simulations in the previous section, we randomly chose 100 images from each category in the CIFAR-10 test set.These 1000 images were used for every All-CNN-C model to extract representation matrices, therefore any variation in representations between models would be due to the differences between models as the same images were used for every model.
We also tested variation in representations between variants of large models.Because we could not train large models from scratch, we used pretrained models.To perform analyses similar to those in our small models, we used a set of pretrained models from Mehrer et al. (2021).These models used one of two architectures: the classic AlexNet architecture (Krizhevsky et al., 2012) and vNet, a novel architecture designed to match the receptive field sizes along the human ventral stream.The AlexNet architecture used the eponymous style of architecture with 5 blocks containing 1-3 convolutional layers followed by a max pooling layer.After the blocks of convolutional layer there were three fullyconnected layers with the last layer being the classification layer.The vNet architecture was designed to match the receptive fields along the human ventral stream, containing 8 blocks matching the V1, V2, V3, hV4, LO, TO, pFUS, mFUS.Each block contained only 1 convolutional layer with a group normalization layer acting on its net output before the activation function (a ReLU); some layers contained a max pooling layer before the convolutional layer.After the 8 convolutional layers, there were 3 fully-connected layers with the last layer being the classification layer.
The AlexNet and vNet models were trained on either the ImageNet-1 K dataset from the ILSVRC2016 (Russakovsky et al., 2015) or the Ecoset dataset (Mehrer et al., 2021), which were a dataset with a similar number of images as the ImageNet-1 K but the categories were at the basic-level of abstraction.The training procedures for both models on either dataset were like the original AlexNet training procedures (Krizhevsky et al., 2012).
For comparisons between more distinct architectures with even more layers and parameters, we used pretrained variations of the VGG (VGG16 and VGG19; Simonyan & Zisserman, 2014) and ResNet (ResNet50 and ResNet101; He et al., 2015) architecture.Each VGG model contains 5 blocks each containing 2-4 convolutional layers with a max pooling layer at the end.After the convolutional blocks, there are three fully-connected layers with the last layer being the classification layer.The ResNet architecture followed a similar blocking structure that increased receptive fields but in each block there were groups of 3 layers that formed a "residual" group with the addition of skip connections that added the input of a group of layers to the output of the group.Each ResNet model begins with a single convolutional layer with a max pooling layer followed by 4 blocks of 3-23 residual layer groups.After the convolutional blocks, there was a single fully-connected classification layer.The difference between each pair of models with the same architecture was chiefly the number of layers (size) of the network.
The large models were all trained on the ImageNet-1 K dataset from the ILSVRC2016 (Russakovsky et al., 2015).Their training procedures were like those used in training the All-CNN-C architecture (see He et al., 2015;Simonyan & Zisserman, 2014).
As in our simulations in the previous section, we randomly chose one image from each of the 1000 categories in the ImageNet-1 K test set.These images were used to extract representation matrices from each of the large models.To provide a measure of consistency, we repeated the analysis with 50 versions of this dataset by randomly selecting different images from each category for each repeat.

Manipulating representations learned by DNNs
For All-CNN-C, we manipulated randomization of initial network weights and biases (holding randomization of training image order constant), randomization of training image order (holding randomization of initial weights and biases constant), relative frequency of training images within each category, and relative frequency of training categories.For each manipulation, we trained multiple instances of each model variation and measured the representational similarity (using Euclidean RSA) between all pairs of instances with the same model variation.
We carefully controlled the random number seeds governing the randomization of initial weights and the randomization of training image order.We selected ten seeds for initial weights and ten seeds for training image order.We crossed these two types of seeds, resulting in 100 trained models.To analyze the variation in representation matrices from different models varying on single type of randomization, we averaged across the other type of randomization.
To explore the impact of the relative frequency of training images within each category on representations, we created new training sets from the CIFAR-10 set.New training sets were created by randomly sampling images from the CIFAR-10 set with replacement until we created a dataset with the same number of images as the original.To manipulate the relative frequency of specific images, we pseudorandomly generated probabilities for each individual image to be selected.We tested three levels of relative frequency manipulations based on the maximum relative probability difference between images.The relative probability between the most common and the least common image was set to a maximum of 3×, 10×, or 100×.The number of images in each category was held equal (albeit with some images repeated within a category).We trained 50 models of each level of relative image frequency differences for a total of 150 models varying on relative image frequency in their training dataset, each trained on the same initial weight and dataset randomization seed.To index model classification performance over the course of training, we used the original CIFAR-10 test set.
For relative category frequency differences, we similarly generated new datasets from the CIFAR-10 set.As above, we created new datasets by sampling with replacement until we created a dataset with the same 6 The All-CNN-C is named as such because it only uses convolutional layers.
Spingenberg and colleagues (2014) demonstrated that the different types of layers in the typical AlexNet-style architecture was not necessary and could all be replaced by convolutional layers.The max pooling layers were replaced by convolutional layers with stride 2 (causing the convolutional filters to step over every other unit, as opposed to the typical stride 1 where the filter steps to each unit), achieving the same feature bottleneck effect as a max pooling layer.Fully-connected layers were replaced by convolutional layers that culminates to a 1 × 1 filter size that covers the entire image given the large receptive fields by the end of the network, achieving the same high-level representations that is used for classification.Regardless of these network replacements, the unifying principle of AlexNet-like feedforward architecture of convolutional layers with pooling layers for feature bottlenecks towards a classification layer was preserved.
number of images as the original.To manipulate the relative frequency between categories, we sampled more images from some categories relative to another, resulting in the total number of images from each category being different.We tested three levels of relative frequencies based on the maximum relative frequency between categories.The relative frequency of image between the most common category and the least common category was set to a maximum of 3×, 10×, or 100×.Every image within a category had equal probability of being selected.This resulted in a unique dataset for each model that differed on the number of images in each category (with any individual image within a category appearing at a similar frequency).We trained 50 models of each level of relative category frequency differences for a total of 150 models varying on relative category frequency in their training dataset, each trained on the same initial weight and dataset randomization seed.To index model classification performance over the course of training, we used the original CIFAR-10 test set.
Between pretrained AlexNets and vNets, the two architectures and the two training datasets were factorially combined resulting in four model variations (AlexNet trained on ImageNet-1 K, AlexNet trained on Ecoset, vNet trained on ImageNet-1 K, and vNet trained on Ecoset).Ten instances of each model variant were trained with different model randomization seeds (akin randomization of initial weights and training dataset order in our All-CNN-C models).We first compared the architecture separately.We measured the impact of model randomization across the models with the same architecture trained on the same dataset at each layer, then we measured the impact of different model randomization in addition to being trained on different datasets.We also compared the models across architectures, either trained on the same dataset or trained on a different dataset.When comparing across architecture, the models differed modestly in model size and architecture design.
We did not train instances of VGGs or ResNets.The difference between two variants of the same architecture (VGG16 vs VGG19 or ResNet50 vs ResNet101) is essentially only model size (albeit much larger difference in size relative to the AlexNet vs Ecoset difference), where the models were matched on general architecture design, training dataset, and training procedures.When comparing across architecture, the models differ on model size, general architecture design, and training procedures, while the training dataset was matched.These comparisons provided comparisons between much larger models that differed more in their architecture design (namely the architecture additions introduced with ResNet models, such as the residual blocks)

Measuring variation in DNN representations
For each kind of manipulation, we measured pairwise representational similarity from the same test set between all models with the same type of manipulation.We measured representational similarity using Euclidean RSA.
For small models (the All-CNN-C architecture), we used images from the CIFAR-10 test set to extract representation from each of the 8 convolutional layers.For each layer, we applied global average pooling on each filter and then flattened outputs into vectors to use as the representations of that layer.We computed the squared Euclidean RSA similarity between corresponding layers for each kind of model training manipulation.This resulted in a distribution of representational similarity scores that indexed the range of differences in representation matrices that was caused by each type of manipulation at each level of that manipulation.
We followed similar procedures to measure representational similarity in the AlexNets and vNets.We used the 1000 images from the ImageNet-1 K test set to extract representations from each layer with a ReLU activation function.For layers with more than one dimension, we applied global average pooling and then flattened the outputs into onedimensional vectors to use as the representations of those layers.When comparing models with the same architecture, we calculated Euclidean RSA similarity between corresponding layers.
When comparing models across architecture (AlexNet vs vNet and VGG vs ResNet), we selected representative layers in each model instance based on their architecture to have commensurate layers to compare between incommensurate architectures.For an early layer, we selected the output of the first block of convolutional layers after the pooling.For the late layer, we selected the penultimate layer, right before the classification layer.For the "middle" layer, the selection criteria were less obvious, so we used three different "middles" with different criteria.(1) Architectural landmarkthe output from the second-to-last block of layers (2) Receptive field sizethe output from the layer that covered approximately a quarter of the size of the original input image (3) Parameter countthe output from the layer that the number trainable parameters up to that layer was approximately half of the total number of model parameters.The selected layers for each large model are presented in Table 1.

Baselines for assessing variation in DNN representations
We generated baselines for representational variation using the image transformations that were employed as augmentation during training of a certain model.Image augmentation techniques are typically used to train DNNs by applying identity-preserving randomized image transformation to the training images.Commonly used augmentations transformations include horizontal reflection, image translation, image scaling, and color shifting. 7To use these augmentation techniques to form a baseline, we applied the relevant transformations to the images used to extract representation matrices for comparisons between models.Instead of comparing representation matrices from different models, we calculated the similarity between the representation matrices for the original images and the representation matrices for the transformed images from the same model.The resulting similarity score we considered to be a baseline level of representational difference due to identity-preserving image transformation that the model itself should be trained to handle.
For each model manipulation's baseline, we used the types of image augmentation transformations that were used during training of those models, at magnitudes reasonable for the size of the images and relevant to reasonable presentation variations that humans would encounter dayto-day.By using the same augmentation transformations used during training for a specific model, we indexed the typical difference in representation matrices caused by these image-level transformations that the model was trained to handle.Any representational variation caused by these transformations would arguably be akin to state differences rather than stable individual differences akin to trait differences.
Our All-CNN-C models were trained with two types of image augmentation: horizontal reflection and image translation.For each model, we randomly transformed the test set: there was a 50% chance that the dataset would be horizontally reflected and the image would be translated between [− 5, 5] pixels on both axes (the extent of the translation on each axes were drawn from a uniform distribution).Notably, these are the exact same procedures that would normally be used to transform each image for data augmentation, but we apply the randomly chosen transformation consistently to the entire dataset.As the image transformations were stochastic, we repeated this procedure 10 times for each model (resulting in 10 × 100 repeats for random number seed manipulation models, 10 × 50 repeats for the relative item frequency models, and 10 × 50 repeats for the relative category frequency models) and averaging the results to form the baseline level of 7 These image transformations may be familiar to vision scientists as invariances that the biological visual system must contend with to be robust but the initial motivation to use such transformations in training DNNs was primarily the effective use of limited datasets and avoiding overfitting to specific images.Theoretically, if the datasets were sufficiently large and varied (and therefore contain many variations in presentations), image augmentation would not be necessary in machine learning.representational variation.We found that the baseline for all of the All-CNN-C models were quite similar, so we averaged across them at each layer regardless of manipulation.
The large models were trained with horizontal reflection, size scaling, random cropping, and PCA color shifting.Following their training augmentation procedures, we randomly transformed the test set: (1) first, there was a 50% chance that the dataset would be horizontally reflected, (2) then we randomly scaled the images to a square size sampled from a uniform distribution between [256,512] for VGG models; [256,480] for ResNet and AlexNet models; [140,224] for vNet models (3) then we take a random crop of 224 × 224 (except for in vNet models where the crop is 128 × 128) within the scaled image with all possible crops within the image having an equal probability, (4) and finally, for PCA color shifting (Krizhevsky et al., 2012), we performed PCA on the ImageNet training set color channels and then applied multiples of each principle component to each pixel with the magnitude proportional to the eigenvalues multiplied by a random variable drawn from a normal distribution with a mean of 0 and an standard deviation of 0.1.Once again, these are the exact same types of augmentations used to augment individual images during training, but we apply the same set of randomly chosen image transformations to the whole dataset to form the baseline.We averaged 100 repetitions of these procedures for each layer we chose in the large models to act as the baseline for these networks.These baselines are intended to provide context for the representational variation between different variations of models so the baseline calculated for each pair of model variant was again averaged.

Statistical comparisons between manipulations and baselines
The typical statistical tools in vision science are ill-fitted to compare representational variability due to different manipulations and against the baseline.Chiefly, because of the relatively large sample sizes (50-100 models per model manipulation and 500-1000 baseline repetitions) and relatively little variance around the means, most statistical tests (e.g., t-test or ANOVAs on means) would produce significant results even if the practical difference between the means is quite small (<0.01).Furthermore, comparing confidence intervals and effect sizes would similarly produce misleading results (confidence intervals ±0.001 around the mean and extremely large effect sizes) if we follow typical conventions.We opted instead to use normalized median absolute deviation (MAD) as an index of dispersion around the mean and used them as intervals around the mean to determine whether means were similar or different.We used a robust measure of dispersion instead of the typical standard deviation here because of the presence of outliers in the baseline as well as obvious violations of normality.Removal of the outliers and using standard deviation as the measure of dispersion does not drastically change our interpretations.The choice of descriptive statistic to index dispersion and to decide whether there are appreciable differences between means is arbitrary, but we believe that using MAD produces statistically robust results and a sense of what might be consistent and useful differences in conjunction with our baseline method.

DNN model training
All trained models of the All-CNN-C architecture reached high classification accuracy on the validation set with relatively consistent accuracy across types of manipulated variation.Fig. 6 shows that training trajectories (one line for each model) based on validation accuracy (yaxis) at each epoch (x-axis).Training trajectories were relatively similar between models with most models converging within the first 100 epochs.Nonetheless, we trained all models for a full 350 epochs to ensure that model parameters had stabilized, despite relatively little changes in the validation accuracy in the last 100 epochs.All models converged to similar levels of validation accuracies by the end of training.Differences in early training trajectories do not mean that there will necessarily be differences in how the models represent objects, nor does similar validation accuracies after training mean that there will be no variability in representations (Mehrer et al., 2020).We needed to examine representation matrices from trained models.

Manipulation of representational similarity: Within-manipulation comparison
We separately examined the representational variation for the All-CNN-C models caused by each type of manipulation.As each type of manipulation had different instantiations (different sources of randomization or different relative frequencies between images and categories), we wanted to quantify and compare how different instantiations within a manipulation might cause different amounts of representational variation.We also wanted to examine whether representational variation could be different at different layers of the network.
Fig. 7a shows average Euclidean RSA similarity scores (y-axis) at each network layer (x-axis) for the two types of model randomization manipulation (lines) in the left panel and the distribution of similarity scores in the final layer as kernel densities in the right panel.These manipulations (random initial weights, random training image sequence) generated similar magnitudes of representational similarities at each layer (overlapping MAD intervals in Fig. 7a).We generally replicated previous work (Mehrer et al., 2020) showing a monotonic decrease in representational similarity across layers by manipulating initial weights.We additionally demonstrated that manipulation of training image order produced nearly identical effects on representational similarity.Because of this, we collapse across the two types of randomization manipulations for brevity.
Fig. 7b shows average Euclidean RSA similarity scores (y-axis) at each layer (x-axis) for each level of relative maximum image frequency (lines) in the left panel and the distribution of similarity scores in the final layer as kernel densities in the right panel.While representational similarity decreased as a function of layer depth, there were relatively small differences in similarity as a function of frequency despite the frequency differences spanning a wide range (3 vs 10 vs 100; the orange, blue, and green lines, respectively, have overlapping MAD intervals).The absence of representational variability caused by variation in relative image frequency may be due to the size of the training set in that repetitions of individual images had relatively little effect compared to Note.The layer names are those used in (Mehrer et al., 2021) for AlexNet and vNet and in Tensorflow for VGGs and ResNets.
the number of images in the overall set.For the remainder of our analyses and comparisons, we use the intermediate relative frequency difference of 10.Fig. 7c shows average Euclidean RSA similarity scores (y-axis) at each layer (x-axis) for each level of relative category frequency (lines) in the left panel and the distribution of similarity scores in the final layer as kernel densities in the right panel.Representational similarity decreased as a function of layer depth, but now greater levels of relative frequency differences resulted in lower representational similarity, most obviously when comparing max 3 (orange line) vs max 100 (green line) at layer 8.However, the MAD intervals still overlap.This suggests that a great deal of variations can be caused by manipulating the relative frequency between categories within the training set but there are some instantiations of this manipulation that may not produce as much variation.For the remainder of our analyses, we use the intermediate maximum relative category frequency difference of 10.

Manipulation of representational similarity: Between-manipulation comparison
We next compared different kinds of manipulations and their impact on representational similarity and compared these against the baseline.In the context of modeling individual differences, such comparisons provide insight on which manipulations might be more useful for characterizing the sources of stable individual differences.
In Fig. 8, the left panel shows representational similarity (y-axis) at each layer (x-axis) for each kind of model manipulation (lines).Representational similarity for each kind of manipulation (colored lines) did not drop below the baseline (black dotted line) until layer 7.Both relative image frequency (orange line) and relative category frequency (green line) manipulations caused more representational variation than the baseline starting in layer 7. Model randomization (of initial weights and training order) never exceeded baseline.
The right panel of Fig. 8 shows the distribution of representational similarity scores (y-axis) for each kind of manipulation and for baseline (black and colored curves) as kernel density estimates.We see that the distribution of similarity scores for model randomization manipulations (blue curve) at layer 8 overlapped a large portion of the baseline distribution whereas relative image and category frequency manipulation distributions (orange and green curves respectively) did not.Note that the relative category frequency manipulation produced a longer tail due to some pairs of models exhibiting far greater representational variation.
In Fig. 9, the left panel shows representational similarity (y-axis) at each layer (x-axis) for each kind of model manipulation (lines).The representational similarity in both models and both types of model manipulations did not consistently exceed the baselines, except for layer 2 in AlexNets differing on the training dataset.In contrast to the results from the small models, the representational similarity did not decrease monotonically across layers.The AlexNets differing in training dataset generated more representational variation than the AlexNets differing only in model randomization.The vNets differing in training dataset generated more representational variation than the vNets differing only in model randomization, though they did meet in layer 7 and most notably in layer 10.
The right panel of Fig. 9 shows the distribution of representational similarity scores (y-axis) for each kind of manipulation and for baseline (black and colored curves) as kernel density estimates.Given the wide range of possible image manipulations for the baseline, there was a wide range of possible representational variations.Nonetheless, even just using the mean representation variation caused by the baseline to compare to the model manipulations would not change the interpretation of results.
Fig. 10 shows representational similarity (y-axis) at three representative layers (early, middle, late; x-axis) between AlexNet and vNet models trained on either ImageNet-1 K or Ecoset.No between-model comparison yielded representational similarities that exceeded baseline.This was true at every layer and no matter if the pairs of models were trained on the same dataset or different datasets.
Fig. 11 shows representational similarity (y-axis) at three representative layers (x-axis) between the VGG and ResNet model architecture (color) and for baseline (gray and black dashed), with three different definitions for the middle layer (colored line style).Neither the difference between VGG16 and VGG19 nor between ResNet50 and ResNet101 exceeded baseline.These architectural differences in size (number of layers and number of parameters) did not produce representational similarities that exceeded baseline.When comparing the two different model architectures, the ResNet architecture produced more Different criteria used to define the middle layers did yield quantitatively different representational similarity but nonetheless the qualitative differences compared to baseline were the same.

Interim summary
We examined how different kinds of manipulations of DNNs caused different levels of representational similarity across different networks as well as how these differences compared to baseline differences in representational similarity from mere image augmentation of the sort used during network training.This was motivated by our broader goal of exploring how DNNs can be used to develop mechanistic models of individual differences in high-level visual cognition.
Using the All-CNN-C architectures, we could examine manipulations of how different networks were trained and because of the small size and tractability of All-CNN-C be able to do so for relatively large numbers of trained networks.Randomization of initial network weights and training image order produced similar levels of representational Fig. 7. Average Euclidean RSA similarity scores (y-axis) between small (All-CNN-C) models at each layer using different instantiations of the same manipulation (lines).In the left panels, representational scores are shown at each layer (x axis).In the right panels, the layer 8 similarity score distribution for each manipulation and the baseline are visualized using kernel densities estimation (x-axis).a. Models differing on either initial weight randomization or training image order randomization.The lines index average similarity at each layer between models that different on one type of model randomization while holding the other constant.b.Models trained on resampled datasets with individual images repeating at different frequencies.Each model was trained on a resampled dataset that allowed individual images to appear a maximum of 3, 10, or 100× more often than another image in the dataset, these are plotted with different line colors.c.Models trained on resampled datasets with different relative category frequencies.Each model was trained on a resampled dataset that allowed categories to appear a maximum of 3, 10, or 100× more often than another category in the dataset, these are plotted with different line colors.Bands represent normalized median absolute deviation around the mean.similarity, with less representational similarity at later layers compared to earlier layers of a network, replicating recent results (Mehrer et al., 2020).But these kinds of small randomizations only produced small differences between models resulting in small effects on representational similarity that never significantly exceeded baseline, though there was a clear trend on the penultimate layer of a network.Overall, mere randomizations of initial weights and training image sequence produce differences of a magnitude more akin to state differences from manipulations of images and their viewing conditions than from training differences that might explain more stable individual differences in visual cognition ability.
By contrast, manipulations of relative image frequency (holding  category frequency constant), and manipulations of relative category frequency had a larger impact on representational similarity.While for both kind of manipulation, levels of representational similarity did not exceed baseline in the early and middle layers of the network, they did for the later layers of the network, with a larger impact from manipulations of relative category frequency than relative image frequency.
For larger models, we could not train new networks from scratch given our current computational limitations and instead leveraged pretrained models.We compared a set of pretrained AlexNet and vNet models trained on either ImageNet-1 K or Ecoset (Mehrer et al., 2021).This allowed us to perform analyses similar to those in our small models, finding that model randomization, differences in the dataset, and modest differences in architecture did not yield representational similarity that exceeded baseline.To test more varied architecture, we used both within-and between-architecture comparisons of two versions of VGG  and two versions of ResNet (ResNet50 and ResNet101).For the within-architecture comparison, while the overall structure of the models was the same, and they were trained on the same image set, they differed in size, whether measured by layers or by parameters.Neither VGGs nor ResNets exhibit representational similarity exceed baseline from image augmentation.The same was not true for the between-architecture comparisons, where representational similarity was close to that for baseline in the earliest layer but far exceeded it for middle and late layers of the networks.
Taken together, the results from the large model demonstrate that more significant manipulations are necessary to cause differences in representations that exceeds baseline.The difference between the ImageNet-1 K and the Ecoset datasets, which is largely different in the exact make-up of balanced categories, does not lower representational similarity below baseline.This contrasts with the more significant changes in the distribution of the categories in the dataset as we implemented in the smaller models that did cause more representational differences.Similarly, modest differences in architecture, such as those between AlexNet and vNet, does not lower representational similarity below baseline.More significant changes in the architecture, such as the design differences between VGGs and ResNets, can lower representational similarity below baseline.These results demonstrate the type of differences between models (e.g., dataset or architecture differences) matters less than the magnitude of those differences (e.g., large differences in category frequencies or large architectural design differences).

General discussion
Deep learning neural networks (DNN) offer a potential tool to model individual differences in high-level visual cognition.Individual differences could arise from differences in initial conditions, learning history,  distributions of visual experience, or neural architecture, all of which can be explored using simulations of DNN models.
To the extent that trained DNN models instantiating these differences develop different internal representations, we need to know which measure of representational similarity to use.That was explored in the first part of this article.To understand the conditions under which differences in internal representations might be sufficiently large to explain individual differences in visual cognition, we need explore whether differences in initial conditions, learning history, distributions of experiences, and architecture lead to differences in representations in comparison to differences from mere variation in image viewing.That was explored in the second part of this article.
Representational Similarity Analysis (RSA), Centered Kernel Alignment (CKA), and Canonical Correlation Analysis (CCA) are different measures of representational similarity that are often discussed as if they are largely interchangeable with one another.Previous systematic comparisons of different measures (Kornblith et al., 2019;Williams et al., 2021) did not focus on cognitive science, cognitive neuroscience, or vision science applications.We found that RSA using Euclidean distance to form Representational Dissimilarity Matrices (RDMs) performed the best in our evaluations.
Perhaps most critically, Euclidean RSA (and other variants of RSA) is invariant to the ratio between the number of images to the number of features in the representational matrices, unlike either CKA or PWCCA.When the number of images does not exceed the number of features by at least a factor of 10, CKA and PWCCA yield erroneously high similarity scores.With lower ratios, even completely randomized representations, which should have no similarity, appear to have significantly non-zero representational similarity using CKA or PWCCA.This is critical for applications aimed at modeling visual cognition because behavioral and neural experiments typically only use hundreds of images whereas commonly-used DNNs (e.g., VGG16) have thousands of features in a network layer.The ratio of images to features is far too low to use CKA or PWCCA.In those cases where there are many more images than features, for example when simply examining the properties of models given a large set of images without an eye towards comparing them to behavioral or neural data using a small set of images, CKA can be a useful measure of representational similarity because it is computationally more efficient than RSA, which requires calculating pairwise similarities of many very large image representations.
Euclidean RSA also passed the other tests in our evaluation, including the layer correspondence test to see whether a measure indicates that the same layer in two different DNNs trained on the same image set are maximally similar to one another, the permutation invariance test that to see whether a measure is invariant to consistently permuting the order of features, the randomized shuffling sensitivity test to see if a measure is sensitive to random shuffling of features, the robustness to noise test to see how measures respond to the presence of added noise, and the robustness to feature loss test to see how measures respond to the removal of features.Both PWCCA and variants of RSA using correlation measures to form RDMs failed some of these tests, sometimes rather strikingly.Like Euclidean RSA, CKA performed well on these tests, but CKA can only be used if the number of images greatly exceeds the number of features.
We next measured differences in representational similarity caused by different kinds of manipulations and compared these to one another and to baseline.In general, all manipulations caused larger differences in representational similarity in the later layers than earlier layers and no manipulations caused differences to significantly exceed baseline in earlier layers.It is known that DNNs generally converge on developing low-level features like edge or center-surround detectors in the early layers with small receptive fields (Oquab et al., 2014).So, it makes sense that no matter the manipulation, whether from randomization to relative frequencies in the dataset, when DNNs were trained to categorize images of objects, similar kinds of representations would be formed in earlier layers, causing relatively small differences in representational similarity along those layers.These learned low-level features are building blocks of perception and should be relatively consistent across individuals.Individuals can differ in representations of low-level features, such as differences in low-level feature discrimination (Kieseler et al., 2022;Riddoch & Humphreys, 1993), but those are likely caused by experience (training) on low-level perceptual learning (e.g., Fine & Jacobs, 2002;Li et al., 2004) tasks (not complex object categorization), differences in the quality of retinal inputs (caused by illness, damage, or developmental differences in the eye or visual pathways), or architectural differences (caused by a range of factors from genetics to environment and their interactions).Arguably, every visual system (including DNN models) will need edge detectors but not every visual system will need representations for detecting animal fur or vehicle wheels in the same way or at all.It is sensible then that we generally observe more difference in representational similarity in the later layers than earlier layers.
Because our focus is primarily on modeling high-level visual cognition mechanisms, the lack of differences in representational similarity beyond baseline in the earlier layers has potential practical utility for developing models.Training a single full DNN, especially a large one, from scratch is not only computationally challenging and timeconsuming, but details of how large models were originally trained are often not presented in sufficient detail to allow for reproducibility (e. g., Hutson, 2018).Training dozens or hundreds of large DNNs from scratch is impossible for all but those with the most resources.Transfer learning (Oquab et al., 2014), where the initial layers of a pre-trained network are carried over to a new network and held fixed while later layers of the new network are trained, has for many years been used as a means of generalizing from previously trained networks to new networks and to new kinds of problems for networks to solve (e.g., Kolesnikov et al., 2020).At least for modeling individual differences in highlevel visual cognition (e.g., Richler et al., 2017Richler et al., , 2019;;Smithson et al., 2023), different instances (individuals) could share earlier layers and only differ in the later layers.This confers significant practical advantages in terms of computational resources and time.Of course, the question remains as to where to make the cut in deciding which early layers to fix in a network across multiple individuals; as we showed in the final set of simulations, there can be significant differences in representational similarity in middle layers of networks with different architectures.
Conversely, our results suggest that if the goal is to study individual differences in low-level perceptions, the DNNs we evaluated may not be the correct tool.Certainly, individual differences in performance could be propagated from differences in representations very early on, such as in low-level sensation and perception.However, such differences are unlikely to exist in DNNs due to broad differences in training data or differences in starting points considering the fundamental nature of such low-level features (Yosinski et al., 2015).They might result instead from architectural differences due to differences in maturation, development, and early experience.
It is notable that the baseline similarity decreases deeper into the models.There is evidence that the earlier layers in DNNs generally detect low-level features (like edges) and later layers form more objectlike representations (like parts of objects; Yosinski et al. (2015)).It would have been possible that the pattern of results for the baseline had been the reverse pattern, with less similarity in early layers (because low-level features are changed due to the image presentation manipulations) and more similarity in the later layers (because the object themselves have not changed, just their presentation).But this can only be true if the models are creating presentation-invariant object representations.Our results showed that this is not the case.In the baseline, modifying the presentation of the images seemed to have an effect well in to the final layers (albeit smaller than the between model manipulations).It should be noted that the decrease in similarity in the later layers do plateau whereas the model manipulations continue to decrease in similarity, suggesting that these later representations are higher-level and no longer as vulnerable to low-level image presentation differences.This provides evidence that the representations in the later layers are indeed the higher-level representations that we would want to use to model individual differences in high-level visual cognition.It is possible that more complex models with different training regimes could exhibit a different pattern of results, but we leave that for future work.
Manipulations of the image set used to train a DNN, including manipulations of the relative frequency of individual images in each category or manipulations of the relative frequency of categories to others in the set of categories, caused larger differences in representational similarity than manipulations of randomization of initial network weights or randomization of the sequence of images in the training set.Neither kind of randomization manipulation caused differences in representational similarity that exceeded those produced by baseline differences caused by image augmentation used to train a network.For large networks, it appeared that modest manipulations such as model randomization, model size, small architecture differences, different datasets of similar quality (though notably still different levels of abstraction) are insufficient to produce sufficiently large differences in representational similarity between models.However, we did observe large differences in representational similarity when comparing architectures that are more different (comparing VGG to ResNet).
Small manipulations by randomizing weights and training sequence do produce differences in representational similarity, just not ones of a magnitude likely to explain the magnitude of behavioral differences observed in measures of visual cognition (e.g., Richler et al., 2017Richler et al., , 2019;;Smithson et al., 2023).Differences in the kinds of experience, here instantiated as manipulations of relative image and category frequency, produced more sizeable differences in representational similarity.In larger, more complex models, the differences in dataset of similar quality (but different make-up) may not matter as much as significant architectural differences.Using models varying on such manipulations could allow modeling pursuits whereby each DNN could be considered an individual participant, as in behavioral experiments with humans.But to be clear, here we only examined differences in representations, not whether those differences in representations could manifest in differences in behavior that characterize individual differences, but we leave that for future work.
We only use strictly supervised feedforward convolutional DNNs throughout our experiments but there are other possible types of DNNs of interest.DNNs containing feedback or recurrent connections could be important to model the complex dynamics of representations that are critical for some high-level visual cognition tasks (Kar et al., 2019;Rajaei et al., 2019;Spoerer et al., 2017).Weakly supervised or unsupervised DNNs may better represent the types of experiences that biological systems receive during development, which may create representations that better predict brain representations (Zhuang et al., 2021).
There has also been growing interest in leveraging transformer architectures in modeling brain and behavior.But there may be theoretical concerns and practical limitations that need to be solved first.Chiefly, while vision transformers perform remarkably well in visual tasks, they do not possess the kinds of inductive biases that convolutional DNNs have (such as convolutional operations, pooling, and hierarchical architecture) that lend validity to convolutional DNNs as models of biological systems.How necessary these features are in DNN models of biological systems remains unknown.There is evidence that vision transformers better match human error patterns in their behavior relative to convolutional DNNs (Tuli et al., 2021), but further work is necessary to quantify how well their internal representations model biological representations.On the practical side, vision transformers are infamously data hungry (c.f., Pandey et al., 2023) with many more trained parameters than most convolutional DNNs that computational modelers use in cognitive neuroscience.This means that it would be difficult to train and use many instances of vision of transformers to model individual differences.Nonetheless, as research into vision transformers as models of the brain grows, these theoretical and practical implications may not be a concern to use them in models of visual cognition.
Standard DNN models by themselves can only perform the single task they were trained to do, such as categorizing objects at a specific level of abstraction.Individual differences in human participants are measured using a wide range of tasks, such as same-different discrimination or recognition memory (e.g., Duchaine & Nakayama, 2006;Richler et al., 2017).Nonetheless, these participants maintain the same perceptual front-end to extract information from input across all tasks.To instantiate similar human flexibility with DNNs, we must use techniques that can reuse the parts of the DNNs that represent general perceptual processes across disparate tasks.Fully modeling observed results on individual differences would require coupling the representations from DNN models with other simulated mechanisms involved in task representations, short-term and long-term memory representations, and decision processes to fully explain observed results.The DNN models would act as the perceptual front-end, a unique one for each participant that is used across all tasks, that feeds into simulated mechanisms that are unique for each task with different parameterizations for each participant.For example, Annis et al. (2020) used DNNs to create representations of novel objects the network had never been trained on, used representations of a studied object and a test object to simulate a sequential same-different task, used measured similarity between those representations as evidence to drive an accumulation of evidence model of decision making to predict observed response probabilities and response times.While individual differences can come from differences in DNN representations, they can also come from parameterized differences in the quality of memory representations, criteria for judging same vs.different, variability in the decision process, and response boundaries for accumulated evidence used to make a decision (see also Annis & Palmeri, 2019;Ratcliff et al., 2011;Shen & Palmeri, 2016).Arriving at a complete account of individual differences in visual cognition will require examining differences in object representations created by a DNN and differences in how those representations are used to solve a visual cognition task, ultimately accounting for qualitative and quantitative differences in behavior observed across individuals (Duchaine & Nakayama, 2006;McGugin et al., 2020;McGugin et al., 2023;Shelton & Gabrieli, 2004;Shen & Palmeri, 2016).

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1. Results from the Layer Correspondence Test for small (All-CNN-C) models.Each panel shows a confusion matrix for each type of representational similarity measure, with axes representing network layers (1-10) for two compared networks trained on the same dataset.Each cell is the proportion of that layer being matched as the most similar between the two DNNs.Cell color represents the proportion of layers matched with the highest similarity score, with darker colors indicating higher proportions.A perfect confusion matrix would only have ones (dark) along the diagonal with zeroes (white) everywhere else.

Fig. 2 .
Fig. 2.Results from the Layer Correspondence Test for the large models.Each panel shows a confusion matrix for each type of representational similarity measure, with axes representing network layers for two compared networks trained on the same dataset.Each cell is the proportion of that layer being matched as the most similar between the two DNNs.Cell color represents the proportion of layers matched with the highest similarity score, with darker colors indicating higher proportions.A perfect confusion matrix would only have ones (dark) along the diagonal with zeroes (white) everywhere else.

Fig. 4 .
Fig. 4. Results for the Robustness to Noise Test with small (All-CNN-C) models.Average similarity scores (y-axis) are calculated using each measure (colored lines) between pairs of identical simulated representations with normallydistributed noise added at different intensities (x-axis).The normallydistributed noise had a standard deviation equal to the standard deviation of the representations multiplied by a scalar from 0 to 4.0 at steps of 0.01.Bands represent normalized median absolute deviation around the mean.

Fig. 5 .
Fig.5.Results for the Robustness to Feature Loss Test in small (All-CNN-C) models.Average similarity scores (y-axis) are calculated using each measure (colored lines) between a simulated representations and itself with a range of features removed (x-axis); for example, a simulated representation was compared to the same representations with 0% (an identical representation matrix) to 70% (the same simulated representation matrix with 7 columns removed) of features removed.Bands represent normalized median absolute deviation around the mean.

Fig. 6 .
Fig. 6.Validation accuracy (y-axis) over epochs (x-axis) for all small (All-CNN-C) models trained with different instantiations of the same manipulation, each model's trajectory over training is an individual line.Plots are truncated at 250 epochs as no significant changes occur in the last 100 epochs of training.a. Models differing on initial weight randomization and training image order randomization (no significant difference between the two types of randomizations were observed).b.Models differing on relative image frequency.Each model was trained on a resampled dataset that allowed individual images to appear a maximum of 3, 10, or 100× more often than another image in the dataset, each plotted using a different line color.c.Models differing on relative category frequency.Each model was trained on a resampled dataset that allowed categories to appear a maximum of 3, 10, or 100× more often than another category in the dataset, each plotted using a different line color.

Fig. 8 .
Fig. 8. Average representational variation (y-axis) due to each manipulation (colored lines) compared against baseline (black dotted line) in small (All-CNN-C) models.The baseline similarity was averaged across all different model variations as magnitude of the representational differences caused by baseline manipulations was similar for all model variations.In the left panel, representational similarity scores are shown at each layer (x-axis).Each colored line represents a different model manipulation.Bands represent normalized median absolute deviation around the mean.In the right panel, the layer 8 similarity score distribution for each manipulation and the baseline are visualized using kernel densities estimation (x-axis).

Fig. 9 .
Fig. 9. Average representational variation (y-axis) due to each manipulation (colored lines) compared against baseline (black dotted line) in AlexNet and vNet models.The baseline was averaged across different model manipulations as the magnitude of the representational differences caused by baseline manipulations was similar for all models.In the left panels, representational similarity scores are shown at each layer (x-axis).Each colored line represents a different model manipulation.Bands represent normalized median absolute deviation around the mean.In the right panel, the final layer similarity score distribution for each manipulation and the baseline are visualized using kernel densities estimation (x-axis).

Fig. 10 .
Fig. 10.Average representational variation (y-axis) between AlexNet and vNet trained on either the same dataset or different dataset (colored lines) compared against baseline (black dotted line) where the definition of the "middle" layer varied based on different criteria (line type).The baseline similarity was averaged across all different model architectures and training sets as magnitude of the representational differences caused by baseline manipulations was similar for all models.Bands represent normalized median absolute deviation around the mean.

Fig. 11 .
Fig. 11.Average representational variation (y-axis) between the VGG and ResNet model architectures (colored lines) compared against baseline (black dotted line) where the definition of the "middle" layer varied based on different criteria (line type).The baseline similarity was averaged across all different model architectures as magnitude of the representational differences caused by baseline manipulations was similar for all four models.Bands represent normalized median absolute deviation around the mean.

Table 1
Selected layers to compare across architectures.