The unreliable influence of multivariate noise normalization on the reliability of neural dissimilarity

Representational similarity analysis (RSA) is a key element in the multivariate pattern analysis toolkit. The central construct of the method is the representational dissimilarity matrix (RDM), which can be generated for datasets from different modalities (neuroimaging, behavior, and computational models) and directly correlated in order to evaluate their second-order similarity. Given the inherent noisiness of neuroimaging signals it is important to evaluate the reliability of neuroimaging RDMs in order to determine whether these comparisons are meaningful. Recently, multivariate noise normalization (NNM) has been proposed as a widely applicable method for boosting signal estimates for RSA, regardless of choice of dissimilarity metrics, based on evidence that the analysis improves the within-subject reliability of RDMs (Guggenmos et al. 2018; Walther et al. 2016). We revisited this issue with three fMRI datasets and evaluated the impact of NNM on within- and between-subject reliability and RSA effect sizes using multiple dissimilarity metrics. We also assessed its impact across regions of interest from the same dataset, its interaction with spatial smoothing, and compared it to GLMdenoise, which has also been proposed as a method that improves signal estimates for RSA (Charest et al. 2018). We found that across these tests the impact of NNM was highly variable, as also seems to be the case for other analysis choices. Overall, we suggest being conservative before adding steps and complexities to the (pre)processing pipeline for RSA.

Representational similarity analysis (RSA) has become a staple of the multivariate pattern analysis (MVPA) toolkit in cognitive neuroscience. Methodologically, the core construct of the approach, representational dissimilarity matrices (RDMs), provide a common and straightforward format for summarizing and directly comparing datasets from different types of modalities to evaluate their second-order similarities. By converting multivariate signals in condition-rich experiments into RDMs, neural data acquired with fMRI, EEG/MEG, or cellular recordings can be directly compared both to each other and also to RDMs derived from behavioral judgments and computational models ( Kriegeskorte, Mur, Bandettini, 2008a ). Theoretically, by focusing attention on the "representational geometry " of multivariate datasets ( Kriegeskorte and Kievet, 2013 ), RSA has its roots in the long tradition of psychological theories and methods that characterize the relationship between mental representations in terms of similarity structure ( Attneave, 1950 ;Shepard, 1964 ). Thus, not only does RSA provide a unified analytic framework for formating and comparing datasets; it also promises a means for bridging the gap between psychological constructs and their neural implementation.
As with any neuroimaging method, the viability of RSA is constrained by data quality. Although all neuroimaging data is noisy, for RSA the issue is especially pressing in light of the massive number of comparisons that are sometimes necessary to construct an RDM. For example, the classic study of Kriegeskorte et al. (2008 b) included 92 image conditions, which requires calculating 4186 pairwise neural dissimilarity values to construct a single RDM. In principle, one approach to determining the reliability of an RDM would be to evaluate the stability of each of these comparisons individually ( Bobadilla-Suarez et al. 2019 ;Ritchie and Op de Beeck, 2019 ). However, in practice, researchers have focused on more global properties of RDMs when evaluating their within-and between-subject reliability. For example, since different samples for the same condition should be more similar than samples from different conditions, if activity patterns are reliable, a common procedure is to calculate all pairwise similarity correlations between independent splits of the data. If the on-diagonal correlations, reflecting self-similarity of conditions, are greater than the off-diagonal values, this suggests that on average the conditions can be differentiated ( Haxby et al. 2001 ;Nili et al. 2020 ;Ritchie, Bracci, and Op de Beeck, 2017 ). However, since it is the off-diagonal values that are compared when carrying out RSA, another common method is to estimate between-subject reliability of these values by using a leave-one-subjectout procedure: RDMs for all but one subject are averaged and correlated with that of the remaining subject. The average across folds then gives a point estimate of the "noise ceiling "; that is, an upper bound of how much another RDM can on average correlate with individual neural RDMs ( Nili et al. 2014 ).
The importance of reporting the reliability of neural dissimilarity has naturally led to proposals for how it can be improved. One major focus has been on the choice of dissimilarity metric, with different distance metrics having been proposed as an alternative to the standard 1 -r correlation distances ( Allefeld and Haynes, 2014 ;Guggenmos, Sterzer and Cichy, 2018 ;Nili et al. 2014 ;Nili et al. 2020 ). Walther et al. (2016) go a step further, proposing not only cross-validated Mahalanobis distance as a superior dissimilarity metric, but also recommending that withinsubject reliability of neural RDMs can be improved by multivariate noise normalization (NN M ). In standard fMRI pipelines, a GLM is fit to the BOLD signal of individual voxels and the activity patterns that are analyzed with RSA are the beta estimates of this model. However, the BOLD signal can be influenced by many sources of noise, which are not captured by the GLM. Some of these sources of noise may have a spatial component ( Friston, Jezzard, and Turner, 1994 ), in which case, it is possible that estimating the structure intrinsic to the noise can be used to improve the estimates of the signals of interest, and thereby improve the reliability of the data used to construct neural RDMs. NN M offers a method for improving the reliability of fMRI data by improving the estimate of beta values based on voxel noise using information gleaned from the residuals of the GLM that is standardly used in first level analysis  ). More specifically, NN M normalizes the beta weights by the covariance of the run-specific noise, in contrast to univariate noise normalization, which uses only the variance ( Misaki et al. 2010 ).
Through both simulation and reanalysis of four datasets, Walther et al. found that NN M improves the split-half within-subject reliability of the off-diagonal values of neural RDMs regardless of the choice of dissimilarity metric. Applying the NN M approach to MEG data, Guggenmos, Sterzer and Cichy (2018) also found a marked improvement. However, not all results have been positive, running counter to the results of these studies. Charest, Kriegeskorte and Kay (2018) compared the effects of NN M to those obtained with GLMdenoise, which estimates the number of noise predictors using a data-driven cross-validation procedure. Contrary to the previous studies, they found that NN M in fact made reliability worse, and only revealed a benefit when combined with the noise estimates derived from GLMdenoise. Even more concerning, Liu et al. (2021) found that, after carrying out NN, an observed interaction in representational dissimilarity for adjacent fingers and age was no longer significant. Taken as a whole, these studies suggest there is at present equivocal support for the effectiveness of NN M for improving the reliability of neural RDMs.
In the present study we revisited the issue of the effectiveness of NN M at improving the reliability of neural dissimilarity estimates regardless of the choice of dissimilarity metric. First, we attempted to replicate the findings of Walther et al. with three fMRI datasets from previous studies ( Bracci and Op de Beeck, 2016 ;Ritchie and Op de Beeck, 2019 ). Second, unlike the previous work on NN M , we did not restrict ourselves to sensory-motor ROIs or a single ROI per dataset. Third, we also evaluate the impact of NN M on both between-subject reliability, or the noise ceiling, and RSA effect sizes. Fourth, we compared the effect of NN M with and without spatial smoothing, which has a minor positive effect on RSA, at least when using the 1 -r correlation as the dissimilarity metric ( Hendriks et al. 2017 ;Op de Beeck, 2010 ). Finally, we attempted to reproduce some of the findings of Charest et al. by comparing results with NN M to those obtained with GLMdenoise or NN M when using the noise estimates of GLMdenoise.

Datasets
We reanalyzed fMRI datasets from three previously published studies, described below. In each case, the experiments were approved by the ethics committee of UZ/KU Leuven and all methods were performed in accordance with the relevant guidelines and regulations. All studies were carried out a 3T Phillips scanner with a 32-channel coil at the Department of Radiology of UZ Leuven. MRI volumes were collected echo planar (EPI) T2 * -weighted scans with virtually identical parameters ( Table 1 ). Preprocessing, including slice time correction and motion correction, was carried out with SPM8 or 12. First level analysis was also carried out with SPM and the GLM included all of the stimulus conditions as well as six motion correction parameters (translation and rotation in the x,y, and z axes). We note that in a standardly constructed GLM, although a single design matrix is used to model all the data, beta estimates are specific to each run and independent of each other. Further differences in the preprocessing of the images are noted below, and full details of the analysis pipelines can be found in the original studies. Stimuli of the datasets, and the ROIs used in the present study, are depicted in Fig. 1 .
Dataset 1 (D1). The first dataset came from a study (N = 10) investigating the role of abstraction in category learning behavior using activity patterns from early visual cortex ( Ritchie and Op de Beeck, 2019 ). Stimuli consisted of 16 square-wave annular gratings varying in four levels of spatial frequency and orientation ( Fig. 1 ). Subjects completed 12 runs of a rapid-event related design in which each image appeared and flashed for 2 s (phase reversing at 4 Hz) followed by 2 s of fixation. The region of interest ( Fig. 1 ) was anatomically defined V1 ( Benson et al. 2012 ). For the present study the data was also smoothed at two levels: 6 mm and 9 mm FWHM. The original data was not normalized, and analysis was carried out within the native brain space of individual participants. In the present work, the data was again analyzed after transforming the data to a normalized brain space. All analyses on the normalized space data was identical to that carried out on the native brain space data, except that the normalized ROI image were thresholded to have a minimal increase of voxels within an ROI compared to the ROI image in the native brain space (mean increase = 44; SD = 29). For RSA the model RDM used was based upon the pairwise similarity judgments for the grating stimuli from both the in-scanner judgments of participants and a separate group of participants (N = 10) who performed the task off-line.
Dataset 2 (D2). The second dataset came from a study (N = 14) contrasting activity patterns for object category vs shape in multiple regions of the ventral and dorsal visual pathways ( Bracci and Op de Beeck, 2016 ). Stimuli consisted of 54 greyscale natural images of 6 object types and 9 orthogonal shape types ( Fig. 1 ). Across two sessions, subjects completed 16 (in one case 14) runs of a rapid-event related design in which two repeats of each image appeared for 1.5 s followed by 1.5 of fixation, in a pseudorandom order. The main ROI considered was bilateral object-selective lateral occipitotemporal cortex (LOTC), defined by a functional contrast of chairs > scrambled images based on separate localizer runs ( Fig. 1 ). We also consider two other ROIs from the study: superior parietal lobe (SPL) and early visual cortex (BA17). Both regions were defined by a contrast of all localizer conditions > baseline. For the present study the data was smoothed at two further levels at 6 mm and 9 mm FWHM. Data was also normalized. For RSA the model RDM included the similarity judgments ratings for shape based on a multiple arrangement task.
Dataset 3 (D3) . Data came from a study (N = 21) contrasting response patterns for social vs non-social actions across a large number of brain regions ( Lee . Stimuli consisted of 75 videos (3 s) with 39 depicting human-to-human touch interaction and 36 showing human-to-object interaction ( Fig. 1 ; Lee . The stimulus set included 3 videos depicting the same interaction with different actor pairs. Here we keep these conditions separate, as in the original study. Subjects completed 6 long runs of a rapid-event related design in which each video was shown followed by 3 s fixation in pseudorandom order. The main ROI considered was the bilateral temporoparietal junction (TPJ), which was defined by a contrast of observed touch vs baseline within an anatomical mask ( Fig. 1 ).  1  10  16  2  2  12  3  2  30  90  216  72 x 72 x 37  94  2  14  54  2  1.5  16 (or 14)  3  2  30  90  216  72 x 72 x 37  230  3  21  75  1  3  We also considered early visual cortex (BA17) and a lateral occipital region (BA37) as ROIs. Both were defined by the same contrast within anatomically defined masks. Data was normalized, and up-sampled to 2 mm resolution from 2.7 mm ( Table 1 ). We also considered two levels of smoothing: 4 mm and 6 mm FWHM. For RSA the model RDM was the binary model matrix for the social/non-social division of the videos.

Multivariate noise normalization
Procedurally the process of NN M was carried out as follows. Let Y be a matrix of size S (number of total scans across runs) x V (number of voxels in an ROI) consisting of the raw BOLD signal amplitudes after preprocessing, and let X be an S x P (number of predictors) sized design matrix from a GLM. We used the design matrix from SPM, which consists of separate columns for each experimental condition (for each run) as well as run-wise nuisance predictors for the six head motion parameters and run-specific constants. The ordinary least square estimate for the beta values for each predictor of X across all acquisitions in Y is then: Where B is a P x V sized matrix of the beta weights for each predictor (rows) for each voxel (columns) of the ROI. In turn, the estimate of the residuals of the model is then: Where R is a S x V matrix reflecting the residual information of each scan (rows) for each voxel (columns). Following Walther et al. (2016 , Eq. 4 ), the V x V variance-covariance matrix for the run k is then: Where R k is the portion of the residual matrix R corresponding to the scans in run k. Because the number of features (voxels) may be greater than the number of samples (scans) Σ k can be rank deficient and non-invertible. To address this, previous studies have used the optimal shrinkage factor of Ledoit and Wolf (2004) to regularize R k towards the diagonal matrix. We used the oracle approximating shrinkage (OAS) factor, which outperforms the method of Ledoit and Wolf when the number of samples is much less than the number of features ( Chen et al. 2010 ). After shrinkage, the multivariate noise normalized versions of the beta weights are then calculated as: Where B k is a C (number of conditions) x V matrix corresponding to the portion of B for run k that only includes the beta weights for the experimental predictors (i.e. excluding the nuisance predictors). More generally, B * is then a P ' x V matrix, where P ' is the number of conditions multiplied by the number of runs. As the estimates of the residuals and NN M are carried out independently for each run, the values in B * can then be used to estimate neural dissimilarity values based on the crossvalidated metrics described below.
This procedure for NN M was carried out on all three datasets for the (smoothed and unsmoothed) imaging data, which was masked to select only the voxels within the specified ROIs. For D1, the same analysis was also carried out on the first 6, 8, 10, and 12 runs of the unsmoothed data.
Another option for normalizing the responses of each voxel is to down-weight the beta estimates of noisier voxels based on the standard deviation of their noise: Where , is the beta estimate of voxel p for some predictor, for run k , and , is the standard deviation of its residual, or the square root of the diagonal value in Σ k for the voxel p ( Eq. 3 ). This univariate noise normalization (NN U ) is similar to using the t-values instead of beta estimates in order to improve classifier performance and RDM reliability ( Charest et al. 2018 ;Misaki et al. 2010 ). This alternative form of normalization was also carried out on all three datasets, for the unsmoothed data, masked by the ROI images.
Implementation of both NN M and NN U was carried out in Matlab using inbuilt SPM functions and custom code.

Dissimilarity metrics
An RDM is typically constructed as a C x C matrix of all pairwise estimates of dissimilarity between C conditions across features (voxels) based on some metric. We considered four different dissimilarity metrics in our analysis, the first three of which were evaluated by Walther et al. (2016) .
The first metric is the commonly used 1 -r , or correlation, distance (Cor), which has been a mainstay of RSA since its initial development ( Haxby et al., 2001 ;Kriegeskorte, Mur, and Bandettini, 2008 ;Op de Beeck et al., 2008 ;Kietzmann et al., 2019 ). To compute this metric, the portions of B for the conditions i and j are averaged across runs, and the resulting beta row vectors b ̅ i and b ̅ j , are linearly correlated and the Pearson's correlation coefficient, r ij , is subtracted from 1. Note that it is common practice to subtract the mean from beta vectors before correlating them, as was the case for the study from which D3 was selected ( Lee  ). However, this has the potential to distort the relationships between conditions . We elected to not do mean subtraction, which entails that the correlation distance can be influenced by global signal differences across voxels. However, we note that preliminary analysis found that run-wise mean subtraction had no appreciable influence on the results when NN M was employed. It is also possible to utilize a cross-validated version ( Guggenmos et al. 2018 ), however we selected to use the simpler, and clearly most common, method.
The second metric is pairwise classifier accuracy (Cla), which is the first of the cross-validated metrics we consider. For this we used linear discriminant analysis (LDA), which maps samples onto a discriminant axis that maximizes the between-class variance, while minimizing the within-class variance. A decision values is then positioned orthogonal to the discriminant and decisions about the label for training data are based on the position of a sample on the discriminant relative to the decision value. For LDA (and the other metrics described below) we used a leave-one-run-out cross-validation procedure: the data from the training runs were averaged and used to estimate the discriminant and decision value, which was then used to label the test data. This was carried out for all cross-validation folds, and the dissimilarity metric is then the pairwise classifier accuracy. LDA makes an assumption of homoscedasticity: that the beta weight vectors for i and j have identical multivariate Gaussian distributions that differ only in their mean; that is, they have the same within-class variance-covariance matrix sigma Σ. These details are elaborated on below. Linear support vector machines (SVM) are also a popular algorithm that can be used to estimate neural dissimilarity  ), which do not make the same distributional assumptions and in some cases may be superior in performance to LDA ( Misaki et al. 2010 ). However, we preferred LDA because it is analytically simpler (and therefore computationally faster), and because it is mathematically closely related to the other two distance metrics we utilized, which have been emphasized in previous work on NN M Guggenmos et al. 2018 ;Walther et al. 2016 ).
The third metric was the pairwise cross-validated squared Euclidean distance (Euc), which can be expressed as: Where b ̅ i is the average beta weight vector from the training runs, denoted TR , and b i is the beta weight vector from the test run, denoted TS . Procedurally the data for the conditions i and j was demeaned, split into the training and test partitions (leave-one-run-out), and the runwise averages were computed for the training partition, which was then multiplied by the transposition of the difference between the beta weight vectors in the test partition. The final dissimilarity value was the results of this procedure when averaged across cross-validation folds.
Finally, when the variance-covariance matrix of the training data, or within-class "scatter ", is added to Eq. 5 , we get what Walther et al. (2016) call the cross-validated squared Mahalanobis dis-tance (Mal) as a metric of dissimilarity: The within-class scatter, Σ TR , is just the equally weighted variancecovariance matrices for the training samples of beta weights for conditions i and j ( Misaki et al. 2010 ): Where B i is a matrix of all the samples for condition i in the training data. Procedurally the variance-covariance matrices can be rank deficient and so they were also regularized using OAS ( Chen et al. 2010 ). When defined in this way, Mal is closely related to LDA since the pairwise difference along the discriminant, as a portion of Eq. 6 , describes the feature weights for an LDA classifier: The decision value along the discriminant can then be calculated by summing the mean beta weight patterns for the two conditions from the training data: Then for any beta weight vector from the training set b , we can compute the value v: If v > c (or is positive), then the Fisher discriminant rule says to guess that b is from class i, otherwise if v < c (or is negative), then the rule says to guess that b is from class j .
Two clarifications are worth making about Mal as a dissimilarity metric, before continuing. First, Walther et al. (2016) describe their preferred metric, the linear discriminant contrast (LDC), or "crossnobis " distance, as identical to Mal ( Walther et al. 2016 , Eq.9 ; see also Nili et al. 2020 , p.6). This assertion is misleading, since LDC is defined as the value one obtains when one first carries out NN M and then uses Euc as a dissimilarity metric ( Dierdrichsen et al. 2016 , Eq.4 ;Walther et al. 2016 , Eq.7 ;). However, the variance-covariance matrices used for NN M ( Eq.3 ) and Mal ( Eq.7 ) are clearly not the same and Mal can be used without NN M ( van Meel and Op de Beeck, 2020 ; Ritchie and Op de Beeck, 2019 ;Ritchie et al. 2020 ). To avoid confusion, we treat Euc and Mal as wholly distinct metrics. Since the assumption that NN M somehow converts Euc to Mal is obscure, we did not follow Guggenmos et al. (2018) in replacing the within-class scatter from Eq. 7 with the identity matrix. Second, LDA is essentially a discretized version of Mal, and so will necessarily contain less information than a continuous metric . To address this drawback, one possibility is to use the distance between v and c as a dissimilarity metric, or the equivalent decision value for other classifiers, to weight the classifier accuracies ( Guggenmos et al. 2018 , Eq. 3 ). However, it is unclear to us what the advantage is of this metric compared to simply calculating Mal directly and so we do not consider it in the present work.
The calculation of the dissimilarity values using all these metrics was carried out with custom code in Matlab along with the CoSMoMVPA toolbox ( Oosterhof et al. 2016 ).

Estimating reliability
The reliability of individual RDMs were estimated in a number of ways. First, following the analysis of Walther et al., our primary form of evaluation for reliability was to assess the within-subject reliability. For this we split the data of individual subjects into the odd and even runs and then constructed RDMs for these partitions based on the different metrices described above. The two RDMs were then Pearson's r correlated with each other to determine their reliability ( Charest et al. 2018 ;Guggenmos et al. 2018 ;Walther et al 2016 ). Other measures of reliability are possible. For example, Walther et al. (2016) propose the sum-ofsquares differences for testing the reliability between two sets of RDMs because it will be influenced by scaling factors and provides a better estimate for distance metrics. We opted for the Pearson's r correlation as our estimate of reliability for a number of reasons. First, it is the most familiar method. Second, it is the same method used for estimating betweensubject reliability. Third, unlike the sum-of-squares difference it can be used with Cor. And fourth it preserves continuity with the analyses that are typically used to evaluate RSA effect sizes (e.g. multiple regression).
Second, for the between subject reliability we calculated the noise ceiling ( Cronbach, 1949 ;Nili et al. 2014 ;Op de Beeck et al. 2008 ). Individual RDMs were constructed using the above dissimilarity metrics for the full datasets, and then a leave-one-subject-out procedure was employed: the RDMs for all but one subject were averaged and then (Pearson's r) correlated with the left-out subject's RDM, and then the coefficients were averaged across all iterations of the procedure. This has sometimes been described as the "lower " bound of reliability with the upper bound defined by the same procedure except that the data of the one subject is also included in the group average ( Nili et al. 2014 ). However, this latter procedure is explicitly described as generating overestimates of the reliability, while the former procedure is not. Since it is not clear how it provides an accurate description of the relevant feature, which is the explainable variance, we do not consider this overfitting estimate. Crucially, while in standard applications the noise ceiling is treated as a point estimate, there is of course a distribution based on the leave-one-subject-out folds, which is reflected in the results to follow.
Third, for Cor we also evaluated the relationship between the onand off diagonal values, or "exemplar discriminability index " (EDI) as it has also been called ( Haxby et al. 2001 ;Nili et al. 2020 ). The data of individual subjects was again split in half to odd and even runs, and then the pairwise correlations were performed between the two splits. In principle, the patterns for a condition should be self-similar across the two splits, in which case the diagonal values will be higher than the off diagonal values ( Ritchie, Bracci, and Op de Beeck, 2017 ). The onand off-diagonal values are then averaged, and their difference score is reported (on-minus-off). Positive values indicate that the patterns for each condition are on average reliably self-similar. Testing for the significance of this index of reliability assumes that the distribution of the difference scores are 0-mean normal given the null hypothesis. Although this assumption is not in fact true of the distribution, the simulations of Nili et al. (2020) suggest that the false positive rate is in fact low and so the test remains valid.

Comparison to model RDMs
To evaluate the impact of NN M on the effect sizes for RSA, we compared individual subject RDMs to distinct model RDMs for each dataset (described above). The bottom half of individual RDMs (based on their full data, as constructed for the between-subject reliability analysis) were converted to column vectors and rank-order correlated (Spearman's ) with the lower half of the model RDM, which was also converted to a column vector. For discussion of why rank-order correlations such as Spearman's are typically used to test RSA effects, rather than Pearson's r, see Kriegeskorte, Mur, and Bandettini (2008 , Appendix). Notably, when binary coding is used, as with the social/non-social touch model for D3, dissimilarity values in effect have a Bernoulli distribution (taking values 0 or 1 at different frequencies) and so assuming linearity is inappropriate.

GLMdenoise
GLMdenoise is a method that estimates beta values and automatically derives nuisance predictors, and the optimal number of these predictors, directly from an fMRI dataset via cross-validation procedures ( Kay et al. 2013 ). The guiding idea behind the pipeline is that this procedure can improve the beta estimates by iteratively determining how much of the variance between conditions and noise predictors can be captured. Unlike SPM, GLMdenoise analyzes the data from all runs together, and so outputs only a single beta weight for each condition. Although each run has its own set of noise regressors the number of these regressors are fixed across datasets and are determined by the steps described below. GLMdenoise also includes a number of polynomial regressors to characterize the baseline signal level, which shifts over time in each run. Polynomials of degrees 0 to round(L/2) are included, where L is the length of the runs in minutes. Thus, given the relatively short run length, there were three such regressors for D1 and five for D2 and D3, which had comparatively longer run lengths.
GLMdenoise has a number of steps, which we briefly summarize (for full details see Kay et al. 2013 ). First, a seed HRF for each condition is generated based on the stimulus durations (2, 1.5, and 3 s). Second, the signal estimate is determined by keeping either the HRF or nuisance regressors fixed and the ordinary least squares estimate is computed until there is a convergence of parameter estimates, based on when R 2 is 99% between the current and previous HRF iteration. Third, having set the form of the HRF, a leave-one-run-out cross-validation procedure is used to determine best fit of the GLM for each voxel. The predictions from the folds are combined and the total GLM model is compared to the data using R 2 . Fourth, voxels are selected for the noise pool based on R 2 < 0 as determined at the previous step. Although this step can be used to exclude voxels outside the brain, we truncated the raw data using the whole brain mask generated by the GLM from SPM ( Charest, Kay, and Kriegeskorte, 2018 ). Fifth, PCA is run on the noise pool, while projecting out the polynomial regressors. It is the resulting principle components that constitute the noise regressors for each run. Sixth, again, cross-validation is used to evaluate the model fit while varying the number of noise regressors by iteratively increasing the numbers of the principle components from the previous step that are used as predictors. The only way this step impacts the beta estimates is if there are correlations between the noise and condition regressors. This procedure also allows for greater variance to be captured. Seventh, the number of PC is selected based on the median of R 2 across the runs. A final step, performing a procedure to bootstrap to estimate error bars, was not performed to conserve computational resources.
We used the resulting denoised data in four ways. First, we applied the analysis separately to the odd and even splits in order to determine the impact on within-subject reliability. Note that this allowed that the number of PCs would be different between the odd and even splits. Also, since the predictors are concatenated, resulting in a single beta estimate for a condition from all runs, we could only estimate the withinsubject reliability for the Cor measure since cross-validation was not possible. Second, we applied GLMdenoise to the full datasets to evaluate the between-subject reliability and RSA effect sizes, again only using RDMs constructed with the Cor metric. Third, we attempted to replicate the analysis of Charest et al., in which the design matrix generated by GLMdenoise is used to carry out NN, by plugging the matrix into Eqs. 1 and 2 . However, this procedure required restructuring the design matrices in order to ensure run-specific estimates of beta values for each condition both for NN M and to make utilization of the cross-validated dissimilarity metrics possible. To this end the estimates for each condition were assigned a distinct column in the design matrix (i.e. were no longer concatenated). No other modification of the design matrices was carried out. We note that as GLMdenoise already cross-validates across runs the assumption of independence between runs made by NN M and cross-validated metrics is violated. However, the reconfigured design matrices were only used to evaluate the within-subject reliability, where the cross-validation at least occurs independently for the odd and even splits. Finally, we compared the results to those obtained when the SPM design matrices were split for the odd and even runs and the predictors for the experimental conditions concatenated across runs. This allowed us to assess whether any improvement in within-subject reliability ob-tained with GLMdenoise could also be achieved with SPM by changing the design matrix to only estimate a single beta value per condition.

Statistical analyses
To statistically assess the impact of NN M on reliabilities and RSA effect sizes we modeled the individual correlation coefficients using linear mixed effects (LME) models with subject as a grouping factor and NN, dissimilarity metric, and their interaction, as fixed effects. There were random effects for the intercept, and NN M and metric for each level of the grouping variable (that is, subject). To assess the overall fit of the models we report the R 2 . Because the metric predictor included four classes, the LME estimates the effect of each metric and its interaction with NN M . Thus, to also assess the main effects of choice of dissimilarity metric and the interactions we applied an ANOVA to the LME. Although this set of statistical procedures carries the common (and commonly violated) assumption that the data is normally distributed, we utilized LME in this case because, by grouping the data by subject, it provides a more appropriate way of testing the impact of NN M and metric as the data for a subject across metrics and NN M is not independent. Rather, it reflects the same data being analyzed in a number of different ways. A main effect of NN M tells us that the mean correlation across subjects is impacted by NN M across metrics, where at the individual level NN M generates an increase in the correlation value, not just the population and of metric regardless of NN M . An interaction tells us that these fixed effects influence each other. We also tested the paired differences (or difference scores; see Fig. 3 C) in mean correlation coefficients for each metric, when calculated with and without NN, using paired t-tests.
When evaluating the impact of NN M when run-number was manipulated for D1, we included it as a fixed effect. When the data was smoothed, we further added level of smoothing (none, Level 1, or Level 2) as a fixed effect in the LME. When evaluating the impact of NN M on the EDI, the LME model only included NN M and smoothing as fixed effects, since only Cor was used as a dissimilarity metric. When evaluating the impact on within-subject reliability smoothing was included along with NN M and metric as fixed effects, and therefore the model contained multiple two-way interactions and a three-way interaction term. As before we evaluated main effects using an ANOVA applied to the LME and carried out paired t-tests of the mean individual correlations for each metric, with and without NN M . Since for the smoothed data analysis this resulted in far more statistical tests for each dataset, we controlled for multiple comparisons by reporting the FDR adjusted p-values. For GLMdenoise, we carried out paired t-tests of the mean correlations obtained with GLMdenoise relative to those for baseline and NN M . The same tests were performed when comparing NN M with the GLMdenoise design matrices against NN M with the SPM design matrices and against baseline.

Exploratory analysis of residual variance-covariance matrices
In an exploratory analysis we investigated the covariance structure of the run-wise variance-covariance matrices derived from the GLM residuals based on the spatial distance between voxels in an ROI as well as the goodness-of-fit (GoF) of the GLMs for each voxel.
For each subject we constructed matrices for the initial three ROIs based on the pairwise Euclidean distance between voxels in voxel space. Since plausibly covariance between voxels decreases with their distance, we used the exponentially decaying distance between voxels as an estimate of similarity in spatial position: Where d ij is the pairwise Euclidean distance between voxels i and j . Van Bergen and Jehee (2018) also consider Eq. 11 as a predictor of noise correlations between voxels and vary parameters for the rate and starting value of the decay in order to better fit their data. For the present preliminary analysis both values were set to 1. The off-diagonal values of the decaying distance matrix for each subject was correlated (Pearson's r ) with covariance values of each of the run-specific variance-covariance matrices and then averaged. The group averages of these correlations were then tested for significance with a two-sided t -test. The betweenrun reliability of the run-specific covariance values was also estimated using a similar procedure for calculating the between-subject reliability of neural RDMs: for each run, the off-diagonal covariance values were correlated with the average values of the remaining runs, and then the across-run average of these correlations was calculated and tested for significance using a two-sided t -test.
We considered whether the above correlations (between the decaying distance matrices and variance-covariance matrices) might predict whether NN M improved within-subject reliability. For this analysis we focused solely on Euc as a measure since, as reported below, it produced the largest improvements in within-subject reliability for D1 and D2. For all three datasets we subtracted individual within-subject reliability coefficients without NN M from those obtained with NN M . We also considered whether these difference scores might be predicted by the ratio between voxels and experimental conditions which describe the size of the variance-covariance matrices and relate to the possibility of rank deficiency described above. Pooling across all three datasets, we rankorder correlated (Spearman's ) all three variables with each other: (i) the average correlations between the decaying distances between voxels and run-specific covariance values from the residuals; (ii) the change in within-subject reliability after NN M when using Euc as the dissimilarity metric; and (iii) the ratio between voxels and experimental conditions.
We also evaluated whether the GoF of the GLMs for each voxel might bare a relationship to the covariance between voxels and predict whether NN M improved within-subject reliability. The measure of goodness-of-fit we used was the model-based SNR: Which divides the variance of the explained BOLD signal by the variance of the unexplained BOLD signal ( Welvaert and Rosseel, 2013 ). Calculation of the SNR mb was carried out using code adapted from the MACS SPM toolbox ( Soch and Allefeld, 2018 ). We evaluated this GoF measure in two ways. First, we determined the proportion of voxels in an individual subject ROI where the explained signal was greater than the unexplained signal (i.e. SNR mb > 1). We also made a dissimilarity matrix based on the pairwise absolute difference in GoF, which was correlated with the run-averaged residual variance-covariance matrix. Both of these sets of values derived from the GoF were then rank-order correlated (Spearman's ) with the change in within-subject reliability after NN, when Euc was used as a metric.

Results
We investigated whether NN M improves the estimate of condition specific changes in the BOLD signal of fMRI, and thereby improves the reliability of RDMs, and whether this effect of NN M interacts with choice of dissimilarity metric. We reanalyzed three datasets (D1-D3) from previous studies in order to evaluate the impact of NN M on the within-and between-subject reliability of neural RDMs constructed using four different metrics, as well as the RSA effect sizes based on correlations with dataset-specific model RDMs. We further evaluated the results of NN M for other ROIs from two of these datasets, compared its impact across different levels of spatial smoothing, and finally compared it to results obtained with GLMdenoise, which has also been suggested as a method for improving signal estimate for the purpose of RSA. Because many statistical tests were performed for each portion of the results, they are reported in accompanying tables.

Multivariate noise normalization does not consistently improve the within-or between-subject reliability or RSA effect sizes
We began with within-subject analyses following the approach of Walther et al. For each dataset we calculated the correlation between the odd and even run RDMs of each subject both with and without NN M and across dissimilarity metrics. For the four datasets they analyzed, Walther et al. considered regions from primary motor and sensory cortex (M1/S1) as well as high-level visual cortex. Thus, the choice to include D1, with V1 as a ROI and gratings as stimuli, was to have an equivalent early sensory area include in our analysis. Similarly, D2 was included, with LOTC as an ROI and natural object images as stimuli, to have a similar dataset to two of those considered by Walther et al. However, Walther et al. only evaluated sensorimotor or visual areas and did not consider any ROI that is known to be responsive to more abstract or cognitive relationships between experimental conditions. This was our motivation for including D3, with its large number of social/non-social touch videos, and to initially focus on an area like TPJ, which is known to be recruited by social cognition and theory of mind ( Saxe and Kanwisher, 2003 ). Inclusion of D3 was especially important for evaluating how well NN M applies more broadly.
For D1, the results were well described by the LME model with significant main effects of NN M , metric, and interaction ( Table 2 A). All paired t-tests for the metrics were also significant ( Fig. 2 A). So, for D1, across metrics, NN M improved within-subject reliability, though the improvement tended to vary with choice of metric. For D2, the results were well described by the LME model with significant main effects for NN, metric, and interaction ( Table 2 A). Only the paired t-tests for Euc and Mal were significant ( Fig. 2 A). So, for D2, NN M only improved within-subject reliability when using the two distance metrics. For D3, the results were well described by the LME model and there was a significant main effect of NN, but not metric or interaction ( Table 2 A). However, the effect of NN M for D3 was in the wrong direction, though only the paired t -test for Cor was significant ( Fig. 2 A). So, for D3, NN M made within-subject reliability worse.
Next, we evaluated the impact of NN M on the between-subject reliability by calculating the noise ceiling with and without NN, for each of the four metrics. For D1, the results were well described by the LME model and there were significant main effects of NN M , metric, and interaction ( Table 2 B). Only the paired t -test for Cla was not significant ( Fig. 2 B). So, for D1, NN M improved between-subject reliability but the size of this improvement varied with metric ( Fig. 2 B). For D2, the results were well described by the LME model and there were significant effects of NN M , metric, and interaction ( Table 2 B). Only the paired ttest for Clas was not significant ( Fig. 2 B). So, as with D1, for D2 NN M tended to improve between-subject reliability, but the size of this improvement varied with metric ( Fig. 2 B). For D3, the results were also well described by the LME model, with significant effects of NN M and metric, but no interaction ( Table 2 B). All paired tests were also significant, but unlike with D1 and D2, the differences were in the wrong direction, with NN M consistently lowering the between-subject reliability ( Fig. 2 B). So, for D3, NN M also made the between-subject reliability worse.
Finally, we considered how NN M might influence actual RSA effect sizes since it was conceivable that it might impact the explainable variance without changing the mean correlations with model RDMs. For D1, the results were well described by the LME model with significant main effects of NN, metric, and interaction ( Table 2 C). The paired differences were only significant for Cor and Euc ( Fig. 2 C). Although the mean correlations were much lower for D2 than D1, the same pattern was observed. The LME model explained most of the variance and there were main effects of NN, metric, and interaction ( Table 2 C). The paired differences were also only significant for Cor and Euc. So, for both D1 and D2, NN M only seemed to increase the RSA effect sizes when using two of the metrics. For D3, the LME model was again significant, and the same pattern was observed as was seen for the within-and between-subject reliability: there were main effects of NN, metric, and interaction ( Table 2 C). But this was once again in the wrong direction for NN, with mean correlations significantly lower across all metrics ( Fig. 2 C).
Two observations are worth making about the findings summarized so far. First, regarding the positive results for D1 and D2, among metrics NN M had very little influence on any of the mean effect sizes when Cla was used while it had the greatest impact when Euc was used. This substantial improvement seemed to occur because the correlations for Euc were low to begin with. In contrast, the mean reliabilities were already much higher when Cor and Mal were used as metrics and tended to be at least slightly improved by NN M . However, if one chose the metric based solely on the magnitude of the baseline RSA effect sizes when NN M was not utilized, then Mal should be selected as NN M did not significantly increase the mean correlations. Second, it is notable that the baseline within-subject reliability is considerably lower for D3 than the other two datasets. This is likely the case for a number of reasons: that it is a cognitive region; that the study included a much larger number of conditions; and finally that there were far fewer runs (only half as many as for D1). In part for these reasons we next considered the impact of NN M on other ROIs.

Univariate noise normalization has less impact on within-subject reliability
We next assessed whether it was specifically the normalization by the covariance between voxels that produced the discrepant results across datasets. To do this, we performed univariate noise normaliza- tion (NN U ), and compared its impact on the within-subject reliability across all three datasets.
For D1, we found that NN U also significantly improved reliability compared to baseline across all four of the dissimilarity metrics, but to a significantly lesser degree than NN M when Euc and Mal were used to construct the RDMs ( Fig. 3 A). For D2, NN U only improved reliability when Cor was used, with no significant difference for any of the other metrics for which reliability was effectively unchanged from baseline. Thus, NN M also significantly improved reliability compared to the results with NN U , when either Euc or Mal were used as metrics. Finally, for D3, NN U had no significant impact on the within-subject reliability, regardless of the metric used. Crucially, unlike NN, there was no general trend of decreasing the reliability either.
These results suggest that the differential impact of v across datasets and metrics is specifically a result of applying multivariate NN M and normalizing by the run-wise covariance between residuals of the voxels in an ROI, as NN U had comparatively far less influence (positive or negative) on within-subject reliability.

The improvement in within-subject reliability from multivariate noise normalization is consistent across number of runs
Another question is whether the difference in findings we observed between D1 and D2 compared to D3 is a result of the difference in the amount of data per subject. D1 and D2 contained 12 and 16 (or 14) runs per subject, while D3 only contained 6 runs. To assess the amount of runs on within-subject reliability, we further analyzed the data of D1. For each subject we carried out the same analysis as before, splitting the odd and even runs, based on the first 6, 8, 10, or full 12 runs of a subject ( Fig. 3 B). For D1, the results were well described by the LME model with significant main effects of NN, metric, and run number, with a significant interaction between NN M and metric ( Table 3 ). With fewer runs, there was no significant increase in reliability when Cor was used as a metric, while the reverse pattern, of decreasing differences, was suggested when Cla was used. Regardless of the number of runs, NN M resulted in a substantial improvement when Euc and Mal were used as dissimilarity metrics. Notably, the baseline reliability achieved with Cor with only 6 runs, even without NN M , was much higher than for any of the crossvalidated metrics. Furthermore, even with only 6 runs, the reliability for D1, even without NN M , tended to be higher than for D3, and similar or higher than what was observed for D2.  Fig. 2 A. (C) Bars indicate the difference of the within-subject reliabilities subtracting the effects observed based on results when data was in subjects' native brain space vs when it was in a normalized brain space. Errors are the standard error of the mean. Scatter plot indicates the difference in within-subject reliability as a result of NN M either in the native subject space (Diff native ) or the normalized space (Diff norm ).
These results show that it is possible to obtain substantial improvement in within-subject reliability of individua RDMs with as few as 6 runs, which was the same number of runs for D3. Thus, the discrepant results we observed for that dataset, are unlikely to be merely a result of run sample size.

The improvement in within-subject reliability from multivariate noise normalization is not impacted by normalizing brain space
Another possibility is that the relatively consistent results observed for D1 relative to D2 and D3 was a result of the fact that it was carried out on data in the native brain space of each individual subject and not a normalized brain space as was the case for the other two datasets. For it is conceivable that, to the extent that the noise information present in the residuals has a spatial component, this might be distorted through normalization.
To test this, we first normalized the data for D1 before carrying out the exact same analyses for constructing a GLM for the data, applying NN, and evaluating the impact of the method on within-subject reliability across dissimilarity metrics. When the within-subject reliabilities obtained for the normalized space data were subtracted from those for the native space data, these difference scores were never significantly different from chance ( Fig. 2 C). The lack of influence of normalization is also apparent when looking at individual data points which largely fall along the diagonal in Fig. 2 C, indicating that there was no appreciable difference in the impact of NN M , whether data from a normalized or native brain space is used.
These results suggest that the difference in results for D1 from D2 and D3 are unlikely to simply be a result of whether data is analyzed in a subject's native brain space.

The impact of multivariate noise normalization is consistently inconsistent across regions of interest
For the three datasets we considered, and the three evaluations we performed, the impact of NN M was most consistently beneficial for D1, more mixed for D2, and generally quite negative for D3. Unlike D1, D2 and D3 both came from studies that considered a large number of ROIs. Given our discrepant findings, we next asked whether similar results might obtain for a selection of the other ROIs from these studies. Since the ROI for D1 was a portion of V1, we also considered the equivalent ROIs for D2 and D3 to determine whether the impact of NN M would be consistently more positive in early visual cortex, across datasets. In both cases in the original studies these regions were labeled BA17. For D3 we initially focused on a non-visual region, and this raised the question of whether this might play a role in the negative influence of NN M we observed. Thus, for D2, we also considered a ROI outside of the ventral visual pathway, SPL. For D2 we had initially focused on a high-level visual region and so we also considered a similar ROI for D3: a portion of lateral occipitotemporal cortex, labeled as BA37 in the original study. For the pairs of new ROIs for D2 and D3, we then again carried out the same analyses as before. Results are depicted in Fig. 4 and statistical tests are reported in Table 4 .
For both datasets, the correlations for within-subject reliability for neural RDMs for BA17 were well described by the LME model. There were significant main effects for Metric and Interaction for D2, and only Metric for D3 ( Table 4 A). However, for D2 there was only a significant increase for Euc, and no significant differences in mean reliability for D3 ( Fig. 4 A). Thus, for D2, NN M seemed to have very little positive impact on the mean within-subject reliability of neural RDMs for BA17, and for D3, NN M seemed to have no impact, though there was possibly a slightly trend in a positive direction. For D2 (SPL) the LME model was significant, and there were main effects of NN M and metric, but no interaction ( Table 4 A). Like with TPJ for D3, for all metrics NN M decreased within-subject reliability, though only the effect for Cor was significant. The results for D3 (BA37) were qualitatively similar to those for BA17, though the LME model was not significant even though there were significant effects of NN, metric, and interaction, and only a significant pairwise difference for Cor and no effect or a possible positive trend otherwise ( Fig. 4 A).
When it came to the results for the between-subject reliability, for both datasets the correlations for all ROIs were well-described by the LME model and all main effects were significant, with the exception of NN M for D2 (SPL) and D3 (BA17) (

Table 4
Summary of linear mixed effects (LME) modeling of the results depicted in Fig. 4 .  ( Fig. 4 B). Thus, across metrics, the impact of NN M on between-subject reliability was also quite variable when considering the early visual ROIs for D2 and D3. For D2 (SPL) there were significant main effects of metric and interaction, but not NN M ( Table 4 B). Only when Euc was used as a metric was there a significantly increased in the mean correlations due to NN M . For D3 (BA37) the pattern of results was similar to those for TPJ with significant main effects, but significant decreases in reliability for all metrics except Euc ( Fig. 4 B). Finally, when it came to the RSA effect sizes, the correlations for D2 (BA17) and D3 (BA37) were well described by the LME model and all main effects were significant ( Table 4 C). For D2 (BA17) there were only significant increases in the mean correlations for Cor and Euc, while for all metrics NN M resulted in significant decreases in mean effect sizes.

Table 5
Summary of linear mixed effects (LME) modeling of the results depicted in Fig. 5 . These results further suggest that the mixed findings we previously observed were not merely a result of the ROIs that were initially selected. Although the R 2 values appeared similar for the other two ROIs, the LME did not significantly capture any variance for D2 (SPL) or D3 (BA17) for which there was very little variation ( Fig. 4 C). These null findings were somewhat expected, given the weak (D2) or non-existent (D3) findings from the original studies ( Bracci and Op de Beeck, 2016 ;.

Multivariate noise normalization interacts with spatial smoothing in different ways, for different datasets
Previous work suggests that multivariate BOLD signals can be modestly enhanced by levels of spatial smoothing, which can improve the EDI values and reliability of RDMs for RSA ( Hendriks et al. 2017 ;Op de Beeck, 2010 ). We also sought to evaluate how spatial smoothing and NN M might interact, given that the noise targeted by the method is also presumed to be spatially distributed and so may be influenced by the possible enhancement of the spatially distributed signal. To this end, we investigated the on-off diagonal difference (EDI) and the within-subject reliability at different levels of smoothing (with and without NN M ). For the analysis we focused on the initial trio of ROIs. Results are depicted in Fig. 5 and LME modeling statistics are summarized in Table 5 .
For D1, the correlations were well captured by the LME model with a significant main effect of NN, but not smoothing or interaction ( Table 5 A). There were significant pairwise differences in the mean EDI with no smoothing, and at Level 1 ( Fig. 5 A). For D2, the correlations were well captured by the LME model with a significant main effect of NN M and interaction, but not smoothing ( Table 5 A). There were significant pairwise differences in the mean EDI with no smoothing and at Level 1 ( Fig. 5 A). For D3, the correlations were well captured by the LME model and all main effects were significant ( Table 5 A). There were significant pairwise differences in the mean EDI at all three levels of smoothing ( Fig. 5 A). Notably, in all cases where there was a pairwise difference in mean EDI, this was because NN M tended to decrease the values of this index of on-diagonal reliability of individual RDMs, though this decrease tended to diminish with increased levels of smoothing.
Next, we evaluated the impact of NN M and smoothing on withinsubject reliability. For D1, the correlations were well captured by the LME model, and there were significant main effects for all three fixed effects (NN M , metric, and smoothing), and the interactions between NN M and metric and NN M and smoothing level ( Table 5 B). All pairwise tests were significant, with within-subject reliability appearing to decrease with greater levels of smoothing and NN M undoing this effect ( Fig. 5 B). For D2, the correlations were well captured by the LME model, and there were significant main effects for all three fixed effects ( Table 5 B). The interactions between NN M and metric and NN M and smoothing level were also significant ( Table 5 B). Once smoothing was introduced all pairwise comparisons were significant and the pattern observed for D1 was reversed; for D2, the baseline within-subject reliability was not influenced by smoothing, but consistently increased with level of smoothing when NN M was performed ( Fig. 5 B). Finally, for D3, the correlations were well captured by the LME model with a significant main effect for all three fixed effects ( Table 5 B). There was also a significant interaction between NN M and metric and NN M and smoothing. A significant de-crease in mean within-subject reliability was only observed for the first two levels of smoothing for Cor ( Fig. 5 B). However, smoothing seemed to undo the negative impact of NN M .
In summary, although NN M and spatial smoothing are in principle motivated by similar considerations about the spatial distribution of noise and signal, respectively, these two factors tended to interact in different ways depending on the dataset in question. We return to the topic of the spatial properties of the residual variance-covariance matrices below.

3.7.
GLMdenoise has a more consistently beneficial impacts on the reliability of neural dissimilarity and RSA effect sizes than multivariate noise normalization Charest et al. (2018) propose GLMdenoise as a method for improving signal estimates, and thus the reliability of neural RDMs. Their results suggest that, when preprocessing is carried out with GLMdenoise, superior results are obtained compared to NN M . Furthermore, they found that NN M could be improved when using the design matrix of GLMdenoise to estimate the residuals. We revisited both findings of Charest et al. (2018) again using the initial trio of ROIs. First, we compared the results obtained with Cor for within-and between-subject reliability of neural RDMs and RSA effect sizes with those obtained when employing GLMdenoise. Second, we compared the results obtained with NN M for within-subject reliability when using GLMdenoise design ma- Fig. 6. GLMdenoise has a consistently more beneficial impact on neural RDM reliability and RSA effect sizes than multivariate noise normalization. (A) Bars indicate the group mean of within-subject split-half reliabilities when Cor is the dissimilarity metric either with or without NN M , or with GLMdenoise (GD). Error bars are the standard error of the mean. (B) Bars indicate the group mean of the between-subject reliability or "noise ceiling ". (C) Bars indicate the mean rank-order correlations between individual neural RDMs and target model RDMs for each dataset. All conventions are the same as in Fig. 1 A. trices relative to those from SPM. Since of primary interest was the difference afforded by GLMdenoise, we did not model the results with LME models and instead simply compared the pairwise relationships using ttests. The baseline and NN M results in the accompanying figures are the same as plotted in Fig. 2 and are replotted below for visual comparison.
When it came to within-subject reliability, for D1 GLMdenoise did not significantly increase the mean within-subject reliability relative to the mean correlations obtained either with or without NN M ( Fig. 6 A). Since NN M did improve the within-subject reliability before ( Fig. 2 A), we cannot entirely rule out a positive impact of GLMdenoise and a false negative test result ( Nieuwenhuis, Forstmann, and Wagenmakers, 2011 ). For D2, and most notably for D3, the results were by comparison unambiguously positive: GLMdenoise significantly improved mean reliability compared to baseline and in contrast to NN, which had no effect or made reliability worse ( Fig. 6 A). For both D1 and D2, GLMdenoise significantly improved the between-subject reliability of neural RDMs compared to baseline, and to a similar level as NN M . For D3, GLMdenoise again provided a substantive improvement in reliability relative to baseline unlike NN M ( Fig. 6 C). Finally, for RSA effect sizes, the impact of GLMdenoise was mixed ( Fig. 6 C). For D1 the mean effect was not significantly higher than baseline or lower than what was obtained with NN M , though again, the difference between baseline and NN M was significant ( Fig. 2 C). For D2, while the improvement afforded by GLMdenoise was significantly higher than baseline, and not significantly different than the mean effect obtained after NN M . Once more and most notably, GLMdenoise significantly improved the effect sizes for D3 relative to the mean effect obtained with NN M and relative to baseline.
When NN M was carried out with the GLMdenoise design matrix ( Fig. 7 A), there was still a significant increase in within-subject reliability for D1 when Euc and Mal were used as metrics, but the improvement was not significantly different than the mean reliabilities when NN M was performed with the SPM design matrix. For D2, there were only improvements for Euc, but these were again not significantly different than those obtained using the SPM design matrices. Finally, for D3 reliability was not significantly higher than for the SPM design matrix, while there was still a significant drop relative to baseline when Cor was used as a metric. When compared to the results depicted in Fig. 6 , it is notable that for D2 and D3 there was no longer a substantial improvement when using Cor as a metric.
Given that the only alteration of note to the analysis was the expansion of the design matrix, these results suggests that the improvement provided by GLMdenoise reported above ( Fig. 6 ) may have less to do with the data-driven method for deriving noise regressors, and more to do with the fact that the analysis derives single estimates per condition due to concatenation across runs of the experimental condition predictors. For if the apparent benefit of GLMdenoise was due to the superior noise estimates, then one would predict a similar (though perhaps diminished) improvement in reliability for D2 and D3 even when averaging across multiple runs when using Cor as the metric. The fact that the results for this measure differ between Fig. 5 and Fig. 6 A suggest that this may not the case. To test this, we altered the SPM design matrices for the odd and even partitions so that they also concatenated across runs, resulting in single beta estimates per condition for each partition ( Fig. 6 B. While for D1 there was again no significant increase in withinsubject reliability relative to baseline the mean after GLMdenoise, there was a significant increase for D2 and D3 (though in the former case, less than what was obtained with GLMdenoise). Thus, the relative improvement observed with GLMdenoise compared to the results obtained with NN M may partially depend on differences in the structure of the design matrix. Though we also note that Charest et al. (2018) also concatenated their design matrices across runs, and still found a relative increase in within-subject reliability when using GLMdenoise.

The covariance structure derived from GLM residuals is related to the distance between voxels, but not goodness-of-fit
The results so far invite the question: what sort of structure do the residual variance-covariance matrices have and how might it predict the impact of NN M ? We carried two sets of exploratory analyses as a step towards answering this question.
First, a fundamental assumption of NN M is that the noise information contained in the residuals of a GLM is spatially distributed, which suggests that the noise covariance may reflect the relative position of voxels inside a volume. Such a possibility is consistent with a high-degree of between-run reliability, across datasets, in the residual covariance values for the initial trio of ROIs ( Fig. 6 A). For each subject we constructed matrices of the exponentially decaying distances between voxels and correlated them with the off-diagonal values of the run-specific variance-covariance matrices. The resulting coefficients were then averaged across runs. We found that for all three datasets the distances matrices on averaged significantly correlated with the covariance values ( Fig. 8 A). This was especially true for D3, where over a third of the variance was on averaged accounted for by the distance relations between voxels. Though these correlations may be inflated due to the spatial upsampling that was performed for that dataset. While in all cases the ROIs were bilateral, the distance between the hemispheric components are much farther for LOTC and TPJ and as these ROIs were defined in terms of functional contrasts, they also tended to be asymmetric. Both properties can be seen in Fig. 8 B, which depicts the mean variance-covariance matrix for a representative subject from D1-D3 and the corresponding decaying distance matrix. Especially for the subjects from D2 and D3 the spatial structure component of the covariance values can be seen from visual inspection alone.
We next considered how the correspondence between the distance between voxels and the covariance of their residuals might relate to any improvement in within-subject reliability resulting from the application of NN M . We focused on the results obtained with Euc since NN M produced the most substantial improvement when Euc was used as a metric ( Fig. 1 A). For each subject we calculated the change in within-subject reliability after NN M by subtracting the baseline correlation values. As anticipated by the results depicted in Fig. 2 A and 8 A, when data was pooled across datasets the change in reliability due to NN M negatively correlated with the correspondence between voxel distance and covariance ( Fig. 8 C). We also correlated the changes in within-subject reliability with the ratio between voxels and conditions, which again showed a negative correlation ( Fig. 8 C). Finally, when the changes in reliability were correlated with the voxel/condition ratios for each subject, there was a positive correlation ( Fig. 8 C).
Second, the structure of the residual variance-covariance matrices is dependent on how well the constructed GLMs account for the signal fluctuation of each voxel. Thus, it is possible that the impact of NN M may be related to the GoF of the voxels in an ROI. For each of the initial ROIs, we determined the proportion of voxels in an ROI where the explained variance of the BOLD signal was greater than the unexplained variance (SNR mb > 1). We also constructed dissimilarity values based on the absolute difference in GoF between voxels, which was correlated with the run-averaged residual variance-covariance matrix of each subject.
For all three datasets, the proportion of voxels with greater explained BOLD signal variance was < 0.5, suggesting that for the majority of voxels there was greater unexplained variance in the signal ( Fig. 8 D). However, only for D1 was there any significant positive correlation between the dissimilarities of GoF and the pairwise residual covariance between  ( Fig. 8 D). Nor did the proportion of voxels with SNR mb > 1 predict individual variation in the change in within-subject reliability afforded by NN M when pooling across datasets ( Fig. 8 E). However, there was a positive correlation between changes in within-subject reliability due to NN M and the correlations between the variance-covariance matrices and SNR mb matrices.
Although we again emphasize that these analyses were exploratory, these preliminary results are nonetheless still instructive. First, they confirm that part of the covariance structure is well captured by the similarity in position of voxels within a volume. Second, they are suggestive that these spatial relations may be predictive of the influence of NN M on within-subject reliability. Third, they are also suggestive that the relationship between the number of voxels and conditions, or features and samples, which influences the invertibility of the variance-covariance matrices, may also be predictive of whether NN M has a beneficial effect. In particular, having more features per sample may partially account for whether NN M improves reliability. Finally, GoF did not provide a good predictor of either the covariance structure between voxels or the impact of NN M on within-subject reliability. Based on these results, future research might more systematically investigate how the covariance structure derived from residuals may vary across datasets and ROIs within datasets.

Discussion
NN M has been billed as a useful method for improving the withinsubject reliability of neural RDMs irrespective of the choice of dissimilarity metric ( Guggenmos et al. 2018 ;Kriegeskorte and Diedrichsen, 2019 ;Nili et al. 2020 ;Walther et al. 2016 ). At the same time, other results suggest that it may be less beneficial than originally proposed and even inferior to GLMdenoise when it comes to boosting RDM reliability ( Charest et al. 2018 ). We revisited this issue with three of our own datasets. We also evaluated the impact of NN M on between-subject reliability, RSA effect sizes, the influence of NN M across ROIs, the influence of spatial smoothing when carrying out NN M , and finally we compared the results obtained to those generated when using GLMdenoise. The results of our investigation were mixed. Our findings have implications for: (i) whether NN M should be prescribed to boost reliability when performing RSA; (ii) whether developing pipeline tools like NN M is well-motivated in the first place; and (iii) how the choice of whether to use NN M may depend on differing methodological and theoretical motivations for using RSA.

Multivariate noise normalization and principles of beneficence and non-maleficence in data processing
NN M comes strongly recommended. Walther et al. (2016, p.197) state that: "Activation patterns (usually formed by regression coefficients) should be subjected to multivariate noise normalization to improve RDM reliability, regardless of dissimilarity measure. " Similarly, Guggenmos et al. (2018, p.444) offer that for time-series RSA: "multivariate noise normalization is a highly recommended preprocessing step irrespective of other analytic choices. " Do our results support these prescriptions?
Before answering this question, it is worth highlighting that any prescribed intervention for data analysis should meet two normative principles, which mirror those used to evaluate the ethical implications of biomedical research ( National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, 1979 ). The first is one of beneficence : does the intervention tend to improve the signal estimate for the purpose of detecting the effects of interest? Clearly promotion of the effectiveness of NN M has been largely centered on whether the method is beneficial in some way. However, equally important is the principle of non-maleficence : does the intervention tend to not worsen the signal estimate for the purpose of detecting the effects of interest? In other words, with data as with health: first, do no harm.
We believe our results call into question whether NN M satisfies either principle. Therefore, they do not support the above prescriptions. Consider beneficence first. The original finding of Walther et al. (2016) was that NN M tended to improve within-subject reliability, however, it is unclear that such a finding provides sufficient evidence of a benefit for RSA. First, they did not show that it had any discernable influence on the between-subject reliability or RSA effect sizes. Second, they did not show that it had a consistent benefit across multiple ROIs within the dataset, even though any use of NN M would presumably be applied uniformly when a study considers multiple regions throughout the brain. When we carried out these further analyses, across just a few additional ROIs, we found that NN M was only consistently beneficial when using Euc, and Cor, as a metric. It has been suggested that Cla is a less desirable metric because it is discrete ( Guggenmos et al. 2018 ;Walther et al. 2016 ). In line with previous findings we found use of Cla resulted in the lower mean reliabilities and effect sizes even when NN M was applied. Reliabilities and effect sizes were also not consistently improved when using its continuous cousin Mal as a metric. This latter result is significant when compared to those obtained when using Euc paired with NN M , which is closest to the "crossnobis " distance that has been promoted as the preferred metric for improving reliability Guggenmos et al. 2018 ;Nili et al. 2020 ;Walther et al. 2016 ). For we found little difference in between-subject reliability and effect sizes when using this approach vs Mal without NN M . Thus, our results do not support the claim that NN M is of clear benefit, regardless of metric.
However, these varied positive results were only observed for D1 and D2. For when it comes to non-maleficence, the negative impact of NN M across analyses for D3 was perhaps our most consistent finding. Only when using Euc and Mal as metrics to construct RDMs for BA17 did we see any positive impact of NN M for this dataset. We note that this outcome may not be entirely surprising for some. The fundamental assumption of NN M is that the noise information in the residuals of the GLM have a spatial component, which itself may be connected to topographic organization in a region. Our exploratory analysis showed that the structure of the covariance values for bilateral TPJ were especially well-captured by the distance between voxels within the ROI ( Fig. 8 A). And yet, NN M consistently reduced the reliabilities and effect sizes observed for this region. This is unlikely solely to be a result of the fact that the hemispheric components for the bilateral ROI were far apart since the same was true of the LOTC ROI for D2. Instead, for D3 the spatial structure did not appear to include information that was meaningful for improving BOLD signal estimation. When stimuli are videos any such topographic response to an image frame will be combined with that for all of the other video frames, and so the noise may have a more complex spatiotemporal structure. So, it may be that NN M is less suitable when dynamic stimuli are used in part because standard GLMs also do not capture the dynamic nature of video stimuli either. However, our results for D2 (SPL) also raise the question of whether NN M is suitable at all when one leaves the cortical realm of sensorimotor regions. Thus, it may also be that NN M is not suitable for all regions one may investigate. In which case, if NN M is intended to be applied uniformly across ROIs, then it does not satisfy the principle of non-maleficence since there is the risk that in some regions one may simply be multiplying beta estimates with spatially unstructured noise.
More cautiously, our results suggest that further analytic trials would be required to evaluate the effectiveness of NN M and when and where it should be utilized. In contrast, the results obtained with GLMdenoise were more reliably beneficial. Indeed, the clearest positive impact of the approach was seen for D3. Thus, our findings may be more consistent with the recommendations of Charest, Kriegeskorte, and Kay (2018) to use this boutique approach to first-level analysis prior to carrying out RSA. However here a number of considerations suggest further analysis is still necessary. First, we found that the positive influence of GLMdenoise may have more to do with the single beta estimates generated by the analysis rather than the more complex procedure for choosing noise regressors, as similar results are achievable simply by restructuring standard design matrices so that only a single beta estimate is produced. Second, GLMdenoise is not compatible with using crossvalidated metrics even if one reorganizes the design matrix since the cross-validation procedure steps it depends on violate the independence between runs. As an alternative to Cor, GLMdenoise would also be compatible with using non-cross-validated forms of either Mal and Euc as metrics ( Guggenmos et al. 2018 ;Ritchie et al. 2020 ), However, the fact would remain that, if one believes cross-validation provides more accurate estimates of dissimilarity relations ( Bobadilla-Suarez et al. 2019 ;Walther et al, 2016 ), then GLMdenoise is not a viable analytic option. We note that leveraging other methods, such as Bayesian RSA ( Cia et al. 2019 ) may also help to further clarify the relationship between GLMdenoise and NN M , and more general concerns about how noise in BOLD signals might be estimated and exploited in the service of carrying out RSA.

Questioning the motivation for multivariate noise normalization
Our results provide reason to doubt whether NN M is as widely applicable as has been proposed. However, we believe that they also point to more fundamental issues with NN M , which call into question the motivation for introducing, and promoting, such new analytical tonics as a curative for the ills of noisy data.
The first issue is that it remains not entirely clear when and why NN M works. The underlying assumption is that the noise structure contained in the residuals will have a spatial component that can be leveraged to improve the estimate of condition-specific variation in the BOLD signal. There are parts of the brain where such an assumption seems eminently plausible, such as early visual or motor cortex, and indeed in such regions NN M seems to produce the best results. Our preliminary results also suggest that some of the structure of the variance-covariance matrices central to NN M simply reflects the distance between voxels. However, the reality is that without further analysis the form of the noise remains unknown, and where there is no spatial component to be found one is again simply multiplying signal estimates with unstructured noise. In such cases, NN M may have all the benefit of bloodletting. The risk posed by the unknown is not unique to NN M . In recommending GLMdenoise as a first level analysis for RSA, Charest et al. also acknowledge that, since GLMdenoise is data driven, the noise that is removed by the approach is left unspecified unless further analyses are taken. They further acknowledge that the noise may vary across experiments and participants. The risk with NN M is that, since it is not data-driven, it may even vary across areas and stimulus types. Furthermore, whereas GLMdenoise is part of an open source Matlab toolbox, the application of NN M is not as yet standardized -though one implementation can be found in the Decoding Toolbox ( Hebart et al. 2015 ).
The fact that NN M may inject even more uncertainty into how we interpret neuroimaging results is problematic when we consider the trend towards more transparent, replicable processing pipelines ( Esteban et al. 2019 ). This point is well-illustrated by the results of Botvinik-Nezer et al. (2020) who investigated the effect of pipeline flexibility on neuroimaging findings. In their study, 70 teams analyzed the same neuroimaging dataset to test the same collection of hypotheses. The variation in analysis approaches, and subsequent results, were striking: no two teams used identical pipelines, and even in cases where the statistical maps of the brain were correlated in some stages of the processing pipeline the reported significant results still differed. Of the nine hypotheses tested, there was near consensus of a negative results for three of them ( < 10 % of teams found an effect) while there was majority agreement of a significant result for only one of them > 80 % of teams reported an effect). For the remaining five only ∼ 20 -40 % of teams reported significant results. Now imagine a similar study where RSA was ultimately the method of choice. In such a case, NN M would introduce yet another degree of freedom into the choice of pipeline where the appropriateness of its application would still remain uncertain.
The preceding hypothetical helps to motivate the second issue, which is whether NN M is intended to be a method that is supposed to impact what effects are detected. On the one hand, the cautious response may be to say that it is simply intended as a method to improve within-subject reliability. However, if NN M has no material impact on the explainable variance, then it seems we have little motivation to apply the analysis at all. But if the goal is to increase the explainable variance, then this is itself only of interest because it might make a difference to what effects are observed. In which case, the utility of NN M is after all because it might change what effects are observed. On the other hand, if the goal of NN M as an intervention is to possibly influence what effects are found, then the known unknowns about NN M are again a source of concern. In our analysis we only used a single model RDM to assess the impact of NN M on RSA effect sizes. However, it is typical to test multiple model RDMs, with models tested against the null hypothesis, against each other, or jointly used to model dissimilarity values using multiple regression or "variance partitioning " variants such as commonality analysis ( Groen et al. 2018 ;Hebart et al. 2018 ;Newton and Spurell, 1967 ). Therefore, it is as yet unknown whether carrying out NN M might boost some effects at the expense of others.
This second issue does not mitigate against the utility of NN M , but we do believe it is worth highlighting for researchers deliberating on whether to apply the method. A different perspective is to favor carrying out multiple analyses for a single dataset to show that some target result is robust across all of them in order to improve methodological transparency. For example, Steegen et al. (2016) propose a "multiverse " approach where one tests for results based on all datasets that are generated across all possible combinations of data processing choices. In one application of this approach, Moors and Hesselmann (2019) found that only 14% of pipelines revealed apparent effects when analyzing a dataset for evidence of unconscious arithmetic. Applying the same logic to pipelines for RSA, NN M is again one more degree of freedom for analysis pipelines that one might implement. But whether a desired effect is observed should not depend on whether NN M is carried out but instead should be robust across many data processing choices. If we attempt to follow this alternative approach, then one can again wonder whether even in principle prescriptions like NN M are well motivated.

Distinguishing "modest " vs "ambitious " applications of representational similarity analysis
Despite being relatively critical of both the prescriptions to use NN M and its underlying motivation, we are nonetheless reluctant to recommend not using NN M . In their discussion, Walther et al. suggest that the choice of dissimilarity metric may depend on the question one is asking: is the goal to determine how discriminable patterns are or their dissimilarity regardless of its shape? In the former case, it may be more desirable to use a distance metric like Mal, in the latter, Cor. Similarly, we believe that the choice of whether or not one should use NN M may depend on the research question and the precise way in which RSA is being employed. NN M has been introduced as a salve for improving signal estimates and thus the reliability of neural RDMs. As we have seen, however, its effectiveness is plausibly restricted by whether we have prior reason to believe that the noise contained in the residuals of a GLM is spatially structured. However, whether or not this is the case is more akin to an issue of measurement than of reliability. As pointed out by Bodadilla-Saurez et al. (2019) , when it comes to RSA, and our choice of dissimilarity metric, the issue of measurement is not the same as that of RDM reliability. Different dissimilarity metrics are not simply different in terms of how reliable they are in their estimates, but also in terms of what kinds of relationships they estimate. Depending on the sort of similarity structure one is hypothesizing may be latent in a brain region, different metrics will be appropriate ( Ramirez, 2017 ). This consideration points to a more fundamental distinction, which is that theories are not built upon the backs of data; rather, our choice of analyzes of data are revealing of phenomenon that we wish to explain ( Bogen and Woodward, 1988 ;Woodward, 2011 ). This distinction, emphasized in the philosophy of science, is also germane to neuroimaging methods like RSA ( Carlson et al. 2018 ). NN M is not simply method for improving reliability but depends on assumptions about the form of the signal being detected and its spatial extent. So, whether NN M is applicable will, as with choice of metric, depend on prior assumptions about the signal being measured and the hypotheses being tested.
Here we believe a distinction between two types of applications of RSA may be helpful in researchers deciding whether to utilize NN M . As we emphasized at the outset, RSA has both methodological and theoretical virtues. Many uses of RSA clearly aim to take advantage of the former: comparing neural RDMs from multiple brain regions to those derived from computational models or behavior. In such cases we suspect that researchers are content to use Cor as a measure since whether dissimilarity values solely reflect pattern discriminability, or partially depend on mean signal amplitude, may not be of interest. Instead, what is of chief importance is that RSA allows for direct comparison of many different data types where condition rich designs are used. In such "modest " applications of RSA, where the topographic structure of a large number of ROIs is likely unknown, we believe NN M may do more harm than good, and GLMdenoise may potentially be more appropriate for boosting the estimate of signal estimates. insofar as we suspect that most studies using RSA aim for such modesty, NN M may therefore have a limited scope of application. However, in other cases the use of RSA is driven more by its theoretical benefits. For example, studies that use RDMs to construct a low-dimensional representation of activity patterns in order to directly compare models of behavior may depend on very particular assumptions about both dissimilarity relations, but also the structure of the neural population code in a region ( Davis, Love, and Preston, 2012 ;Op de Beeck, Wagemans, and Vogels, 2001 ;Ramírez, 2018 ;Ritchie and Op de Beeck, 2019 ). In such "ambitious " applications of RSA, NN M may indeed be useful insofar as the design of such studies are attentive not just to the reliability of neural RDMs, but also precise hypotheses about the structure of the activation space that is being measured. Another case where NN M may be useful are studies where within-subject reliability of particular importance. For example, several studies have found that individual differences in the structure of neural RDMs are reliable and meaningful ( Charest et al. 2014 ;Feilong et al. 2018 ). For such applications NN M may be particularly well-suited, depending on the experimental conditions and ROIs.
In summary, the broad prescriptions supporting the use of NN M suggest it is appropriate for modest uses of RSA. We believe our results show that such a recommendation is not yet supported. However, more positively we believe that it may be applicable when the underlying signal, and accompanying noise, have a known structure and this fact is being leveraged in more ambitious uses of RSA.

Conclusion and future directions
NN M has been proposed as method for boosting the reliability of neural RDMs when carrying out RSA. As we have seen, this method in fact produces mixed results based on differences between datasets that are related to stimuli and choices of regions. What factors account for these differing results remains unclear even though the datasets and analyses that we performed already provide some exploration of obvious candidates. Further research may clarify when NN is an appropriate analysis step to take. Empirically we believe promising directions for such research might include larger numbers of datasets, and comparisons of randomly selected regions of cortex as could be achieved with searchlight methods ( Etzel et al. 2013 ). Alternatively, important insights might be gained from further simulation studies based on brain areas where the spatiotemporal properties of the BOLD signal (and noise) are well understood and can be compared to neural data. At the same time, such modeling approaches can only take us so far given that the spatiotemporal profile of the BOLD signal is likely heterogenous throughout the cortex. Were such profiles well understood, it would likely obviate the need for a method like NN M in the first place. Furthermore, as we have argued, the very motivation for NN M and its prescriptive nature can also be questioned.

Data and code availability statement
The (raw) fMRI data of all three datasets is available via the Open Science Framework (D1: https://osf.io/b9svy/ ; D2: https://osf.io/qp54f/ ; D3: https://osf.io/hpwjx/ ), as well as code for running multivariate noise normalization on a token subject from D1 ( https://osf.io/hnq6m/ ). Further code for running the analyses reported here will be made available upon request by the first author.