Insight and inference for DVARS

Estimates of functional connectivity using resting state functional Magnetic Resonance Imaging (rs-fMRI) are acutely sensitive to artifacts and large scale nuisance variation. As a result much effort is dedicated to preprocessing rs-fMRI data and using diagnostic measures to identify bad scans. One such diagnostic measure is DVARS, the spatial root mean square of the data after temporal differencing. A limitation of DVARS however is the lack of concrete interpretation of the absolute values of DVARS, and finding a threshold to distinguish bad scans from good. In this work we describe a sum of squares decomposition of the entire 4D dataset that shows DVARS to be just one of three sources of variation we refer to as D-var (closely linked to DVARS), S-var and E-var. D-var and S-var partition the sum of squares at adjacent time points, while E-var accounts for edge effects; each can be used to make spatial and temporal summary diagnostic measures. Extending the partitioning to global (and non-global) signal leads to a rs-fMRI DSE table, which decomposes the total and global variability into fast (D-var), slow (S-var) and edge (E-var) components. We find expected values for each component under nominal models, showing how D-var (and thus DVARS) scales with overall variability and is diminished by temporal autocorrelation. Finally we propose a null sampling distribution for DVARS-squared and robust methods to estimate this null model, allowing computation of DVARS p-values. We propose that these diagnostic time series, images, p-values and DSE table will provide a succinct summary of the quality of a rs-fMRI dataset that will support comparisons of datasets over preprocessing steps and between subjects.


Introduction
Functional connectivity obtained with resting state functional magnetic resonance imaging (rs-fMRI) is typically computed by correlation coefficients between different brain regions, or with a multivariate decomposition like Independent Components Analysis (Cole et al., 2010). Both approaches can be corrupted by artifacts due to head motion or physiological effects, and much effort is dedicated to preprocessing rs-fMRI data and using diagnostic measure to identify bad scans. Smyser et al. (2011) proposed and Power et al. (2012) popularized a measure to characterize the quality of fMRI data, an image-wide summary that produces a time series that can detect problem scans. They called their measure DVARS, defined as the spatial standard deviation of successive difference images. In fact, DVARS can be linked to old statistical methods developed to estimate noise variance in the presence of drift (see Appendix A for DVARS history).
While DVARS appears to perform well at the task of detecting bad scansbad pairs of scansit does not have any absolute units nor a reference null distribution from which to obtain p-values. In particular, the typical "good" values of DVARS varies over sites and protocols which makes it difficult to create comparable summaries of data quality across data sets. The emergence of the large scale data sets such as the Human Connectome Project (HCP) (>1k subjects) and the UK Biobank (>10k subjects) further motivates the need for automated, yet reliable, quantitative techniques to control and improve the data quality.
The purpose of this work is to provide a formal description of DVARS as part of a sum of squares (SS) decomposition of the data, propose more interpretable standardized versions of DVARS, and compute DVARS pvalues for a null hypothesis of homogeneity.
The remainder of this work is organized as follows. We first describe the sum of squares decomposition for the 4D data and how this relates to traditional DVARS, and other new diagnostic measures it suggests. Then we describe a sampling distribution for DVARS under the null hypothesis, and mechanisms for estimating the parameters of this null distribution. We establish the validity and sensitivity of the DVARS test with simulations, and use two different fRMI cohorts to demonstrate how both the DVARS test and our'DSE' decomposition are useful to identify problem subjects and diagnose the source of artifacts within individual subjects.

Theory
Here we state our results concisely relegating full derivations to Appendices.

Notation
For T time-points and I voxels, let the original raw rs-fMRI data at voxel i and t be Y R it . Denote the mean at voxel i as M R i ¼ 1 T P t Y R it , and by m R some type of overall mean value (i.e. a summary of the mean image fM R i g, like median or mean). We take as our starting point for all calculations the centered and scaled data: The scaling ensures that typical brain values before centering are around 100 and are comparable across datasets.

DSE decomposition
Denote the total ("all") variability at scan t as (2) Fig. 1. Illustration of the DSE decomposition, where A t (green) is the total sum-of-squares at each scan, D t (blue) is the sum-of-squares of the half difference of adjacent scans, S t (yellow) is the sum-of-squares of the average of adjacent scans, and E t is the edge sum-of-squares at times 1 and T; ffiffiffiffiffi D t p is proportional to DVARS. The D and S components for index t (D t and S t ) sum to A averaged between t and t þ 1 (ðA t þ A tþ1 Þ=2). Note how the S and D time series allow insight to the behavior of the total sum-of-squares: The excursion of A around t ¼ 2; 3 arise from fast DSE component while the rise for t ! 6 is due to the slow component. For perfectly clean, i.e. independent data, D and S will converge and each explain approximately half of A. Table 1 Expressions that make up the time series visualization of the DSE SS-decomposition. A-var is to the total SS at time point t, D-var, S-var and E-var correspond to the fast, slow and edge variability terms. Global and non-global variance components sum to the total components. All of these terms, given as mean squared quantities, are best reported and plotted in root mean squared (RMS) units (see Appendix B for more on plotting global SS).

Name
Notation Value Range X-axis Loc.
A-var At Non-Global A-var ANt 1 I P i ðYit À Yt Þ 2 t ¼ 1; …; T t Non-Global D-var DNt 1 I P i ðYit À Yi;tþ1 À ðYt À Ytþ1ÞÞ 2 =4 t ¼ 1; …; T À 1 t þ 1 2 Non-Global S-var SNt 1 I P i ðYit þ Yi;tþ1 À ðYt þ Ytþ1ÞÞ 2 =4 t ¼ 1; …; T À 1 t þ 1 2 Non-Global E-var ENt We interchangeably use the terms variability and sum of squares, not to be confused with variance. Define two mean squared terms, one for fast ("differenced") variability the half difference between time t and t þ 1 at each voxel, squared and averaged over space, and one for slow variability the average between t and t þ 1 at each voxel, squared and averaged over space.
We then have the following decomposition of the average SS at time points t and t þ 1, A t;tþ1 ¼ ðA t þ A tþ1 Þ=2.
for t ¼ 1; …; T À 1. This has a particularly intuitive graphical interpretation: If we plot D t and S t at t þ 1=2, they sum to the midpoint between A t and A tþ1 found at t þ 1=2 (see Fig. 1). Since we now have a concrete interpretation for DVARS, with DVARS 2 t =4 being the "fast" DSE component in the average SS at t and t þ 1.
This also leads to a decomposition of the total sum of squares over all scans: With averages A, D, S and E defined in Table S1 (row 1) we have the following "DSE" decomposition That is, the total variability ("A-var") in the 4D dataset is the sum of terms attributable to fast ("D-var"), slow ("S-var") and edge variability ("E-var"). D is also 1=4 the average mean squared difference (MSSD; see Appendix A). Each term in the "DSE" decomposition can be split into global and non-global components, as shown in Table S1, rows 2-3 (as also noted by Burgess et al. (2016) for D t ).
Elements of the DSE decomposition can be visualized as time series (see Table 1) or as images. For example, just as a variability image with it =T is useful, we find that a D-var image, D i ¼ offer more informative views of the noise structure.

DSE table & reference values
We see the arrangement of DSE values in Table S1 as a variant of an Analysis of Variance (ANOVA) table that summarizes contributions from fast, slow, end, global and non-global components to the total mean-squares in a 4D dataset. Traditionally ANOVA tables use sum-ofsquares to partition variability, but we instead focus on root mean squared (RMS) or mean squared (MS) values to leverage intuition on typical noise standard deviation (or variance) of resting state fMRI data. To avoid any confusion, we call this variant 'DSE table'.
We calculate expected values for each of the DSE values for artifactfree data using different null models. In Appendix D we detail the most arbitrary version of this model, based only on time-constant spatial ; the upper two panels (A & B) use variance based on hIQR with d ¼ 1, the lower two panels (C & D) use hIQR with d ¼ 3. P-P plots and histograms of Z scores show that only use ofμ DVARS 0 gives reliable inferences, and that the power transformation parameter d seems to have little effect.
covariance, Σ S . Another model is based on time-space-separable correlation; this noise model assumes data with arbitrary spatial covariance Σ S but a common temporal autocorrelation for all voxels with a constant lag-1 autocorrelation ρ. While this is a less restrictive time series model than AR(1), in practice temporal autocorrelation varies widely over space, and we stress we only consider this as a working model to gain intuition on the DSE table. (Our null model for DVARS p-values, below, does not assume time-space separability). We also consider the idealized model of "perfect" data with completely independent and identically distributed (IID) 4D data. Power of the DVARS hypothesis test to detect artifactual spikes. Plots show sensitivity (% true spikes detected) versus number of true spikes as a percentage of time series length T, for varying degrees of temporal autocorrelations (line color). Different T (rows) and degree of spatial variance heterogeneity (columns) are considered. These results show hat power increases with autocorrelation but falls with increasing prevalance of spikes; for up to 10% spikes we have excellent power, and for 20% spikes we have satisfactory power (60-90% sensitivity). the average voxel-wise variance σ 2 , and as temporal autocorrelation ρ increases D-var shrinks and S-var grows. The global components are seen to depend on σ 2 , the average of the I 2 elements of Σ S . This indicates, intuitively, that the greater the spatial structure in the data the more variability that is explained by the global.
The next pair of rows in Table 2 show the expected MS values normalized to the expected A-var term. The A-var-normalized D-var and S-var diverge from 1/2 exactly depending on ρ, specifically S À D ¼ ρðT À 1Þ=T. The global terms here depend on the ratio of average spatial covariance and average variance, σ 2 =σ 2 .
The final pair of rows shows expected values under the most  Table 3; Δ%D-var is marked with a practical significance threshold of 5%, and ZðDVARSÞ with the one-sided level 5% Bonferroni significance threshold for 1200 scans. Vertical grey stripes mark scans that only attain statistical significance, while orange stripes mark those with both statistical and practical significance. The bottom two plots show the 4 DSE components, total A t (green), fast D t (blue), S t slow (yellow), and edge E t (purple), for minimally preprocessed (upper) and fully preprocessed (lower) data. For minimally preprocessed data D-var is about 25% of A-var (see right axis), far below S-var. For fully preprocessed data D-var and S-var converge to 50%A-var. The handy result on the S À D approximating ρ generalizes beyond the time-space-separable model: For an arbitrary model, both S À D and S t À D t normalized to A estimate a weighted average of the lag-1 temporal autocorrelations (see Appendix D.8). Hence, the convergence of Dvar and S-var we observe as data is cleaned up has the specific interpretation of reduction in the average lag-1 autocorrelation.
These reference models provide a means to provide DSE values in three useful forms. For each A-var, D-var, S-var and E-var term we present: 1. RMS, the square root of the mean squared quantity, 2. %A-var, a variability as a percentage of total mean-square A, and 3. Relative IID, A-var-normalized values in ratio to nominal IID values.
noting that we normalize to A and not A G . We note that the fast and slow components can be defined as responses of linear time-invariant filters. The slow component corresponds to an integrator filter with power transfer function HSðωÞj 2 ¼ 2ð1 þ cosðωΔTÞÞ and the fast component corresponds to a differentiator filter with power transfer function HDðωÞj 2 ¼ 2ð1 À cosðωΔTÞÞ, where ω is angular frequency and ΔT is the repetition time (TR). In other words, in time domain, S t can be interpreted as average of convolved BOLD signals Table 3 Form and interpretation of various DVARS variants, expressed as functions of original DVARSt . Here fYit g are the 4D data, A is the overall mean square variance, μ 0 is the expected DVARS 2 t under a null model, PðDVARSt Þ is the p-value for DVARS 2 , and Φ À1 is the inverse cumulative distribution function of a normal. p DVARS as a multiple of null mean Z(D-var) Φ À1 ð1 À PðDVARSt ÞÞ DVARS p-value as Z-score Table 4 Spatial effective degrees of freedom (EDF) for HCP subject 115320. As more spatial structure is removed with preprocessing, spatial EDF rises, but never to more than 5% of the actual number of voxels.

Inference for DVARS
We seek a significance test for the null hypothesis where μ 0 is the mean under artifact-free conditions. Note this is equivalent to a null of homogeneity for DVARS t or D t . If we further assume that the null data are normally distributed, we can create a χ 2 test statistic approximately following a χ 2 ν distribution with ν ¼ 2b μ 2 0 =b σ 2 0 degrees of freedom, where σ 2 0 is the null variance (see Appendix E).
What remains is finding estimates of μ 0 and σ 2 0 . The null mean of DVARS t is the average differenced data variance, where σ 2 Di is the variance of the differenced time series at voxel i. To avoid sensitivity to outliers, we robustly estimate each σ 2 where IQR 0 ¼ ðΦ À1 ð0:75Þ À Φ À1 ð0:25ÞÞ % 1:349 is the IQR of a standard normal, and Φ À1 ð⋅Þ is the inverse cumulative distribution function of the standard normal. Below we evaluate alternate estimates of μ 0 , including the median of fb σ 2 Di g and directly as the median of fDVARS 2 t g. The variance of DVARS 2 t unfortunately depends on the full spatial covariance, and thus we're left to robustly estimating sample variance of fDVARS 2 t g directly. We consider several estimates based on IQR and evaluate each with simulations below. Since the IQR-to-standard deviation ratio depends on a normality assumption, and we consider various power transformations before IQR-based variance estimation (see Appendix F). We also consider a "half IQR" estimate of variance where hIQR is the difference between the median and first quartile, and hIQR 0 ¼ IQR 0 =2. This provides additional robustness against contamination of the variance estimate from upward spikes. Finally, the XðDVARS t Þ values can be converted to p-values PðDVARS t Þ with reference to a χ 2 ν distribution, and subsequently converted into equivalent Z scores, Note that for extremely large values of DVARS t numerical underflow will result in p-values of zero; in such cases an approximate Z score can be obtained directly as ZðDVARS t Þ ¼ ðDVARS 2 t À μ 0 Þ=σ 0 .  Fig. 7. Cleaning has brought S t slow variability into line with D t fast variability, each explaining about 50% of total sum-of-squares. While some scans are still flagged as significant, %D-var (D as a % of A-var, right y axis) never rises above about 55%, indicating Δ%D-vars of 5% or less lack of practical significance.
Under complete spatial independence the degrees of freedom will equal the number of voxels I, and so ν can be thought of an effective number of spatial elements; large scale structure will decrease ν while larger ν should be found with cleaner data. Though we caution that estimates of ν will be very sensitive to the particular estimators used for μ 0 and σ 2 0 .

Standardized DVARS
For intra-cohort investigation of corruptions, we propose that our Dvar time series, D t ¼ DVARS 2 t =4, is a more interpretable variant of DVARS, as it represents a particular "fast" portion of noise variance, and when added to "slow" mean-square, S t , gives the total mean-square of the 4D data A t;tþ1 . However, these components are not suitable for intercohort comparisons, as the variance characteristics may vary with acquisition or scanner differences. In this section we propose a set of transformations which makes the inter-cohort comparison of the DSE components (including DVARS) possible.
First consider the percent D-var explained at a single time point. Eqn. (5) could be used to find, in sums-of-squares units, the percent variability attributable to D-var at t, t þ 1: 100: However, problem scans can inflate A t and could mask spikes. Hence we instead propose to replace A t;tþ1 with its average A and compute percent D-var at time t as This has the merit of being interpretable across datasets, regardless of total sum of squares. This is just percent normalization to A as discussed above.
While %D-var can be more interpretable than unnormalized D-var, its overall mean is still influenced by the temporal autocorrelation. For example, if %D-var is overall around 30% and at one point there is a spike up to 50%, what is interesting is the 20 percentage point change, not 30% or 50% individually. Hence another useful alternative is change in percent D-var from baseline Δ%D À var : interpretable as the excess fast variability as a percentage of average sum of squares. Later in Section 4.2.1, we show how Δ%D-var is used as measure of "practical significance" to complement DVARS p-values. We previously have proposed scaling DVARS relative to its null mean (Nichols, 2013), (While we had called this "Standardized DVARS", a better label is "Relative DVARS.") This gives a positive quantity that is near 1 for good scans and substantially larger than one for bad ones. However, there is no special interpretation "how large" as the units (multiples of μ À1=2 0 ) are arbitrary; as noted above, DVARS falls with increased temporal correlation, making the comparison of these values between datasets difficult.
Finally the Z-score ZðDVARS t Þ or Àlog 10 PðDVARS t Þ may be useful summaries of evidence for anomalies.

Simulations
To validate our null distribution and p-values for DVARS we simulate 4D data as completely independent 4D normally distributed noise for σ i drawn uniformly between σ min and σ max for each i, I ¼ 90; 000.
We manipulate two aspects in our simulations, time series length and heterogeneity of variance over voxels. We consider T of 100, 200, 600 and 1200 data-points, reflecting typical lengths as well as those in the Human Connectome Project. We use three variance scenarios, homogeneous with σ min ¼ σ max ¼ 200, low heterogeneity σ min ¼ 200 and σ max ¼ 250, and high heterogeneity σ min ¼ 200 and σ max ¼ 500.
We consider four estimates of μ 0 . First is the very non-robust sample Table 5 List of all statistically significant Dt fast SS components in the fully pre-processed HCP 115320. Spikes which represent the highest (index 1177) and lowest (index 1035) are marked in bold.  Di (Eqn. (13)), the robust IQR-based estimates of differenced data variance at each voxel. Finally we also consider the empirical median of fDVARS 2 t g, μ DVARS 0 . For σ 2 0 , all estimates were based directly on fDVARS 2 t g; for comparative purposes we considered the (non-robust) sample variance of fDVARS 2 t g, b σ 2 0 , and IQR-based and hIQR-based estimates of variance with power transformations d of 1, 1/2, 1/3 and 1/4, denoted genericallyσ 2 0 ; note d ¼ 3 is theoretically optimal for χ 2 (see Appendix F).
For p-value evaluations, we only evaluate the most promising null moment estimators,μ D 0 andμ DVARS 0 for μ 0 , andσ 2 0 with hIQR, d ¼ 1 and hIQR, d ¼ 3. We measure the bias our estimators in percentage terms, as ðb μ 0 À μ 0 Þ=μ 0 Â 100 and ðb σ 2 0 À σ 2 0 Þ=σ 2 0 Â 100, where the true value are i =I 2 (as per Appendix E when Σ S ¼ I). For each method we obtain P-values and create log P-P plots (probabilityprobability plots) and histograms of equivalent Z-scores. Similar simulation settings are used to evaluate the power of the DVARS hypothesis test, except we consider 4 different autocorrelation levels ρ ¼ f0; 0:2; 0:4; 0:6g. This range is chosen to reflect observed estimates of lag-1 autocorrelation coefficients in the HCP cohort. Inferences are assessed in terms of sensitivity and specificity.
All simulations use 1000 realisations. Fig. 9. Distribution of temporal lag-1 autocorrelation across three pre-processing levels. First three rows show maps of autocorrelation for raw, minimally preprocessed and fully preprocessed, respectively, for one subject (only positive values); bottom row shows dot plots of autocorrelation for that same subject and two other subjects (random selection of 1% voxels plotted for better visualization). Fully preprocessed data has median correlation near zero, consistent with converging S-var and D-var.

Analysis of functional connectivity
We evaluate the impact of the DVARS test as a tool for "scrubbing" (scan deletion) on functional connectivity (FC) measued with Pearson's correlation coefficient. We consider FC between all possible pairs of Region of Interests (ROI) in each subject for a given ROI atlas. The mean time series of each ROI is obtained by averaging all the time series within a ROI. To parcellate the brain, we use two data-driven atlases; Power Atlas (Power et al., 2011) which is constructed of 264 non-neighboring cortical and sub-cortical ROIs and each ROIs has 81 voxels (is case of 2 mm isotropic volumes) and Gordon Atlas (Gordon et al., 2014) which is constructed of 333 cortical regions of interests with different sizes.
We use two popular methods to evaluate the effect of the DVARS inference on functional connectivity. First, we use the QC-FC analysis which begins by creating per-edge, intersubject scores, the correlation of the number of removed volumes and FC; these scores are plotted against the inter-ROI distance (in mm). We then use LOESS smoothing method (with span window of %1) to summarize the association for each method.   box-plot right-outliers, 1.5 IQRs above the 75%ile. Note that the first three approaches used a fixed threshold, while the FSL approach gives a run-specific threshold.
The objective of this FC analysis is to investigate whether DVARS inference test performs as well as the available thresholding methods (such as arbitrary thresholding of FD (Power et al., 2012) and DVARS (Burgess et al., 2016)) and if so, whether it delivers the optimal results while sacrificing the fewest temporal degree of freedom as possible. Therefore, we only present the results for the Minimally pre-processed data sets.

Real data
We use two publicly available data-sets to demonstrate the results of methods proposed in this paper on real-data. First, we use 100 subjects from "100 Unrelated" package in the Human Connectome Project (HCP, S1200 release). We chose this dataset due to the high quality and long sessions of the data Glasser et al., 2013). Second, we used first 25 healthy subjects from the New York University (NYU) cohort of the Autism Brain Imaging Data Exchange (ABIDE) consortium via Preprocessed Connectome Project (PCP) (Craddock et al., 2013). We selected this cohort for its high signal-to-noise ratio and the more typical (shorter) time series length (Di Martino et al., 2014).

Human Connectome Project (HCP)
For full details see Glasser et al., 2013); in brief, 15 min eyes-open resting acquisitions were taken on a Siemens customized Connectome 3T scanner with a gradient-echo EPI sequence, TR ¼ 720ms, TE ¼ 33.1 ms, flip angle ¼ 52 and 2 mm 3 isotropic voxels. For each subject, we used the first session, left to right phase encoding direction (See Table S2 for full details of subjects). We considered each subject's data in three states of pre-processing: unprocessed, minimally pre-processed and ICA-FIXed processed. Unprocessed refers to the raw data as acquired from the machine without any pre-processing step performed, useful as a reference to see how the DSE components change with preprocessing steps. Minimally pre-processed (MPP) data have undergone a range of conventional pre-processing steps such as correction of gradient-nonlinearity-induced distortion, realignment aiming to correct the head movements, registration of the scans to the structural (T1w) images, modest (2000s) high pass filtering and finally transformation of the images to the MNI standard space.
Finally, after regressing out the 24-motion parameters, an ICA-based clean up algorithm called ICA-FIX (Salimi-Khorshidi et al., 2014) is applied, where artifactual ICA components, such as movement, Fig. 12. Square root D-var (fast) and S-var (slow) variability images of four subjects, for minimally (left sub-panels) and fully preprocessed data (right sub-panels). Subject 151627 appears to have been successfully cleaned, others less so; see text for detailed interpretation with respect to Figs. 10 and 11. physiological noises of the heart beat and respiration, are regressed out non-aggressively. Due to extent of the FIX denoising and an ongoing debate regarding the nature of the global signal, we did not consider global signal regression with the HCP data. From now on, we call this stage 'fully pre-processed (FPP)' to be consistent with the ABIDE-NYU cohort we describe in the following.

Autism Brain Imaging Data Exchange (ABIDE)
We use use 20 healthy subjects of New York University (NYU) dataset. For full details visit Pre-processed Connectome Project website http://preprocessed-connectomes-project.org/; in brief, 6 min eyesclosed resting acquisitions were taken on an Allegra 3T scanner with a gradient echo EPI sequence, TR ¼ 2000ms, TE ¼ 15ms, flip angle ¼ 90 , and 3 mm isotropic voxels (See Table S2 for full details of subjects). In this study, each subject was analyzed using Configurable Pipeline for the Analysis of Connectomes (C-PAC) pipeline, in three stages; unprocessed, minimally pre-processed and fully pre-processed. The unprocessed data are raw except for brain extraction with FSL's BET. Minimally preprocessed data were only corrected for slice timing, motion by realignment and then the data were transformed into a template with 3 mm 3 isotropic voxels. Fully pre-processed data additionally had residualisation with respect to 24-motion-parameters, signals from white matter (WM) and cerebrospinal fluid (CSF), and linear and quadratic lowfrequency drifts. Conventionally this pipeline deletes the first three volumes to account for T1 equilibration effects, but we examine the impact of omitting this step for the raw data.
Further, we also use all healthy subject of ABIDE (530 subjects) to show how DSE decomposition can be used to compare the data-sets, cohorts and pipelines. Fig. 2 shows the percentage bias for the null expected value μ 0 (left panel) and variance σ 2 0 (right panel) for different levels of variance heterogeneity and time series length.

Simulations
The direct estimates of the μ 0 based on the DVARS 2 t time series perform best on this clean, artifact-free data, while μ 0 estimated on Fig. 13. Investigation of S-var, slow variability artifacts. When S t and D t coincide, like at t ¼ 871 (Panel A), the S-var image shows no particular structure. In contrast, we find multiple S-var excursions correspond to a common pattern of vascular variability across the acquisition, with time points t ¼ 591, 202 and 1030 shown in panels B, C and D, respectively. variance of the differenced data (b μ D 0 andμ D 0 ) degrades with increasing heterogeneity. The estimates of variance have relatively less bias but it is difficult to identify one particular best method, save for IQR often (but not always) having less bias than hIQR, and lower d generally associated with less bias.
On balance, given the generally equivocal results and concerns about robustness, for further consideration we focus onμ DVARS 0 (median of fDVARS 2 t g) andμ D 0 (median of b σ 2 Di ) as promising candidates for μ 0 , and hIQR with d ¼ 1 and hIQR with d ¼ 3 for σ 2 0 . Fig. 3 shows log P-P plots for χ 2 p-values and histograms of approximate Z scores, ðDVARS 2 t À μ 0 Þ=σ 0 ; values above the identity in the P-P plot correspond to valid behavior. While all methods have good performance under homogeneous data,μ D 0 (panels A & C) is not robust to variance heterogeneity and results in inflated significance. In contrast, μ DVARS 0 (panels B & D) has good performance over all, for variance estimated with either d ¼ 1 or d ¼ 3 (top and bottom panels, respectively), and also yields good approximate Z-scores. On the basis of these results, we elected to useμ DVARS 0 as the only reliable option for the mean, and hIQR, d ¼ 3 as a variance estimate, and use these settings going forward. Fig. 4 shows the results of the power simulation. For all sample sizes and autocorrelation parameters, and for the 1% and 10% artifact rates, power was always above 80% and often %100%. Increased autocorrelation resulted in improvements in power, while higher artifact rates reduced power. For the 20% artifact rate power was adequate (% 80%), but falls to zero for the 30% artifact rate. These results suggest that, at the highest spike rate, the artifacts start to be become indistinguishable from the overall noise (see Fig. S2 for one realization). However, the distribution of DVARS values (Fig. S1) suggest that the constituent null and artifact components are distinguishable even at the highest spike rate, but would require yet more robust methods for estimating the null component than we have employed.

Real data
We first focus on selected results of two HCP subjects, then later summarize results for all HCP and ABIDE subjects. Fig. 5 shows different standardized DVARS measures, as introduced in section 2.5, as well as the other DSE components for subject 118730 of the HCP cohort (See Figs. S4, S5 and S6 for more results.). The first six plots corresponds to the variants listed in Table 3; the bottom two plots show "DSE plots," plots of A t , D t , S t and E t components, upper plot with minimal pre-processing, lower with full pre-processing. The gray and magenta stripes indicate 19 data points identified as having significant DVARS after Bonferroni correction, with magenta indicating time-points that are additionally practically significant by the criterion Δ%D À var > 5%. In Fig. 5, the largest D t occurs at index 7 (i.e. 7th and 8th data points) and has ffiffiffiffiffi D t p ¼ 4.07, large in terms of being %D-var ¼ 70.16% of average variability, Z ¼ 36.33 indicating extreme evidence for a spike, and having Δ%D-var ¼ 41.20% more sum-ofsquares variability than expected. The least significant D t occurs at index 726, with ffiffiffiffiffi D t p ¼ 2.83; while its Z ¼ 4.36 is not a small Z-score, with just Δ%D-var ¼ 4.95% excess variation, it is a relatively modest disturbance. In contrast, we find that the values of original DVARS or relative DVARS do not offer a meaningful interpretation. Table S4 shows values for all significant scans.

Temporal diagnostics: DVARS inference and standardized measures
The bottom panel of Fig. 5 shows the DSE plot for fully pre-processed data. This data now exhibits the idealized behavior of IID data, with Dvar and S-var components converging at 50% of average variability (see right-hand y-axis). However, interestingly, the change is not similar for all DSE components. Note how ffiffiffiffiffi D t p is around 2.6 before clean up, and 2.5 after clean up, while ffiffiffiffi S t p falls dramatically with cleaning, indicating that nuisance variance removed was largely of a "slow" variability component. Also observe that cleaned up results drops in total variability componentA t , where artifacts were found, indicating variance is removed by FIX.
Finally, Table 4 explores the use of the estimated χ 2 degrees of freedom ν as an index of spatial effective degrees of freedom. Raw data, exhibiting substantial spatial structure, has ν ¼ 287, which increases to ν ¼ 11; 086 for fully preprocessed data, still only about 5% of the actual number of voxels.

Effect of DVARS inference testing on functional connectivity
FC evaluations based on 55,278 unique edges from the Gordon atlas are shown in Fig. 6 (see Fig. S9 for Power Atlas results). Panel A shows the QC-FC analysis of five thresholding methods, compared to unscrubbed QC-FC. The results from the DVARS test appear comparable to the other methods, but Panel B of Fig. 6 show that the DVARS test removes many fewer scans on average, preserving temporal degrees of freedom. A related evaluation, comparing DVARS hypothesis test scrubbing to random scrubbing, finds that FC is significant impacted by the DVARS scrubbing (Figs. S10 and S11).
We note that the sole purpose of preceeding QC-FC analysis is to ensure that the DVARS inference test outperforms other arbitrary thresholds available in literature, and therefore we do not show the similar results for FPP data.

Temporal diagnostics: before and after clean-up
Figs. 7 and 8 shows the minimally and fully pre-processed DSE decompositions, respectively, of HCP subject 115320. Fig. 7, upper panel, shows that if the strict FD threshold, 0.2 mm (Power et al., 2014b), were used 47% of scans would be flagged, while the lenient threshold, 0.5 mm (Power et al., 2014b), appears to miss several important events. For example, around scans 775 and 875 there are two surges in ffiffiffiffiffi D t p , rising to about 60% and 40% average sum-of-squares (excesses of 30% and 10%, respectively, from a baseline of about 30%) while FD remains low. The lower panel's pie chart shows that S-var explains just under 75% of total, and almost all of global sum-of-squares. The Edge component is also 1.5 above its expectation.
In Fig. 8, the fully preprocessed data-set shows roughly equal fast and slow components, as reflected in the overlapping D t and S t sum-ofsquares time series (blue and yellow, respectively) and the pie and bar charts for total sum-of-squares. Edge component E-var has also dropped to fall in line with IID expectations. However, this convergence is not homogeneous over scans and excursions of S-var are still found after scan 650. However, these are much reduced relative to MPP data (no more than 75% of average sum-of-squares, compared to over 150% in Fig. 7).
Note that while significant DVARS are found in the FPP data, they are small in magnitude: Table 5 lists the 10 significant tests, none with Δ%Dvar greater than 6%. If we used a Δ%D-var of 5% we would still mark 4 of these 10 significant; while we might hope for better performance from the FIX method, note the severe problems detected towards the end of the scan (Fig. 7).
The smallest significant Δ%D-var is 2.66%, which is smaller than the least significant scan detected in the minimally preprocessed data, 3.78%. This indicates the increased sensitivity in our procedure as the background noise in the data is reduced. Note that the majority of the spikes detected in Fig. 7 has been removed by ICA-FIX (Fig. 8), however the algorithm has left down-spikes which could be detected via a twosided version of the test explained in section 2.4.
Temporal diagnostics of before and after clean-up for three other subjects (HCP subject 118730, NYU-ABIDE subjects 51050 and 51050) also reported in Supplementary Materials. See Figs. S12 and S13 for HCP subject 118730, Figs. S14 and S15 for NYU-ABIDE subject 51050 and Figs. S16 and S17 for NYU-ABIDE 51055.
The DSE tables for minimally and fully preprocessed (Table 6) gives concise summaries of the data quality. The RMS values provide concrete values that can be used to build intuition for data from a given scanner or protocol. The total noise standard deviation falls from 5.015 to 3.437 with clean-up, but it is notable that the fast component, D-var, falls only slightly from 2.598 to 2.406 (in RMS units), while slow variability falls dramatically from about 4.287 to 2.454. This indicates that much of the variance reduction in "cleaning" comes from removal of low frequency drifts and other slowly-varying effects. The magnitude of temporally structured noise is reflected by S-var explaining 73% of total sum-ofsquares, and after clean-up S-var and D-var fall into line around 50%. A measure of the spatially structured noise is the global A G -var that, while small as a percentage, is seen to be about 1500 that expected with IID before preprocessing, and falling to about 275 relative to IID after preprocessing. That the majority of A G -var is due to S G -var indicates that the global signal is generally low frequency in nature.
We also show the DSE tables for three other subjects; HCP subject 118730 in Table S5, NYU-ABIDE subject 51050 and 51055 in Tables S6 and S7, respectively.
We observe that the cleaned data has D t % S t , which implies that the average lag-1 autocorrelation is close to zero (Sec. Appendix D.8). However, temporal autocorrelation is a ubiquitous feature of fMRI data, suggesting a contradiction. To address this, Fig. 9 shows maps of the lag-1 temporal autocorrelation across the pre-processing steps. For raw data, the autocorrelation coefficient is between 0.4 and 0.6, but with successive pre-processing steps, the autocorrelation coefficient decreases until the FFP level where the median of voxel-wise autocorrelation coefficients is approximately zero. (See Fig. S18 for similar results on 20 HCP subjects).
Thus, while temporal autocorrelation is present in the data, we find that the lag-1 autocorrelation coefficients do get close to zero with cleaned data, indicating that the D t % S t heuristic is correctly indicating negligible average autocorrelation. Fig. 10 illustrates the use of the DSE decomposition to summarize the DSE components of 100 unrelated subjects in the HCP cohort, normalized as a percentage of total variability, A-var, to be maximally comparable across subjects. (See Fig. S22 for same results for ABIDE-NYU cohort). A non-normalized version of this plot (Fig. S23) is useful for viewing absolute changes, showing that S-var dramatically drops with preprocessing while DS-var is relatively stable.
For the raw data, %D-var ranges from just over 5%-40%, and S-var varies between 60% and 96%; the EÀvar only ever explains a negligible portion of the sum-of-squares, 0.027 to 0:50% across all three preprocessing levels. For all but two subjects the %D-var and %S-var components successively converge to 50% AE5% for FPP data.
Considering only the global variability, the slow %S G -var is small, usually falling well below 1%, and fast %D G -var is negligible, never exceeding 0.1%. This reflects the low frequency nature of the global signal. Similarly, the global edge component, %E G -var only explains a small proportion of the global variabilities.
To demonstrate the utility of the DSE decomposition in data quality control, we isolate four subjects and observe how their DSE values change with successive preprocessing.
Subject 151627, marked with a square, is one of the most extreme subjects for S-var and D-var in raw and MPP data, but has one of the smallest %S-var À %D-var differences for FPP data. This dramatic reduction in autocorrelation is confirmed in Fig. 11-A, showing the cumulative distribution of lag-1 autocorrelation, and is likely linked to physiological noise around brain stem and other inferior regions (Fig. 12-A1) successfully removed by ICA-FIX (Fig. 12-A2).
Subject 122620, marked with an triangle, has small %S-var À %D-var differences for all versions of the data, also reflected in its distribution of autocorrelation ( Fig. 11-B). However, there is still some notable spatial structure in the S-var and D-var images even after clean up (Fig. 12-B2). This illustrates that if a small portion of the image possess problems, it may not be detected in any simple summary.
Subject 135932, marked with circle, has absolutely typical S-var and D-var among the 100 subjects in the raw and MPP data, but in the FPP data it has one of worst %S-var À %D-var differences. The distribution of autocorrelation coefficients reflects this (Fig. 11-C), with FPP (red line) having more large values of ρ than the other subjects. Inspection of the raw data S-var map (Fig. reffig:VarImg-C1) shows evidence of substantial structured noise that is, by in large, mostly removed by ICA-FIX correction ( Fig. 12-C2). However the FPP S-var map shows vascular structure, likely a branch of the posterior cerebral artery near the lingual gryus; this is likely an element of physiological noise that ICA-FIX would have ideally removed but missed. Note also that this subject has low movement as measured by median FD (Fig. S21), eliminating motion as the likely source of the problem.
Finally, subject 101107, marked as a diamond, has the worst quality as measured by divergent %S-var and %D-var across preprocessing levels, with FPP level having S-var¼ 77% and D-var¼ 23%, and reflected in the largest autocorrelation values among the four subjects ( Fig. 11-D). Images of S-var show substantial structured variability that remains even in the FPP data ( Fig. 12-D), while the D-var image is improves notably with ICA-FIX. (This was a high-motion subject; note loss of ventromedial prefrontal cortex).
DSE time series plots of these four subjects confirm these findings, with 122620 and 151627 having flat and converged S-var and D-var time series, while 135932 and especially 101107 have structured and diverged S-var time series (Figs. S19 and S20).
To demonstrate the value of the S-var time series, Fig. 13 explores time points where S t is particularly large and small for subject 135932.
Four "S t images" are shown, ðY it þ Y i;tþ1 Þ 2 =4 for voxel i, the constituents of S t (Eqn. (4)). Panel A of Fig. 13 shows a 'clean' time point, with a minimum of structured noise apparent, while panels B-D all show a similar vascular pattern. Examination of the ICA components fed into FIX finds 3 components that reflect this vascular structure that were classified as 'good' (Fig. S24). This demonstrates the value of the DSE decomposition to identify subtle structured noise in the data.
Finally, in addition to using DSE plots to investigate the quality of scans across pre-processing levels, they can also be used as a universal measure to compare the quality of scans across cohorts, data-sets and pipelines. We computed the DSE decomposition of 530 healthy subjects across 20 acquisition sites in the ABIDE dataset (Fig. S25), identifying particular sites (e.g. NYU & OHSU) and the CPAC preprocessing pipeline generally to have minimal temporal autocorrelation as reflected in S-var/ D-var divergence.

Discussion
We have provided a formal context for the diagnostic measure DVARS, showing DVARS 2 t to be part of a decomposition of sum-ofsquares at each successive scan pair and over the whole 4D data. We have proposed a significance test for the DVARS measure which, when detected scans are removed based on p-values, we found to address corruptions of FC while preserving temporal degrees of freedom better than other arbitrary approaches. We have also proposed the DSE decomposition which is particularly useful for summarizing data quality via DSE plots and DSE tables. These tools concisely summarize the interplay of the fast, slow, total and global sum-of-squares, and our derived nominal expected values for each table entry facilitates the identification spatial and temporal artifacts.
Our analysis shows that D-var (and DVARS) scales with overall noise variance, and is deflated by temporal autocorrelation. We observe that as data becomes cleaner, and the background noise falls, we have greater power to identify DVARS 2 t spikes. Therefore, to avoid 'over-cleaning' the data we complement the statistical significance of DVARS p-values with the practical significance of Δ%D-var, a standardized measure of the excess variance explained by a spike as a percentage of average variance. Consequently, the final candidate time-points to be scrubbed is a conjunction of statistical and practical significance; we choose a 5% familywise error rate significance level via Bonferroni and a 5% Δ%D-var cut-off; this practical significance threshold worked adequately in the HCP data we examined but may need to be recalibrated for other data sources.
Yet one more advantage of using χ 2 tests, proposed in this work, is that we can estimate the effective spatial degrees of freedom which may prove to be a useful index of spatial structure in the data, but we stress this particular χ 2 degrees-of-freedom ν is specific to this setting and is unlikely to be useful in other contexts (e.g. as a Bonferroni correction over space).
Besides Δ%D-var, we have introduced two standardised measures which facilitate the inter-cohort comparison of the fast (or DVARS) component regardless of intensity normalisation used in the preprocessing pipelines. For example, standardised measure %D-var shows the proportion of variability which can be explained via fast component while %S-var shows the similar proportion for the slow variability in data.
The DSE plots allow D-var to be judged relative to S-var, checking for convergence to approximately 50% of A-var as data approaches temporal independence, and consequently the level of autocorrelation as measure of corruption can be tightly monitored across pre-processing steps.
Using DSE plots we found two HCP subjects (101107 & 136932) where the motion-parameter regression and further ICA-FIX algorithm failed to clean the data and clearly stand out from others in the 100 unrelated subject cohort. We have used the DSE variability images to temporally and spatially locate the corruptions. It is important to note that the DSE decomposition technique should only be used before any form of resting-state bandpass filtering (such as 0.01Hz-0.1Hz) and autocorrelation modelling (such as FILM pre-whitening techniques).
Finally, we stress that we don not believe there is any one strategy can address all fMRI artifacts. Each method used in this work has it is merits and pitfalls. For example, while scrubbing was shown to be useful to remove the head motion induced spikes, it fails to remove the nuisance due to physiological signals on it is own and requires alternatives like ICA-based methods. Regardless of method, we still see value of using DSE plots and images throughout the analysis to choose a right combination of methods; see Ciric et al. (2016) for a recent comparison of various combinations of artifact methods.

Limitations and future work
Our DVARS p-values depend critically on accurate estimates of μ 0 and σ 2 0 . Despite finding exact expressions for the null mean and variance, we found the most practical and reliable estimates to be based on the sample DVARS 2 t time series itself, using median for μ 0 and hIQR to find σ 0 .Of course this indicates that our inference procedure can only infer relative to the background noise level of the data, picking out extreme values that are inconsistent with our approximating χ 2 approximation.
There are two essentials avenues as continuation of this work. First is to study the effect of global signal regression via DSE decompositions. As regressing out the global mean deflates the global segment of each variability component, the DSE decomposition can be used to investigate whether global signal regression is helpful to suppress the spatial artifacts. Second, both cleaning algorithms used in this work, scrubbing and ICA-FIX, leave down-spikes (or dips) after regressing out the nuisance. These down-spikes may also affect the FC and could be detected with a two-sided variant of our hypothesis test.

Software and reproducibility
In this work majority of the analysis have been done on MATLAB versions 2015b and 2016b, supported by FSL 5.0.9 for neuroimaging analysis.
Inference on DVARS as well as DSE decomposition techniques proposed in this paper are available via MATLAB scripts, found at http:// www.github.com/asoroosh/DVARS. Also, a dedicated web page, http://www.nisox.org/Software/DSE/, present the DSE decompositions of HCP and ABIDE cohort and is regularly updated with new publicly available resting-state data sets.
Results and figure scripts presented in this work are publicly available for reproducibility purposes (Diggle, 2015) on http://www.github.com/ asoroosh/DVARS_Paper17.
can be plotted. These set of three time series may seem arbitrary, but have the feature of the sum of squares of G Dt and G St sum to the mean-square of G At and G At;tþ1 .

Appendix C. Derivation of DSE SS-decomposition
The decomposition of the average sum of squares at time t and t þ 1, Eqn. (5), is based on a simple algebraic identity; for variables a and b, This justifies a decomposition of the average SS at each voxel i, for each time t ¼ 1; …; T À 1, Averaging this expression over voxels i ¼ 1; …; I gives the decomposition for scan pair variability A t;tþ1 in Eqn. (5). Summing image variability A t;tþ1 over t, however, misses 1/2 of edge terms, which are added to produce the fundamental DSE decomposition in Eqn. (7).

Appendix D. Derivation of DSE table Mean Squares
Here we set out the least restrictive model possible to justify our expected values for the DSE table (Table S1). While the DSE table and decompositions A ¼ D þ S þ E and A G ¼ D G þ S G þ E G are in mean-square (MS) units, below we develop the results in terms of sum-of-squares (SS) that, in each case, can be divided by I Â T to obtain the MS.
All of the results follow from application of rules for expectations and variances of quadratic forms of mean zero vectors. For reference, if w is a mean zero random vector with covariance Σ, and B is a square matrix, then Eðw > BwÞ ¼ trðBΣÞ and Vðw > BwÞ ¼ 2trðBΣBΣÞ.

Appendix D.1Model
In defining the joint distribution of all I Â T elements of the 4D data fY it g, we will always assume is that Y it is mean zero and has constant variance over time, VðY it Þ ¼ VðY it' Þ for t 6 ¼ t', but allow variance to vary over space. For data organized as time series, length-T vectors Y i , let where Σ S is the I Â I spatial covariance matrix, common to all time points, and ðΣ S Þ ii is the variance at the ith voxel, Σ T ii is the T Â T temporal autocorrelation matrix for voxel i, ℂð⋅Þ denotes covariance, and Σ T ii' is the T Â T temporal cross correlation matrix for voxels i and i'. This implies that, for data organized as images, length-I vectors Y t , When a time-space separable covariance structure is assumed then Σ T ii' ¼ Σ T for all i; i'. is the ðT À 1Þ Â T finite difference matrix. We have where notably the last expression only depends on the lag-1 temporal autocorrelations. To obtain more interpretable results we further assume that there is a constant lag-1 autocorrelation at each voxel, ρ i ¼ ðΣ T ii Þ t;tþ1 ; for t ¼ 1; …; T À 1, which reduces (D.5) to 2ðT À 1ÞðΣ S Þ ii ð1 À ρ i Þ. This gives the expected total D-var SS as If we yet further assume constant temporal autocorrelation ρ, corresponding to our separable model, this SS simplifies to trðΣ S ÞðT À 1Þð1 À ρÞ=2.
The expected SS for S-var is follows the same arguments with differencing matrix replaced with a running sum matrix absðDÞ, negating the three negative terms in Eqn. (D.5), and reducing to trðΣ S ÞðT À 1Þð1 þ ρÞ=2 under spatially and temporally homogeneous lag-1 temporal autocorrelation.

Appendix D.4E-var Expected SS
The total SS E-var is

D.5. A G -var Expected SS
The global time series is Y t and total SS due to global is where 1 is a vector of ones. The expectation of the squared term is Vð1 > Y t Þ ¼ 1 > Σ S 1, and thus the expected SS is where ∘ is the Hadamard product and Σ ST t;tþ1 is the spatiotemporal covariance matrix, elements extracted from the temporal cross correlation matrix as per ðΣ ST tt' Þ ii' ¼ ðΣ T ii' Þ t;t' , and that The final expression for the expected SS is then, with successive assumptions ¼ ðT À 1Þ1'Σ S 1ð1 À ρÞ=ð2IÞ; (D.13) where first equality comes from assuming a separable covariance structure and the second from a common lag-1 autocorrelation. The result for S G -var follows similarly.
That is, the expected value of DVARS 2 t is simply the variance of each voxel in the differenced data, averaged over voxels. The natural estimator of this is the sample mean (or robust equivalent) of the sample variance image (or robust equivalent) of the differenced 4D data.
The variance is more involved in particular depending on the entirety of the I Â I difference image variance matrix. For the most restrictive assumptions considered above VðY D t Þ ¼ 2ð1 À ρÞΣ S and thus V À DVARS 2 t Á ¼ 8ð1 À ρÞ 2 trðΣ S Σ S Þ This dependence on the full spatial covariance demands the empirical approaches to variance estimations taken in the body of the paper. Only at this point do we invoke a normality assumption, and make use of the classic chi-square approximation for sums-of-squared normal variates (Satterthwaite, 1946). In this approach we equate the mean and variance of c Â DVARS 2 t (cμ 0 & c 2 σ 2 0 ) and χ 2 ν (ν & 2ν) and solve for c and ν, giving the multiplier c ¼ 2μ 0 =σ 2 0 and degrees-of-freedom ν ¼ 2μ 2 0 =σ 2 0 as found in Section 2.4.

Appendix F. Power Transformations to Improve DVARS Variance Estimation
The robust IQR-based variance estimate reflects a normality assumption, equating the sample IQR with that of a standard normal. DVARS 2 t , as a sumof-squares and as reflected by its χ 2 approximation, may exhibit positive skew. Hence we consider power transformations of DVARS 2 t that may improve symmetry and the accuracy of the IQR variance estimate. While the asymptotically optimal power transformation to normality for χ 2 is known to be the d ¼ 3 cube-root transformation (Hernandez and Johnson, 1980), our test statistic is only approximately χ 2 and, in particular, variance heterogeneity can worsen the approximation.
To obtain a quantity that should be more symmetric consider the power transformation IQR-based estimates of the variance of W, σ 2 W , will hopefully be more accurate than such estimates on DVARS 2 . However, ultimately we seek estimates of the variance of DVARS 2 , and so for a given d we compute where the last expression is the delta method variance of W 1=d t , and μ W is the mean of W t (which we robustly estimate with the median of W t ).