Effective Degrees of Freedom of the Pearson’s Correlation Coefficient under Serial Correlation

The dependence between pairs of time series are commonly quantified by Pearson’s correlation. However, if the time series are themselves dependent (i.e. exhibit temporal autocorrelation), the effective degrees of freedom (EDF) are reduced and the standard error is biased. This issue is vital in resting-state functional connectivity (rsFC) since the fMRI time series are notoriously autocorrelated and therefore rsFC inferences will be biased if not adjusted. We find that the available EDF estimators make restrictive assumptions that are not supported by the data, resulting in biased rsFC inferences that lead to distorted topological descriptions of the connectome. We propose a practical method that accounts not only for distinct autocorrelation in each time series, but instantaneous and lagged cross-correlation. In addition to extensive synthetic and real data validations, we investigate the impact of this correction on rsFC measures in data from the Young Adult Human Connectome Project.


Introduction
Resting-state functional connectivity (rsFC), the neuronal interaction between the brain regions in the absence of any external stimuli, has been proven to be an effective technique for gaining a deeper understanding of the human brain. Many rsFC methods make use of temporal correlation estimated with the Pearson product-moment correlation coefficient, often converted to a Z-score via Fisher's transformation. However, standard results for the variance of Pearson's correlation (before or after Fisher's transformation) depends on independence between successive observations. fMRI time series are autocorrelated (i.e. correlated with temporally-lagged version of themselves), and autocorrelation inflates the sampling variance of Pearson's toolboxes (e.g. FSL's FILM) are designed for voxelwise or vertex wise analyses and would need to be adapted to Region of Interests (ROIs) data.
The concern about effect of autocorrelation on Pearson's correlation has a long history in spatial statistics (Haining, 1991), econometrics (Orcutt and James, 1948) and climate sciences (Bretherton et al., 1999), but the fundamental work is Bartlett (1935), who first asserted that the lack of independence (between observations) is a bigger challenge than non-Gaussianity. In his 1935 paper, Bartlett proposes a variance estimator of sample correlation coefficients based on a AR(1) model, but he acknowledges that a limitation of the work is that it assumes zero cross-correlation. In later work he proposed a more general estimator which accounts for higher order AR models (Bartlett, 1946) but still fails to account for cross-correlation.
Two extensions to the work has been proposed by Quenouille (1947) and Bayley and Hammersley (1946) where the former adapts the estimator to the cases where the autocorrelation functions are not identical across the two time series and the latter down weights the autocorrelation of long lags. Several years later Clifford et al. (1989) also proposed a reformulation of Bayley and Hammersley (1946). We have found little comparative evaluation of these methods in the literature, save Pyper and Peterman (1998)

that compared
False Positive Rates on low order autoregressive models of uncorrelated time series.
Importantly, save for the work of Roy (1989), all of the methods we have discussed so far have been derived under the rsFC null hypothesis (i.e. independence between the two series). This null encomposes both zero instantaneous and lagged cross-correlations. This is problematic for rsFC, as typically the challenge is not only to detect edges but also to measure the strength of the connectivity. Moreover, as we show below, ignoring non-null lagged cross-correlation induces biases in estimation of the variance, even in absence of autocorrelation. We identify the confounding responsible for this effect (see Section D).
In this work we propose a variance estimator, which we call xDF, which imposes no assumptions on the (intra-time-series) autocorrelation or (inter-time-series) correlation of the time series. To motivate and introduce our results, Figure 1 shows correlation of BOLD time series of Left Dorsal Prefrontal Cortex (PCFd) from one HCP subject to all 114 ROIs (i.e. parcellated with Yeo atlas) of a different HCP subject.
Due to the random nature of resting-state BOLD time series, we expect 5% of the 114 correlation coefficients to be significant; instead, the standard Fisher's z result (see Section 2.4) finds 35.7% of the regions significant controlling the false discovery rate (FDR) at 5% (Figure 1.D). After application of our xDF correction, no regions survive FDR correction (2.6% of nodes were found to be significant, at the nominal level of 5%). A plot of xDF-adjusted z-scores against Naive z-scores shows the dramatic difference in values (Figure 1.D).
Observe how L-SoMotCent (blue marker) is incorrectly detected and is highly auto-and cross-correlated (blue ACF, Figure 1.E). This causes the z-score of the correlation to be greatly reduced. In contrast, R-Insula (red marker) has essentially the same Naive and xDF-adjusted z-score and has nearly zero autocorrelation (red ACF, Figure 1.E).

3
The remainder of the work is as follows. We first present a concise overview of the model and the proposed estimator. Second, we discuss the importance of accounting for unequal autocorrelation between the pair of variables, i.e. node-specific autocorrelation, by showing how the autocorrelation structures are spatially heterogeneous and dependent of parcellation schemes. Third, we discuss how ignoring such effects may result in spurious significant correlations and topological features. Fourth, using Monte Carlo simulations and real-data, we show how xDF outperforms all existing available variance estimators. And finally, we show how using xDF may change the interpretation of the rsFC of the human brain. The potential impact of such corrections on interpretation of rsFC is investigated for conventional thresholding method (i.e. statistical and proportional) as well as un-thresholded functional connectivity maps.

Notation
Without loss of generality, we assume to have mean zero and unit variance time series X = {x 1 , . . . , x N } and Y = {y 1 , . . . , y N }, each of length N , have V(X) = Σ X and V(Y ) = Σ Y , respectively; we write the covariance between X and Y as Σ XY = Σ Y X ; we assume stationarity, and thus have Toeplitz Σ X , Σ Y and Σ XY , and denote auto-correlations The key rsFC parameter is ρ XY,0 , the cross-correlation at lag 0, which we refer to as simply ρ going forward. Note we do not assume that cross-correlations are symmetric, i.e. ρ XY,k and ρ XY,−k are distinct.

Variance of Sample Correlation Coefficients
For the sample correlation coefficient for mean zero data,ρ = X Y / √ X XY Y , in Appendix B we derive a general expression for its variance: For stationary covariance (see Appendix C), we can rewrite Eq. 1 as 6 where w i = N − 2 − k. While Eq. 2 takes the same form as the estimator of Roy (1989), we obtain our result from finite sample as opposed to asymptotic arguments (see Appendix B).
It is also useful to discuss two special cases of the Eq. 1. First, suppose two time series X & Y are both white but correlated such that Σ X = Σ Y = I, and Σ XY = Iρ. Eq. 1 then reduces to a well-known result for the variance of the sample correlation coefficients between two white noise time series (see §5.4 in Lehmann (1999b)).
Second, suppose X and Y are autocorrelated but are uncorrelated of each other, with non-trivial Σ X and Σ Y but Σ XY = 0. Then, Eq. 1 reduces to a result on the variance inflation ofρ in spatial statistics, proposed by Clifford et al. (1989) and also discussed as the variance of the inner product of two random vectors in Brown and Rutemiller (1977). Written in summation form (see Appendix C) this expression is a result proposed much earlier by Bayley and Hammersley (1946) which has found use in neuroimaging (Nicosia et al., 2013;Valencia et al., 2009). A closely related form (Dutilleul et al., 1993) that adjusts for mean centering has also been used in neuroimaging (Nevado et al., 2012;Pannunzi et al., 2018), though for typical time series lengths (i.e. N 20) there should be little difference from the original result.

Effective Degrees of Freedom for the Correlation Coefficient
One way of dealing with autocorrelation is to modify a variance result that assumes no autocorrelation, replacing N with a deflated EDFN . This can be done in terms ofρ (e.g. Eq. 3) or after Fisher's transformation; here we consider EDFs forρ and return Fisher's transformation in Section 2.4.
Different corrections have been proposed to estimateN . One of the earliest results is due to Bartlett (1935), who proposed an EDF for uncorrelated (ρ = 0) AR(1) time series: We refer to this EDF estimator as B35.
Building on work of Bartlett (1946), Quenouille (1947) proposed a more general EDF that allowed for though still assuming ρ = 0. We refer to this EDF estimator as Q47.
In neuroimaging, a global form of Eq. 7 has been used, where the a single ACF ρ GG,k is computed averaged across voxels or ROIs for each subject, or even over subjects (Fox et al., 2005;Dijk et al., 2010); it We refer to this EDF as G-Q47.

& 5), gives EDFN
still under an independence assumption ρ = 0. We refer to this EDF as BH.
Whether defined with infinite or finite sums, some sort of truncation or ACF regularisation is required to use these results in practice, which we consider in Section 2.3.

Monte Carlo Parametric Simulations
The one other approach we evaluate is Monte Carlo parametric simulation (MCPS) (Ripley, 2009). In this approach the variance of the sample correlation is estimated from surrogate data, simulated to match the original data in some way. If a common autocorrelation model and parameters are assumed over variables and subjects, this can be a computationally efficient approach. For example, the FSLnets 2 toolbox for analysis of the functional connectivity assumes an AR(1) model with the AR coefficient chosen globally for all subjects and node pairs. While MCPS avoids any approximations for a given model, it can only be as accurate as the assumed model.
We evaluate the method used by FSLnets, which chooses the number of realisations set equal to the number of nodes. We refer to this as AR1MCPS.

Regularising Autocorrelation Estimates
All of the advanced correction methods described depend on the true ACFs ρ XX,k , ρ Y Y,k and some on the cross-correlations ρ XY,k . We expect true ACFs and cross-correlations to diminish to zero with increasing 2 http://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FSLNets; visited on 18 September 2015 8 lags, but sampling variability means that substantial non-zero ACF estimates will occur even when the true values are zero. Thus all ACF-based methods use a strategy to regularise the ACF, by zeroing or reducing ACF estimates at large lags.
Several arbitrary rules have been suggested for truncating ACF's, zeroing the ACF above a certain lag.
For example, Anderson (1983) suggests that the estimators should only consider the first N 4 lags or Pyper and Peterman (1998) has found that truncating at N 5 lags is optimal. Hence in this work we use the N 5 cut-off for our truncated ACF estimator.
Smoothly scaling ACF estimates to zero is known as tapering. Chatfield (2016) suggests tapering methods using Tukey or Parzen windows. For example, for Tukey tapering, the raw ACF estimate is scaled by 1 2 (1 + cos( πk M )) for k <= M and zeroed for k > M . Similar to truncating, finding the optimal M appears to be cumbersome; Chatfield (2016) suggests an M of 2 √ N while Woolrich et al. (2001) propose the more stringent √ N ; for a detailed comparison of tapering methods see Woolrich et al. (2001).
In this study we consider fixed truncation and Tukey tapering, as well as an adaptive truncation method.
For our adaptive method, we zero the ACF at lags k ≥ M , where M is the smallest lag where the null hypothesis is not rejected at uncorrected level α = 0.05, based on approximate normality of the ACF and sampling variance of 1/N . We base the truncation of the cross-correlation ρ XY,k on the ACFs of X and Y , choosing the larger M found with either time series.
Following practice in the field, we do not consider inference usingρ but focus our evaluations only on Fisher's transformed sample correlations, Z-scores of the form The particular correction method used determines V(ρ). For xDF we use Eq. 1, while for all other methods we use the nominal variance with an EDF, i.e. V(ρ) = (1 − ρ 2 ) 2 /N ; Naive hasN = N − 3, and each other EDF method uses their respective estimateN , as described in Section 2.2.1. 9

Simulations and Real Data Analysis
The xDF is validated and compared with other existing estimators via series of Monte Carlo simulations and real data experiments. We simulate time series with complex autocorrelation, under both uncorrelated and correlated conditions, using ACF parameters estimated from one HCP subject (see Section 3.4). We generate null realisations with real data by randomly exchanging the nodes between subjects (see Section S3.6). From both of these sources of null data we evaluate the distribution of z-scores and false positive rates.
To investigate sensitivity as well as specificity, we simulate correlation matrices, transformed to z-scores with each method, with a known proportion of edges with non-null (ρ > 0) edges (see Section S3.5). Sensitivity, specificity, accuracy (sum of the former) and area-under-the-ROC-curve is computed for each method.
We consider graph metrics computed on real data, with z-scores computed with each method. We use one session of resting state data from each of the 100 unrelated HCP subjects. This data was pre-processed (Section S4) and we created P × P resting-state functional connectivity matrices (z-scores), where P is

Autocorrelation Length
To summarise the strength of autocorrelation effect on time series X of length N , we define the autocorrelation length (ACL), τ X , as 3 Results

Autocorrelation and Parcellation Schemes
We find that the degree of autocorrelation of resting state data is highly heterogeneous over the brain, Figure 2.A shows a map of voxel-wise autocorrelation length, averaged across 100 subjects, finding values range from 1 to 4. We found that using parcellation schemes not only fail to reduce the spatial heterogeneity, but instead magnify the autocorrelation effects: Figure 2 This shows the dramatic increase in autocorrelation that comes with averaging within ROIs (see Appendix E for the likely origin of this effect).
More generally, we find that the size of ROIs predicts autocorrelation of the ROI in both volumetric and surface-based parcellation schemes (Figure 2.C). Further, not only the size of the ROI, but the location of the ROI influences autocorrelation. Using the Power atlas, where all ROIs have identical volume (81 2mm 3 voxels), the autocorrelation in subcortical structures is weaker than in cortical structures, as summarized by plotting autocorrelation length vs. distance to Thalamus (Figure 2.D). While differences in BOLD characteristics between subcortical and cortical voxels could contribute to the autocorrelation structure, it is more likely the higher noise levels (far from the surface coils, and more susceptible to problems of accelerationreconstruction, in this HCP data) explain the lower autocorrelation length in subcortical regions.
We also quantify the heterogeneity of autocorrelation length using the variance explained by the intersubject and inter-node mean profile (Figure 2.E). For time series extracted using Power atlas, only 36% of inter-subject variance is explained, while for ICA200 time series up to 73% of inter-subject is explained, showing that the autocorrelation profiles is highly dependent on parcellation scheme. On the other hand, the variance explained of inter-node autocorrelation length is below 12% across all four parcellation schemes suggesting the importance of addressing the autocorrelation as a node-specifically. Panel C plots the ACL, averaged over subjects of the HCP 100 unrelated-subjects package, vs. region size for ICA time series and three atlases, where ICA and one atlas (MMP) are surface-based. There is a strong relationship between ACL and ROI size. The "ROI size" for ICA is defined as number of voxels in each component above an arbitrary threshold of 50. For MMP, the ROI size is defined as number of vertices comprising an ROI. Panel D considers the Power atlas, which has identically sized spherical ROIs, plotting ACL vs. distance to a voxel in the thalamus. Cortical ROIs have systematically larger ACL than subcortical ROIs. Panel E shows variance explained by inter-subject and inter-node ACL profiles for the Gordon, ICA200, Power and Yeo atlases; the large variance explained by inter-subject mean indicates substantial consistency in ACL over subjects.

Real-data and Monte Carlo Evaluations
Using inter-subject scrambling of 100 HCP subjects, parcellated with Yeo atlas, to create null realisations with realistic correlation structure, we compare different EDF methods in terms of FPR and distribution of Fisher's Z scores, both visually via QQ plots and by Kolmogorov-Smirnov (KS) statistics of observed Z scores vs. a standard normal distribution. Results on node-specific autocorrelation corrections in Figure 3A show that Naive and B35 have greatly inflated FPR, while BH and its approximation, Q47, successfully preserve the FPR level at the 5% level while distribution of the both methods closely follow normal distribution (i.e. -log10(KS) = 2.54). Similar results for other FPR levels (%1 and 10%) are also illustrated in Figure S3.
Since majority of methods used are not node-specific but instead consider the autocorrelation as a global effect (Fox et al., 2005;Zhang et al., 2008;Hale et al., 2016), we evaluate them under their homogeneity assumption using simulated correlation matrices comprised of uncorrelated time series with strong autocorrelation measured from one particular HCP subject (see Section S3.5 for details). Figure 3B shows comparison of node-specific methods (B35 excepted, due to its poor FPR control) to global methods AR1MCPS and G-Q47. Both global correction methods fail to achieve the desired FPR level and the KS statistics of the two methods are also amongst the lowest; this poor performance is likely due to the simple AR correlation model used by each of these methods. On the other hand, the node-specific methods (xDF, BH and Q47) remarkably improves the FPR and KS statistics. However, for uncorrelated time series, BH and G-Q47 outperform the xDF. Similar results for other FPR levels (%1 and 10%) are illustrated in Figure S4.
We also repeated the FPR and KS analysis of correlation matrices for another sets of simulated correlation matrices, this time with autocorrelation structures drawn from a different subject than above. Results suggest similar FPR and KS statistics except for AR1MCPS which almost meet the FPR-level. This clearly suggests that the performance of the global measures, especially AR1MCPS, are subject-dependent.
We further complement the validation methods with FPR and ROC analysis. Using the same simulation techniques, discussed in section S3.1, we compare the FPR levels for pair-wise uncorrelated time series. Figure   3.C illustrates the FPR of each method for α-level= 5%. Figure 3.C suggests that the Naive correction (first column) can only maintain the desired FPR level when at least one of the time series are white, otherwise, the FPR level can approach 50%, in cases where the both time series are highly autocorrelated.
Second and third columns of Figure 3.C shows the FPR levels after the degrees of freedom are corrected via BH and Q47 methods where results suggest an remarkable improvement. Despite both methods having a conservative FPR level on short time series, both successfully maintain the FPR level on larger N . The FPR results for xDF suggest that for short time series, the method fails to contain the FPR level, especially on highly autocorrelated time series, however as N grows, the FPR levels approaches the nominal α-level until N = 2000 where the xDF has the closest FPR level. We finally, evaluate the FPR of B35 where, for majority of the autocorrelation structures, the method has failed to control the FPR level regardless of the 13 sample size. For example, for time series with AR1-AR14 structure, the FPR level is as conservative as 2% while for AR14-AR20 the level exceeds 7%.   Panel A shows results using real data and inter-subject scrambling of HCP 100 unrelated package with the Yeo atlas ROIs, comprising to 235,500 distinct z-scores (see Figure S5 for same results with other atlases). Left shows the QQ plot of Z-scores of each method, top right shows the − log 10 KS statistics (larger is better, more similar to Gaussian), and bottom right the FPR, all of which show that Naive and B35 have very poor performance. Panel B depicts a similar evaluation, except a single ACF is used to simulate all time series with identical autocorrelation (see Section S3.5), again under the null; here we additionally consider two "global" correction methods that assume common ACF between the nodes, G-Q47 and AR1MCPS.
Here the Naive and the two global methods have poor false positive control. Panel C reports on the FPR at nominal α−level=5% across five methods (i.e. columns) for identical (top row) and different (bottom row) ACFs, over a range of time series lengths. Naive (note different y-axis limits) and B35 have poor FPR control, while BH, Q47 and xDF all have good performance for long time series, with xDF having some inflation for the most severe autocorrelation structures with short time series.
14 Results presented in Figure 3 are for highly autocorrelated, yet uncorrelated, time series (ρ = 0). However, in rsFC, it is the highly correlated time series that are of interest. This motivates us to investigate the accuracy of standard errors forρ for highly autocorrelated time series, with non-zero cross-correlation, simulated following the model described in Section S3.1.    We further investigate the percentage biases for B35, BH and Q47 in Figure 4.B-D, respectively. respectively. While BH and Q47 corrections give mostly unbiased standard errors in case of independence (ρ = 0) there is substantial bias as correlation ρ grows, for both short and long time series length. For example, for ρ = 0.5, BH and Q47 corrections overestimates V(ρ) by more than 30% bias. The bias for the B35 standard errors show a similar pattern but with biases being more sensitive to particular autocorrelation structure. A notable finding from these ρ = 0 results is for white data ("W-W" for ρ XX = ρ Y Y , blue triangles): For B35, BH and Q47 methods, this 'easy' case of no autocorrelation gives just as bad performance as severe autocorrelation. We identified the source of this problem, a confounding of autocorrelation with cross correlation; see Appendix D for details.
Results for Oracle simulation ( Figure S2.A) also confirm the confounding of autocorrelation with crosscorrelations, as Q47 and BH are both biased even when the true parameters are used, while xDF shows negligible biases across different autocorrelation structures and sample sizes.
We also use the simulations to evaluateρ standard error bias across different tapering methods (see section 2.3). Figure S7 suggests that despite similarities between the tapering methods on low-and mid-range correlations, they differ on higher correlation coefficients where unregularised, Tukey tapered (M = √ N ) and truncation (M = N/5 lags) autocorrelation functions overestimate the variances while the shrinking and Tukey of 2 √ N lags maintain the lowest biases. Although the two methods, Tukey taper with cut-off at √ N lags and adaptive truncation, appear to have very similar biases we notice that adaptive truncation has less bias for short time series. Moreover, adaptive truncation is immune from arbitrary choice of lag cut-off. We therefore use adaptive truncation as the ACF regularisation method for remainder of this work. . From top to bottom, specificity, sensitivity and accuracy (sum of detections at non-null edges and non-detections at null edges) are shown. Specificity (i.e. FPR control) is good for xDF, BH, Q47 and G-Q47, and sensitivity increases with time series length; accuracy is best for xDF, closely followed by BH and Q47.
The FPR analysis presented, in Figure 3.C, concern only the null case of uncorrelated time series. To summarise performance in the presence of correlation ρ > 0 we evaluate the sensitivity and specificity of each method, via ROC analysis, on simulated correlation matrices, discussed in section S3.5, where the time series are highly dependent and autocorrelated. Figure 5 illustrates the sensitivity, specificity and accuracy measures for each of the methods across three different sample sizes. For correlation matrices comprised of short and long sample sizes, the xDF outperforms all other existing methods to deliver the highest sensitivity while controlling false positive rates. For shorter time series, B35 and AR1MCPS are runner-ups to xDF.
AUC analyses showed virtually no difference between the methods for FPR < 10% (Fig. S6). Figure 6.A suggests that, as expected, the z-score of the functional connections either remained unchanged or has been reduced due to unbiased estimation of variance using xDF correction. For example, the functional connection between node 37 and 94 (i.e. both series are almost white; see Figure 1.E) has experienced almost no changes; z xDF (37, 94)=3.61 and z Naive (37, 94)=3.65, while functional connection between nodes 23 and 13 (i.e. both series are highly autocorrelated; see Figure 1.E) was reduced for 200%; z xDF (13, 25)=2.08, z Naive (13, 25)=6.02.

Effect of Autocorrelation Correction on Functional Connectivity
Naturally, such drastic changes in z-scores are also reflected in p-values of statistical inferences for each connection. Figure 6.C illustrate these changes (i.e. orange dots) between FDR-corrected p-values of Naive correction (left y-axis) and similar statistics of xDF correction (x-axis) where, broadly speaking, large number of the connections with significant p-values no longer meet the 5% α-level.
Since the changes in z-scores due to xDF are spatially heterogeneous, both statistical and proportional thresholding methods are dramatically affected. In statistical thresholding (ST), after xDF correction, the FDR critical values (shown as dotted lines on Figure 6.A) are slightly increases from 2.064 to 2.14. With the use of xDF, 13.66% of the edges change from being marked FDR-significant to being non-significant; i.e. over 10% of the edges would be incorrectly selected with the Naive method. Similarly, proportional thresholding (PT) is also affected since the cost-efficient densities (shown as solid lines on Figure 6.A, see section S4.4 for more details on cost-efficient densities) are decreased from 35% to 27.5%. These changes in cost-efficient density result in 16.61% false positive edges meaning that they were found to be significant merely due to the autocorrelation effect.
The same analysis on another HCP subject finds very similar changes in functional connectivity ( Figure   S9) as in ST and PT, the critical values and cost-efficiencies were reduced by more than 50% and 26%, respectively, due to xDF correction. This results in 16% FP edges in ST and 19% FP edges in PT.
While Figure 6.A shows that there is a profound effect of xDF relative to no correction (Naive), it is of interest to see how xDF compares to an existing methods that does attempt to correct for autocorrelation.
For this we compare xDF z-scores to BH z-scores ( Figure 6.A); recall that BH correction does not account for cross-correlation Σ XY and, due to confounding of cross-correlation and autocorrelation, can over-estimatê ρ standard errors. When the z-scores are low (corresponding to weak correlation) there is little difference between the approaches, while for stronger effects the difference between the two correction methods become clearer. For example, the z-score for the edge between node 103 and node 104 (green dot in Figure 6 Functional connectivity changes reported so far were limited to study of single subjects. In Figure 6.D we show the impact of autocorrelation correction on mean value of rsFC z-scores for edges included in proportional (left) and FDR-based statistical (right) thresholding. Reflected what was found in simulation, autocorrelation correction reduces z-scores (reflecting Naive's under estimation of V(ρ) ), but the BH method has appreciably smaller z-scores (attributable to the confounding problem).    Figure 7). This quantifies the dramatic impact of correction method on each graph metric. We also evaluate changes in the global measures. Figure 7.C shows the changes in network density (left) where the density of networks with Naively corrected network were significantly reduced after xDF and BH correction however there is a slight, yet significant, difference between density of xDF-corrected and BHcorrected networks. Similarly, the global efficiency of networks (Figure 7.C) are significantly reduced after accounting for autocorrelation. Figure 7.C also suggests that overestimation of variance due to correlationautocorrelation confounding may yet reduce the local efficiency.
In the global measures a similar pattern of changes is also found for CE-based proportional thresholded networks, as the density (Figure 7.D) of xDF-corrected networks are significantly reduced. However, in spite of changes in weighted degree, the confounding effect may not affect the network densities. Finally, global efficiency of xDF-corrected networks suggests a significant reduction. Despite a small difference, the confounding effect has also reduced the global efficiency.
We repeat this analysis for the Gordon and the Power atlas. For the Power atlas, Figure S12  Yeo atlas, Betweenness centrality has shown the highest resilience towards these changes. Finally, Figure   S13 shows similar results for subjects parcellated using ICA200.   On real data (non-null) rsFC we replicate the simulation findings, with Naive z-scores dramatically inflated relative to xDF, BH and Q47 z-scores smaller in magnitude. The differences between the methods, however, are node specific, reflecting how xDF adjusts for autocorrelation in each node pair. We recommend that all rsFC analyses that are based on z-scores, whether thresholded arbitrarily or or, say, use a mixture modelling approach (Bielczyk et al., 2018), use the xDF correction to obtain the most accurate inferences possible.
We show that graph analysis measures are dramatically impacted by use of xDF, relative to either Naive or BH corrections. Broadly speaking, accounting for autocorrelation results in lower z-scores and lower rsFC densities. These heterogeneous changes alter the topological features of the functional connectome, however the changes are not similar across resting-state networks; in the HCP data, we find nodal strengths and local efficiencies in parts of the sub-cortical regions experience lower changes compared to nodes from the frontoparietal and default mode networks, which are among the highly affected areas. The pattern of changes suggest that the nodal degree and efficiency hubs are among the most affected. In contrast, results for betweenness centrality suggest no systematic pattern with relatively lower changes.
We provide a comprehensive review of the literature on autocorrelation corrections for the variance of the sample correlation, usually cast as estimation of the effective degrees of freedom. In the neuroimaging community this is sometimes referenced as "Bartlett Correction Factor (BCF)", though it has been only defined as "the integral across time of the square of the autocorrelation function" (Dijk et al., 2010;Fox et al., 2005), with citation to textbook , and used as a global correction over subjects and ROIs (Hale et al., 2016). We emphasize the importance of truncation or tapering of ACF's and that such sums (not integrals) reflect the symmetry of ACFs, doubling of a sum counting only positive lags, 24 as well as computing a correction for each node pair.
We note the strong influence of ROI size on the strength of autocorrelation, with, at one extreme, voxellevel data having the weakest correlation, increasing in strength as the size of the ROI increases. We also showed that, even for an ROI atlas with identically sized regions (i.e. Power atlas), autocorrelation can vary substantially over the brain depending on their location. These factors could become a significant source of bias in analyses of rsFC based on z-score if not otherwise corrected. We stress that these findings hold for both volumetric and surface-based analysis (Figure 2.C).
We do not wish, however, to say that all analyses of rsFC must be conducted in z-score units. Our derivation shows that Pearson's correlation is approximately unbiased for the correlation ρ in the data (Appendix B). Hence researchers need to consider whether an r-units or z-units analysis is most appropriate for their question and then, if z-units, computing standard deviation with xDF is essential. We note that this assertion of the unbiasedness of correlation is at odds with other recent works (Arbabshirani et al., 2014;Davey et al., 2013). This is not inconsistent: Both of these works start their analysis with a pair of white noise variables with instantaneous correlation ρ and then assess the impact of inducing autocorrelation on those variables. In particular, they note that if different autocorrelation structures are induced then a bias in estimate of ρ can arise. Instead, we study the auto-and cross-correlation of the presented data (X, Y ), without reference to an (unobserved) autocorrelation-free signal; in this setting Pearson's correlation is (approximately) unbiased regardless of differential autocorrelation. We believe our empirical approach is more appropriate, as the BOLD signal is not white and hence inference on the correlation of presented (and not latent white) signals is of primary interest.
We note that other authors have proposed pre-whitening as a solution to improve inference on correlation (Bright et al., 2017), and when conducting system identification of the cross-correlation function ρ XY,k it is recommended to first whiten each variable (Priestley, 1983). However, we still see the value of the nowhitening plus xDF correction approach. First, pre-whitening requires accurate estimate of Σ X and Σ Y , and careful evaluation is required to see if the FPR is valid and the standard error unbiased over a range of settings. Second, pre-whitening changes the definition of ρ, from concerning the instantaneous correlation of the observed time series to that of the (unobserved, latent) white time series. And, finally, perhaps most important for moving window time-varying rsFC, pre-whitening mixes data from distant time points with neighbouring ones, challenging the interpretation of the individual time points as pertaining to a precise moment in time.
One immediate extension to the current work is to adapt the xDF estimator to partial correlations. Partial correlations have recently drawn substantial attention after they were shown to be effective in resting-state analysis (Marrelec et al., 2006;Smith et al., 2011). Further, recent studies have shown that accounting for autocorrelation is remarkably sensitive to sampling rates of the fMRI BOLD time series (Bollmann et al., 2018), therefore evaluating the proposed methods on different sampling rates would be useful. We have not attempted to investigate how the changes in z-scores we describe would affect the inter-group changes, but this would be a useful extension, as Váša et al. (2018) did for Schizophrenia in context of wavelet EDF.
Finally, it is important to note that application of the xDF is not confined to rsFC of fMRI time series as it can be used in other modalities such as EEG and MEG as both modalities were shown to suffer from dependencies amongst their data-points.

Software Availability and Reproducibility
Analysis presented in this paper have been done in MATLAB 2015b and R 3.1.0. Graph theoretical analysis were done using Brain Connectivity Toolbox (accessed: 15/1/2017) (Rubinov and Sporns, 2010 A Results for Joint Distribution of Time Series X and Y .
Here we provide basic results required for the next appendix, for moments of inner and cross produces of X and Y .
Theorem 1 (Covariance of Quadratic Form of Bivariate Gaussian Distribution). For fixed matrices A and B, if G is a random vector such that G ∼ N (0, Φ) then C(G AG, G BG) = 2 tr(ΦAΦB) Proof. The result follows from application of the definition of covariance, and expectation of a quadratic form for Gaussian variates (Petersen et al., 2008), E(G AGG BG) = 2 tr(AΦBΦ) + tr(AΦ) tr(BΦ) and expectation of quadratic forms, E(G AG) = tr(AΦ) and E(G BG) = tr(BΦ).
The Gaussian assumption can be relaxed, but then an additional term arises to account for departures from Gaussian kurtosis.
The inner products of X and Y can now be represented in terms of a quadratic form of such that X X = G A XX G, Y Y = G A Y Y G, and X Y = G A XY G.

B xDF: Variance of Sample Correlation Coefficient for Arbitrary
Dependence.
For mean zero length-N random vectors X and X with (N × N ) variance matrices Σ X and Σ Y and crosscovariance Σ XY , we develop the variance of the sample correlation. Following Lehmann (1999a) and Hunter (2014), we derive an approximation for the sampling variance of Pearson's correlation. Starting with the 3-dimensional sufficient statistic note that the function f (W ) = W 3 / √ W 1 W 2 generates the correlation coefficientρ. Then the first order 2 tr(Σ Y X Σ XY ) 2 tr(Σ 2 Y ) tr(Σ XY Σ Y ) + tr(Σ Y Σ Y X ) tr(Σ X Σ XY ) + tr(Σ Y X Σ X ) tr(Σ XY Σ Y ) + tr(Σ Y Σ Y X ) tr(Σ X Σ Y ) + tr(Σ 2 XY ).
    

28
Based on these expressions, evaluating the matrix product (∇f (E(W ))) V(W )∇f (E(W )) gives the result in Eq. 1. While very similar, this derivation is not a standard delta method result as we do not have independent observations.
Knowing that the cross-covariance matrices are asymmetric but Σ XY = Σ Y X , the trace of tr(Σ 2 XY ) = tr(Σ 2 Y X ) and can be re-written as Similarly, other terms of the Eq. 1 can be re-written as vector operations (see Eq. 2).

D Confounding of Serial and Cross-correlation Estimates
Majority of the EDF estimators discussed have the term tr(Σ X Σ Y ) which depends on the product of autocorrelation functions, ρ XX,k ρ Y Y,k (see Appendix C). Some unexpected results such as poor variance estimation for the seemingly easy case of time series with no autocorrelation (see Section 3.2) were found to be caused by confounding between serial and cross correlation. Observe that for two white but dependent time series (ρ XX,k = ρ Y Y,k = 0 for k > 0, but ρ XY,0 = 0), the product of the sample autocorrelation functions areρ x t y t x t+k y t+k + 1≤t =t ≤N −k x t y t x t+k y t +k   (17) In this last expression note that in the first sum each summand has an expected value of ρ 2 , while the second sum has an expected value of zero. As a result, even when there is no (or very light) serial correlation, methods dependent on tr(Σ X Σ Y ) or ρ XX,k ρ Y Y,k alone can be substantially biased by non-zero cross-correlation. This includes BH and all its variations listed in Pyper and Peterman (1998). Our xDF, on the other hand, is immune of the effect thanks to the cancelling cross-covariance terms in Eq. 2.

E Autocorrelation in Regions of Interests
We observed substantial differences in severity of autocorrelation in voxel-wise data as opposed to data derived from regions of interest (ROI). This section proposes a model that explains the effect.
Suppose that a ROI contains R voxels, and the data for time t and voxel i = 1, ..., R can be be modelled as,

30
where the S t a common "signal" shared across all the voxels within the ROI (i.e. a ROI-specific global signal) and V it are the voxel-specific component of data. Assuming that the two components are uncorrelated , the autocorrelation for the ROI is defined as, where σ 2 V and σ 2 S are the variances of V and S, respectively, and ρ V V,k and ρ SS,k are the autocorrelation coefficient of V and S at lag k, spectively. Now, for the ROI-averaged time series, Y t = R i=1 Y ti /R, the autocorrelation is: