Statistical tests for non-independent partitions of large autocorrelated datasets

Large sets of autocorrelated data are common in fields such as remote sensing and genomics. For example, remote sensing can produce maps of information for millions of pixels, and the information from nearby pixels will likely be spatially autocorrelated. Although there are well-established statistical methods for testing hypotheses using autocorrelated data, these methods become computationally impractical for large datasets. • The method developed here makes it feasible to perform F-tests, likelihood ratio tests, and t-tests for large autocorrelated datasets. The method involves subsetting the dataset into partitions, analyzing each partition separately, and then combining the separate tests to give an overall test. • The separate statistical tests on partitions are non-independent, because the points in different partitions are not independent. Therefore, combining separate analyses of partitions requires accounting for the non-independence of the test statistics among partitions. • The methods can be applied to a wide range of data, including not only purely spatial data but also spatiotemporal data. For spatiotemporal data, it is possible to estimate coefficients from time-series models at different spatial locations and then analyze the spatial distribution of the estimates. The spatial analysis can be simplified by estimating spatial autocorrelation directly from the spatial autocorrelation among time series.


Specifications
Environmental Science More specific subject area; Statistics Method name; Method for performing statistical tests using non-independent data partitions Name and reference of original method; These are standard statistical methods documented in most statistical textbooks, for example, Neter et al. [12] and Judge et al. [10] Resource availability; The methods are implemented in the package remotePARTS in the R programming language. This is available at https://github.com/morrowcj/remotePARTS

Overview
The method developed here uses a conceptually simple approach to perform statistical estimation and hypothesis tests on large autocorrelated datasets [ 1 , 4 , 8 , 11 , 13 , 14 ]. The method was developed specifically for remote-sensing datasets, so it will be described in this context, although it could be applied to other types of large datasets such genome samples from different subjects in which basepair similarity is likely to be greater for base pairs located nearby on a chromosome due to limited recombination. The approach divides the dataset into non-overlapping partitions, and statistical hypothesis tests are conducted on each partition. The results of these tests are not independent, because the data points from different partitions are not independent. Nonetheless, it is possible to calculate the correlation between the statistical test scores from different partitions and combine the scores for an overall test. The selection of sampling schemes for partitions is arbitrary, but here we will focus on random sampling to produce partitions. This approach has the advantage of guaranteeing that each partition gives a representation of the entire dataset. The method is central to the analysis of spatiotemporal data given in Ives et al. [9] .
We apply this approach to three common tests used for regression and ANOVA on Gaussian data [12] : the F -test, the likelihood ratio test (LRT), and the t -test. The F -test and LRT involve hypotheses comparing a full statistical model with a reduced model, thereby allowing tests of hypotheses on multiple coefficients in a model. The t -test is used for inference about individual coefficients. These tests can be performed for datasets with autocorrelated errors using generalized least squares (GLS) regression [10] . In GLS, a correlation matrix is specified to describe the autocorrelation of errors. We apply the approach to both purely spatial and spatiotemporal data.

Mathematical derivations
The specific problem addressed here involves the regression of a response (dependent) variable y l on p predictor (independent) variables for l = 1, 2, ..., N locations in a spatial map. The p predictor variables for location l are contained in a 1 x p vector X l ; at the minimum, X l contains the value 1 to correspond to an intercept. The regression model for location l is y l = X l B + γ l (1) where γ l is the remaining random error, and B is a p x 1 vector containing regression coefficients for the p variables in X l . We assume that the parameters B are the same for all locations l . We further assume that the correlation between γ l and γ k for locations l and k depends on the distance between them. For example, if the correlation diminishes exponentially with distance, then cor[ γ l , γ k ] = exp(d lk / r ) where d lk is the distance between locations l and k , and the parameter r gives the "range" of the correlation, with larger values of r giving correlations over greater spatial distances. Thus, the covariance matrix for the errors, = ( γ 1 , ..., γ N ) (where the apostrophe denotes transpose), is the N x N covariance matrix V = σ 2 C , where C is the correlation matrix containing the values of cor[ γ l , γ k ] for all pairs of locations l and k , and σ 2 is a common variance.

GLS analysis
A single dataset, or a subset of a larger dataset, in which errors are autocorrelated can be analyzed by GLS when V is known [10] . Thus, let Y denote an N x 1 vector of response variables, and X denote an N x p matrix of predictor variables including a column of ones for the intercept. The p x 1 vector ˆ B containing the estimates of the regression coefficients B is The sum-of-squared error is then An F -test is based on the sum-of-squared error of the full model ( SSE ) and a reduced model ( SSE 0 ) defined by the hypothesis to be tested; for example, the null hypothesis that some of the predictor variables have no effect on the response variable is tested using the reduced model with these predictor variables removed. Under the null hypothesis, and letting SSR 1-0 = SSR 1 -SSR 0 = SSE 0 -SSE denote the difference in sum-of-squared regression between full and reduced models, we have Thus, the ratio of SSR 1-0 / df 1 to SSE / df 2 follows an F distribution where df 1 is the degrees of freedom equaling the difference in the number of predictor variables between full and reduced models, and df 2 = N -p is the degrees of freedom for the sum-of-squared error in the full model. Strategic selection of the reduced model makes it possible to test a wide range of hypotheses, not only about whether one or more of the coefficients in B are different from zero, but also whether some or all of the coefficients are equal by generating orthogonal contrasts [12] .
For GLS, a LRT is closely related to an F -test. Specifically, the log-likelihood ratio between the full and reduced models is SSR 1-0 for the case when the full and reduced models have the same covariance matrix V . The LRT uses the asymptotic approximation 2 SS R 1 −0 ∼ χ df 1 (5) Thus, twice the difference between the SSR s from the full and reduced models is χ 2 distributed with df 1 degrees of freedom. The distribution for the F -test ( Eq. 4 ) will converge to the distribution for the LRT when df 2 approaches infinity. For large datasets, df 2 will be large, and the F -test and LRT will give very similar results.
Like the LRT, the t -test is closely related to the F -test. Letting MSE = SSE / df 2 be the mean squared error, the estimated covariance matrix for the estimates ˆ B is The standard error of the estimate of a coefficient , and a test of the null hypothesis that the coefficient is zero is performed with t -distribution: To simplify the following developments, it is useful to recast the GLS model above by transforming variables. Specifically, let D be the matrix such that DVD = I .
Further, let H denote the hat matrix for U defined by (11) where H 0 is the hat matrix for the reduced model derived from U 0 = DX 0 , and X 0 is the matrix of predictor variables in the reduced model.
Correlations among partitions for SSR 1-0 , SSE , and ˆ B The primary computational burden of GLS is inverting V (or its Cholesky decomposition). A map with 10 6 pixels would require inverting a 10 6 x 10 6 V matrix, and the computation time for this inversion scales roughly with N 3 . If a map is partitioned into n p subsets of size m and each subset is analyzed separately, then the computational burden will scale linearly with N for a fixed subset size m . Using the computational advantage of analyzing partitions, the method below makes it feasible to combine the results from the n p separate analyses.
For the F -test, the method requires calculating the correlation between sums-of-squares computed for the n p different subsets. Specifically, it is necessary to calculate the correlation between SSR i and SSR j , and between SSE i and SSE j ; for notational convenience, SSR i denotes SSR 1-0 (the difference between the SSRs for the full and reduced models) for partition i , and SSE i denotes the SSE of the full model for partition i. Further, let Z i , U i , and A i denote the transformed response and predictor variables, and the transformed errors, for partition i . It is possible to show that, for any m x m matrices S i and S j , (12) where vec is the vec operator that stacks columns of a matrix on top of each other to form a vector, and is the Kronecker product. The matrix cov[( A i A i ) ( A j A j )] can be expressed in terms of the matrix V ij = σ 2 C ij containing covariances between errors γ i , l and γ j , k from partitions i and j , where the appearance of df 1 arises by noting that var Eq. (13) was simplified using the empirically confirmed identity that vec ( The LRT depends on the SSR 1-0 , and therefore the method requires calculating the correlations between values of SSR i from the n p partitions, as is already given for the F -test. For t -tests on the coefficient values, it is necessary to calculate the correlations between the estimators of the coefficients given by Eq. (8) , which are given as where ˆ B i is the estimator of the coefficients from partition i .

Combining tests from the partitions
From the values of SSR i and SSE i calculated for each partition, i = 1, 2, ..., n p , and the correlations between SSR i and SSR j , and between SSE i and SSE j , it is possible to compute an overall F -test for the data. The procedure for the LRT is similar to that for the F -test, and below they are presented together. The procedure for the t -test is somewhat different and is described after the F -test and LRT.
For a given partition i , 2 SSR i follows a χ 2 distribution with df 1 degrees of freedom. A χ 2 distribution with df 1 degrees of freedom is the sum of df 1 squared Gaussian variables with mean 0 and variance 1. Thus, SSR i can be expressed as The overall test statistic depends on the sum of SSR i from all partitions i = 1, 2, ..., n p . Let G denote the ( n p df 1 ) x 1 vector of values of G i , μ ( μ = 1, ..., df 1 ) from all partitions. By construction Eqs. 8 -(11) , the distributions G i, μ and G i, ν ( μ = ν) within the same partition i are independent. To satisfy the identity in Eq. (13) for SSR , the correlations between G i, μ and G j, ν from different partitions i and j are where G PG follows a quadratic Gaussian distribution. The probability density function of this quadratic Gaussian can be computed directly [ 5 , 7 ] to give the probability of G PG being greater than an observed value of SSR , which produces the P -value of the LRT.
The F -test depends on both SSR and SSE , where SSE is defined like SSR as SSE = i SSE i and each SSE i has df 2, i degrees of freedom. Because partitions may differ in size, df 2, i may differ among partitions. Because the values of df 2, i will be large (certainly > 100), the values of SSE i / df 2, i will be approximately Gaussian distributed with mean 1 and variance 2/ df 2, i . The correlation between the Thus, the F -score is approximated as the quadratic Gaussian distribution for SSR divided by the Gaussian distribution for SSE . There is no closed-form expression for this distribution, and therefore it is obtained via simulating a large number (e.g., 10 5 ) values to generate the approximate (parametric bootstrapped) test distribution.
For the t -test, the estimator of the coefficient b h,i , ˆ b h,i , from partition i follows a Gaussian distribution, and the test statistic for the mean value of ˆ b h,i from all partitions is . gives the correlation between ˆ b h,i and ˆ b h,j from partitions i and is the sum of n p random variables each having a χ 2 distribution with df 2, i degrees of freedom. From this expression, is approximately distributed as a t -distribution with i df 2, i degrees of freedom; for large degrees of freedom df 2, i , this will approach a Gaussian distribution with mean zero and variance one.

Spatiotemporal analyses
The spatiotemporal analyses follow the approach presented in Ives et al. [9] for analyzing time trends in remote-sensing data. The approach involves first fitting a time-series model to the time series in each pixel on a map and obtaining the estimate of the time trend. As described below, the correlations between the residuals obtained from the fitted time-series model approximate the spatial autocorrelation of the estimated coefficients of the time trends. Therefore, the spatial autocorrelation matrix required for the GLS spatial analysis can be estimated before the GLS analysis is performed. Although this approach can be used for different time-series models, here we focus on two: leastsquares (LS) regression, and regression with AR(1) errors estimated using REML. The former is useful, because it allows analytical solutions, whereas the second has better statistical properties and is therefore preferentially used for the analyses in Ives et al. [9] .
To explain the approach, we use a specific spatiotemporal model, and we also use this model in the validation. The model for time series within pixel l is Here, z l ( t ) is the value of interest in pixel l at time t ( t = 1, 2, ..., T ), a l is the intercept, and c l is the time trend. Random errors ε l ( t ) follow a stationary first-order Gaussian autoregressive process with mean zero generated from the Gaussian random variable δ l ( t ) that has mean zero and variance σ 2 , with values independent through time so that E[ δ l ( t ) δ l ( s )] = 0 for s = t . Thus, the vector ( ε l (1) , ..., ε l ( T )) ,has distribution N(0, σ 2 /(1 -β l 2 ) l ), where the elements of the correlation matrix l are cor[ ε l ( t ), ε l ( s )] = β l | t -s | for all t and s . To include spatial autocorrelation, we assume that the Gaussian random variables δ l ( t ) and δ k ( t ) at t from pixels l and k are correlated, with parameter cor δ = cor[ δ l ( t ), δ k ( t )], but values of δ l ( t ) and δ k ( s ) are independent when s = t .
The procedure for spatiotemporal data collapses temporal information from the time series into two quantities: the pixel-specific estimates of the time trend c l and the correlations between the temporal errors ε l ( t ) and ε k ( t ) from pixels l and k . Note that the estimates of c l , ˆ c l , are now the dependent variable in the spatial model, rather than the data z l ( t ). For LS regression, the exact relationship between the correlation cor[ ˆ c l , ˆ c k ] and cor[ ε l ( t ), ε k ( t )] can be derived analytically under the assumption that the temporal autocorrelation coefficients β l and β k are known. The T x T covariance matrix W lk whose t,s -element ( t = 1, ..., T; s = 1, ..., T ) is the covariance between ε l ( t ) and ε k ( s ) is given by where I is the T x T identity matrix, and is the backward shift operator [3] . Under the assumption that the time series are sampled from their stationary distributions, lk is a diagonal matrix whose first diagonal element is the value of cov[ ε l ( t ), ε k ( t )] at the stationary distribution, and other diagonal elements are one. The stationary distribution of E ( t ) = ( ε l ( t ), ε k ( t )), follows a bivariate  and cor[ ε l ( t ), ε k ( t )] are slightly lower than cor δ, although they are equal (within the uncertainty of the simulation) ( Table 2 ). Therefore, when β l = β k , cor[ ε l ( t ), ε k ( t )] is an excellent approximation for cor[ ˆ  Table 2 for AR(1) regression suggest that using cor[ ε l ( t ), ε k ( t )] to approximate cor[ ˆ c l , ˆ c k ] will lead to a small overestimate of cor[ ˆ c l , ˆ c k ], although this will not change the conclusions; this is discussed in detail in Ives et al. [9] .
To build the correlation matrix C for the GLS spatial analysis, we assume that cor[ ˆ c l , ˆ c k ] decays according to some function v ( d lk ) of the distance d lk between pixels l and k . For example, Ives et al. [9] use an exponential-power function v ( d lk ) = exp(-( d lk / r ) g ) where the range parameter r and shape Table 1 Relationship between the correlation for estimates of the time trend parameter, cor[ ˆ c l , ˆ c k ], and the correlation for residuals from least-squares (LS) regression, cor[ ε l ( t ), ε k ( t )].
To explore extreme differences in the strength of temporal autocorrelation between time series, β l = 0.8 for one time series and β k = -0.4, 0, 0.4, and 0.8 for the other time series, with time-series length T = 10, 30, or 100. The values of cor[ ε l ( t ), ε k ( t )] and cor[ ˆ c l , ˆ c k ] were calculated analytically using Eqs. (19) and (20) , respectively. Throughout, the parameter governing the correlation between δ l ( t ) and δ l ( s ), cor δ, was 0.5.

Implementation
The partition approach for performing statistical tests can be implemented directly using the derivations above. When using random partitions, for large datasets it is not necessary to calculate the pairwise correlations between SSR i (or SSE i ) from all partitions Eqs. 13 , (14) , or the pairwise correlations between the estimates of the coefficients from all partitions ( Eq. 15 ). These correlations vary little between different pairs of partitions, and the number of partitions can be set after examining the variation in correlations for specific datasets. This approach saves considerable computational time for large datasets with many partitions.
Although the methods are presented assuming that the correlation matrices C i are known, in many applications C i will contain one or more parameters to be estimated. For example, spatial models will often contain a "nugget" to capture local (non-spatial autocorrelation) variation [ 4 , 9 ]. In the spatiotemporal analysis we outline above, other parameters giving the spatial extent of the autocorrelation (e.g., r in v ( d lk )) can be estimated from the residuals of the time-series analyses. However, for the purely spatial model these would be estimated during the GLS spatial analysis. Any parameters of C i can be estimated for each partition separately and the formulae applied with the estimated matrices of C i . The resulting omnibus test statistics are then conditional on the parameter estimates of C i .
The methods are implemented as a part of the package remotePARTS in the R programming language. It is available at https://github.com/morrowcj/remotePARTS .

Spatial model
To assess type I error rates and power, we performed a simulation study using the regression model in which the N spatial errors γ l follow a multivariate Gaussian distribution with correlation matrix C , N(0, σ 2 γ C ). The simulation was performed on a 60 x 60 pixel map ( N = 3,600), which was small enough to perform a GLS analysis Eqs. 1 -(7) without partitioning the datasets. Spatial autocorrelation was introduced by assuming that the elements of C equal exp(-d lk / r ). Values of r were standardized to the scale of the map, and values of r = 0.03 and 0.1 correspond to 3% and 10% of the maximum distance on the map (corner to corner). Values of x l were given by the "latitude" that was defined as the row number in the 60 x 60 map divided by 60. For each value of b = 0, 2, 4, ..., 20, five hundred simulations were performed, and the null hypothesis H 0 : b = 0 was tested for each at the significance levels of α = 0.05 and 0.01. We performed the partition analyses with eight partitions, and computed the pairwise correlations between SSR i (or SSE i ) Eqs. 13 , (14) as the average from a random subset of six partitions (a subset of 15 of the 28 total number of pairwise correlations). Fig. 1 gives the proportion of the simulations for which H 0 : b = 0 was rejected, with the significance level α given by the black dotted line. The tests were based on different methods for producing P -values. First, GLS ( Eq. 4 ) was used to perform an F -test ( P GLS ), which gives the "gold standard" since the GLS is the best linear unbiased estimator (BLUE) [10] . Second, the partition method developed here was applied using eight partitions to give an F -test ( P F ), a LRT ( P LRT ), and a t -test ( P t ). Third, the lowest P -value from the eight partitions from F -tests was selected and adjusted for eight multiple comparisons using either the Hochberg adjustment ( P hoch ) [6] or the False Discovery Rate adjustment ( P fdr ) [2] . Finally, a single partition was selected at random and its P -value was used ( P single ). As expected, all three values using the method developed here ( P F , P LRT , and P t ) gave very similar results. Type I error rates were appropriate, with close to 5% and 1% of simulations rejected at significance levels of α = 0.05 and 0.01, respectively. There was slight loss of power in comparison to the GLS ( P GLS ). Nonetheless, this loss of power was less than that of either Hochberg or FDR adjustments for multiple comparisons ( P hoch and P fdr ). In fact, when autocorrelation was high ( r = 0.1), the Hochberg or FDR adjustments had lower power than the randomly chosen partition ( P single ).
Because partitions were created randomly, there is variation in the test statistics depending on the partitions created. To illustrate this, Fig. 2 gives the distributions of P -values for a partition LRT of H 0 : b = 0 for two simulated datasets constructed as described above with r = 0.03 ( Fig. 2 a) and r = 0.1 ( Fig. 2 b). The variation in the P LRT with eight partitions is created solely by the selection of different partitions, because the datasets were the same for all analyses in the same panel. It is interesting that the variation in P -values is less for the more-highly autocorrelated dataset ( r = 0.1, Fig. 2 b), which is likely due to the greater correlations between SSR i from different partitions which makes the test scores from the separate partitions more similar. Finally, note that these results are ( P LRT ), and a t -test ( P t ); selecting the lowest P -value and applying a Hochberg ( P hoch ) or the False Discovery Rate adjustment ( P fdr ); and randomly selecting one partition ( P single ).
for "small" datasets compared to those that the method was designed to analyze; the variation in P -values from different random partitions is less for larger remote-sensing datasets [9] . For smaller datasets, the analyses can be run multiple times with different random partitions, and the overall P -value is selected as the median of multiple values computed.
To compare the statistical results for different numbers (and hence sizes) of partitions, we use the same simulation model on a 60 x 60 pixel map to investigate the power to reject H 0 : b = 0 using a LRT when the true value is b = 10 Table 3 ). For 8 and 16 partitions, a subset of six partitions was chosen at random to compute the pairwise correlations between SSR i among partitions and SSE i among partitions ( Eqs. 13 , (14) . The proportion of the simulations for which H 0 : b = 0 was rejected changed little with the number of partitions, n p = 1, 2, 4, 8, and 16. This suggests that loss of power when partitioning data is insensitive to the number of partitions.

Table 3
Power of the partition method for a given number of partitions. Five-hundred simulations were performed on a 60 x 60 map using Eq.

Spatiotemporal model
We performed a simulation study using the time-series model given in Eq. (18) on a map of 40 x 40 pixels. Each time series was 30 data points long, with moderate temporal autocorrelation ( β l = 0.2). Spatial autocorrelation in the matrix C was assumed to have the form v ( d lk ) = (1 -nugget )exp(-( d lk / r )) for l = k ; thus, a proportion nugget of the error variance is "local" to a pixel. Spatial autocorrelation was assumed to be either moderate or strong ( r = 0.03 or r = 0.1). We estimated r from the correlation among residuals, while nugget was estimated during the GLS spatial analysis via maximum likelihood. Thus, unlike the spatial example ( Fig. 1 ), the spatiotemporal example included a parameter in matrix C that was estimated.
We assumed that the map was divided into 16 squares (10 x 10 pixels each) and assigned to four land-cover classes (e.g., Fig. 4 in [9] ). The time trends in each of the four land-cover classes were given by c 1 = 0, c 2 = c, c 3 = 2 c, c 4 = 3 c , with values of c = 0, 0.04, ..., 0.20. We simulated 500 datasets for each parameter combination and tested the hypothesis that there were no differences in To scale the time trends, the time variable t ranged from 0 to 1 over the 30-year simulated time series, and the standard deviation σ of δ l ( t ) was 1. Therefore, a value of 0.12, for example, represents a change in the mean value of z l ( t ) over 30 years of 0.12 σ (1 − βι 2 ) -0.5 . To include pixel-scale (non-spatially autocorrelated) variation, we added non-spatial variation in the form of a Gaussian random variable with mean zero and variance 0.16 to c for each pixel. Simulations were performed for a 40 x 40 grid of pixels with weak temporal autocorrelation ( β l = 0.2). The hypotheses were tested using (i) a GLS analysis applied to all of the data ( P GLS ), (ii) a GLS applied to eight partitions of the data with the log-likelihood ratios combined among partitions ( P part ), (iii) the lowest P -value from the GLS analyses of the eight partitions correcting for multiple comparisons using a Hochberg ( P hoch ) or False Discovery Rate ( P fdr ) adjustment, and (iv) a randomly selected partition ( P single ). trends among land-cover classes, H 0 : c 1 = c 2 = c 3 = c 4 , and the hypothesis that there was no overall time trend, H 0 : c = 0. We performed the test using GLS on the entire map. We also partitioned the map into eight random partitions and tested each separately. We then combined the tests either by selecting the partition with the lowest P -value and adjusting for multiple comparisons, or combining the tests as described in the section Combining tests from the partitions .
For both tests of differences among land-cover classes ( Fig. 3 a, b) and tests for an overall trend ( Fig. 3 c, d), the GLS analyses of the entire map had the highest statistical power to reject the null hypothesis. The method combining statistical results among partitions had the second-highest power ( P part ), while the method using adjustments for multiple comparisons had lower power ( P hoch and P fdr ). Finally, in general picking a single partition at random had the lowest power ( P single ). Spatial autocorrelation reduced the statistical power of all methods ( r = 0.1; Fig. 3 b, d). The method combining statistical results among partitions ( P part ) had somewhat inflated type I error rates for the analyses of land-cover classes, rejecting 9% of the simulated datasets when the null hypothesis was true. The inflated type I error rates, however, were the result of the relatively small size of the simulated map (40 x 40 pixels) necessitated by the application of the GLS analysis of the entire map ( P GLS ). Repeating the same analysis on a 60 x 60 pixel map ( P part ) gave rejection rates of 5.4% ( r = 0.03) and 5.2% ( r = 0.1). The inflated type I errors for the smaller map occurred due to the estimation of the nugget , because GLS in which no parameters in the correlation matrix C are estimated does not give inflated type I errors [10] .

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.