REML estimation with intrinsic Mat´ern dependence in the spatial linear mixed model

: We present a new matrix-free residual maximum likelihood (REML) analysis for irregularly spaced spatial data, where observations usually represent average values over very small regions that are interpreted as points. The REML analysis is obtained after embedding the sampling locations in a ﬁne scale rectangular lattice, treating unobserved sites as missing data. The spatial random ﬁelds considered here are based on fractional Laplacian diﬀerencing on the lattice and they are unique in approximating continuum intrinsic Mat´ern dependence. Here, using the h-likelihood method, we derive REML estimating equations that allow for singular precision matrices, estimation of covariate eﬀects, prediction of unobserved spatial eﬀects and REML estimation of precision parameters as a solution to an explicit gamma non-linear model. Furthermore, we devise a sophisticated computational algorithm that enables us to achieve scalable matrix-free statistical computations. In particular, these matrix-free computations include the use of (1) the two-dimensional discrete co-sine transformation that arises in the spectral decomposition of the precision matrix of our spatial random ﬁelds and that allows fast matrix-free matrix-vector multiplication, (2) a matrix-free pre-conditioned Lanczos al-gorithm that solves non-sparse matrix equations with linear constraints, (3) a matrix-free Hutchinson’s trace estimator that stochastically approximates the trace of a matrix, (4) a robust trust region method that always ﬁnds a local maximum of the non-concave residual log-likelihood function and (5) some preliminary computations of the log REML likelihood function based on Taylor series approximation. Using computer experiments, we provide further understanding on not just the number and values but also the basins of attraction of the local and global maxima of the REML function. This understanding signiﬁcantly simpliﬁes the problem of ﬁnding global maxima. We further demonstrate through computer experiments that our matrix-free REML estimators attain both eﬃciency and geosta-tistical inference


Introduction
In recent years, interest in connecting lattice-based Gaussian random fields with geostatistical models has increased significantly, as researchers begin to explore the extent to which lattice-based models offer an adequate fit or explanation to continuum spatial variations we see in real data. At the forefront of these discussions are the works of McCullagh (2002), Besag (2002), Besag and Mondal (2005), McCullagh and Clifford (2006), Lindgren et al. (2011), Dutta and Mondal (2015a), Mondal (2013), and many others. In particular, the work of McCullagh (2002) lays a theoretical foundation for the use of the de Wijs process or the logarithmic covariance model for spatial analysis. Besag and Mondal (2005) derive the connection between first-order intrinsic autoregression and the de Wijs process in which the latter arises as the scaling limit of the former. Using stochastic partial differential equation representations, Lindgren et al. (2011) (see also Gay and Heyde, 1990;Kelbert et al., 2005) provide useful Gaussian Markov random field approximations to a subclass of Matérn covariance models, and present Galerkin methods for statistical computations. Dutta and Mondal (2015a), on the other hand, derive fast matrix-free computations for spatial mixed models based on Gaussian Markov random fields with nugget effects, enhancing the computations developed in Lindgren et al. (2011). Encouraged by these developments, we undertake here the novel possibility of using the first-order Gaussian intrinsic autoregression on the regular lattice as a building block for constructing fractionally differenced random fields. The latter, as we shall see, is unique in approximating continuum intrinsic Matérn random fields that have an important place in the geostatistical literature. It is these lattice-based fractional random fields that give a fresh perspective to the conceptual framework of spatial Matérn dependence analysis, whose full potential has not been realized. We also develop novel, scalable, easy-to-implement and yet sophisticated matrix-free computations that allow us to argue for their wider relevance, and help us understand many of the strengths and weaknesses of intrinsic Matérn dependence models.
Thus, our purpose in studying Matérn models is distinct from classical works of Mardia and Marshall (1984), Handcock and Stein (1993), Stein (1999), Diggle et al. (2003), Guttorp and Gneiting (2006), Minasny and McBratney (2007), Pardo-Igúzquiza et al. (2009), Anitescu et al. (2012) and many others on conventional fitting of stationary Matérn covariances to spatial data. Furthermore, the purpose here is also distinct from studying Whittle's spectral approximations methods (see e.g., Guyon (1982), Dahlhaus and Künsch (1987), Kent and Mardia (1996), Fuentes (2007) and many others) for estimating parameters of stationary covariances. Necessarily, we avoid direct modeling via covariances or spectral densities, and instead focus on the structure of the precision matrix (i.e., the canonical parameter) that arises through full conditional specifications and scaling limit connections. Furthermore, as geostatstics is largely about differencing and intrinsic random functions (Matheron, 1971(Matheron, , 1973; see also the review of Beran (1992) on long range dependence and the discussion of Diggle et al., 2010), we restrict our attention mostly to fractional differencing and approximations of intrinsic Matérn models. Corresponding methods for the stationary case, if needed, can be obtained easily and will be discussed later in the paper.
In geostatistical applications, observations usually represent average values over very small regions that are interpreted as points. Thus, for brevity of discussion, we assume that we can embed the locations in a fine scale rectangular lattice, treating unobserved sites as missing data. As we shall see in Sections 5 and 6, there is little loss in discretizing the space in the above way and in embedding irregularly spaced locations to a fine regular grid. We can then focus on estimation of parameters in an overall mixed effect formulation, which might include covariate information as well as a white noise component. In other words, we consider a mixed linear model of the form y = Tτ + Fψ + . (1) In the above y is the vector of observations around n sampling locations, T is an n × m matrix of some covariate values with τ as the coefficients, ψ is the vector of latent spatial process coming from a fine r × c regular array on which the sampling locations are embedded, F is a known incidence matrix indicating whether an observation belongs to a particular array cell so that Fψ gives back the vector of latent spatial variable values for the observed y and indicates the vector of residual fluctuations that are left unexplained by the regularly varying spatial process or the covariate values. We assume that the covariate values are so adjusted that T1 = F1 = 1 or F1 = 1 where 1 is the vector of all ones of appropriate dimension and 1 belongs to the column space of T. We interpret y i , the ith entry of the y vector, as the average value on the array cell on which the corresponding sampling location falls. Similarly, we interpret ψ u,v , the (u, v)th entry of ψ, as the average value of the latent spatial random field on the (u, v)th array cell. Furthermore, we embrace the residual fluctuations as independent and identically distributed Gaussian random variables with an unknown precision parameter λ y , and the random spatial effect ψ as an intrinsic Matérn Gaussian random field on an r × c array with singular precision matrix R. In other words, where |R| + denotes the product of nonzero eigenvalues of R. Thus, what assumes importance is how we construct R and how we pursue subsequent statistical computations in order to draw meaningful statistical inference. In the case of first-order Gaussian intrinsic autoregression on the two dimensional integer lattice, which is central to the Markov random field approach of spatial statistics, ψ is defined via the following conditional mean and variance formula where the conditioning variables (namely ...) are all ψ u,v such that (u, v) = (i, j). When restricted on a finite r × c array, this intrinsic autoregression actually follows ψ ∼ |W/(2π)| 1 2 where the singular precision matrix W of ψ is the discrete (graph) Laplacian on the grid with rank(W) = q − 1 and can be written in terms of a quadratic form as In what follows, we shall draw upon the above construction of W to derive an explicit representation to R. Specifically, we shall use with a precision parameter λ ψ and a (long range) dependence parameter ν. The above may look rather simple, but as we shall see, it allows us to describe fractionally differenced random fields on regular arrays and, as lattice spacing diminishes, it allows us to approximate intrinsic Matérn models. Furthermore, it is the use of (2) that allows us to develop fast matrix-free REML computations for the mixed model in (1) and, in the process, allow us to gain new understanding on the statistical estimation and inference of intrinsic Matérn models. It is worthwhile to point out that there is little literature on fitting exact intrinsic Matérn mixed linear model besides McCullagh and Clifford (2006). These exact REML estimation of McCullagh and Clifford (2006) require O(n 3 ) computations and O(n 2 ) storage when data are observed at n spatial locations. It is also worthwhile to point out that in Lindgren et al. (2011) the focus was on the values ν = 1, 2, . . ., which correspond to a subclass of conditional autoregressions with sparse precision matrices and which allow them to use certain sparse matrix computations that are of O(n 3/2 ). However, the sparse approximations and the fast computations of Lindgren et al. (2011) are not extensible for any arbitrary positive values of ν (e.g., for ν = 5/3 that arises in the study of turbulence). In comparison, the effective order of matrix-free computations developed by Dutta and Mondal (2015a) for spatial mixed linear models based on a sparse precision matrix (i.e., ν = 1, 2, . . .) and nugget effects is just O(n log n) with O(n) storage. In this paper, we not only extend the estimation to arbitrary ν > 0, but also the computations that we develop here for any ν > 0 effectively require only O(n(log n) 2 ) computations and O(n) storage. Other than non-sparsity, the statistical problem considered here has many challenges. As, for example, the intrinsic Matérn dependence leads to not just a non-sparse singular precision matrix but also an ill-conditioned one and finding an useful preconditioning method is a challenge. Similarly, non-integer values of ν give rise to non-convexity in the log REML function which poses an important optimization challenge. We offer here new and interesting solutions to these challenges without resorting to dimension reduction, tapering, or procedures such as block averaging. We also demonstrate that there is little loss of statistical efficiency (in terms of achieving Cramer-Rao lower bound) due to our matrix-free scalable computations and we can obtain answers that are as good as exact answers.
The rest of the paper is laid out as follows. In Section 2, we provide interpretation of (2) based on the fractional Laplacian difference equations and derive some justifications for its validity using both the frequency domain and the spatial domain results. Specifically, we show that 1) as the lattice spacing diminishes the eigenvalues of the singular precision matrix from (2) trace out the eigenvalues of an intrinsic Matérn process, and 2) the differences between the variograms of the fractionally Laplacian differenced process at a moderately fine lattice and the continuum Matérn process are just a small constant. Thus, at a moderately fine lattice scale, our approximations work remarkably well by the routine use of a random error term along with the spatial component. In Sections 3 and 4, we develop REML estimation procedure of the mixed model in (1). Specifically, we derive of matrix-free statistical computations of best linear unbiased estimators (BLUEs) of the contrasts of fixed effect τ , best linear unbiased predictors (BLUPs) of the contrasts of random effect ψ, and REML estimation of θ = (λ y , λ ψ , ν) T . We also provide explicit characterization of the non-convex nature of REML estimation. Our computational steps consist of (1) matrix vector multiplications using the discrete cosine transformations (2) linear equations solving with a non-sparse Lanczos algorithm, (3) derivation of effective matrix-free preconditioner using an incomplete Cholesky factor decomposition, (4) calculating traces of matrices using Hutchinson's trace estimators and (5) optimization using a robust matrix-free trust region method. We also include some preliminary computations of the log REML likelihood function based on Taylor series approximation. In Section 5, we present through simulations the performance of the REML estimators. These simulation runs suggest that despite non-convex nature of REML optimization, the trust region method can pick the global estimates, at least when n = O(rc) and n is large. We also run simulations to demonstrate that there is little loss in embedding irregularly spaced locations to a fine regular grid. In particular, we show numeric consistency of results and highlight robustness of inference to changes of lattice spacing. We further demonstrate that our matrix-free REML estimators attain efficiency properties that are very close to those of the exact REML estimators. In terms of practical gain, we also report actual computational times that are better than those from Lindgren et al. (2011). In Section 6, we provide an extensive application to mapping ground water arsenic contamination in Bangladesh and again highlight robustness of statistical inference to the changes of scales. Finally, in Section 7, we close the paper with some discussions on computations of the stationary case, conditional simulations, and on potential challenges ahead in accommodating spatial anisotropy and heterogeneity within the framework developed here.

Fractionally differenced random fields and intrinsic Matérn processes
The objective of this section is to provide some details of the fractional Laplacian differenced random fields on finer and finer regular arrays as approximations to continuum intrinsic Matérn random fields. Here, we shall consider some frequency domain results, exact variogram calculations and precise algebraic eigenvalue expressions that lead to (2). We also discuss certain meaning and significance to its use. First, as discussed by Mondal (2011) with regard to the paper of Lindgren et al. (2011), we consider a sequence of Gaussian random fields {Z (m) (u, v)} on regular two dimensional sub-lattices Z 2 m with spacing 1/m, m = 1, 2, . . ., which have individual spectral densities of the form with ω, η ∈ (−πm, πm], σ m > 0 and ν > 0. In the above, ν = 1 corresponds to the first-order intrinsic autoregression, ν = 2 suggests a Whittle's simultaneous intrinsic autoregression and so on. The integer values of ν, namely, ν = 1, 2, 3, . . . were the focus of Lindgren et al. (2011) and they correspond to various latticebased intrinsic autoregressions with sparse precision matrices. In contrast, the non-integer values of ν, which is the primary focus of this paper, lead to fractionally differenced random fields. Typically, they correspond to random fields with non-sparse precision matrices and offer greater flexibility in modeling long range spatial dependencies. Specifically, following Duffin (1953), let D m be the Laplace difference operator on the sub-lattice Z 2 m , i.e., where f is any real valued function defined at the lattice points of Z 2 m . Then, by extending the results of Hosking (1981), we can represent {Z (m) (u, v)} as where ξ u,v is a Gaussian white noise random field on the sub-lattice Z 2 m with var ξ u,v = σ 2 m /m 2 . Thus, {Z (m) (u, v)} can be interpreted as a fractional Laplacian differenced random field on the sub-lattice Z 2 m and they enjoy properties similar to those of fractionally differenced time series. Now, just like the fractionally differenced time series, the fractionally differenced random fields {Z (m) (u, v)} can be thought of as discrete space analogue of certain continuum fractional random fields. In fact, these continuum random fields, say, {Z(u, v)} can be obtained by taking scaling limits of {Z (m) (u, v)}.
To elaborate on this, we assume that, as m → ∞, m ν−1 σ m → σ/2 ν . Then, it is not difficult to see that f m (ω, η) converges to The above gives the spectral density formula of a continuum Gaussian intrinsic Matérn random field and thus, from the above convergence result, it follows that continuum Gaussian intrinsic Matérn random fields are scaling limits of fractionally differenced random fields. Furthermore, the above convergence result explicitly describes the rescaling of parameters needed, particularly when we want to choose a suitable sub-lattice (e.g., to embed irregular sampling locations into a grid or to approximate irregular regions by unions of grid cells when observations themselves are aggregates over such regions). As we shall see, this will become important to show numeric consistency and robustness of inference to changes of lattice spacing.
The key question is, as approximations to the continuum intrinsic Matérn process, how good are the above lattice-based model {Z (m) (u, v)} for small values of m. For brevity of discussion, we focus on the case 1 < ν < 2. The intrinsic case with ν = 1 is discussed in details in Mondal (2005), Besag and Mondal (2005) and Dutta and Mondal (2015a). The intrinsic case with ν ≥ 2 requires use of higher-order contrasts and some calculations are outlined in Appendix and in the supplement to the paper. The intrinsic case with 0 < ν < 1 is similar to the case ν = 1 in the sense it also requires calculations using (contrasts of) averages over nun-null regions, and some calculations are given in the supplement to the paper. If 1 < ν < 2, following Matheron (1973), it is well known that the continuum intrinsic Matérn random field {Z(u, v)} has the following exact power variogram formula where h = √ s 2 + t 2 . On the other hand, the variogram of the fractional Laplacian differenced random field on Z 2 m at lag (s, t) is given by the double integral formula which is computed numerically with great accuracy by a two-dimensional quadrature method discussed in Dutta and Mondal (2015b). For large m, γ m (s, t) will be very close to γ(s, t). However, we want to know whether one can work with {Z (m) (u, v)} for moderately small values of m instead of working with the continuum process {Z(u, v)} and essentially get the same result. To address this question, Figure 1 plots the difference {γ(s, t) − γ m (s, t)} against the lag distance √ s 2 + t 2 for ν = 1.25 and ν = 1.5 and σ 2 = 1. When ν = 1.25, we see that the difference decreases as m increases and this difference becomes almost a positive constant at m = 8. Thus, in this case it is to our advantage to add a small nugget effect to the process {Z (m) (u, v)} so that statistical analyses based on the fractional Laplacian differenced random field on sub-lattices Z 2 m are essentially equivalent to the same based on continuum power variogram model on integer lattice Z 2 . On the other hand, when ν = 1.5, the difference of the variograms is negative, increases with respect to m and essentially becomes a small negative constant when m = 8. In fact, notice that this difference is only 1.5% of γ(0, 1) when m = 8. Thus, the lattice-based approximation would be good for moderate values of m even if a little nugget effect is present in the data.
In general, we have observed similar phenomena for other values of ν. In particular, for other reasonable values of 1 < ν < 2, we found that the difference γ(s, t) − γ m (s, t) is essentially a small constant (positive or negative) for moderately small values of m. Thus, it not unreasonable to anticipate that lattice based approximations are enhanced by routine use of a random error term along with the spatial component. In fact, these results are typical to the ones found in Mondal (2005), Besag and Mondal (2005) and Dutta and Mondal (2015a) and they can be further justified by writing down the asymptotic expansions of γ m (s, t) using results from Duffin and Shaffer (1960), as done in Mondal (2005), Besag and Mondal (2005) and Dutta and Mondal (2015a) for the case of ν = 1. Overall, a fractional Laplacian differenced random field plus a white noise at a reasonable sub-lattice is a very good approximation of the intrinsic Matérn plus white noise model on the actual lattice.
Next we explore how the finite grid representation in (2) is connected with the infinite lattice representation in (3). To this end, suppose that observations are made on a finite r × c regular array with r = p 1 × m, c = p 2 × m and lattice spacing 1/m. We can then approximate the Laplace differenced operator D m by the discrete (graph) Laplacian matrix of this grid. However, note that the matrix −W in (2) is the Laplacian matrix on this finite grid. This leads to saying that the spatial lattice process Z (m) u,v on a finite integer lattice are approximately zero mean Gaussian random variables with precision matrix σ −2 m W ν . This gives rise to (2) with λ ψ = σ −2 m . Furthermore, suppose for k = r or k = c that M k is a k × k matrix corresponding to the discrete cosine transformation with entries . . , k, j = 1, . . . , k. Suppose also that D k is a diagonal matrix with ith diagonal entry equal to It then follows that M k is orthogonal and M = M c ⊗ M r diagonalizes W. Specifically, It is now obvious that, for any m = 1, 2, . . ., the eigenvalues of R − , the Moore-Penrose inverse of R, exactly trace out the spectral density (3) at the discrete cosine frequencies, further justifying the validity of (2). It is also obvious that with the increase of m, i.e., as the lattice spacing diminishes, the eigenvalues of R − trace out the limiting continuum spectral density (4) better. The spatial formulation in (2) as representation of the infinite lattice fractional random fields in (3) is enhanced further if we allow a few extra layers of conceptual 'border' grid cells when we embed sampling locations. The idea was put forward by Besag and Higdon (1999), purely as a computational ploy, to alleviate edge effects, and, in practice, its implementation requires just a straightforward adjustment to F.

Estimation for the mixed model
The singularity of R presents some difficulty in interpreting the mixed model in (1). Put another way, (1) is interpreted in terms of contrasts of y, i.e., C 0 y = C 0 Tτ + C 0 Fψ + C 0 , where rows of C 0 are vector of contrasts and rank(C 0 ) is equal to rank(FR − F T ). Thus, contrasts of τ are estimable and contrasts of ψ are predictable. In such settings, typical REML estimation goes as follows. We estimate τ from marginal distribution of C 0 y. We estimate BLUPs of the contrasts of ψ using conditional mean formula given data differences C 0 y. Separate from the estimation of τ , we obtain estimates of λ y , λ ψ and ν from maximizing the log REML function. However, unless carefully done, such estimation procedures based on data differencing do not lead to any simplification nor do they provide any insight into the REML estimation problem. Thus, in what follows, we draw upon the work of Henderson (1950Henderson ( , 1975, Nelder (1996, 2001), and Dutta and Mondal (2015a) and derive exact REML estimation using h-likelihood formulation.

Estimation of fixed effects and prediction of spatial effects
Let B denote the matrix of orthogonal contrasts formed by the last rc − 1 eigenvectors of W, and let G denote the diagonal matrix formed by the corresponding eigenvalues of W, i.e., where d 01,i and d 10,i are the ith diagonal entries of D 01 and D 10 respectively. It then follows that Bψ has an rc − 1 dimensional Gaussian distribution with zero mean and covariance matrix λ −1 ψ G −1 . Extending Henderson (1950Henderson ( , 1975, Nelder (1996, 2001) and Dutta and Mondal (2015a), we represent the mixed model equations in (1)-(3) by where and ζ are independent centered Gaussian distributions random vectors with covariance matrices λ −1 y I and λ −1 ψ G −1 respectively. For notational brevity, we further define Then we can express the mixed model in (1) as a linear regression model . Under the assumption that T1 = F1 = 1, the design matrix X has one rank deficiency; that is there is a non-zero vector such that X = 0. For fixed parameters λ y , λ ψ and ν, we then estimate β by solving the normal equation or in more succinct notation subject to the constraint A = 0. Let β be the estimate of β that we obtain solving the normal equation (6). Let τ and ψ be the corresponding estimates of τ and ψ, which we obtain from β. Extending the argument given in Dutta and Mondal (2015a), one can then see that the contrasts of τ coincide with the best linear unbiased estimators of the contrasts of τ . Similarly, the contrasts of ψ coincide with the best linear unbiased predictors of the contrasts of ψ. Furthermore, even if the symmetric coefficient matrix A = X T QX is not sparse, we shall see that the above normal equation is solved using a novel and efficient matrix-free Lanczos algorithm.

REML estimation for precision parameters
Extending Nelder (1996, 2001) and Dutta and Mondal (2015a), the log-likelihood function of the residuals obtained from the regression model (5) takes the form of The derivations provided in Dutta and Mondal (2015a) further show that the traditional REML log-likelihood arising from C 0 y and (7) is actually same up to an additive constant that depends only on the contrast matrix C 0 . Thus, maximizing (7) gives rise to REML precision parameter estimate of θ = (λ y , λ ψ , ν). Traditional maximization of the log REML function uses score equations which is obtained by equating the gradient of r to zero. Thus suppose Q 1 = ∂Q/∂λ y , Q 2 = ∂Q/∂λ ψ , and Q 3 = ∂Q/∂ν. The score equations that maximize the log-REML function in (7) are then given by for i = 1, 2 and 3. It is not difficult to see that the above score equations can also be written as where H = X(X T QX) −1 X T Q denotes the 'hat'-matrix of the linear regression model (5). Typically, Fisher's scoring method is used to solve the score equations and to obtain REML estimates. However, this also requires computation of the second derivatives of the log REML function or the information matrix I whose (i, j)th entry is equal to and whose inverse is used to produce estimates for dispersion of the REML estimators of the precision parameter.

Nonconvexity in REML estimation
Neither the log REML function (7), nor the scoring equations (8) give insight into the exact non-convex nature of the optimization problem. Here, instead of maximizing the REML function directly, we consider an alternative approach that allows us to characterize the REML optimization problem in terms of a non-linear gamma regression. This alternative approach follows the work of Nelder (1996, 2001) and Dutta and Mondal (2015a). The basic idea is as follows. Rather than using traditional REML computations that separate estimations of β and θ, we can consider an iterative approach. Thus, starting with an initial estimate β, we compute the residuals e = z − X β and use them to estimate the precision parameters λ y , λ ψ and ν. Then, once estimates λ y , λ ψ and ν are obtained, we update β by computing the normal equation from (6). The process continues until the estimates converge numerically. It is just that the final estimates are actually REML estimates. Specifically, suppose h jj is the jth diagonal element of the hat matrix H, and q jj is the jth diagonal entry of Q. Then .
Accordingly the score equations in (8) then can be written as where q (i) jj denote the jth diagonal entry of Q i . The above score equations coincide with the estimating equations of a nonlinear gamma regression, where we assume that the adjusted residuals e 2 j /(1 − h jj )'s are the responses variables that are distributed independently as Gamma random variables, and we have an inverse link, prior weights (1 − h jj )'s and nonlinear predictors q jj = λ y v y,j + λ ψ (v 01,j + v 10,j ) ν .
The above provides a precise characterization of the non-convex nature of REML estimation in that non-convexity arises through the estimation of the dependence parameter ν. In particular, when ν is fixed and known, the optimization becomes a convex optimization in which case any local maximum is also the global maximum.

Matrix-free computations
Typical statistical computations such as REML estimation from Matérn mixed models of spatial data with n irregularly distributed sampling locations require at least O(n 3 ) computations and O(n 2 ) storage space. The main objective here is to reduce computations for REML estimation of our mixed linear model in (1) to O(Kn(log n) 2 ) operations, where K << n (and effectively a constant) and storage to O(n) space. We do this without losing any efficiency properties of the REML estimators. To this end, we divide our task into three parts. First, we discuss the two dimensional discrete cosine transformation that is essential for various matrix-free matrix-vector statistical computations for our mixed model. Second, we indicate how we can compute BLUEs and BLUPs of contrasts using a novel matrix-free Lanczos algorithm. Here we also develop a novel preconditioning method to reduce the order of computations. Third, we develop fast matrix-free ways to obtain REML estimates using stochastic trace approximations and a trust region method. In this paper we do not pursue equation (8) for estimating the precision parameters, which would require computing the diagonal entries of the hat matrix. In a future paper, we shall take up matrix-free computations of such quantities. Furthermore, in this section, we provide some preliminary matrix-free computations of the log REML function based on Taylor series approximations. In our limited experience, it appears that these computations work better than those by Aune et al. (2014). More accurate scalable matrix-free computations of the log REML function will be a matter of future investigation.

DCT and matrix-vector multiplications
For any r × c matrix E = (e i,j ), its two dimensional discrete cosine transformation gives rise to a r × c matrix whose (s, t)th entry is given by where c s = 1/r if s = 1 and 2/r otherwise, and c t = 1/c if t = 1 and 2/c otherwise. On the contrary, the two dimensional inverse discrete cosine transformation on E produces another r × c matrix with (s, t)th entry: The above transformation is very closely related to the two dimensional fast Fourier transformation (FFT). Indeed, following the works of Cooley and Tukey (1965) and Rao and Yip (1990), one can factorize the above computations further, as done in the case of FFT. These factorizations reduce the computational complexities of DCT to O(rc log(rc)). Alternatively, it is also possible to use FFT to compute the DCTs with additional O(rc) pre-and post-processing steps, as shown by Makhoul (1980). Over the decades, highly optimized algorithms for FFT have been developed for various machine architectures. We have found that these algorithms in conjunction with O(rc) pre-and post-processing steps are more time efficient to compute DCTs than directly implementing the Cooley-Tuckey and the Rao-Yip algorithms. In this paper, we follow Frigo and Johnson (2005) that show how to eliminate the redundant operations of the FFT algorithms to compute the DCTs. The codes from Frigo and Johnson (2005) are freely available on the web: http://www.fftw.org/.
In what follows, the two dimensional DCT is used in various matrix-vector multiplications. We encounter these matrix-vector multiplications in solving the normal equation (6), and in computing the REML score equations and in computing the entries of the Fisher information matrix. Essentially, these matrix-vector multiplications would require us to compute either W ν x or W −ν x for certain candidate vector x. The matrix W ν and its pseudo-inverse W −ν are non sparse. However, spectral decomposition allows us to write is obtained by inverting all the non-zero entries of (D 01 + D 10 ) ν . Thus, multiplying a vector with W ν or W −ν essentially reduces to multiplying a vector with M or M T . At this point, we note that obtaining Mv is same as performing a two-dimensional DCT on the r × c matrix formed using elements of v filling each column at a time and then converting the resulting matrix back into a vector by stacking the columns. Similarly computing M T v is same as obtaining a two-dimensional inverse DCT.
It must be noted that, for very large arrays, one can further improve the speed of the two-dimensional DCTs using distributed computing. The main idea comes from the fact that a two-dimensional DCT consists of many one-dimensional DCTs along the columns and then along the rows. Thus, these one-dimensional DCT computations can be distributed equally or parallely among the cores of the processors. This division of computations is particularly useful on a graphical processing unit (GPU) which typically has thousands of processing cores and this division of computations can provide substantial gain in computational time. For further discussions on parallel implementation of DCT, we refer to the website: http://www.fftw.org/parallel/parallel-fftw.html.

Solving linear equations with Lanczos algorithm
First, note that the Lanczos algorithm in Dutta and Mondal (2015a) for solving the sparse system of equations Aβ = b with ν = 1 can be extended to solve a dense system structured by general values of ν > 0 in matrix-free way. This matrix-free extension of the algorithm from sparse to the dense case is possible because all that the Lanczos algorithm uses are matrix vectors products of the form Ax which can be computed via the DCT and the inverse DCT described in the previous section. The Lanczos algorithm proceeds as follows. It computes a set of orthonormal vectors v 1 , v 2 , . . . called the Lanczos vectors, from the span of b, Ab, A 2 b, . . . . At the kth iteration, it then obtain an approximate tri- In the implemented version of the algorithm none of these matrices V k and Δ k is stored but the solution is iteratively updated on the fly by progressively computing the lower bidiagonal Cholesky factorization of Δ k . We refer to Dutta and Mondal (2015a) for all technical details on the algorithm including computing the norm of the residuals in almost no additional cost that gives a practical stopping criterion. The algorithm stores only few vectors of length m + rc and a few scalars thus requiring only O(rc) memory. With the use of the discrete cosine transformation, a matrix-vector multiplication of the form Av incurs a O(rc log(rc)) computational cost. Thus, if the Lanczos algorithm takes K 0 iterations to converge, then the total computational cost of solving Aβ = b becomes O(K 0 rc log(rc)).
Second, we do not solve Aβ = b directly, but rather we carefully construct a preconditioning matrix P and solve a well-conditioned system of linear equations PAP T β = Pb. We then obtain the final solution as β = P T β . The idea is to reduce the number of iterations K 0 as much as possible. The matrix P has the following requirements: 1) we want P such that the matrix-vector products Px and P T x can be computed in matrix-free way at a computational cost of O(rc log(rc)), and 2) the condition number of PAP T will be finite so that Lanczos algorithm for solving PAP T β = Pb converges geometrically fast, i.e., in O(log(rc)) iterations.
To this end, we apply a block preconditioner for A where the first block is a Cholesky factor of the small dimensional matrix λ y T T T and the second block is a preconditioner for the matrix When ν = 1 (or some other positive integer), the above matrix is sparse, in which case Dutta and Mondal (2015a) proposed using its incomplete Cholesky factor as preconditioner. For a general non-integer value of ν, we construct here a practical sparse approximation C of C by thresholding or tapering in such a way that the small eigenvalues of C stay at the same order of the small eigenvalues of C. We then apply an incomplete Cholesky factor of C to obtain an effective matrix-free preconditioner of C.
In order to construct the sparse approximation C of C, we use the following decomposition of the matrix W: where 4W k = M T k D k M k and k is either r or c. In the above the matrix W k is a tridiagonal matrix and W ν k is a non-sparse matrix, but its entries far away from the diagonal are very small in magnitude compared to the ones on or near the diagonal. Let W ν k be a sparse matrix approximation of W ν k obtained by suitable thresholding or tapering. Then a candidate for the matrix C is given by Thus the block diagonal preconditioner for A becomes P = diag{L −1 1 , L −1 2 } where L 1 is the lower triangular Cholesky factor of λ y T T T and L 2 is the lower triangular incomplete Cholesky factor of C. This effectively makes the condition number of the matrix PAP T bounded. It can seen that, if r and c are of the same order, the overall cost of constructing this matrix P is O(rc log(rc)). Furthermore, since the inverse of P sparse, the matrix vector multiplications Pv and P T v incurs a computational cost of at most O(rc log(rc)).
The above preconditioner effectively brings down the computation cost of calculating the BLUEs of the contrasts of τ and the BLUPs of the contrasts of ψ to O(rc(log(rc)) 2 ) operations.
Alternatively, one can obtain a preconditioner of A using Chebyshev polynomial approximations, see e.g., (Saad, 1985), but we did not pursue its use in this work.

Stochastic trace approximation
Equations (8) and (9) (rc)) 2 ) calculations even with the use of discrete cosine transformation and the Lanczos algorithm. In other words, overall trace computations using this method require at least O((rc) 2 (log(rc)) 2 ) operations. Here, instead we apply Hutchinson's method that stochastically approximates the trace of a symmetric matrix using a Monte Carlo average of its quadratic forms in random vectors with zero mean and identity covariance matrix. Thus, for a symmetric matrix E, the Hutchinson's method approximates the trace of E by where u t 's are either i.i.d Rademacher or Gaussian random variables and K is an integer that is much smaller than the dimension of E. Throughout, we only consider i.i.d. Rademacher variables because they minimize the variance of the above estimate among all i.i.d random variables of size K from a distribution with zero mean vector and identity covariance matrix. Although further reduction in variances of the trace approximations are possible if we allow dependent random variables in trace estimation, we do not pursue such stochastic approximations in this paper. It then follows that the REML score equations reduce to the following unbiased estimation equations And the (i, j)th entry of the Hessian matrix becomes Overall the computations of the trace terms, REML score equations and the information matrix require O(Krc(log(rc)) 2 ) flops and O(rc) storage and allow us to pursue numeric optimization of the log REML function via the Newton-Raphson line search method or the trust region methods.

Trust region method
Finding a solution to the REML score equation that minimizes the negative log REML function is a non-convex optimization problem, and in such settings, the traditional Netwon-Rhapson algorithm with line search methods are often perceived to be susceptible in practice. Thus, following Powell (1970Powell ( , 1984 and Nocedal and Wright (1999), we adopt here a trust-region method that can be considered as a global line search method and that allows us to obtain solutions to the REML score equations in a numerically stable way. At any iteration, the trust region method approximates the objective function by a suitable quadratic function, identifies a region within which it perceives the quadratic approximation to be good and then minimizes the quadratic approximation over the identified region around the current value of the variable to obtain the next iterate.
It must be emphasized that the above quadratic function can computed in a matrix free way, as we can do so for computing the gradient and the Hessian using the stochastic trace approximations. Next, the trust region method picks the step size δ k by minimizing p k (δ) in a suitable region. The choice of the suitable region is critical here. If the region is too large then p k (δ) could be a poor approximation to (1/2) ∇g(θ k ) 2 and minimizing p k (δ) could become unsuitable. On the other hand, if the region is too small then the algorithm will pick a tiny step size resulting in little effective gain. Thus, the trust region method adaptively chooses the radius of the region based on the performance of the previous iteration. If there was a significant gain in the previous iteration then the algorithm becomes optimistic and expands the radius of the trust region; a loss on the other hand results in shrinking the radius. Overall, the key steps of the algorithm are then given by:
4. Compute the effectiveness of the step: 5. Decide whether the trust region radius shrinks, expands or stays the same: 6. Decide whether to update the solution: The algorithm stops when either there is no significant change in the θ k 's and the value of the objective function is sufficiently close to zero. Sometimes the algorithm does not converge in the sense that the radius of the trust region becomes very small and yet the value of the objective function remains far from zero. In such a case the algorithm must be restarted with a different initial value.
Furthermore, this algorithm computes the Jacobian matrix of the score equation, which is nothing but the negative of the observed information matrix. Thus, as a by-product of the trust region algorithm, we can obtain standard errors of the θ as the square roots of the diagonal entries of the inverse of the observed information matrix (of the stochastic score equations). Powell (1984) proves convergence to stationary points for φ 0 = 0, i.e. when a step is taken whenever the value of the objective function goes down under some weak assumptions. Moré and Sorensen (1983) and Nocedal and Wright (1999, ch. 4) provide convergence results for φ 0 ∈ (0, 1/4) under stronger assumptions that the objective function is Lipschitz, continuously differentiable and the corresponding Hessian matrix is bounded.

Approximate computations of log REML likelihood function
Some preliminary computations of REML log-likelihood function can be obtained in a matrix-free way using Taylor series expansions. We indicate these computations in the no covariate case (i.e., with T = 0). First, in order to compute the log REML function, we need to find the log determinant of C = λ y F T F+R, where R = λ ψ W ν as given in (2). Using C = (λ y I+R)−λ y (I−F T F) the log determinant of C can then be expressed as: Next, we use the Taylor series expansion and write the second term as where o J is a negligible term. The above is possible because the matrix (I + λ −1 y R) −1 (I − F T F) has spectral radius less than 1. In particular, note that the spectral radii of each of the matrices I − F T F and (I + λ −1 y R) −1 are exactly equal to 1. However, 1 is not an eigenvalue of their product because the matrix λ y F T F + R is non-singular. This also implies that the above Taylor series converges geometrically, a value of J can be set carefully to make the approximation sufficiently accurate.
Next we approximate the trace terms using Huchinson's estimators. Letũ k 's be i.i.d Rademacher vectors. We then approximate the trace as Furthermore, using the two-dimensional DCT, we can compute all of the above quadratic forms in a matrix-free way. This allows us to approximately evaluate the REML log-likelihood function in a matrix-free way. In practice, negative log REML likelihood values can be substantially large, and can cause some numerical instability. However, in practice, we are mostly required to evaluate the difference of the log REML functions at two different parameter values, in which case we can write the difference of the log REML function in terms of term by term differences of quadratic forms. Computing these series of term by term differences is numerically more stable and they converge faster than the individual series of quadratic forms.

Local maxima in REML estimations
The non-linear gamma score equations derived in Section 3.3 characterize the non-convex nature of the REML computations. The trust region method considered in Section 4.4 guarantees convergence to a local maximum, but its convergence to the global maximum requires a more careful investigation, as such convergences are often proved under strong regularity assumptions. Here we present through simulations the performance of the REML estimators, computed using matrix-free trust region method and summarized for two different values of the dependence parameter. Results for further values of the dependence parameter and for varying proportions of missing observations are reported in the supplement to the paper. However, the overall conclusions of these more extensive simulation studies do not change from what we cover here. Thus the simulations summarized below are largely representative of what we expect to see for the entire range of parameter space and provide some critical understanding of the strength and weakness of the REML computations.
The specific details of the simulation runs are as follows. We generate realizations from the spatial mixed linear model (1) on 512 × 512 arrays by setting T = 0, τ = 0, λ ψ = 8, λ y = 4 and using ν = 1.25 and 1.50. This is done in several steps. First, for a fixed value of the dependence parameter ν, a random effect ψ is generated on a 512 × 512 array from a Gaussian distribution with mean 0 and a precision matrix 8W ν with the sum constraint ψ T 1 = 0. Then, each ψ u,v value, independent of others, is removed randomly with probability p = 0.8, and a Gaussian white noise with precision λ y = 4 is added to the remaining ones. This produces a random incidence matrix F and a vector of realization y with an expected sample size of n = 52428. The procedure is then repeated with different values of ν.
Tables 1 and 2 display results of REML computations done on each of the simulated dataset using our matrix-free trust region method with 9 different starting values of the dependence parameter, namely, ν (0) = 0.8, 0.9, 1.0, . . . , 1.6. Throughout we consider K = 50 Rademacher vectors to stochastically approximate the traces in the score equation (11). We then pick the values of λ ψ ) as the starting point we next run the trust region algorithm and obtain final solutions ν, λ y and λ ψ to the REML score equation. In some instances, the estimate of λ y grew increasingly large with successive iterations and we mark it by '∞' in the tables. Thus, in a typical trust-region run '∞' signifies a value that is larger than 10, 000. While the first three columns of each table record the initial values of ν (0) , λ (0) y and λ (0) ψ , the last three columns of each table comprise of the final estimated values ν, λ y and λ ψ .
As we glean through the numbers in these Tables, several conclusions can be drawn: (a) When the initial value ν (0) is close to the true ν value, we find that the estimates of λ ψ get further away from the true parameter values, but the procedure still converges to the same final estimates ν, λ y and λ ψ . There is an interesting pattern here, namely, if we increase the value of ν (0) locally from the true ν value, we force a slightly smoother spatial process, and as a compensation we end up with a larger value of λ (0) ψ but a smaller value of λ (0) y . However, as we run the trust region iterations, the algorithm makes subsequence adjustments to get close to the true values. (c) When the initial value of ν (0) is really far from the true ν value, the non-convex nature of the REML likelihood function takes over, and in some instances, we see that trust region algorithm converges to final values that are far from the true parameter values. Furthermore, it is just that only the global maximum is in the interior of the parameter space and is very close to the true parameter values. However, the second local maximum, found from poor starting points, falls on the boundary of the parameter space with λ y = ∞.

Numeric consistency towards geostatistical models
The purpose of our next computer experiment is to demonstrate that using lattice-based approximations we can obtain inference for the continuum Matérn dependence plus the white noise structures, even when observations are made at irregular sampling locations. The basic idea is to embed the study region and irregular sampling locations into finer and finer lattice arrays by diminishing the lattice spacing and then check for numeric consistency (in terms of convergence in distribution of the estimators of dependence parameters) as we fit the approximations of the continuum Matérn dependence with nugget effect on finer and finer lattices. To this end, we consider the unit square (0, 1) × (0, 1) as the study region and pick 20, 000 points uniformly within this study region as sampling locations. At these randomly generated irregular sampling locations, we then draw a realization from the intrinsic Matérn process with ν = 1.25 and σ −2 = 2/(4π 2 ). We also add a random noise to each observations by generating i.i.d. standard normal random variables. Next, we embed the unit square and the irregular sampling points in an r × r regular square array and estimate parameters using methods detailed in Sections 2, 3 and 4. Table 2 provides these estimates along with their standard errors for various values of r.
The results vindicate the theoretical findings in Section 2. First, the estimates of ν hover around the true value 1.25 as the lattice spacing decreases. Second, the sample size is fixed, but with diminishing lattice spacing, standard errors for ν are essentially constant, which endorse that the procedure is converging in distribution and approximating geostatistical inference. Third, as lattice spacing decreases, λ y gets closer to the true value 1. The slight difference between the lattice-based nugget effect λ y and the geostatistical nugget 1 occurs because of (essentially) constant difference between the variograms of the intrinsic Matérn process and the fractionally differenced process. Fourth, the estimates λ ψ increase with the diminishing lattice spacing. But this increase and its exact nature can be explained by the approximation theory detailed in Section 2. Specifically, it follows from the scaling limit equations (3) that at lattice spacings 1/m and 1/m the ratio (σ 2 m /σ 2 m )(m/m ) 2ν−2 should be close to 1. Thus, for example, between arrays of size 400 × 400 and 500 × 500, this ratio is seen to be (4.946/5.848) × (500/400) 0.5 = 0.946. Overall, the estimates and the standard errors are indicative of numeric convergences of estimators in distribution, and, it can be seen that after appropriate rescaling of parameters, we can obtain geostatistical inference from the lattice-based approximations.

Efficiency of matrix-free REML estimators
In classical statistics, efficiency of an estimator is often judged based on the variance of the estimator and both the Fisher information matrix and the Cramėr-Rao lower bound play a central role in assessing the efficiency of an estimator. In particular, for linear mixed models, it is typical that REML estimators are asymptotically efficient and achieve the Cramėr-Rao lower bound. The question that we ask here is how efficient are our matrix-free estimators. We address  (1) on 256×256 arrays by setting T = 0, τ = 0, λ ψ = 8, λ y = 4 and using ν = 1.25. As in Section 5.1, this is done in several steps. First, for each Monte Carlo sample, a random effect ψ is generated on a 256 × 256 array from a Gaussian distribution with mean 0 and a precision matrix 8W ν with the sum constraint ψ T 1 = 0. Then, each ψ u,v value, independent of others, is removed randomly with probability p = 0.6, and a Gaussian white noise with precision λ y = 4 is added to the remaining ones.
For each Monte Carlo sample, this produces a random incidence matrix F and a vector of realization y with an expected sample size of n = 39322. For each Monte Carlo sample, we then apply our matrix free computations and obtain the corresponding REML estimates ν, λ y and λ ψ and their standard errors by calculating the inverse of observed Fisher information matrix in a matrix-free way. Table 3 provides the Monte Carlo summaries of these estimates and their standard errors. It can be seen that the estimates are very accurate, and the Monte Carlo standard deviations of the estimators match very well with the corresponding Monte Carlo averages of the standard errors of estimates. (There is a slight discrepancy which occurs because only 50 Rademacher vectors are used in approximating the trace terms and this introduces a very slight additional variation in the estimates). We next repeat the above simulation experiment with ν = 1.75 and report the summaries in Table 4. Again we see that Monte Carlo standard deviations of the estimators match very well with the corresponding Monte Carlo averages of the standard errors of estimates. These confirm that there is little loss of statistical efficiency when computations are implemented in a matrix-free way.
The above summaries of computer simulations along with those from Section 5.2 now suggest that we can use lattice-based fractional random fields as proxies for continuum Matérn random fields and when we do so we can still achieve geostatistical inference with little loss of statistical efficiency.

Computational times and practical gains
We now demonstrate a practical gain of our matrix-free method by comparing our actual run times with those of Lindgren et al. (2011) based on iterated nested Laplacian approximations (INLA). It must be noted that the current version of INLA (numbered 0.0-1420281647) can only fit a Matérn dependence model for the first few integer values of the dependence parameter ν. Furthermore, INLA is a Bayesian method where the estimates of parameters can be very sensitive to the choice of prior. Thus, in order to keep the comparison fair, we only consider the run times for the case ν = 1. In a fashion identical to Section 5.1, we simulate observations on r × r arrays with ν = 1, λ y = 4 and λ ψ = 8, and, discard 40% as missing. We then estimate these parameters using our matrix-free method and INLA. Table 5 provides the average run times in seconds of these methods. Specifically, the second column indicates run times for our matrix-free method, and the third and the fourth columns show run times for INLA. These are obtained respectively by running INLA on (1) a laptop with a Intel R i7-4700MQ processor and 8GB of RAM on Linux operating system and (2) a workstation with two Intel R Xeon R E5430 processors and 32GB RAM on a Linux operating system. Furthermore, our matrix-free REML computations are done with some preliminary codes running on MATLAB 8.4 using only a single core of the laptop, while INLA is granted to use four cores on either machine.
We see that the computation time of our matrix-free REML algorithm scales very well with increasing dimension and does so at an approximate rate O(n log n) as suggested by the theory in Section 4. Furthermore because RAM requirement of our matrix-free algorithms scales only linearly with the array size, we find that these computations are possible on an ordinary laptop for moderately large array dimension. In contrast, we see that the computation times of INLA algorithm scales poorly with the array size. And to make things worse, INLA runs out of memory on the laptop even for a moderate array size. Only when it is run on the workstation, it can handle moderately large arrays. Thus, we conclude that we save both time and storage space.
Although we used only a single core to carry out the matrix-free REML optimization, our method can be made much faster with parallel implementation. Furthermore, we can deal with non-integer values of ν, which the sparse computations of Lindgren et al. (2011) can not.

Arsenic mapping in Bangladesh
Arsenic contamination of the groundwater in Bangladesh (and in West Bengal, India) is a serious problem with approximately one in five of the wells used for drinking water currently contaminated with arsenic above the government's drinking water standard (50 mg/L). Following an agreement with the Government of Bangladesh, the British Geological Survey (BGS) and the Department of Public Health Engineering (DPHE) conducted an extensive investigation of the arsenic problem during the period 1998 to 2001 and their website http://www. bgs.ac.uk/research/groundwater/health/arsenic/Bangladesh/data.html provides a rich treasury of groundwater arsenic concentration data. See also BGS and DPHE (2001) for a full report of the survey. The primary focus of the survey was to assess the scale of the groundwater arsenic contamination so that appropriate arsenic mitigation program could be developed. Consequently, a key component of the study was to construct an extensive geographic map of arsenic contamination. Over the years, numerous important works on the cause and damage of arsenic contamination and on the mechanism of arsenic mobilization came out of this study. These include Nickson et al. (2000), Chowdhury et al. (2000), McArthur et al. (2001), Smith et al. (2000), Harvey et al. (2002) and Neumann et al. (2009). Although till date the exact cause of arsenic contamination remained sketchy, scientists speculate that it is the supply of excess oxygen during pumping the tube wells that accelerates hydrolysis of precipitated arsenate and that releases soluble arsenous acid into groundwater.
Here, using methods developed in Sections 3, 4 and 5, we explore mapping groundwater arsenic concentration data collected by BGS andDPHE during 1998 and1999. These data are from 3534 tube wells located at irregularly distributed sites at 61 out of 64 districts in Bangladesh and are measured in parts per billion (ppb which is same as micrograms per liter). The three districts in the south-eastern Bangladesh that are left out of the study are known to be arsenic safe. Overall the arsenic concentration measurements in the study region have a large range with values varying from less than 0.5 ppb to 1660 ppb, and a fraction of measurements are truncated on the right at the instrumental detection limit. The left panel of Figure 2 displays these data after grouping into different categories. Notice that, 0.5ppb is the machine detection limit, 10ppb is the WHO permissible limit, 50ppb is the Bangladesh Governtment permissible limit and 150ppb is believed to be the threshhold above which cancer mortality appears (Lamm et al., 2006). We see arsenic contamination is endemic in the middle southern part of Bangladesh. The south-west and the parts of northern Bangladesh region also have a patchy high arsenic concentration values. Overall, about 42% wells sampled had arsenic concentration more than the World Health Organization's permissible limit of 10 ppb and about a quarter of the wells sampled had arsenic more than the Bangladesh Government permissible limit of 50 ppb.
In order to map groundwater log arsenic concentration, we then embed the study region with latitudes between 20 and 27 degrees north (approximately 778 km in length) and longitudes between 88 to 93 degrees east (approximately between 495 and 522 km in width) into a 500 × 300 array. Thus each array cell is approximately 2.64 square kilometers 1 in area and we tally arsenic concentrations in n = 3224 of these cells, (which is only about 2.15% of total array cells). Moreover, among these n = 3224 array cells, 2941 contain only one sampling location each, 258 contain two sampling locations each, 23 contain three sampling locations each and only 2 contain four sampling locations. Thus, due to clustering nature of the sampling locations, some array cells contained two or more sampling locations even after placing a such large array. To rectify this problem partially, we then take the average of logarithm of the arsenic concentrations over each cell and obtain our data vector y. However, since a small fraction of cells contained multiple sampling locations, in our preliminary analysis we decided against adjusting the residual or nugget vector using the number of sampling locations over which these averages are calculated. We then fit the mixed model (1) with T = 0 to obtain BLUP of ψ −ψ1 where 1 is the rc−vector of all ones, and to obtain REML estimates of λ y , λ ψ and ν.
First, we fit the data with ν = 1. This is mainly because ν = 1 corresponds to fitting a Gaussian autoregression, which is often the standard practice in spatial

Table 6
Estimates of the precision parameters for the ground water log arsenic concentration data.

Initial value
Final Solution statistics. To this end we fix a sample of 20 Rademacher variables and use the Lanczos algorithm and trust-region algorithm described in Section 3 and obtain global REML estimates λ y = 0.906 (with a s.e. 0.074) and λ ψ = 1.527 (with a s.e. 0.106). The left panel in Figure 3 provide the image plots of the BLUP of ψ −ψ1 and the estimates of the residuals. These plots suggests that an intrinsic autoregression model captures the local variation of arsenic log-concentration fairly well. Overall the BLUP of ψ−ψ1 highlights the high arsenic concentration areas in northern Bangladesh, central and southwest Bangladesh. Next we fit the data with an arbitrary dependence parameter ν > 0. Here, we use the same 20 Rademacher variables, but pursue REML computations using 11 different starting values of ν as in the simulation study. We summarize both the initial set of estimates and the final REML estimates in Table 6. As was the case with our simulation study, we found two local maxima; one corresponds to no nugget model with λ y = ∞, ν = 0.858 (with a s.e. 0.016) and λ ψ = 0.660 (with a s.e. 0.020); and other with nugget effect where λ y = 0.596 (with a s.e. 0.040), ν = 1.24 (with a s.e. 0.054) and λ ψ = 4.607 (with a s.e. 1.260). The center and right panels in Figure 3 provide corresponding image plots of the BLUP of ψ −ψ1 at these local maxima. Next we compute the difference of log REML function at these two local maxima and found this difference to be −49.82. This confirms that second set of the estimates are the global REML estimates and they corresponds to a model that includes nugget effects or residual values to capture the small scale variations. Next we reanalyze the data at different lattice resolutions and check for numeric consistency of the results. In particular, we embed the study region and irregular sampling locations into finer and finer lattice arrays by diminishing the lattice spacing but by keeping the aspect ratio (r/c) of the arrays fixed. We then fit the lattice approximations of continuum Matérn dependence with nugget effect to the observed arsenic contamination data. Table 7 provides the estimates of the precision parameters, and their standard errors for various lattice sizes. We find that, at different lattice resolutions, there are little changes in the estimates of ν and λ y after accounting for their uncertainties. The estimates of λ ψ , however, increases with diminishing lattice spacings, but this increase can further be explained by the approximation theory laid out in Section 2. Specifically, the scaling limit equation (3) suggests that at lattice spacings 1/m and 1/m the ratio (σ 2 m /σ 2 m ) × (m/m ) 2ν−2 should be close to 1. Thus, for example, by inserting ν = 1.322, in arrays of size 500 × 300 and 550 × 330, we find that this ratio is equal to 0.923. Overall, these results are largely consistent with what saw in computer generated experiments in Section 5. and we can conclude that the inference drawn from lattice-based approximations actually mimic the inference from the usual continuum intrinsic Matérn dependence structures with nugget effects.
We now extend the arsenic mapping problem further to study various covariate effects. It is believed that water from old and/or shallow tube wells tend to have higher arsenic concentration than water from new and deeper tube-wells. The plots of arsenic concentration against depth (in meters) and age (in years) shown in Figure 4 graphically supports this view. In other words, there is an apparent negative correlation between the arsenic concentration and the depth of the tube-wells and an apparent positive correlation between the arsenic concentration and the age of these tube-wells. Thus, we include the depth and the age of the tube-wells as covariate information into our spatial mixed linear model and make some preliminary investigation on the significance of these covariate  effects and robustness of such analyses to the changes of scales. To this end, Table 8 provides the BLUEs of the age and depth effects, REML estimates of the precision parameters, and the corresponding standard errors for two array sizes. These estimates are obtained by using the same sample of 20 Rademacher variables that we used before for the same array sizes to obtain our REML estimates for the kriging problem. As expected, we find that these covariate effects are significant. But the most striking observation here is that the BLUEs of the covariate effects appear to be robust to any changes of scales. This exact same phenomenon was observed also in Dutta and Mondal (2015a) in the context of agricultural variety trial, and, in the current context this observation reaffirms that even for a moderately small lattice spacings, a fractionally differenced random field plus a white noise on a lattice grid is a very good approximation to the intrinsic Matérn process plus a little white noise. This offers further justifications for the use of lattice-based approximations. Finally, some words should be added on why we did not fit a stationary Matérn covariance to the arsenic concentration data and instead fit a limiting intrinsic version. This question of stationary vs. intrinsic has dominated many discussions of spatial statistics in the past; see e.g., Beran (1992), Besag and Kooperberg (1995), Besag (2002), McCullagh and Clifford (2006) and many others that address the inadequacy of stationary models in various spatial applications. In the arsenic example, we first need to acknowledge that these observations are not point-wise measurements. Rather they corresponds to average arsenic concentrations over water intake areas of the wells. This adds difficulties to fitting a stationary Matérn covariance model. Second, even if we assume that data are point-referenced, we could not estimate MLEs of the covariance parameters as fitting a stationary Matérn model to these data runs into boundary and other numerical problems. Similar to the computer experiments in Section 5.1, we tried to explore basins of attraction to local maxima by doing a grid search on few parameters. But these efforts proved futile and the log-likelihood surface revealed a long flat ridge. Presence of such flat ridges not only complicate the numerical optimization problem but is also indicative of parameter non-interpretability and redundancy.

Discussion
We open the discussion by noting the practical benefits of implementing our lattice-based approximations of continuum spatial models. As we have seen through simulations and REML analysis of arsenic contamination, there is little loss in turning a geo-statistical dataset into an areal one by embedding the study region on to a fine regular grid. In this context, we must recognize that we never measure a spatial random variable at a point in space, but as an average value (or an integral) over a small non-null, perhaps infinitesimal, region. Thus, conceptually, there is no problem in discretizing the space, and treating the observed values as averages over discretized regions. Furthermore, in most spatial applications, the scale of sampling is neither infinitesimally small nor infinitely great. Thus, in a range of applications, it should not be difficult to implement a lattice-based model at a reasonably fine grid and still obtain the same inference that we would have obtained had we fit the corresponding continuum geostatistical model.
As statisticians our rule of computations is simple. We should pursue exact computations when possible. When we can not pursue exact computations directly (for example if the sample size is large), we can look for alternatives that will give us answers that are as good as exact answers. In this context, our matrix-free scalable REML computations are relevant, as it is difficult to pursue exact REML computations in large datasets. Numerical experiments in Section 5.3 show that there is little loss of statistical efficiency due to our matrixfree scalable computations and we can obtain answers that are as good as exact answers.
We believe that the fractional Laplacian differenced random fields on regular arrays (that are presented in this paper) are of interest on their own. Lattice based models, particularly Markov random fields, have played a fundamental role in the development of spatial statistics and the use of fractional differencing will widen the overall scope of lattice models.
In the current paper, we did not consider any stationary models, but they deserve some attention here. Typically, for stationary models, one can pursue REML computations first by embedding the sampling locations in a finer rectangular grid and then embedding the rectangular grid into a much larger torus lattice using block circulant embedding, as proposed in Newsam (1993, 1997) and Wood and Chan (1994). This allows for the use of station-ary models on a torus lattice, where one can take computational advantages of fast Fourier transform (Besag and Moran, 1975) and derive matrix-free scalable REML computations within our h-likelihood framework. From a computational point of view, this h-likelihood framework will be easier to implement than directly maximizing the REML function by adapting the works of Anitescu et al. (2012). Furthermore, the h-likelihood framework will allow us to characterize non-convexity in the optimization in terms of gamma non-linear models and one will be able to obtain useful circulant or other preconditioners. However, the main advantage here is that these computations will be not just matrix-free and scalable but also statistically efficient (in terms of achieving Cramer-Rao lower bound) and thus will yield qualitatively better estimates than those by Fuentes (2007).
In the above context, we must note that, even for stationary models, to date there is no known Whiitle likelihood method that can deal with irregularly sampled observations and produce asymptotically efficient estimators. Thus, the work of Anitescu et al. (2012), Dutta and Mondal (2015a), this paper and the computations mentioned above for the stationary case are a significant step forward.
In arsenic data example, Gaussian distribution is used as an approximation, as only a small fraction of the data is truncated to the right. If there is a concern that this right-censoring can affect the Gaussianity assumption, one can further consider a non-linear model using truncated Gaussian distribution and obtain appropriate statistical estimation and inference. In this regard, it must be noted that REML estimation applies only for Gaussian models and not for truncated Gaussian models. Thus, although REML estimation can not be extended to a non-linear truncated Gaussian model, one can still derive meaningful statistical estimation and inference extending the computations we have developed in this paper.
It is worthwhile to mention that there are many interesting directions in which we can take our research forward in the future. One direction is to develop matrix-free methods for conditional simulations of the spatial effect ψ. The works of Borici (2000), Schneider and Willsky (2003), Parker and Fox (2012) and Aune et al. (2013) have paved ways for matrix-free Lanczos methods for solving equations of the form A 1/2 x = b, where A is a positive definite matrix that can be multiplied with a vector in matrix-free way in less than O(n 2 ) computations. By applying these works, we can thus advance conditional simulations of ψ. Specifically note that the conditional precision matrix of ψ is λ y F T F + λ ψ W ν , which can be multiplied with a vector in matrix-free way in O(n log n) computations. Thus such conditional simulations will allow us to make various statistical inference about the underlying latent spatial random field. At the same time, they would help us develop novel Bayesian computations, as such simulations can be implemented for block Gibbs updates or within Matropolis-Hasting steps or other Markov chain Monte Carlo computations.
Another direction will be to combine the current work with that of Mondal (2013) to develop fractionally differenced conditional autoregressive spatial models. This will allow for advancement of Box-Jenkins type methodology in spatial statistics.
A third direction is to develop more complex spatial models that can accommodate anisotropy and heterogeneity. In reality, there are several reasons why substantial anisotropy and heterogeneity may be present in arsenic concentration. For example, aquifers may not be homogeneous, the ages and the depths of the wells can vary from place to place. Similarly, demographic variables such as population density, agricultural practices and industrial variables may also affect local arsenic contamination. Furthermore, it is typical that older or deeper wells supply more oxygen to the aquifers and accelerate hydrolysis arsenates to a greater extent. It is also typical that the underground water usage is high where population density is high or where the land is used for agriculture and these factors have an adverse effect on contamination. Thus it would be of interest to think how we can collect more covariate information and how we can develop a more elaborate spatial model that can accommodate different sources of anisotropy and heterogeneity.
We refer to the supplement of the paper for numeric comparisons of γ D (s, t) and γ D,m (s, t).