Likelihood Approximation With Hierarchical Matrices For Large Spatial Datasets

We use available measurements to estimate the unknown parameters (variance, smoothness parameter, and covariance length) of a covariance function by maximizing the joint Gaussian log-likelihood function. To overcome cubic complexity in the linear algebra, we approximate the discretized covariance function in the hierarchical (H-) matrix format. The H-matrix format has a log-linear computational cost and storage $O(kn \log n)$, where the rank $k$ is a small integer and $n$ is the number of locations. The H-matrix technique allows us to work with general covariance matrices in an efficient way, since H-matrices can approximate inhomogeneous covariance functions, with a fairly general mesh that is not necessarily axes-parallel, and neither the covariance matrix itself nor its inverse have to be sparse. We demonstrate our method with Monte Carlo simulations and an application to soil moisture data. The C, C++ codes and data are freely available.


Introduction
The number of measurements that must be processed for statistical modeling in environmental applications is usually very large, and these measurements may be located irregularly across a given geographical region. This makes the computing procedure expensive and the data difficult to manage. These data are frequently modeled as a realization from a stationary Gaussian spatial random field. Specifically, we let Z = {Z(s 1 ), . . . , Z(s n )} , where Z(s) is a Gaussian random field indexed by a spatial location s ∈ R d , d ≥ 1. Then, we assume that Z has mean zero and a stationary parametric covariance function C(h; θ) = cov{Z(s), Z(s + h)}, where h ∈ R d is a spatial lag vector and θ ∈ R q is the unknown parameter vector of interest. Statistical inferences about θ are often based on the Gaussian log-likelihood function: where the covariance matrix C(θ) has entries C(s i − s j ; θ), i, j = 1, . . . , n. The maximum likelihood estimator of θ is the value θ that maximizes (1). When the sample size n is large, the evaluation of (1) 6. There are efficient rank-truncation techniques to keep the rank small after the matrix operations; for instance, the Schur complement and matrix inverse can be approximated again in the H-matrix format.
Figure 2(a) shows an H-matrix approximation of the covariance matrix from a discretized (n = 16, 641) exponential covariance function on the unit square, its Cholesky approximation 2(b), and its H-matrix inverse 2(c). The dark (or red) blocks indicate the dense matrices and the grey (green) blocks indicate the rank-k matrices; the number inside each block is its rank. The steps inside the blocks show the decay of the singular values in log scale. The white blocks are empty. In the last few years, there has been great interest in numerical methods for representing and approximating large covariance matrices in the applied mathematics community [61,12,68,57,2,3,70,6]. Recently, the maximum likelihood estimator for parameter-fitting Gaussian observations with a Matérn covariance matrix was computed via a framework for unstructured observations in two spatial dimensions, which allowed the evaluation of the log-likelihood and its gradient with computational complexity O(n 3/2 ); the method relied on the recursive skeletonization factorization procedure [36,54]. However, the consequences of the approximation on the maximum likelihood estimator were not studied.
In [12], the authors computed the exact solution of a Gaussian process regression by replacing the kernel matrix with a data-sparse approximation, which they called the H 2 -matrix technique, cf. [46] . It is more complicated than H-matrix technique, but has computational complexity and storage cost of O(kn).
The same H 2 -matrix technique for solving large-scale stochastic linear inverse problems with applications in subsurface modeling was demonstrated in [2]. The authors explored the sparsity of the underlying measurement operator, demonstrated the effectiveness by solving a realistic crosswell tomography problem, quantified the uncertainty in the solution and provided an optimal capture geometry for this problem. Their algorithm was implemented in C++ and is available online.
The authors of [70] observed that the structure of shift-invariant kernels changed from low-rank to block-diagonal (without any low-rank structure) when they varied the scale parameter. Based on this observation, they proposed a new kernel approximation algorithm, which they called the Memory-Efficient Kernel Approximation. That approximation considered both the low-rank and the clustering structure of the kernel matrix. They also showed that the resulting algorithm outperformed state-of-the-art low-rank kernel approximation methods regarding speed, approximation error, and memory usage. The BigQUIC method for a sparse inverse covariance estimation of a million variables was introduced in [37]. This method could solve 1 -regularized Gaussian Maximum Likelihood Estimation (MLE-) problems with dimensions of onemillion. In [6], the authors estimated the covariance matrix of a set of normally distributed random vectors. To overcome large numerical issues in the high-dimensional regime, they computed the (dense) inverses of sparse covariance matrices using H-matrices. This explicit representation enables them to ensure positive definiteness of each Newton-like iterate in the resulting convex optimization problem. In conclusion, the authors compare their new H-QUIC method with the existing BIG-QUIC method [37]. In [35], the authors offered to use H-matrices for the approximation of random fields. In particular, they approximated Matérn covariance matrices in the H-matrix format, suggested the pivoted H-Cholesky method and provided an a posteriori error estimate in the trace norm. In [68,66], the authors applied H-matrices to linear inverse problems in large-scale geostatistics. They started with a detailed explanation of the H-matrix technique, then reduced the cost of dense matrix-vector products, combined H-matrices with a matrix-free Krylov subspace and solved the system of equations that arise from the geostatistical approach. They illustrated the performance of their algorithm on an application, for monitoring CO 2 concentrations using crosswell seismic tomography. The largest problem size was n = 10 6 . The code is available online.
In [3], the authors considered the new matrix factorization for Hierarchical Off-Diagonal Low-Rank (HODLR) matrices. A HODLR matrix is an H-matrix with the weak admissibility condition (see Appendix B). All off-diagonal sub-blocks of a HODLR matrix are low-rank matrices, and this fact significantly simplifies many algorithms. The authors showed that typical covariance functions can be hierarchically factored into a product of block low-rank updates of the identity matrix, yielding an O(n log 2 n) algorithm for inversion, and used this product for evaluation of the determinant with the cost O(n log n), with further direct calculation of probabilities in high dimensions. Additionally, they used this HODLR factorization to speed up prediction and to infer unknown hyper-parameters. The provided formulas for the factorization and the determinant hold only for HODLR matrices. How to extend this case to general H-matrices, where off-diagonal block are allowed to be also H-matrices, is not so clear. For instance, conditional covariance matrices may easily lose the HODLR structure. The largest demonstrated problem size was n = 10 6 . The authors made their code freely available on GitHub.
Summarizing the literature review, we conclude that H-matrices are already known to the statistical community. It is well-known that Matérn covariance functions could be approximated in the H-matrix format (H, H 2 , HODLR). In [6], the authors used H-matrices to compute derivatives of the likelihood function. Approximation of the inverse and Cholesky factorization is working in practice, but requires more theoretical analysis for error estimations (the existing theory is mostly developed for elliptic PDEs and integral equations). The influence of the H-matrix approximation error on the quality of the estimated unknown parameters is not well studied yet. To research this influence with general H-matrices (and not only with the simple HODLR format) is the main contribution of this work.
The rest of this paper is structured as follows. In Section 2, we remind the H-matrix technique and review the H-matrix approximation of Matérn covariance functions. Section 3 contains the hierarchical approximation of the Gaussian likelihood function, the algorithm for parameter estimation, and a study of the theoretical properties and complexity of the approximate log-likelihood [29,39]. Results from Monte Carlo simulations, where the true parameters are known, are reported in Section 4. An application of the hierarchical log-likelihood methodology to soil moisture data, where parameters are unknown, is considered in Section 5. We end the paper with a discussion in Section 6. The derivations of the theoretical results are provided in the Appendix A, more details about H-matrices in Appendix B.

Hierarchical Approximation of Covariance Matrices
Hierarchical matrices have been described in detail [32,31,27,33,47]. Applications of the H-matrix technique to the covariance matrices can be found in [4,66,68,12,35,2,3,43]. There are many implementations exist. To our best knowledge, the HLIB library 1 is not supported anymore, but it could be used very well for academic purposes, the H 2 -library 2 operates with H 2 -matrices, is actively supported and contains some new idea like parallel implementation on GPU. The HLIBPro library 3 is actively supported commercial, robust, parallel, very tuned, well tested, but not open source library. It contains about 10 open source examples, which can be modified for the user's personal needs. There are some other implementations (e.g., from M. Bebendorf or Matlab implementations), but we do not have experience with them. To new users, we recommend to start with free open source HLIB or H2Lib libraries and then to move to HLIBPro, which is free for academic purposes.

Hierarchical matrices
In this section, we review the definition of H-matrices and show how to approximate covariance matrices using the H-matrix format. The H-matrix technique is defined as a hierarchical partitioning of a given matrix into sub-blocks followed by the further approximation of the majority of these sub-blocks by lowrank matrices (see Fig. 2). To define which sub-blocks can be approximated well by low-rank matrices and which cannot, a so-called admissibility condition is used (see more details in Appendix B). There are different admissibility conditions possible: weak, strong, domain decomposition based, and some others (Fig. 5). The admissibility condition may depend on parameters (e.g., the covariance length).
To define an H-matrix some other notions are needed: index set I, clusters (denoted by τ and σ), cluster tree T I , block cluster tree T I×I , admissibility condition (see Appendix B). We start with the index set I = {0, . . . , n − 1}, which corresponds to the available measurements in n locations. After the hierarchical decomposition of the index set I into sub-index sets has been completed (or in other words, a cluster tree T I is constructed), the block cluster tree (denoted by T I×I , see more details in Appendix B) together with the admissibility condition decides which sub-blocks can be approximated by low-rank matrices. For definitions and example of cluster trees and corresponding block cluster trees (block partitioning) see Appendix B and Fig. 6.
On the first step, the matrix is divided into four sub-blocks. The hierarchical tree T I×I tells how to divide. Then each (or some) sub-block(s) is (are) divided again and again until sub-blocks are sufficiently small. The resulting division is hierarchical. The procedure stops when either one of the sub-block sizes is n min or smaller (n min ≤ 128), or when this sub-block can be approximated by a low-rank matrix.
Another important question is how to compute these low-rank approximations. The HLIB library uses a well-known method, called Adaptive Cross Approximation (ACA) algorithm [26,10,13], which performs the approximations with a linear complexity O(kn) in contrast to O(n 3 ) by SVD.
Remark 2.1. Errors in the H-matrix approximation may destroy the symmetry of the symmetric positive definite covariance matrix, causing the symmetric blocks to have different ranks. As a result, the standard algorithms used to compute the Cholesky decomposition may fail. A remedy to this is defining C := 1 2 (C + C ).
Remark 2.2. Errors in the H-matrix approximation may destroy the positive definiteness of the covariance matrix. Especially for matrices which have very small (very close to zero) eigenvalues. A remedy is: 1) to use a more robust, e.g., block H-Cholesky, algorithm; 2) use LDL factorization instead of LL or 3) add a positive diagonal τ 2 · I to C. Remark 2.3. We use both notations C ∈ R n×n and C ∈ R I×I . In this work it is the same, since |I| = n.
The notation C ∈ R I×I is useful when it is important to differentiate between C ∈ R I×I and C ∈ R J×J , where I and J are two different index sets of the same size.
To define the class of H-matrices, we assume that the cluster tree T I and block cluster tree T I×I are already constructed.
Definition 2.1. We let I be an index set and T I×I a hierarchical division of the index set product, I × I, into sub-blocks. The set of H-matrices with the maximal sub-block rank k is where k is the maximum rank. Here, C| b = (c ij ) (i,j)∈b denotes the matrix block of C = (c ij ) i,j∈I corresponding to the sub-block b ∈ T I×I .
Blocks that satisfy the admissibility condition can be approximated by low-rank matrices; see [31]. An H-matrix approximation of C is denoted by C. The storage requirement of C and the matrix vector multiplication cost O(kn log n), the matrix-matrix addition costs O(k 2 n log n), and the matrix-matrix product and the matrix inverse cost O(k 2 n log 2 n); see [31].
Remark 2.4. There are two different H-matrix approximation strategies, [13,32]. 1) The fixed rank strategy ( Fig. 2), i.e., each sub-block has a maximal rank k. It could be sufficient or not, but it is simplify theoretical estimates for the computing time and storage. In this case we write C ∈ H(T I×I ; k) 2) The adaptive rank strategy (Fig. 5, left sub-block), i.e., where absolute accuracy (in the spectral norm) for each sub-block M is ε or better (smaller): where R(k, n, m) := {M ∈ R n×m |rank(M) ≤k} is a set of low-rank matrices of size n × m of the maximal rankk. The second strategy is better, since it allows to have an adaptive rank for each sub-block, but it makes it difficult to estimate the total computing cost and storage. In this case, we write C ∈ H(T I×I ; ε).
The fixed rank strategy is useful for a priori evaluations of the computational resources and storage memory. The adaptive rank strategy is preferable for practical approximations and is useful when the accuracy in each sub-block is crucial. Similarly, we introduce L(θ; k) and L(θ; ε) as H-matrix approximations of the log-likelihood L from (1). Sometimes we skip k (or ε) and write L(θ).
Thus, we conclude that different H-matrix formats exist. For example, the weak admissible matrices (HODLR) result in a simpler block structure (Fig. 5, right), but the H-matrix ranks could be larger than the ranks by standard admissible matrices. In [33], the authors observed a factor of ≈ 3 between the weak and standard admissible sub-blocks. In [47], the author observed that the HODLR matrix format might fail or result in very large ranks for 2D/3D computational domains or when computing the Schur complement. The H 2 matrix format allows to get rid of the log factor in the complexity and storage but it is more complicated for understanding and implementation. We note that there are conversions from H to H 2 matrix formats implemented (see HLIB and H2LIB libraries).

Matérn covariance functions
Among the many covariance models available, the Matérn family [55] has gained widespread interest in recent years. The Matérn form of spatial correlations was introduced into statistics as a flexible parametric class, with one parameter determining the smoothness of the underlying spatial random field [34]. The varied history of this family of models can be found in [30].
The Matérn covariance depends only on the distance h := s − s , where s and s are any two spatial locations. The Matérn class of covariance functions is defined as with θ = (σ 2 , , ν) ; where σ 2 is the variance; ν > 0 controls the smoothness of the random field, with larger values of ν corresponding to smoother fields; and > 0 is a spatial range parameter that measures how quickly the correlation of the random field decays with distance, with larger corresponding to a faster decay (keeping ν fixed) [35]. Here K ν denotes the modified Bessel function of the second kind of order ν. When ν = 1/2, the Matérn covariance function reduces to the exponential covariance model and describes a rough field. The value ν = ∞ corresponds to a Gaussian covariance model that describes a very smooth, infinitely differentiable field. Random fields with a Matérn covariance function are ν −1 times mean square differentiable.

Matérn covariance matrix approximation
By definition, covariance matrices are symmetric and positive semi-definite. The decay of the eigenvalues (as well as the H-matrix approximation) depends on the type of the covariance matrix and its smoothness, covariance length, computational geometry, and dimensionality. In this section, we perform numerical experiments with H-matrix approximations.
To underline the fact that H-matrix approximations can be applied to irregular sets of locations, we use irregularly spaced locations in the unit square (only for Tables 2, 3): where X ij , Y ij are i.i.d. uniform on (−0.4, 0.4) for a total of n observations; see [77]. The observations are ordered lexicographically by the ordered pairs (i, j).
All of the numerical experiments herein are performed on a Dell workstation with 20 processors (40 cores) and total 128 GB RAM. The parallel H-matrix library, HLIBPro [45], was used to build the Matérn covariance matrix, compute the Cholesky factorization, solve a linear system, calculate the determinant and the quadratic form. HLIBPro is fast and efficient; see the theoretical parallel scalability in Table 1. Here V (T ) denotes the set of vertices, L(T ) the set of leaves in the block-cluster tree T = T I×I , and n min the size of a block when we stop further division into sub-blocks (see Section 2.1). Usually n min = {32, 64, 128}, since a very deep hierarchy slows down computations.

Operation
Sequential Complexity Parallel Complexity [45] Table 2 shows the H-matrix approximation errors for log | C|, C and the inverse ( L L ) −1 , where L is the H-Cholesky factor. The errors are computed in the Frobenius and spectral norms, where C is the exact Matérn covariance matrix with ν = 0.5 and σ 2 = 1. The local accuracy in each sub-block is ε. The number of locations is n = 16,641. The last column demonstrates the total compression ratio, c.r., which is equal to 1−size( C)/size(C). The exact values are log |C| = 2.63 for = 0.0334 and log |C| = 6.36 for = 0.2337. The uniformly distributed mesh points are taken in the unit square and perturbed as in (3). The H-matrix accuracy and compression rates (c.r.). Accuracy in each sub-block is ε; n = 16,641, C is a Matérn covariance with ν = 0.5 and σ 2 = 1. The spatial domain is the unit square with locations irregularly spaced as in (3).

Convergence of the H-matrix error vs. the rank k
In Table 3 we show the dependence of the Kullback-Leibler divergence (KLD) and two matrix errors on the H-matrix rank k for the Matérn covariance function with parameters = {0.25, 0.75} and ν = 1.5, computed on the domain G = [0, 1] 2 . The KLD was computed as follow: We can bound the relative error C −1 − C −1 / C −1 for the approximation of the inverse as The spectral norm of (I − C −1 C) can be estimated by few steps of the power iteration method (HLIB library uses 10 steps). The rank k ≤ 20 is not sufficient to approximate the inverse, and the resulting error C( C) −1 − I 2 is large. One remedy could be to increase the rank, but it may not help for ill-conditioned matrices. The spectral norms ofC are C ( =0.25) 2 = 720 and C ( =0.75) 2 = 1068.
Remark 2.5. Very often, the nugget τ 2 I is a part of the model or is just added to the main diagonal of C to stabilize numerical calculations, i.e., C := C + τ 2 I, [52]. By adding a nugget, we "push" all the singular values away from zero. Adding a nugget effect to the list of unknown parameters θ = ( , ν, σ 2 , τ ) is straightforward. See the modified procedure in the GitHub repository [48].

Parameter estimation
We use the H-matrix technique to approximate the Gaussian likelihood function. The H-matrix approximation of the exact log-likelihood L(θ) defined in (1) is denoted by L(θ; k): where L(θ; k) is the rank-k H-matrix approximation of the Cholesky factor L(θ) in C(θ) = L(θ)L(θ) , and v(θ) is the solution of the linear system L(θ; k)v(θ) = Z. Analogously, we can define L(θ; ε).
To maximize L(θ; k) in (4), we use Brent's method [14,60], also known as Brent-Dekker 4 . The Brent-Dekker algorithm first uses the fast-converging secant method or inverse quadratic interpolation to maximize L(θ; ·). If those do not work, then it returns to the more robust bisection method. We note that the maximization of the log-likelihood function is an ill-posed problem, since even very small perturbations in the covariance matrix C(θ) may result in huge perturbations in the log-determinant and the log-likelihood. An alternative to C(θ) = L(θ)L(θ) is the L(θ)D(θ)L (θ) decomposition, which is more stable since it avoids extracting square roots of diagonal elements, i.e., LDL = (LD 1/2 )(LD 1/2 ) . Very small negative diagonal elements can appear due to, e.g., the rounding off error.

Computational complexity and accuracy
We let C(θ) ∈ R n×n be approximated by an H-matrix C(θ; k) with a maximal rank k. The H-Cholesky decomposition of C(θ; k) costs O(k 2 n log 2 n). The solution of the linear system L(θ; k)v(θ) = Z costs O(k 2 n log 2 n). The log-determinant log | C(θ; k)| = 2 n i=1 log{ L ii (θ; k)} is available for free. The cost of computing the log-likelihood function L(θ; k) is O(k 2 n log 2 n) and the cost of computing the MLE in m iterations is O(mk 2 n log 2 n).
Once we observe a realization z of the random vector Z, we can quantify the accuracy of the H-matrix approximation of the log-likelihood function. Our main theoretical result is formulated below in Theorem 1.

Theorem 1. (Accuracy of the hierarchical log-likelihood)
We let C(θ) be an H-matrix approximation of the matrix C(θ) ∈ R n×n , and Z = z a data vector. We let also the spectral radius ρ( C(θ) −1 C(θ) − I) < ε < 1. Then the following statements hold: Proof: See in the Appendix A. The estimate in Remark A.1 states how fast the probability decays while increasing the norm of Z. For simplicity, further on, we will assume z 2 ≤ c 0 . In [6], the authors observed that the bound n in the inequalities above is pessimistic and is hardly observed in numerical simulations. Though Theorem 1 is shown for the fixed rank arithmetics, it also holds for the adaptive rank arithmetics with C(θ; ε) and L(θ; ε).

Monte Carlo Simulations
We performed numerical experiments with simulated data to recover the true values of the parameters of the Matérn covariance matrix, known to be θ * := ( * , ν * , σ * ) = (0.7, 0.9, 1.0). In the first step, we constructed 50 independent data sets (replicates) of size n ∈ {128, 64, . . . , 4, 2} × 1,000, by multiplying the Cholesky factor L(θ, 10 −10 ) on a Gaussian random vector W ∼ N (0, I), C(θ * ) = L(θ * ) L(θ * ) . We took the locations (not the data) from the daily moisture example, which is described below in Sect. 5. After that we ran the optimization algorithm and tried to identify (recover) the "unknown" parameters ( * , ν * , σ * ). The boxplots for each parameter over 50 replicates are plotted in Figure 3    This simulation study (Fig. 3) shows boxplots vs n. We are able to estimate the unknown parameters of a Matérn covariance function on an irregular grid with a certain accuracy. The parameter ν (Fig. 3(b)) was identified more accurately than the parameters (Fig. 3(a)) and σ 2 (Fig. 3(c)). It is difficult to say why we do not see a clear patterns for and σ 2 . We believe there are a few reasons. First, the iterative optimization method had difficulties to converge (we need to significantly decrease the threshold and increase the maximal number of iterations) for large n. This is very expensive regarding the computing time. Second, the loglikelihood is often very flat, and there are multiple solutions which fulfill the threshold requirements. We think that the reason is not in the H-matrix accuracy since we took a very fine accuracy ε = 10 −10 already. Third, we have fixed the locations for a given size n. Finally, under infill asymptotics, the parameters , ν, σ 2 cannot be estimated consistently but the quantity σ 2 / 2ν can, as seen in Fig. 3(d).
In Fig. 4 we present functional boxplots [76] of the estimated parameters as a function of the accuracy ε, based on 50 replicates with n = {4000, 8000, 32000} observations. The true values to be identified are θ * = ( , ν, σ 2 , σ 2 / 2ν ) = (0.7, 0.9, 1.0, 1.9), denoted by the doted green line. Functional boxplots on (a), (b), (c) identify ; on (d), (e), (f) identify ν; on (g), (h), (i) identify σ 2 , and on (j), (k), (l) identify σ 2 / 2ν . One can see that we are able to identify all parameters with a good accuracy: the median, denoted by the solid black curve is very close to the dotted green line. The parameter ν and the auxiliary coefficient σ 2 / 2ν are identified better than and σ 2 , similarly to Fig. 3. The size of the middle box in the boxplot indicates the variability of the estimates. It decreases when n increases from 4,000 to 32,000.
The functional boxplots in Fig. 4 demonstrate that the estimates are fairly insensitive to the choice of the accuracy ε. Plots of the estimated parameters' curves for 20 replicates with n = 64,000 are presented in Fig. 7 in Appendix C. All these plots suggest when to stop decreasing ε further. The MLE estimates are already good enough for relatively large ε. In other words, the H-matrix approximations are accurate enough, and to improve the estimates of the parameters, one should improve the optimization procedure (the initial guess, the threshold, the maximal number of iterations).

Application to Soil Moisture Data
First, we introduce the daily soil moisture data set from a numerical model. Then we demonstrate how to apply H-matrices to fit a Matérn covariance model. We use just one replicate sampled at n locations. Then we perform numerical tests to see which n is sufficient and explain how to chose the appropriate H-matrix accuracy. We provide the corresponding computational times and storage costs.

Problem and data description
In climate and weather studies, numerical models play an important role in improving our knowledge of the characteristics of the climate system and the causes behind climate variations. These numerical models describe the evolution of many variables, such as temperature, wind speed, precipitation, humidity, and pressure, by solving a set of equations. The process is usually very complicated, involving physical parameterization, initial condition configurations, numerical integration, and data output schemes. In this section, we use the proposed H-matrix methodology to investigate the spatial variability of soil moisture data, as generated by numerical models. Soil moisture is a key factor in evaluating the state of the hydrological process and has a broad range of applications in weather forecasting, crop yield prediction, and early warnings of flood and drought. It has been shown that a better characterization of soil moisture can significantly improve weather forecasting. However, numerical models often generate very large datasets due to the high spatial resolutions of the measured data, which makes the computation of the widely used Gaussian process   models infeasible. Consequently, the whole region of interest must be divided into blocks of smaller size in order to fit the Gaussian process models independently to each block; alternatively, the size of the dataset must be reduced by averaging to a lower spatial resolution. Compared to fitting a consistent Gaussian process model to the entire region, it is unclear how much statistical efficiency is lost by such an approximation. Since our proposed H-matrix technique can handle large covariance matrix computations, and the parallel implementation of the algorithm significantly reduces the computational time, we are able to fit Gaussian process models to a set of large subsamples in a given spatial region. Therefore, next we explore the effect of sample size on the statistical efficiency. We consider high-resolution soil moisture data from January 1, 2014, measured in the topsoil layer of the Mississippi River basin, U.S.A. The spatial resolution is 0.0083 degrees, and the distance of one-degree difference in this region is approximately 87.5 km. The grid consists of 1830 × 1329 = 2,432,070 locations with 2,000,000 observations and 432,070 missing values. Therefore, the available spatial data are not on a regular grid. We use the same model for the mean process as in Huang and Sun (2018), and fit a zero-mean Gaussian process model with a Matérn covariance function to the residuals; see Huang and Sun (2018) for more details on the data description and exploratory data analysis.

Estimation, computing times and required storage costs
In the next experiment, we estimated the unknown parameters , ν, and σ 2 for different sample sizes n. Each time n samples were chosen from the whole set randomly. Table 4 shows the computational time and storage for the H-matrix approximations for n = {2000, . . . , 2000000}. All computations are done with the parallel H-matrix toolbox, HLIBpro. The number of computing cores is 40, the RAM memory 128GB. It is important to note that the computing time (columns 2 and 5) and the storage cost (columns 3 and 6) are growing nearly linearly with n. These numerical computations illustrate the theoretical formulas from Table 1. Additionally, we provide the accuracy of the H-Cholesky inverse and estimated parameters. The choice of the starting value is important. First, we run the optimization procedure with n = 2,000 randomly sampled locations, and with the starting value ( 0 , ν 0 , σ 2 0 ) = (3, 0.5, 1), ε = 10 −7 , the residual threshold 10 −7 and the maximal number of iterations 1,000. After 78 iterations the threshold was achieved and the solution is (ˆ ,ν,σ 2 ) = (2.71, 0.257, 1.07). This is the starting value for the experiment with n = 4,000. The estimated parameters, calculated with n = 4,000 and 80 iterations, are (ˆ ,ν,σ 2 ) = (3, 0.228, 1.02). Second, we take these values as the new starting values for another randomly chosen n = 8,000 locations. After 60 iterations, the residual threshold is again achieved, the new estimated values are (ˆ ,ν,σ 2 ) = (3.6, 0.223, 1.03). We repeat this procedure for each new n till 2,000,000. At the end, for n = 2,000,000 we obtain (ˆ ,ν,σ 2 ) = (1.38, 0.34, 1.1).
In Table 5 we list the computing time and the storage cost (total and divided by n) vs. H-matrix accuracy ε in each sub-block. This table connects the H-matrix sub-block accuracy, the required computing resources, and the inversion error I − ( L L ) −1 C 2 . It provides some intuition about how much computational resources are required for each ε for fixed n. We started with ε = 10 −7 (the first row) and then multiplied it each time by a factor of 4. The last row corresponds to ε = 6.4 · 10 −3 , which is sufficient to compute C, but is not sufficient to compute the H-Cholesky factor L (the procedure for computing H-Cholesky crashes). We see that the storage and time increase only moderate with decreasing ε.
In Table 6 we use the whole daily moisture data set with 2 · 10 6 locations to estimate the unknown parameters. We provide computing times to set up the matrix C and to compute its H-Cholesky factorization L. Additionally, we list the total storage requirement and the storage divided by n for both C and L. The Table 4: Estimation of unknown parameters for various n (see columns 1,8,9,10). Computing times and storage costs are given for 1 iteration. The H-matrix accuracy ε = 10 −7 .
n C L L Parameters time size kB/n time size   . We note that the choice of the initial point is very important and may have a strong influence on the final solution (due to the fact that the log-likelihood could be very flat around the optimum). We started our computations with a rough H-matrix accuracy ε = 10 −4 and the H-Cholesky procedure crashes. We tried ε = 10 −5 and received (ˆ ,ν,σ) = (1.39, 0.325, 1.01). We took them as the initial condition for ε = 10 −5 and received very similar values, and so on. To find out which H-matrix accuracy is sufficient, we provide tests for different values of ε ∈ {10 −4 , 10 −5 , 10 −6 , 10 −7 , 10 −8 }. One can see that ε = 10 −4 is not sufficient (the H-Cholesky procedure crashes), and ε = 10 −5 provides similar results as ε ∈ {10 −6 , 10 −7 , 10 −8 }. Therefore, our recommendation would be to start with some rather low H-matrix accuracy, to use the obtainedθ as the new initial guess, and then decrease ε and repeat this a few times.

Reproducibility of the numerical results
To reproduce the presented numerical simulations, the first step is downloading the HLIB (www.hlib.org) or HLIBPro (www.hlibpro.com) libraries. Both are free for academic purposes. HLIB is a sequential, opensource C-code library, which is relatively easy to use and understand. HLIBPro is the highly tuned parallel library, where only the header and object files are accessible.
After signing the H-matrix (HLIB or HLIBPro) license, downloading, and installing the libraries, our modules for approximating covariance matrices and computing log-likelihoods, and the soil moisture data can be downloaded from https://github.com/litvinen/HLIBCov.git.
HLIB requires not only the set of locations, but also the corresponding triangulation of the computing domains, [47,42]. The vertices of the triangles should be used as the location points. To construct a triangulation, we recommend employing the routines in MATLAB, R, or any other third-party software. HLIBPro does not require triangulation; only the coordinates of the locations are needed [49].

Conclusion and Discussion
We have applied a well-known tool from linear algebra, the H-matrix technique, to spatial statistics. This technique makes it possible to work with large datasets of measurements observed on unstructured grids, in particular, to estimate the unknown parameters of a covariance model. The statistical model considered yields Matérn covariance matrices with three unknown parameters, , ν, and σ 2 . We applied the H-matrix technique in approximating multivariate Gaussian log-likelihood functions and Matérn covariance matrices. The H-matrix technique allowed us to drastically reduce the required memory and computing time. Within the H-matrix approximation, we can increase the spatial resolution, take more measurements into account, and consider larger regions with larger data sets. H-matrices require neither axes-parallel grids nor homogeneity of the covariance function.
From the H-matrix approximation of the log-likelihood, we computed the H-Cholesky factorization, the KLD, the log-determinant, and a quadratic form, Z C −1 Z (Tables 2, 3). We demonstrated the computing time, storage requirements, relative errors, and convergence of the H-matrix technique (Tables 4, 5, 6).
We reduced the cost of computing the log-likelihood from cubic to log-linear, by employing the Hmatrix approximations, without significantly changing the log-likelihood. We considered both simulated examples, where we identified the known parameters, and a large, real data (soil moisture) example, where the parameters were unknown. We were able to calculate the maximum likelihood estimate (ˆ ,ν,σ 2 ) for all examples. We researched the impact of the H-matrix approximation error and the statistical error (see Fig. 7) to the total error. For both examples we repeated the calculations for 50 replicates and computed box plots. We analyzed the dependence of (ˆ ,ν,σ 2 ) on ε and on the sample size n.
With the parallel H-matrix library HLIBPro, we were able to compute the log-likelihood function for 2,000,000 locations in a few minutes (Fig. 4) on a desktop machine, which is ≈ 5 years old and cost nowadays ≈ 5.000 USD. At the same time, computation of the maximum likelihood estimates is much more expensive and depends on the number of iterations in the optimization and can take from few hours to few days. In total, the algorithm may need 100-200 iterations (or more, depending on the initial guess and the threshold). If each iteration takes 10 minutes, then we may need 24 hours to get (ˆ ,ν,σ 2 ). Possible extension of this work are 1) to reduce the number of required iterations by implementing the first and second derivatives of the likelihood; 2) add nugget to the set of unknown parameters; 3) combine the current framework with the domain decomposition method.
Finally, we conjecture that the H-matrix approximation may have some links with the Vecchia approximation [40,79] but this seems to be a difficult question. Indeed, the H-matrix approximation is tuned with respect to either accuracy or rank in sub-blocks of the covariance matrix, whereas the Vecchia approximation is tuned based on the selection and the number of neighbors. Hence, without a clear understanding of the link between these two methods, it is not possible to tune them in a way that makes them comparable at this time. Further research is needed to investigate the possible relationship between these two methods.
Remark A.1. Let Z ∼ N (0, σ 2 I). Theorem 2.2.7 in [25] provides estimates for a norm of a Gaussian vector, Z , for instance for sup t∈T |Z(t)|, The assumption C −1 ≤ c 1 is strong. This estimation depends on the matrix C, the smoothness parameter ν, the covariance length , and the H-matrix rank k. First, we find an appropriate block-decomposition for the covariance matrix C. Second, we estimate the ranks of C. Third, we prove that the inverse/Cholesky can also be approximated in the H-matrix format. Then we estimate the ranks for the inverse C −1 and the Cholesky factor L. Finally, we estimate the H-matrix approximation accuracies; see [32]. In the worst case, the rank k will be of order n. We also note that some covariance matrices are singular, so that C −1 and L may not exist. The computation of log | C| could be an ill-posed problem, in the sense that small perturbations in C result in large changes in log |C|. Proof: Lemma A.4. Let z, z ∈ R n , and conditions of Lemmas A.2-A.3 hold, then Proof: Adding and subtracting an additional term and applying the previous Lemmas A.2-A.3, we obtain Thus, if the approximations of the matrices C, C −1 and L are accurate (the error tends to zero relatively fast with increasing the ranks), then the H-matrix approximation error in the quadratic form also tends to zero with increasing k or decreasing ε L . A more rigorous proof calculating the order of convergence is out of the scope of this paper. Dependence on the condition number is researched in [11,8] (only for matrices that origin from elliptic partial differential equations and integral equations). In Table 7 we list the Frobenius and spectral norms of C, L, and C −1 vs. n. This table shows that C 2 is growing with n and C −1 2 almost not growing with n.

B. Appendix: Admissibility condition and cluster and block-cluster trees
The admissibility condition (criteria) is used to divide a given matrix into sub-blocks and to define which sub-blocks can be approximated well by low-rank matrices and which not. There are many different admissibility criteria (see Fig. 5). The typical criteria are: the strong, weak and domain decomposition-based [32]. The user can also develop his own admissibility criteria, which may depend, for example, on the covariance length. It regulates, for instance, block partitioning, the size of the largest block, the depth of the hierarchical block partitioning. Large low-rank blocks are good, but they may require also larger ranks.   Figure 6 shows examples of cluster trees (1-2) and block cluster trees (3)(4). Let I be an index set of all locations. Denote for each index i ∈ I corresponding to a basis function b i (e.g., the "hat" function) the support G i := supp b i ⊂ R d , where d ∈ {1, 2, 3} is spatial dimension. Now we define two trees which are necessary for the definition of hierarchical matrices. These trees are labeled trees where the label of a vertex t is denoted byt.
Definition B.1. (Cluster Tree T I ) [31,27] A finite tree T I is a cluster tree over the index set I if the following conditions hold: • I is the root of T I and a subsett ⊆ I holds for all t ∈ T I .
• If t ∈ T I is not a leaf, then the set of sons, sons(t), contains disjoint subsets of I and the subsett is the disjoint union of its sons,t = s∈sons(t)ŝ .
• If t ∈ T I is a leaf, then |t| ≤ n min for a fixed number n min .
We generalise G i to clusters τ ∈ T I by setting G τ := i∈τ G i , i.e., G τ is the minimal subset of R d that contains the supports of all basis functions b i with i ∈ τ . Definition B.2. (Block Cluster Tree T I×I ) [31,27] Let T I be a cluster tree over the index set I. A finite tree T I×I is a block cluster tree based on T I if the following conditions hold: • root(T I×I ) = I × I.
We can see that root(T I×I ) = I × I. This implies that the set of leaves of T I×I is a partition of I × I.
Definition B.3. The standard admissibility condition (Adm η ) for two domains B τ and B σ (which actually correspond to two clusters τ and σ) is defined as follows where B τ , B σ ⊂ R d are axis-parallel bounding boxes of the clusters τ and σ such that G τ ⊂ B τ and G σ ⊂ B σ . diam and dist are usual diameter and distance, by default η = 2.0.