HLIBCov: Parallel hierarchical matrix approximation of large covariance matrices and likelihoods with applications in parameter identification

Graphical abstract


Introduction
HLIBpro is a very fast and efficient parallel H-matrices library. This is an auxiliary technical paper, which contains technical details to our previous paper [31]. In [31] we used the gradient-free optimization method to estimate the unknown parameters of a covariance function using HLIB and HLIBpro.
Parameter estimation and problem settings. We let n be the number of spatial measurements Z located irregularly across a given geographical region at locations s :¼ fs 1 ; . . . ; s n g 2 R d , d ! 1. We also let Z = {Z(s 1 ), . . . , Z(s n )} > , where Z(s) is a stationary Gaussian random field. Then, we assume that Z has mean zero and a stationary parametric covariance function C(h;u) = cov{Z(s), Z(s + h)}, where h 2 R d is a spatial distance and vector u 2 R q denotes q unknown parameters. To infer u, we maximize the joint Gaussian log-likelihood function, where C(u) ij = C(s i À s j ;u), i, j = 1, . . . , n. Let us assume thatû maximizes (2.1). When the sample size n is large, the evaluation of (2.1) becomes challenging, due to Oðn 3 Þ computational cost of the Cholesky factorization. Hence, scalable and efficient methods that can process larger n are needed. For this, the hierarchical matrices (H-matrix) technique is used, which approximates sub-blocks of the dense matrix by a low-rank representation of either a given rank k or a given accuracy e > 0 (see Section 3.2).
To maximizeLðu; kÞ in (2.2), we use the Brent-Dekker method [9,33]. It could be used with or without derivatives.
An additional difficulty is the ill-posedness of the optimization problem. Even a small perturbation in the covariance matrix C(u) may result in large perturbations in the log-determinant and the loglikelihood. A possible remedy, which may or may not help, is to take a higher rank k.
Features of the H-matrix approximation. Other advantages of applying the H-matrix technique are the following: 1. The H-matrix class is large, including low-rank and sparse matrix classes; 2. C(u) À1 , C(u) 1/2 , |C(u)|, Cholesky decomposition, the Schur complement, and many others can be computed in the H-matrix format [16]; 3. Since the H-matrix technique has been well studied, there are many examples, multiple sequential and parallel implementations and a solid theory already available. Therefore, no specific MPI or OpenMP knowledge is needed; 4. The H-matrix cost and accuracy is controlled by k; 5. The H-Cholesky factor and the H-inverse have often moderate ranks.
Structure of the paper. After introduction and problem setting in Section 2, we explain the H-matrix approximation of Matérn covariance matrices in Section 3. In Section 4, we describe the software installation details, the input, and output of the HLIBCov code. In Section 5, we provide dependence of the storage and computing costs vs. the H-matrix rank k. We visualize the shapes of the log-likelihood functions for different n. Finally, we demonstrate how to estimate three unknown parameters u ¼ ð'; n; s 2 Þ > of C(u). Best practices are listed in Section 6. We end the paper with a conclusion in Section 7. The auxiliary H-matrix and log-likelihood details are provided in the Appendices A and B.
For any two spatial locations s and s 0 and the distance h : = ||s À s 0 ||, the Matérn class of covariance functions is defined as where u ¼ ð'; n; s 2 Þ > ; ' > 0 is a spatial range parameter; n > 0 is the smoothness, with larger values of n corresponding to smoother random fields; and s 2 is the variance. Here, K n denotes a modified Bessel function of the second kind of order n, and G(Á) denotes the Gamma function. The values n = 1/2 and n = 1 correspond to the exponential and Gaussian covariance functions respectively.
The H-matrix technique was originally introduced by Hackbusch (1999) for the approximation of stiffness matrices and their inverses coming from partial differential and integral equations [8,12,15]. Briefly, the key idea of the H-matrix technique is to divide the initial matrix into sub-blocks in a specific way, identify those sub-blocks which can be approximated by low-rank matrices and compute the corresponding low-rank approximations.
The partitioning of the matrix into sub-blocks starts by recursively dividing the rows and columns into disjoint sub-sets, e.g., splitting the set of all rows into two (equal sized) sub-sets, which are again divided. This yields a cluster tree where each sub-set of rows/columns is called a cluster. By multiplying the cluster trees for the rows and the columns, a hierarchical partitioning of the matrix index set is obtained, the so called block cluster tree or H. Within this block cluster tree, low-rank approximable blocks are identified using an admissibility condition. Such admissible blocks are not further refined into sub-blocks, i.e., the corresponding sub-tree is not computed or stored. For all admissible blocks a lowrank approximation of the initial matrix is computed, either with a given rank k (fixed-rank strategy) or an accuracy e > 0 (fixed-accuracy strategy). The result of this computation is called an H-matrix.
This process is also shown in Fig. 1.
Definition 3.1. Let I be an index set (representing the rows/columns) and T I be a cluster tree based on I. Furthermore, let T IÂI be a block cluster tree based on T I and an admissibility condition adm : T IÂI → {true, false}. Then the set of H-matrices with maximal rank k is defined as HðT IÂI ; kÞ :¼ fC 2 R IÂI j rankðCj tÂs Þ k for all ðt; sÞ of T IÂI with admðt; sÞ ¼ trueg: Here the block cluster (t, s) is an element (a vertex) of the block cluster tree T IÂI . Various partitioning strategies for the rows and columns of the matrix and admissibility conditions have been developed to approximate different types of matrices. Typical admissibility conditions are strong (also called standard), weak and based on domain decomposition [16], for which examples are shown in Fig. 2. The red blocks indicate dense or in-admissible blocks whereas green blocks are identified as admissible. The maximal size of the dense blocks (i.e., how deep the hierarchical subdivision into sub-blocks is) is regulated by the parameter "n min ", whose value affects the storage Fig. 1. Examples of a cluster tree T I (left) and a block cluster tree T IÂI (right). The decomposition of the matrix into sub-blocks is defined by T IÂI and the admissibility condition. size and the runtime of the H-matrix arithmetic, e.g., a smaller value leads to less storage but is often in-efficient with respect to CPU performance. Typically values of n min are in the range 20 to 150.

Remark 1.
With a more appropriate choice of the admissibility condition (criteria), one can influence the depth of the hierarchy, the size of the sub-blocks, their number, the number of empty sub-blocks. The choice of the admissibility criteria is not crucial here; the H-matrix rank (or accuracy) in each sub-block is much more crucial. If the H-matrix rank in each block is sufficiently large, then any admissibility criteria will work. In our numerical tests, we used the standard admissibility criteria.
For the computation of the low-rank approximation for admissible sub-blocks many different methods are available, e.g., adaptive cross approximation (ACA), hybrid cross approximation (HCA), rank-revealing QR, randomized SVD [3,5,7,8,11,19,22]. For the fixed-rank strategy, the resulting lowrank matrix is of rank at most k. In case of the fixed-accuracy strategy with a given e > 0, the low-rank approximationM of the sub-block M is computed such that kM ÀM k ekMk. The storage size of the resulting H-matrix is of order O knlogn ð Þ [12]. In Fig. 3 (left), an example of an H-matrix approximation to C(u) can be found. There, the local ranks and the decay of singular values in the admissible blocks (green) in logarithmic scale are shown.
In addition to efficient matrix approximation, H-matrices also permit full matrix arithmetic, e.g., matrix addition, matrix multiplication, inversion or factorization. However, similar to matrix compression, H-matrix arithmetic is approximate to maintain log-linear complexity. The approximation during arithmetic is again either of a fixed-rank or a fixed-accuracy [12]. In this work, we make use of the H-Cholesky factorization of C(u) (see Fig. 3).
For C(u), the predefined rank (or accuracy e) defines the accuracy of the H-matrix approximation, for the initial approximation of C(u) as well as for the Cholesky factorization for C(u) À1 .

Parallel hierarchical-matrix technique
We used the parallel H-matrix library HLIBpro [13,25,27,28], which implements H-matrix approximation and arithmetic functions using a task-based approach to make use of todays many-core architectures. For this, the mathematical operation is decomposed into small atomic tasks with corresponding incoming and outgoing data dependencies. This set of tasks and dependencies forms a directed acyclic graph (DAG), which is used for scheduling the tasks to the CPU cores, e.g., if all incoming data dependencies are met, the corresponding task is executed on the next free CPU core available.
The computational complexity of the different H-matrix operations is shown in Table 1. Here, |V(T)| denotes the number of vertices, |L(T)| is the number of leaves in the block-cluster tree T = T IÂI . The sequential terms in those estimates are typically due to the sequential behaviour of the corresponding algorithm, e.g., strictly following the diagonal during Cholesky factorization, but usually do not show in practical applications since the majority of the computation work is parallelized.

HLIBCov and HLIBpro installation
This section contains a summary of the information provided at https://www.hlibpro.com and https://github.com/litvinen/HLIBCov.git. HLIBpro supports both shared and distributed memory architectures, though in this work we only use the shared memory version. For the implementation of the task-parallel approach, Intel's Threading Building Blocks (TBB) is used, Table 2. HLIBpro is free for academic purposes, and is distributed in a pre-compiled form (no source code available). Originally, HLIBpro was developed for solving FEM and BEM problems [13,28]. In this work, we extend the applicability of HLIBpro to dense covariance matrices and log-likelihood functions.
Installation: HLIBCov uses the functionality of HLIBpro; therefore, HLIBpro must be installed first. All functionality implemented by HLIBCov is based on HLIBpro, i.e., no extra software is needed in addition to the libraries needed by HLIBpro. This also holds for the Matérn kernel, which uses Bessel functions and maximization algorithms, both being provided by the GNU Scientific Library (GSL) and also used by HLIBpro. The reader can easily replace GSL with his own optimization library. The Bessel functions are also available in other packages.
To install HLIBpro on MacOS and Windows, we refer the reader to www.HLIBpro.com for further details.  Hardware. All of the numerical experiments herein are performed on a Dell workstation with two Intel(R) Xeon(R) E5-2680 v2 CPUs (2.80GHz, 10 cores/20 threads) and 128 GB main memory.
Adding HLIBCov to HLIBpro. The easiest form of compiling HLIBCov is by using the compilation system of HLIBpro. If the iterative process is converging, then the last row will contain the solution u Ã ¼ ð' Ã ; n Ã ; s Ã2 Þ > . When computing error boxes, the output file will contain M solutions (n, '*, n*, s Ã2 ), where M is the number of replicates: The name of the output file can be found in the main() procedure in loglikelihood.cc.

Numerical experiments
We perform several numerical tests. First, we investigate how the approximation errors and the Kullback-Leibler divergence depend on the maximal rank k. Second, we show how the memory requirement for the matrixC depends on ' and n. Third, we compute the variances of ' and n vs. k and n. Fourth, we estimate the unknown parameters.

Convergence errors and memory requirement
The Kullback-Leibler divergence (KLD) D KL (P||Q) is a measure of information loss when a distribution Q is used to approximate P. For the multivariate normal distributions (m 0 , C) and ðm 1 ;C Þ, it is defined as follows: kCðC Þ À1 À Ik 2 is large. Relatively often, the H-matrix procedure, which computes the H-Cholesky factorL or the H-inverse, produces "NaN" (not a number) and terminates. One possible cause is that some of the diagonal elements can be very close to zero, and their inverse is not defined. This may happen when two locations are very close to each other and, as a result, two columns (rows) are linearly dependent. To avoid such cases, the available data should be preprocessed to remove duplicate locations. Very often, the nugget t 2 I is added to the main diagonal to stabilize numerical calculations (see more in Section 5.5), i.e.,C :¼C þ t 2 I. In Tables 3 and 4, the nugget is equal to zero.    The dependence of n on the problem size, e.g., the number of measurements is also tested with the results shown in Fig. 5 (right). As the results demonstrate, with a larger number n the estimation of the parameter n is getting better. This experiment is also showing that the estimations, obtained on a smaller data set, say with n = 2000 measurement, could be used as a starting value in the next experiment with a larger data set, say n = 4000.   Fig. 6, we illustrate the dependence of ÀL=n on the parameters ' (left, with n = 0.5, s 2 = 1), and n (right, with ' = 0.0864 and s 2 = 1). Both figures demonstrate the smooth dependence ofL=n on ' and n.

Identification of unknown parameters
We generate a "synthetic" data set with parameters ð' Ã ; n Ã ; s Ã2 Þ ¼ ð0:0864; 0:5; 1:0Þ and then try to infer these parameters. To build M various datasets (M replicates) with n 2 {64, . . . ,4,2} Â 1000 locations, we generate a large vector Z 0 with n 0 = 2 Â10 6 locations, and randomly sample n points from it. We note that if the locations are very close to each other, then the covariance matrix may be singular or the Cholesky factorization will be very difficult to compute.
To generate the random data Z 0 2 R n 0 , we compute the H-Cholesky factorization of C ð0:086; 0:5; 1:0Þ ¼LL > . Then, we evaluate Z 0 ¼Lj, where j 2 R n 0 is a normal vector with zero mean and unit variance. We generate Z 0 only once. Next, we run our optimization algorithm and try to  identify (recover) the "unknown" parameters ð'; n; s 2 Þ > . The resulting boxplots for ' and s 2 over M = 100 replicates are illustrated in Fig. 7. We see that the variance (or uncertainty) decreases with increasing n. The green line indicates the true values.
To identify all three parameters simultaneously, we solve a three-dimensional optimization problem. The maximal number of iterations is set to 200, and the residual is 10 À6 . The behavior and accuracy of the boxplots depend on the H-matrix rank, the maximum number of iterations to achieve a certain threshold, the threshold (or residual) itself, the initial guess, the step size in each parameter of the maximization algorithm, and the maximization algorithm. All replicates of Z are sampled from the same generated vector of size n 0 = 2 Â10 6 .
In Table 5, we present the almost-linear storage cost (columns 3 and 6) and the computing time (columns 2 and 5).
The shape of the negative log-likelihood function and its components are illustrated in Fig. 8. This helps us to understand the behavior of the iterative optimization method, and the contributions of the log-determinant and the quadratic functional. We see that the log-likelihood is almost flat, and that it may be necessary to perform many iterations in order to find the minimum.

Adding nugget t 2
When diagonal values ofC are very close to zero, the H-Cholesky procedure becomes unstable. It produces negative entries on the diagonal during computation. By adding a diagonal matrix with small positive numbers, all the singular values become larger and move away from zero. However, by adding Table 5 Computing time and storage vs n. The number of parallel computing cores is 40,n ¼ 0: 33   From (5.1), we see that the relative error on the left-hand side of (5.1) depends on the norm kC À1 k, i.e., the relative error is inversely proportional to the smallest singular value ofC . This may explain a possible failing of approximating matrices, where the smallest singular values tend towards zero. The estimates for the H-Cholesky and the Schur complement for general sparse positive-definite matrices are given in [4]. The approximation errors are proportional to the kðCÞ, i.e., matrices with a very large condition number may require a very large H-matrix rank. Fig. 9 (left) demonstrates three negative log-likelihood functions computed with the nuggets 0.01, 0.005, and 0.001. For this particular example, the behavior of the likelihood is preserved, and the  minimum does not change (or changes very slightly). Fig. 9 (right) is just a zoomed version of the picture on the left.

Best practices (HLIBCov)
In this section, we list our recommendations and warnings. For practical computations, use adaptive-rank arithmetic since it produces smaller matrices and faster runtime.
For the input, it is sufficient to define a file by three columns in 2D and four columns in 3D: location coordinates (x, y, z) and the observed value; no triangles or edges are required.
If two locations coincide or are very close to each other, then the matrix will be close to singular or singular. As a result, it will be hard to compute the Cholesky factorization. Our suggested remedy is to improve the quality of the locations by preprocessing the data.
By default, the H-Cholesky or HÀLU factorizations use a task-based approach employing a DAG (directed acyclic graph). For sequential computations this can be turned off to revert to a slightly faster recursive implementation by setting HLIB::CFG::Arith::use_dag = false By default, HLIBpro uses all available computing cores. To perform computations on 16 cores, use HLIB::CFG::set_nthreads(16) at the beginning of the program (after command INIT()).
Since HLIBpro is working in d = 1, 2, 3, .. .-dimensional case, only very minor changes are required to move from 1D locations to 2D or 3D (or higher). Replace dim= 2 with dim= 3 in TCoordinate coord(vertices, dim); then add "> >z" to in > >x > >y > >z > >v; The H-matrix data format is a rather complicated data structure (class) in HLIBpro. Therefore, the H-matrix objects (or the pointers on them) are neither the input nor the output parameters. Instead, the input parameters for the HLIBpro C++ routines are: a vector (array) of locations and a vector of observations Z. The triangulation (a list of triangles/edges) is not needed. The output parameters are either scalar values or a vector; for example, the determinant, the trace, a norm, the result of the matrix-vector product, and an approximation error.

Conclusion
We extended functionality of the parallel H-matrix library HLIBpro to infer unknown parameters for applications in spatial statistics. This new extension allows us to work with large covariance matrices. We approximated the joint multivariate Gaussian likelihood function and found its maxima in the H-matrix format. These maxima were used to estimate the unknown parameters (', n, and s 2 ) of a covariance model. The new code is parallel, highly efficient, and written in C++ language.
With the H-matrix technique, we reduced the storage cost and the computing cost (Tables 4, 5) of the log-likelihood function dramatically, from cubic to almost linear. We demonstrated these advantages in a synthetic example, where we were able to identify the true parameters of the covariance model. We were also able to compute the log-likelihood function for 2, 000, 000 locations in just a few minutes on a desktop machine ( Table 5). The H-matrix technique allowed us to increase the spatial resolution, handle more measurements, consider larger regions, and identify more parameters simultaneously. kx À yk ÀjaþbjÀg ; a; b 2 N; holds for constants C 1 , C 0 and g 2 R. This estimation is used to control the error e q from the Taylor expansion xðx; yÞ ¼ X a2N d 0 ;jaj q ðx À x 0 Þ a 1 a! @ a x xðx 0 ; yÞ þ e q : Suppose that χ k (x, y) is an approximation of χ in G t Â G d of the separate form (e.g., Taylor or Lagrange polynomials): x k ðx; yÞ ¼ where h is some positive number. The admissibility condition indicates blocks that allow rank-k approximation and those that do not. Admissible blocks are either very small (and computed exactly) or are approximated by rank-k matrices. All other (inadmissible) blocks are computed as usual.
Rank-k Adaptive Cross Approximation (ACA): An H-matrix contains many sub-blocks, which can be well approximated by low-rank matrices. These low-rank approximations can be computed accurately by truncated singular value decomposition (SVD), but it is very slow. HLIBpro uses the Adaptive Cross Approximation method (ACA) [11] and its improved modifications such as ACA+ and HACA [3,5,7].
Remark B.1. Further optimization of the ACA algorithm can be achieved by a recompression using low-rank SVD. If we suppose that a factorization of the matrix R = AB > , A 2 R pÂK , B 2 R qÂK , is found by ACA and that the actual rank of R is k, k < K. Then we can apply the low-rank SVD algorithm to compute R = USV > in Oððp þ qÞK 2 þ K 3 Þ time.