Sparse approximate matrix-matrix multiplication for density matrix purification with error control

We propose a method for strict error control in sparse approximate matrix-matrix multiplication. The method combines an error bound and a parameter sweep to select an appropriate threshold value. The scheme for error control and the sparse approximate multiplication are implemented using the Chunks and Tasks parallel programming model. We demonstrate the performance of the method in parallel linear scaling electronic structure calculations using density matrix purification with rigorous error control.

Sparse matrix-matrix multiplication is a key operation in linear scaling electronic structure calculations based on, for example, Hartree-Fock or Kohn-Sham density functional theory. This operation has therefore received a lot of attention from method and software developers in this field [5]. This includes the development of sparse data structures [9,26,21,15], approximation techniques taking advantage of the special properties of the matrices involved [16,17,3], and different approaches to parallelization [4,19,12,28,7]. Sparse matrix-matrix multiplication is used in the construction of the density matrix defined by where θ(x) is the Heaviside function, µ is the chemical potential and F is the Fock or Kohn-Sham matrix. A number of different methods for the computation of the density matrix, including minimization and recursive polynomial expansion methods, use a sequence of matrix-matrix multiplications. Recursive polynomial expansion methods are also referred to as density matrix purification. The great performance of these methods for large systems can be attributed to the decay properties of the density matrix and any matrices that occur during the course of its computation. In exact arithmetics, the matrices involved contain many entries of small magnitude. In efficient implementations, this is utilized by the removal of small matrix entries from the matrix representation, meaning that they are treated as zeros [6,27,25,12]. A key issue and a common topic for discussion is how this is done and what implications it has for performance and accuracy. In the recursive polynomial expansion methods it is possible to strictly control the accuracy of the final result if the norm of the matrix consisting of removed matrix entries can be controlled. This procedure is formalized in Algorithm 1 where the removal of small entries is written as the addition of an error matrix E i in each iteration. The starting matrix X 0 is given by a linear transformation of the Fock/Kohn-Sham matrix and f i , i = 1, 2, . . . are low order polynomials chosen so that their recursive application tends to the desired step function in (1). It is shown in refs 22, 14 how to choose the error tolerances δ i , i = 1, 2, . . . so that the error in the density matrix is controlled, i.e. so that D − D < ε, where D is the computed approximate density matrix. Several methods to select small matrix entries for removal without violating the condition E < δ have been proposed, making use of different matrix norms [17,23,14].

Algorithm 1 Purification process
for i = 1, . . . , n do 4: end for 7: end procedure A drawback with the approach described above, where the truncation is performed separately from the multiplication, is that the multiplication may result in a large increase of the number of nonzero matrix entries [13,24]. Often, many of the just introduced nonzeros have small magnitude and will anyway be removed in the subsequent truncation. This means computational resources are used to compute and temporarily store those matrix entries for no purpose. Several remedies for this issue have been proposed. For block-sparse matrix representations it has been proposed to skip submatrix products of blocks with small norm [8,20]. In the cutoff radius approach all matrix entries that correspond to distances between nuclei or basis function centers larger than some cutoff radius are excluded from the representation [16,6]. Since, in this case, the nonzero pattern is known in advance, the product may be computed directly in truncated form.
In the present work we are particularly interested in sparse matrix representations that make use of sparse quaternary trees (quadtree) to store matrices where any identically zero submatrix quadrant is left out of the representation [29]. In the quadtree representation a matrix X is either 1) stored in a data structure used for small enough matrices, or it is 2) flagged as identically zero, or it is 3) composed of four quadrants, each a matrix recursively represented as a quadtree. The data structure used for small matrices may be a regular column-or row-wise dense matrix representation or some sparse matrix format. In the matrix product, zero branches in the quadtree are skipped. In the SpAMM approach [10,3] one also skips submatrix products whose norm is known to be small. Such skipping is performed at each level of the quadtree, see Algorithm 2. The approaches described above alleviate the issue of fill-in but do not offer error control. In this letter, we show how fill-in can be avoided while strictly controlling the error in the product. We make use of the SpAMM algorithm but add a preceding step to carefully select the SpAMM tolerance τ so that the error in the product SpAMM(A, B, τ ) − AB F < δ, for given δ. This gives us a method for approximate evaluationf (X) of low order poynomials f (X) such that f (X) − f (X) F < δ for a given predefined tolerance δ, as required to compute the density matrix with strict error control.
Our method to select the SpAMM tolerance combines an error bound with a parameter sweep. We will now show how a bound of the SpAMM product error can be computed for given input matrices A and B and a given SpAMM tolerance τ . Let us consider how the multiplication of 2-by-2 block matrices is performed with if lowest level then 6: return C = AB for j = 0, 1 do 10:  (2). Then the product matrix C = AB is given by Assume that the SpAMM tolerance τ 1 is such that the whole procedure is not performed, because A F B F < τ 1 . Then, clearly, the error matrix E = C = AB and So the error norm is bounded by the product of the multiplicands' norms.
Suppose that we multiply the same matrices approximately with some other tolerance τ 2 and that three of the sub-multiplications A 00 B 00 , A 10 B 01 , A 11 B 11 are skipped because the product of norms is too small. The result of this operation is the matrix Then, the error matrix and its Frobenius norm can be bounded from above as The idea of our algorithm to find the optimal SpAMM tolerance is based on the observation outlined above: each skipped multiplication brings an error in the product matrix, and this error can be bounded at any level of the matrix hierarchy. The summation of the errors from the underlying multiplications can be done as in (6). This gives the error bound for given A, B, and SpAMM tolerance τ .
In Algorithm 3 we combine the error bound with a parameter sweep. This algorithm computes a bound of the SpAMM product error SpAMM(A, B, τ ) − AB F for each of N candidates (τ 1 , . . . , τ N ) for the SpAMM tolerance. Once we know an error bound for each τ i , it is straightforward to pick the right SpAMM tolerance so that the corresponding error is the largest below the tolerance for the product error. for j = 0, 1 do 14: 15: 16: return Errors 21: end procedure We evaluate our method in the context of density matrix purification with rigorous error control. In this evaluation, we consider two variants of Algorithm 1. In the first variant, we use the new approximate evaluation of the matrix polynomials with error control, but do not perform any subsequent truncation on the product, see Algorithm 4. In the second variant, we include also the subsequent truncation, see Algorithm 5. Note that in all three algorithms the error in each iteration, measured by end for 6: end procedure Algorithm 5 Hybrid purification process 1: procedure Purification(X 0 , f i , δ i ,n) 2: for i = 1, . . . , n do 4: end for 7: end procedure We implement the algorithms using the Chunks and Tasks parallel programming model and library [18]. We use the Chunks and Tasks matrix library [19,2] and the hierarchical block-sparse leaf level library of ref 1. The matrix leaf-level size is 2048, whereas the leaf internal block size is 16.
The computations are performed on the Beskow cluster located at the PDC center at KTH Royal Institute of Technology in Stockolm, Sweden. The system is a Cray machine equipped with 2060 nodes each carrying 2 16-core Intel Xeon E5-2698v3 CPUs combined with 64 gigabytes of RAM. The base operation frequency is 2.3 GHz. The connection between the nodes is the Cray Aries network with the Dragonfly topology. The code is compiled with the GCC 8.3.0 compiler, Cray MPICH 7.7.0 and OpenBLAS 0.2.20. The latter is built from sources with disabled multi-threading. We let a worker process occupy a whole computational node. The 32 available cores are split into two groups: 31 perform the tasks in parallel if possible, 1 is dedicated for communication.
In our evaluation, we perform two purification iterations with a converged density matrix using each of the three algorithms, Algorithms 1, 4, and 5. We use a density matrix computed using the Ergo software [25] for a water cluster with 7947 molecules with the 3-21G basis set, which gives a matrix size 71253. For a given tolerance, the approximate matrix square is computed. Then the process is repeated with the same tolerance using the approximate square from the previous stage as input. In the end, we compute the exact square of that input matrix to assert that the error does not exceed the tolerance. We use timings from the second iteration only.
We refer to multiplication of matrices and then truncation as truncmul, sparse approximate multiplication as SpAMM and their combination as hybrid.
We present wall times for the different parts of each of the three approaches in Figure 1. The total wall times of the SpAMM and hybrid approaches with error control proposed here are less sensitive to the choice of error tolerance and clearly outperform the truncmul approach for small tolerances. The hybrid variant outperforms the pure SpAMM variant due to a smaller time spent on the parameter sweep to select threshold value (T CSE ).
The matrix sparsity for the matrices involved is shown in Figure 2. The left panel clearly shows the issue discussed earlier with many nonzero entries in X i computed for no purpose. Up to around 85% of the nonzero entries in X i are removed in the subsequent truncation. This issue is mitigated in the SpAMM and hybrid approaches, resulting in smaller memory usage.  We measure the error between the square of the input matrix of the 2nd iteration of the purification process and its approximate counterpart computed with each of the approximate multiplication variants to verify that the error control is working as expected. The results can be found in Figure 3. We can see that all three variants give an error matrix norm below the desired tolerance. One can also notice that the truncmul approach provides the sharpest results in terms of how close the error norm is to the tolerance, whereas the SpAMM variant provides the least sharp results.
In summary, we have presented a method to control the Frobenius norm of the error matrix in sparse approximate matrix-matrix multiplication for matrices with exponential decay of elements and tested it in the context of the density matrix purification method. The results show that the standard routine, see Algorithm 1, which can be described as "multiply-and-truncate" can be improved by applying the multiplication operation approximately with a properly chosen threshold. One can build the purification process exclusively on approximate multiplication, see Algorithm 4, or combine it with a subsequent truncation as done in Algorithm 5. Our results indicate that the latter combination is the best option.
Although the new SpAMM and hybrid approaches with error control clearly outperform the truncmul approach, there is also room for improvements. The routines utilizing the SpAMM algorithm require an extra step, which selects the best truncation threshold value from a given vector of possible values, and the overhead of this operation is comparable with the cost of the approximate operation itself for the variant built exclusively on SpAMM. The hybrid variant has a lower overhead of the selection routine, which is due to a smaller number of nonzero elements. The cost of the selection routine depends on the structure of the matrix, and the more zero blocks it has, the faster the routine works. Another way to reduce the cost is to manipulate the vector of possible threshold values, for example by altering its length and starting value.
While representing a significant improvement, the hybrid approach still involves the computation of a significant number of matrix entries that are thrown away in the subsequent truncation. This can, at least partially, be explained by an overestimation of the error by the CSE algorithm which in Figure 3 is manifested by an error with magnitude more than an order lower than the chosen tolerance. Besides improving the error bound in the Frobenius norm, both with respect to sharpness and speed, as discussed above, future work could also address error control for SpAMM in other norms.
We note that asymptotic error analyses with respect to both the SpAMM tolerance and system size have been carried out in earlier work [1,11]. Here, we have proposed a scheme to select the SpAMM tolerance so that the error in a unitary in-variant norm is below a predefined tolerance as needed in density matrix purification with rigorous error control.