Parallel Implementation of Large-Scale Linear Scaling Density Functional Theory Calculations With Numerical Atomic Orbitals in HONPAS

Linear-scaling density functional theory (DFT) is an efficient method to describe the electronic structures of molecules, semiconductors, and insulators to avoid the high cubic-scaling cost in conventional DFT calculations. Here, we present a parallel implementation of linear-scaling density matrix trace correcting (TC) purification algorithm to solve the Kohn–Sham (KS) equations with the numerical atomic orbitals in the HONPAS package. Such a linear-scaling density matrix purification algorithm is based on the Kohn's nearsightedness principle, resulting in a sparse Hamiltonian matrix with localized basis sets in the DFT calculations. Therefore, sparse matrix multiplication is the most time-consuming step in the density matrix purification algorithm for linear-scaling DFT calculations. We propose to use the MPI_Allgather function for parallel programming to deal with the sparse matrix multiplication within the compressed sparse row (CSR) format, which can scale up to hundreds of processing cores on modern heterogeneous supercomputers. We demonstrate the computational accuracy and efficiency of this parallel density matrix purification algorithm by performing large-scale DFT calculations on boron nitrogen nanotubes containing tens of thousands of atoms.


INTRODUCTION
The Kohn-Sham density functional theory (DFT) (Hohenberg and Kohn, 1964;Kohn and Sham, 1965) has been successfully applied to perform first-principles calculations for describing the electronic structures of both molecules and solids. However, conventional DFT calculations based on direct diagonalization methods for solving the KS equations have a high cubic-scaling cost (Goedecker, 1999), which can usually be used to study medium-scale systems containing up to hundreds of atoms. Therefore, it is difficult to achieve massive parallelism for these conventional cubic-scaling methods due to complex communication issues. To avoid the bottleneck arising from the computational cost and memory usage of directly diagonalizing the Hamiltonian matrix in conventional DFT calculations, linear-scaling methods using local basis functions have been proposed (Goedecker, 1999), strongly promoting the applications of DFT calculations in large-scale systems containing thousands of atoms.
In general, linear-scaling methods include direct, variational, and purification methods (Bowler and Miyazaki, 2012). The direct methods are featured by direct evaluation of density matrix using various approximations, including divide and conquer (Yang, 1991;Yang and Lee, 1995) and Fermi operator expansion (Goedecker and Colombo, 1994;Goedecker and Teter, 1995;Liang et al., 2003). The variational methods minimize the total energy with respect to the auxiliary density matrix or Wannier-like orbitals, covering density matrix minimization method (Daw, 1993;Li et al., 1993;Nunes and Vanderbilt, 1994) and orbital minimization method (OMM) (Galli and Parrinello, 1992;Mauri and Galli, 1994;Kim et al., 1995;Ordejón et al., 1995). The third scheme exploits the purification polynomial and iterative solution, which is known as density matrix purification method (Palser and Manolopoulos, 1998;Niklasson, 2002;Niklasson et al., 2003). Nearly all of the linear-scaling methods are based on the Kohn's nearsightedness principle with localized basis sets, such as Gaussian type orbitals (GTOs) (Frisch et al., 1984) and numerical atomic orbitals (Shang et al., 2010) (NAOs), resulting in the sparsity of density matrix with a number of non-zero entries that increase linearly with the system size, so the linear-scaling matrix-matrix multiplication can be achieved (VandeVondele et al., 2012;Kim and Jung, 2016). In particular, the density matrix purification algorithms without prior knowledge of the chemical potential, including the trace-preserving canonical purification scheme of Palser and Manolopoulos (PM) (Palser and Manolopoulos, 1998;Daniels and Scuseria, 1999), the trace-correcting purification (TC) (Niklasson, 2002), and the trace resetting density matrix purification (TRS) (Niklasson et al., 2003), have been demonstrated as accurate and efficient linear-scaling methods to describe the electronic structures of molecules, semiconductors, and insulators. However, almost all of the developed linear-scaling techniques (direct, variational, and purification methods) assume the presence of a non-zero gap in the electronic structure, which prevents them from treating metallic systems. Recently, Suryanarayana (2017) have employed the O(N) Spectral Quadrature (SQ) method (Suryanarayana, 2013;Pratapa et al., 2016) to study the locality of electronic interactions in aluminum (a prototypical metallic system) as a function of smearing/electronic temperature. They have found exponential convergence accompanied by a rate that increases sub-linearly with smearing. It is also worth mentioning that all these linearscaling methods based on Kohn's nearsightedness principle are limited to the localization of density matrix (Bowler and Miyazaki, 2012). A recently published innovative version of PEXSI scheme named iPEXSI (Etter, 2020), which does not rely on the nearsightedness principle, can scale provably better than cubically even in the absence of localization of density matrix. The iPEXSI algorithm utilizes a localization property of triangular factorization, which could extend the usable range of linear-scaling method to metallic system without the constraint of finite electronic temperature. Nowadays, with the rapid development of modern heterogeneous supercomputers, the high-performance computing (HPC) has become a powerful tool for accelerating the DFT calculations to deal with large-scale systems. Several highly efficient DFT software based on low-scaling methods have been developed, such as SIESTA (Soler et al., 2002), OPENMX (Ozaki and Kino, 2005), CP2K (Kühne et al., 2020), CONQUEST (Gillan et al., 2007), PROFESS (Ho et al., 2008), FREEON (Challacombe, 2014), ONETEP (Skylaris et al., 2005), BigDFT (Genovese et al., 2008;Mohr et al., 2014), FHI-aims (Blum et al., 2009), ABACUS (Chen et al., 2010(Chen et al., , 2011, HONPAS (Qin et al., 2015), and DGDFT (Lin et al., 2012;Hu et al., 2015a,b;Banerjee et al., 2016;Zhang et al., 2017), which are capable to make full advantage of the massive parallelism available on HPC architectures beneting from the local data communication of sparse Hamiltonian matrix generated with local basis sets. In linear-scaling DFT calculations, the kernel for HPC is to parallel sparse matrix-matrix multiplication. In order to realize the HPC parallelism, two massively parallel libraries of BCSR (Borštnik et al., 2014) and NTPOLY (Dawson and Nakajima, 2018) have been developed, which have shown a high performance for the density matrix purification algorithms implemented in the CP2K (Kühne et al., 2020) and CONQUEST (Gillan et al., 2007) packages.
In this work, we present a parallel implementation of linearscaling density matrix second-order trace-correcting purification (TC2) algorithm (Niklasson, 2002) to solve the KS equations with the NAOs in the HONPAS package (Qin et al., 2015). We propose to use the MPI_Allgather function for parallel programming to deal with such sparse matrix multiplication within the CSR format, which can be scaled linearly up to hundreds of processing cores on modern heterogeneous supercomputers. We demonstrate the computational accuracy and efficiency of this linear-scaling density matrix purification method by performing large-scale DFT calculations on boron nitrogen nanotubes containing thousands of atoms.

Density Functional Theory
We first give a brief review of Kohn-Sham density functional theory (KS-DFT). The key spirit of KS-DFT is to solve the KS equations defined aŝ whereĤ is the Hamiltonian operator, ψ i is the ith molecular orbital, and ǫ i is the corresponding orbital energy.T is the kinetic operator,V ion is the ionic potential operator, andV H is the Hartree potential operator defined aŝ where the electron density is given by In the approximation of linear combination of atomic orbitals (LCAO) (Mulliken, 1955), the ψ i is expanded on a set of NAOs where c µi is the expansion coefficient at the µth atomic orbital and N b is the number of NAOs. Then, the KS equations can be rewritten into matrix notations as where C is coefficient matrix and E is the corresponding orbital energy. H and S are the Hamiltonian and overlap matrices over the NAOs The default choice in the SIESTA package is to use the direct diagonalization method though the LAPACK and ScaLAPACK libraries to solve this eigenvalue problem with a high cubicscaling cost. Therefore, the computational cost and memory usage of such DFT calculations increase rapidly as the system size, which are only limited to small systems containing hundreds of atoms. In order to overcome this limitation, several linearscaling methods have been implemented in the SIESTA package, such as the Kim-Mauri-Galli (KMG) orbital minimization (OMM) method (Galli and Parrinello, 1992;Mauri and Galli, 1994;Kim et al., 1995;Corsetti, 2014) and divide and conquer method (Cankurtaran et al., 2008). The KMG requires a initial approximate Wannier functions and a prior knowledge of the chemical potential. In the HONPAS-SIESTA package (Qin et al., 2015), we implement the density matrix purification algorithms, including the trace-preserving canonical purification scheme of PM (Palser and Manolopoulos, 1998;Daniels and Scuseria, 1999), the trace-correcting purification (TC) (Niklasson, 2002), and the trace resetting density matrix purification (TRS) (Niklasson et al., 2003).

Linear-Scaling Density Matrix Purification Algorithms
After constructing the Hamiltonian matrix, the density matrix can be obtained by directly diagonalizing the Hamiltonian matrix with cubic-scaling cost. In order to avoid the high cost of explicit diagonalization, we implement three density matrix purification algorithms, without prior knowledge of the chemical potential for linear-scaling DFT calculations, including the trace-preserving canonical purification scheme of PM, the trace-correcting purification (TC) (Niklasson, 2002), and the trace resetting density matrix purification (TRS) (Niklasson et al., 2003), in the HONPAS package (Qin et al., 2015). In this work, we use the second-order trace-correcting purification (TC2) (Niklasson, 2002) algorithm with orthogonal basis sets to illustrate our parallel algorithms. In the coordinate presentation, the general form of density matrix can be given by with the chemical potential µ and the inverse temperature β = 1/k B T. Within the LCAO method, we can transform the density matrix from coordinate presentation to the basis presentation, then the density matrix element P µν becomes: If NAOs are orthogonal, the density matrix element P µν can be written as Note that ε i is relative to the eigenvalue ofĤψ i = ε i ψ i , so P µν can be rewritten as which implies that P is commutative with H, namely [H, P] = 0. Another substantial property of the appropriate density matrix is particle conservation, Tr(P) = N e /2. When the electronic temperature is zero, f (ε i ) = 1 and the density matrix of insulator can be written as Frontiers in Chemistry | www.frontiersin.org FIGURE 1 | The flowchart of density matrix purification TC2 method. There are four time-consuming parts in the TC2 method, including constructing the Hamiltonian matrix, initializing the density matrix P from the Hamiltonian matrix with Cholesky and Lanzos methods, updating the density matrix with parallel sparse matrix-matrix multiplication, and computing total energy and atomic forces after SCF iterations.
which must satisfy the so-called idempotency PP = P with an orthogonal basis.
The solution of eigenvalue problem is under three restricted conditions of the density matrix mentioned above, which is known as the purification method. The trace-preserving canonical purification scheme of PM (Palser and Manolopoulos, 1998), which imposes commutation relation and tracepreserving condition, works with a predefined occupation and does not need the input or adjustment of the chemical potential. Trace-conserving spectral projections are performed during each iteration, until the density matrix P n converges to the correct one that satisfies the idempotency condition. This method is inefficient at low and high partial occupancies (Palser and Manolopoulos, 1998;Daniels and Scuseria, 1999). A subsequent strategy proposed by Niklasson named TC algorithm (Niklasson, 2002). Its second-order form is called the second-order tracecorrecting purification (TC2) method (Niklasson et al., 2003). The higher order TC2 requires additional matrix multiplications, which pursues a more rapid reduction of errors and a less step of purification iterations (Kim and Jung, 2016).
The TC2 purification algorithm is simple, robust, and rapidly convergent for closed-shell systems, and more efficient in orthogonal basis sets (Xiang et al., 2005). In this work, H denotes the Hamiltonian matrix under the presentation of orthogonal basis sets. Reasonably in the preparatory step, a transformation H = Z T H AO Z is required, here the matrix Z is obtained by solving out the inverse square root of overlap matrix S by the Cholesky factorization (Cholesky, 2005). The idempotency and Algorithm 1 The pseudocode of TC2 algorithm, where N e is the number of electrons, E is the energy-density matrix, ε min (H) and ε max (H) denote the minimal and maximum eigenvalue of the Hamiltonian matrix H, respectively.
During each iteration step, the trace of P n+1 is corrected by P 2 n (Tr(P n ) ≤ N e /2) or 2P n − P 2 n (Tr(P n ) > N e /2). Then, matrix elements less than a numerical threshold δ filter (10 −4 or 10 −6 ) are dropped to zero, thus maintaining the sparsity of density matrix. The pseudocode of the TC2 algorithm is given in Algorithm 1 and its corresponding owchart is shown in Figure 1.
The density matrix purification method in HONPAS is based on the fact that both the density matrix and Hamiltonian matrix are sparse with NAOs. Therefore, sparse matrix multiplication is the most expensive step in the density matrix purification method. Figure 2 shows the sparsity of the density matrix P saved as CSR format for the BN nanotubes consisting of 100 and 1,000 atoms (BNNT100 and BNNT1000) with different basis sets [single-ζ (SZ), double-ζ (DZ), and double-ζ plus polarization (DZP)] and thresholds (δ filter = 10 −4 and 10 −6 ) of non-zero elements in the density matrix P.
When the selected basis sets are strictly localized, the coefficient matrix C for BNNT100 and BNNT1000 systems formed in a similar block diagonal matrix and arbitrary row of the resultant P is occupied by same number of non-zero elements since P = CC T . Therefore, the total number of nonzero elements grows linearly with the system size under tight binding approximation, which is the substantial precondition of almost all linear scaling algorithms. If we have an observation onto the first two columns of Figure 2, matrices show blockmultidiagonal patterns and sparse degree of BNNT 1000 with the SZ basis set under δ filter = 10 −4 is 11.6%, which is obviously less than that of BNNT100 (87.7%). Variation trend of sparse degree is consistent with the character mentioned above from a perspective of image. Moreover, the cutoff radius r c of the DZP basis set is higher, so non-zero elements per row distribute more intensively FIGURE 2 | The sparsity of the density matrix P in the compressed sparse row (CSR) format of BN nanotubes (BNNTs) consisting of 100 and 1,000 atoms (BNNT100 and BNNT1000). Percentage under the system name indicates sparse degree of P. The matrix elements exceeding 10 −2 are labeled as light blue and those exceeding 10 −1 are marked as red. White area is remained for elements with tiny values or zero. Density matrix becomes less occupied with its elements gathering close to diagonal when the threshold after multiplication 100× tighter (10 −6 ) than that of higher occupied matrix (10 −4 ). When the basis sets become larger from SZ to DZP, sparse degree varies and δ filter strongly affects the sparsity pattern.
Frontiers in Chemistry | www.frontiersin.org but their value is smaller than that of SZ and DZ. Sparse degree of BNNT1000 also decreases significantly from that of BNNT100 in the case of DZP basis set, such as 86.6% of BNNT100 and 16.4% of BNNT1000 at the same threshold, respectively.
On the perspective of threshold as shown in Figure 2, we observe that, as the system size increases, the influence of δ filter is more significant, either the patterns or the sparse degree display an obvious difference under δ filter = 10 −4 and 10 −6 . For example, the density matrix of BNNT1000 with the DZP basis set is 32.3% occupied under δ filter = 10 −6 and 16.4% occupied under δ filter = 10 −4 . On the other hand, δ filter has a slight influence on the sparsity of BNNT100 compared with BNNT1000, which implies that the distribution of numerical value of matrix elements is shifted to higher level than that of BNNT1000. Just as elements dotted with deep color in BNNT100 are much more intensive than those of BNNT1000. It should be noted that dropping matrix elements less than δ filter and using strictly truncated NAOs both contribute to the sparsity of P.

Parallel Implementation of Sparse Matrix Multiplication
The time required to process matrix-matrix multiplications during each iteration step accounts for a major part of total time. Note that there are some additional steps such as data communication and matrix addition. Fortunately, all matrices we need to deal with are sparse, so that the number of dot products reduces. The linear scaling cost arises from the fact that all matrix operations are performed on sparse matrices, which has a number of non-zero entries that increase linearly with the system size (VandeVondele et al., 2012).
The sparsity of matrix also causes unexpected drawbacks. An apparent disadvantage is, the matrix multiplication step would change the sparsity pattern during the self-consistent field (SCF) iterations, resulting in a load imbalance between matrix computation and data commutation among different processing cores. Since each matrix is distributed on a series of processes in advance, the instability of sparsity pattern will occur at each iteration, thus we also need to modify the data distribution after each iteration or exploit a block-cyclic distribution scheme. Apart from those, dropping matrix elements with the numerical value less than a threshold can reduce the number of dot products. But the computational accuracy of total energy and atomic forces is sacrificed inevitably under a loose threshold. The parallel version of TC2 algorithm in HONPAS is based on CSR data format and message-passing interface (MPI), which is capable of performing massive parallelism on modern heterogeneous supercomputers. We employ the SPARSEKIT library to manipulate and deal with sparse matrices, which provides programs for converting data structures, filtering out elements, and performing basic linear algebra operations with sparse matrix (Saad, 1994).
In the parallel TC2 module, there is a single hierarchical structure of parallelization that consists of single type of data distribution and communication scheme. The TC2 module utilizes the MPI parallel programming to deal with data communications between different MPI processes. In our work, the MPI processes are organized in 1D row MPI grids. The FIGURE 3 | Parallel data distribution and communication of sparse matrix multiplication when N p = 4. The density matrix is partitioned into four row block local matrices with 1D row BN nanotube (BNNT) grid parallelism (1p, 2p, 3p, and 4p). Each local matrix is stored in the compressed sparse row (CSR) format. MPI_Allgather is invoked to integrate these four row block local matrices into a global matrix in the CSR format. density matrix is distributed by 1D row blocks across MPI processes, and each process saves N b /N p rows of global matrix. Thus, such local and global sparse matrix-matrix multiplication does not require additional data communication. Individual process computes its part of the multiplication, processing a row block of np (n = 1, 2, . . ., N p ) at a time. After the local multiplication has been processed, each processor just gathers a local subset of global density matrix. We use the MPI_Allgather function to gather local matrices into global density matrix in each MPI process, similar to the case of MPI_Gather and then MPI_Bcast, then performing local sparse matrix-matrix multiplication at the next iteration step. Figure 3 illustrates the schematic diagram of MPI communication on CSR data format, in which we set N p = 4 to simplify the discussion.

RESULTS AND DISCUSSION
In this section, we demonstrate the computational accuracy and efficiency of our parallel TC2 algorithm. We implement this method in the HONPAS package (Qin et al., 2015), which has been written in the Fortran programming language with the MPI for parallelism. We use the norm-conserving Troullier-Martins pseudopotentials (Troullier and Martins, 1991) to represent interaction between core and valence electrons. We use the exchange-correlation functional of local density approximation of Goedecker-Teter-Hutter (LDA-PZ) (Goedecker et al., 1996) to describe the electronic structures of these BNNTs with a grid cutoff of 100 Ry. In our calculations, the NAOs are generated by default parameters in SIESTA. We utilize the linear-scaling density matrix TC2 purification algorithm to calculate the electronic structures of a series of boron nitride nanotubes (BNNTs), containing 100-18,000 atoms (labeled by BNNT100-BNNT18000). The details of the input parameters and atomic structures of BNNTs used in this work as well as the performance data are shown in the Supplementary Materials. 1 | Absolute error of total energy E tot (eV/atom) and the maximum of root mean square error of atomic forces F max (eV/Å) of the TC2 method with varying thresholds of δ filter = 10 −4 and 10 −6 for the BNNT100 system with SZ, DZ, and DZP basis sets. The reference results are computed by the direct diagonalization method.

Accuracy
We benchmark the computational accuracy of parallel TC2 method by comparing the results with those obtained from the diagnonalization method. We consider the effects of both the size of basis sets (SZ, DZ, and DZP) and different values of thresholds (δ filter = 10 −4 to 10 −6 ) on the computational accuracy of density matrix TC2 purification algorithm. We define the errors of total energy and atomic forces, respectively, as where N A is the total number of atoms and I is the atom index.
In the HONPAS calculations, the default convergence accuracy for total energy and atomic forces are 10 −4 eV/atom and 0.02 eV/Å, respectively. Table 1 shows that the TC2 purification calculation for total energy is performed very well when choosing a tight dropping threshold, and δ filter = 10 −6 can yield a total energy accuracy of 10 −5 eV/atom at least. On the other hand, strictly truncated NAOs can yield the sparsity without loss of accuracy simultaneously (Shang et al., 2010). We compute the total energy and atomic forces under different basis sets using a variable threshold. As shown in Figure 4, the errors of atomic forces from TC2 and those obtained from direct diagonalization method are indistinguishable. For all tested systems, the accuracy of the TC2 method can be obviously improved by tightening the threshold (10 −4 to 10 −6 ). In particular, when the threshold is set to 10 −4 , F max with the most general case of DZP basis set reaches 10 −1 eV/Å, which is already comparable to the magnitude of atomic force itself. In contrast, F max = 4.30 × 10 −4 eV/Å under δ filter = 10 −6 with the same basis set. The noticeable error arises from the lack of information in density matrix when too many elements are neglected after each iteration step and the information of Hamiltonian matrix just included in the initial step. However, the relative error of energy per atom is less than 10 −7 when threshold is set to 10 −6 in the case of SZ basis set, which indicates that the computational accuracy of TC2 method is still guaranteed. On the perspective of basis sets, high accuracy is ensured when we employ rigorously localized basis sets (SZ). Note that systems with the DZP basis set have relatively larger errors, since the information of polarization orbital is partly omitted by dropping matrix elements. For instance, when δ filter is set to 10 −4 , the energy error for SZ is 3.36 × 10 −4 but that for DZP is 8.80 × 10 −3 (still can achieve the converged accuracy). As we have mentioned in section 2, non-zero elements those hold relatively small value distribute more intensively in the case of DZP basis set, and the physical information can be seriously lacking under relatively large δ filter .

Efficiency
We demonstrate the computational efficiency and parallel scalability of linear-scaling TC2 method by checking the weak and strong scaling performance on BNNT systems with the SZ basis set and a threshold of δ filter = 10 −4 . We illustrate the total time of the main time-consuming parts as shown in Figure 1: (a) Construction of Hamiltonian matrix, (b) evaluation of density matrix P from Hamiltonian matrix by Cholesky factorization following Lanczos method, and (c) purification with matrix multiplication and addition. It should be noticed that the data communication via MPI interface also occupies numerous time resource while performing massive parallelization over plenty of processing cores. Practical tests on the computational efficiency and parallel scalability are performed in the case of BNNT systems with MPI parallelism on modern heterogeneous supercomputers, including comparison of TC2 and diagonalization methods with respect to different system sizes and process counts, as shown in Figures 5, 6, respectively. There are some additional steps such as computing total energy and atomic forces, which are all included in the total wall clock time of outer SCF iterations in the TC2 method.
Since the computational cost of linear-scaling TC2 method grows linearly with respect to the system size, a noteworthy speed up is supposed to be observed. We choose all tested systems with the SZ basis set to illustrate strong and weak scaling behaviors, since it is more strictly localized, resulting in a relatively small change of sparsity pattern after each iteration step. The variation of total time with respect to the system size is plotted in Figure 5. We can see that the scaling of TC2 is fitted to O(N) due to the linearly growing sparse degree of P, and the number required to perform multiplication has the same trend. Linear scaling behavior is obtained with various systems containing 2,000-9,000 FIGURE 5 | Weak scaling of wall clock time per SCF iteration with respect to the number of atoms with the message-passing interface (MPI) parallelism for BN nanotubes (BNNTs) with 2,000 and 9,000 atoms (BNNT2000 and BNNT9000) computed with the TC2 and diagonalization methods. atoms under serial mode (N p = 1), and it continues to scale further up to 500 cores at least, which benefits from the efficient parallel implementation of matrix multiplications based on the CSR formatted sparse density matrix. A speed-up of 4.7 can be achieved for 9,000 atoms (500 cores) and could be larger for more atoms. The fitted scaling for explicit diagonalization is just O(N 1.9 ) with number of atoms fewer than 5,000 when the number of processors is relatively large, arising from the load imbalance that problem size (number of computational tasks) distributed on each process is not adequate and some cores remain idle. If the processors keep increasing, low efficiency of parallelization is going to happen. When the size of system grows sufficiently or processing cores have a relevant scale, fitted scaling turns back to O(N 3 ) due to the cubic scaling of conventional diagonalization step. As a conclusion, linear-scaling TC2 method outperforms explicit diagonalization in terms of expansibility to large systems and massive parallel implementation. Figure 6 compares the parallel scalability of TC2 to diagonalization methods. As it can be seen, the parallel scalability of both methods is unsatisfactory, especially with the smaller system size. This issue arises in the load imbalance caused by idle processors since computational tasks are inadequate compared with hundreds of cores. Test for 18,000 atoms with diagonalization is not represented due to a memory overflow problem (the dimension of matrix is 72,000). Unlike the diagonalization method, test for TC2 has been performed since the utilization of CSR data format reduces the memory requirement. TC2 method demonstrates just scaling up to 600 cores, since the 1D processes layout prevents it from massive parallelization. The performance of global MPI communications such as MPI_Allgather is strongly impacted by the physical distance of remote processing cores, which prompts us to utilize BCSR storage format and 2D block-cyclic processor layout.

CONCLUSION AND OUTLOOK
In summary, we present a parallel implementation of linearscaling density matrix trace correcting (TC) purification algorithm to solve KS equations with numerical atomic orbitals in the HONPAS package. We use the MPI_Allgather function for parallel programming to deal with the sparse matrix multiplication within the CSR format, which can scale up to hundreds of processing cores on modern heterogeneous supercomputers. We demonstrate the computational accuracy and efficiency of this linear-scaling density matrix purification algorithm by performing large-scale DFT calculations on boron nitrogen nanotubes containing tens of thousands of atoms.
However, our parallel implementation of TC2 method in HONPAS is inferior to that of BigDFT (Genovese et al., 2008;Mohr et al., 2014), ONETEP (Skylaris et al., 2005), and CONQUEST (Gillan et al., 2007). They exploit more than one level of organization and data distribution schemes resembling the BCSR format to handle the groups of atoms, which achieve high flexibility in load balancing (Bowler et al., 2002) with high performance on modern heterogeneous supercomputers. In the future, We plan to implement a massively parallel algorithm based on the NTPoly library (Dawson and Nakajima, 2018) in HONPAS. The NTPoly library utilizes the 3D sparse matrix multiplication algorithm, that is, the processors are organized into a three dimensional, cube-shaped virtual topology. In this case, density matrix purification algorithms can scale up to thousands of processing cores on modern heterogeneous supercomputers.

DATA AVAILABILITY STATEMENT
All datasets presented in this study are included in the article/Supplementary Material.

AUTHOR CONTRIBUTIONS
ZL has tested the computational accuracy and efficiency of the TC2 method in HONPAS package, written manuscript and drawn figures. WH and XQ provide the research direction and revise the manuscript. WH has provided codes of the parallel version of TC2 method in HONPAS and directive methods for tests of computational accuracy and efficiency. LW has offered help for the modification of formulas. All authors contributed to the article and approved the submitted version.