Accelerated Cyclic Reduction: A Distributed-Memory Fast Solver for Structured Linear Systems

We present Accelerated Cyclic Reduction (ACR), a distributed-memory fast direct solver for rank-compressible block tridiagonal linear systems arising from the discretization of elliptic operators, developed here for three dimensions. Algorithmic synergies between Cyclic Reduction and hierarchical matrix arithmetic operations result in a solver that has $O(k~N \log N~(\log N + k^2))$ arithmetic complexity and $O(k~N \log N)$ memory footprint, where $N$ is the number of degrees of freedom and $k$ is the rank of a typical off-diagonal block, and which exhibits substantial concurrency. We provide a baseline for performance and applicability by comparing with the multifrontal method where hierarchical semi-separable matrices are used for compressing the fronts, and with algebraic multigrid. Over a set of large-scale elliptic systems with features of nonsymmetry and indefiniteness, the robustness of the direct solvers extends beyond that of the multigrid solver, and relative to the multifrontal approach ACR has lower or comparable execution time and memory footprint. ACR exhibits good strong and weak scaling in a distributed context and, as with any direct solver, is advantageous for problems that require the solution of multiple right-hand sides.


Introduction
Cyclic reduction, introduced in [1], is a direct solver for tridiagonal linear systems. It is effective for the solution of (block) Toeplitz and (block) tridiagonal matrices that arise from the discretization of elliptic PDEs [2,3]. For the constant-coefficient Poisson equation, since each of the blocks of the discretized system is Fourier diagonalizable, cyclic reduction can be used in combination with the fast Fourier transform (FFT) to deliver optimal complexity, as proposed in the FACR method [4]. However, in the presence of variable coefficients, the FFT-enabled version of cyclic reduction can not be used. The purpose of this work is to address the time and memory complexity growth in the presence of heterogeneous blocks with a variant called Accelerated Cyclic Reduction (ACR). The main observation is that elliptic operators have a hierarchical structure of off-diagonal blocks that can be approximated with lowrank matrices. Thus we approximate appropriate blocks of the initially sparse matrix with hierarchical matrices and operate on these blocks with hierarchical matrix arithmetics, instead of the usual dense operations, to obtain a direct solver of log-linear arithmetic and memory complexities. This philosophy follows recent work discussed below, but to our knowledge, this is the first demonstration of the utility of complexity-reducing hierarchical substitution in the context of cyclic reduction.
Cyclic reduction can be thought of as a direct Gaussian elimination on a permuted system that recursively computes the Schur complement of half of the unknowns until a single block remains or the system is small enough to be inverted directly. Schur complement computations have a complexity that is dominated by the cost of the inverse; by applying a red/black re-ordering of the unknowns, the linear system separates into two halves with block diagonal structure. This decoupling addresses the most expensive step of the Schur complement computation regarding operation complexity and does so in a way that launches independent subproblems. This concurrency feature, in the form of recursive bisection, can be naturally implemented in a distributed-memorẙ parallel environment. The stability of the block cyclic reduction has been studied in [5], where the author presents error bounds for strictly and nonstrictly diagonally dominant matrices.
In order to simplify the description of the algorithm, in this work we consider structured linear systems arising from the discretizations or scalar PDEs on three-dimensional Cartesian grids. For three-dimensional problems of size N " n 3 , where n is the number of discretization points in the linear dimension of the target domain, the synergy of cyclic reduction and hierarchical matrices leads to a parallel fast direct solver of Opk N log N plog N`k 2 qq arithmetic complexity, and Opk N log N q memory footprint, where k ! N represents the numerical rank of compressed blocks. This is in contrast to OpN 2 q and OpN 1.5 q respectively, if hierarchically low-rank matrices matrices are not used.
In this manuscript, we present ACR and its distributed-memory implementation, and we demonstrate its performance on a set of problems with various symmetry and spectral properties in three dimensions. These problems include the Poisson equation, the convection-diffusion equation, and the indefinite Helmholtz equation. We show that ACR is competitive in memory consumption and time-to-solution when compared to methods that rely on a global factorization and do not exploit the cyclic reduction structure.

Related work
Recent years have seen increasing interest in the use of hierarchical low-rank approximations to accelerate the direct solution of linear systems. In this section, we briefly describe some of this literature focusing primarily on efforts that target distributed-memory environments.
Arguably the most common approach for using hierarchical matrix representations in matrix factorizations is to use low-rank approximations to compress the dense frontal blocks that arise in the multifrontal variant of Gaussian elimination. The enabling property is that under proper ordering, many of the off-diagonal blocks of the Schur complement of discretized elliptic PDEs have an effective low-rank approximation [6] that improves the memory and arithmetic estimates of conventional multifrontal solvers [7]. Furthermore, there are efficient low-rank approximation methods to perform the necessary arithmetic operations and preserve the low-rank representation during the factorization and solution stages of the solver. Within this general approach, various methods that differ in the particular data-sparse format used and in the algorithms for the computation of low-rank approximations have been developed.
In Wang et al. [8] the authors investigate the use of the HSS format [9] to accelerate the parallel geometric multifrontal method, which results in a method known as the HSS-structured multifrontal solver (HSSMF). The general approach uses intra-node parallel HSS operations within a distributed-memory implementation of the multifrontal sparse factorization. This approach lowers the complexity of both arithmetic operations and memory consumption of the resulting HSS-structured multifrontal solver by leveraging the underlying numerically low-rank structure of the intermediate dense matrices appearing within the factorization process, driven by an optimal nested dissection ordering.
In a similar line of work, Ghysels et al. [10] also investigate a combination of the multifrontal method and the HSS-structured hierarchical format, extending the range of applicability of the solver to general non-symmetric matrices. Using the task-based parallelism paradigm, they introduce randomized sampling compression [11] and fast ULV HSS factorization [12]. Under the assumption of the existence of an underlying low-rank structure of the frontal matrices, randomized methods deliver almost linear complexity; this reduces the asymptotic complexity of the solver, which is mainly attributed to the frontal matrices near the root of the elimination tree. The effectiveness of these task-based algorithms in combination with a distributed-memory implementation of the multifrontal method is available in an early stage software release of the package STRUMPACK [10], which we will consider in the numerical experiments section of this article. The HSS format assumes a weak admissibility condition (see section 2.1.1), which in practice requires the use of large numerical ranks even for approximations with modest relative accuracy. Consequently, this stresses the memory requirements and increases overall execution time.
The hierarchical interpolative factorization [13,14] is another method for finding low-rank approximations that has proved to be a fast solver for symmetric elliptic PDEs and integral equations. This decomposition relies on a "skeletonization" procedure to eliminate a redundant set of points from a symmetric matrix to further compress the dense fronts. The key step in skeletonization uses the interpolative decomposition of low-rank matrices to achieve a quasi-linear overall complexity in factorization. The performance of hierarchical interpolative decomposition in a distributed-memory environment is reported in [15].
A fast direct method for high-order discretizations of elliptic PDEs has been proposed by Martinsson et al. [16,17,18]. The method is based on a multidomain spectral collocation discretization scheme and a hierarchy of nested grids, similar to nested dissection. It exploits analytical properties of elliptic PDEs to build Dirichlet-to-Neumann operators, by hierarchically merging these operators originating from smaller grids. When computations are done using the HSS data-sparse format, an asymptotic complexity of OpN 4{3 q can be reached. The high-order discretizations used in this method makes it quite powerful in practice as they allow it reach the same accuracy with fewer degrees of freedom compared to second order discretizations. A distributed-memory implementation of this algorithm is in progress.
Even though this approach has larger asymptotic estimates than the log-linear performance of the methods above, because of the high-order discretization of the PDE, this method is quite powerful in practice as they can reach the same accuracy with fewer degrees of freedom as compared to second order discretizations. A distributedmemory implementation of this algorithm is in progress.
The BLR format [19] has also been used to compress blocks into low-rank approximations to accelerate the factorization process of the multifrontal method. This format is compatible with numerical pivoting and is wellsuited for the reuse of existing high-performance implementations of dense linear algebra kernels. Even though this format is not hierarchical, it has proven to be useful for a wide range of problems [20] within the distributed-memory implementation of the multifrontal method provided by the MUMPS library [21,22].
Rather than compressing and identifying individual blocks of the decomposition, another hierarchy-exploiting approach considers the system as a whole and seeks to construct a holistic decomposition of the full linear system. An example of such decomposition is the recursive computation of the inverse of a hierarchical matrix [23,24], or the computation of its Cholesky or LU factorization [25,26]. These methods have generally much higher prefactors than methods that compress individual matrix blocks of the factorizations and are not usually competitive for large-scale problems; as an example, we refer the reader to [23] for a discussion of the challenges of scaling the construction of the inverse of a hierarchical matrix.
Pouransari et al. approximate fill-in via low-rank approximations with the H 2 format; see [27]. This format guarantees linear complexity provided that blocks correspond to well-separated clusters and have a data-sparse property. The algorithm starts by recursively bisecting the computational domain, implicitly forming a binary tree. The leaf nodes correspond to independent subdomains, and the internal nodes correspond to Schur complements to computed with low-rank arithmetic operations. The bottom-up elimination process is performed with a procedure referred to as "extended sparsification" in which the original matrix dimension grows by introducing auxiliary variables but nonetheless remains sparse. Alternatively, elimination can be performed with an in-place algorithm that keeps the matrix size constant. A related method with similar strategies as in this work is the so-called "compress and eliminate" solver [28]. A recent extension of this line of work into a distributed memory environment documented in [29], demonstrates that concurrent processors can work on independent subdomains defined by their corresponding subgraphs, where interior vertices are eliminated concurrently. Communication is needed at the boundary vertices, but additional concurrency at the boundary is exploited trough graph coloring.

Contributions
The contribution of this work is the development of a parallel, robust and efficient method for the solution of block tridiagonal linear systems, with emphasis on systems that arise from the discretization of elliptic PDEs. ACR is a fast solver in the sense that it has a log-linear arithmetic complexity in operations count and memory consumption. The algorithm arrives at the solution in a finite number of steps, rather than iteratively converging to a solution, which makes it a direct solver with a tunable accuracy. The fact that ACR is entirely algebraic extends its range of applicability to problems with an arbitrary coefficient structure including nonsymmetry within the block tridiagonal sparsity structure, subject to their amenability to rank compression. This entirely algebraic property gives the method robustness on problems that are challenging for iterative methods, while still maintaining asymptotic efficiency.
Two key features of the algorithm from a computational perspective are the simplicity of its parallelization and the regularity of its communication patterns in a distributed memory environment. The communication pattern is well-established beforehand and it is based on recursive bisection, as opposed to nested dissection with different block sizes at different levels of the factorizations. The amount of inter-node concurrency is proportional to the size of the blocks and it fits readily into a distributed-memory parallel environment. The algorithm also exhibits substantial intra-node concurrency, both in processing multiple blocks and within its hierarchical operations on individual blocks, which fits the multi-core architecture of modern supercomputers.
We demonstrate that our implementation is well suited for modern parallel multi-core systems and scalable in a distributed-memory environment. We also compare our implementation against other state-of-the-art direct solvers over a relevant class of problems and show competitive time to solution and memory requirements.

Preliminaries
In this section, we review the building blocks of the proposed solver, namely hierarchical low-rank approximations and the cyclic reduction algorithm.

Hierarchical matrices
A hierarchical matrix is a data-sparse representation that enables fast linear algebraic operations by using a hierarchy of off-diagonal blocks, each represented by a low-rank approximation, that can be tuned to guarantee an arbitrary precision. The approximation, sometimes referred to as compression, is performed via singular value decomposition, or with a related method that delivers a low-rank approximation with fewer arithmetic operations than the traditional SVD method. For the representation to be effective in terms of arithmetic operations and memory requirements, numerical ranks significantly smaller than the sizes of the various matrix blocks are required.
There are several hierarchical and non-hierarchical low-rank approximation formats available in the literature. In this work, we consider the H-matrix format introduced by Hackbusch et al. in [30]. Being modular by design, ACR is not limited to the H-format. In fact, the use of the H 2 -format would immediately translate to an additional reduction of one logarithmic factor in terms of arithmetic and memory complexity estimates, from Opk N log N plog N`k 2 qq to Opk N log N q in terms of operations, and Opk 2 N log N q to Opk N q in terms of memory requirements, however, we require a complete set of hierarchical matrix operations and fast construction, which at the time of this publication is still ongoing work within our group. Our implementation uses the H-format arithmetic and its arithmetics operations provided by the HLibPro library. We refer to the reader to [31,32] for a discussion of the sharedmemory scalability of these hierarchical matrix operations, their relative costs, and their performance on modern manycore architectures. HLibPro does not feature a distributed memory solver. We use its shared-memory kernels in combination with MPI to orchestrate parallel workload across nodes in a distributed memory environment as we will discuss in section 4.

H-matrix construction
The structure of a hierarchical matrix in the H format can be described by four components: an index set, a cluster tree, a block cluster tree, and the choice of an admissibility condition. The index set I " t0, 1, . . . , N´1u represents the number of degrees of freedom N . The cluster tree represents row/column groupings, and it is constructed by recursively subdividing the index set. Once the cluster tree is formed, the block cluster tree defines matrix sub-blocks over the index IˆI. Its leaves are either low-rank blocks or small dense ones. Finally, the admissibility condition determines whether a given block should be represented as a low-rank approximation or a dense block 1 .
The first step for the construction of an H-matrix is the definition of the cluster tree of unknowns. In this work, since each block row of the sparse matrix represents a plane from a three-dimensional regular discretization, we leverage the geometry information by selecting a binary space partitioning strategy to cluster the unknowns considering the two-dimensional domain representing the planes.
The next step is the definition of a block cluster tree for these two-dimensional domains, which together with the admissibility condition determines the structure of the hierarchical representation of the plane-block. We chose a standard admissibility condition, as opposed to the simpler weak admissibility condition, because it provides the flexibility of selecting a range of coarser or finer blocks, tuned by an admissibility parameter η. Weak admissibility refers to a matrix decomposition where the p1, 2q and p2, 1q blocks are single low-rank blocks and the p1, 1q and p2, 2q blocks are recursively decomposed in a similar way. On the other hand, standard admissibility allows a more refined blocking of the matrix; the η parameter appears in the inequality minpdiameterpτ q, diameterpσqq ď η¨distancepτ, σq, where τ and σ denote two geometric regions defined as the convex hulls of two separate point sets t and s (nodes in cluster tree). A matrix block A ts satisfying the previous inequality is represented in a low rank form.
The motivation for choosing a standard admissibility condition is that, by further refining the off-diagonals blocks, it is possible to achieve a similar accuracy with smaller numerical ranks, that are crucial to ensure economic memory consumption and overall high performance. The impact of the admissibility condition is illustrated in Figure 1, which depicts the H-inverse of the variable-coefficient two-dimensional Poisson operator discretized on a N " 64ˆ64 grid using a finite difference scheme. In the right panel, the use of a few small dense blocks in the off-diagonal regions allows much smaller ranks to be used in the remaining low-rank blocks, without compromising accuracy.

Ranks bin Rank frequency
Weak admissibility Standard admissibility (c) Ranks histogram with respect to admissibility condition. Figure 1: H-inverse of the 2D Poisson operator, discretized with N " 64ˆ64 grid points, using two different admissibility conditions. The number in each block is the numerical rank necessary to achieve a compression accuracy of 1E-4. The color map is determined by the ratio of the numerical rank and the size of the block, deep blue indicates an effective low-rank approximation, while red depict dense blocks. Table 1 shows the storage gains by representing the inverse of a 2D Poisson problem with an H-matrix with weak admissibility versus standard admissibility. Table 1 also shows the difference in terms of number of operations between these two structures. The cost of the H-matrix inversion requires 56C 3 sp knplog n`1q 2`1 84C sp k 3 nplog n`1q operations, where k represents the average rank of the low-rank blocks, and C sp represents the sparsity of the structure of the hierarchical matrix inverse, see [34]. Since the weak admissibility condition requires larger ranks than the standard admissibility condition, at scale, this tends to increase the memory requirements and the number of floating-point operations.  Table 1: H-inverse of the 2D Poisson operator for N " 256 2 grid points, using two different admissibility conditions. We document the memory and floating-point operations to build the H-matrix inverse with weak and standard admissibility. The weak admissibility condition tends to require large ranks, which lead to increased memory requirements and more arithmetic operations than the standard admissibility condition. (Equivalent dense storage and arithmetic operations of the inverse operation would have required 2,147 MB and 1.0E13 operations.). *As a matter of comparison we show the equivalent storage requirements of the same problem by using the nested-basis HSS format which for this particular problem has an optimal complexity.
A low-rank approximation for a given off-diagonal block can be found in a variety of ways. Several strategies, ranging from randomized algorithms to heuristics for pivoting, are available in the literature. Every block of the Hmatrix stored as a low-rank approximation has the form of an outer product AB T . The goal of efficient hierarchical matrix processing is to construct the best possible low-rank factorization as matrix operations are performed. This routine is often referred to as the compression step. For a comprehensive discussion of the construction of H-matrices and its arithmetics, we refer the reader to [34].

Cyclic reduction
This section reviews the cyclic reduction algorithm in preparation for the following section describing the accelerated cyclic reduction variant that improves its arithmetic and memory complexity growth.

Model problem
Consider the seven-point stencil finite difference discretization with Dirichlet boundary conditions of the threedimensional variable-coefficient Poisson equation on the unit cube.
This discretization leads to a block tridiagonal linear system of N " n 3 unknowns. This corresponds to a matrix A composed of 3n´2 blocks of size n 2ˆn2 .
Block cyclic reduction can be used to solve the system defined in Equation 2. The algorithm consists of two phases: elimination and back substitution.

Elimination
The first step is to rearrange the linear system via matrix permutation pP AP T qpP uq " P f . The permutation matrix P corresponds to a red/black (even/odd) ordering of the blocks. For illustration, we choose n " 8 and consider a 2ˆ2 partition of the permuted system as shown in Equations 3 and 4. Superscripts indicate step number, where at each step a Schur complementation of a permuted system is performed to reduce the number of unknowns by half. (3) » -- The Schur complement computations of the partitioned system are shown in equation 5: Since the upper-left block A 11 is block-diagonal, its inverse can be computed as the inverse of each individual block (in this case: 4 , and D p0q 6 ), in parallel. All computations for the generation of the Schur complement at step i`1, whose size is half of the step i problem, are also done at block-level granularity as show in Equation 6, which applies to j odds only. There is a slight abuse of notation in Equation 6 to handle the case of the last plane that has one neighbor, the computations involving the plane j`1 are not performed. We use a polymorphic notation for the matrix addition, matrix subtraction, matrix-matrix multiplication, matrix-vector multiplication, and matrix inversion (A`H B, A´H B, A¨H B, A¨H b, A´H), depending on whether the matrices are represented in the regular sparse format or the H-matrix format, as we will later refer back when describing the H-matrix accelerated cyclic reduction method.
This process of permuting and Schur complementation is recursive. It finishes when a single block is left, or when the remaining system is small enough to be inverted directly. Recursion is possible because the Schur complement of a tridiagonal matrix is tridiagonal. This property can be seen in the structure of the matrix at the next step shown in Equation 7 and illustrating the remaining (originally odd) unknowns after they have been renumbered sequentially.
The algorithm proceeds to apply the red/black permutation followed by a Schur complementation for two more steps to compute the last single block D p3q 0 .

Back-substitution
Once elimination is completed, the solve stage starts from the last block of unknowns, as shown in equation 8: Once the solution at the last step u p3q 0 is computed, it is propagated backward in the hierarchy of the elimination tree.
The formula to compute the solution at step q is given by u piq " pD piq q´H¨H pf piq´Epiq¨H u pi`1q´F piq¨H u pi`1q q.
This procedure continues until the solution of the entire linear system is computed. Back-substitution is much more lightweight than the elimination algorithm regarding computation and communication volume, because it communicates parts of the solution in the form of vectors, and the only matrix operation performed is a matrix-vector multiplication. For large scale problems, this makes the solve phase orders of magnitude faster than the elimination phase. As with other direct solvers, the ability to efficiently solve for a given right-hand side given a factorization motivates the use of ACR for multiple right-hand sides at a minimal cost per new forcing term.

Accelerated Cyclic Reduction
This section describes how cyclic reduction can be used in combination with hierarchical matrices to result in a variant that improves the computational complexity and memory requirements of the classical cyclic reduction method.

Block-wise H-matrix approximation
ACR approximates each D i , E i and F i block of the original block tridiagonal matrix A given in Equation 2 with a hierarchical matrix, and then proceeds with the cyclic reduction algorithm, as described in the previous section, by using hierarchical matrix arithmetics instead of the conventional dense linear algebra arithmetic operations.
In generating the structure of the hierarchical matrix representations of the blocks, we exploit the fact the domain is subdivided into n planes each consisting of n 2 grid points and block rows of the matrix are identified with the planes of the discretization grid. We consider this geometry and use a two-dimensional planar bisection clustering when constructing each H-matrix.
Cyclic reduction requires hierarchical matrix addition, subtraction, matrix-matrix multiplication, matrix-vector multiplication and matrix inversion. The relative accuracy of the approximation is specified during the compression of each block and while performing hierarchical matrix arithmetic operations. Committing to a given tolerance ensures that the numerical ranks are adjusted to preserve the specified accuracy during the elimination and solve phases. It is at the block level that the improvements in the complexity estimates take place. Table 2 summarizes the advantages of a block-wise approximation of matrix blocks with H-matrices in the computation of the inverse of a block, and its storage, as compared to their equivalent dense counterparts.

General algorithm
To simplify the exposition, we assume the size of the linear system is a power of two; the number of steps required by ACR is thus q " log N . The size of the blocks for 2D problems is n 2 .
As mentioned in Section 2.2, two procedures define cyclic reduction: elimination and back-substitution. The high-level algorithm of elimination is shown in listing 1, whereas the high-level algorithm for back-substitution is shown in listing 2. Even though Algorithms 1 and 2 show permutations and matrix operations at the level of the global system, our implementation operates at a per-block granularity, which means that permutations are part of the implementation's logic and that linear algebraic operations are performed block by block as shown in Equation 6. This is possible since cyclic reduction preserves the block tridiagonal structure during elimination. u piq " pA piq 11 q´H¨H pf piq´A piq 12¨H u pi`1q q 4: end for

Sequential complexity estimates
Every cyclic reduction step requires two matrix-matrix multiplications, one matrix inversion and one matrix addition per block being eliminated. These kernels have arithmetic complexity of Opk n log n plog n`k 2 qq operations [34]. For a problem size of N " n 3 with n " 2 q , ACR requires n{2`n{4`n{8`. . . « n steps to perform elimination. The most expensive computation in each step is the computation of an inverse of a block of size n 2ˆn2 , which in H-format has a complexity of Opk n 2 log n plog n`k 2 qq, therefore, ACR results in a Opk N log N plog N`k 2 qq overall algorithm, with Opk N log N q memory requirements. Table 3 summarizes the complexity estimates of each of the H matrix operations involved in ACR. Table 4 summarizes the complexity estimates of the classical cyclic reduction algorithms without exploitation of equal blocks versus ACR.

Operation Complexity
A`H B Opk 2 n log nq A¨H B Opk n log n plog n`k 2 qq A´H Opk n log n plog n`k 2 qq A¨H b Opk n log nq

Method Operations Memory Cyclic Reduction (CR)
OpN 2 q OpN 1.5 log N q Accelerated Cyclic Reduction (ACR) Opk N log N plog N`k 2 qq Opk N log N q Table 4: Summary of the sequential complexity estimates of the classic cyclic reduction method and the proposed variant, accelerated cyclic reduction; k represents the numerical rank of the approximation.
Because ACR effectively uses hierarchical representations only for a set of regular two-dimensional problems, the resulting constants appearing in the asymptotic complexity estimates tend to be smaller, as a virtue of lower rank requirements, and make it feasible to perform large scale computations. For instance, our limited experiments show that for the 3D Poisson problem (Table 5) ACR requires substantially lower numerical ranks than the ranks reported in the HSSMF literature [8,10].
In terms of practical usage, ACR has different concurrency properties than H-LU or multifrontal HSS, enabling different amounts of independent work to be performed. The regularity of the computational patterns of ACR is valuable in terms of the ability to efficiently use current and future hardware architectures.

Parallel accelerated cyclic reduction
This section describes how to leverage the concurrency features of the accelerated cyclic reduction method in a distributed-memory parallel environment.

Parallel implementation
The parallel ACR elimination and back-substitution algorithms are listed in Algorithms 3 and 4, respectively.

Algorithm 3 Parallel ACR elimination
1: j= Processor number 2: parallel for at all processors j, j P 0 : 2 q´1

3:
Block-wise conversion to H-matrix of A = tridiagonal(E parallel for at j odd, j P 0 : 2 q´i´1´1 12: parallel for at j, j P 0 : 2 q´i´1

5:
Compute u piq j from Equation 9 6: Communicate u piq j to processors j´1 7: Communicate u piq j to processors j´1 8: end parallel for 9: end for A number of concurrency features of the algorithms are evident. Each block row, identified by a plane in the discretization, is assigned to an MPI rank. This decomposition allows the initial conversion of each block into an H-matrix in an embarrassingly parallel manner. The q " log n levels of Schur complement computation exploit concurrent execution in two ways: • The inverse of the block A 11 of Equation 4 can be computed concurrently in a block-wise fashion since A 11 is block diagonal. This computation is embarrassingly parallel.
• Computing the Schur complement requires two matrix-matrix multiplications and one matrix addition. Since the linear system partition is formed out of matrix blocks, the computation of these block matrix-matrix multiplications and block matrix-addition can also be computed concurrently. Figure 2 depicts the concurrency through the various levels in ACR elimination. We note here that the ACR decomposition strategy bears a similarity to the slice decomposition [35], and also relate to the sweeping preconditioner strategy [36], with the key distinction being that rather than sweeping through the domain, ACR eliminates several planes at once, concurrently.

Inter-node communication
In the current implementation, each plane is assigned to an MPI rank, and multiple planes are assigned to compute nodes. Let p be the number of physical compute nodes each storing n{p planes at the beginning of the factorization. After r steps of ACR, each compute node holds n{p2 r pq planes. At level r " logpn{pq, a coarse level called the C-level, every node holds a single plane only. The remaining log p steps of ACR beyond the C-level leave some compute nodes idle as illustrated in Figure 3. Distributed-memory communication occurs just at inter-node boundaries thanks to sorting at every step of the factorization, as the computation of the Schur complement for plane P j just requires planes P j´1 and P j`1 , see Figure 3. Thus up to the C-level there are Oppq communication messages per step, each transmitting planes of size Opk n 2 log nq. Beyond the C-level, there are Opp{2`¨¨¨`1q « Oppq communications messages, adding up to a total communication volume of Opk p n 2 log n plog n p`1 qq for ACR. The communication pattern with its bottom-up binary tree structure is depicted in Fig. 4. Step 1. Blocks conversion into Step 2. Elimination Step 3. Back-substitution

Parallel time complexity
The regularity of the ACR algorithm makes it straightforward to estimate the parallel time of the factorization and assess its scalability characteristics. Consider the longest computing node which executes log n ACR steps. In the logpn{pq steps preceding the C-level, this node processes n{p2pq`n{p4pq`¨¨¨`1 block rows in sequence. Beyond the C-level, it processes a single block row in every one of the sequential log p steps. This results in an asymptotic parallel time complexity for ACR of O`kn 2 log nplog n`k 2 qpn{p`log pq˘. The sequential computational time gets reduced by the number of parallel compute nodes p, but at the expense of an additional log p factor that inhibits perfect strong scaling. Fortunately, the amount of work above the C-level that introduces this log p factor left is small and grows only as n 2 " N 2{3 .
Finally, we note that beyond the parallelism across distributed computing nodes, there is additional concurrency available at the node level. This additional level of parallelism is possible, not only because elimination and backsubstitution for multiple block rows can proceed concurrently, but also because parallel variants of the hierarchical matrix arithmetics can be used in performing operations on individual blocks. The two levels of intra-node parallelism are shown schematically in Figure 5. In practice, programming models based on tasks and directed acyclic graphs have proven to be effective to parallelize hierarchical matrix arithmetics [32,10], but the optimal allocation of the multiple cores of a compute node to either block row processing or to individual operations on single blocks requires tuning. We do not describe this aspect of the parallel implementation further here.

Distributed-memory planes
Distributed-memory communication Hierarchical matrix Figure 5: Parallel ACR elimination tree depicting two-levels of concurrency using distributed-memory parallelism to distribute concurrent work across compute nodes, and shared-memory parallelism to perform H-matrix operations within the nodes.

Numerical results
This section documents the parallel performance and scalability of ACR in a distributed-memory environment. The source code is written in C and compiled with the Intel C compiler v15. External libraries utilized in the reference implementation include HLIBpro v2.2 with Intel TBB [31,37], and the sequential version of the Intel Math Kernel Library [38]. Experiments are conducted on the Cray XC40 Shaheen supercomputer at the King Abdullah University of Science & Technology. Each node has 128GB of RAM and two Intel Haswell processors, each with 16 cores clocked at 2.3Ghz.
To provide a baseline of performance we consider the solution of the same linear systems with STRUMPACK [10] v1.0.3, the open-source implementation of the HSS-structured multifrontal solver (HSSMF) developed at the Lawrence Berkeley National Laboratory. The HSSMF method can solve a broader class of linear systems compared to ACR, but the comparison is still of interest, as STRUMPACK is among the few available implementations of distributed-memory fast direct solvers that exploit hierarchically low-rank approximations.
The tuning parameters of ACR include the choice of the leaf node size n min for the H matrices, the threshold parameter η used to decide which blocks will be approximated with a low-rank factorization, or as a dense, full-rank, block, and the accuracy of the approximation H for the construction and algebraic operations of the H matrices. The tuning parameters for STRUMPACK include how many matrices from the nested-dissection elimination tree will be approximated as HSS, which is controlled by specifying the threshold at which frontal matrices will represented as HSS matrices, the compression accuracy for the HSS matrices, and the minimum leaf size of the HSS frontal matrices. We recall here that the HSS matrix format uses the so-called weak admissibility condition, whereas ACR uses a standard admissibility condition, which does not limit the use of dense blocks exclusively at the matrix diagonal. Additionally, we also consider the algebraic multigrid (AMG) implementation of hypre [39,40]. Comparison experiments are set to deliver a solution with a relative error tolerance as ||Ax´b|| 2 {||b|| 2 « 10´2. For further comparisons, we also consider the multifrontal (MF) implementation of STRUMPACK, and our cyclic reduction (CR) implementation with dense matrix blocks.

Poisson equation
We consider a sequence of Poisson problems of up to 512 3 « 134M unknowns, which is considered very large for this type of "direct" (as opposed to iterative) methods. We feature the Poisson equation with homogeneous Dirichlet boundary conditions in the unit cube, i.e.
discretized with the 7-point finite-difference star stencil, which leads to a symmetric positive definite linear system. Although this problem can be solved with ACR, other methods such as multigrid or FFTs are ordinarily used instead; we consider it to report on a standard and well-known problem, to facilitate the exposition of ACR.   (f) Choice of H-matrix structure to represent planes. Blue indicates low-rank blocks, whereas red indicates dense blocks.    Furthermore, the discretization of the Poisson equation has all positive eigenvalues with rapid decay in off diagonal, making it also an ideal case for hierarchically low-rank approximations analysis. Figures 6a and 6b show the total time in seconds for the factorization and solve phases of ACR in a strong scaling setting; dashed lines indicate ideal scaling. Ideal scaling of the factorization stage deteriorates at large processor counts as factors such as communication volume and hardware latency begin to play a significant role; the same factors tend to dominate even more during the solve phase, being the latter a sequence of fast H-matrix-vector multiplications with limited availability of communication/computation overlap. Figures 6c and 6d depict the results of a weak scaling test for ACR with different numbers of degrees of freedom per processor, along with the ideal weak scaling reference lines depicted as dashed curves. The timings deviate from the ideal scaling due to the inherently load imbalance of the recursive bisection strategy of cyclic reduction as some processors become idle towards the end of the reduction. Communication latency further impacts the solve stage at large core counts due to the lower arithmetic intensity of this stage. Figure 6e depicts the memory requirements to store the ACR factorization, together with the expected asymptotic memory usage as OpN log N q. We stress that the maximum rank of the factored matrices varies from 5 to 10 within all the combinations of problem sizes/number of processors considered in the strong and weak scaling tests (data not shown). Figure 6f depicts the structure of the H-matrices used to represent each plane, with the choice of standard admissibility condition. Dark blue blocks denote a low ratio between the numerical rank of the approximation and the full rank of the block, whereas red block indicates non-admissible blocks stored in dense format. For visualization purposes, the figure was taken from the N " 32 3 problem, and represents the last diagonal block during the elimination phase of ACR. The prevalence of dark blue blocks indicate a good relative compression of each block, since the ratio of numerical rank of the approximation and the actual block size is very small. Most of the red blocks are clustered near the diagonal, where the smallest blocks reside. Figure 7 compares all solvers under consideration for a set of Poisson problems that range from N " 32 3 to N " 512 3 unknowns, with processor counts increased from 256 to 4,096 respectively. We document the execution parameters, obtained relative residual, and ranks of the ACR and HSSMF factorization in Tables 5 and 6. We report factorization times in Figures 7a showing that ACR can competitively tackle these problems. Similarly, the solve timings in Figure 7b show that ACR is able to solve for a given right-hand size in comparable times as the other methods under consideration. Figure 7c documents the size of the factors required by the factorizations, and it shows that the cyclic reduction method (CR) cannot solve problems as small as N " 128 3 due to memory limitations. Additionally, we report the peak memory utilization of each solver using the library PAPI v5.5 [41], which shows the largest memory usage that each solver required to produce the factorization. Also, the experiments confirm that the HSSMF method requires less memory to store its factors than the multifrontal method (MF). However, as Figure 7d shows, the HSSMF method requires higher ranks than ACR, which translated into a larger size of the factors and prohibited the execution of HSSMF for problems of N " 256 3 and above. The experiments show that ACR requires only Op1q ranks, as opposed to the Opnq rank requirements of the HSSMF factorization.
As expected for this particular problem, multigrid is the method of choice concerning performance and memory footprint for a single right-hand-side. However, for multiple right-hand-sides, the ability to reuse the factorization could give the advantage to solvers based on factorization. The factorization times for ACR and HSSMF are    comparable, with the setup stage of HSSMF being faster for smaller problems; the smaller ranks required by ACR instead lead to a faster factorization step with large problem sizes and faster time to solution.
While ACR and HSSMF solvers can deliver a more accurate solution as direct solvers (i.e. without iterative procedures), this comes at the expense of more time and memory; it is common practice that this factorization is then used as a preconditioner or passed to an iterative refinement procedure. Numerical experiments confirm that ACR could be used as a direct solver if we tune its parameters with a higher accuracy for its H-matrix representations and operations, as depicted in Figure 8, at the expense of modest rank increases, albeit with higher memory requirements and time to solution. However, as Table 7 shows, a low-accuracy factorization in combination with an iterative procedure is best to minimize the total time-to-solution.
To demonstrate the robustness of ACR and HSSMF for this problem, we fix the number of degrees of freedom at N " 128 3 and we increase the dominance of the convective term; results are reported in Figure 9. Consistently with the Poisson problem, multigrid methods remains the method of choice for diffusion dominated problems in terms of time to solution; however, when α is increased, the performance of AMG deteriorates. On the other hand, both ACR and HSSMF prove to be able to solve convection-dominated problems, with ACR being consistently faster than HSSMF particularly in the back-substitution phase. The size of the factors generated by ACR and HSSMF are comparable, with ACR using significantly smaller ranks.

Helmholtz equation
We finally consider the indefinite Helmholtz equation with Dirichlet boundary conditions on the unit cube, i.e. p∇ 2 u`κ 2 uq " 1, Ω " r0, 1s 3 ,     discretized with the 27-point trilinear finite element scheme on hexahedra. Results for ACR and HSSMF are reported in Figure 10. The parameter κ is chosen to obtain a sampling rate of approximately 12 points per wavelength, specifically κ " t16, 32, 64u respectively, corresponding to approximately 10ˆ10ˆ10 for the N " 128 3 problem. As opposed to the positive definite Helmholtz equation which models phenomena similar to diffusion, the indefinite variant, commonly denoted as the wave Helmholtz equation, has a solution that is oscillatory in nature. Multigrid methods are known to diverge without specific customizations for high-frequency Helmholtz problems, which we also confirmed via experimentation. For a detailed examination of the difficulties of solving the Helmholtz equation with classical iterative methods we refer the reader to [44]. We document the execution parameters, obtained relative residual, and ranks of the ACR and HSSMF factorization in Tables 8 and 9. Numerical experiments show that ACR features consistently lower factorization and solve times than HSSMF, as can be seen in Figure 10a and 10b. The size of the factors of ACR and HSSMF are comparable, with a slightly higher memory requirements of ACR due to performance-oriented tuning, see Figure  10c. Furthermore, as also shown in section 5.1, HSSMF required less memory than MF, and CR quickly runs out of memory for problems larger than N " 64 3 . Finally, the largest rank of ACR is consistently lower than that of HSSMF, even though both solvers require Opnq ranks, as shown in Figure 10d. Nevertheless, lower ranks lead to faster time-to-solution in favor of ACR.

Conclusions and future work
We present a novel fast direct solver, Accelerated Cyclic Reduction, for block tridiagonal linear systems which commonly arise in the discretization of elliptic operators. The elimination strategy is based on a red/black ordering of the blocks that logically divides the grid into planes, approximates matrix blocks representing these planes with H-matrices, and proceeds with elimination using hierarchical matrix operations. ACR achieves log-linear arithmetic complexity of Opk N log N plog N`k 2 qq and memory requirements of Opk N log N q by approximating each block with a hierarchical matrix whose structure is refined using a spatial partitioning of the planar grid sections, employing a strong admissibility criterion that effectively limits the ranks of individual low rank blocks in the hierarchical matrix representations, and operating with hierarchical matrix arithmetics throughout. The average rank k of the blocks inside the hierarchical matrix representations controls the accuracy of the approximation and grows only modestly with problem size. A fair agreement with the rank estimate of [6] was found for the 3D Poisson equation of Op1q (Table 5), and for the 3D Helmholtz equation Opnq (Table 8).
The concurrency features of ACR are among its most important strengths. The regularity and structure of the decompositions allow efficient load balance. These features are demonstrated in a distributed-memory environment with numerical experiments that study the strong and weak scalability of our implementation. We provide a reference for performance and memory consumption using comparisons with state-of-the-art open-source implementations of the HSS-structured multifrontal solver from the STRUMPACK library, and algebraic multigrid from hypre.
ACR, being essentially a direct solver with tunable accuracy, can tackle problems that lack definiteness, such as the indefinite high-frequency Helmholtz equation, or symmetry, such as the convection-diffusion equation. For these problems, stock versions of algebraic multigrid fail to produce convergent schemes. We demonstrated the robustness of ACR in dealing with such problems over a range of problem sizes and parameters.
While multigrid methods are generally superior for scalar problems possessing smoothness and definiteness, direct factorization methods such as ACR and HSSMF benefit where multiple right-hand sides are involved, as the time to solve per extra forcing term is orders of magnitude smaller than the factorization, which can be reused. The smaller ranks k of ACR result in solution times per new right-hand side that are smaller than those of HSSMF.
Although having the same asymptotic complexity as other solvers that use general hierarchical matrix representations in their factorizations, such as H-LU, ACR has fundamentally different algorithmic roots which enable a novel alternative for a relevant class of problems with competitive performance, increasing concurrency as the problem grows and almost optimal memory requirements. Moreover, to the best of our knowledge, this is the first distributed-memory implementation of the synergies of cyclic reduction and hierarchical matrices, which scales up to 8,192 cores for problems up to N " 512 3 degrees of freedom.
ACR has been demonstrated for a regular grid discretization, but the generalization to arbitrary grids is possible and we intend to explore it in the future. Such a generalization would require an ordering of the mesh that produces a sequence of thin elongated regions (in 2D or 3D) where every region has only two neighbors so that the block tridiagonal structure is preserved. Such an ordering might be produced via a breadth-first search traversal of the mesh as shown in Figure 11. In the unstructured case, the diagonal blocks do not necessarily have the same size, and the off-diagonal blocks might be of rectangular shape. The main algorithmic implication is that each block will now have its own hierarchical matrix structure generated from the geometry of the region it represents. Computationally however, the structure generation represents a small portion in the overall computation.
In addition, because of the tunable accuracy characteristics of ACR, there are complexity-accuracy trade-offs that would naturally lead to the development of a new scalable preconditioner which we present at [45]. Figure 11: Partitioning of an unstructured mesh that produces a block tridiagonal matrix structure, for the application of ACR.