The Sherman–Morrison–Woodbury formula for generalized linear matrix equations and applications

We discuss the use of a matrix‐oriented approach for numerically solving the dense matrix equation AX + XAT + M1XN1 + … + MℓXNℓ = F, with ℓ ≥ 1, and Mi, Ni, i = 1, … , ℓ of low rank. The approach relies on the Sherman–Morrison–Woodbury formula formally defined in the vectorized form of the problem, but applied in the matrix setting. This allows one to solve medium size dense problems with computational costs and memory requirements dramatically lower than with a Kronecker formulation. Application problems leading to medium size equations of this form are illustrated and the performance of the matrix‐oriented method is reported. The application of the procedure as the core step in the solution of the large‐scale problem is also shown. In addition, a new explicit method for linear tensor equations is proposed, that uses the discussed matrix equation procedure as a key building block.


INTRODUCTION
We are interested in solving dense linear matrix equations in the form where A, F ∈ R n×n and M of rank s ≪ n. We assume that A has no eigenvalues such that = − , and that X can be uniquely determined. Equation (1) is a simplified version of the following more general linear equation (see, e.g., References 1, [2, ch. 12]), with M i , N i ∈ R n×n , i = 1, … , not necessarily symmetric, having rank s M i , s N i ≪ n, respectively. These matrix equations are characterized by the presence of a Lyapunov operator X  → AX + XA T and of extra terms ∑ i=1 M i XN i , which make the closed form of the solution hard to formulate solely in terms of the given coefficient matrices. In particular, even in the simplest case (1) of three terms, that is = 1, no closed-form based on the Schur decomposition exists, which is instead the case for = 0. Solvability conditions associated with the relative roles of the two operators X  → AX + XA T and X  → ∑ i=1 M i XN i and geometry properties of the solutions are studied, for example, in Reference [3, sec.3.4]. This type of equation classically arises in Control, for instance as the Gramian equations of bilinear dynamical systems; see, for example, Reference 4. The problem has been recently attacked in the large-scale case, which provides additional challenges. 3,[5][6][7] Multiterm equations in the form (2) may also arise in discretizing elliptic equations using a matrix-oriented approach, 8,9 or in parametric and time-dependent problems. 10,11 In all these settings, the low-rank framework depends on the problem hypotheses.
For large-scale matrix equations with sparse data, a small number of methods is available in the literature, and they are based on the efficient matrix-oriented application of standard vector iterative methods; these are, for example, References 5-7,12,13, alternating least squares, 14 subspace projection [5, sec. 5.2],. 15 In general, these methods all rely on the possibility of determining a good low-rank approximation to the solution, and this problem is analyzed, for instance, in Reference 5. Methods that implicitly exploit the Kronecker form of the problem have also been analyzed, [16][17][18] and these are close to the method we are going to discuss.
We aim at analyzing a strategy for solving dense or banded small and medium size problems of type (1), and generalizing it to (2); we do not make any rank assumption on the sought after numerical solution. As we will see, the approach can then be used as a building block for large-scale projection-type methods, or for solving linear tensor equations.
The problem in (1) can be stated in a more familiar form using the Kronecker version of the problem. Given two matrices A = (A i,j ) and B = (B i,j ), the Kronecker product is defined in block form as This matrix operator satisfies vec(AXB) = (B T ⊗ A)vec(X), where vec(X) stacks the columns of X one below the other. Using this property in (1) with  = A ⊗ I + I ⊗ A, f = vec(F), and M = UV T so that we can define  = U ⊗ U,  = V ⊗ V, the matrix equation can be written in vectorized form as ( +   T )x = f , with x = vec(X) and rank( ) = s 2 = rank(). Note that  = [u 1 , … , u s 2 ] with u t = u k ⊗ u i where t = (k − 1)s + i. If s 2 is still much lower than n, the vector x can be effectively determined by means of the Sherman-Morrison-Woodbury (SMW) formula [19][20][21] as follows Hager 22 described many properties of this formula and illustrated a variety of applications where the low-rank update naturally arises, such as statistics, matrix partitioning, networks, linear programming, discretization of partial differential equations (PDEs), physics. In particular, this formula is a building block of many numerical linear algebra methods, such as preconditioning and quasi-Newton methods, see, for example, References 23-25, [26, sec.6.1]. It has been classically used in bordering, partitioning, tearing, and modifications of sparse linear systems, 27 [28,29, sec.11.6], among others. In general, the formula is quite appealing every time the modification   T destroys some computationally convenient structure of , which can still be exploited in the SMW formula. Classical notes of caution regarding the use of the SMW formula, however, state that the updating method (4) cannot be expected to be numerically stable in all cases. In particular, problems will arise when the initial problem is more ill-conditioned than the modified one. In other words, () could be significantly larger than ( +   T ), hence possible worse accuracy is known to be expected in some cases when using the SMW formula. In the following, we will assume that the method is stable, that is, all the steps associated with the application of the formula are stable. 30 Indeed, in our treatment we are mainly interested in devising an implementation of the SMW formula that takes into account the matrix equation structure, assuming the whole computation is stable. The use of the Kronecker form within the SMW formula has a serious computational drawback, which makes the formula hard to apply. The matrix  has very large dimensions even for moderate n, since  is n 2 × n 2 . In particular, if A is not sparse, the matrix  is going to be rather full, and thus impossible to store online. Fortunately, the solution X can be directly obtained by only relying on the original matrices A, U, V, and F in (1) by maintaining and exploiting their structure as much as possible. The SMW formula is still employed but at the matrix level, with great advantages both in terms of memory consumption and computational costs. A possibly first description of a general procedure that avoids the Kronecker product in the symmetric case can be found in Reference 31. In there, however, there is no explicit derivation of the SMW formula associated with a possibly low-rank structure as in our setting. Nonetheless, it is acknowledged that the overall cost can be significantly lower than that of the vectorized form as long as the rank of U, V is moderate. In Reference 32 a generalization to systems of matrix equations stemming from Control applications is given. These procedures target small size problems, for which dense computations can be carried out. A later work by Damm 16 explicitly focuses on the symmetric version of (2), that is N i = M T i , in a way that the use of the SMW formula becomes apparent. The implementation differs from ours, and we provide a computational cost comparison showing the benefits of a full matrix-oriented implementation. Finally, more recently, the authors of Massei et al. 17 have developed the same implementation that we are going to discuss, however, their focus is on (very) large problems (1) with the highly structured symmetric matrix A, and the role of s is not analyzed (in fact, s = 1 is used); the computational cost of the overall method is not discussed. A similar procedure is employed in Reference 18 for preconditioning purposes, without a specialized analysis, and in Reference 33 for a related problem.
We aim to provide a detailed analysis of the matrix-oriented implementation of the SMW formula for solving dense or structured (e.g., banded A) equations of type (2), with particular attention to the role of the ranks s i and the number of terms. We also discuss how this implementation can be used as a building block for methods to solving more complex problems.
This methodology can be extended in a straightforward manner to the case when the Lyapunov operator X  → AX + XA T is replaced by the Sylvester operator X  → AX + XB with no major change in the algorithm. On the other hand, dealing with the Lyapunov operator allows us to lighten the presentation, by using a single matrix A. We refrain from repeating our same arguments for the Sylvester operator, and we directly use this variant whenever needed.
The synopsis of the article is as follows. In Section 2 we discuss the SMW formula in matrix form for solving the small and medium size linear matrix equation, and then extend it to the general problem in Section 3. Considerations on stability are briefly discussed in Section 4. In Section 5, some numerical experiments are given to illustrate the performance of our strategy on generally random data, while in Section 6 we illustrate the applicability of this strategy to moderate size matrices stemming from the discretization of certain PDEs. More general numerical linear algebra problems where this methodology can be effectively employed are discussed in subsequent sections. In particular, in Section 7 we consider using this approach for solving the reduced problem associated with a Galerkin projection method for the large-scale multiterm linear equation. Moreover, in Section 8 we illustrate the applicability of our scheme to a new solution strategy for linear tensor equations. Finally, a brief conclusion is given in Section 9.

THE SMW FORMULA IN MATRIX FORM
The standard SMW formula (4) in vectorized form is applied using the following sequence of computations, Algorithm 0.
We next illustrate that all computations associated with n 2 -long vectors can be transformed into matrix-matrix operations of size n throughout the whole procedure. The first step satisfies that is, a Lyapunov equation of size n × n can be solved to obtain W such that w = vec(W). A Schur decomposition-based method such as the Bartels-Stewart algorithm can be used to this end if A is dense and of moderate size. 34 Regarding step 2, for each u j , j = 1, … , s 2 , noticing that u j = vec(u i u T k ) for k, i such that j = (k − 1)s + i, we have Therefore, once again a sequence of small dense Lyapunov equations can be solved. The third step determines , which can equivalently be computed by using Analogously, the solution of the s 2 × s 2 linear system Hg =  T w requires computing the right-hand side. This can be obtained in terms of the original data and of (5) as follows The final step can equivalently be obtained as X = W − ∑ s 2 j=1 P j (g) j , although the vectorized version x = w − g may give better efficiency.
The complete procedure is summarized in Algorithm 1.
with j = (k − 1)s + i, t = 1, … , s 2 , and e j the jth column of the identity matrix. 8: Solve Hg = d for g 9: We explicitly observe that the computations within each cycle in steps 6 and 7 are completely independent, and can be performed in parallel. In step 6, the procedure requires the solution of s 2 Lyapunov equations, together with the solution of a Lyapunov equation in step 5. This computation can be significantly accelerated by first performing a single (real) Schur decomposition of the matrix A: the right-hand side is written in terms of the Schur orthogonal basis, and only backward substitutions with quasi-triangular matrices need to be carried out to solve the Lyapunov/Sylvester equation. This step is particularly efficient whenever A is normal, since in this case the Schur decomposition leads to a diagonal matrix, so that all remaining computations can be performed elementwise. Indeed, the matrix equations to be solved in the orthogonal basis have the form Matrix R is diagonal for A normal, and upper quasi-triangular when A is nonnormal, so different solvers are adopted. For R diagonal, it holds TA B L E 1 Computational cost of Algorithm 1 Step

Computational cost
Schur decomposition of A 25n 3 Computation H s 2 (2n 3 + n 2 ) + s 4 (2n 2 + n − 1) + 4n 2 s + s 2 − 2ns where a ⊘ b denotes element-by-element division, and 1 is the vector of all ones. When R is upper quasi-triangular, a recursive triangular solver can be used; see, for example, Reference 35 and its references * .
The main computational costs of Algorithm 1 are summarized in Table 1, giving a total of (35 + 2s 2 )n 3 + (2s 4 + 5s 2 + 4s − 4)n 2 + (s 4 + s 2 − 2s)n + 2 3 floating-point operations for determining the final solution. The cost for computing the Schur decomposition of A can be significantly alleviated if A has further structure. For instance, this cost becomes (n 2 ) if A is symmetric and tridiagonal. We numerically experimented with different matrix structures to show the impact of this computation on the overall procedure. Moreover, additional sparsity of some of the quantities, such as in U i , V i may be exploited during the computation. These are clearly problem-dependent issues and are not taken into account in the general (worst case) cost description.
In the actual implementation, the Lyapunov equation solution for all u i , u k is split into different stages, and the matrix H is generated during the solution of these Lyapunov equations in the eigenvector basis, so that the actual costs are spread through the algorithm. In Appendix A we report a typical implementation, pointing to the most expensive steps. The algorithm leading cost is associated with the solution of s 2 Lyapunov equations with the same matrix A, and different right-hand sides. For s = (1) this cost is largely dominant, and it may be convenient to simplify the algorithm and just use the vectorized form (Algorithm 0 above), limiting the exploitation of the matrix structure in step 2 to the solution of the s 2 Lyapunov equations instead of the large system with . Indeed, taking into account the underlying structure of  appears to be irrelevant for s = (1). We refer to this simplified implementation as a hybrid version. For this reason, in our experimental analysis we consider larger values of s, for which a fully matrix-oriented version becomes relevant. The same occurs when more than one low-rank term appears in (2), that is > 1, in which case the dimensions of  , , and thus that of H, quickly increase, as it will be discussed in the following section.
Remark 1. Though we mainly focus on solving (2) for dense and moderate size problems, it is of interest to linger over the large-scale case. For large A, the low rank of the right-hand side terms u i u T k allows one to employ powerful recently developed iterative methods. 15 Since s 2 + 1 such equations need to be solved, however, this cost is affordable as long as s = (1), and in this case the hybrid approach could be used. In Section 7, we explore an ad hoc large-scale approach for (2) based on the projection onto a single approximation space, that uses the SMW formula as a core solver for an occurring small size problem.
We conclude this cost analysis by reporting on the comparison with a similar implementation proposed by Damm, 16 whose algorithm is reported in Appendix B, together with the computational costs of the main floating-point operations. For a fair comparison, we only considered the operations marked with ⋆, since the remaining ones handle nongeneric situations. The leading total cost of the implementation proposed by Damm is given by which differs from ours mainly in the coefficient of the n 3 term. Figure 1 shows typical cost curves of the two approaches as s (left) and n (right) vary. The factors in n 3 are dominant for s small, thus being in favor of our implementation, whereas * In our Matlab implementation we did not exploit this more efficient version, we simply used the lyap function with triangular coefficient matrices. for large values of s the costs for the two implementations become more comparable, also due to the presence of similar factors in high powers of s.

THE SMW FORMULA FOR THE GENERAL CASE
The strategy in Section 2 can be extended to the following nonsymmetric problem with more than three linear terms in X, where we notice the change of notation for the rank, to adhere to the adopted notation for the low-rank matrices U i V T i . The considered procedure is also applicable to the case of even more terms, and it is effective as long as the total rank of the U i s and the V i s remains moderate with respect to n. To pass to the Kronecker form of the problem, we define so that (11) can be written as where  and f are defined as in Section 2, with Thus, the problem (13) goes back to the problem (4). The matrix-oriented strategy proceeds as follows. (2) k and A Matlab 36 implementation of this algorithm is reported in Appendix A for completeness † .

CONSIDERATIONS ON STABILITY
It is well known that using the SMW formula in the vectorized case may have disastrous stability effects depending on the choices of  , ; 22 Yip 30 gives suggestions on how these two low-rank matrices could be selected to lower these effects, in case these matrices can be tuned. The stability considerations in Reference 30 focus on the conditioning of the matrix I +  T , which in turn also depends on the stability properties of the equation associated with . Our computation differs from the vectorized approach for the use of Sylvester/Lyapunov matrix equations instead of algebraic (vector) systems to determine . Higham [37, chapter 15] gives a thorough account of the accuracy and stability properties associated with the solution of linear matrix equations. Let AX + XB = C be the matrix equation to be solved, and assume that a perturbation on the data occurs. Let where || ⋅ ||  denotes the Frobenius norm. In Reference [37, section 15.3] it is shown that the solution is perturbed by ΔX satisfying and the bound is sharp, as first-order bound in ; here, in our notation, The quantity Ψ plays the role of the condition number for the Sylvester equation. A weaker bound, which actually corresponds to applying standard perturbation theory to the Kronecker form of the problem, is given by Higham [37, section 15.3].
Hence, comparing Ψ and Φ corresponds to measuring the relative solution sensitivity with respect to the input data, associated with using either the vector or the matrix approaches for the problem with . It was shown in Reference [37, section 15.3] that Φ and Ψ "can differ by an arbitrary factor." That is, although from their definition we have Ψ ≤ Φ and they have in general similar magnitude, it may happen that Ψ ≪ Φ. Thus, the analysis seems to favor the matrix equation approach. An alternative analysis could be performed that makes use of the sep function. 35,37 These results provide insights into the different behavior of the two formulations under perturbations in the data. It would be interesting to analyze whether similar differences carry over in finite precision arithmetic computations. A thorough analysis remains an open problem.
We next linger over the effect of computational perturbations on the final solution accuracy. Let x f =  −1 f + e f be the numerical solution obtained with the chosen method to solve x = f , and analogously, x  =  −1  + E  . Then the numerical solution to ( +   T )x = f can be written as where we assume that sums and products among matrices and vectors in the formula are exact. Note that both ||e f || and ||E  || are related to the accuracy of the method used to solve the Sylvester equation with A. Let Then, assuming for simplicity that the system with H or H e is solved exactly, we can write where we assumed that ||e f || ≤ c 1 u, ||E  || ≤ c 2 u, in which u is the machine precision and the constants c 1 , c 2 depend on the data. The two matrices ( in the bound above are related to the quality of the data, and the conditioning of the obtained H e and H, respectively. Some calculations would recover a bound similar to that by Yip 30

NUMERICAL EXPERIMENTS WITH DENSE OR STRUCTURED DATA
In this section, we illustrate the performance of Algorithm 1 when using dense and banded random data. This random setting provides a frame of reference for the computational costs associated with the methods in the worst scenario, that of dense data. All considered methods can also appropriately exploit sparsity whenever available, in different ways, and this will be explored in later sections. In all displayed tables we report the CPU time, and the relative error and residual norms where X ⋆ is our "true" solution, computed as a random matrix with uniform elements in the interval (0, 1), so that the right-hand side is determined explicitly: for instance, for (1) it is given as Computational comparisons are principally performed among the matrix-oriented and vectorized forms of the SMW formula. For the sake of completeness and as a reference target, in all our tables we also report on the performance of a direct method for explicitly solving ( +   T )x = f (backslash in Matlab); we notice that this is a built-in (compiled) code, which is expected to be (much) faster than an interpreted code implemented within Matlab. In spite of this, our results show that the matrix-oriented method is able to outperform the direct solver in practically all cases, and in some instances with orders of magnitude better CPU times, both in the dense and sparse cases.

Example 1. Symmetric and dense matrix A for various s's.
We consider the problem where A 0 is full and it has random entries uniformly distributed in (0,1); U 1 , V 1 and U 3 , V 3 are also random matrices from the same distribution. Due to the symmetry of A, the eigendecomposition of A was computed once at the beginning and the Lyapunov equations solved as described in Section 2. The computational Note: Symmetric and pentadiagonal matrix A; U 1 , V 1 and U 3 , V 3 with different geometric properties. Here, s 1 = 3 and s 3 = 5. "-" stands for excessive computational time.
results for increasing n and s 1 , s 3 are reported in Table 2, and show the great advantages of the matrix setting, gaining several orders of magnitude in CPU time, while maintaining at least the same accuracy as with the vectorized approach.

Example 2. Symmetric and banded matrix A.
For the same matrix equation as in Example 1, we first consider A a random and symmetric pentadiagonal matrix, and U 1 , V 1 and U 3 , V 3 random matrices with s 1 = 3, s 3 = 5 columns, respectively. The method performance is reported in Table 3. We also slightly modified the problem, by imposing that the U i s and V i s have orthonormal columns. This latter setting showed better accuracy of the obtained solution, as reported in the bottom results in Table 3; this is in full agreement with the results in Reference 30.
Example 3. Random, nonsymmetric and dense matrix A. Let A, U 1 , V 1 and U 3 , V 3 be random matrices for the same problem as in Example 1. The numerical results are listed in Table 4. As we expected, the vectorized form pays the price of a fully populated and nonsymmetric matrix. The nonsymmetry also affects the performance of the matrix method, since the full Schur form needs to be used, and Lyapunov equations with triangular matrices solved in all instances. The total costs remain very modest, though.

Example 4. Dense A and four term equations.
We consider the equation with U i , V i random matrices with s i columns, and A a random matrix (all uniformly distributed in (0, 1)). The results are reported in Table 5. The results are consistent with our previous findings, also when several terms occur.

APPLICATION TO DISCRETIZED PDES
In this section, we report on the computational solution of the small-scale problem (2) in the context of matrix-oriented discretization of PDEs, and in problems naturally described in terms of multiterm matrix equations including low-rank coefficient matrices. In all cases, the coefficient matrices are sparse, and all methods have been implemented so as to make the best use of sparsity. Once again, the solution with the (sparse) direct method is reported as a reference, taking into account that CPU time comparisons are biased by the different implementation settings (see the related discussion in Section 5).
Our first example was discussed in Reference [5, sec.6.2] in the equation solving context, and then employed for comparison purposes in Reference 7 in the large-scale case.

Example 5.
A nonlinear RC circuit. The data stems from the model reduction of a nonlinear system, representing a scalable RC ladder with k resistors, with an exponential dependence on the voltage-current, giving rise to the problem of type (1) of dimension n = k(k + 1) with N 1 = M T 1 and = 1. The matrix M 1 = U 1 V T 1 has rank s = k. To explore how the performance depends on the rank, in our experiments we considered portions of these matrices by taking the first r 1 columns of U 1 , V 1 , with r 1 up to k. Our numerical results are listed in Table 6 and illustrate the effectiveness of the matrix-oriented approach with respect to the vector settings, at a comparable residual accuracy.
The next examples stem from the discretization of two-dimensional convection-diffusion equations on rectangular domains, with homogeneous Dirichlet boundary conditions. The matrix-oriented discretization of this class of PDEs was treated in Reference 9 for separable coefficient functions. Here, we show that the matrix setting described in (2) can be obtained, for instance, when some of the coefficient functions have a small nonzero support in the given domain, both in the separable and nonseparable cases. To the best of our knowledge, the reported treatment in the nonseparable case is new. Example 6. Separable coefficients. We consider the linear PDE −Δu + (x, y)u y = 1, (x, y) ∈ Ω, with Ω = (0, 1) × (0, 1) and where Ω 1 is a small square centered at (x, y) = ( 1 2 , 1 2 ) and 2r s + 1 grid points on each side. Finite difference discretization of this problem 9 leads to the linear system where A = (n + 1) 2 tridiag(−1, 2, −1) ∈ R n×n , B = (n+1) 2 tridiag(1, 0, −1) ∈ R n×n , while M 1 and N 0 are diagonal matrices containing the nonzero values of the functions (x) = (x + 1), (y) = (2y + 1), respectively, for (x, y) ∈ Ω 1 , and zero elsewhere, where the grid nodes are ordered lexicographically. 9 Due to the small function support, the only nonzero entries are those corresponding to the ith entries, with ( n 2 − r s ≤ i ≤ n 2 + r s ). The parameter r s allows us to control the rank of the matrices M 1 , N 1 = BN 0 in our numerical experiments. Note that the matrices U 1 , V 1 in M 1 = U 1 V T 1 can be determined at no computational cost, since they correspond to the nonzero columns and scaled rows of the nonzero diagonal elements of M 1 . The numerical results are reported in Table 7, and show that the matrix-oriented version of the SMW formula is very attractive, when compared with the vectorized one. In spite of the problem high sparsity, the reference (compiled) direct method is not superior to the matrix-oriented method except for the largest dimension and rank.
In the following we consider the case where the coefficient (x, y) is not a separable function. For the sake of the description, we work with the linear PDE −Δu + (x, y)u = f . The finite difference discretization of the zero-order term (x, y)u(x, y) on a rectangular grid leads to an addend in the matrix equation of the form C • U, where • denotes the elementwise (Hadamard) product. Here, (C) i,j = (x i , y j ) where (x i , y j ) are the grid nodes, hence the entries of C are nonzero only at the grid points included in the support of . In the vectorized case we would obtain vec(C • U) = diag(c)vec(U), where the vector c would contain all values of (x, y) at the nodes. The discretization of the given PDE yields the matrix equation The inclusion of the uncommon Hadamard product term makes the solution of the matrix equation more challenging; see, for example, Reference 8. Nonetheless, the high sparsity of C allows us to efficiently employ the matrix-oriented SMW formula. Matrix C can be written as C = GH T with G, H of rank ≤ n, such that G, H maintain the sparsity of C. For instance, G can collect the nonzero columns of C, while H is a selection of the identity matrix columns corresponding to the nonzero columns of C. Let G = [g 1 , … , g ] and H = [h 1 , … , h ], for ≤ n. Using the following property of the Hadamard product we can rewrite Equation (17) as With this definition of G and H, the number of nonzero columns in C gives the number of summation terms. A different selection strategy for G, H can be considered if C is known to have rank lower than the number of its nonzero columns. From now on we can proceed as it was done in the case of a separable function case, except that ≥ 1. We observe that in our setting G, H are very sparse, so that diag(g i ) and diag(h i ) are diagonal matrices of very low rank. Therefore, setting diag(g i ) = U L,i V T L,i and diag(h i ) = U R,i V T R,i we have The new notation U *,i , V *,i is only used here to emphasize the description generality. In the following we will employ the derivation above to discretize a linear PDE with a convective term, hence the extra factor B will appear, as in Example 6. Example 7. We consider the same Equation (15) as in Example 6, where this time the derivatives were discretized with a fourth-order finite difference formula, leading to a pentadiagonal structure in A and B, and we employed a different function (x, y). Let Ω h be a uniform discretization of Ω = (0, 1) × (0, 1), with nodes (x i , y j ), i, j = 1, … , n, and h = 1 n+1 . Let We consider two distinct cases for (x, y): Repeating the described derivation in the presence of the Hadamard term, we obtain the following linear matrix equation, We note that 2 yields a larger value of than 1 , and thus a larger final rank in the SMW formula, as reported in our tables. The numerical results are listed in Table 8 (for 1 ) and in Table 9 (for 2 ). All results consistently report the good performance of the matrix-oriented version of the SMW formula with respect to the vectorized one, with comparable error and residual norms. The approach also compares rather well with the compiled sparse direct method.

APPLICATION TO LARGE-SCALE EQUATIONS
In many applications Equation (1) has large dimensions, typically A and M are sparse and large, while F is very low rank. In this setting, the SMW formula may be prohibitively expensive to be used, both in the vectorized and matrix forms, unless , s i = (1), as performed in References 17,18. Indeed, the matrix-oriented formula, or its hybrid version (see Remark 1), require solving at least as many Lyapunov equations as the number of columns of  in Algorithm 0. So, for instance, if = 1 and s 1 = 15 (see Example 5), a total of s 2 = 225 large-scale Lyapunov equations with nonsymmetric right-hand side need to be solved to compute H = I +  T  −1  . Significantly larger dimensions of H may arise for > 1.
In this challenging context, the matrix-oriented approach may still be appealing as a core method within a projection strategy to solve (1), whenever the rank of M is modest, say up to a few tens.
The Galerkin method is a general approximation methodology widely used in several analytical and numerical procedures such as in the variational characterization and discretization of PDEs, in algebraic linear systems and eigenvalue problems, and more recently in linear matrix equations; see Reference 15 for a recent account concerning this last context. In our setting the strategy proceeds as follows. For the sake of the presentation ‡ ; see, for example, Reference 15. Assume that F is symmetric and low rank, that is F = F 1 F T 1 . Given an approximation space V ⊂ R n of dimension m k and a matrix V k ∈ R n×m k whose orthonormal columns span V, we are interested in an approximation to X as where Y k is to be determined. The matrix X k has rank at most m k and under certain hypotheses on the data, it may be a good approximation to X; see, for example, Reference 5. To determine Y k the following matrix orthogonality (Galerkin) condition is imposed to the residual matrix S k := AX k + X k A T + MX k M T − F, that is, S k is orthogonal to the approximation space, in the matrix inner product. Substituting S k in this constraint equation and recalling that V T k V k = I we get ‡ In the nonsymmetric case, a left and a right projection space may be required, as in the standard Sylvester equation. In the case F is not low rank, low-rank approximations can be considered The matrices A k = (V T k AV k ), M k = (V T k MV k ) and F k = V T k FV k have dimensions m k , so that the new equation is of the same type as the original one, but it has much smaller dimension, as long as m k ≪ n. For s < m k , where s is the rank of M, the matrix M k can again be written as the product of two low-rank matrices § , and our method can be applied to determine the sought after reduced solution Y k . If the approximation is not good enough, the approximation space can be enlarged and a new reduced solution Y k can be obtained. For the existence of Y k the reduced matrix equation must be solvable. This can be obtained by requiring that the eigenvalues of the coefficient matrix in the Kronecker form of (21) are all contained in one half of the complex plane. This last condition is satisfied if, for instance, the field of values of the coefficient matrix in the Kronecker form of the original problem, that is of the matrix  +   T , is contained in one half of the complex plane. 15 To complete the algorithm the approximation space must be selected. Given the expansion property outlined above, a natural choice is to choose a nested space, that is V k ⊆ V k+1 where k indicates the iteration, so that the residual matrix is enforced to lay into a smaller and smaller space as k grows. So far the procedure is very general and can be applied to a variety of linear and quadratic matrix equations. 15 However, to the best of our knowledge no Galerkin method has been designed to directly attack the specific problem (1) of large dimension. To make the Galerkin methodology practical, the space selection plays a crucial role.
Let M = U 1V T 1 . We propose to consider the column orthogonalization of V 0 = [F 1 , U 1 ] as initial basis. If U 1 is thin, then V 0 has indeed few columns. Letting V 0 = V 1 R be the reduced QR factorization of V 0 , the columns of V 1 span V 1 . The space is expanded with the two vectors This choice is in agreement with similar selections for multiterm linear equations in the literature. 10,11 Here, after k iterations, v represents the kth vector in the already generated basis, which spans a space of dimension not smaller than k. Due again to the SMW formula, the vector (A + M) −1 v is a linear combination of A −1 v and the columns of A −1 U 1 (computed once for all at the beginning of the algorithm), hence after a number of iterations corresponding to the number of columns of U 1 , this second vector in (22) will no longer be needed to expand the space, and only the vector (A − k I) −1 v will be added. Nonetheless, we found that adding this term in the first few iterations was crucial for the overall convergence. We stress that assuming the spectrum of A lay on one half of the complex plane, the shifts k are computed so as to be on the other half of the complex plane, so as to ensure that A − k I is nonsingular. We refer the reader to Reference 15 for a general discussion on Galerkin type procedures for linear matrix equations, of which this is a typical example, and on the shift selection. As opposed to other Galerkin projection methods for multiterm equations, here the space is mainly based on the coefficient matrix A, since the terms involving M are dealt with explicitly. The stopping criterion is based on the relative residual norm and it is given by In our experiments we used tol = 10 −6 . The residual matrix should not be explicitly computed as it would require too many memory allocations. We thus adopted a low-rank representation of the residual S k as it is commonly done in this setting Let = Q 1R1 § A rank reduction is performed in practice, if M k is found to be low rank, so as to exploit the matrix-oriented inner solver as soon as possible. be the QR decomposition of G, then the norm of the residual S k can be simplified as The QR decomposition is not computed from scratch at each iteration, but updated as new basis columns are included.
Remark 2. Although the derivation of this section is focused on the symmetric case, the procedure can be easily generalized to the form (2) with matrices of large dimensions. The reduced problem to be solved will use the general approach described in Section 3.
In the following, we report on two sets of numerical experiments we have performed. We consider symmetric matrices, for which Algorithm 1 has shown its greatest effectiveness. We also compare with a state-of-the-art method, GLEK in Reference 7, which is appropriate for any sparse M, not necessarily of very low rank. Being based on a matrix splitting condition, it may have convergence problems on data that do not satisfy a certain matrix norm requirement; see Reference [7, th. 3.1]. GLEK was shown to be competitive with respect to other methods for the same class of problems in Reference 7. Example 8. We consider an academic example ¶ where A corresponds to the finite difference discretization of the operator (u) = +(exp(−xy)u x ) x + (exp(xy)u y ) y and M 1 = −U 1 U T 1 in (0, 1) × (0, 1), with U 1 = [U 11 ; 0], and U 11 = [1 ⊗ I 4 ], 1 is the vector of all ones of length 10, so that U 1 has four columns having a sequence of ones at different locations, and it has orthogonal columns. Here, F 1 is a vector with random entries uniformly distributed in (0, 1). The convergence history of the Galerkin approach as the subspace dimension grows is reported in Figure 2 for n = 100 2 and n = 200 2 . We notice that the convergence is barely sensitive to the problem dimension, as it has been observed in other application problems for rational Krylov subspaces. More details on the performance of the method are reported in Table 10, together with a comparison with the method GLEK in Reference 7, as grows. For the large values of the Lyapunov operator becomes less dominant, and the splitting-based method GLEK has convergence problems. The projection method converges for all values of , though its computational costs are higher # than GLEK, whenever the latter converges. Note also that the performance of the new solver improves with : for the larger values of this seems to be due to the fact that the more dominant part is the portion of the problem that is solved explicitly by including U 1 in the space.  control input and is problem dependent. This example was also reported in Reference [7, example 5.1], with a larger rank U 1 . The numerical results are listed in Table 11 for d ∈ {0.5, 0.8, 1}. As long as the Lyapunov operator remains dominant, the method GLEK is superior to the Galerkin approach in terms of CPU time, while memory requirements are slightly higher than with the projection method. As soon as the term involving the Robin boundary conditions becomes dominant, problems for the splitting-based method arise, like in the previous example.
We conclude with a comment on this example, illustrating our Remark 1. The solution of each single Lyapunov equation AP + PA T = u i u T j for n = 40,000 and d = 0.5 requires about 3.3 s (2.6 s) to reach a relative residual norm less than 10 −7 in only 15 iterations (26 iterations), using the adaptive rational Krylov subspace method 38 (using KPIK with a prefactorized A 39 ). If one were to use Algorithm 1 directly on the problem, the total time to approximate each of the 100 columns of  would be largely over 200 s. A more advanced implementation might be able to reduce these large timings and make the direct use of the SMW matrix-oriented approach directly applicable to the large-scale problem for > 1.

APPLICATION TO LINEAR TENSOR EQUATIONS
Consider the tensor equation in the X tensor variable with H, H 3 , M 3 nonsingular, and A 1 , M 1 low rank. Here, the matrix H is in bold to emphasize that it is the same matrix appearing in two different terms. This tensor equation is representative of a large class of problems that can be described by means of tensors. This is the case for instance for discretized three-dimensional PDEs when a tensor basis is used for the discretization; see, for example, Reference [9, section 5] and references therein. More generally, tensor equations have become a key ingredient in the numerical treatment of mathematical models dealing with uncertainty quantification and of parameter-dependent model order reduction methodologies; the literature is quite vast, nonetheless we refer, for example, to  and their references.
The analysis of tensors and the development of numerical methods have recently generated a large amount of literature. Different tensor representations have been analyzed, as discussed for instance in Reference 48; we also refer the reader to the literature survey in Reference 49. In most cases, the authors have been interested in numerical methods dealing with the presence of many summands and many Kronecker products in a sparse context, leading necessarily to iterative approaches. Nonetheless, available explicit methods deriving the solution in some closed form with low memory requirements are very scarce even for few summands. Here, we aim to contribute to filling this gap; we refer to Reference 50 for another contribution towards this aim for a tensor equation with different properties.
The following result yields a new procedure for computing the solution tensor X to Equation (25). To the best of our knowledge, this is the first method that explicitly determines the solution in closed form, without using the Kronecker form of the problem.
We recall that a tensor X ∈ R n 1 ×n 2 ×n 3 can be written using the mode-1 unfolding as (see, e.g., Reference 48).  Example 10 distributed in (0, 1)). The numerical results for increasing n and s i , i = 1, 2 are reported in Table 12, where

TA B L E 12
The cost and residual when solving with Matlab backslash are also reported. Due to the high density of the problem, the direct method could only be used for the smallest dimensional problem. On the other hand, the new approach can solve the problem in a few seconds of CPU time, even in the largest considered case.
Under certain hypotheses on the data, the case of the tensor equation with more than three terms can be treated similarly. As an example, we report the result for the equation with four terms.

Corollary 1. Consider the tensor equation
and let X (1) ∈ R n 1 ×n 2 n 3 be its mode-1 unfolding solution. Let (H −1 A 3 ) T = QRQ * be the Schur decomposition of the given matrix. Then X (1) = ([ẑ 1 , … ,ẑ n 1 ]Q * ) T , whereẑ j = vec(Ẑ j ) andẐ j solves the four term equations: (i) For j = 1, (ii) For j = 2, … , n 1 , Proof. The proof proceeds as that of the previous proposition, with the extra term A 4 ⊗ M 4 ⊗ H carried over. ▪ Each of the two matrix equations in Corollary 1 contains four terms. However, two of these terms can be collected together under certain hypotheses on the data. For instance, if for instance A 4 = H 3 , then the first matrix equation becomes where T = (n + 1) 2 tridiag(−1, 2, −1) is again the n × n 3-point stencil discretization of the second-order operator in one dimension, and I is the identity matrix of conforming size. In addition, M 1 , A 1 are diagonal matrices with nonzero diagonal entries only corresponding to the support grid values associated with Ω 1 of the two functions (x), (y), respectively. Using the formulation in (29), for j = 1 the equation for the transformed solutionZ 1 can be written as whose structure naturally adapts to our setting. For j > 1 the equation changes accordingly. The numerical results are reported in Table 13 and illustrate the potential of the matrix approach for solving the tensor problem, compared with the use of the sparse direct solver applied to the Kronecker version of the original tensor equation. Equation (32) could also be solved with methods other than the matrix-oriented SMW formula. Comparisons similar to those reported in previous sections would arise, and we refrain from repeating this whole discussion. Here, we focus on the major advantages obtained by unfolding the tensor problem.

CONCLUSIONS
We have analyzed in detail a fully matrix-oriented procedure for solving general small and medium scale multiterm linear matrix equations, when some of the terms have low rank. The procedure relies on a matrix-oriented use of the classical vector SMW formula, while avoiding the generation of the large dense matrices necessary to deal with the original formula. By means of a variety of small and medium size application problems, we have shown that the matrix-oriented method can achieve order of magnitude CPU time improvements over the vectorized method, and can solve in a few seconds dense problems that cannot be treated by the vectorized SMW formula. We have shown how to employ this procedure in the large-scale case as the reduced equation solver within a Galerkin projection method. Moreover, we have illustrated that the new procedure is a crucial ingredient for a new effective method that explicitly solves a class of linear tensor equations.