Thick-restarted joint Lanczos bidiagonalization for the GSVD

The computation of the partial generalized singular value decomposition (GSVD) of large-scale matrix pairs can be approached by means of iterative methods based on expanding subspaces, particularly Krylov subspaces. We consider the joint Lanczos bidiagonalization method, and analyze the feasibility of adapting the thick restart technique that is being used successfully in the context of other linear algebra problems. Numerical experiments illustrate the effectiveness of the proposed method. We also compare the new method with an alternative solution via equivalent eigenvalue problems, considering accuracy as well as computational performance. The analysis is done using a parallel implementation in the SLEPc library.


Introduction
The generalized singular value decomposition (GSVD) of two matrices was introduced by Van Loan [21], with subsequent additional developments by Paige and Saunders [22]. Given two real matrices A and B with the same number of columns, the GSVD of the pair {A, B} is given by where U A , U B are orthogonal matrices, G is a square nonsingular matrix, and C, S are diagonal matrices in the simplest case. A more precise definition of the problem will be given in section 2.1. This factorization, as an extension of the usual singular value decomposition (SVD), is finding its way in an increasing number of applications, for instance in the solution of constrained least squares problems [11], discrete ill-posed problems via Tikhonov regularization [20,19], as well as in many other contexts [9]. The problem of computing the GSVD of a small dense matrix pair is well understood, and a robust implementation is available in LAPACK [1, §2.3.5.3]. However, the case of large sparse matrix pairs is still the subject of active research. Normally, the computation of the large-scale GSVD is addressed by means of iterative methods based on expanding subspaces. This is the case of the Jacobi-Davidson method proposed by Hochstenbach [15]. In this paper, we focus on Krylov methods, which still need to incorporate a great deal of the knowledge and innovation that has been successfully applied in similar linear algebra problems such as the SVD. One example of such is an effective restart mechanism, which is the main focus of this paper. The ultimate goal is to provide reliable software implementations of the methods that are readily available to users from the field of scientific computing or any other discipline that requires computing a GSVD. In our case, the developments presented in this paper are included in one of the solvers of SLEPc, the Scalable Library for Eigenvalue Problem Computations [13].
Zha [27] was the first to propose a Lanczos method to be used for the GSVD. His method relies on a joint bidiagonalization of the two matrices, A and B, in a similar way as Lanczos methods for the SVD have a single bidiagonalization in their foundation. The joint bidiagonalization in Zha's method results in two small-sized upper bidiagonal matrices. Later, Kilmer and coauthors [20] proposed a variant in which one of the two bidiagonal matrices generated by the joint bidiagonalization is lower instead of upper. The latter has since then been the most popular approach, and we focus mainly on that variant for our developments.
As in any Lanczos method, a finite-precision arithmetic implementation suffers from various numerical pitfalls, such as loss of orthogonality in the generated Krylov basis vectors. Jia and Li [18] have studied numerical error in the context of joint bidiagonalization for the GSVD, showing that loss of orthogonality can be prevented by simply enforcing semiorthogonality of Lanczos vectors, e.g., with a partial reorthogonalization technique [17], as is done in Lanczos methods for other linear algebra problems. Either in the case that a semiorthogonality scheme is pursued, or a full reorthogonalization approach is followed as we do to avoid loss of orthogonality, a consequence is that all Lanczos vectors must be kept throughout the computation, with the consequent increase in storage and computational costs.
Thick restart is an effective mechanism to keep the size of the Krylov basis bounded, that has been applied to Lanczos methods in many different contexts such as the symmetric eigenproblem [26] or the SVD [2,14]. Compared to explicit restart, the thick restart scheme is much more effective because it compresses the currently available Krylov subspace into another Krylov subspace of smaller dimension (not a single vector), that retains the wanted spectral information while purging the components associated with unwanted eigenvalues (or singular values). The thick restart methodology can be summarized in two stages. In the first stage, one or more Lanczos recurrences are used to build one or more sets of Lanczos vectors, in a way that the small-sized problem resulting from the projection of the original problem retains the properties of the original problem (structure preservation). For instance, for symmetric-indefinite matrix pencils it is possible to employ a pseudo-Lanczos recurrence that results in a symmetric-indefinite projected problem [6]. In the second stage, the built factorization is truncated to a smaller size decomposition, in a way that it is feasible to extend it again using the same Lanczos recurrences.
In this paper, we work out the details that are needed for thick-restarting the Lanczos recurrences associated with the joint bidiagonalization for the GSVD, so that the projected problem is a small-size GSVD, and this structure is preserved whenever the restart truncates the involved decompositions and they are extended again.
The rest of the paper is organized as follows. Section 2 presents all the background material that is required for section 3, which presents the new developments related to thick restart for the GSVD. In section 4 we provide a few details of how the proposed method has been implemented in the SLEPc library. Section 5 illustrates how the solver performs when applied to several test problems. Finally, we wrap up with some concluding remarks in section 6. Throughout the paper, the presentation is done for real matrices, although the extension to the complex case is straightforward. In fact, our implemented solver supports both real and complex arithmetic.

Background
In this section, we review a number of concepts that are required for the developments of subsequent sections. Many of the concepts are also discussed for the case of the SVD, aiming at facilitating the understanding of the GSVD case, which is more involved.

The SVD and the GSVD
Recall that the (standard) SVD of a matrix A ∈ R m×n is written as where U = [u 1 , . . . , u m ] ∈ R m×m and V = [v 1 , . . . , v n ] ∈ R n×n are orthogonal matrices, and Σ ∈ R m×n is a diagonal matrix with real non-negative diagonal entries Σ ii = σ i , i = 1, . . . , min{m, n}.
The vectors u i and v i are called the left and right singular vectors, respectively, and the σ i are the singular values. It is customary to write the decomposition in a way that the singular values are sorted in non-increasing order, σ 1 ≥ σ 2 ≥ . . . ≥ σ r > σ r+1 = . . . = σ n = 0, where r = rank(A). We can write the decomposition (2) as a sum of outer product matrices, It is well known that if only k < r terms in (3) are considered, the resulting matrix is the best rank-k approximation of matrix A, in the least squares sense. This so called truncated SVD of A is what one can usually afford to compute in the large-scale, sparse case. More generally, we will consider the case where the k < r terms taken in (3) correspond to either the largest or the smallest singular values, and we will refer to this decomposition as the partial SVD. Now consider two matrices with the same column dimension, A ∈ R m×n and B ∈ R p×n . The GSVD (1) can also be written as with U A ∈ R m×m and U B ∈ R p×p orthogonal and G ∈ R n×n nonsingular. For our purpose, we can think of C and S as being diagonal matrices, but in the general case this needs a detailed discussion that we summarize below. In (1) and (4), we are assuming that the pair {A, B} is regular, which means that the matrix obtained by stacking A and B has full column rank and hence the triangular factor of its QR decomposition is nonsingular 1 , where R ∈ R n×n is upper triangular and Q ∈ R (m+p)×n has orthonormal columns. Then the structure of C and S is whereĈ andŜ are square diagonal matrices, I q and I n−q−ℓ are identity matrices of the indicated size, and O represents a rectangular block of zeros with the appropriate dimensions. Writinĝ C = diag(c q+1 , . . . , c q+ℓ ) with c q+1 ≥ · · · ≥ c q+ℓ > 0, andŜ = diag(s q+1 , . . . , s q+ℓ ) with 0 < Figure 1: Scheme of the GSVD of two matrices A and B, for the case m > n and p < n.
s q+1 ≤ · · · ≤ s q+ℓ , we have that c 2 i + s 2 i = 1, i = q + 1, . . . , q + ℓ, and the generalized singular values σ(A, B) are the ratios of these quantities, In order to better understand the structure of C and S, it is helpful to note that the zero blocks in (6) may have zero rows or columns, and also that the number of nonzero rows of C is equal to rank(A) and similarly for S with respect to rank(B). For instance, Figure 1 shows an example of GSVD for the case that m > n and p < n. Assuming that A and B have full column and row ranks, respectively, we have that there are q = n − p infinite generalized singular values. In order to simplify the presentation, in the sequel we will suppose that there are no infinite or zero generalized singular values, that is, q = 0 and ℓ = n. In section 3.3 we will discuss what happens when our method is applied to problems with infinite or zero σ's, such as when either A or B have less rows than columns.
As in the case of the SVD, for large-scale problems we will consider a partial GSVD, that is, we employ methods that compute approximations of k quadruples (σ i , u A i , u B i , g i ) corresponding to either the largest or the smallest generalized singular values σ i . We call u A i and u B i the left generalized singular vectors, while g i are the right generalized singular vectors. Note that if σ i are the generalized singular values of {A, B}, then σ −1 i are the generalized singular values of {B, A}.

Equivalent eigenvalue problems
The solution of the two problems presented in the previous section can be approached by formulating a related eigenvalue problem. More precisely, there are two possible strategies, that we will refer to as cross and cyclic. We start by discussing this in the context of the SVD and then extend it to the GSVD.
The SVD relation (2) can be written as AV = U Σ or as A T U = V Σ T . Suppose that m ≥ n, then equating the columns we have A T u i = 0, i = n + 1, . . . , m.
Premultiplying (8) by A T and using (9) results in the relation that is, the v i are the eigenvectors of the symmetric matrix A T A corresponding to eigenvalues σ 2 i . If the corresponding left singular vectors are also required, they can be computed as u i = 1 σ i Av i from (8). Alternatively, it is possible to compute the left vectors first, via and then the right ones as v i = 1 σ i A T u i , but care must be taken that (12) has at least m − n zero eigenvalues. In practice, one would generally use (11) if m ≥ n and (12) otherwise. We will call this approach the cross product eigenproblem.
The second strategy is the cyclic eigenproblem. Consider the symmetric matrix of order m + n that has eigenvalues ±σ i , i = 1, . . . , r, together with m + n − 2r zero eigenvalues, where r = rank(A). The normalized eigenvectors corresponding to ±σ i are 1 Hence we can extract the singular triplets (σ i , u i , v i ) of A directly from the eigenpairs of H(A). Note that in this case the singular values are not squared, so the computed smallest singular values will not suffer from severe loss of accuracy as in the cross product approach. The drawback in this case is that small eigenvalues are located in the interior of the spectrum.
The cross and cyclic schemes can also be applied to the GSVD (1). The columns of G satisfy so solving a symmetric-definite generalized eigenvalue problem for the pencil (A T A, B T B) provides us with the generalized singular values and right generalized singular vectors. From g i we can compute u A i = 1 c i Ag i and u B i = 1 s i Bg i . This is the analog of the cross product eigenproblem for the SVD (11). It has the same concerns regarding the loss of accuracy in the smallest generalized singular values, but in this case one may consider computing these values as the reciprocals of the largest eigenvalues of the reversed pencil (B T B, A T A).
Likewise, the formulation that is analogous to the eigenproblem associated with the cyclic matrix (13) is to solve the symmetric-definite generalized eigenvalue problem defined by any of the matrix pencils of dimensions m + n and p + n, respectively. The nonzero eigenvalues of the first pencil are ±σ i , while those of the second pencil are ±σ −1 i , so the latter is likely to be preferred when the smallest generalized singular values are required. The generalized singular vectors can be obtained from the eigenvectors corresponding to nonzero eigenvalues ±σ i of the first pencil, 1 [16] for considerations on which of the two pencils is best in finite precision computations, based on the conditioning of A and B.
Note that the symmetric-definite generalized eigenproblems discussed in this section may in fact be semi-definite, e.g., if B T B is singular because p < n. This may cause numerical difficulties when solving the problem via the cross and cyclic approaches.

SVD via Lanczos bidiagonalization
In this section we give an overview of the computation of the partial SVD by means of Lanczos recurrences. Additional details can be found, e.g., in [14]. Without loss of generality, we will assume that A is tall, i.e., m ≥ n.
In the same way that the solution of the cross product eigenproblem for A T A can be approached by first computing an orthogonal similarity transformation to tridiagonal form, bidiagonalization methods for the SVD rely on an orthogonal reduction to bidiagonal form, with P n ∈ R m×n and Q n ∈ R n×n having orthonormal columns and J n ∈ R n×n being upper bidiagonal, so that J T n J n is tridiagonal with eigenvalues σ 2 i , that is, J n has the same singular values as A.
In a Lanczos method, we compute a partial bidiagonalization instead of the full one. From (16) we can establish the two equalities AQ n = P n J n and A T P n = Q n J T n , and equating the first k < n columns we obtain the Lanczos relations where J k denotes the k × k leading principal submatrix of J n , and we have used the notation Equating the jth column of (17)- (18) gives the familiar double Lanczos recurrence that is shown in algorithmic form in Algorithm 1. p j = Aq j − β j−1 p j−1

4:
Normalize: α j = p j 2 , p j = p j /α j 5: Normalize: β j = q j+1 2 , q j+1 = q j+1 /β j 7: end for It is possible to establish an equivalence between the output of Algorithm 1 and the quantities computed by the Lanczos recurrence for tridiagonalizing the cross product matrix A T A, see for instance [14]. In particular, the right Lanczos vectors q j computed by Algorithm 1 form an orthonormal basis of the Krylov subspace K k (A T A, q 1 ). Similarly, the left Lanczos vectors p j span the Krylov subspace K k (AA T , Aq 1 ). Finally, there is also an equivalence with the Lanczos tridiagonalization associated with the cyclic matrix (13), provided that the initial vector [0 T q T 1 ] T is used.
From the discussion above, it is clear that Ritz approximations of the singular triplets of A can be obtained. After k Lanczos steps, the Ritz valuesσ i (approximate singular values of A) are computed as the singular values of J k , and the Ritz vectors areũ i = P k x i andṽ i = Q k y i , where x i and y i are the left and right singular vectors of J k , respectively.

Residual and stopping criterion
As the number of Lanczos steps increase, the Ritz approximations become increasingly accurate. A criterion is required to determine when a certain approximate singular triplet (σ i ,ũ i ,ṽ i ) can be declared converged. And for this, we need to define a residual.
Due to the equivalence discussed in section 2.2, we can define the residual in terms of an equivalent eigenproblem. In the case of the cross product eigenproblem, (11) or (12), only one of the singular vectors would appear, so it is better to define the residual vector in terms of eigenproblem associated with the cyclic matrix (13), whose norm is This residual norm can be used in a posteriori error bounds. In particular, there exists a singular value . In the context of Lanczos, the residual (20) can be computed cheaply as follows. If the Lanczos relations (17) and (18) are multiplied on the right respectively by y i and x i , the right and left singular vectors of J k , then The residual estimate (21) can be used in the stopping criterion in practical implementations of Lanczos bidiagonalization.

Thick restart for the SVD
As in any Lanczos process, when run in finite precision arithmetic, loss of orthogonality among Lanczos vectors occurs in Algorithm 1 whenever the Ritz values start to converge. The simplest cure is full reorthogonalization, which we consider in this work. This solution would be implemented by replacing line 3 of Algorithm 1 with that applies Gram-Schmidt to explicitly orthogonalize vector Aq j against the columns of P j−1 , and similarly for line 5.
Full reorthogonalization requires keeping all previously computed (left and right) Lanczos vectors, and this justifies the need of a restart technique that limits the number of vectors, not only due to storage requirements but also because the computational cost of full reorthogonalization is proportional to the number of involved vectors.
Thick restart is very effective compared to explicit restart, because instead of explicitly building a new initial vector to rerun the algorithm, it keeps a smaller dimensional subspace that retains the most relevant spectral information computed so far. The computation is thus a sequence of subspace expansions and contractions, until the subspace contains enough converged solutions. The key is to purge unwanted components during the contraction, which is beneficial for overall convergence.
Suppose we have built the Lanczos relations (17)-(18) of order k and we want to compress to size r < k. We start by transforming the decomposition in a way that approximate singular values and the residual norms (21) appear explicitly in the equations, as follows. First, compute the

SVD of the bidiagonal matrix
whereŨ k = P k X k andṼ k = Q k Y k , whose columns are the left and right Ritz vectors. This would be an exact partial SVD of A if it were not for the second summand of the right-hand side of (23), which is related to the residual norms of the k Ritz approximations. Equation (23) can also be written as where ρ i := β k e T k x i . Note that |ρ i | is equal to the residual norm (21). Equation (24) is represented graphically in Figure 2 (step 2).
Due to the fact thatΣ k in (22)-(23) is diagonal, it is possible to truncate this decomposition at any size r. For this, we must permute it so that the leading principal submatrix ofΣ k contains the approximations of the wanted singular values (typically the largest or smallest ones). Figure 2 (step 3) shows in dark gray the part of the decomposition that will be discarded. Then the new decomposition after truncation is Figure 2 (step 4). It only remains to extend the decomposition by running a modified version of Algorithm 1 that starts the loop at j = r + 1. Note that the first newly generated left Lanczos vector is computed as p r+1 = orthog(Aq k+1 ,Ũ r ), whose orthogonalization coefficients are precisely the ρ i 's. When the algorithm stops at iteration j = k, the new J k matrix has the shape depicted in Figure 2 (step 5), a bidiagonal except for the leading part that has an arrowhead form.

GSVD via joint Lanczos bidiagonalization
We now turn our attention to the GSVD, and consider Lanczos methods that compute a joint bidiagonalization of the two matrices, A and B. In this context, the stacked matrix Z (5) is relevant and will appear in the algorithms. Also, the matrices Q A and Q B in (5), whose row dimensions are the same as A and B, respectively, will be used for deriving the methods, but need not be formed explicitly as justified later.
The matrices C and S (6) where U A , U B are the same as in (4), and W ∈ R n×n is also orthogonal. Note that the first equation in (25) can be seen as the conventional SVD of Q A , with the singular values sorted in non-increasing order, while the second equation is an SVD-like relation where the singular values appear in non-decreasing order with the largest value at the bottom-right corner of S. If Z has full column rank, the GSVD decomposition of {A, B} is given by where G = R −1 W with R as defined in (5). An approach to compute the CS decomposition of {Q A , Q B } is to perform a bidiagonalization of both matrices, i.e., a decomposition of the form with J, J (upper or lower) bidiagonal matrices such that the stacked matrix J J has orthonormal columns, that is, J T J + J T J = I. Then, the problem is reduced to the CS decomposition of the bidiagonal matrices {J, J }, Substituting in (27) we get the CS decomposition (25) with U A = U X, U B = U X and W = V Y . As mentioned in section 2.1, if A, B are very large and sparse matrices, the interest is normally to compute a partial decomposition, i.e., a few (extreme) generalized singular values and vectors. In that case, a partial bidiagonalization of matrices Q A and Q B is done using Lanczos recurrences, as described next.
Zha [27] presented an algorithm for the joint bidiagonalization of Q A and Q B , in which both matrices are reduced to upper bidiagonal form without explicitly computing Q A or Q B . Later, Kilmer et al. [20] proposed a variation of the joint bidiagonalization where Q A and Q B are transformed to lower and upper bidiagonal forms, respectively. To keep the presentation short, we focus on the latter variant from now on, although our solver also includes an implementation of Zha's variant.
Although not computing Q A or Q B explicitly, Kilmer's joint bidiagonalization is based on the application of the lower and upper Lanczos bidiagonalization algorithms in [23] to Q A and Q B , respectively, yielding with the column-orthonormal matrices

and the lower and upper bidiagonal matrices
Note that the bottom Lanczos relations (29) are analogous to those used for the bidiagonalization in SVD (17)- (18). However, the top Lanczos relations (28) differ in that the associated bidiagonal matrix J k is lower bidiagonal with one more row than columns, and the basis of left Lanczos vectors U k+1 contains one more vector than in the other case. The reason is that the lower bidiagonalization algorithm in [23] starts the recurrence with the left vectors instead of the right ones. Zha's method generates two upper bidiagonal matrices by applying Algorithm 1 to both Q A and Q B with the same initial vector. In contrast, Kilmer's method uses the two types of bidiagonalizations and connects them by using the first right Lanczos vector v 1 generated in (28) as initial vectorv 1 in (29).
It can be shown [27,20] that if v 1 =v 1 , the joint bidiagonalization given by (28)-(29) verifieŝ Thus, (28)-(29) can be rewritten as Taking into account that Q A = I m 0 Q, Q B = 0 I p Q, and premultiplying equalities on the right side of (31)-(32) by Q, we get or, defining V k = QV k , The joint bidiagonalization process in [20] uses the two Lanczos relations in (33) and the first one in (34) to compute matrices U k+1 , U k , V k , J k ,J k . Recall that the vectors u j ,û j ,ṽ j have lengths m, p and m + p, respectively. Algorithm 2 gives the details of Kilmer's joint bidiagonalization, where in lines 2 and 9 expand(A, B, u j+1 ) denotes the operation that generates new Krylov directions from the A and B matrices as follows. Each step j of the joint bidiagonalization requires computing QQ Tũ j+1 , whereũ j+1 = u T j+1 , 0 T . Note that this is the orthogonal projection ofũ j+1 onto the column space of Z, which means that QQ Tũ j+1 = Zx j+1 , where x j+1 is the solution of the least squares problem The expand(A, B, u j+1 ) operation first computes the solution of the least squares problem (35) by padding u j+1 below with p zeros, and then performs an additional multiplication by Z. If Z is a large sparse matrix, the least squares problem is solved by means of an iterative solver such as the LSQR algorithm [23]. Thus, the bidiagonalization process does not need to compute the QR factorization of Z.
Using the CS decomposition (36), the Lanczos relations (31)-(32) become X, U k X, X and X, respectively, and c i , s i , as the ith diagonal elements of C, S, we have for i = 1, 2, . . . , k. Thus, u A i , u B i , are approximations of the left generalized singular vectors for A and B, respectively, while the approximate right singular vectors g i = R −1 w i can be computed as the solution to the least squares problem Zg i =w i , wherew i is the ith column of V k Y .

Thick restart for the GSVD
With all the ingredients presented in the previous section, we are now able to adapt the thick restart technique of section 2.5 to the case of Kilmer's joint bidiagonalization for the GSVD. As was done for the SVD, the goal is to successively compress and expand the decomposition until the working Lanczos bases contain sufficiently good approximations to the wanted solutions. The compression is carried out by transforming the decomposition computed by Algorithm 2 to a form where the generalized singular values (or, more precisely, the c i and s i values) appear explicitly, and then truncating it in a way that keeps the most relevant part. This requires computing a small-sized GSVD (or CS decomposition), as in (36). Once the decomposition is truncated, it should be possible to extend it again by a slightly modified variant of Algorithm 2 whose loop starts at the current size after truncation.

Restarting Kilmer's joint bidiagonalization
Substituting (36) in (33)-(34), we have This decomposition is similar to (37)-(38), but expressed in terms of the V k basis instead of V k . In order to truncate the equations above, we define Y r and X r as the matrices formed by taking the first r columns of Y and X, respectively, C r and S r as the leading principal r × r submatrix of C and S, and X r+1 = [x 1 , x 2 . . . , x r , x k+1 ], where x j is column j of X. We can then write where b r+1 = α k+1 X T r+1 e k+1 andb r =β k X T r e k , while the orthonormal vector bases have been updated according to U ′ r+1 = U k+1 X r+1 , U ′ r = U k X r and V ′ r = V k Y r . The bidiagonalization process continues with the updated vector bases and takingṽ k+1 as the next vector of the V basis. That is, Algorithm 2 is run with the loop starting at j = r + 1.
The process is illustrated graphically in Figure 3, in a similar way as was done for the SVD in Figure 2. This time, the pictures show the right Lanczos basis V k , together with both bidiagonals J k andJ k , and depict how these matrices evolve when the compression and extension are carried out. In the last panel (step 5), we can see how after restart both the upper and lower bidiagonals have the leading part with an arrowhead shape, with the spike pointing to the left in both cases.

Residual estimates
We now discuss a convergence criterion to be used in the computation of the partial GSVD with the method of section 3.1 to decide when an approximate generalized singular quadruple (σ i ,ũ A i ,ũ B i ,g i ), withσ i =c i /s i , can be considered converged. In the remainder of this section we will omit the tildes to simplify the notation. We must derive formulas for the estimates of residual norms, using the information obtained during the bidiagonalization, without expensive additional computation. The iteration stops when the number of converged solutions reaches the number requested by the user.
As in the case of the SVD (see section 2.4), there are two possible alternatives to define the residual associated to an approximate solution, in terms of either the cross product eigenproblem (14) or the cyclic eigenproblem (15), The residual norm r CROSS i 2 was used by Zha [27], who showed, in the context of joint bidiagonalization with both bidiagonal matrices in upper form, that where y i is the ith right singular vector of the bidiagonal matrix J k resulting from a k-step joint bidiagonalization process. In the case of joint bidiagonalization with lower-upper bidiagonal forms, the above inequality should be modified slightly, The matrix R above is the triangular factor of the QR decomposition (5), so its 2-norm is not readily available, but it can be bounded by √ m + p max{ A ∞ , B ∞ }.
The residual r CROSS i (43) does not account for errors in left vectors u A i , u B i , so the residual r CYCLIC,A i (44) may be more appropriate. This is the residual employed in the Jacobi-Davidson method [15]. It refers to u A i . Similarly, the residual r CYCLIC,B i (45) stemming from the second cyclic eigenproblem in (15) refers to u B i . In the following, we define a convergence criterion that combines the two cyclic residuals so that both u A i and u B i are taken into consideration. From the definition of the GSVD (26) we have Using (48) in (49), and in the same way using (47) in (49), we get and we define a residual that accounts for these two relations, whose norm is We can derive bounds for this residual from the quantities computed in the joint bidiagonalization. From (39)-(40) we have Using (53) and taking into account that c 2 i + s 2 i = 1, the residual norm for (50) is In fact, this is a bound for s 2 i r CYCLIC,A i for the cyclic residual (44), since the upper part of that residual is exactly zero due to the left relation in (53). Analogously, the residual associated to (51) verifies which is related to c 2 i r CYCLIC,B i . Combining the previous two bounds, It is interesting to see that r GSVD 2 is equal to the residual norm of (49). Using (47) and (48) we have As a summary, our criterion to accept a computed generalized singular quadruple as converged during the restart of the joint bidiagonalization is where tol is the user-defined tolerance. According to (55), this can be considered a criterion relative to the norm of the problem matrices.

Generalized singular values equal to zero or infinity
Remember from (7) that some of the generalized singular values can be 0 or ∞. These values represent a problem for the joint bidiagonalization algorithm, and should be avoided. Given the CS decomposition (25), and taking into account (6), the subset of generalized singular values corresponding toĈ,Ŝ, satisfy The rest of the generalized singular values are 0 or ∞. A generalized singular value of ∞ corresponds to vectors w i , u A i satisfying and a zero generalized singular value corresponds to vectors w i , u B i satisfying Finally, there can be a number of left singular vectors for which The joint bidiagonalization algorithm should avoid converging to any of the generalized singular vectors in (57)-(60). In exact arithmetic, if the start vector u i of the algorithm is a linear combination of the generalized singular vectors u A i of (56), all the generated vectors v i , u i andû i will also be combinations of the vectors w i , u A i and u B i , respectively, of (56). Such an initial vector u 1 can be generated with the following steps: 1. Start with a random vectorû 0 of p elements. That vector will be a combination of the singular vectors u B i of (56), (58) and (59).
3. Compute u 1 as the result of normalizing I m 0 ṽ 0 . Then u 1 = Q A v 0 and, because Q A w i = 0 in (58), vector u 1 will be a linear combination only of vectors u A i of (56). Unfortunately, when working in finite precision, rounding errors may cause components of the unwanted singular vectors to appear even if a suitable start vector is used. The algorithm effectively avoids components of the generalized singular vectors from (58), related to the null space of Q A , which comes from the fact that, according to (33), computation of each vectorṽ i starts by forming the vector QQ T u i 0 = Qx, where x is a vector in the range space of Q T A , thus orthogonal to the null space of Q A . However, the same is not true for components of vectors from (57) or (60). In particular, if the null space of Q B is non-trivial, such as when p < n, unwanted components of the singular vectors of (57) may appear when computing the largest generalized singular values. Similarly, if the null space of Q T A is non-trivial, such as when m > n, unwanted components of the singular vectors of (60) may appear when computing either the largest or the smallest generalized singular values. We

Using a scale factor
During the development of the solver, we have noticed that scaling one of the matrices may result in a significant improvement of convergence. When the user specifies a scale factor γ, the method is applied to the pair {A, γB} instead of {A, B}. As a consequence, the algorithm works with a modified coefficient matrix for the least squares problem, where Q A,γ and Q B,γ are different from those obtained without scaling. The results of section 5 will illustrate how large is the impact of scaling on performance. An analysis of why this is the case is out of the scope of this paper. If we use (c i , s i , u A i , u B i , g i ) to denote the solution for the pair {A, B} as in (4), we can write the relations for the scaled problem as for certain weights ω i , so after solving (61) we can retrieve the solution of the original problem (4) by multiplying the generalized singular values σ i by γ and also multiplying the g i vectors by ω −1 i , which can be obtained from the relation where c i , s i , corresponding to the original problem, can be computed from σ i . When computing the largest generalized singular values, a good choice for the scale factor is a value similar to σ 1 , but of course this value is not known a priori. In cases where no good initial guess of σ 1 is available, we have devised a dynamic scaling mechanism, in which the user specifies a threshold γ 0 so that, at the time of restart, the solver checks whetherσ 1 > γ 0 , wherẽ σ 1 is the current approximation to the largest generalized singular value, and if this condition is met then the factorization is thrown away and started from scratch by using a scale factor γ =σ 1 . This scale factor is cumulative for several restarts in case γ 0 is small, and still the wasted computation usually pays off compared to a run without scaling.

One-sided orthogonalization
When the thick restart technique was introduced for the SVD in section 2.5, we pointed out that the loss of orthogonality among Lanczos vectors can be avoided by means of full orthogonalization, that is, enforcing the full orthogonality of both left and right Lanczos bases via explicit orthogonalization. In practice, lines 3 and 5 of Algorithm 1 are replaced with p j = orthog(Aq j , P j−1 ) and q j+1 = orthog(A T p j , Q j ). However, Simon and Zha [25] showed that it is not necessary to explicitly orthogonalize both bases, and it is enough with orthogonalizing one of them since the level of orthogonality of left and right Lanczos vectors go hand in hand. This technique is called one-sided orthogonalization, and is implemented in SLEPc's thick-restart Lanczos solver for the SVD [14], reducing the cost of each iteration significantly.
The one-sided orthogonalization technique can also be applied to the GSVD case. In Kilmer's joint bidiagonalization (Algorithm 2), lines 5, 7, and 9 involve a full orthogonalization that would be implemented with a call to orthog(). However, two of these three orthogonalizations can be avoided, that is, instead of explicitly orthogonalizing against the full basis, just a local orthogonalization with the previous vector is done. In particular, we have found that it is enough to explicitly orthogonalize the U k+1 basis (line 7). This saves about two thirds of the orthogonalization cost, as will be illustrated in the computational results of section 5.
One-sided orthogonalization in the context of joint Lanczos bidiagonalization is discussed in [18], where the authors show that it is sufficient to explicitly orthogonalize U k+1 and V k , provided that the bidiagonal matrix J k does not become too ill-conditioned. Note that our onesided orthogonalization strategy is more aggressive, as it orthogonalizes just one basis instead of two. That is why in our solver one-sided orthogonalization is an option that the user can activate, but by default full orthogonalization of the three bases is carried out. However, in our tests we have not found any situation where the one-sided variant leads to large residual norms or a bad level of orthogonality of the computed bases.

Details of the implementation in SLEPc
We now discuss a few relevant details of our implementation, with which the results shown in section 5 have been obtained. The implementation is included in one of the solvers of SLEPc, the Scalable Library for Eigenvalue Problem Computations [13]. We first give an overview of this library, before going into the details.
SLEPc is a parallel library intended for solving large-scale eigenvalue problems, mainly by iterative methods, that is, in cases where only part of the spectrum is required. It consists of several modules, each one for a different type of problem, including linear eigenproblems, polynomial eigenproblems, general nonlinear eigenproblems, and singular-value-type decompositions. The latter module is called SVD and is the one relevant for this work.
SLEPc can be seen as an extension of PETSc, the Portable, Extensible Toolkit for Scientific Computation [4]. PETSc provides a lot of functionality related to data structures and solvers useful for large-scale scientific applications modeled by partial differential equations. Apart from the basic data objects for working with matrices and vectors, the only PETSc modules that are relevant for this work are those implementing iterative methods for linear systems of equations (KSP) and preconditioners (PC).
The model of parallelism of PETSc/SLEPc is message-passing (MPI), where every object is created with respect to an MPI communicator, implying that certain operations are carried out collectively by all processes belonging to that communicator. A second level of parallelism is also available to further accelerate the local computations either via CPU threads or by launching kernels to a GPU, but we do not consider them in this paper.
The solver modules are essentially a collection of solvers with a common user interface, where the user can choose the solver to be employed. This selection can be done at run time via command-line options, which confers a lot of flexibility by allowing also to specify solver parameters such as the dimension of the subspace, the tolerance, and many more. In the case of the SVD module, it contains a number of solvers, from which we highlight the following: • cross. This solver originally contained a gateway for computing the SVD via the linear eigenvalue problem module, in particular with the equivalent cross product matrix eigenproblem, (11) or (12). In this work, we have extended this for the GSVD via the equivalent eigenproblem (14).
• cyclic. Similarly to the previous one, this solver had code for the SVD formulated via the cyclic eigenproblem (13), and we have extended it for the GSVD using the equivalent cyclic eigenproblems (15).
• trlanczos. This is the thick-restart Lanczos solver. The implementation for the SVD is described in [24]. The extension to the GSVD, following the methodology detailed in section 3, constitutes the main contribution of this paper.
At the core of the thick-restart Lanczos solver is the lower-upper 2 joint bidiagonalization of Algorithm 2. One of the main operations in this algorithm is the orthogonalization of vectors, which in SLEPc is carried out with an iterated classical Gram-Schmidt variant that is both efficient in parallel and numerically stable [12]. Another notable operation is the expansion of the V Lanczos basis, which implies the solution of the least squares problem (35). This is done using PETSc's functionality, as described next.
PETSc allows combining iterative methods from KSP with preconditioners from PC, for instance gmres and jacobi. Direct linear solvers can be used with a special KSP that means precondition only, e.g., preonly and lu. There are a limited number of instances of KSP and PC that can handle rectangular matrices, so the possibilities for least squares problems are: • A purely iterative method, i.e., without preconditioning, via the KSP solver lsqr that implements the LSQR method [23], or the cgls method, which is equivalent to applying the conjugate gradients on the normal equations.
• A direct method, with preonly and qr. This requires installing PETSc together with SuiteSparse, which implements a sparse QR factorization [7].
• A combination of the two, i.e., lsqr and qr. Since qr is a direct method, the lsqr method will finish in just one iteration. Note that LSQR requires that the preconditioner is built for the normal equations matrix Z T Z, which means that in the case of qr the preconditioner application will consist in a forward triangular solve followed by a backward triangular solve with R, the computed triangular factor. Even though the number of iterations is one, due to how the method is implemented the number of preconditioner applications is two, but still this is usually cheaper than preonly with qr because the latter needs applying the orthogonal factor of the QR decomposition, which is avoided with lsqr.
We will consider lsqr without preconditioner and with the qr preconditioner. The latter case requires that matrix Z is built explicitly, by stacking the user matrices A and B, while the unpreconditioned lsqr can operate with A and B independently. Hence, we have added a user parameter explicitmatrix that can be turned on if necessary. This option applies also to the cross and cyclic solvers to explicitly build the matrices of (14) and (15), which is necessary in the case that a direct linear solver is to be used in the solution of the eigenproblem, e.g., to factorize matrix B T B.
Apart from the joint bidiagonalization of Algorithm 2, the main ingredient of the solver is the thick restart discussed in section 3. In SLEPc, the small-sized dense projected problem is handled within an auxiliary object called DS, with operations to solve the projected problem, sort the computed solution according to a certain criterion, and truncate it to a smaller size, among other. For the GSVD, we have implemented the computation of the CS decomposition (36) using LAPACK's subroutine GGSVD3, and the truncation operation that manages the arrowhead shapes in Figure 3 using a compact storage.
A common parameter in all SLEPc solvers is ncv, the number of column vectors, that is, the maximum allowed dimension of the subspaces. In SVD-type computations, the user can also choose whether to compute the largest (default) or smallest (generalized) singular values and vectors. The user interface of the GSVD thick-restart solver includes two additional parameters: the scale factor, discussed in section 3.4, and the restart parameter, indicating the proportion of basis vectors that must be kept after restart (the default is 50%).
A final comment is about locking. The absolute value of the spikes of arrows in Figure 3 are precisely the residual bounds used in the convergence criterion (55). As soon as the gener- alized singular quadruples converge, the leading values of these spikes become small (below the tolerance). Then one could set these values to zero explicitly, with the consequent decoupling from the rest of the bidiagonalization. This is called locking, a form of deflation in which the converged solutions are used in the orthogonalization but are excluded from the rest of operations. In our solver, locking is active by default, but the user can deactivate it so that the full bidiagonalization is updated at restart (including converged vectors). All results discussed in next section use locking.

Computational results
In this section we present results of some computational experiments to assess the performance of the solvers in terms of accuracy, convergence, and parallel scalability. The executions have been carried out on the Tirant III computer, which consists of 336 computing nodes, each of them with two Intel Xeon SandyBridge E5-2670 processors (16 cores each) running at 2.6 GHz with 32 GB of memory, connected with an Infiniband network. We allocated 4 MPI processes per node at most. The presented results correspond to SLEPc version 3.18, together with PETSc 3.18, SuiteSparse 5.12 and MUMPS 5.5. All software has been compiled with Intel C and Fortran compilers (version 18) and Intel MPI. Table 1 lists the problems used in the experiments, summarizing some properties and parameters. Here is a short description of the problems: intended to test iterative regularization methods for linear inverse problems. The GSVD can be used in this context. In this case, the A matrix comes from the interpolation relations of n points at random locations with respect to a 2D regular grid of side √ n, and B is the regularization matrix. In our case, for matrix B we use the last case listed in [10, §4.2] (2D Laplacian with zero derivative enforced on one of the boundaries) except that we do not compute the compact QR decomposition of this matrix.
• In the 3elt, gemat12 and onetone1 problems, the A matrix is the matrix with the same name taken from the SuiteSparse matrix collection [8], while the B matrix is a bidiagonal matrix with one more row than columns and values −1 and 1 on the subdiagonal and diagonal, respectively. The hardesty2 matrix also belongs to the SuiteSparse matrix collection, but instead of using it with a bidiagonal matrix, we take the bottom square block as matrix B and the remaining top rows as matrix A.
All results in this section are obtained with the explicitmatrix flag set, except for the diagonal, 3elt, gemat12 and onetone1 problems when using LSQR. Table 2 illustrates the performance of our solver with the first three test cases. In sequential runs we use the sparse QR factorization to solve the least squares problems, but in parallel we have to resort to LSQR, which may need a lot of iterations (in fact, for the hardesty2 problem the maximum number of iterations is exceeded). From the table, we can see that the invinterp problem needs more time in parallel than sequentially, because LSQR is slow (this is also related to the scale factor as discussed below). Still, parallelism allows solving very large problems where computing a sequential sparse QR is not viable.
To assess the accuracy of the computed generalized singular quadruples, we use the residual norm (52) relative to the norm of the stacked matrix (5), r GSVD 2 / Z ∞ . The default tolerance for the thick-restart Lanczos solver is 10 −8 , and we see in Table 2 that the error is below the tolerance in all cases except for invinterp with LSQR. This can be fixed by requesting a more stringent tolerance for the LSQR stopping criterion (which we have set to 10 −10 ), otherwise a bad accuracy of the inner iteration prevents attaining the requested accuracy in the outer one.
In the case of a sparse QR factorization, the value of the total number of iterations of the linear solver in Table 2 can be interpreted as the number of least-squares problems that need to be solved. To understand this value, we have to take into account that the default value of ncv (number of column vectors) is equal to max{2 · nsv, 10}, that is, 40 in the case of the test problems analyzed in Table 2. In addition to the least-squares problems required during the Lanczos iteration, the postprocessing to obtain the final right singular vectors g i = R −1 w i requires additional least squares solves.
The scale factor discussed in section 3.4 may have a significant impact on the performance of the thick-restart Lanczos solver, as it has an influence on the number of Lanczos iterations as well as the number of iterations needed by LSQR. Figure 4 compares the execution time of the last problems in Table 1 when the scale factor is changed. The tolerance used for the LSQR in this case is 10 −12 . We can draw several conclusions. First, in all these problems scaling is beneficial, and execution time is reduced whenever the scale factor γ is increased, up to a point where it stabilizes. This is also true, but with the reciprocal of γ, if the smallest generalized singular values are computed (onetone1 problem). Second, in these problems it is much cheaper to use a sparse QR factorization for the least squares problems, compared to solving them with LSQR, as the latter may require many iterations (it does not converge in the onetone1 problem). Still, LSQR is currently necessary for parallel runs, and in Figure 5 we study how the number  of iterations of LSQR changes with respect to the scale factor. As the scale factor increases, the LSQR method has more difficulties to converge, but still the overall time goes down (although relatively less compared to the sparse QR cases). Figure 6 analyzes the performance of the dynamic scaling mechanism described in section 3.4. As we increase the value for the dynamic scale threshold γ 0 , we reduce the number of rescale operations but increase the amount of discarded iterations (when re-scaling, the previous iterations are discarded). The cost of each re-scale operation is not negligible in this case, because we are using sparse QR factorization, which has to be computed after each re-scale. As a result, the best execution time is achieved at intermediate values of the threshold (10 or 20). Figure 7 shows the execution time of the thick-restart Lanczos solver compared with the cross and cyclic solvers discussed in section 2.2. In all cases we use a direct method for the linear solves (SuiteSparse for Lanczos and MUMPS for cross and cyclic). There is no clear winner, but take into account that in Lanczos we are using the scale factor γ that gave the smallest time in our tests for each problem, otherwise the Lanczos solver is not competitive in general. However, it is worth noting that for the hardesty2 problem, the Lanczos solver is the only one that provides a suitable solution in terms of the residual. In this case, the matrix B T B in the cross and cyclic methods is singular and, because the generalized symmetric-definite Lanczos eigensolver fails, one has to solve the equivalent eigenproblem as non-symmetric, resulting in much lower accuracy (≈ 1 · 10 −4 ), compared to the Lanczos GSVD solver (≈ 6 · 10 −11 ). In the other cases, the error of the computed solutions is similar in all solvers.
We conclude this section by analyzing parallel performance. Figure 8 plots the parallel execution time of the Lanczos solver for the diagonal problem with up to 64 MPI processes. We can see that the scaling is very good, close to the ideal one. The figure shows the one-sided  variant together with the default one (two-sided). The one-sided variant is always faster, but the difference is not too significant because orthogonalization amounts to only a modest percentage of the overall cost. To better appreciate the gain of one-sided orthogonalization, the right panel of Figure 8 shows only the time of the orthogonalization, with a factor of about 2.5 improvement of the one-sided scheme with respect to the default. The diagonal problem is very favorable for parallel computing, as the matrix-vector product is trivially parallelizable. On the other extreme, the invinterp problem has a very disadvantageous sparsity pattern (nonzeros located at random positions), so the speedup shown in Figure 9 is good up to 16 MPI processes, but gets ruined afterwards.

Concluding remarks
We have developed a thick restart mechanism for the joint Lanczos bidiagonalization to compute a partial GSVD of a large-scale matrix pair. This is a very important piece to make the solver usable in the context of real applications. We have developed a fully-fledged implementation in the SLEPc library, that in addition to the restart, includes other interesting features such as one-sided orthogonalization, scaling, or locking. The solver is very flexible, in the sense that the user can indicate at run time whether the associated least squares problems must be solved with a direct or iterative method, as well as specify many other settings such as the dimension of the Krylov subspace or the scale factor.
We have conducted a number of computational experiments, showing that our solver is numerically robust, computationally efficient and scalable in parallel runs. If an appropriate scale factor is used, the performance of Lanczos method is on a par with that of the cross and cyclic solvers for the GSVD, which we have also developed during this work.
In parallel executions for large-scale problems, the only possibility at the moment for the least squares problem is to use LSQR without preconditioner, since PETSc does not currently provide parallel preconditioners for this case. This represents a severe performance bottleneck, especially if a large scale factor is used, as this hinders convergence of LSQR. As a future work, we could consider implementing a parallel preconditioner for least squares [5].