A Jacobi–Davidson Method for Large Scale Canonical Correlation Analysis

: In the large scale canonical correlation analysis arising from multi-view learning applications, one needs to compute canonical weight vectors corresponding to a few of largest canonical correlations. For such a task, we propose a Jacobi–Davidson type algorithm to calculate canonical weight vectors by transforming it into the so-called canonical correlation generalized eigenvalue problem. Convergence results are established and reveal the accuracy of the approximate canonical weight vectors. Numerical examples are presented to support the effectiveness of the proposed method.


Introduction
Canonical correlation analysis (CCA) is one of the most representative two-view multivariate statistical techniques and has been applied to a wide range of machine learning problems including classification, retrieval, regression and clustering [1][2][3]. It seeks a pair of linear transformations for two view high-dimensional features such that the associated low-dimensional projections are maximally correlated. Denote the data matrices S a ∈ R m×d and S b ∈ R n×d from two data sets with m and n features, respectively, where d is the number of samples. Without loss of generality, we assume S a and S b are centered, i.e., S a 1 d = 0 and S b 1 d = 0 where 1 d ∈ R d is the vector of all ones, otherwise, we can preprocess S a and S b as S a ← S a − 1 d (S a 1 d )1 T d and S b ← S b − 1 d (S b 1 d )1 T d , respectively. CCA aims to find a pair of canonical weight vectors x ∈ R m and y ∈ R n that maximize the canonical correlation max x T Cy, subject to x T Ax = 1 and y T By = 1, where and then projects the high-dimensional data S a and S b onto low-dimensional subspaces spanned by x and y, respectively, to achieve the purpose of dimensionality reduction. In most cases [1,4,5], only one pair of canonical weight vectors is not enough since it means the dimension of low-dimensional subspaces is just one. When a set of canonical weight vectors are required, the single-vector CCA (1) has been extended to obtain the pair of canonical weight matrices X ∈ R m×k and Y ∈ R n×k by solving the optimization problem max trace(X T CY), subject to X T AX = I k and Y T BY = I k .
Usually, both A and B are symmetric positive definite. However, there are cases, such as the under-sampled problem [6], that A and B may be semi-definite. In such a case, some regular techniques [7][8][9][10] by adding a multiple of the identity matrix to them are applied to find the optimal solution of max trace(X T CY), subject to X T (A + κ a I m )X = I k and Y T (B + κ b I n )Y = I k , where κ a and κ b are called regularization parameters and they usually are chosen to maximize the cross-validation score [11]. In other words, A and B are replaced by A + κ a I m and B + κ b I n to keep the invertible of A and B, respectively. Hence, in this paper, by default we assume A and B are both positive definite and m ≥ n unless explicitly stated otherwise. As shown in [4], the optimization problem (3) can be equivalently transformed into solving the following generalized eigenvalue problem where the positive definiteness of the matrices A and B implies M being positive definite. The generalized eigenvalue problem (4) is referred as the Canonical Correlation Generalized Eigenvalue Problem (CCGEP) in this paper. Define J := R −1 are their Cholesky decomposition. It is easy to verify that where and it implies Cq = λp and C T p = λq.
It means that the eigenpairs of (4) can be obtained by computing the singular values and the associated left and right singular vectors of C. This method works well when the sample size d and feature dimension m and n are of moderate size but it will be very slow and numerically unstable for large-scale datasets which are ubiquitous in the age of "Big Data" [12]. For large-scale datasets, the equivalence between (4) and (6) makes it possible for us to simply adapt the subspace type algorithms for calculating the partial singular values decomposition, such as Lanczos type algorithms [13,14] and Davidson type algorithms [15,16], and then translate them for CCGEP (4). However, in practice, the decompositions of the sample covariance matrices A and B are usually unavailable in large scale matrix cases. The reason is that the decomposition is too expensive to compute explicitly for large scale problems, and may destroy the sparsity and some structural information. Furthermore, sometime sample covariance matrices A and B should never be explicitly formed, such as in online learning systems.
Meanwhile, in [17], it is suggested to solve CCGEP (4) by considering the large scale symmetric positive definite pencil {K, M}. Some subspace type numerical algorithms also have been generalized to computing partial eigenpairs of {K, M}, see [18,19]. However, these generic algorithms do not make use of the special structure in (4), and they usually are less efficient than custom-made algorithms. Therefore, existing algorithms either can not avoid the covariance matrices decomposition, or do not consider the structure of CCGEP.
In this paper, we will focus on the Jacobi-Davidson type subspace method for canonical correlation analysis. The idea of Jacobi-Davidson algorithm proposed in [20] is Jacobi's approach combined with Davidson type subspace method. Its essence is to construct a correction for a given eigenvector approximation in a subspace orthogonal to the given approximation. The correction in a given subspace is extracted in a Davidson manner, and then the expansion of the subspace is done by solving its correction equation. Due to the significant improvement in convergence, the Jacobi-Davidson has been one of the most powerful algorithms in the matrix eigenvalue problem, and is almost generalized to all fields of matrix computation. For example, in [15,21], Hochstenbach presented Jacobi-Davidson methods for singular value problems and generalized singular value problems, respectively. In [22,23], Jacobi-Davidson methods are developed to solve the nonlinear and two-parameter eigenvalue problems, respectively. Some other recent work on Jacobi-Davidson methods can be found in [24][25][26][27][28][29]. Motivated by these facts, we will continue the effort by extending the Jacobi-Davidson variant to canonical correlation analysis. The main contribution is that the algorithm directly tackles CCGEP (4) without involving the large matrix decomposition, and does take advantage of the special structure of K and M, while the significance of transforming (4) into (6) lies only in our theoretical developments.
The rest of this paper is organized as follows. Section 2 collects some notations and a basic result for CCGEP that are essential to our later development. Our main algorithm is given and analyzed in detail in Section 3. We present some numerical examples in Section 4 to show the behaviors of our proposed algorithm and to support our analysis. Finally, conclusions are made in Section 5.

Preliminaries
Throughout this paper, R m×n is the set of all m × n real matrices, R m = R m×1 , and R = R 1 . I n is the n × n identity matrix. The superscript "· T " takes transpose only, and · 1 denotes the 1 -norm of a vector or matrix. For any matrix N ∈ R m×n with m ≥ n, σ i (N) for i = 1, . . . , n is used to denote the singular values of N in descending order.
For vectors x, y ∈ R n , the usual inner product and its induced norm are conveniently defined by x, y := y T x, With them, the usual acute angle (x, y) between x and y can then be defined by (x, y) := arccos | x, y | x 2 y 2 .
Similarly, given any symmetric positive definite W ∈ R n×n , the W-inner product and its induced W-norm are defined by x, y W := y T Wx, If x, y W = 0, then we say x ⊥ W y or y ⊥ W x. The W-acute angle W (x, y) between x and y can then be defined by Let the singular value decomposition of C be C = PΛQ T where P = [p 1 , . . . , p m ] ∈ R m×m and Q = [q 1 , . . . , q n ] ∈ R n×n are orthonormal, i.e., P T P = I m and Q T Q = I n , and Λ = diag(λ 1 , . . . , λ n ) ∈ R m×n with λ 1 ≥ λ 2 ≥ · · · ≥ λ n ≥ 0 is a leading diagonal matrix. The singular value decomposition of C closely relates to the eigendecomposition of the following symmetric matrix [30] (p. 32): whose eigenvalues are ±λ i for i = 1, . . . , n plus m − n zeros, i.e., with associated eigenvectors are p i ±q i , i = 1, 2, . . . , n, and p i 0 , i = n + 1, . . . , m, respectively. The equivalence between (4) and (6) leads that the eigenvalues of CCGEP (4) are ±λ i for i = 1, . . . , n plus m − n zeros, and the corresponding eigenvectors are x i ±y i , i = 1, 2, . . . , n, and Let X = [x 1 , . . . , x m ] and Y = [y 1 , . . . , y n ]. Then, the Aand B-orthonormal constraints of X and Y, respectively, i.e., X T AX = I m and Y T BY = I n (11) are followed by P T P = I m and Q T Q = I n . Here, the first few x i and y i for i = 1, 2, . . . , k with k < n are wanted canonical correlation weight vectors. Furthermore, their corresponding eigenvalues satisfy the following maximization principle which is critical to our later developments. For the proof see Appendix A.1.

Theorem 1.
The following equality holds for any U ∈ R m×k and V ∈ R n× where A, B and C are defined in (2) and λ i defined in (9), and σ i (U T CV) for 1 ≤ i ≤ min{k, } are the singular values of U T CV.

The Main Algorithm
The idea of the Jacobi-Davidson method [20] is to construct iteratively approximations of certain eigenpairs. It uses solving a correction equation to expand the search subspace, and finds approximate eigenpairs as best approximations in such search subspace.

Subspace Extraction
Let X ⊆ R m and Y ⊆ R n with dim(X ) = k and dim(Y ) = , respectively. As stated in [31], we call {X , Y } a pair of defalting subspaces of CCGEP (4) if Let X ∈ R m×k and Y ∈ R n× be Aand B-orthonormal basis matrices of the subspaces X and Y, respectively, i.e., X T AX = I k and Y T BY = I .
The equality (13) implies that there exist D A ∈ R k× and D B ∈ R ×k [32] (Equation (2.11)) such that They (14) is equivalent to gives an eigenpair of (4), where z = [x T , y T ] T , x = Xx and y = Yŷ. This is because and similarly C T x = λBy. That means Hence, we have shown that a pair of deflating subspaces {X , Y } can be used to recover those eigenpairs associated with the pair of deflating subspaces of CCGEP (4). In practice, pairs of exact deflating subspaces are usually not available, and one usually use Lanczos type methods [14] or Davidson type methods [15] to generate approximate ones, such as Krylov subspaces in Lanczos method [33]. Next, we will discuss how to extract best approximate eigenpairs from a given pair of approximate deflating subspaces.
In what follows, we consider the simple case: Let U ∈ R m×k and V ∈ R n×k be the Aand B-orthonormal basis matrices of the subspaces U and V, respectively. Denote θ i , i = 1, 2, . . . , k the singular values of U T CV in descending order with associated left and right singular vectorsû i and v i , respectively, i.e., Even though U and V as Aand B-orthonormal basis matrices are not unique, these θ i are. Motivated by the maximization principle in Theorem 1, we would seek the best approximations associated with the pair of approximate deflating subspaces {U , V } to the eigenpairs (λ i , for any U j ∈ R m×j and V j ∈ R n×j satisfying span( We claim that the quantity in (15) is given by ∑ j i=1 θ i . To see this, we notice that any U j and V j in (15) can be written as and vice versa. Therefore the quantity in (15) is equal to which is ∑ j i=1 θ i by the proposition of the singular value decomposition of U T CV [30]. The maximum is attended at U j = [û 1 ,û 2 , . . . ,û j ] and V j = [v 1 ,v 2 , . . . ,v j ]. Therefore naturally, the best approximations to (λ i , z i ) (1 ≤ i ≤ j) in the sense of (15) are given by Define the residual where K and M defined in (4), . We summarize what we do in this subsection in the following theorem.
Let U ∈ R m×k and V ∈ R n×k be the A-and B-orthonormal basis matrices of the subspaces U and V, respectively. Denote θ i , i = 1, 2, . . . , k the singular values of U T CV in descending order. Then, for any j ≤ k, given by (16), and the associated residuals defined in (17) admit r (i) a ⊥ U and r (i) b ⊥ V.

Correction Equation
In this subsection, we turn to construct a correction equation for a given eigenpair approximation.
T is the associated residual. We seek Aand B-orthogonal modifications ofx andỹ, respectively, such that where s ⊥ Ax and t ⊥ Bỹ . Then, by (18), we have Notice that r a ⊥x and r b ⊥ỹ by Theorem 2, which gives rise to Because s ⊥ Ax and t ⊥ Bỹ , Equation (20) is rewritten as However, we do not know λ here. It is natural that we use θ to replace λ in (21) to get the final correction equation, i.e., We summarize what we have so far into Algorithm 1, and make a few comments on Algorithm 1.
(1) At step 2, Aand B-orthogonality procedures are applied to make sure U T At = 0 and V T Bs = 0.
(2) At step 7, in most cases, the correct equation is not necessity to solve exactly. Some steps of iterative methods for symmetric linear systems, such as linear conjugate gradient method (CG) [34] or the minimum residual method (MINRES) [35], are sufficient. Usually, more steps in solving the correction equation lead to fewer outer iterations. This will be shown in numerical examples. (3) For the convergence test, we use the relative residual norms to determine if the approximate eigenparis (θ i ,z i ) has converged to a desired accuracy. In addition, in the practical implementation, once one or several of approximate eigenpairs converge to a preset accuracy, they should be deflated so that they will not be re-computed in the following iterations. Suppose λ i for 1 ≤ i ≤ j, X j = [x 1 , . . . , x j ] and Y j = [y 1 , . . . , y j ] have been computed where j ≤ k. We can consider the generalized eigenvalue problem where By (11), it is clear that the eigenvalues of (24) consist of two groups. Those eigenvalues associated with the eigenvectors [x T 1 , T are shifted to zero and the others remain unchanged. Furthermore, for the correction equation, we find s and t subject to additional Aand B-orthogonality constrains for s and t against X j and Y j , respectively. By a similar derivation of (22), the correction equation after deflation becomes where Notice that s ⊥ A X j and t ⊥ B Y j mean U ⊥ A X j and V ⊥ B Y j in Algorithm 1, respectively. It follows that U T CV = U T CV. (4) At step 5, LAPACK's routine xGESVD can be used to solve the singular value problem of U T CV because of its small size, where U T CV takes the following form: This form is preserved in the algorithm during refining the basis U and V at step 8. The new basis matrices U U and V V are reassigned to U and V, respectively. Although a few extra costs are incurred, this refinement is necessary in order to have faster convergence for eigenvectors as stated in [36,37]. Furthermore, the restart is easily executed by keeping the first s min columns of U and V when the dimension of the subspaces span{U} and span{V} exceeds s max . The restart technique appears at step 8 to keep the size of U, V and U T CV small. There are many ways to specify s max and s min . In our numerical examples, we just simply take s max = 3k and s min = k. Aand B-orthogonal s and t against U and V, respectively, to obtaiñ s andt. 3:

4:
Update the corresponding column and row of U T CV.

5:
Compute the singular value decomposition of U T CV, i.e., U T CV = UΘ V T . 6: Compute the wanted approximate eigenpairs (θ i , [x T i ,ỹ T i ] T ) by (16) and the corresponding residuals r (i) a and r (i) b .

7:
Solve Update U = U U and V = V V. If the dimension of U and V exceeds s max , then replace U and V with U (1:s min ) and V (1:s min ) respectively. 9: end for

Convergence
The convergence theories on the Jacobi-Davidson method for the eigenvalue and singular value problem are given in [15,38], respectively. Here we prove a similar convergence result for the Jacobi-Davidson method of CCGEP based on the following lemma. Specifically, if we solve the correction Equation (22) exactly, and thenx andỹ are close enough to x and y, respectively, it can be hoped that the approximate eigenvectors converge cubically. For the proof see Appendices A.2 and A.3.
is a bijection from span(x) ⊥ A × span(y) ⊥ B onto itself, where span(x) ⊥ A and span(y) ⊥ B are A-and B-orthogonal complementary spaces of span(x) and span(y), respectively.

Numerical Examples
In this section, we present some numerical examples to illustrate the effectiveness of Algorithm 1. Our goal is to compute the first few canonical weight vectors. A computed approximate eigenpair (θ i ,z i ) is considered converged when its relative residual norm All the experiments in this paper are executed on a Ubuntu 12.04 (64 bit) Desktop-Intel(R) Core(TM) i7-6700 CPU@3.40 GHz, 32 GB of RAM using MATLAB 2010a with machine epsilon 2.22 × 10 −16 in double-precision floating point arithmetic.

Example 1.
We first examine Theorem 3 by using two pairs of data matrices S a and S b which come from a publicly available handwritten numerals dataset (https://archive.ics.uci.edu/ml/datasets/Multiple+Features). It consists of features handwritten numerals ('0'-'9') and each digit has 200 patterns. Each pattern is represented by six different feature sets, i.e., Fou, Fac, Kar, Pix, Zer and Mor. Two pairs of feature sets Fou-Zer and Pix-Fou are chosen for S a and S b , respectively, such that S a ∈ R 76×d and S b ∈ R 47×d in Fou-Zer, and S a ∈ R 240×d and S b ∈ R 76×d in Pix-Fou with d = 2000. To make the numerical example repeatable, the initial vectors are set to be u 0 = x 1 + 10 −3 × ones(m, 1) and v 0 = y 1 + 10 −3 × ones(n, 1) where m and n are the dimension of S a and S b , respectively, ones is MATLAB built-in function, and [x T 1 , y T 1 ] T is computed by MATLAB's function eig on (4) and considered to be the "exact" eigenvector for testing purposes. The corrected Equation (22) in Algorithm 1 is solved by direct methods, such as Gaussian elimination, and the solution [s T , t T ] T by such methods is regarded as "exactly" in this example. Figure 1 plots sin A (x 1 ,x 1 ) and sin B (y 1 ,ỹ 1 ) in the first three iterations of Algorithm 1 for computing first canonical weight vector of Fou-Zer and Pix-Fou. It is clearly shown by Figure 1 that the convergence of Algorithm 1 is very fast when the initial vectors are enough close to the exact vectors, and the cubical convergence of Algorithm 1 appears in the third iteration. In such a case, iterative methods, such as MINRES method which is simply GMRES [39] applied to symmetric linear systems, are usually preferred. In this example, we report the effect of the number of steps in the solution of the correction equation, denoted by n g , on the total number of matrix-vector products (denoted by "#mvp"), outer iteration number (denoted by "#iter"), and CPU time in seconds for Algorithm 1 to compute the first 10 canonical weight vectors of the test problems appeared in Table 1. Table 1 [40], the ORL dataset is partitioned into two groups. We select the first five images per individual as the first view to generate the data matrix S a , while the remaining for S b . Similarly, we get data matrices S a and S b for the FERET and Yale datasets. The numbers of row and column of S a and S b are detailed in Table 1. In this example, we set the initial vectors u 0 = ones(m, 1) and v 0 = ones(n, 1) with s max = 30 and s min = 10 for restarting and simply take regularization parameter κ a = κ b = 10 −4 . We let MINRES steps n g vary from to 5 to 40, and collect the numerical results in Figure 2. As expected, the number of total outer iterations decreases as n g increases, while the total number of matrix-vector products does not change monotonically with n g . It depends on the degree of reduction of outer iterations by the increasing of n g . In addition, it is shown by Figure 2 that the total #mvp is not a unique deciding factor on the total CPU time. When n g is larger, the significantly reduced #iter leads to smaller total CPU time. For these three test examples, the MINRES steps n g around 15 to 25 appear to be cost-effective, further increasing n g over 40 usually does not have significant effect. The least efficient case is when n g is too small.

Example 3.
In this example, we compare Algorithm 1, i.e., JDCCA, with Jacobi-Davidson QZ type method [41] (JDQZ) for the large scale symmetric positive definite pencil {K, M} defined in (4) to compute the first 10 canonical weight vectors of the test problems appeared in Table 1 with MINRES steps n g = 20. We take u 0 = ones(m, 1) and v 0 = ones(n, 1) in Algorithm 1 and the initial vector ones(m + n, 1) for the JDQZ algorithm, and compute the same relative residual norms η(θ i ,z i ). The corresponding numerical results are plotted in Figure 3. For these three test problems, it is suggested by Figure 3 that Algorithm 1 always outperforms the JDQZ algorithm. Other experiments that we tested with different test problems and MINRES steps not reported here also illustrate our points.

Conclusions
To analyze the correlations between two data sets, several numerical algorithms have been available to find the canonical correlations and the associated canonical weight vectors; however, there is very little discussion of the large scale sparse and structured matrix cases in the literature. In this paper, a Jacobi-Davidson type method, i.e., Algorithm 1, is presented for large scale canonical correlation analysis by computing a small portion of eigenpairs of the canonical correlation generalized eigenvalue problem (4). The theoretical result is established in Theorem 3 to demonstrate that the cubic convergence of the approximate eigenvector if the correction equation is solved exactly and the approximate eigenvector of the previous step is close enough to the exact one. The corresponding numerical results are presented to confirm the effectiveness of asymptotic convergence rate provided by Theorem 3, and to demonstrate that Algorithm 1 performs far superior to the JDQZ method for the large scale symmetric positive definite pencil {K, M}.
Notice that the main computational tasks in every iteration of Algorithm 1 consist of solving the correction Equation (22). In our numerical example, we only focus on the plain version of MINRES, i.e., without considering any preconditioner. However, it is not hard to notice that incorporating a preconditioner presents no difficulty and can promote the numerical performance if the preconditioner is available. In addition, from the point of view that multi-set canonical correlation analysis (MCCA) [42] proposed to analyze linear relationships among more than two data sets can be equivalently transformed to the following generalized eigenvalue problem where C ij = S i S T j and S i is the data matrix, the development of efficient Jacobi-Davidson methods for treating such large scale MCCA will be a subject of our future study.  Proof of Theorem 1. To prove (12), for any U ∈ R m×k and V ∈ R n× satisfying U T AU = I k and V T BV = I , respectively, we first consider the augmented matrices of C and U T CV, i.e., 0 C C T 0 and 0 U T CV V T C T U 0 , whose eigenvalues ±λ i for i = 1, . . . , n plus m − n zeros and σ i (U T CV) for i = 1, . . . , min{k, } plus k + − 2 min{k, }, respectively. Notice that where R A and R B are defined in (5), and R A U and R B V satisfy (R A U) T R A U = U T AU = I k and (R B V) T R B V = V T BV = I , respectively. Hence, apply Cauchy's interlacing inequalities [30] (Corollary 4.4) for the symmetric eigenvalue problem to the matrices 0 C C T 0 and 0 for any U ∈ R m×k and V ∈ R n× such that U T AU = I k and V T BV = I . On the other hand, let U = [x 1 , x 2 , . . . , x k ] and V = [y 1 , y 2 , . . . , y ] where x i and y i are defined in (10). Then, by (11), we have U T AU = I k and V T BV = I . Furthermore, U T CV = [p 1 , p 2 , . . . , p k ] T C [q 1 , q 2 , . . . , q ] = diag(λ 1 , . . . , λ min{k, } ) ∈ R k× which to give σ i (U T CV) = λ i for 1 ≤ i ≤ min{k, } and thus Equation (12) is a consequence of (A1) and (A2).