Quantum-inspired canonical correlation analysis for exponentially large dimensional data

Canonical correlation analysis (CCA) is a technique to find statistical dependencies between a pair of multivariate data. However, its application to high dimensional data is limited due to the resulting time complexity. While the conventional CCA algorithm requires polynomial time, we have developed an algorithm that approximates CCA with computational time proportional to the logarithm of the input dimensionality using quantum-inspired computation. The computational efficiency and approximation performance of the proposed quantum-inspired CCA (qiCCA) algorithm are experimentally demonstrated. Furthermore, the fast computation of qiCCA allows us to directly apply CCA even after nonlinearly mapping raw input data into very high dimensional spaces. Experiments performed using a benchmark dataset demonstrated that, by mapping the raw input data into the high dimensional spaces with second-order monomials, the proposed qiCCA extracted more correlations than linear CCA and was comparable to deep CCA and kernel CCA. These results suggest that qiCCA is considerably useful and quantum-inspired computation has the potential to unlock a new field in which exponentially large dimensional data can be analyzed.

In this study, we present an algorithm that approximates CCA with computational time proportional to the logarithm of the input dimensionality. The proposed algorithm is based on quantum-inspired algorithms, which are recently developed randomized algorithms for linear algebra computations (17)(18)(19)(20)(21). In this paper, we first explain our proposed quantum-inspired CCA (qiCCA) algorithm and then experimentally demonstrate its computational efficiency and approximation performance.
Furthermore, the fast computation of our algorithm enables us to directly apply CCA even after nonlinearly mapping raw input data into high dimensional spaces. Previous studies have shown that kernel CCA (22)(23)(24)(25)(26) and deep CCA (27,28) can extract more correlations than linear CCA by nonlinearly mapping raw input data into high dimensional spaces. While kernel CCA and deep CCA utilize the kernel trick and neural networks for their nonlinear transformation, mapping raw input data using a set of multivariate basis functions, such as polynomial or Fourier basis functions, seems a simple and straightforward approach; however, after nonlinear mapping, the dimensionality increases exponentially against the dimensionality of the raw input data. This makes it computationally infeasible to apply the conventional CCA. In an experiment with a benchmark dataset, we mapped the raw input data into high dimensional spaces by taking the products of the input features for all possible feature pairs and applied qiCCA on the resultant high dimensional data. The results demonstrated that qiCCA extracted more correlations than linear CCA and was comparable to kernel and deep CCA. This suggests that the combination of qiCCA and nonlinear mapping using a large number of basis functions can provide an alternative way to reveal nonlinear correlations.

Algorithm
This section is organized as follows. First, we explain the mathematical notations. Then, we introduce the conventional CCA algorithm in which singular value decomposition (SVD) is used as a subroutine. In the following subsections, we explain quantum-inspired SVD (qiSVD) and the data structure that allows us to perform qiSVD in logarithmic time. Finally, we explain the qiCCA algorithm. Note that a Python implementation of qiCCA is available at our GitHub repository 1 .

Canonical correlation analysis
We introduce the CCA algorithm in which SVD is used as a subroutine. We assume a pair of data matrices, ∈ ℝ × 1 and ∈ ℝ × 2 , where is the number of samples and 1 and In last subsection, the qiCCA algorithm is obtained by replacing the SVD in step 1 of the above algorithm with qiSVD.

Quantum-inspired singular value decomposition
We explain qiSVD based on a previous study (17). For a given matrix ∈ ℝ × , the standard SVD algorithm provides a low-rank (K-rank) approximation of the given matrix as T .
Here, ∈ ℝ × is the diagonal matrix whose ( , )-entry is the -th largest singular value, and ∈ ℝ × and ∈ ℝ × are the matrices whose columns are the left-and rightsingular vectors corresponding to the top singular values, respectively. qiSVD computes a "description" of (17,29). A description of is the set of a matrix ∈ ℝ × and vectors ∈ ℝ ( = 1,2, ⋯ , ) that approximate (: , ) by T where is a parameter that controls the trade-off between computational time and approximation performance. Once a description of is obtained, we can access any entry of the approximation of in constant time regardless of the size of the input matrix. The qiSVD procedure for a given matrix ∈ ℝ × is as follows.
Repeat this P times and denote the results by 1 , 2 , ⋯ , .
5. Orthonormalize { T̂} =1 by applying the Gram-Schmidt process, and obtain a new set of vectors { } =1 such that { T } =1 is an orthonormal set (see Appendix for a complete explanation).

Output
and ( = 1,2, ⋯ , ) as a description of the right singular vectors of .
The above algorithm can be performed in (log ) time when we use the input data structure explained in the next subsection. In steps 1 and 2 of the above algorithm, we perform random sampling, which can be interpreted as analogous to quantum computation. In quantum computation, a superposition of states is represented by a linear combination of the corresponding vectors, and we can sample each state with the probability proportional to its squared coefficient in constant time regardless of the dimensionality of the vectors. Utilizing this, quantum computation provides quantum algorithms to compute several basic algebraic calculations (e.g., taking the inner product between two vectors) faster than their classical counterparts and leads to several quantum machine learning algorithms (30)(31)(32)(33).
The original qiSVD algorithm proposed by Tang (2019) does not include step 5 of the above algorithm and does not provide an orthonormal set of vectors. The orthonormalization step was introduced to obtain an orthonormal set of vectors that approximates singular vectors used in CCA. The full explanation of this step is given in the appendix. The approximation performance of this algorithm has been investigated analytically in previous studies (17,29).
We also demonstrate the approximation performance of qiSVD numerically in the results section of this paper.

Data structure
Here, we introduce the data structure that we use for qiSVD and qiCCA. By using a binary search tree with N leaves for an N-dimensional vector ∈ ℝ ( Figure 1A), we can sample from the categorical distribution that takes ( = 1,2, ⋯ , ) with probability ( ) 2 ‖ ‖ 2 ⁄ in (log ) time, which is faster than naïve sampling algorithms without binary search trees, such as that adopted in SciPy ( Figure 1B). To perform the sampling operations required in step 1 of qiSVD for an input matrix ∈ ℝ × , we prepared a binary search tree with leaves. This allows us to perform the required samplings in (log ) time. To perform the sampling operations required in step 2, we prepared binary search trees with leaves that stored the individual rows of the matrix. In step 2, we uniformly and randomly selected an index from { 1 , 2 , ⋯ , } and then sampled from the categorical distribution using the binary search tree corresponding to the row specified by the selected index. This is equivalent to the sampling required in step 2 of qiSVD, and this can be done in (log ) time. A total of ( + 1) binary search trees were used to store the matrix. Variants of this binary tree data structure are described in the literature (17,34).

Quantum-inspired canonical correlation analysis
For qiCCA, we assume a pair of input matrices ∈ ℝ × 1 and ∈ ℝ × 2 where is the number of samples and 1 and 2 indicate the number of input dimensions. T and T are stored using the binary tree data structure explained in the previous subsection. In the qiCCA algorithm, a pair of descriptions that approximate the first canonical variates and a pair of descriptions that approximate the linear weight matrices for the canonical variates are computed as follows.
1. Perform qiSVD on T . Denote the resultant description of the left singular vectors of by (1) and . Denote the sampled indices in the first step of qiSVD by 2. Perform qiSVD on T . Denote the resultant description of the left singular vectors of by (2) and . Denote the sampled indices in the first step of qiSVD by ) T (1) (2) T ( 1 (2) , 2 (2) , ⋯ , (2) ) and perform SVD on it.
Denote the results by T . and

Output
as a description of the first canonical variates from , and output (2) and as a description of the first canonical variates from . 6. Prepare the 1 × 1 binary matrix (1) whose ( , )-entry is 1 if (1) = and 0 otherwise. Prepare the 2 × 2 binary matrix (2) whose ( , )-entry is 1 if (2) = and 0 otherwise. 7. Output (1) and as a description of the weight vectors to extract the canonical variates from . Output (2) and as a description of the weight vectors to extract the canonical variates from .
The results of the above algorithm are equivalent to those of CCA when the input matrices are low-rank and qiSVD provides the exact left singular vectors. Thus, the qiCCA algorithm can be considered to approximate CCA. Estimating the approximation performance of this algorithm analytically is difficult; thus, we evaluate it numerically in the results section. In the subsequent numerical experiments, we set = = 1000, and 1 and 2 were set to 20% larger than and .
When we focus on the relationship between the computational cost of the algorithm and the input dimensionalities, the above algorithm takes (log ( 1 2 )) time. In addition, we can efficiently obtain the canonical variates for new data using the weight vectors derived from the above algorithm. To obtain the -th canonical variates for a new data matrix new ∈ ℝ new × 1 , we calculate new (1) T (1) . By utilizing the fact that (1) is a sparse matrix, we can compute each entry of new (1) T (1) in constant time regardless of the input dimensionality ( 1 ). Furthermore, while we primarily emphasize computational efficiency with respect to input dimensionalities in this study, the computational time of the above algorithm with respect to the number of input samples ( ) can be also reduced from ( ) to (log ) by a randomized algorithm. In step 3 of qiCCA, we compute ) T (1) (2) T ( 1 (2) , 2 (2) , ⋯ , (2) ), which requires computational time proportional to . By using the randomized algorithm to compute the inner product (explained in the Appendix), each entry of (1) (2) T can be approximately computed in (log ) time.

Results
First, we evaluated the computational time and approximation performance of qiSVD, and then evaluated the computational time and performance of qiCCA. We also experimentally demonstrated that the computational efficiency of qiCCA enables us to perform CCA even after nonlinearly mapping original input data into high dimensional spaces, which provides a way to extract more correlations than linear CCA. These analyses were all performed using virtual machines on Amazon Elastic Compute Cloud (EC2). The instance type p2.xlarge was used when we performed the computation with deep CCA, and r5.12xlarge was used for all other experiments.
We evaluated the computational time of qiSVD. In our numerical experiment, the input matrix ∈ ℝ × was synthesized by = where and are ( × 100)and (100 × )-matrices whose each entry was sampled from the standard normal distribution.
was set to a fixed number (10000). The computational time of the conventional SVD and qiSVD algorithms was measured while changing the number of rows of the input matrix ( Figure 2A). The qiSVD algorithm is exponentially faster than the conventional SVD algorithm.
Next, we evaluated the approximation performance of qiSVD. Using the above input matrix, where is the matrix that consists of the (exact or approximated) right singular vectors corresponding to the top singular values. This value measures the difference between the input matrix and its low-rank approximation obtained by . We compared this value for conventional SVD and qiSVD while changing ( Figure 2B).
qiSVD shows results similar to those of the conventional SVD, indicating that qiSVD can provide a set of vectors that span a subspace similar to the subspace spanned by the exact right singular vectors. We then evaluated the computational time and performance of qiCCA. In a numerical experiment, we prepared a pair of input matrices ∈ ℝ × 1 and ∈ ℝ × 2 by following the standard assumption of probabilistic CCA (35). = 1 + 0.5 1 , = 2 + 0.5 2 , where ∈ ℝ × , 1 ∈ ℝ × 1 , 2 ∈ ℝ × 2 , 1 ∈ ℝ × 1 , and 2 ∈ ℝ × 2 are matrices whose each entry was sampled from the standard normal distribution. N and K were set to 10000 and 100. 1 and 2 were set to the same number and changed across  For each of a pair of input matrices, we concatenated the raw input features and the products of the input features for all possible feature pairs and applied qiCCA. We refer to this method as "qiCCA + 2nd-order." In this section, the method where qiCCA is applied only to the raw input features is referred to as "qiCCA + 1st-order." We compared the extractable correlations between qiCCA + 1st-order, qiCCA + 2nd-order, kernel CCA, and deep CCA using the MNIST dataset (36). This MNIST dataset has been used as a benchmark dataset to test variants of CCA in several previous studies (27,28). Following the procedure used in a previous study (27), the pixel values of the left and right 14 columns in 28 x 28 handwritten digit images were treated as a pair of input data. The number of raw input dimensions in each view is 392, and the number of input dimensions for qiCCA + 2nd-order is 77420, which is a huge number of dimensions that conventional CCA cannot handle even over more than two days of computational time in the same computational environment. Following the procedure used in the previous study (27), we fitted each CCA model using 50000 training samples, and evaluated generalization performance on 10000 test samples. For kernel CCA and deep CCA, the hyperparameters and architecture were optimized using the same procedure as the previous study (27). extracted more correlations than qiCCA + 1st-order, and was comparable with kernel and deep CCA.

Discussion
We developed qiCCA and numerically demonstrated its computational efficiency ( Figure 3A) and performance ( Figure 3B) compared to conventional CCA. We have also demonstrated that, by combining it with nonlinear mappings to high dimensional spaces, qiCCA extracted more correlations than linear CCA and was comparable to kernel and deep CCA on a benchmark dataset (Figure 4).
In an experiment with a benchmark dataset, we mapped the raw input data into high dimensional spaces by taking the second-order monomials. By using a complete set of basis functions, such as Fourier basis functions or monomials, we can approximate an arbitrary function as a linear combination of them, which enables us to extract nonlinear correlations by applying linear CCA. However, the number of monomials whose order is less than a given number or the number of Fourier basis functions whose frequency is less than a given value rapidly increases as the input dimensionality of the raw data increases, which makes it computationally infeasible to apply conventional CCA. The kernel trick can treat such nonlinear transformations implicitly. However, training kernel CCA takes (  The computational efficiency of qiCCA and quantum-inspired computation would be helpful when training a model on a large number of training samples is required. Taken together, our results suggest that qiCCA is considerably useful and quantum-inspired computation has the potential to unlock a new field in which exponentially large dimensional data can be analyzed. ℎ = ( ) 2 ‖ ‖ 2 P times. Denote the results by 1 , 2 , ⋯ , .

Output
as an approximation of the inner product.
By simple analysis, it can be confirmed that the mean and variance of the above output are ( , ) and {‖ ‖ 2 ‖ ‖ 2 − ( , ) 2 }⁄ . The error of the approximation is independent of the dimensionality of the vectors. When we use the binary search tree data structure explained in the Algorithm section for the input vectors, we can perform this randomized algorithm in (log ) time.