Learning multi-kernel multi-view canonical correlations for image recognition

canonical correlations (M2CCs) framework for subspace learning. In the proposed framework, the input data of each original view are mapped into multiple higher dimensional feature spaces by multiple nonlinear mappings determined by different kernels. This makes M2CC can discover multiple kinds of useful information of each original view in the feature spaces. With the framework, we further provide a specific multi-view feature learning method based on direct summation kernel strategy and its regularized version. The experimental results in visual recognition tasks demonstrate the effectiveness and robustness of the proposed method.

is a powerful technique for finding the linear correlations among multiple (more than two) high dimensional random vectors. Currently, it has been applied to various real-world applications such as blind source separation [3], functional magnetic resonance imaging (fMRI) analysis [4,5], remote sensing image analysis [6], and target recognition [7].
In recent years, the generalizations of MCCA have attracted increasing attention and some impressive results have been obtained. Among all the extensions, an attractive direction is the nonlinear one. Bach and Jordan [8] proposed a kernel MCCA (KMCCA) method which minimizes the minimal eigenvalue of the correlation matrix of the projected univariate random variables. Later, Yu et al. [9] presented a weighted KMCCA 1 to extract low dimensional projections from heterogenous datasets for data visualization and classification tasks. Recently, Rupnik and Shawe-Taylor [12] developed another KMCCA method directly based on the sum of correlations criterion [2], which can be regarded as a natural extension of kernel CCA (KCCA) [8,13] and has been demonstrated to be effective in cross-lingual information retrieval.
However, in practice KMCCA must face two main issues. The first is how to select the types and parameters of the kernels for good performance. Currently, although the choice for kernel types and parameters can usually be achieved by some cross validation methods [14], these methods have expensive computational costs when handling a large number of kernel types and 1 Although the authors refer to their method as weighted multiple kernel CCA, it is necessary to point out that the real meaning of "multiple kernel" is to use m kernel functions for all the m views, i.e., only one kernel for each view, rather than conventional multiple kernel learning in the popular literature [10,11].
parameters. Second, KMCCA essentially is a singlekernel-based learning method, i.e., only one kernel function for each view. As pointed out in Ref. [15], a single kernel can only characterize some but not all geometrical structures of the original data. Thus, it is obvious that KMCCA does not sufficiently exploit the geometrical information hidden in each view. This may lead to the challenge that KMCCA is not always applicable to the data with complex multi-view structures.
Over the past few years, researchers have shown the necessity to consider multiple kernels rather than a single fixed kernel in practical applications; see, for example, Refs. [10,11,16,17]. Multiple kernel learning (MKL), proposed by Lanckriet et al. [10] in the case of support vector machines (SVM), refers to the process of learning the optimal combination of multiple pre-specified kernel matrices. Using the idea of MKL, Kim et al. [18] proposed to learn an optimal kernel over a given convex set of kernels for discriminant analysis, while Yan et al. [19] presented a non-sparse multiple kernel Fisher discriminant analysis, which imposes a general l pnorm regularization on the kernel weights. Lin et al. [20] generalized the framework of MKL for a set of manifold-based dimensionality reduction algorithms. These investigations above have shown that learning performance can be significantly enhanced if multiple kernel functions or kernel matrices are considered. This paper is an extended version of our previous work [21]. In contrast, in this paper we present a general multi-kernel multi-view canonical correlations (M 2 CCs) framework for joint image representation and recognition, and show the connections to other kernel learning-based canonical correlation methods. In the proposed framework, the input data of each view are mapped into multiple higher dimensional feature spaces by implicitly nonlinear mappings determined by different kernels. This enables M 2 CC to uncover multiple kinds of characteristics of each original view in the feature spaces. Moreover, the M 2 CC framework can be employed as a general platform for developing new multi-view feature learning algorithms. Based on the M 2 CC framework, we present an example algorithm for multi-view learning, and further suggest its regularized version that can avoid the singularity problem, prevent the overfitting, and provide the flexibility in real world. In addition, more experiments are done to evaluate the effectiveness of the proposed method.

Kernel MCCA
KMCCA [12,22] can not only be considered as a nonlinear variant of MCCA, but also a multiview extension of KCCA. Specifically, given m views represents a data matrix of the ith view containing p i -dimensional sample vectors in its columns, assume there is a nonlinear mapping for each view X (i) , i.e., which implicitly projects the original data into a higher dimensional feature space KMCCA aims to compute one set of projection vectors {α (i) ∈ F i } m i=1 by the following optimization problem: Note that we assume that every φ i (X (i) ) in Eq. (1) has been centered, i.e., n j=1 φ i (x (i) j ) = 0, i = 1, 2, · · · , m. The details about the data centering process can be found in Ref. [23].
Taking advantage of the following two equations: and we can equivalently transform the optimization problem in Eq. (1) into the following: Let α (i) = φ i (X (i) )β (i) with β (i) ∈ R n . By means of kernel trick [8,23], the problem in Eq. (2) can be reformulated as is the kernel Gram matrix determined by a certain kernel function.
Using the Lagrange multiplier technique, we can solve the problem in Eq. (3) by the following multivariate eigenvalue problem (MEP) [24]:

Multi-kernel multi-view canonical correlations framework
In this section, we use the idea of MKL to build a multi-kernel multi-view canonical correlations (M 2 CCs) framework, where each set of original data are mapped into multiple high dimensional feature spaces.

Motivation
As discussed in Section 1, on one hand, KMCCA is very time-consuming to choose appropriate kernel types and parameters for the optimal performance in practical classification applications. Also, KMCCA only employs a kernel function for each of multiple views. Thus, in essence it is a single kernelbased subspace learning method. This makes KMCCA more difficult to discover multiple kinds of geometrical structure information of each original view in the higher dimensional Hilbert space. On the other hand, many studies [15,[18][19][20] show that MKL can significantly improve the learning performance for classification tasks and has the capability of uncovering a variety of different geometrical structures of the original data. Moreover, MKL can also help kernel-based algorithms relax the selection of kernel types and parameters. Motivated by the advantages of MKL, we consider multiple kernel functions for each original view and propose a multikernel multi-view canonical correlations framework for multi-view feature learning, which can provide a unified formulation for a set of kernel canonical correlation methods. To the best of our knowledge, such an MKL framework of MCCA is novel.

Formulation
Suppose m-view features from the same n images are given as n ) denotes the data matrix from the ith view and p i denotes the dimensionality of the samples. For each view X (i) , assume there are n i 1 nonlinear mappings: {φ which implicitly map the original data X (i) into n i different higher dimensional feature spaces, respectively. Note that the number of nonlinear mappings, n i , may be different for different views. Let us denote is an ensemble function of nonlinear mappings, i = 1, 2, · · · , m and k = 1, 2, · · · , n. Let α (i) be the projection axis of φ f i (X (i) ) in the feature space, then the M 2 CC framework can be defined as max . Note that we assume each φ f i (X (i) ) has been centered. As can be seen from Eq. (6), it is clear that many classical kernel canonical correlation methods can be subsumed into the M 2 CC framework by defining different multi-view correlation criteria and ensemble mappings {f i } m i=1 if we impose that the number of nonlinear mappings in each view is equal to one, i.e., n 1 = n 2 = · · · = n m = 1. For example: • Reduce to KMCCA. When the multi-view correlation criterion function g is defined as the sum of correlations between every pair of views, i.e., · · · , m and k = 1, 2, · · · , n, M 2 CC reduces to KMCCA. In other words, KMCCA can be viewed as a special case of M 2 CC.
• Reduce to KCCA. When m = 2 and multi-view correlation criterion g is defined as the correlation between two views, i.e., , k = 1, 2, · · · , n and i = 1, 2, M 2 CC becomes KCCA. As a result, one can design new multiple view data learning algorithms via defining different multiview correlation criterion functions and ensemble

Example algorithm: direct sum based M 2 CC
In this section, we give a specific multi-view learning algorithm, where all nonlinear mappings for each view share the same weight. We also present its regularized version which can prevent overfitting and avoid the singularity of the matrix.

Model
By means of the idea from the sum of correlations [12,22], our direct summation based M 2 CC model can be defined as Using the dual representation theorem, we have where n ) ∈ R n is referred to as dual vector. With Eq. (8), the optimization problem in Eq. (7) can be reformulated as As we can see, different ensemble mappings in Eq. (9) will result in different models. Thus, in this paper we define these ensemble mappings as m. According to Eq. (10), the optimization problem in Eq. (9) can be further converted as

Algorithmic derivation
To solve the optimization problem in Eq. (11), we define K k (X (i) ) ∈ R n×n using the kernel trick [21], where K (i) k denotes the kernel matrix corresponding to the kth nonlinear mapping in the ith view, and k = 1, 2, · · · , n i . Now, the problem in Eq. (11) can be formulated equivalently as Let us denote By the Lagrange multiplier technique, we can solve the problem in Eq. (12) by the following generalized eigenvalue problem: (11) K (22) . . . (2) . . .
It is clear that the objective function in Eq. (12) can be maximized directly by calculating the eigenvectors of the eigen-equation (14). Thus, we choose a set of eigenvectors corresponding to the first d largest eigenvalues as the dual solution vectors of our method. Once the dual solution vectors are obtained, we can perform multi-view feature extraction for a given multi-view observation denoting the jth kernel function in the ith view, i = 1, 2, · · · , m.

Regularization
In real-world applications, it is possible that the matrix diag(K (11) , K (22) , · · · , K (mm) ) in Eq. (14) is singular. In such case, the classical algorithm can not be directly used to solve the generalized eigenvalue problem. Thus, to avoid the singularity and prevent overfitting, we need to build a regularized version, which is the following: are the regularization parameters and · denotes the 2-norm of vectors.
If the singularity/overfitting problem occurs, or some applications need to control the flexibility of the proposed method, we can utilize Eq. (17) instead of Eq. (14) to calculate the dual vectors {β (i) } m i=1 .

Experiments
In this section, we perform two face recognition experiments to test the performance of our proposed method using the famous AT&T 1 and Yale databases. Moreover, we compare the proposed method with kernel PCA (KPCA) and KMCCA for revealing the effectiveness. In all the experiments, the nearest neighbor (NN) classifiers with Euclidean distance and cosine distance metrics are used for recognition tasks.

Candidate kernels
In our experiments, we adopt three views in total from the same face images and we use three kinds of kernel functions for the ith view in our proposed method, as follows: as used in Ref. [15]; In KMCCA, we use the above three kinds of kernels with the same parameters, i.e., linear kernel for the first view, RBF kernel for the second, and polynomial kernel for the last.
In addition, for a fair comparison with KMCCA and our proposed method, we perform KPCA by first stacking three views together into a single view and then using one of the above-described kernels.

Compared methods
To demonstrate how the recognition performance can be improved by our method, we compare the following nine methods: • KPCA Lin which uses a linear kernel.
• KPCA RBF which uses an RBF kernel.
• KPCA PolA which uses a polynomial kernel with order A, where A takes 2, 3, and 4 respectively. • KMCCA PolA where one of three views uses the polynomial kernel with order A and A takes 2, 3, and 4 respectively. • Our method which is the new one proposed in this paper.

Experiment on the AT&T database
The AT&T database contains 400 face images from 40 persons. There are 10 grayscale images per person with a resolution of 92 × 112. In some persons, the images are taken at different time. The lighting, facial expressions, and facial details are also varied. The images are taken with a tolerance for some tilting and rotation of the face up to 20 degree, and have some variation in the scale up to about 10%. Ten images of one person are shown in Fig. 1.
In this experiment, we employ the same preprocessing technique as used in Refs. [25][26][27] to obtain three-view data. That is, we first perform Coiflets, Daubechies, and Symlets orthonormal wavelet transforms to obtain three sets of lowfrequency sub-images (i.e., three views) from original face images, respectively. Then, the K-L transform is employed to reduce the dimensionality of each view to 150. The final formed three views, each with 150 dimensions, are used in our experiment.
In this experiment, N images (N = 4, 5, 6, and 7) per person are randomly chosen for training, while the remaining 10−N images are used for testing. For each N, we perform 10 independent recognition tests to evaluate the performances of KPCA, KMCCA, and our method. Tables 1-4 show the average recognition rates of each method under NN classifiers with Euclidean distance and cosine distance and their corresponding standard deviations. From Tables 1-4, we can see that our proposed method outperforms KMCCA and the baseline algorithm KPCA, no matter how many training samples per person are used. Particularly when the number of training samples is less, our method improves more compared with other methods. On the whole, KMCCA achieves better recognition results than KPCA. Moreover, KPCA with RBF kernel performs better than with linear and polynomial kernels.

Experiment on the Yale database
The Yale database [28] contains 165 grayscale images of 15 persons. Each person has 11 images with different facial expressions and lighting conditions, i.e., center-light, with glasses, happy, left-light, without glasses, normal, right-light, sad, sleepy, surprised, and wink. Each image is cropped and resized to 100 × 80 pixels. Figure 2 shows eleven images of one person. In this experiment, the Coiflets, Daubechies, and Symlets wavelet transforms are again performed on original face images to form three-view data. Also, their dimensions are, respectively, reduced to 75, 75, and 75 by K-L transform. For each person, five images are randomly selected for training, and the remaining six images for testing. Thus, the total number of training samples and testing samples is, respectively, 75 and 90. Ten-run tests are performed to examine the recognition performances of each method. Figures 3 and 4 show the average recognition results of each method under the Euclidean and cosine NN classifiers. As can be seen from Figs. 3 and 4, our proposed method is superior to KPCA and KMCCA. KPCA performs the worst and the RBF kernel is still more effective than other kernels in KPCA. These conclusions are overall   consistent with those drawn from Section 5.3.

Conclusions
In this paper, we have proposed an M 2 CC framework for multi-view image recognition. The central idea of M 2 CC is to map each of multiple views to multiple higher dimensional feature spaces by multiple nonlinear mappings determined by different kernels. This enables M 2 CC to discover multiple kinds of useful information of each original view in the feature spaces. In addition, the M 2 CC framework can be used as a general platform for developing new algorithms related to MKL as well as MCCA. As shown in this paper, we have proposed a new specific multi-view feature learning algorithm where all the nonlinear mappings for each view are treated equally. Two face recognition experiments demonstrate the effectiveness of our method.
Quan-Sen Sun received his Ph.D degree in pattern recognition and intelligence system from Nanjing University of Science and Technology (NUST), China, in 2006. He is a professor with the Department of Computer Science in NUST. He visited the Department of Computer Science and Engineering, the Chinese University of Hong Kong, in 2004 and 2005. He has published more than 100 scientific papers. His current interests include pattern recognition, image processing, remote sensing information system, and medicine image analysis.
Open Access The articles published in this journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http:// creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.