Abstract

Recently, cross-view feature learning has been a hot topic in machine learning due to the wide applications of multiview data. Nevertheless, the distribution discrepancy between cross-views leads to the fact that instances of the different views from same class are farther than those within the same view but from different classes. To address this problem, in this paper, we develop a novel cross-view discriminative feature subspace learning method inspired by layered visual perception from human. Firstly, the proposed method utilizes a separable low-rank self-representation model to disentangle the class and view structure layers, respectively. Secondly, a local alignment is constructed with two designed graphs to guide the subspace decomposition in a pairwise way. Finally, the global discriminative constraint on distribution center in each view is designed for further alignment improvement. Extensive cross-view classification experiments on several public datasets prove that our proposed method is more effective than other existing feature learning methods.

1. Introduction

Under the modern technique background, there are many artificial intelligence methods inspired by nature, such as machine learning [14], reinforcement learning [5], and artificial immune recognition [6]. Among them, machine learning can effectively deal with image recognition problems. However, some researches have indicated that the adaptive ability of traditional machine learning drops sharply, when the learned images have large distribution discrepancy, such as cross-view data [2]. This discrepancy means that data variance in the view space is larger than data variance in the class space. It generates that the different views become the major factor affecting recognition. Therefore, we mainly focus on cross-view subspace learning to deal with the distribution discrepancy problems in this paper.

In recent years, subspace learning (SL) has made great contributions in the field of machine learning and has a wide application in computer vision, data mining, and so on [721]. One of the most typical methods is principal component analysis (PCA) [22], which uses an orthogonal transformation to reduce the dimensionality while preserving unique information (principal component) of the data. However, PCA is an unsupervised dimensionality reduction method, which disregards the discriminative information attached to the semantic component. So, a supervised dimensionality reduction method, linear discriminant analysis (LDA), using semantic component was proposed in [23]. LDA learns a supervised linear combination to adjust the spatial dispersion. However, LDA generates the overfitting phenomenon when processing noisy data. To overcome corruption, rank minimization technique has been spotlighted in recent years. Candès et al. enforced low-rank and sparse constraints to eliminate corrupt information in the data [24]. After that, low-rank representation (LRR) was proposed to restore clean data through dictionary representation in multiple spaces [7]. In the last decade, LRR model has achieved satisfactory results in various fields [816, 2535], such as domain adaptation [8], clustering [9], transfer learning [25], and low-rank texture structure [26]. However, Liu et al. pointed out that the dictionary of LRR may fail when the data is insufficient. To solve this problem, the latent LRR (LatLRR) was proposed to enhance subspace learning by latent information [10]. However, LatLRR is still unsupervised. Inspired by LRR models and LDA, Li et al. unified linear discriminant constraint and low-rank representation into subspace learning to enhance the learned low-dimensional feature [11]. Afterwards, the low-rank embedding (LRE) proposed in [12] provides a robust embedding subspace learning framework that eliminates reconstruction errors by adopting -norm constraint on the projection residual. The latent low-rank and sparse embedding (LLRSE) was developed in [13] based on LRE. LLRSE additionally introduces a reconstructed orthogonal matrix, which makes the projection space contain more unique feature.

Recently, a lot of algorithms dedicated to cross-view feature learning based on above methods have been developed [2731]. The low-rank common subspace (LRCS) was proposed in [27], which finds a view common subspace by LRR. Nevertheless, LRCS only considers the view label and ignores the discrimination of the class label. From the perspective of multiview data structure, a supervised subspace learning method, namely, multiview discriminant analysis (MVDA), was proposed in [28] by using the discriminative information in the different views. After that, a multiview manifold learning with locality alignment (MVML-LA) framework proposed in [29] provides us with a discriminative low-dimensional latent space. Most recently, the robust cross-view learning (RCVL) was designed to learn a common view-invariance discriminative subspace by adopting a novel rank minimization technique [30]. However, RCVL ignores the global discriminative information.

Human visual perception works through visual circuit, whose function is to understand the visual signal in a layer-wise way. In detail, our brain uses a small feature extractor (each layer in the visual circuit) to obtain some simple features from the real complex signal. Drawing inspiration from layered processing in human visual system, we represent the cross-view data in two different structure layers, class structure layer and view structure layer, respectively, in order to construct the view-consistency feature learning model. Hence, we design two novel discriminative alignment constraints from simultaneous local and global viewpoints, which can not only disentangle the class and view layers but also bridge the gap existent in cross-view data. Our contributions are as follows:(1)The dual low-rank representation model is set up to discover the two latent structures existent in cross-view data, which are view and class structures, respectively. These two distinguished structures are conducive to discovering the potential feature for cross-view classification task.(2)A local alignment constraint based on two designed local graphs is utilized to transform the neighbouring relationships between each pair of samples in the learned subspace. This constraint can make the view and class structures be separated effectively.(3)A global alignment constraint designed in our framework ulteriorly cut down view discrepancy. The projected samples from view and class subspaces are used to compose the discriminative constraint in global alignment by enforcing the mean distance between classes in different views.

Figure 1 illustrates how to learn an aligned subspace in which samples have a large distance between classes and a close distance within class from both of class perspective and view perspective. The structure of this paper is organized as follows. Related works simply review the baseline of our work. Part three presents the proposed model and its solution process. Experiments show the results of the comparison experiments and parameter experiments. At last, conclusions summarize this paper.

Our method has an intimate connection with the two following methods: (1) low-rank representation and (2) linear discriminant analysis.

2.1. Low-Rank Representation

LRR is unaffected by errors and can explore the underlying structure of data. Suppose that is a matrix of natural data from classes. The model of LRR can be expressed bywhere is the low-rank linear combination coefficient matrix of data . Matrix with -norm can fit the corrupted information in real life. is used to balance the level of corruption. Therefore, LRR that can simulate the corrupted data with representation framework possesses a feasible skill, which can handle the cross-view data well.

2.2. Linear Discriminant Analysis

LDA is a familiar supervised method proposed for dimensional reduction, which is constrained by discriminative semantic information. The main principle of LDA is to find a discriminative subspace with the largest interclass variance and the smallest intraclass variance. Assume that samples from classes are , where represents samples and represents labels of different samples. In addition, and denote the center of all samples and samples belonging to the th class, respectively. Hence, between-class scatter and within-class scatter are as follows:where is the number of samples from the th class. Therefore, the generalized Rayleigh quotient can be described as follows by Fisher discriminant criterion:where is a projection matrix and denotes the trace operator. Furthermore, the solution of equation (3) is relatively complicated, so we transform it into trace-difference problem as follows [36]:

LDA not only can retain as much information for recognition as possible while reducing the dimensionality but also can remove superfluous and dependent variable feature to the disadvantage of classification task. Nevertheless, due to the distribution of cross-view data, the performance of LDA only considered class semantic information is not outstanding.

3. Our Proposed Method

This section contains four parts. The first part of this chapter is to specify the symbols in our algorithm. Part two is a detailed introduction and direction about our framework. The third one develops a numerical scheme to obtain the approximate solutions iteratively. The last part discusses the computational complexity of our proposed algorithm.

3.1. Notations

Assume that and are two matrices of different views from same classes, where and denote the number and dimensionality of all training samples, respectively. Class structure and view structure are two linear combination matrices, which are included in local graph framework to discover view-invariant structure of cross-view data. is a basis transformation projection matrix, where is dimensionality of projected data. is a matrix of error data designed to obtain a robust subspace from noise. In addition, are four constant coefficient matrices utilized for aligning the global information of cross-view data.

3.2. Objective Function

To address the cross-view discriminative analysis, we formulate our subspace learning model with simultaneous local and global alignments as follows:where is a low-rank framework of two potential manifolds in cross-view data. enforces the view-specific discriminative local neighbor relationship among instances. performs the discriminative clustering and separation of the global structure in the class manifold and the view structure, respectively. In the following, the above terms are illustrated in detail.

3.2.1. Dual Low-Rank Representations

Only one linear combination matrix, which is constrained by the rank minimum, is used by methods based on low-rank model. Nevertheless, unitary low-rank structure gives rise to the failure of linear representation due to differences between distributions of views, which endow cross-view data from a same class with a large divergence. Therefore, the two structure matrices and are adopted to solve this specific problem, where the between-view samples from the same class are far away and the within-view samples from the different classes are closer. The first term is defined with dual low-rank representations to strip down the class and view structures as follows:where is a sign of the nuclear norm, which is close to a representation of the rank minimum problem, and its solution is relatively convenient. Assuming that part of the data from real world contains corruption, we adopt the -norm to make matrix have the structured sparsity as the noisy data. -norm can effectively remove the corruption of specific sample while holding the other clean samples. is used to balance the corruption.

3.2.2. Graph-Based Discriminative Local Alignment

To introduce the local discriminative constraint, two graph-based constraints are constructed on each pair of synthetic samples with and from class and view subspaces, respectively, as follows, which can better cluster intraclass samples and decentralize interclass ones.where and denote the th projected samples of cross-view data from the class space and view space , respectively. Correspondingly, and denote the th projected samples. and denote graph weight matrices that are defined as follows:where and are the labels of samples and , respectively. denotes that belongs to the adjacent datasets of the same sample . means that belongs to the adjacent datasets of the same view sample . Hence, can calculate the distance between two similar samples from same class. Similarly, denotes the distance between two similar samples from different classes. With the help of Fisher criterion, the pairwise local discriminative constraint can be rewritten as follows:where and denote the Laplacian operators of and . is a balance parameter. Minimizing the subtraction of trace in equation (9) can weaken the impact of view information and separate the class structure and the view structure.

3.2.3. Discriminative Global Alignment

The discriminative projection of all pairwise samples through can reduce the impact between views, but the differences between learned features from different classes are not significant enough. So, to further enhance the separation of two manifolds, we design a global discriminative constraint for cross-view analysis as the third term :where is within-class and between-view scatter matrix in class manifold and is within-view and between-class scatter matrix in view manifold. These scatter matrices are formulated aswhere denotes the overall mean feature from the th view and denotes the mean feature of the th class from the th view. In this way, the within-class view margin in class structure can be reduced, and the margin of between-class data from same view in view structure can be magnified. The third term can be framed aswhere is the coefficient matrix of the within-class mean feature of the th view and is the coefficient matrix of the overall mean feature of the th view. In detail, only if belongs to the th class from the th view, where denotes the number of samples of the th class from the th view; in other cases, . only if belongs to the th view, where denotes the number of samples from the th view; in other cases, . is a trade-off parameter. Equation (12) achieves global alignment by the mean vectors of joint synthetic samples from global representation and further enforces the view-invariant constraint on the same class.

In addition, we add an orthogonal constraint to neglect trivial solutions. In the end, we rewrite equation (5) with all the terms as

3.3. Optimization Scheme

To obtain the feasible solution of and , we adopt two auxiliary variables and . Then, equation (13) can be transformed into the following term:

For optimization problems with equality constrains, the Augmented Lagrangian method is an effective solution. The Augmented Lagrangian form of equation (14) is as follows:where , and are the Lagrange multipliers and is the penalty parameter. We use an alternating solution to optimize iteratively all variables. We define the left bottom of the variable plus as the -th solution.

First, by ignoring the other variables except , equation (15) becomes

We obtain the projection matrix one by one, because is an orthogonal matrix. For the th column of , the objective function is rewritten as

We enforce the derivative of function (17) to be zero.

Therefore, is the th eigenvector of the matrix in equation (18) and can be simply solved.

Update :

The singular value thresholding is an approximate method to solve the above kernel norm minimization equations [37].

Update :

Equation (20) can be addressed in the same way as equation (19).

We enforce the derivative of equation (15) with respect to to be zero.

It is obvious that equation (21) is a Sylvester equation. We can easily solve it by [38]. Similarly, we enforce the derivative of equation (15) with respect to to be zero.

Equation (22) can be addressed in the same way as equation (21).

Update :

The above equation is a -norm minimization problem whose solution is shown in [39].

The entire numerical iterative scheme for equation (14) is shown in Algorithm 1, where the parameters , and are set empirically. Moreover, the matrices , and are initialized as 0 and the parameters , and are tuned by the experiments.

Input: data matrix , parameters and
Initialize:, , , , and ;
while not converged ordo
(1)Solving by equation (18);
(2)Solving by equation (19);
(3)Solving by equation (20);
(4)Solving by equation (21);
(5)Solving by equation (22);
(6)Solving by equation (23);
(7)Updating by ;
(8)Updating by ;
(9)Updating by ;
(10)Updating the parameter by ;
(11)Checking convergence by ;
(12).
end while
Output:
3.4. Complexity Analysis

According to the above computational process and Algorithm 1, we discuss the computational complexity of the proposed algorithm in detail. In Algorithm 1, the main factor of algorithm complexity depends on Steps 1–5. Equation (18) is a typical characteristic equation, which costs , where is the number of the training samples. The SVD decomposition in Steps 2 and 3 takes about , but it can be reduced due to the low-rank matrices and , where is the rank of low-rank matrix. Equations (21) and (22) are two Sylvester equations whose computational complexity is . In summary, the computational complexity of our proposed method is .

4. Experiments

In this section, we evaluate the performance of our proposed method with classification task. Firstly, we introduce four cross-view datasets: face database, object databases, image-text database, and experimental setup. Secondly, we adopt several excellent subspace learning algorithms for comparison with ours. The initializations of all unknown parameters are tuned to get the best experimental results. The analysis of parameters is shown in Figure 2. In addition, each experiment is repeated 10 times and the average classification results are shown.

4.1. Experimental Datasets

CMU-PIE Face dataset contains face images in 9 postures and 21 illumination conditions of 68 people. In our experiment, we chose 4 kinds of postures, which are P05, P09, P27, and P29. The original dimension of the cropped face images is . To enhance efficiency of our algorithm, the 300-dimensional principal feature extracted by PCA was adopted to our experiment.

Wikipedia dataset, which is an image-text bimodal dataset, consists of 2866 pairwise samples from 10 classes. The dimensions of the image and text are 4096 and 100, respectively. Therefore, due to inconsistency of dimensionality of the two features, we use PCA to adjust the image dimension.

COIL-20 object dataset is composed of 20 objects from a level 360-degree view. There is 5° between every two adjacent images, so each category has 72 samples. We divide the 72 images into two groups, G1 and G2. In addition, G1 is composed of samples from V1 [0°, 85°] and V2 [185°, 265°]. Similarly, G2 is composed of samples from V3 [90°, 175°] and V4 [270°, 355°].

COIL-100 object dataset is an extension of the COIL-20. The only difference is that the COIL-100 is composed of 100 objects from a level 360-degree view. Therefore, the set of the COIL-100 dataset is similar to that of the COIL-20 dataset.

4.2. Experimental Results and Analysis

In experiments, we need not use any information about the test set, including class and view information. We select several subspace learning methods as comparison methods, that is, PCA, LDA, locality preserving projections (LPP) [40], LatLRR, SRRS, and RCVL. After extracting feature with the comparison methods, we uniformly choose KNN as classifier to evaluate their performance. In addition, we also add 10 percent and 20 percent of random noise to part of the datasets to demonstrate the adaptability of our subspace learning algorithm to different levels of corrupted data and some instances are shown in Figure 3.

For CMU-PIE, we randomly perform cross-view subspace learning on two poses, with a total of 6 experimental groups, which are C1{P05,P09}, C2{P05,P27}, C3{P05,P29}, C4{P09,P27}, C5{P09,P29}, and C6{P27,P29}. Tables 13 respectively show the classification results of all experimental algorithms on the original data, 10% noisy data, and 20% noisy data. For COIL-20 and COIL-100 object datasets, we select two sets of samples from G1 and G2 as a cross-view training set, respectively, and the others as a test set. So we get 4 experimental groups from COIL-20 and COIL-100 datasets, including C1{V1,V3}, C2{V1,V4}, C3{V2,V4}, and C4{V2,V3}. Tables 46 display the classification results of all experimental algorithms on the original data, 10% noisy data, and 20% noisy data from COIL-20. Figures 2, 4, and 5 show the experimental results of four groups from original COIL-100 dataset, 10% corrupted COIL-100 dataset, and 20% corrupted COIL-100 dataset. For Wikipedia, we use the reduced dimensionality image feature and text feature as two views and Figure 6 displays the results of comparison experiments on the original data, 10% noisy data, and 20% noisy data.

The results of experiments prove that our method achieves the persistent higher classification results compared to other methods. For noisy data, the classification results of most methods based on LRR are more robust than those of other methods. It is due to the fact that low-rank representation framework can restore raw information from corrupted data by learning latent structure. Besides, another result also can be found that the classification results of methods used for cross-view data are better than those of other comparisons. Our proposed method projects data into the discriminative view-invariant subspace via dual low-rank representations framework, so that the method can better learn from cross-view data.

4.3. Performance Evaluations

In this part, we test what parameters should we assign to ensure that the performance of our method can get a best grade. Then, we show the convergence of our algorithm.

There are three tunable parameters , and in our framework. We evaluate the effect of parameters on COIL-20 C1. are two parameters to adjust discriminative local alignment and discriminative global alignment. From Figures 7(a) and 7(b), it can be seen that our method gets the best result, when and . Furthermore, is a parameter to constrain the corrupted data and the classification result is optimal around 1.

In the end, we show the convergence analysis of our method on different datasets: the original, the 10% corrupted COIL-20 C1, and the CMU-PIE C1. The maximum value of , , and is used as convergence criterion in each iteration. The variation of the maximum value with the increase of the number of iterations is shown in Figure 8. The curves point out that the proposed algorithm converges steadily and efficiently after 20 iterations.

5. Conclusions

We proposed a subspace learning algorithm with discriminative constraint via low-rank representation to solve the cross-view recognition task. Our method can learn a distribution-invariant subspace from cross-view data by designing two substantial structures with dual low-rank constraints. We also integrate the local alignment and the global alignment into our framework to eliminate the interference caused by the view manifold in the subspace. Meanwhile, we also design a feasible iterative scheme to ensure that the model converges and obtains the optimal solution. Extensive experiments on several public datasets prove that our proposed method has strong robustness and stability for cross-view classification tasks.

Data Availability

The datasets used in this paper can be downloaded through the following links: (1) CMU-PIE face dataset: http://vasc.ri.cmu.edu/idb/html/face/. (2) Wikipedia dataset: http://www.svcl.ucsd.edu/projects/crossmodal/. (3) COIL-20 dataset: https://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php. (4) COIL-100 dataset: https://www.cs.columbia.edu/CAVE/software/softlib/coil-100.php.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Grants 61501147 and 62071157), University Nursing Program for Young Scholars with Creative Talents in Heilongjiang Province (Grant UNPYSCT-2018203), Natural Science Foundation of Heilongjiang Province (Grant YQ2019F011), Fundamental Research Foundation for University of Heilongjiang Province (Grant LGYC2018JQ013), and Postdoctoral Foundation of Heilongjiang Province (Grant LBH-Q19112).