Video based object representation and classification using multiple covariance matrices

Video based object recognition and classification has been widely studied in computer vision and image processing area. One main issue of this task is to develop an effective representation for video. This problem can generally be formulated as image set representation. In this paper, we present a new method called Multiple Covariance Discriminative Learning (MCDL) for image set representation and classification problem. The core idea of MCDL is to represent an image set using multiple covariance matrices with each covariance matrix representing one cluster of images. Firstly, we use the Nonnegative Matrix Factorization (NMF) method to do image clustering within each image set, and then adopt Covariance Discriminative Learning on each cluster (subset) of images. At last, we adopt KLDA and nearest neighborhood classification method for image set classification. Promising experimental results on several datasets show the effectiveness of our MCDL method.


Introduction
With the recent development in imaging techniques, multiple images of an object are usually available in many cases, such as video based surveillance, multi-view camera networks, etc. Object recognition from these multiple images is formulated as an image set (video) classification problem and has attracted more and more interests and attention in computer vision and machine learning area in recent years [1,2,3,4,5,6,7]. This technique can be widely used in many computer vision problems. For example, in visual object search task [8,9,10], one can use multiple images to retrieve and recognize the similar visual objects. In face recognition problem [11], we can also use multiple face images to conduct person identification. Compared with the traditional single image based object recognition and learning, video model generally contains more visual appearance contents and thus performing more robustly and effectively on image set representation [12,13,14,15,16,17,18].
One of the main problems and challenges for video based object recognition is to develop an effective method to represent an image set or sequence. In recently years, many methods have been proposed for image set representation and classification. Other main problems include image set classifier development, image set clustering methods and so on. In this paper, we focus on image set representation. Kim et al. [19] proposed Discriminant-analysis of PLOS  Canonical Correlations (DCC) to represent an image set by using single linear subspace. Hamm et al. [3] proposed Grassmann Discriminant Analysis (GDA) which uses multiple local linear subspaces to represent an image set. Besides linear subspace, nonlinear subspace methods have also been used for image set representation. For nonlinear subspace based representation, Wang et al. [20] presented an image set with nonlinear manifolds and used Manifold-Manifold Distance (MMD) method for image set representation and classification. Wang et al. [21] also proposed Manifold Discriminant Analysis (MDA) to obtain a more discriminative feature space to represent a set of images. In additional to above methods, probabilistic models have also been used for image set representation and classification. Shakhnarovich et al. [4] used a single Gaussian model for set modeling. Arandjelovic et al. [1] further provided a method to use Gaussian Mixture Models (GMM) to image set representation. Wang et al. [11] proposed Discriminant Analysis on Riemannian Manifold of Gaussian Distributions (DARG) to learn a discriminative representation for image set. As one of the probabilistic methods, Covariance Discriminative Learning (CDL) [5] has been widely used for image set representation. The core idea of CDL method is to represent an image set using a single covariance matrix. One benefit of CDL representation is that it makes no assumption about the set data distribution and thus providing a simple and effective representation for an image set with any kinds of features. However, when data samples are drawn from a union of multiple subspaces, traditional CDL generally fails to provide an accurate and reliable representation.
In this paper, we present a new image set representation method called multiple covariance discriminative learning (MCDL), which aims to represent an image set using multiple covariance matrices with each covariance matrix representing one cluster of images. Comparing with previous single CDL method [5], MCDL explores the data distribution of multiple subspaces more thus providing a more faithful representation. To do that we first use the Nonnegative Matrix Factorization (NMF) technique to cluster the samples into their respective subspaces. Then, we adopt Covariance Discriminative Learning (CDL) on representing each cluster (subset) of images which lies in a single subspace, as shown in Fig 1. Note that covariance-based visual representation has been used in many applications [22,23]. Different from these works, here we focus on multiple covariance matrices representation, which considers multi-subspaces property of image set data and thus providing a more effective descriptor for image set data. For set classification, we first define a method to measure the similarity between image sets based on MCDL and then adopt KLDA and nearest neighborhood classification method [5] for image set classification. Experimental results on several datasets show the effectiveness and benefits of the proposed MCDL method.
The remainder of this paper is organized as follows. In the materials and methods part, we introduce nonnegative matrix factorization (NMF) data clustering method and propose our Multiple Covariance Matrices representation and Kernel LDA classification method. At last, we apply MCDL method to some datasets to evaluate the effectiveness of the method.

Materials and methods
The experimental data in our study was acquired legitimately from international standard database, and this study was approved by the Local Ethics Committee of Wuhan University of Technology.

NMF clustering
Nonnegative Matrix Factorization (NMF) [24,25] is a matrix factorization algorithm that has been widely used in many machine learning problems. Let X = (x 1 ,x 2 ,. . .x n ) 2 R p×n be n data points in p-dimensional space. The aim of NMF is to find two smaller nonnegative matrices F 2 R p×k and G 2 R n×k whose product can approximate the original matrix X as close as possible, ie., Using Euclidean distance (or Frobenius norm) residual function, the above approximation problem can be formulated as the following optimization, s:t: From optimization aspect, although the above objective functions are convex in F or G only, there are not convex in both of this two variables. Thus, it is difficult to develop an algorithm to find the global optimal solution for this problem. Lee and Seung [18] has presented an effective update algorithm which iteratively updates the current solution as follows, It has been proven that the above update algorithm can converge to a local optimal solution. The above NMF model has been widely used in many applications. One important aspect of NMF is that it can be used for data clustering. In fact, let be the optimal solution of the above optimization problem. Then, f Ã i ; i ¼ 1 Á Á Á k can be regarded as the cluster centroid, and the optimal G ik can be viewed as the continuous coefficient of data x i belonging to cluster c k . In clustering process, we can use the maximum coefficient of G ik to determine the cluster label of data x i .

Image set modeling with multiple covariance matrices
In this section, an effective method is proposed to represent image sets by using multiple covariance matrices. Based on this representation, a similarity metric between two image sets is further computed.
A. Image set representation. We first propose an effective image set representation by using multiple covariance matrices, called Multiple Covariance Discriminative Learning (MCDL). Formally, given a video (or image set) X ¼ ðx 1 ; x 2 ; Á Á Á x n Þ, we first use the above NMF method to do clustering on X and obtain clustering results X ¼ fX 1 ; X 2 ; Á Á Á X k g with k clusters. Here, X i is the image subset belonging to the i-th cluster.
Then, each cluster X i is represented with a d × d covariance matrix as follows, where " X i is the mean of the i-th cluster, and X i (h) is the h-th element in cluster X i . At last, the whole image set X can be represented by using a set of covariance matrices as follows, when k = 1, our MCDL degenerates to the traditional CDL method [5]. Therefore, the proposed MCDL can be regarded as a general extension of CDL representation. Comparing with CDL, MCDL can represent the variations of images in an image set more sufficiently and effectively while maintaining the benefit of CDL representation. B. Similarity metric for MCDL. Based on the above MCDL representation, we propose a method to define a similarity metric between two image sets S 1 and S 2 whose covariance matrix representations are C 1 and C 2 respectively. Formally, let C 1 ¼ fC 1 1 ; C 1 2 ; Á Á Á C 1 k g and C 2 ¼ fC 2 1 ; C 2 2 ; Á Á Á C 2 k g, it is known that for any covariance matrix C 1 h or C 2 l , it is symmetric positive definite (SPD). For any SPD matrix, it does not lie in a Euclidean space but on the Riemannian manifold [5,26]. Therefore, it is necessary to map C 1 h or C 2 k from Riemannian manifold to Euclidean space using the following logarithm operator [5], where log ðCÞ ¼ flog ðC 1 Þ; log ðC 2 Þ; Á Á Á log ðC k Þg: Using this mapping C M!E , the similarity between two image sets S 1 and S 2 can be defined as the following three main steps.
Step 1. Compute the similarity between covariance matrices C 1 i and C 2 j as the inner product between them, i.e., where Tr(A) is the trace norm function of matrix A.
Step 2. Compute the optimal mapping f between two covariance matrix set C 1 and C 2 by solving the following optimization The above problem is known as bipartite graph matching problem and can be efficiently and effectively solved by using Hungarian algorithm.
Step 3. Calculate the mapping similarity between covariance matrix set C 1 and C 2 as follows, Note that the above similarity function K is the combination of linear kernel functions kðC 1 i ; C 2 j Þ. Therefore it is also a desired kernel function.

Image set classification
Based on the above MCDL representation and associated metric definition method, we can provide an effective classification method for image set. Generally, our classification method contains two main steps. Firstly, we use the Kernel Linear Discriminant Analysis (KLDA) [27,5] method to extract a kind of discriminative feature for our MCDL representation. Then, we use nearest neighbor classification method to do classification on image set data. Let X ¼ fX 1 ; X 2 ; Á Á Á ; X m g be m image sets belonging to c classes. For each pair of set X i and X j , we extract the MCDL representations for them and then compute the similarity kernel function Kði; jÞ between them. The aim of KLDA is to solve the following optimization, where K is the kernel matrix which is computed using Eq (10), and L is the class label matrix defined as, : where m k is the number of data points belonging to class k and S c k¼1 m k ¼ m. It is well known that optimal solution p can be obtained by computing the eigenvector corresponding to the largest eigenvalue. By further grouping the first largest (c − 1) eigenvectors, we can obtain P = [p 1 ,Á Á Á,p c−1 ] and get the c − 1 projected feature vector by After KLDA projection, we then use nearest neighbor classification to classify image sets [5].

Experiments and results
In this section, we implement and test our MCDL method on several datasets to evaluate the effectiveness and benefits of our method. The detail introduction of these datasets are given below. These datasets have been widely used in many other methods. We have compared our MCDL method with some other methods including traditional Covariance Discriminative Learning (CDL) [5], Set to Set Distance Metric Learning (SSDML) [28], Manifold Discriminant Analysis (MDA) [21], Manifold-Manifold Distance (MMD) [20] and Discriminant Canonical Correlations [19].
• YTC [30] dataset contains 1910 face videos of 47 subjects. Each video contains several hundreds of frames.
• Cambridge-Gesture [31] dataset has 900 video sequences of 9 gestures in whole. Each gesture contains 100 videos. We divide it into five sets.
All the images used in these four datasets have been resized to the same 20 × 20 intensity images to make the same consistent dimension.

Classification results
We conduct the classification experiments for three cases by randomly selecting 50%, 70%, 90% image sets respectively for gallery and the rest image sets for probe. For fair comparison, the important parameters of each method were empirically tuned according to the recommendations in the original references. Fig 2 shows the classification results of all methods on the four datasets.
Here, we can observe that (1) CDL can return better performance in general, which indicates the effectiveness and benefits of the CDL method on conducting image set classification tasks. (2) MCDL obtains obvious better performance than other methods on gesture. Because in this dataset, the image in each set are usually lied on multiple subspaces. As discussed before, MCDL method performs more effectively and suitably for the data lying on multiple subspaces because it uses multiple covariance matrices representation instead of traditional single model representations such as MMD, MDA and DCC. (3) MCDL outperforms traditional CDL method and obtains the best performance on the four datasets. This clearly indicates the robustness and effectiveness of the proposed MCDL method on conducting image set representation and classification task. (4) MCDL obviously outperforms CDL on Gesture dataset. For this dataset, the variations of images in each set are very large due to different gestures and thus can be divided into several clusters. In this case, the proposed MCDL can capture these variations more effectively and sufficiently than single CDL.
We also test our method on the standard setting, which is summarized as follows. For each person in CMU MoBo [29] dataset, one set is used for gallery and the rest for probe. In YTC [30] dataset, we randomly chose 3 sets for gallery and 6 sets for probe. For Cambridge-Gesture [31] dataset, the first set for gallery and the rest four sets are used for probe. For each category in ETH-80 [32] dataset, five objects are selected for gallery and the rest 5 objects for probe. The results are summarized in Table 1. It can be seen that our method can return better performance than other compared methods, which further demonstrates the robustness of the proposed MCDL method on image set classification tasks.
The classification accuracy of different methods on four datasets are summarized in Table 1. Here, we can note that comparing with other methods, CDL can return better performance, which indicates the effectiveness and benefits of the CDL method. Our MCDL generally outperforms traditional CDL method and obtains the best performance. This clearly indicates the robustness and effectiveness of the proposed MCDL method on conducting image set representation and classification task.

Conclusion
In this paper, we present a new image set representation method called multiple covariance discriminative learning (MCDL). The aim of MCDL is to represent an image set using multiple covariance matrices and each covariance matrix represents one cluster of images. To do that, firstly we use a Nonnegative Matrix Factorization (NMF) to conduct image clustering within each image set. Then, we adopt Covariance Discriminative Learning (CDL) to represent each  cluster (subset) of images. In terms of classification, we first define a method to measure the similarity between image sets based on MCDL and then adopt KLDA and nearest neighborhood classification method for image set classification. Experimental results show the effectiveness and benefits of the proposed method.