General Image Categorization Using Collaborative Mean Attraction

In this paper, we apply Collaborative Mean Attraction (CMA) method, which has been developed for person re-identification problem, to general image categorization problem. Experimental results using Caltech101 and Caltech256 dataset reveal that CMA shows better categorization accuracy than traditional methods, particularly in the case when the size of training data is small. Furthermore, we discuss the parameter settings for CMA through several experiments.


Introduction
Recently, studies of general object recognition have been progressed.In this paper, we address the general image categorization problem, which classifies each image into one of known categories.
In general object recognition, Deep Learning method shows great results.Deep Learning not only classify a given input image but also extract features of the image.The extracted features can be used for general purpose object recognition, and using the features we can achieve excellent recognition results.To accomplish that, Deep Learning demands a huge number of training images.
However, in general object categorization problem, there are some cases when we can give only a small number of training images.For example, when a user wants to organize his/her personal photos based on his/her personal criteria, the user can give only a small number of example images, which can be used as training images.In such cases, we need a categorization method which works fine with a small number of training images.
In this paper, we apply Collaborative Mean Attraction (CMA) method to general image categorization problem. 1 CMA has been developed for person reidentification problem.In person re-identification problem, we classify a person in query images into one of known persons in gallery (training) images.Here, since gallery images are taken while the person passes through the view area of a surveillance camera, a few gallery images can be available.CMA shows good results in person re-identification problem.
In the experiment, we apply CMA method and other comparison methods to Caltech101 and Caltech256, which are public general image categorization datasets, and discuss the characteristic of CMA method.

CMA method
CMA is an identification method which classifies unknown test data into one of known categories.Known categories are given as feature sets of one or more training images belonging to that category.The test data is also given as the feature of the test image, but it is possible to give a set of test images as one group and classify the group into one of known categories at once.
CMA method consists of two stages, that is, the optimization stage and the classification stage.In the optimization stage, we generate a representative of test data and approximate it with all training images.In the classification stage, we select the category that most contributed to the approximation.

Optimization stage
A representative point and its approximate point are represented by linear sums of test data and training data, respectively.In the optimization stage, the coefficients of the linear sum constituting the representative point and the approximate point are obtained.
The coefficients are determined so that the distance between the representative point and the approximate point becomes short.At the same time, the constraint is imposed that the representative point and the approximate point should be near to the average of each data so that the properties of each data are retained.
A test data matrix ϵ × and a training data matrix ∈ (i∈{1,…,n}) are given, where m is the number of dimensions of the feature vector, is the number of the test images, n is the number of known categories, is the number of training images in i-th category.We denote the matrix which concatenates all the training data matrix as = ( … ∈ , where = ∑ is the total number of training images.Here, we obtain the coefficient vector , ϵ by minimizing the following expression ( , ) : where 1 is the k-dimensional vector with each element 1, ‖•‖ is the of the vector.and in Equation ( 1) are the representative point and approximate point, respectively.
The first term is a term for minimizing the distance between the representative and approximate point.The second term and the third term are constraint terms (regularization term) for bringing the representative and approximate point close to their respective average points.
, are weight parameters of the regularization term.

Classification stage
Using the coefficients α and β obtained in the optimization stage, we find the category that most contributed to constructing the approximation point, and classify the test data into that category.Each dimension of the coefficient β corresponds to the training image.β can be decomposed with βi as the coefficients of the training image in i-th category.Thus, we calculate the following formula for each category i and we classify the test data into the category which has the minimum .
The smaller − is, the better the representative point can be approximated with only i-th category.Similarly, when the value of ‖ ‖/‖ is larger, the coefficient of i-th category plays a large role in the entire known categories.‖•‖ * is the nuclear norm of the matrix (sum of singular values of the matrix).The larger the variation of the data, the larger the nuclear norm becomes.It is used for weighting the variance of the categories.

Dataset
We used the Caltech101 and Caltech256 dataset for evaluation. 2 The Caltech101 and Caltech256 dataset consists of images with 101 and 256 categories, respectively.The number of images per category is 31 to 800 in Caltech101, while more than 80 in Caltech256.Overall, Caltech256 is a more challenging dataset than Caltech101.

Feature extraction
For feature extraction, we use the model (bvlc_reference_caffenet) 3 that trained using the ImageNet image with Caffe, 4,5 which is a Deep Learning framework.This model was constructed using 1,000 categories of ILSVRC2012, using 1.2 million training images.The output of the coupling layer is taken as the feature of the image.The dimension of the extracted P -316 feature is 4096.For implementation, OpenCV 3.1 Deep Neural Network Module was used.

Comparison methods
To discuss the characteristics of the CMA method, we compare it with three conventional methods.
The first method is Support Vector Machine (SVM).SVM is a standard method of current pattern recognition.SVM is basically two-class classifier.In order to apply SVM to multi-class classification, we employ "one-toone classification" approach, which constructs SVM for all 2 class combinations and classifies by the majority decision of the result.We use a linear kernel for the kernel.Since SVM basically cannot handle test data as a group, it was used only for experiments with single test data.We used libsvm for implementation. 6he second method is Center Point Distance (CPD).In CPD let the distance between the average of training data of one category and the average of test data be the distance between that category and data.test data is classified into the category that has minimum this distance.
The Third method is Minimum Point Distance (MPD) between data.At first, we find the distance between two points for all data point combinations of training and test data.Then we let the minimum value be the distance between the category and the test data and classify the test data into the category with the minimum distance.Whereas CPD considers only the center of the group, MPD is based on individual data points, and classification result can reflect the distribution of data.

Experiment for single test data
In this experiment, we give each test data from a single image and investigate that how the result of identification changes when the number of training images is changed.
We randomly extract k images from each categories in the dataset as training images, extract features and make them as training data.All of the remaining images are used as an evaluation dataset.The test data is extracted one by one from the evaluation dataset and categorized by each method.After obtaining the accuracy for the evaluation dataset for each category, we calculate the overall accuracy by averaging them for all categories.
The number of training images is k=1, 2, 4, 8, 16, 30.In order to avoid the influence of the particular selected images for training, we calculate the accuracy for 8 kinds of training images, and let the average be the experimental result.In all methods, the same eight training dataset are used.The results are shown in Fig. 1.
Experimental results show that CMA shows higher accuracy than any other comparison methods in any k and in both datasets.Especially, in Caltech101, when k=1, CMA shows 12.7 points better accuracy than the comparison methods, and it can be seen that the performance is relatively higher in the range where the number of training images is small.

Experiment for group test data
Then we give the test data as a group of images.In the same way as sec.4.1, training data is created and the rests are used as an evaluation dataset.In this experiment, plural images are extracted from the evaluation dataset as a group and used as a test data.The number of images in each group is l=2, 4, 8, 16, 30.In order to avoid the influence of the particular selected images for testing, we conduct 3 trials for each l and averaging the results.The results are shown in Fig. 2.
Even in the case of group test data, CMA shows higher accuracy than the comparison methods in k=1 and 2. As a whole, as the number l of test data increases, the accuracy increases, and it is considered that there is an advantage in CMA that can apply image classification to a combined plural test data as a group.However, at k=4, 8, 16 and 30, CPD shows equal to or higher accuracy than that of CMA.

Experiment on regularization parameters
In the previous experiments, the regularization parameter and is set to =16, In order to observe the influences of the regularization parameters, we changed the values of and 32, 16, 8 and investigate how the accuracy changes.Caltech256 dataset is used in this experiment.The number of training data is k=1, 30 and the number of test images in a group is l=30.The results are shown in Table 1.
It was found that the lower the value of is, the higher the accuracy rate is.This result suggests that since the training data contains different kinds of objects, the P -317 approximate point should not be close to the average of the training data.Analysis of the parameters in CMA suggested that the difference between person re-identification and the general image categorization could be reduced by setting parameters appropriately.The method to give the optimal parameter values is left as a future work.

Figure 1 Figure 2
Figure 1 Accuracy for single test data