A Multimodal Constellation Model for Object Image Classification

We present an efficient method for object image classification. The method is an extention of the constellation model, which is a part-based model. Generally, constellation model has two weak points. (1) It is essentially a unimodal model which is unsuitable to be applied for categories with many types of appearances. (2) The probability function that represents the constellation model requires a high calculation cost. We introduced multimodalization and speed-up technique to the constellation model to overcome these weak points. The proposed model consists of multiple subordinate constellation models so that diverse types of appearances of an object category could be described by each of them, leading to the increase of description accuracy and consequently, improvement of the classification performance. In this paper, we present how to describe each type of appearance as a subordinate constellation model without any prior knowledge regarding the types of appearances, and also the implementation of the extended model’s learning in realistic time. In experiments, we confirmed the effectiveness of the proposed model by comparison to methods using BoF, and also that the model learning could be realized in realistic time.


Introduction
In this paper, we consider the problem of recognizing semantic categories with many types of appearances such as Car, Chair, and Dog under environment changes such as direction of objects, distance to objects, illumination, and backgrounds. This recognition task is challenging because object appearances widely vary by difference of objects in semantic categories and environment changes, which complicates feature selection, model construction, and training dataset construction. One application of this recognition task is image retrieval.
For these recognition tasks, a part-based approach, which uses many distinctive partial images as local features, is widely employed. By focusing on partial areas, this approach can handle a broad variety of object appearances. Typical well-known methods include a scheme using Bag of Features (BoF) [1] and Fergus's constellation model [2]. BoF is an analogy to the "Bag of Words" model originally proposed in the natural language processing field. Approaches using BoF have been proposed, using classifiers such as SVM (e.g., [3][4][5]) and document analysis methods such as probabilistic Latent Semantic Analysis (pLSA), Latent Dirichlet Allocation (LDA), and Hierarchical Dirichlet Processes (HDPs) (e.g., [6][7][8]).
On the other hand, the constellation model represents target categories by probability functions that represent local features that describe the common regions 1 of objects in target categories and the spatial relationship between the local features. This model belongs to the "pictorial structure" approach introduced in [9]. The details will be introduced in Section 2.1.
The constellation model has the following three advantages. 2 (a) Adding or changing the target categories is easy. In this research field, recognition methods are often categorized as a "generative model" or a "discriminative approach (discriminative model + discriminant function)" [10]. This advantage comes from the fact that the constellation model is a generative model. A generative model makes a model for each target 2 EURASIP Journal on Image and Video Processing category individually. Therefore the training process for adding target categories does not affect the existing target categories. For changing the existing target categories, it is only necessary to change the models used in the tasks; no other training process is necessary. On the other hand, discriminative approaches, which optimize a decision boundary to classify all target categories, have to relearn the decision boundary each time adding or changing the target categories. For recognition performance, the discriminative approach generally outperforms the generative model.
(b) Description accuracy is higher than that of BoF due to continuous value expression. Category representation by BoF is a discrete expression by a histogram formed by the numbers of local features corresponding to each codeword. On the other hand, since the constellation model is a continuous value expression by a probability function, the description accuracy is higher than BoF.
(c) Position and scale information can be used effectively.
BoF ignores spatial information of local features to avoid complicated spatial relationship descriptions. 3 On the other hand, the constellation model uses a probability function to represent rough spatial relationships as one piece of information to describe the target categories.
In spite of the advantages, the constellation model has the following weak points.
(1) Since it is essentially a unimodal model, it has low description accuracy when objects in the target categories have various appearances.
(2) The probability function that represents the constellation model requires high computational cost.
In this paper, we propose a model that improves the weak points of the constellation model. For weak point (1), we extend the constellation model to a multimodal model. A unimodal model has to represent several types of appearances as one component. But by extension to a multimodal model, some appearances can be cooperatively described by components of the model, improving the accuracy of category description. This improvement is the same as extending a representation by Gaussian distribution to that by Gaussian Mixture Model in local feature representation. In addition, we speed-up the calculation of the probability function to solve weak point (2).
Another constellation model is proposed before Fergus's constellation model in [11]. Multimodalization of this model was done in [12], but the structure of these models considerably differs from Fergus's constellation model, and they have the following three weak points against Fergus's model.
(i) They do not have the advantage (b) of Fergus's constellation model since the way to use local features is close to BoF.
(ii) They do not use the information of common regions' scale. (iii) They cannot learn appearance and position simultaneously since the learning of them is not independent. However, Fergus's constellation model requires high computation cost to calculate the probability function which represents the model, so it is unrealistic to multimodalize the model since the estimation of parameters in the probability function requires high computation cost. So we realize the multimodalization of Fergus's constellation model together with the speeding-up of the calculation of the probability function. Fergus's constellation model was also improved in [13], but the improvements were made so that the model can make use of many sorts of local features and modify the positional relationship expression. For clarity, in this paper we focus on the basic Fergus's constellation model.
Image classification tasks can be classified into the following two types.
(1) Classify images with target objects occupying most area of an image, and the object scales are similar (e.g., Caltech101/256). (2) Classify images with target objects occupying partial area of an image, and the object scales may differ (e.g., Graz, PASCAL). The method proposed in this paper targets type (1) images. It can, however, also handle type (2) images using methods such as the sliding window method and then handle them as type (1) images.
The remainder of this paper is structured as follows. In Section 2, we describe the Multimodal Constellation Model, the speeding-up techniques, and the training algorithm. In Section 3, we explain the classification method and describe the experiments in Section 4. Finally, we conclude the paper in Section 5.
Note that this paper is an extended version of our work [14], which includes additional experiments and discussions, about number of effective components (part of Section 4.3), object appearances described in each component (Section 4.5), and comparison with Fergus's model (Section 4.6).

Multimodal Constellation Model
In this section, we describe Fergus's constellation model, then explain its multimodalization, and finally describe the speeding-up technique for the calculation. [2]. The constellation model describes categories by focusing on the common object regions in each category. The regions and the positional relationships are expressed by Gaussian distributions.

Fergus's Constellation Model
The model is described by the following equation: EURASIP Journal on Image and Video Processing 3 where I is an input image and Θ is the model parameters. Image I is expressed as a set of local features which are extracted from image I by a local feature detector (e.g., [15] [2]. The part of the equation, which cyclopedically exhaustively calculates all combinations between all local features and each region of the model ( h∈H ), is in the form of a summation. However, the part of the equation that describes a target category, p(A, X, S, h|Θ), is substantively represented by a multiplication of the Gaussian distributions. Therefore, Fergus's constellation model can be considered as a unimodal model.

Multimodalization.
For improving the description accuracy, we extend the constellation model from a unimodal model to a multimodal model. We formulate the proposed "Multimodal Constellation Model" as follows: where K is the number of components. If K ≥ 2, then the model becomes multimodal. Each type of appearance in a target object category is described by each component, so the description accuracy is expected to be improved. L is the number of local features extracted from image I, and G(·) is the Gaussian distribution. Also, Θ = {θ k,r , π k }, θ = {µ, Σ}, I = {x l }, and x = (A, X, S). θ k,r is a set of parameters of the Gaussian distribution of region r in component k. x l is the feature vector of the lth local feature. A, X, and S, which are the feature vectors of appearance, position, and scale, respectively, are subvectors of x. π k is the existence probability of component k, which assumes 0 ≤ π k ≤ 1 and K k π k = 1. r k,l is the index of the most similar region to the local feature l of image I, in component k. Moreover, R (number of regions) exists as a hyperparameter, though it does not appear explicitly in the equation.

Speeding-Up Techniques.
Since the probability function that represents Fergus's constellation model requires high computation cost, estimating the model parameter is also time consuming. In addition, this complicates multimodalization because multimodalization increases the number of parameters and thus completing the training in realistic time becomes impossible. Here we describe two speeding-up techniques.
Simplifying Matrix Calculation. For simplification, we approximated all covariance matrices to be diagonal. This is equivalent of assuming independence. This modification considerably decreases the calculation cost of (x −µ) t Σ −1 (x − µ) and |Σ| needed for calculating the Gaussian distributions. The total calculation cost is reduced from O(D 3

) to O(D)
for D × D matrices. Although the approximation decreases the individual description accuracy of each component, we expect that the multimodalization increases the overall description accuracy. In particular, when assuming that Σ is a diagonal matrix whose diagonal components are σ 2 d , Modifying h∈H to L l and arg max r . The order of h∈H in equation (1) is O(L R ), where L is the number of local features and R is the number of regions. In actuality, even though A * search method is used for speeding-up in [2], the total calculation cost is still large. In the proposed method we changed h∈H to L l and arg max r . As a result, the cost is reduced to O(LR). This approach is same with the calculation cost reduction in [16] which targeted the classification of identical view angle car images captured by a fixed camera and modified the constellation model for this task.
Here we compare the expression of each model. Fergus's model exhaustively calculates probabilities of all combinations of correspondences between regions and local features. The final probability is calculated as a sum of these probabilities ( h∈H ). On the other hand, our model calculates the final probability using all the local features at once. This is expressed as L l . After the region which is most similar to each local feature is selected (arg max r ), the probability to the region is calculated for each local feature. The final probability is calculated as a multiplication of these probabilities. For the detail of the modification refer to [16].

Parameter Estimation.
Model parameter estimation is carried out using the EM algorithm [17]. Algorithm 1 shows the model parameter estimation algorithm for the Multimodal Constellation Model. N denotes the number of training images, and n denotes the index of the training image. x n,l denotes a feature vector of local feature l in training image n. r k,n,l denotes r k,l in training image n.
(3) M step: where Q k,r = N n l:( r k,n,l =r) (4) If parameter updating converges, the estimation process is finished, and p(k) = π k , otherwise return to (2).
Algorithm 1: Model parameter estimation algorithm for the multimodal constellation model.
We explain the initial values in initialization (1). The initial values of µ and Σ (diagonal matrix with only diagonal σ 2 ) are initialized based on the range of feature values. µ is initialized as random values considering the range of feature values. Σ is initialized as static values also considering the range of feature values. π is initialized as 1/K.
One difference with the general EM algorithm for the Gaussian Mixture Model is that the data that update µ, Σ are not per image but per local feature extracted from the images. Degree of belonging q k,n of training image n to component k is calculated in the E step, and then all local features extracted from training image n participate in the updating of µ, Σ based on the value of q k,n . In addition, local feature l participates in the updating of µ, Σ of only region r k,n,l to which local feature l corresponds.

Classification
The classification is performed by the following equation: where c is the resultant category, c is a candidate category for classification, and p(c) is the prior probability of category c, which is calculated as the ratio of training image of category c to all candidate categories. Since the constellation model is a generative model, it is easy to add categories or change candidate categories, and thus the training process is only independently needed for the first time a category is added. For changing already learnt candidate categories, it is only necessary to change the models used in the tasks. On the other hand, discriminative approaches make one classifier using all of the data for all candidate categories. Therefore it has the following two weak points: a training process is needed every time candidate categories are added or changed, and for relearning, all of the training data need to be kept.

Conditions.
We evaluate the effectivity of multimodalization for constellation models by comparing two mod-els Multimodal Constellation Model ("Multi-CM") and Unimodal Constellation Model ("Uni-CM"). Uni-CM is equivalent to the proposed model when K = 1 (unimodal).
We also compare the proposed model's performance to two methods using BoF. "LDA + BoF" is a method using LDA. Each category c is described by LDA probabilistic model individually (p(I|Θ c ), like a model for bag of words), and an image I is classified by (4). "SVM + BoF" is a method using SVM. In the feature space of BoF (codebook size dimension), SVM classifies an image I described by a BoF feature vector. Multi-CM, Uni-CM, and LDA + BoF are generative models, SVM + BoF is a discriminative approach, and LDA is a multimodal model.
Next, we discuss the influence of hyperparameters K and R on the classification rate, compare the proposed model's performance to Fergus's model with limitation due to the difficulty of Fergus's model calculation time, and quantitatively validate the two previously mentioned advantages (b) and (c) of the constellation model.
Two image datasets were used for the experiments. The first is the Caltech Database [2] ("Caltech"), and the other is the dataset used in the PASCAL Visual Object Classes Challenge 2006 [18] ("Pascal"). As a preparation for the experiments, object areas were clipped from the images as target images using the object area information available in the dataset, because these datasets do not assume the task targeted in this paper (classifying images with target objects occupying most area of an image to correct categories). We defined the task as classifying target images into correct categories (i.e., for ten categories dataset, it is ten-class classification). The classifying process was carried out for each dataset. 4 Half of the target images were used for training and the rest for testing.
Caltech consists of four categories. Figure 1 shows examples of the target images. The directions of the objects in these images are roughly aligned but their appearances widely vary. Table 1 shows number of object area in each category. Pascal has ten categories. Figure 2 shows examples of the target images. The direction and the appearance of objects in Pascal vary widely. Furthermore, the posees of objects in some categories (e.g., Cat, Dog, and Person) vary considerably.  Therefore classification of Pascal images is considered more difficult than that of Caltech images. And Table 2 shows number of object area in each category.
The identical data of local features are used for all methods compared here to exclude the influence of difference of local features on the classification rate. In addition, we experimented ten times by changing training and test images randomly and used the average classification rate of ten times for comparison.
In this paper, we empirically determined K (number of components) as 5 and R (number of regions) as 21. For the local features, we used the KB detector [15] for detection and the Discrete Cosine Transform (DCT) for description. The KB detector outputs positions and scales of local features. Patch images are extracted using these information and are described by the first 20 coefficients calculated by DCT excluding the DC. Therefore, the dimension of a feature vector x is 23 (A : 20, X : 2, S : 1).

Effectivity of Multimodalization and Comparison to
BoF. For validating the effectivity of multimodalization, we compared the classification rates of Multi-CM and Uni-CM and applied Student's t-test to verify the effectivity. We also compared the proposed method to LDA + BoF and SVM + BoF, which are related methods. These related methods have hyperparameters to represent the codebook  size (k of k-means) for BoF. The number of assumed topics for LDA corresponds to the number of components K of Multi-CM. We show the best classification rates obtained by changing these hyperparameters in the following results. Table 3 shows classification rates of Multi-CM and Uni-CM together with the standard deviations over ten trials. In addition, we verified the significance of Multi-CM and Uni-CM for both datasets by Student's t-test (P < 0.01). The reason for this is considered that multimodalization to a constellation model is effective to such datasets as Caltech and Pascal which contain various appearances in a category (e.g., Caltech-Faces: different persons, Pascal-Bicycle: direction of bicycles).
Since the proposed model shows better classification rate than that of LDA + BoF (generative model) or SVM + BoF (discriminative approach), it indicates that the constellation model has better classification ability than the methods based on BoF, for either generative or discriminative approaches.

Influence of the Number of Components K.
Here we discuss the influence of K, one of the hyperparameters of the proposed method, on the classification rate. K is changed in the range of 1 to 9 in increments of 2, to compare the classification rates at each K. When K = 1, it is Uni-CM, and when K ≥ 2, they are Multi-CM. The number of regions R is fixed to 21. Figure 3 shows the results. Note that the scale of the vertical axis for each graph differs because the difficulty of each dataset differs greatly. By comparing the graphs, we can see that the classification rates roughly saturate at K = 5 (Caltech) and 7 (Pascal). We can understand this because the appearance variation of objects for Pascal is larger than that for Caltech. However, we can choose K = 5 as a constant setting because these classification rates only differ slightly when K ≥ 2.    In addition, the fact that the classification rates when K ≥ 2 are better than K = 1 shows the effect of multimodalization.

EURASIP Journal on Image and Video Processing
Next, we discuss the number of effective components for each category. We decided that the effective component is a component that satisfies π k > (1/K) · 0.9. 1/K is the value of π k when all components are effective and the effect levels are equal. We decided this value as the minimum value,  and applied 0.9 for allowing some variation. Figure 4 shows graphs with horizontal axes of the number of components K and vertical axes of the number of effective components. The graphs show all categories for Caltech, and some categories for Pascal. From the graphs, we can see that the number of effective components saturates at a certain point, and also the number of effective components for each category varies. We consider that this value roughly indicates the number of object appearances for each category. From the result, we can see that if K increases beyond necessity, the number of components which are learnt as effective components does not change.
Moreover, from this result, we can see that the variation of appearance in Pascal is generally larger than that in Caltech. Actually, when K = 9, the average numbers of effective components for all categories are 3.2 for Caltech and 4.0 for Pascal.

Influence of the Number of Regions R.
To discuss the influence of R, another hyperparameter of the proposed method on the classification rate, we evaluated the classification rates by increasing R in the range of 3 to 21 in increments of 3. The classification rate at each R is shown in Figure 5. The number of components K is fixed to 5. The results show the classification rates of both Uni-CM and Multi-CM.
The improvement of classification rates saturates at around R = 9 for Caltech and at R = 21 for Pascal. In addition, at all R, the classification rates of Multi-CM are higher than those for Uni-CM, so the effectivity of multimodalization is also confirmed here.
For Fergus's constellation model, R = 6-7 is the extent that the training process can be finished in realistic time. Thanks to the proposed method with the speed-up techniques, we increased R (number of regions) until the improvement of the classification rate saturated and at the same time in realistic time. Therefore the proposed speedingup techniques not only contributed to the realization of multimodalization but also to the improvement of the classification performance.

Object Appearances Described in Each Component.
We discuss object appearances described as model components. The model was learnt as K = 10 to make it easy to understand what appearances are learnt as component. We apply the learnt multimodal constellation model to test images of the same category that was learnt and calculate the contribution rate { L l G(x l |θ k, rk,l )}·π k for each test image for each component. A component with the largest contribution rate is decided as the component that the test image belongs. Figures 6 and 7 show example images belonging to each component; five dominant components out of ten components are shown. In Caltech-Cars Rear, the groups seem to be constructed mainly by difference of car types. In contrast, Caltech-Motorbikes seem to be constructed by the difference of background appearances because the differences of the backgrounds are larger than those of the bike appearances. In Pascal-Car, the direction of objects and the luminance of the object bodies seem to have affected the group construction. The reason that the luminance affects the grouping is that DCT of luminance is used for local feature description. In Pascal-Motorbike, direction of objects affects the grouping. Pascal-Cow and Pascal-Cat have a wide appearance variation and are difficult to make groups. But the direction of bodies and the texture roughly form groups.  For reference we also compare with the computation time reported in [2]. Note that this is not an accurate comparison because each experimental condition probably does not match (performance of computers used and implementations). According to [2], Fergus's model takes 24-36 hours per model for R = 6-7, L = 20-30 per image, using 400 training images. However, our model for K = 1 (unimodal) takes around ten seconds per model in the same condition. In addition, even when K ≥ 2 (multimodal), it only takes a few scores of seconds.

Validation of the Advantage of the Constellation Model.
Here, we quantitatively validate the advantages of the constellation model described in Section 1; (b) Description accuracy is higher than BoF due to continuous value expression, and (c) position and scale information ignored by BoF can be used effectively.
First, advantage (b) is validated. The comparison of BoF and the constellation model should be performed on the condition only with the difference that a continuous value expression by a probability function and a discrete expression by a histogram, formed by the numbers of local features, correspond to each codeword. Therefore we compared LDA + BoF, which is a generative multimodal model identical to a constellation model, and Multi-CM without position and scale information that are not used in LDA + BoF ("Multi-CM no-X,S"). Next, to validate advantage (c) we compared Multi-CM no-X,S and the normal Multimodal Constellation Model. Table 5 shows the classification rates of these three methods. The classification rate of Multi-CM no-X,S is better than that of LDA + BoF, demonstrating the superiority of continuous value expression. The Multi-CM classification rate outperforms Multi-CM no-X,S. This shows that the constellation model can adequately use position and scale information.

Conclusion
We proposed a multimodal constellation model for object category recognition. Our proposed method can train and classify faster than Fergus's constellation model and describe categories with a high degree of accuracy even when the objects in the target categories have various appearances.
The experimental results show the following effectivities of the proposed method: (i) performance improvement by multimodalization (ii) performance improvement by speeding-up techniques, enabling use with more regions in realistic time.
We also compared Multi-CM to the methods using BoF, LDA + BoF, and SVM + BoF. Multi-CM showed higher performance than these methods. We also compared Multi-CM in the unimodal condition with Fergus's model and confirmed that the simplification of the model structure for the speeding-up in the proposed model does not affect the classification performance. Furthermore, we quantitatively verified the advantages of the constellation model; (b) Description accuracy is higher than BoF due to continuous value expression, and (c) position and scale information ignored by BoF can be used effectively. In Sections 1 and 3, by comparing generative and discriminative approaches, we also showed that the advantage (a) of the constellation model is that candidate categories can be easily added and changed.
In future works, we try to apply our method to object detection, and to investigate deeply the relationship between the appearance variations which seem to differ for each category and the hyperparameters.

Endnotes
1. The number of regions is assumed to be five to seven.
2. Since advantages (b) and (c) are not often described in other papers, we validate them quantitatively in Section 4.8 3. There are some extended BoF methods that consider spatial information (e.g., [19,20]). 4. Caltech101, 256 exist as datasets considering the task targeted in this paper, but these are not suitable for experiments of this paper because the number of image in each category is small.
5. Fergus's original paper [2] set R to R = 6-7. But our paper set R to 3 because of computational cost. For evaluation, the paper in [2] calculated one classification rate only, but our paper used average rate of ten time classifications, thus R = 6-7 was not a realistic setting for our paper.