Contextual object categorization with energy-based model

Object categorization is a hot issue of an image mining. Contextual information between objects is one of the important semantic knowledge of an image. However, the previous researches for an object categorization have not made full use of the contextual information, especially the spatial relations between objects. In addition, the object categorization methods, which generally use the probabilistic graphical models to implement the incorporation of contextual information with appearance of objects, are almost inevitable to evaluate the intractable partition function for normalization. In this work, we introduced fully-connected fuzzy spatial relations including directional, distance and topological relations between object regions, so the spatial relational information could be fully utilized. Then, the spatial relations were considered as well as co-occurrence and appearance of objects by using energy-based model, where the energy function was defined as the region-object association potential and the configuration potential of objects. Minimizing the energy function of whole image arrangement, we obtained the optimal label set about the image regions and addressed the evaluation of intractable partition function in conditional random fields. Experimental results show the validity and reliability of this proposed method.


Introduction
Object categorization which classifies image regions into semantic concepts generally is performed based on appearance (i.e. visual features) of the regions. As is well known, there is a semantic gap between low-level visual features and high-level semantics, so object categorization is a challenging task. However, human being correctly recognizes and classifies them through domain knowledge about certain scenes, which includes contextual information as well as appearance of objects. Contextual information about objects in images is an important knowledge to help the object recognition and classification [1,9]. For example, when an object is isolated such as shown in (a) of Figure 1, the appearance of the object is not enough to recognize it as a specific category. In (b) of Figure1 where the whole scene is presented, we can recognize it as a boat with the contextual information about the object in the scene. Based on contextual information about the object in the scene, we can recognize it as a boat.
Recently, many researchers have attempted to improve the accuracy of image categorization by incorporating contextual information with object's appearance [2,3,4,5,6,7,8]. They have incorporated co-occurrence and spatial arrangement with appearance of local objects. However, most of researchers have not made full use of the spatial arrangement of objects in an image. Therefore, it is challenging and significant for object categorization how to efficiently consider the spatial context with co-occurrence and appearance of objects. This is our first motivation for the approach to improve the contextual object categorization proposed in the rest of this presentation. In addition, the probabilistic graphical models (e.g. Markov Random Fields (MRFs) and Conditional Random Fields (CRFs)) have been employed to implement the incorporation of contextual information with appearance of objects in many previous works [3,10,11,12,13,14]. However, there is an evaluation of intractable partition function for normalization of probabilistic models. Energy-based models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables [15,16,18]. It allows us to relax the strict probabilistic modeling and to address the evaluation problem of intractable partition function. Therefore, how to appropriately capture energy functions of each configuration of the variables is our second motivation.
Concretely, we propose a two-phase strategy of object categorization, which includes the appearance-based initial labeling and context-based label refinement. Owing to the incorporating of contextual information with appearance of objects, the categorization accuracy could be improved. In our categorization model, the spatial relations, which are defined as fully-connected fuzzy spatial relations including directional, distance and topological relations between object regions, are considered as well as the co-occurrences by employing EBM, where we define an energy function with the region-object association potential and configuration potential of objects. Minimizing the energy function of whole image arrangement, we obtain the optimal label set about the image regions without the evaluation of intractable partition function in the probabilistic graphical models. Compared with the existing contextual object categorization models such as the literatures [3,16], where the co-occurrence have merely used, and the literatures [12,14], where have only used the fixed spatial relations and considered the simple frequency counting, we investigate the spatial arrangements in an image more sufficiently by defining fuzzy contextual relations. Moreover, the proposed EBM-based object categorization model overcomes the defects of the related works which rely on the use of undirected graphical models and often lead to intractable partition functions [3,12]. The EBMs for object categorization were also used in [16,18] similar to this work. However, their models did not fully reflect the spatial arrangement of an image. The contributions of our paper are as follows: 1) We propose fuzzy spatial relations for object categorization and so overcome the defects of related works which have only used the predefined and fixed spatial relations and considered the simple frequency counting as spatial constraints between objects. This can effectively reflect the contextual arrangement of an image and improve the categorization accuracy of objects. 2) We propose an EBM-based model for object categorization, which incorporates appearance with co-occurrence and spatial constraints of objects in an image. This overcomes the defects of the related works which rely on the use of undirected graphical models and often lead to intractable partition functions.

Related work
As mentioned above, traditional researches of object categorization have been mainly carried out using appearance features of local image regions. Appearance features such as color, texture, edge and shape cues can discriminate diversity of object categories in a certain extent. With the in-depth study, researchers have paid attention to the important role of contextual information for object categorization. A detailed survey of various contextual models for object categorization has been presented in the literatures [17,18,22].
There are two kinds of context level, namely global context and local context. Global context, which takes into account the object-scene interactions, can be used to restrict the possible objects that may be appeared in the scene [11,24], while local context takes into account the interactions between objects [12], patches [2], or pixels [13]. Compared with global context, local context is easily accessible from training data, without expensive computations [17]. Moreover, when many different objects are appeared in an image, the contextual interactions between objects are most beneficial to capture information about the objects. Therefore, in this work we deal with an object-level contextual method.
There are three context types, namely semantic context (co-occurrence of objects), spatial context (spatial relations between objects) and scale context (size constraint for objects) [17]. A simple one of the most widely used contextual methods is to model the co-occurrence frequency of objects. Rabinovich et al. [3] and Felzenszwalb et al. [25] have taken into account information about how a target object co-occurs frequently with other objects and adjusted initial labels of segments assigned by using local detectors. The other one of contextual methods is to model the spatial relations between objects as well as the co-occurrence information. Object-level spatial contextual techniques capture information about the spatial configuration of objects as one of the sources that could infer the object categories. There are many approaches to exploit the spatial constraints between objects [10,12,14]. Zhou et al. [8] have extended the contextual methods to encode spatial relations between objects, where the spatial relations have been quantized to four predefined relations, namely above, below, inside and around. However, in most of the spatial contextual methods, the researchers have only used the predefined and fixed spatial relations and considered the simple frequency counting as spatial constraints between pair-wise objects. Therefore, in this work we deal with the fully-connected fuzzy spatial relations as well as the co-occurrence relations between objects in an image.
Generally, there is no causal relationship between image regions, so the undirected graphical models such as MRFs or CRFs are more suitable for modeling the interactions between object regions. Carbonetto et al. [2] considered information about the relations between image patches and proposed a MRF model that combined appearance feature vectors with spatial relations for object recognition. Heesch et al. [14] modeled the spatial and topological relations between objects in an image by employing the MRFs with asymmetric Markov parameters. To exploit both local features as well as contextual information, Torralba et al. [20] introduced Boosted Random Fields (BRFs) which used Boosting to learn the graph structure and local evidence of a CRF. Yuan et al. [10] employed simple grid-structure graphical models to describe the spatial dependencies between objects in an image. CRF model provides an approach to incorporate appearance and contextual information, and has the ability to directly predict the categories of regions, through modeling the conditional distribution and its relevance to inferring categories. Therefore, to model the contextual interactions between objects, in this work, we also introduce a fully-connected CRF model, where the nodes correspond to the region and the edges correspond to spatial relations between the object regions, and employ the EBM to handle the partition function of CRFs.
In other aspect, the post-processing strategy has been mainly adopted to improve the accuracy of object categorization, where the outputs of the object classifiers (i.e. initial labels) based on appearance features of regions are refined by applying different contextual models. Saathoff et al. [4] proposed a FCSP (fuzzy constraint satisfaction problems) for exploiting spatial prototypes and refined the initial labeling results from appearance-based classification with SVM. Papadopoulos et al. [18] introduced a genetic algorithm for refining of region labels using the confidence degrees from SVM classifiers as well as the spatial constraints between objects. Galleguillos et al. [12] firstly classified the image regions using a bag-of-features classifier, where each region was been assigned several candidate labels, then picked up a single label for each region using a CRF model, which incorporated confidence degrees of regions to object concepts with co-occurrence and spatial relations between objects. In this work, we also adopt two-phase strategy, namely appearance-based initial labeling and context-based label refinement, to improve the accuracy of object categorization.
Finally, we would like to specify the recently other approaches for exploring contextual information. For example, Lee et al. [21] introduced an object-graph descriptor for discovery of unknown object categories, which encode the object-level co-occurrence patterns; Singaraju et al. [27] developed a random field model for joint categorization and segmentation of objects, where they introduced the higher order potentials that encode the classification cost of a histogram extracted from objects belonged to different categories; Jain et al. [28] proposed a latent CRF model which captures the relations between features and visual words, relations between visual words and object categories, and spatial relations between visual words; Angin et al. [29] proposed a simple iterative algorithm for object categorization by exploiting the global co-occurrence frequencies of objects; Sun T et al. [30] proposed a object categorization method by combining local feature context with SVM classifiers.
Compared with the above mentioned researches, the method proposed in this work is easier to implement, more comprehensive to describe and better to perform the object categorization. Concretely, in order to improve the object categorization accuracy, we adopt a two-phase strategy and use the fuzzy SVM classifiers [23] for appearance-based initial labeling. Then, we propose the fuzzy spatial relations between objects to overcome the defects of related works which have only used the predefined and fixed spatial relations and considered the simple frequency counting as spatial constraints between objects. For object categorization, we employ CRF model which incorporates appearance with co-occurrence and spatial constraints of objects in an image. In addition, we introduce EBM, in which we define the region-object association potential and the configuration potential of objects, to overcome the defects of the related works which rely on the use of undirected graphical models and often lead to intractable partition functions. Experiments on four benchmark image datasets (LabelMe, SCEF, MSRC v2 and PASCAL VOC2010) show that the proposed method can improve the performance of object categorization compared with the state-of-the-art results.
2. Methodology 2.1. Problem formulation Figure 2 depicts the overall flowchart of the contextual object categorization proposed in this work. The overall object categorization framework consists of two phases. The first one is the appearance-based region labeling. Firstly, the test images are manually or automatically segmented into object regions. In this work, we adopt manual and automatic segmentations together. While the image segmentation is no major topic of this work, we employ the advanced stability-based clustering method for automatic segmentation which proposed in Rabinovich et al. [3]. Then, the visual features of each segmented region are extracted. Afterward, the SVM classifiers are employed in order to classify image regions into semantic categories. The results of this classification are the initial labels and the confidence degrees of regions to object concepts (cf. Section 2.2).
The second one is the refinement of initial labels by using the contextual model. In this work, we employ the undirected graphical structure (i.e. CRF) for incorporating the appearance with co-occurrence and spatial relations between objects, and infer the final labels of each region by using energy-based model, which has ability to address the estimation of intractable partition functions in CRFs (cf. Section 2.4). Here, we take into account the fuzzy spatial relations as form of spatial context matrix, which has made full use of the spatial arrangement of objects in an image (cf. Section 2.3). Figure2. The overall flowchart of contextual object categorization 2.2. Appearance-based initial labeling Support Vector Machine (SVM) is one of the widely used techniques in semantic image processing owing to its powerful ability for classification and suitability for dealing with high-dimensional data. For initial region labeling, i.e. the assignment of object concepts to segmented regions based solely on appearance features, we employ SVM classifiers. The visual feature vectors of each segmented region i s are accepted as input data of SVMs, then we could obtain the concepts (i.e. candidate labels) l c . Here, we use a non-linear Gaussian RBF kernel based on its performance in other pattern recognition applications [26]. Concrete process is as follows.
The conventional SVM had been originally proposed for the binary classification problem. Therefore, we adopt n one-to-rest SVM classifiers for the purpose of the multiclass classification. Here, for the lth binary SVM, the separate plane classify the data v into class To handle the unclassifiable data in the multiclass SVM, the fuzzy membership degree which denote the belonging degree of data v to class l c , . When −1 ≤ Dl(v) ≤ 1, the data v may be partially classified into the class l c , and the fuzzy membership degree is . So we could perceive that the data lied in the hyper-plane of the lth SVM have the value of membership degree as 0.5. The above mentions are integrated as follows: The unclassifiable regions include two cases: the first is that the data may be classified into plural classes, i.e. the membership degrees of the data v are 1 for several classes; the second is that the data may be not classified into any classes, i.e. the membership degrees of the data v are all 0. To resolve unclassifiable regions, we modify the membership degree as follows: Later, we write the modified membership degree briefly, which means the belonging degree of data v to class l c . Using the fuzzy labeling method, we could determine the membership degrees of each class by equations (1) and (2). Thereby, the initial labels

Fuzzy spatial relations between objects
The spatial relations between two objects can be divided into three classes: directional relations 1 R , distance relations 2 R and topological relations 3 R . The directional relations include above, below, left and right; the distance relations include near and far; the topological relations include disjointed, bordering, invaded by and surrounded by [31]. Moreover, these spatial relations can be combined into several classes, because the spatial relation between two object regions can be described by overlapping multiple relations, e.g.
invaded by from left, right and near, etc.
Considering the characteristics of the natural scene images, i.e. left and right don't affect the object categorization, the directional relations can be divided into above, below and beside. Moreover, near and far are inverse each other, namely, the higher value of degree of near, the lower value of far. Therefore, the distance relations can be described by only near (or far) enough. Similarly, the topological relations can be described by only surrounded by. It means disjointed and bordering that the value of degree of surrounded by is 0. If the value of degree of surrounded by is greater than 0, then the topological relation becomes invaded by or surrounded by. Especially, if the value of degree of surrounded by is 1, then the topological relation is the complete surrounded by. Figure 3 shows the spatial relations between objects in an image.  (  (  )  ,  ,  (   3  2  1  ,  3  denotes a angle between the horizontal axis and the line joining the centers of two object regions; ij d denotes minimum distance between the boundary pixels of two object regions; ij  denotes a ratio of the common perimeter between two object regions to the perimeter of the first object region. In fact, the spatial relation between objects is a fuzzy relation. In equation (3), ) are the fuzzy membership degrees which denote the belonging degrees of the spatial relations between objects to the directional relations, distance relations and topological relations, respectively. The membership degrees of the five fuzzy spatial relations between objects can be computed with spatial relation descriptors as follows: ) ( where, 1  and 2  are the parameters that determine the crispness of the fuzzy membership degrees for distance relations and topological relations, respectively; 1  is the cut-off value that divides the distance relations into near and far fuzzy relations; 2  is the cut-off value that determines the surrounded by fuzzy relations. Finally, the concrete directional relations between objects can be determined by maximum membership principle:

Region labeling with energy-based model
The EBMs capture the dependencies between variables by associating a scalar energy to each configuration of the variables. Here the correct configurations of variables generate minimum energy, while the incorrect configurations reveal higher values of energy. Therefore, the goal of the training phase is to obtain the energy function which associates low values of energy to correct configuration of the variables, and higher values of energy to incorrect one. In the inference phase, it is discovered that the values of the unobserved variables which minimize the energy. EBMs provide a unified framework for probabilistic graphical approaches (e.g. MRFs and CRFs). In contrast with MRFs and CRFs , the EBMs do not require the proper normalization and address the evaluation problem of intractable partition functions associated with estimating the normalization factor. Moreover, the absence of evaluation of the partition functions provides more flexibility in the modeling.
In this work, the region labeling is transformed into an energy minimizing problem by EBMs, where the energy function is obtained from integration of appearance, co-occurrence and spatial arrangement information of objects. An image I which includes k regions could be described as an undirected graph, where the nodes correspond to image regions and the edges correspond to dependencies between regions. Here all of the possible connections between nodes are considered. Through the region classification by appearance features, each region in an image is associated with n candidate labels,   Figure 4, where 5  k . Now, using the EBMs, the goal of the region labeling of image I is to find the best value set * A of association variables, which minimizes the energy. Figure4. Energy-based model with fully-connected CRF. The region-object association potential is defined on the vertical lines, and the configuration potential is defined on the horizontal lines.
In order to find the best value set * A , i.e. the best configuration of labels for image I , we define the energy function which measures the suitability of any configuration of labels:  is a factor that adjusts the influence of the appearance-based classifiers compared with the concept occurrence. Using discriminative classifiers such as SVMs with appearance features of local regions, the every candidate label l c for each region i s is afforded the corresponding belief degree ) ( i c s l  , which denote the belonging degree of the region to its class. Therefore, it is natural that the posterior probability by appearance-based object classifiers is as follows: denotes the contextual interactions between objects, which include spatial context and semantic context (i.e. co-occurrence), and is expressed as follows: There are three adjustable parameters in the energy function, namely ,  and  . These parameters are selected by trial on a validation dataset.
The region labeling is simplified to find the set of concepts which minimize the energy such as equation (14). At the inference stage, we take as input of EBM the appearance-based classification results (i.e. posterior probabilities), as well as the fuzzy spatial relations between object regions. Then, ensuring the minimum of overall energy value, a particular object concept is assigned to every region. The Iterated Conditioned Modes (ICM) algorithm [19], which is a method commonly used in graphical models such as CRFs and MRFs, is generally employed to find the minimum energy configuration.

Experimental setup
In order to verify the effectiveness of object categorization method proposed in this work, we carried out the related experiments with two fields: manually annotated regions and automatically classified regions. We used four benchmark image datasets such as LabelMe 1 , SCEF 2 , MSRC v2 3 and PASCAL VOC2010 4 . We divided each dataset into two parts, i.e. training dataset and test dataset. Table 1 illustrates the number of total, training and test datasets, the number of segmented regions and the supported object concepts for four datasets.
Table1. The number of total, training and test datasets, the number of segmented regions and the supported object concepts for four datasets used in these experiments For each dataset, we used the training set to train the multiclass fuzzy SVM classifiers. Probabilities of co-occurrence and fuzzy spatial relations between object concepts were computed using the training set as well. As mentioned in Section 2.2, we used the SVM classifiers with a non-linear Gaussian RBF kernel to obtain the initial labels and corresponding belief degrees. The values for kernel radiuses  and cost parameteres [26]  While the image segmentation is no major topic of this work, we employ the advanced stability-based clustering method for automatic segmentation which proposed in Rabinovich et al. [3]. Then, the visual features of each segmented region are extracted. For LabelMe dataset, we use the visual features which are the concatenations of a 54-bin linear HSV color histogram, an 8-bin edge direction histogram and the 24 features of the gray-level co-occurrence matrix: contrast, energy, entropy, homogeneity, inverse difference moment and correlation for the displacements 0 The performance is evaluated with categorization accuracies which defined as the percentage of correctly classified regions. The reported results are the average values of them.

Results and discussion
In order to evaluate the importance of contextual information in an image and the effectiveness of EBM-based contextual model, we compared the results of proposed method with the results of non-contextual method and CRF-based method for LabelMe dataset. Figure 6 illustrates the comparision of object classification accuracies of these three methods. It can be seen that the EBM-based method outperform the non-contextual method and the CRF-based method. The average categorization accuracy is 50.09% for EBM-based method, while the one is 46.19% for non-contextual method and is 49.53% for CRF-based method. Specifically, compared with the non-contextual method, the categorization accuracy incresed in most of the 11 categories by using contextual information, i.e. an increase from 1%-12%, besides the category "snow" in which appeared small decrease. Compared with the CRF-based method, the average accuracy incresed 0.6%, because all of these two methods used the co-occurrence and spatial information between objects with appearance features.

Figure6. Comparison of object categorization accuracies for LabelMe dataset
In order to verify the effectiveness of the proposed method, we compared the object categorization results with several state-of-the-art baselines, namely Galleguillos et al [12], Escalante et al [16] and Papadopoulos et al [18] for SCEF, MSRC v2 and PASCAL VOC2010 image datasets. Here, we adopted the MPEG-7 and the SIFT descriptors which have been used in the baselines as visual features extracted from each segmented region. Table  3 shows the comparison of average categorization accuracies with the baselines. In the CRF-based method by Galleguillos et al [12], the contextual information, i.e. location and co-occurrence have been incorporated with appearance of objects by using CRF model. Here, they used only four fixed spatial relations such as Above, Below, Inside and Around, and approximated the partition function by using the Monte Carlo integration. In contrast, Escalante et al [16] and Papadopoulos et al [18] introduced the EBM-based models which avoid the evaluation of intractable partition function. In Escalante et al, the energy function was defined as integrating the appearance-based observation and the simple concept co-occurrence statistics. In Papadopoulos et al, the fuzzy spatial relations between objects have been used for defining energy function as weel as the co-occurrence and appearance of objects. However, they merely used eight directionoal relations between objects such as Above, Right, Below, Left, Below-Right, Below-Left, Above-Right and Above-Left, and did not fully reflect the spatial arrangement of an image. In proposed method, the using of fuzzy spatial relations and EBM-based categorization model overcome the defects of related works in incorporating spatial contextual information for object categorization. As shown in Table 3, the proposed method outperforms the state-of-the-art baselines. Figure 7 illustrates the categorezation accuracies per object categories for SCEF, MSRC v2 and PASCAL VOC2010 datasets. It can also be seen that several categories such as "road" and "sailling-boat" in SCEF dataset and "aeroplane" in MSRC v2 dataset exhibit low categorization accuracies, i.e. lower than 50%. It is due to the fact that the results of appearance-based classification of these categories are insufficient. For PASCAL dataset, the effect of the incorporation of contextual information between objects is not obvious, and for most categories the categorization accuracies are relatively low. It is due to the fact that the images of this dataset contain very few objects, and in this case the contextual relations between objects are not significant. Thus it can be seen that, the effectiveness of contextual information in improving the appearance-based initial labeling results is highlighted under presence of many objects in an image. Figure7. Categorization accuracies per object categories for SCEF, MSRC v2 and PASCAL VOC2010 datasets Finaly, we investigated the effect of the number of candidate labels on the final categorization accurecies. Figure 8 illustrates the curves of average accuracies that have been obtained by taking into account the top n labels which sorted by its confidence degree. The experimental results show that, when the number of candidate labels less than 3, for all image datasets the accuracies were improved with the increasing of the number, and when the number of candidate labels greater than 5 (or 8 for MSRC dataset), the accuracies were converged. However, when the number of candidate labels is increased, the computational complexity is increased too. This illustrates that the suitable number of cadidate labels could lead to wishful result by using the proposed contextual model. Through the comparison analysis of experiments, we have verified the suitability and effectiveness of the proposed contextual object categorization method with energy-based model. The proposed method makes full use of the spatial relational information between objects in an image. In addition, it employs the EBM which incorporates appearance with co-occurrence and spatial constraints of objects. Therefore, it improves the performance of object categorization.