KGEARSRG: Kernel Graph Embedding on Attributed Relational SIFT-Based Regions Graph

In real world applications, binary classification is often affected by imbalanced classes. In this paper, a new methodology to solve the class imbalance problem that occurs in image classification is proposed. A digital image is described through a novel vector-based representation called Kernel Graph Embedding on Attributed Relational Scale-Invariant Feature Transform-based Regions Graph (KGEARSRG). A classification stage using a procedure based on support vector machines (SVMs) is organized. Methodology is evaluated through a series of experiments performed on art painting dataset images, affected by varying imbalance percentages. Experimental results show that the proposed approach consistently outperforms the competitors.


Introduction
Support vector machines (SVMs) [1] have found applications in different fields such as image retrieval [2], handwriting recognition [3] and text classification [4]. In the case of imbalanced data, in which the number of negative patterns, easier to identify and classify, significantly exceeds the positive patterns, which are more difficult to identify and classify, the performance of SVM drops considerably. In general, classifiers perform poorly with imbalanced datasets because they are designed to work on sample data and the output is formulated from the simpler hypothesis that best fits the data. With imbalanced data, the simplest hypothesis is often that all patterns are classified as negative; in essence the positive patterns are not detected by the classifier, being included in the patterns classified as negative in a completely wrong way. Another important issue, which makes the classifier too specific, is sensitivity to noise, which in many cases leads to a wrong hypothesis. Specifically, some researchers modify the behavior of existing algorithms for the purpose of making them more immune to noisy instances. These approaches are designed for balanced datasets and, with highly imbalanced datasets, every pattern is classified as negative. Furthermore, the positive may be treated as noise and completely ignored by the classifier. A widely used approach is to bias the classifier to ensure more attention to the positive patterns. In this paper, the image classification problem applied to imbalanced datasets is addressed. In particular, an image is represented by a data structure Attributed Relational Scale-Invariant Feature Transform-based Regions Graph (ARSRG) [5]. This approach is used in order to capture local and structural image features. Moreover, ARSRG structures are mapped into a vector space through a graph kernel application. Graph kernels aim at bridging the gap between the high representational power and the flexibility of graphs in terms of feature vector representation. The images to be classified, also called target images, are encoded through a set of distances with model images. The final representation is called Kernel Graph Embedding on Attributed Relational Scale-Invariant Feature Transform-based Regions Graph (KGEARSRG). The classification stage is managed through SVM, using the One versus All (OvA) paradigm, with the appropriate kernel modification called Asymmetric Kernel Scaling (AKS) [6].
The paper is organized as follows: Section 2 is dedicated to related literature. In Section 3 the proposed framework is described. Results and conclusions are, respectively, reported in Sections 4 and 5.

Related Work
One of the biggest challenges during the design of a classifier regards the resolution of the imbalance between the number of images belonging to a class focused by a user and the others that share some features with that class. The imbalance problem has been investigated in literature applied to the image classification field.
In [7] the authors compare the performance of artificial immune system-based image classification algorithms to the performance of Gaussian kernel-based SVM in problems with a high degree of class imbalance.
In [8] a methodology, based on resampling, developed to solve the class imbalance problem in the classification of thin-layer chromatography (TLC) images is introduced. In addition, two approaches are proposed for image classification. One based on a hierarchical classifier and another using a multiclassifier system. Both classifiers are trained and tested using balanced datasets.
In [9] an approach for building a classification system for imbalanced data based on a combination of several classifiers is presented. A final classification system combining all the single trained classifiers is built. The approach can be defined as a sort of local undersampling, since each classifier uses a part of the majority samples, or global oversampling since the minority class is replicated M times.
In [10] two genetic programming (GP) methods for image classification problems with class imbalance are developed and compared. The first works on adapting a fitness function in GP in order to evolve classifiers with good individual class accuracy, while the second implements a multi-objective approach to simultaneously evolve a set of classifiers along a trade-off surface representing minority and majority class accuracies.
In [11] a methodological approach to classification of pigmented skin lesions in dermoscopy images is presented. SVM is used for the classification step. The class imbalance problem is addressed using various sampling strategies and through Monte Carlo cross validation.
In [12] the problem of diagnosing genetic abnormalities by classifying a small image imbalanced database of fluorescence in situ hybridization signals of types having different frequencies of occurrence is addressed.
Finally, in [13] is investigated how class imbalance in the available set of training cases can impact the performance of the resulting classifier as well as properties of the selected set. The test phase is performed on a dataset for the problem of detecting breast masses in screening mammograms. Binary and k-nearest neighbor classifiers are adopted.

Kernel Graph Embedding on Attributed Relational SIFT-Based Regions Graph (KGEARSRG)
In this section a novel kernel graph with reference to the ARSRG structure is introduced. This representation is called Kernel Graph Embedding on Attributed Relational Scale-Invariant Feature Transform-based Regions Graph (KGEARSRG). Let F = {ARSRG 1 , . . . , ARSRG N } ∈ R D be a dataset composed of N ARSRG structures in a D-dimensional space. The aim is to reduce the dataset F ∈ R D into a low-dimensional space y ∈ R d (D d), such that ARSRG topological information is preserved from The framework attempts to find the optimal low dimensional vector representation that best characterizes the similarity relationship between the node pairs in ARSRG structures.

Graph-Based Image Representation
The Attributed Relational SIFT-based Regions Graph (ARSRG) [5] is adopted to represent images and is built based on two phases. The first phase, named feature extraction, provides regions of interest (ROIs) from images by means of a segmentation technique and constructs a Region Adjacency Graph (RAG) [14] to encode spatial relations between extracted regions. The second phase, named graph construction, provides the construction of a graph, named the Attributed Relational SIFT-based Regions Graph (ARSRG), formed by three levels: Root node, RAG nodes and leaf nodes. The root node encodes the whole image and is connected to all the RAG nodes at the second level. RAG nodes encode geometric information among different image regions. Thus, spatially adjacent regions in the image are represented by connected nodes. Finally, leaf nodes represent the set of SIFT [15] descriptors extracted from the image in order to ensure invariance to different conditions (view-point, illumination and scale). This level provides two types of configurations: region based, in which a keypoint is associated to a region based on its spatial coordinates, and region graph based, in which keypoints belonging to the same region are connected by edges to encode spatial information. ARSRG is created based on two different leaf nodes configurations.

Graph Embedding
The goal is to provide a fixed-dimensional vector space image representation in order to process the data for classification purposes. To this end, the concept of graph embedding is introduced. Given a labeled set of sample graphs, T = {g 1 , . . . , g n } and the graph dissimilarity measure is d(g i , g j ). T can be any kind of graph set and d(g i , g j ) can be any kind of dissimilarity measure. Subsequently, based on a selected set P = {p 1 , . . . , p m } of m = n prototypes from T, the dissimilarity of a given input graph g is computed to each prototype p ∈ P. This leads to m dissimilarities, d 1 = d(g, p 1 ), . . . , d m = d(g, p m ), which can be represented in an m-dimensional vector (d 1 , . . . , d m ). In this way, any graph can transformed from the training as well as any other graphs set into a vector of real numbers. More formally, given a graph domain G, the training set of graphs, subject to the next mapping phase, and a set of prototype graphs. The vector of mapping between T and P is defined as where d(g, p i ) is any graph dissimilarity measure between graph g and the ith prototype. The distance d(·) does not obey the triangular inequality. Considering the graphs g 1 , g 2 , g 3 and the triangular inequality Equation (4) is not verified as it deals with complex structures where the number of nodes and edges can be associated in a different way during the matching, and the distance varies according to the chosen metric. This is a very complex problem that cannot be generalized using the triangular inequality.

Kernel Graph Embedding
Kernel graph embedding is a framework that works with the purpose of extending the dot product space from a linear to a nonlinear case using the kernel trick. The goal is to map data from the original input space to an alternative higher dimensional space as with φ being the implicit pairwise embedding between g 1 and g 2 . The concept of kernel graph embedding is applied in a particular way. Firstly, ARSRG structure extraction, from images stored in the entire dataset, is performed. After, the ARSRG set is divided into two subsets in the following way: The first subset is composed of target images as in Equation (1), subset training T, while the second subset is composed of model images as in Equation (2), subset prototypes P. Now, the distance vector representing each ARSRG, belonging to a target set, is built as follows: The vector components encode the distance between ARSRG t j and all ARSRGs contained in the subset prototypes. Distance values are calculated through the kernel graph in [16] applied on ARSRG pairs, particularly where Path(ARSRG m i ) and Path(ARSRG t j ) are the sets of all paths in ARSRG t j and ARSRG m i , located at the third level of structures in the form of SIFT Nearest-Neighbor Graphs (SNNGs). An SNNG is defined as follows: where: • VF SIFT is the set of nodes associated to SIFT keypoints; • E SIFT is the set of edges.
is a Euclidean distance applied to the x and y positions of keypoints in the image, τ is a threshold value and p stems from 1 to k, k being the size of VF SIFT . A path is, as usual, defined as a sequence of nodes, consisting of at least one node and without any repetitions of nodes. Defining paths as sequences of neighboring pairwise distinct edges allows to define kernels based on subpaths. In this context, edge walk and edge path are defined. Given a graph is defined as a sequence of edges from e 1 to e l , where e i , with 1 ≤ i ≤ l, is a neighbor of e i+1 = e j , i.e., An edge path p is defined as an edge walk without repetitions of the same edge. An edge path may contain the same node multiple times but every edge only once. An edge path p is an Euler path in the graph exactly consisting of the edges of p. In this case, edge paths are used. Moreover, k path is a positive definite kernel on two paths, defined as the product of kernels on edges and nodes along the paths. To this end, a relation R(x , x , x) is defined, where x is a path and x and x are graphs. R(x , x , x) = 1 if x is the graph created removing all edges in x from x. R −1 (x) is then the set of all possible decompositions of the graph x via R into x and x . R is finite and its length is upper bounded by the number of edges, based on a finite number of paths in the graph. Now, a kernel k path on paths is defined as a product of kernels on nodes and edges in these paths, also named tensor product kernel. Moreover, a trivial graph kernel k one = 1 is defined for all pairs of graphs. Now, an all-paths kernel is defined as a positive definite R-convolution as where Path(ARSRG m i ) and Path(ARSRG t j ) are the sets of all paths in ARSRG t j and ARSRG m i . In this case, kernel graph application requires a preprocessing step. ARSRGs are compared through the algorithm in [5], which provides SNNG pairs for the final application of the kernel graph. Essentially, edge path search is performed on graphs contained in the ARSRG regions. Based on this procedure, k path (p 1 , p 2 ), in Equation (8), encodes the distance between SNNG pairs belonging to regions in the matching set. Finally, kernel matrix K may be expressed as: More precisely, given the sets ARSRG targets and ARSRG models in the ARSRGtm equations, matrix K encodes all pairwise distances between the ARSRGs. In particular, each row of the matrix K corresponds to a vector-based representation of ARSRG ∈ ARSRG targets as in Equation (7). This demonstrates how the vector-based representation can be adopted for kernel matrix K and, subsequently, in the classification of ARSRG structures. Figure 1 shows an example of an image represented by KGEARSRG.

Computational Cost
The computational cost related to KGEARSRG can be divided into different parts:

1.
The computational cost for extracting SNNG pairs between image regions through SIFT match with graph matching.

2.
Kernel graph computation involves: The direct product graph upper bounded by n 2 , where n is the number of nodes.
The inversion of the adjacency matrix of this direct product graph; standard algorithms for the inversion of an x · x matrix require x 3 time. (c) The shortest-path kernel requires a Floyd-transformation algorithm which can be performed in n 3 time. The number of edges in the transformed graph is n 2 when the original graph is connected. Pairwise comparison of all edges in both transformed graphs is required to determine the kernel value. n 2 · n 2 pairs of edges are considered, which results in a total runtime of n 4 .

Experimental Results
The proposed algorithm has been applied to the art painting classification problem [17]. The datasets adopted are publicly available and contain different digitized images of paintings together with the corresponding ground truth data. The testing phase is organized into different blocks. First a comparison with standard SVM [1] is performed. The second phase, differently, involves a comparison with C4.5 [18], RIPPER [19], L2 Loss SVM [20], L2 Regularized Logistic Regression [21] and ripple-down rule learner (RDR) [22]. Finally, performances are evaluated in terms of Adjusted F-measure (AGF) [6]. With reference to Equation (9), it is important to focus attention on two parameters. τ allows to connect two SIFT keypoints and is estimated based on the density and spatial proximity of keypoints. Clearly, the spatial proximity leads to a reduction of the value. k represents the number of SIFT keypoints, the size of VF SIFT , to be analyzed in order to construct the SNNG. Moreover, during classification, with reference to Equation (14), the values k 1 and k 2 are used to transform the kernel and are connected to the size of the input data of each class. The range adopted is variable, as they are very sensitive parameters, and can be easily understood through the x and y axes of the graphs in Figures 2 and 3.
The framework has been developed in two different programming languages. The kernel was developed in Matlab code, while the part related to classification was developed in Java code, with integration in Waikato Environment for Knowledge Analysis (WEKA) (https://www.cs.waikato.ac.nz/ ml/weka/).

Asymmetric Kernel Scaling (AKS) for Support Vector Machines
The main idea, to preserve local angles, consists in applying a conformal transformation. In [23], with purpose of improving SVM performance, a (quasi) conformal transformation on the kernel is adopted. The aim is to increase the break up between the two classes near the boundary and therefore to widen the resolution in this area. The general transformation form is: (13) where D(x) positive is defined. K is Gaussian, and meets Mercer conditions, if D(x) and K are Gaussian. The improvement consists in managing various training instances in the two classes [6]. The goal is to extend areas located near the sides of the separation, which represent the boundary on the surface, in order to compensate for its asymmetry related to minority instances. In the first instance, SVM provides an approximate boundary position. Subsequently, the negative χ − and the positive χ + points are segmented into two sets. In the second instance, the following kernel transformation is applied: where k 1 and k 2 are free parameters and f (x) is given by: The instance class relies on the location of the hyperplane where it falls, specifically with rerefence to the sign of f (x). Support Vectors (SVs) are by definition the x i such that α i > 0 and b represents the bias; by doing this, the space is enlarged on different sides of the boundary surface, which makes it possible to equilibrate the imbalance due to the data. Proper values for k 1 and k 2 are adopted to manage the transformed kernel during classification and are estimated based on the size of the input data of each class.

OvA Classification Setting
The classification performance of the AKS classifier is tested over the standard OvA paradigm on different low, medium and high imbalanced image classification problems. One of the simplest multiclass classification schemes is to create N different binary classifiers, each one trained in order to distinguish the pattern of a single class among the patterns in remaining classes. When a new pattern is ready for classification, the N classifiers are run and the classifier that outputs the largest positive value is chosen. In the case shown, a skewed dataset is considered, where a positive minority class of patterns has to be recognized against a negative majority class. At this point, the application of the OvA schema is simple; each image of a target set, encoded by the KGEARSRG procedure, is submitted to the classification process; then the membership class associated to this image is considered against others present in the target set. This step is performed for all membership classes with the purpose of compensating the multiclass case.

Datasets
The first dataset [24] is composed of two sets of images belonging to 16 classes. The first set contains 15 paintings belonging to Olga's gallery (http://www.abcgallery.com/index.html) and is named the originals set. The second set contains 100 photos of paintings, taken by tourists with different digital cameras, available on Travel Webshots (http://travel.webshots.com) and is named the photographs set. We adopt the originals set as the model set and the photographs set as the target set, with images to be classified. The second dataset [5] is composed of 99 painting photos taken from the Cantor Arts Center (http://museum.stanford.edu/). The images are divided into 33 classes. In addition, in order to apply image classification tasks, 10 additional images, belonging to 33 classes, have been added. We adopt the first 99 painting photos as the model set and the last 10 additional images as the target set, the images to be classified. Subsequently, the imbalance rates to formulate different classification problems are calculated. Tables 1 and 2 show settings about the configuration of the datasets and, in particular, the imbalance rate (IR) in the last column is shown with reference to Equation (16).
IR is defined as the ratio between the percentage of images belonging to the majority class over the minority class.

AKS vs. SVM
This section describes the comparison between AKS and SVM. The experiments are conducted by performing, in the first instance, a standard SVM classification, using a Gaussian kernel with base 0.5 and C = 10, tuned through a grid search. Subsequently, the two step AKS described above is applied. In the second step of AKS, the classification with the transformed kernel is performed. Grid search and fivefold cross validation have been applied to find the optimal value of the parameters and AGF has been used as the performance measure.
It can be seen in Figure 2 that the proposed method needs a wide search in the parameters space for fine tuning and the performance is shown to be very sensitive to a good choice of parameters. Out of a narrow interval of k 1 and k 2 for effective improvement, performance tends to drop quickly. Comparing performance with a careful choice of parameters, the proposed approach consistently dominates standard SVM. Differently, in Figure 3 performances are higher only with a single peak with respect to standard SVM.

Comparison Results
Further tests have been conducted in order to perform a comparison with C4.5 [18], RIPPER [19], L2 Loss SVM [20], L2 Regularized Logistic Regression [21] and ripple-down rule learner (RDR) [22] for a complete set of OvA classification problems. The parameters of the competitors are submitted to a tuning procedure using a very wide range with respect to the parameters of the proposed method and are subsequently initialized randomly. The best performances are shown below. The results of the two datasets are different due to imbalance rates. In the dataset in [24], configuration includes approximately low, medium and high rates. It is a great dataset for a robust testing phase because it covers full cases of class imbalance problems. In the dataset in [5], imbalance rates are identical for all configurations. The behavior of classifiers can be analyzed through Tables 3 and 4. Table 3 shows the results of the dataset in [24]. It can be seen that performances are significantly higher than in the comparison methods. The improvement provided by AKS lies in the accuracy of the classification of patterns belonging to the minority class, positive, which, during the relevance feedback evaluation, have a greater weight. Indeed, these latter are difficult to classify compared to patterns belonging to the majority class, negative. The results reach a high level of correct classification. This indicates that the improvements over existing techniques can be associated with two aspects. The first involves the vector-based image representation extracted through KGEARSRG. The second concerns the use of the AKS method for the classification stage. Table 4, in the same as previous the way, shows the improvement introduced by AKS. Finally, for both cases results indicate that the classification performance of AKS on the minority class is significantly higher than the corresponding performance of the others classifiers. It is clear that our approach has the intrinsic ability to more efficiently address classification problems that are extremely imbalanced. In other words, the AKS classifier retains the ability to correctly recognize patterns originating from the minority class compared to the majority class.

Conclusions
Data imbalance classification is a common challenge in many fields such as pattern recognition, bioinformatics and data mining. In this paper, the imbalance problem in image classification is addressed. A novel way to represent a digital image based on KGEARSRG is presented. This representation is proved to be very useful for mapping a dense graph space to a reduced space. The classification task is managed through the SVM method based on a transformed kernel, named AKS. Experimental results indicate that the combined approach of KGEARSRG and AKS has the intrinsic property of dealing more efficiently with highly imbalanced datasets. Specifically, the method identifies instances from the minority class more efficiently as compared to other classifiers in the same domain. In future, we will provide a comparison against MLPs, k-NN and a mixture of models for different classification problems having a serious degree of class imbalance.
Funding: This research received no external funding.