Supervised distance metric learning through maximization of the Jeffrey divergence
Introduction
Distance metric learning consists in adapting a distance metric using information contained in the training data. The resulting distance metric is used to improve the performance of metric-based methods, such as k-nearest neighbors classification (k-NN) [1], or k-means clustering [2]. Depending on the problem of interest, an appropriate distance metric can yield substantial improvements over the commonly used Euclidean distance metric [3], [4]. Learning a good distance metric for a specific problem may be the key to the successful application of metric-based methods. For this reason, distance metric learning plays a crucial role in metric-related pattern recognition tasks (see recent surveys [5], [6]), such as classification [3], [7], regression [8], clustering [4], [9], [10], feature selection [11], [12], and ranking [13]. Depending on the availability of training instances, distance metric learning methods can be divided into three categories: supervised, semi-supervised, and unsupervised distance metric learning. Supervised methods for classification use the heuristic that instances belonging to the same class should be close to each other, while those from different classes should be farther apart [3], [14]. Semi-supervised methods for information retrieval and clustering use the information in the form of pairwise similarity or dissimilarity constraints [9], [15]. Unsupervised methods learn a distance metric that preserves the geometric relationships (i.e., distance) between most of the training data for the purpose of unsupervised dimensionality reduction [16], [17].
In a supervised setting, we focus on the Mahalanobis distance metric due to its wide use in many application domains and because it provides a flexible way of learning an appropriate distance metric for complex problems [5], [6]. The Mahalanobis distance metric is parametrized by a symmetric positive semidefinite matrix , where the distance between two points and in is computed asSince the matrix is symmetric positive semidefinite, it can be factorized as , where and . Thus, the Mahalanobis distance between and is equal to the Euclidean distance between and . In other words, learning a Mahalanobis distance metric is equivalent to learning a linear transformation.
We begin by introducing a simple two-class classification problem that motivates the key ideas in the proposed method. For this purpose, we construct a two-dimensional data set, containing 100 positive instances and 100 negative instances (see Fig. 1(a)). Both positive instances and negative instances follow a Gaussian distribution with means and , respectively, and the same covariance matrix . The training accuracy of 5-NN using the Euclidean distance metric on this data set is very poor, only . However, this performance can be dramatically improved by applying a linear transformation to the original data. In particular, using our method (as we will describe later) we obtain the linear transformationand consequently, the training accuracy is increased to (see the resulting transformed data in Fig. 1(b)).
The key question is how to find such a linear transformation (or, equivalently, the corresponding Mahalanobis matrix ). Some insights can be obtained when carefully observing how the differences are distributed. Let us informally define the positive (resp. negative) difference space as the set of all differences () between an instance and its nearest neighbors from the same (resp. different) class (see Section 2 for the formal definitions). Here, we use five nearest neighbors with the same class label and five nearest neighbors with different class labels for each training instance. Fig. 2 shows the probability density function1 of data belonging to the positive (Fig. 2(a)) and negative (Fig. 2(b)) difference spaces. It allows us to see how the differences are distributed before applying the linear transformation.
There is a slight difference between these two distributions. However, this difference clearly reveals itself after applying the linear transformation specified by (see Fig. 3). Note that our illustration here is based on k=5, but the same phenomenon occurs for other values of k. This particular example suggests a way to find such linear transformation, namely the one that maximizes the difference between these two distributions. The intuition is based on a two-class classification problem, however, it can be also used for multi-class classification problems since the difference spaces are built independently for any number of classes. In the rest of this paper, we develop this idea. In short, our main contributions are the following.
- (i)
We propose a novel distance metric learning method aimed at finding a linear transformation that maximizes the Jeffrey divergence between two multivariate Gaussian distributions derived from local pairwise constraints. We formulate this task as an unconstrained optimization problem and show that it can be solved analytically (Section 3.1).
- (ii)
While the proposed method is limited to learn a global linear transformation, we extend it into a kernelized version to tackle non-linear problems. We show that the kernelized version of the proposed method is more efficient and highly flexible by using the “kernel trick” (Section 3.2).
- (iii)
The resulting distance metric, when used in conjunction with k-NN, leads to significant improvements in the classification accuracy. We provide an extensive experimental validation to support this claim (Section 5). Several state-of-the-art distance metric learning methods (Section 4) have been used for a fair comparison.
Section snippets
Notations
The following notations are used throughout the paper. Vectors are denoted by boldface lowercase letters, such as , and . Matrices are denoted by boldface capital letters, such as , and . The inner product between two vectors and is denoted as . All scalars are denoted by lowercase or uppercase letters, such as k, n, or D. Sets are denoted by calligraphic uppercase letters, such as , and . The trace of a matrix is denoted as . The multivariate Gaussian density
Proposed method
Motivated by the toy example above, the proposed method is based on learning a linear transformation that maximizes the difference between the probability distribution on the positive difference space and that on the negative difference space. Such difference is often measured by the well-known Kullback–Leibler divergence [18], which is widely used in many machine learning applications, such as information retrieval [19], text categorization [20], particularly in the classification of
Related work
In order to take into account the positive semidefiniteness constraint, distance metric learning methods are mostly formulated as convex semidefinite programs. However, standard semidefinite programming solvers [34] do not scale well when the number of instances or the dimensionality is high, due to the expensive computational cost in each iteration. A number of methods have been proposed to reduce this heavy computational burden. Weinberger and Saul [3] suggested an efficient solver based on
Experiments
In this section, we describe some experiments to evaluate the effectiveness of distance metric learning methods. We compare the proposed methods to the baseline Euclidean distance metric and four state-of-the-art distance metric learning methods, including ITML [7], LMNN [3], DML-eig [35] and SCM [36] (all acronyms are explained in Table 1). First, we use 27 data sets of different sizes to evaluate the linear distance metric learning methods. Second, we conduct an experiment to evaluate the
Conclusion
In this paper, we have developed a novel linear transformation method for distance metric learning. We have shown that learning a linear transformation can be formulated as maximizing the Jeffrey divergence between two distributions derived from local pairwise constraints. Then we have demonstrated that this problem is equivalent to solving a generalized eigenvalue decomposition problem with a closed-form solution. We have also developed the kernelized version of the proposed method to handle
Conflict of interest
None declared.
Bac Nguyen received his B.Sc. and M.Sc. (summa cum laude) degrees in Computer Science from the Universidad Central de Las Villas, Cuba, in 2014 and 2015, respectively. He is currently a Ph.D. candidate in the Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Belgium. His research interests are in the areas of data mining, machine learning, and their applications.
References (53)
- et al.
Nearest neighbor pattern classification
IEEE Trans. Inf. Theory
(1967) Least squares quantization in PCM
IEEE Trans. Inf. Theory
(1982)- et al.
Distance metric learning for large margin nearest neighbor classification
J. Mach. Learn. Res.
(2009) - et al.
Learning a Mahalanobis distance metric for data clustering and classification
Pattern Recognit.
(2008) Metric learninga survey
Found. Trends Mach. Learn.
(2012)- et al.
Metric learning
Synth. Lect. Artif. Intell. Mach. Learn.
(2015) - J.V. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, Information-theoretic metric learning, in: Proceedings of the 24th...
- et al.
Large-scale distance metric learning for -nearest neighbors regression
Neurocomputing
(2016) - E.P. Xing, M.I. Jordan, S. Russell, A. Ng, Distance metric learning with application to clustering with...
- et al.
Distance metric learning by knowledge embedding
Pattern Recognit.
(2004)
Large margin subspace learning for feature selection
Pattern Recognit.
Generalized iterative RELIEF for supervised distance metric learning
Pattern Recognit.
Positive semidefinite metric learning using boosting-like algorithms
J. Mach. Learn. Res.
Semi-supervised metric learning via topology preserving multiple semi-supervised assumptions
Pattern Recognit.
A global geometric framework for nonlinear dimensionality reduction
Science
Think globally, fit locallyunsupervised learning of low dimensional manifolds
J. Mach. Learn. Res.
On information and sufficiency
Ann. Math. Stat.
A fuzzy decision strategy for topic identification and dynamic selection of language models
Signal Process.
Logdet divergence-based metric learning with triplet constraints and its applications
IEEE Trans. Image Process.
Pattern Classification
Cited by (75)
Active contour model based on local Kullback–Leibler divergence for fast image segmentation
2023, Engineering Applications of Artificial IntelligenceDistance metric learning based on the class center and nearest neighbor relationship
2023, Neural NetworksA new representation learning approach for credit data analysis
2023, Information SciencesAn efficient multi-metric learning method by partitioning the metric space
2023, NeurocomputingA multi-birth metric learning framework based on binary constraints
2022, Neural Networks
Bac Nguyen received his B.Sc. and M.Sc. (summa cum laude) degrees in Computer Science from the Universidad Central de Las Villas, Cuba, in 2014 and 2015, respectively. He is currently a Ph.D. candidate in the Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Belgium. His research interests are in the areas of data mining, machine learning, and their applications.
Carlos Morell received his B.Sc. degree in Computer Science and his Ph.D. in Artificial Intelligence from the Universidad Central de Las Villas, in 1995 and 2005, respectively. Currently, he is a Professor in the Department of Computer Science at the same university. In addition, he leads the Artificial Intelligence Research Laboratory. His teaching and research interests include machine learning, soft computing and programming languages.
Bernard De Baets holds an M.Sc. in Maths (1988), a postgraduate degree in Knowledge Technology (1991) and a Ph.D. in Maths (1995), all summa cum laude from Ghent University (Belgium). He is a Full Professor in Applied Maths (1999) at Ghent University, where he is leading KERMIT, the Research Unit Knowledge-Based Systems. He is a Government of Canada Award holder (1988), an Honorary Professor of Budapest Tech (2006) and an IFSA Fellow (2011). His publications comprise more than 400 papers in international journals and about 60 book chapters. He serves on the editorial boards of various international journals, in particular as Co-Editor-in-Chief of Fuzzy Sets and Systems. B. De Baets is a Member of the Board of Directors of EUSFLAT and of the Administrative Board of the Belgian OR Society.