Elsevier

Pattern Recognition

Volume 64, April 2017, Pages 215-225
Pattern Recognition

Supervised distance metric learning through maximization of the Jeffrey divergence

https://doi.org/10.1016/j.patcog.2016.11.010Get rights and content

Highlights

  • We propose a novel distance metric learning method (DMLMJ) for classification.

  • DMLMJ is simple to implement and it can be solved analytically.

  • We extend DMLMJ into a kernelized version to tackle non-linear problems.

  • Experiments on several data sets show the effectiveness of the proposed method.

Abstract

Over the past decades, distance metric learning has attracted a lot of interest in machine learning and related fields. In this work, we propose an optimization framework for distance metric learning via linear transformations by maximizing the Jeffrey divergence between two multivariate Gaussian distributions derived from local pairwise constraints. In our method, the distance metric is trained on positive and negative difference spaces, which are built from the neighborhood of each training instance, so that the local discriminative information is preserved. We show how to solve this problem with a closed-form solution rather than using tedious optimization procedures. The solution is easy to implement, and tractable for large-scale problems. Experimental results are presented for both a linear and a kernelized version of the proposed method for k-nearest neighbors classification. We obtain classification accuracies superior to the state-of-the-art distance metric learning methods in several cases while being competitive in others.

Introduction

Distance metric learning consists in adapting a distance metric using information contained in the training data. The resulting distance metric is used to improve the performance of metric-based methods, such as k-nearest neighbors classification (k-NN) [1], or k-means clustering [2]. Depending on the problem of interest, an appropriate distance metric can yield substantial improvements over the commonly used Euclidean distance metric [3], [4]. Learning a good distance metric for a specific problem may be the key to the successful application of metric-based methods. For this reason, distance metric learning plays a crucial role in metric-related pattern recognition tasks (see recent surveys [5], [6]), such as classification [3], [7], regression [8], clustering [4], [9], [10], feature selection [11], [12], and ranking [13]. Depending on the availability of training instances, distance metric learning methods can be divided into three categories: supervised, semi-supervised, and unsupervised distance metric learning. Supervised methods for classification use the heuristic that instances belonging to the same class should be close to each other, while those from different classes should be farther apart [3], [14]. Semi-supervised methods for information retrieval and clustering use the information in the form of pairwise similarity or dissimilarity constraints [9], [15]. Unsupervised methods learn a distance metric that preserves the geometric relationships (i.e., distance) between most of the training data for the purpose of unsupervised dimensionality reduction [16], [17].

In a supervised setting, we focus on the Mahalanobis distance metric due to its wide use in many application domains and because it provides a flexible way of learning an appropriate distance metric for complex problems [5], [6]. The Mahalanobis distance metric is parametrized by a symmetric positive semidefinite matrix MRD×D, where the distance between two points u and v in RD is computed asdM(u,v)=(uv)M(uv).Since the matrix M is symmetric positive semidefinite, it can be factorized as M=AA, where ARD×m and mD. Thus, the Mahalanobis distance between u and v is equal to the Euclidean distance between Au and Av. In other words, learning a Mahalanobis distance metric is equivalent to learning a linear transformation.

We begin by introducing a simple two-class classification problem that motivates the key ideas in the proposed method. For this purpose, we construct a two-dimensional data set, containing 100 positive instances and 100 negative instances (see Fig. 1(a)). Both positive instances and negative instances follow a Gaussian distribution with means μ1=(1.250;0.205) and μ2=(0.60;0.07), respectively, and the same covariance matrix Σ=(1.96,0.55;0.55,0.16). The training accuracy of 5-NN using the Euclidean distance metric on this data set is very poor, only 64.0%. However, this performance can be dramatically improved by applying a linear transformation to the original data. In particular, using our method (as we will describe later) we obtain the linear transformationA=(20.113.0270.220.63),and consequently, the training accuracy is increased to 97.5% (see the resulting transformed data in Fig. 1(b)).

The key question is how to find such a linear transformation A (or, equivalently, the corresponding Mahalanobis matrix M). Some insights can be obtained when carefully observing how the differences are distributed. Let us informally define the positive (resp. negative) difference space as the set of all differences (xixj) between an instance xi and its nearest neighbors xj from the same (resp. different) class (see Section 2 for the formal definitions). Here, we use five nearest neighbors with the same class label and five nearest neighbors with different class labels for each training instance. Fig. 2 shows the probability density function1 of data belonging to the positive (Fig. 2(a)) and negative (Fig. 2(b)) difference spaces. It allows us to see how the differences are distributed before applying the linear transformation.

There is a slight difference between these two distributions. However, this difference clearly reveals itself after applying the linear transformation specified by A (see Fig. 3). Note that our illustration here is based on k=5, but the same phenomenon occurs for other values of k. This particular example suggests a way to find such linear transformation, namely the one that maximizes the difference between these two distributions. The intuition is based on a two-class classification problem, however, it can be also used for multi-class classification problems since the difference spaces are built independently for any number of classes. In the rest of this paper, we develop this idea. In short, our main contributions are the following.

  • (i)

    We propose a novel distance metric learning method aimed at finding a linear transformation that maximizes the Jeffrey divergence between two multivariate Gaussian distributions derived from local pairwise constraints. We formulate this task as an unconstrained optimization problem and show that it can be solved analytically (Section 3.1).

  • (ii)

    While the proposed method is limited to learn a global linear transformation, we extend it into a kernelized version to tackle non-linear problems. We show that the kernelized version of the proposed method is more efficient and highly flexible by using the “kernel trick” (Section 3.2).

  • (iii)

    The resulting distance metric, when used in conjunction with k-NN, leads to significant improvements in the classification accuracy. We provide an extensive experimental validation to support this claim (Section 5). Several state-of-the-art distance metric learning methods (Section 4) have been used for a fair comparison.

Section snippets

Notations

The following notations are used throughout the paper. Vectors are denoted by boldface lowercase letters, such as x, y and z. Matrices are denoted by boldface capital letters, such as A, B and C. The inner product between two vectors u and v is denoted as u,v=uv. All scalars are denoted by lowercase or uppercase letters, such as k, n, or D. Sets are denoted by calligraphic uppercase letters, such as X, Y and V. The trace of a matrix A is denoted as tr(A). The multivariate Gaussian density

Proposed method

Motivated by the toy example above, the proposed method is based on learning a linear transformation that maximizes the difference between the probability distribution on the positive difference space and that on the negative difference space. Such difference is often measured by the well-known Kullback–Leibler divergence [18], which is widely used in many machine learning applications, such as information retrieval [19], text categorization [20], particularly in the classification of

Related work

In order to take into account the positive semidefiniteness constraint, distance metric learning methods are mostly formulated as convex semidefinite programs. However, standard semidefinite programming solvers [34] do not scale well when the number of instances or the dimensionality is high, due to the expensive computational cost in each iteration. A number of methods have been proposed to reduce this heavy computational burden. Weinberger and Saul [3] suggested an efficient solver based on

Experiments

In this section, we describe some experiments to evaluate the effectiveness of distance metric learning methods. We compare the proposed methods to the baseline Euclidean distance metric and four state-of-the-art distance metric learning methods, including ITML [7], LMNN [3], DML-eig [35] and SCM [36] (all acronyms are explained in Table 1). First, we use 27 data sets of different sizes to evaluate the linear distance metric learning methods. Second, we conduct an experiment to evaluate the

Conclusion

In this paper, we have developed a novel linear transformation method for distance metric learning. We have shown that learning a linear transformation can be formulated as maximizing the Jeffrey divergence between two distributions derived from local pairwise constraints. Then we have demonstrated that this problem is equivalent to solving a generalized eigenvalue decomposition problem with a closed-form solution. We have also developed the kernelized version of the proposed method to handle

Conflict of interest

None declared.

Bac Nguyen received his B.Sc. and M.Sc. (summa cum laude) degrees in Computer Science from the Universidad Central de Las Villas, Cuba, in 2014 and 2015, respectively. He is currently a Ph.D. candidate in the Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Belgium. His research interests are in the areas of data mining, machine learning, and their applications.

References (53)

  • T.M. Cover et al.

    Nearest neighbor pattern classification

    IEEE Trans. Inf. Theory

    (1967)
  • S.P. Lloyd

    Least squares quantization in PCM

    IEEE Trans. Inf. Theory

    (1982)
  • K.Q. Weinberger et al.

    Distance metric learning for large margin nearest neighbor classification

    J. Mach. Learn. Res.

    (2009)
  • S. Xiang et al.

    Learning a Mahalanobis distance metric for data clustering and classification

    Pattern Recognit.

    (2008)
  • B. Kulis

    Metric learninga survey

    Found. Trends Mach. Learn.

    (2012)
  • A. Bellet et al.

    Metric learning

    Synth. Lect. Artif. Intell. Mach. Learn.

    (2015)
  • J.V. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, Information-theoretic metric learning, in: Proceedings of the 24th...
  • B. Nguyen et al.

    Large-scale distance metric learning for -nearest neighbors regression

    Neurocomputing

    (2016)
  • E.P. Xing, M.I. Jordan, S. Russell, A. Ng, Distance metric learning with application to clustering with...
  • Y.G. Zhang et al.

    Distance metric learning by knowledge embedding

    Pattern Recognit.

    (2004)
  • B. Liu et al.

    Large margin subspace learning for feature selection

    Pattern Recognit.

    (2013)
  • C.-C. Chang

    Generalized iterative RELIEF for supervised distance metric learning

    Pattern Recognit.

    (2010)
  • B. McFee, G. Lanckriet, Metric learning to rank, in: Proceedings of the 27th International Conference on Machine...
  • C. Shen et al.

    Positive semidefinite metric learning using boosting-like algorithms

    J. Mach. Learn. Res.

    (2012)
  • Q. Wang et al.

    Semi-supervised metric learning via topology preserving multiple semi-supervised assumptions

    Pattern Recognit.

    (2013)
  • J.B. Tenenbaum et al.

    A global geometric framework for nonlinear dimensionality reduction

    Science

    (2000)
  • L.K. Saul et al.

    Think globally, fit locallyunsupervised learning of low dimensional manifolds

    J. Mach. Learn. Res.

    (2003)
  • S. Kullback et al.

    On information and sufficiency

    Ann. Math. Stat.

    (1951)
  • B. Bigi et al.

    A fuzzy decision strategy for topic identification and dynamic selection of language models

    Signal Process.

    (2000)
  • B. Bigi, Using Kullback–Leibler distance for text categorization, in: Proceedings of the 25th European Conference on IR...
  • P.J. Moreno, P.P. Ho, N. Vasconcelos, A Kullback–Leibler divergence based kernel for SVM classification in multimedia...
  • G.-J. Qi, J. Tang, Z.-J. Zha, T.-S. Chua, H.-J. Zhang, An efficient sparse metric learning in high-dimensional space...
  • P. Jain, B. Kulis, I.S. Dhillon, K. Grauman, Online metric learning and fast similarity search, in: Advances in Neural...
  • J. Mei et al.

    Logdet divergence-based metric learning with triplet constraints and its applications

    IEEE Trans. Image Process.

    (2014)
  • A. Globerson, S.T. Roweis, Metric learning by collapsing classes, in: Advances in Neural Information Processing...
  • R.O. Duda et al.

    Pattern Classification

    (2000)
  • Cited by (75)

    View all citing articles on Scopus

    Bac Nguyen received his B.Sc. and M.Sc. (summa cum laude) degrees in Computer Science from the Universidad Central de Las Villas, Cuba, in 2014 and 2015, respectively. He is currently a Ph.D. candidate in the Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Belgium. His research interests are in the areas of data mining, machine learning, and their applications.

    Carlos Morell received his B.Sc. degree in Computer Science and his Ph.D. in Artificial Intelligence from the Universidad Central de Las Villas, in 1995 and 2005, respectively. Currently, he is a Professor in the Department of Computer Science at the same university. In addition, he leads the Artificial Intelligence Research Laboratory. His teaching and research interests include machine learning, soft computing and programming languages.

    Bernard De Baets holds an M.Sc. in Maths (1988), a postgraduate degree in Knowledge Technology (1991) and a Ph.D. in Maths (1995), all summa cum laude from Ghent University (Belgium). He is a Full Professor in Applied Maths (1999) at Ghent University, where he is leading KERMIT, the Research Unit Knowledge-Based Systems. He is a Government of Canada Award holder (1988), an Honorary Professor of Budapest Tech (2006) and an IFSA Fellow (2011). His publications comprise more than 400 papers in international journals and about 60 book chapters. He serves on the editorial boards of various international journals, in particular as Co-Editor-in-Chief of Fuzzy Sets and Systems. B. De Baets is a Member of the Board of Directors of EUSFLAT and of the Administrative Board of the Belgian OR Society.

    View full text