Elsevier

Pattern Recognition

Volume 47, Issue 11, November 2014, Pages 3641-3655
Pattern Recognition

Learning kernel logistic regression in the presence of class label noise

https://doi.org/10.1016/j.patcog.2014.05.007Get rights and content

Highlights

  • We propose an algorithm to learn the new robust multiple kernel logistic regression.

  • The algorithm bypasses cross validation which is sub-optimal in noisy label settings.

  • The algorithm is significantly faster than traditional cross validation approach.

  • We empirically show that symmetric label noise can be as harmful as asymmetric noise.

Abstract

The classical machinery of supervised learning machines relies on a correct set of training labels. Unfortunately, there is no guarantee that all of the labels are correct. Labelling errors are increasingly noticeable in today׳s classification tasks, as the scale and difficulty of these tasks increases so much that perfect label assignment becomes nearly impossible. Several algorithms have been proposed to alleviate the problem of which a robust Kernel Fisher Discriminant is a successful example. However, for classification, discriminative models are of primary interest, and rather curiously, the very few existing label-robust discriminative classifiers are limited to linear problems.

In this paper, we build on the widely used and successful kernelising technique to introduce a label-noise robust Kernel Logistic Regression classifier. The main difficulty that we need to bypass is how to determine the model complexity parameters when no trusted validation set is available. We propose to adapt the Multiple Kernel Learning approach for this new purpose, together with a Bayesian regularisation scheme. Empirical results on 13 benchmark data sets and two real-world applications demonstrate the success of our approach.

Introduction

Traditional supervised learning machines rely on a correct set of class labels. There is however no guarantee that all the labels will be correct in practice, either due to the scale of the labelling task, the lack of information available to determine the class labels or the subjectivity of the labelling experts.

The presence of class label noise inherent in training samples has been reported to deteriorate the performance of the existing classifiers in a broad range of classification problems including biomedical data analysis [20], [30] and image classification [24], [47]. More recently, class label noise emerges as a side effect of crowdsourcing practices where annotators of different backgrounds are asked to perform labelling tasks. For example Amazon׳s Mechanical Turk, Citizen science, Galaxy Zoo to name just a few. Although, the problem posed by the presence of class label noise is acknowledged in the literature, it is often naively ignored in practice. Part of the reason for this may be that uniform/symmetric label noise is relatively harmless [21], [22], [12], [27].

There is an increasing research literature that aims to address the issues related to learning from samples with noisy class label assignments. The seemingly straightforward approach is by means of data preprocessing where any suspect samples are removed or relabelled [7], [1], [29], [37], [31], [18]. However, these approaches hold the risk of removing useful data too, which is detrimental to the classification performance, especially when the number of training examples is limited (e.g. in biomedical domains). Most previous approaches try to detect mislabelled instances based on various heuristics, and very few take a principled modelling approach with the notable exceptions of [32], [24], [25], [36].

Lawrence and Schölkopf [24] incorporated a probabilistic model of random label flipping into their robust Kernel Fisher Discriminant (rKFD) for binary classification. Based on the same model, Li et al. [25] conducted extensive experiments on more complex data sets, which convincingly demonstrated the value of explicit modelling. The rKFD was later extended to multi-class setting by [3] and this has further motivated the recent development of a label noise-tolerant Hidden Markov Model to improve segmentation [15].

While all these works demonstrate the great potential and flexibility of a model based approach, most existing work falls in the category of generative methods. For classification problems, discriminative methods are of interest, and similar algorithmic developments for discriminative classifiers are still limited. For example, Madger et al. [28] studied logistic regression with known label flip probabilities and they reckon problems when these probabilities are unknown. Hausman et al. [17] have given a foundation of a statistical model for the binary classification problem but provide no algorithmic solution to the learning of label noise parameters.

Recently Raykar et al. [36] proposed an EM algorithm to learn a latent variable model extension of logistic regression, for data with multiple sets of noisy labels. Our initial work [4] suggested a more efficient gradient-based algorithm to optimise a similar latent variable model for problems where only a single set of labels is available. A sparse extension of the model has also been developed in [4]. However all of these developments are limited to linear problems. In this paper we focus on non-linear classification with labelling errors which is not as trivial as it might look at first.

Since the introduction of the kernel trick, many linear classifiers have been harnessed with an ability to solve non-linear problems, whereby their usage extends to a wider range of applications. Generally, deploying a kernel machine also involves determining good kernel parameters, and Cross-Validation (CV) has long been an established standard approach. However, when class label noise is present, it becomes unclear why would CV be a good approach since then all candidate models will be validated against noisy class labels. The issue has also been briefly discussed in [24], [6]. In [24], the authors resort to using a ‘trusted validation set’ to select optimal kernel parameters. The trusted set must be labelled carefully, which seriously restricts the applicability of the method. For example in crowdsourcing it would be very difficult (if not impossible) to construct such a trusted set.

We start by straightforwardly formulating a robust Kernel Logistic Regression (rKLR) as an extension of the robust Logistic Regression (rLR). We present a simple yet effective algorithm to learn the classifier and investigate whether or not CV is a reasonable approach for model selection in the presence of labelling errors. As we shall see, we find that performing CV in noisy environments gives rise to a slightly under-fitted model. We then propose a robust Multiple Kernel Logistic Regression algorithm (rMKLR) based on the so-called Multiple Kernel Learning (MKL) framework (an extensive survey in recent advances of MKL is given in [16]) and the Bayesian regularisation technique [9] to automate the model selection step without using any cross-validation. From this we obtain improvements in both generalisation performance and learning speed. The genealogy of the proposed methods is summarised in Fig. 1, which serves as a roadmap for the next section.

Throughout this work, similar to the related work above, we will focus on label noise occurring at random – the flipping of labels is assumed to be independent of the contents of the data features. The reason for this is simplicity and generic applicability. Alternative models of label noise are discussed after the Experiments section.

Section snippets

Robust kernel logistic regression

Consider a set of training samples D={(xn,y˜n)}n=1N, where xnRm and y˜n{0,1} denotes the observed (possibly noisy) label of xn. Kernel logistic regression produces a non-linear decision boundary, f(x), by forming a linear decision boundary in the space of the non-linearly transformed input vectors. By the representer theorem [19], the optimal f(x) has the formf(x)=n=1Nwnκ(·,xn)where κ(·,·) is a positive definite reproducing kernel that gives an inner product in the transformed space.

Denoting

Experiments

We conducted extensive experiments to answer three main research questions:

  • Firstly, we ask if rKLR improves KLR in terms of robustness against labelling errors as measured via classification performance. To answer this question, we also study the relative harm of two common types of label noise: symmetric and asymmetric noises. Symmetric noise is when the same percentage of class labels flips from one class into the other, while asymmetric noise is when labels from one class flip into the other

Extension to multi-class problems

The proposed multi-kernel approach with Bayesian regularisation technique can be straightforwardly extended to a multi-class problem. In multi-class setting the class posterior of the true label is typically modelled by the softmax function:p(y=k|κ(·,xn),wk)=exp(wkTκ(·,xn))j=0K1exp(wjTκ(·,xn))

Using this we can write the likelihood of the observed label as the following:p(y˜=k|κ(·,xn),Θ)=j=0K1ωjkp(y=j|κ(·,xn),wj)which brings us to the objective of the ‘robust Multi-Class Multi-Kernel

Conclusions

We proposed a novel algorithm to learn a label-noise robust Kernel Logistic Regression model in which the optimal hyper-parameters are automatically determined using Multiple Kernel Learning and Bayesian regularisation techniques. The experimental results show that the latent variable model used is robust against mislabelling while the proposed learning algorithm is faster and has superior predictive abilities than traditional approaches. In comparisons with three state-of-the-art kernel

Conflict of interest

None declared.

Jakramate Bootkrajang is a lecturer in the Department of Computer Science of Chiang Mai University. He received his B.Sc. (2007) and M.Sc. (2009) degrees in Computer Science from Seoul National University, Republic of Korea and the Ph.D. (2013) degree in Computer Science from the University of Birmingham. His interests concern statistical machine learning and probabilistic modelling of data. His current research focuses on supervised learning from unreliable annotated data.

References (47)

  • J. Bootkrajang, A. Kabán, Multi-class classification in the presence of labelling errors, in: Proceedings of the...
  • J. Bootkrajang, A. Kabán, Label-noise robust logistic regression and its applications, in: ECML/PKDD (1), 2012, pp....
  • J. Bootkrajang et al.

    Classification of mislabelled microarrays using robust sparse logistic regression

    Bioinformatics

    (2013)
  • C.E. Brodley et al.

    Identifying mislabeled training data

    J. Artif. Intell. Res.

    (1999)
  • G.C. Cawley et al.

    On over-fitting in model selection and subsequent selection bias in performance evaluation

    J. Mach. Learn. Res.

    (2010)
  • G.C. Cawley et al.

    Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters

    J. Mach. Learn. Res.

    (2007)
  • G. Celeux et al.

    A component-wise em algorithm for mixtures

    J. Comput. Graph. Stat.

    (2001)
  • C.-C. Chang et al.

    LIBSVMa library for support vector machines

    ACM Trans. Intell. Syst. Technol.

    (2011)
  • R.S. Chhikara et al.

    Linear discriminant analysis with misallocation in training samples

    J. Am. Stat. Assoc.

    (1984)
  • N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Computer Society Conference on...
  • B. Frénay, G. de Lannoy, M. Verleysen, Label noise-tolerant hidden Markov models for segmentation: application to ECGs,...
  • M. Gönen et al.

    Multiple kernel learning algorithms

    J. Mach. Learn. Res.

    (2011)
  • Y. Jiang, Z.-H. Zhou, Editing training data for knn classifiers with neural network ensemble, in: Advances in Neural...
  • Cited by (0)

    Jakramate Bootkrajang is a lecturer in the Department of Computer Science of Chiang Mai University. He received his B.Sc. (2007) and M.Sc. (2009) degrees in Computer Science from Seoul National University, Republic of Korea and the Ph.D. (2013) degree in Computer Science from the University of Birmingham. His interests concern statistical machine learning and probabilistic modelling of data. His current research focuses on supervised learning from unreliable annotated data.

    Ata Kabán is a lecturer in the School of Computer Science of the University of Birmingham. Her current interests concern statistical machine learning, high dimensional data analysis, probabilistic modelling of data, and Bayesian inference. She received her B.Sc. degree with honours (1999) in Computer Science from the University Babes-Bolyai of Cluj-Napoca, Romania, and the Ph.D. degree in Computer Science (2001) from the University of Paisley, UK. She has been a visiting researcher at Helsinki University of Technology (June–December 2000 and in the summer of 2003) and at HIIT BRU, University of Helsinki (September 2005). Prior to her career in Computer Science, she received the B.A. degree in musical composition (1994) and the M.A. (1995) and the Ph.D. (1999) degrees in musicology from the Music Academy Gh. Dima of Cluj-Napoca, Romania.

    View full text