Learning kernel logistic regression in the presence of class label noise
Introduction
Traditional supervised learning machines rely on a correct set of class labels. There is however no guarantee that all the labels will be correct in practice, either due to the scale of the labelling task, the lack of information available to determine the class labels or the subjectivity of the labelling experts.
The presence of class label noise inherent in training samples has been reported to deteriorate the performance of the existing classifiers in a broad range of classification problems including biomedical data analysis [20], [30] and image classification [24], [47]. More recently, class label noise emerges as a side effect of crowdsourcing practices where annotators of different backgrounds are asked to perform labelling tasks. For example Amazon׳s Mechanical Turk, Citizen science, Galaxy Zoo to name just a few. Although, the problem posed by the presence of class label noise is acknowledged in the literature, it is often naively ignored in practice. Part of the reason for this may be that uniform/symmetric label noise is relatively harmless [21], [22], [12], [27].
There is an increasing research literature that aims to address the issues related to learning from samples with noisy class label assignments. The seemingly straightforward approach is by means of data preprocessing where any suspect samples are removed or relabelled [7], [1], [29], [37], [31], [18]. However, these approaches hold the risk of removing useful data too, which is detrimental to the classification performance, especially when the number of training examples is limited (e.g. in biomedical domains). Most previous approaches try to detect mislabelled instances based on various heuristics, and very few take a principled modelling approach with the notable exceptions of [32], [24], [25], [36].
Lawrence and Schölkopf [24] incorporated a probabilistic model of random label flipping into their robust Kernel Fisher Discriminant (rKFD) for binary classification. Based on the same model, Li et al. [25] conducted extensive experiments on more complex data sets, which convincingly demonstrated the value of explicit modelling. The rKFD was later extended to multi-class setting by [3] and this has further motivated the recent development of a label noise-tolerant Hidden Markov Model to improve segmentation [15].
While all these works demonstrate the great potential and flexibility of a model based approach, most existing work falls in the category of generative methods. For classification problems, discriminative methods are of interest, and similar algorithmic developments for discriminative classifiers are still limited. For example, Madger et al. [28] studied logistic regression with known label flip probabilities and they reckon problems when these probabilities are unknown. Hausman et al. [17] have given a foundation of a statistical model for the binary classification problem but provide no algorithmic solution to the learning of label noise parameters.
Recently Raykar et al. [36] proposed an EM algorithm to learn a latent variable model extension of logistic regression, for data with multiple sets of noisy labels. Our initial work [4] suggested a more efficient gradient-based algorithm to optimise a similar latent variable model for problems where only a single set of labels is available. A sparse extension of the model has also been developed in [4]. However all of these developments are limited to linear problems. In this paper we focus on non-linear classification with labelling errors which is not as trivial as it might look at first.
Since the introduction of the kernel trick, many linear classifiers have been harnessed with an ability to solve non-linear problems, whereby their usage extends to a wider range of applications. Generally, deploying a kernel machine also involves determining good kernel parameters, and Cross-Validation (CV) has long been an established standard approach. However, when class label noise is present, it becomes unclear why would CV be a good approach since then all candidate models will be validated against noisy class labels. The issue has also been briefly discussed in [24], [6]. In [24], the authors resort to using a ‘trusted validation set’ to select optimal kernel parameters. The trusted set must be labelled carefully, which seriously restricts the applicability of the method. For example in crowdsourcing it would be very difficult (if not impossible) to construct such a trusted set.
We start by straightforwardly formulating a robust Kernel Logistic Regression (rKLR) as an extension of the robust Logistic Regression (rLR). We present a simple yet effective algorithm to learn the classifier and investigate whether or not CV is a reasonable approach for model selection in the presence of labelling errors. As we shall see, we find that performing CV in noisy environments gives rise to a slightly under-fitted model. We then propose a robust Multiple Kernel Logistic Regression algorithm (rMKLR) based on the so-called Multiple Kernel Learning (MKL) framework (an extensive survey in recent advances of MKL is given in [16]) and the Bayesian regularisation technique [9] to automate the model selection step without using any cross-validation. From this we obtain improvements in both generalisation performance and learning speed. The genealogy of the proposed methods is summarised in Fig. 1, which serves as a roadmap for the next section.
Throughout this work, similar to the related work above, we will focus on label noise occurring at random – the flipping of labels is assumed to be independent of the contents of the data features. The reason for this is simplicity and generic applicability. Alternative models of label noise are discussed after the Experiments section.
Section snippets
Robust kernel logistic regression
Consider a set of training samples , where and denotes the observed (possibly noisy) label of . Kernel logistic regression produces a non-linear decision boundary, , by forming a linear decision boundary in the space of the non-linearly transformed input vectors. By the representer theorem [19], the optimal has the formwhere is a positive definite reproducing kernel that gives an inner product in the transformed space.
Denoting
Experiments
We conducted extensive experiments to answer three main research questions:
- •
Firstly, we ask if rKLR improves KLR in terms of robustness against labelling errors as measured via classification performance. To answer this question, we also study the relative harm of two common types of label noise: symmetric and asymmetric noises. Symmetric noise is when the same percentage of class labels flips from one class into the other, while asymmetric noise is when labels from one class flip into the other
Extension to multi-class problems
The proposed multi-kernel approach with Bayesian regularisation technique can be straightforwardly extended to a multi-class problem. In multi-class setting the class posterior of the true label is typically modelled by the softmax function:
Using this we can write the likelihood of the observed label as the following:which brings us to the objective of the ‘robust Multi-Class Multi-Kernel
Conclusions
We proposed a novel algorithm to learn a label-noise robust Kernel Logistic Regression model in which the optimal hyper-parameters are automatically determined using Multiple Kernel Learning and Bayesian regularisation techniques. The experimental results show that the latent variable model used is robust against mislabelling while the proposed learning algorithm is faster and has superior predictive abilities than traditional approaches. In comparisons with three state-of-the-art kernel
Conflict of interest
None declared.
Jakramate Bootkrajang is a lecturer in the Department of Computer Science of Chiang Mai University. He received his B.Sc. (2007) and M.Sc. (2009) degrees in Computer Science from Seoul National University, Republic of Korea and the Ph.D. (2013) degree in Computer Science from the University of Birmingham. His interests concern statistical machine learning and probabilistic modelling of data. His current research focuses on supervised learning from unreliable annotated data.
References (47)
- et al.
Robust supervised classification with mixture modelslearning from data with uncertain labels
Pattern Recognit.
(2009) - et al.
Pattern recognition with a Bayesian kernel combination machine
Pattern Recognit. Lett.
(2009) - et al.
Misclassification of the dependent variable in a discrete-response setting
J. Econom.
(1998) - et al.
Some results on Tchebycheffian spline functions
J. Math. Anal. Appl.
(1971) - et al.
Efficiency of discriminant analysis when initial samples are classified stochastically
Pattern Recognit.
(1990) - et al.
Classification in the presence of class noise using a probabilistic kernel fisher method
Pattern Recognit.
(2007) Learning with an unreliable teacher
Pattern Recognit.
(1992)- et al.
Analysis of new techniques to obtain quality training sets
Pattern Recognit. Lett.
(2003) - R. Barandela, E. Gasca, Decontamination of training samples for supervised pattern recognition methods, in: Advances in...
- et al.
Support vector machines under adversarial label noise
J. Mach. Learn. Res.: Proc. Track
(2011)
Classification of mislabelled microarrays using robust sparse logistic regression
Bioinformatics
Identifying mislabeled training data
J. Artif. Intell. Res.
On over-fitting in model selection and subsequent selection bias in performance evaluation
J. Mach. Learn. Res.
Preventing over-fitting during model selection via Bayesian regularisation of the hyper-parameters
J. Mach. Learn. Res.
A component-wise em algorithm for mixtures
J. Comput. Graph. Stat.
LIBSVMa library for support vector machines
ACM Trans. Intell. Syst. Technol.
Linear discriminant analysis with misallocation in training samples
J. Am. Stat. Assoc.
Multiple kernel learning algorithms
J. Mach. Learn. Res.
Cited by (0)
Jakramate Bootkrajang is a lecturer in the Department of Computer Science of Chiang Mai University. He received his B.Sc. (2007) and M.Sc. (2009) degrees in Computer Science from Seoul National University, Republic of Korea and the Ph.D. (2013) degree in Computer Science from the University of Birmingham. His interests concern statistical machine learning and probabilistic modelling of data. His current research focuses on supervised learning from unreliable annotated data.
Ata Kabán is a lecturer in the School of Computer Science of the University of Birmingham. Her current interests concern statistical machine learning, high dimensional data analysis, probabilistic modelling of data, and Bayesian inference. She received her B.Sc. degree with honours (1999) in Computer Science from the University Babes-Bolyai of Cluj-Napoca, Romania, and the Ph.D. degree in Computer Science (2001) from the University of Paisley, UK. She has been a visiting researcher at Helsinki University of Technology (June–December 2000 and in the summer of 2003) and at HIIT BRU, University of Helsinki (September 2005). Prior to her career in Computer Science, she received the B.A. degree in musical composition (1994) and the M.A. (1995) and the Ph.D. (1999) degrees in musicology from the Music Academy Gh. Dima of Cluj-Napoca, Romania.