Cost-sensitive positive and unlabeled learning

doi:10.1016/j.ins.2021.01.002

Information Sciences

Volume 558, May 2021, Pages 229-245

https://doi.org/10.1016/j.ins.2021.01.002 Get rights and content

Abstract

Positive and Unlabeled learning (PU learning) aims to train a binary classifier solely based on positively labeled and unlabeled data when negatively labeled data are absent or distributed too diversely. However, none of the existing PU learning methods takes the class imbalance problem into account, which significantly neglects the minority class and is likely to generate a biased classifier. Therefore, this paper proposes a novel algorithm termed “Cost-Sensitive Positive and Unlabeled learning” (CSPU) which imposes different misclassification costs on different classes when conducting PU classification. Specifically, we assign distinct weights to the losses caused by false negative and false positive examples, and employ double hinge loss to build our CSPU algorithm under the framework of empirical risk minimization. Theoretically, we analyze the computational complexity, and also derive a generalization error bound of CSPU which guarantees the good performance of our algorithm on test data. Empirically, we compare CSPU with the state-of-the-art PU learning methods on synthetic dataset, OpenML benchmark datasets, and real-world datasets. The results clearly demonstrate the superiority of the proposed CSPU to other comparators in dealing with class imbalanced tasks.

Introduction

Positive and Unlabeled learning (PU learning) [1] has gained increasing popularity in recent years due to its usefulness and effectiveness for practical applications, of which the target is to train a binary classifier from only positive and unlabeled data. Here the unlabeled data might be positive or negative, but the learning algorithm does not know their groundtruth labels during the training stage.

Since the training of a PU classifier does not depend on the explicit negative examples, it is preferred when the negative data are absent or distributed too diversely. For example, in information retrieval, the user-provided information constitutes the positive data, while the databases are regarded as unlabeled as they contain both similar and dissimilar information to the user’s query [2]. In this application, negative examples are unavailable, and thus PU learning can be utilized to find the user’s interest in the unlabeled set. In addition, in a remotely-sensed hyperspectral image, we may only be interested in identifying one specific land-cover type for certain use without considering other types [3]. In this case, we may directly treat the type-of-interest as positive and leave the remaining ones as negative, so PU learning can be employed to detect the image regions of positive land-cover type.

Existing PU learning algorithms can be mainly divided into three categories based on how the unlabeled data are treated. The first category [4], [5] initially identifies some reliable negative data in unlabeled data, and then invokes a traditional classifier to perform ordinary supervised learning. The result of such two-step framework is severely dependent on the precision of the identified negative data. That is, if the detection of negative data is inaccurate, the final outcome could be disastrous. To handle this shortcoming, the second category [6], [7], [8] directly treats all unlabeled data as negative and takes PU learning as a label noise learning problem (the definition of label noise learning can be found in [9]), among which the original positive examples are deemed as mislabeled as negative data. The last but also the most prevalent category [10], [11], [12] in recent years focuses on designing various unbiased risk estimators. The approaches of this category apply distinct loss functions that satisfy specific conditions to PU risk estimators, and result in various unbiased risk estimators. A breakthrough in this direction is [10] for proposing the first unbiased risk estimator with a nonconvex loss function $ℓ (z)$ such that $ℓ (z) + ℓ (- z) = 1$ (e.g., ramp loss $ℓ_{R} (z) = \frac{1}{2} \max (0, \min (2, 1 - z))$ ) with z being the variable. Furthermore, a more general and consistent unbiased estimator was proposed in [11] which advances a novel “double hinge loss” $ℓ_{DH} (z) = \max (- z, \max (0, \frac{1}{2} - \frac{1}{2} z))$ so that the composite loss $\tilde{ℓ} (z) = ℓ_{DH} (z) - ℓ_{DH} (- z)$ satisfies $\tilde{ℓ} (z) = - z$ after normalization. After that, a nonnegative unbiased risk estimator suggested in [12] converts negative part of empirical risks in [11] into zero to avoid overfitting.

Although the methods mentioned above have received encouraging performances on various datasets or tasks, these methods would fail when encountering class imbalanced situations. In practical applications, the class imbalanced phenomena are prevalent, such as credit card fraud detection, disease diagnosis, and outlier detection, etc. In outlier detection, the very few outliers identified by the primitive detector constitute the positive set, and the remaining data points are deemed as unlabeled because some outliers are probably hidden among them. Moreover, the outliers usually occupy a small part of the entire dataset when compared with the inliers, which results in a class imbalanced PU learning problem. Unfortunately, none of the existing PU learning methods takes the class imbalance problem into consideration, so they are all likely to classify every example into the majority class (e.g., inlier) to acquire high classification accuracy. As a result, the influence of minority class (e.g., outlier) will be overwhelmed by the majority class [13] in deciding the decision function, and thus the biased classifier will be generated. This is obviously undesirable as the minority class usually contains our primary interest.

To enable PU learning to applicable to the imbalanced data, in this paper, we propose a novel algorithm dubbed “Cost-Sensitive Positive and Unlabeled learning” (CSPU) which is convex and depends on the widely-used unbiased double hinge loss [11]. To be specific, we cast PU learning as an empirical risk minimization problem, in which the losses incurred by false negative and false positive examples are assigned distinct weights. As a result, the generated decision boundary can be calibrated to the potentially correct one. We show that our CSPU algorithm can be converted into a traditional Quadratic Programming (QP) problem, so it can be easily solved via off-the-shelf QP optimization toolbox. Theoretically, we analyze the computational complexity of our CSPU algorithm, and derive a generalization error bound of the algorithm based on its Rademacher complexity. Thorough experiments on various practical imbalanced datasets demonstrate that the proposed CSPU is superior to the state-of-the-art PU methods in terms of the F-measure metric [14], [15]. The main contributions of our work are summarized as follows:

•
We propose a novel learning setting called “Cost-Sensitive PU learning” (CSPU) to model the practical problems where the absence of negative data and the class imbalance problem co-occur.
•
We design a novel algorithm to address the CSPU learning problem, which introduces a convex empirical risk estimator with double hinge loss, and an efficient optimization method is also provided to solve our algorithm.
•
We analyze the computational complexity of our algorithm, which takes $O (9 n^{3} + 15 n^{2} + 7 n + 1)$ . We also derive a generalization error bound of the algorithm based on its Rademacher complexity, which reveals that the generalization error converges to the expected classification risk with the order of $O (1 / \sqrt{n_{p}} + 1 / \sqrt{n_{u}} + 1 / \sqrt{n})$ ( $n, n_{p}$ , and $n_{u}$ are the amounts of training data, positive data, and unlabeled data correspondingly).
•
We achieve the state-of-the-art results when compared with other PU learning methods in dealing with class imbalanced PU learning problem.

The rest of this paper is organized as follows. In Section 2, the related works of PU learning and imbalanced data learning are reviewed. Section 3 introduces the proposed CSPU algorithm. The optimal solution of our CSPU is given in Section 4. Section 5 studies the computational complexity and derives a generalization error bound of the proposed algorithm. The experimental results of our CSPU and other representative PU comparators are presented in Section 6. Finally, we draw a conclusion in Section 7.

Section snippets

Related work

In this section, we review the representative works of PU learning and imbalanced data learning, as these two learning frameworks are very relevant to the topic of this paper.

The proposed algorithm

The target of PU learning is to train a binary classifier from only positive and unlabeled data. Our proposed algorithm aims to address the situations where the absence of negative training data and the class imbalance problem co-occur. These phenomena are prevalent in many real-world cases, such as outlier detection. In this section, we first provide the formal setting for the PU learning problem, and then propose our CSPU classification algorithm.

Optimization

In this section, we solve our algorithm presented in (14)–(20), which falls into scope of Quadratic Programming (QP) that has the form $\begin{matrix} \min_{_{γ}} \frac{1}{2} γ^{⊤} H γ + f^{⊤} γ \\ s . t . L γ ⩽ k, \\ q ⩽ γ . \end{matrix}$

In our algorithm, we let $γ = {[\begin{matrix} α_{(n + 1) \times 1}^{⊤} & η_{n_{p} \times 1}^{⊤} & ξ_{n_{u} \times 1}^{⊤} \end{matrix}]}^{⊤} .$

Then $H$ is defined as $H = [\begin{matrix} λ K^{⊤} K & O_{(n + 1) \times n_{p}} & O_{(n + 1) \times n_{u}} \\ O_{n_{p} \times (n + 1)} & O_{n_{p} \times n_{p}} & O_{n_{p} \times n_{u}} \\ O_{n_{u} \times (n + 1)} & O_{n_{u} \times n_{p}} & O_{n_{u} \times n_{u}} \end{matrix}],$ where the $O_{(n + 1) \times n_{p}}$ is a zero matrix of the size $(n + 1) \times n_{p}$ . Accordingly, the coefficient $f$ in (14) is constituted of $f = {[\begin{matrix} 0_{(n + 1) \times 1}^{⊤} & \frac{π}{n_{p}} 1_{n_{p} \times 1}^{⊤} & \frac{c_{- 1}}{n_{u}} 1_{n_{u} \times 1}^{⊤} \end{matrix}]}^{⊤} .$

Similarly, the $q$ in the constraint of (14) is $q = [\begin{matrix} - \infty_{(n + 1) \times 1}^{⊤} & - \infty \end{matrix}]$

Theoretical analyses

This section provides the theoretical analyses on CSPU. We firstly analyze the computational complexity of Algorithm 1, and then theoretically derive a generalization error bound of CSPU.

Experiments

In this section, we test the performance of our proposed CSPU by performing exhaustive experiments on one synthetic dataset, four publicly available benchmark datasets, and two real-world datasets. To demonstrate the superiority of CSPU, we compare it with several state-of-the-art PU learning algorithms including Weighted SVM (W-SVM) [19], Unbiased PU learning (UPU) [11], Multi-Layer Perceptron with Non-Negative PU risk estimator (NNPU-MLP) [12], Linear classifier with Non-Negative PU risk

Conclusion

In this paper, we propose a novel PU learning algorithm to deal with class imbalance problem named “Cost-Sensitive PU learning” (CSPU) which imposes distinct weights on the losses regarding false negative and false positive examples. Then the PU learning is formulated as an empirical risk minimization problem with respect to the unbiased double hinge loss that makes the empirical risk to be convex. The proposed algorithm can be easily solved via off-the-shelf quadratic programming optimization

CRediT authorship contribution statement

Xiuhua Chen: Conceptualization, Data curation, Investigation, Methodology, Validation, Writing - original draft. Chen Gong: Formal analysis, Validation, Writing - review & editing, Supervision. Jian Yang: Writing - review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

We would like to thank Dr. Tongliang Liu from the University of Sydney for helping proofread this paper and all anonymous reviewers for the valuable comments to improve our paper. This work was supported by the NSF of China (Nos: 61973162, U1713208), the Fundamental Research Funds for the Central Universities (No: 30920032202), CCF-Tencent Open Fund (No: RAGR20200101), the “Young Elite Scientists Sponsorship Program” by CAST (No: 2018QNRC001), and Hong Kong Scholars Program (No: XJ2019036).

References (50)

G.Y. Wong et al.
A hybrid evolutionary preprocessing method for imbalanced datasets
Inf. Sci.
(2018)
Z. Sun et al.
A novel ensemble method for classifying imbalanced data
Pattern Recogn.
(2015)
D. Elreedy et al.
A comprehensive analysis of Synthetic Minority Oversampling TEchnique (SMOTE) for handling class imbalance
Inf. Sci.
(2019)
G. Douzas et al.
Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE
Inf. Sci.
(2018)
C.-F. Tsai et al.
Under-sampling class imbalanced datasets by combining clustering analysis and instance selection
Inf. Sci.
(2019)
P. Vuttipittayamongkol et al.
Neighbourhood-based undersampling approach for handling imbalanced and overlapped data
Inf. Sci.
(2020)
H. Li et al.
Cost-sensitive dual-bidirectional linear discriminant analysis
Inf. Sci.
(2020)
J. Bekker et al.
Learning from positive and unlabeled data: a survey
Mach. Learn.
(2020)
E. Sansone et al.
Efficient training for positive unlabeled learning
IEEE Trans. Pattern Anal. Mach. Intell.
(2018)
W. Li et al.
A positive and unlabeled learning algorithm for one-class classification of remote-sensing data
IEEE Trans. Geosci. Remote Sens.
(2010)

B. Liu, W.S. Lee, P.S. Yu, X. Li, Partially supervised classification of text documents, in: International Conference...

X. Li, B. Liu, Learning to classify texts using positive and unlabeled data, in: International Joint Conference on...

W.S. Lee, B. Liu, Learning with positive and unlabeled examples using weighted logistic regression, in: International...

H. Shi, S. Pan, J. Yang, C. Gong, Positive and unlabeled learning via loss decomposition and centroid estimation, in:...

F. He, T. Liu, G.I. Webb, D. Tao, Instance-dependent PU learning by bayesian optimal relabeling, arXiv preprint...

B. Frénay et al.

Classification in the presence of label noise: a survey

IEEE Trans. Neural Netw. Learn. Syst.

(2014)

M.C. Du Plessis, G. Niu, M. Sugiyama, Analysis of learning from positive and unlabeled data, in: Advances in Neural...

M. Du Plessis, G. Niu, M. Sugiyama, Convex formulation for learning from positive and unlabeled data, in: International...

R. Kiryo, G. Niu, M.C. Du Plessis, M. Sugiyama, Positive-unlabeled learning with non-negative risk estimator, in:...

N.V. Chawla et al.

Special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newslett.

(2004)

M. Liu et al.

Cost-sensitive feature selection by optimizing F-measures

IEEE Trans. Image Process.

(2017)

S.P. Parambath, N. Usunier, Y. Grandvalet, Optimizing F-measures by cost-sensitive classification, in: Advances in...

B. Liu, Y. Dai, X. Li, W.S. Lee, S.Y. Philip, Building text classifiers using positive and unlabeled examples, in: IEEE...

H. Yu et al.

PEBL: positive example based learning for web page classification using SVM

C. Gong et al.

Loss decomposition and centroid estimation for positive and unlabeled learning

IEEE Trans. Pattern Anal. Mach. Intell.

(2019)

Cited by (18)

A loss matrix-based alternating optimization method for sparse PU learning
2022, Swarm and Evolutionary Computation
Citation Excerpt :
To this end, Plessis et al. [14] further discussed different loss functions and proposed a convex double hinge loss function for PU learning, named uPU, which can obtain the PU classifier with better classification performance. Recently, Chen et al. [31] imposed different misclassification costs on different classes and employed double hinge loss to build for the proposed algorithm under the framework of empirical risk minimization. Zhang et al. [32] assigned a set of candidate labels (i.e., positive and negative) to each U sample and proposed a novel PU learning algorithm, termed PULD, in which a disambiguation technique was designed to determine the true label of each U sample.
Positive and unlabeled (PU) learning is an important research topic in machine learning area, whose aim is to learn a good classifier from PU data. Due to its wide applications, a variety of PU algorithms with promising performance have been proposed. However, few of them consider PU learning from high-dimensional PU data. To fill the gap, in this work, we focus on designing the sparse PU classifier. However, it is difficult to achieve it, since the labels of U samples are uncertain. To this end, a loss matrix-based alternating optimization method, named LMAO-PU is proposed, where a two-stage alternating optimization idea is suggested to solve the difficulty in sparse PU learning. Firstly, a loss matrix is designed to measure the classification performance of the PU classifier on U samples. Then, a two-stage alternating optimization under the guide of loss matrix is developed. Specifically, the first stage is to optimize the PU model, aiming to address the challenge of labels uncertainty of U samples, where two objectives, TPR (True Positive Rate) and a suggested LRU (Loss Rate on Unlabeled), are optimized simultaneously. The second stage is to perform the sparse model optimization, where Sparsity and ErrorRate are adopted as the objectives to obtain the sparse model. The two-stage optimization procedure mentioned above is carried alternately and the sparse PU classifier with high quality is finally achieved. Experiments on 10 high-dimensional datasets demonstrate the superiority of proposed method over six state-of-the-art baselines in terms of sparsity, accuracy and area under ROC curve.
A new self-paced learning method for privilege-based positive and unlabeled learning
2022, Information Sciences
Citation Excerpt :
In the work of [4,5], Wu et al. further propose a puMGL method for the multi-graph learning problem. Chen et al. [6] take the class imbalance problem into account, which leads to a biased classifier due to the neglection of the minority class, and propose a cost-sensitive algorithm called CSPU for the class imbalance problem. According to how unlabeled samples are used, we can divide existing PU learning algorithms into two classes.
Positive and unlabeled learning (PU learning) is a kind of problem whose goal is learning a two-classes classifier with little proportion of positive samples and numerous unlabeled samples. A series of studies focus on how to extract most likely negative samples from the unlabeled samples, and then train a classifier with the labeled samples as supervised learning. Previous PU learning methods always ignore the additional information called privileged information, which is just provided during the training process while unavailable during testing. In this paper, we propose a novel self-paced algorithm for PU learning with privileged information (SPUPI). The proposed SPUPI extracts some reliable negative samples from unlabeled samples at first, and then generates weights for the unlabeled samples according to the similarity with each class. After that, it builds a more accurate classifier based on privileged information and similarity weights by self-paced learning. By taking the self-paced learning into training, we can build the model with a few labeled samples from easy to complex. We also solve the problem by transforming the primal problem of the proposed model into its dual problem and achieving the PU classifier. Various experiments on the practical datasets indicate that the SPUPI has a better performance compared with previous methods.
An efficiency curve for evaluating imbalanced classifiers considering intrinsic data characteristics: Experimental analysis
2022, Information Sciences
Citation Excerpt :
One disadvantage of cost-sensitive classifiers is the difficulty in determining the optimal cost ratio. Chen et al. [6] introduced cost-sensitive positive and unlabeled learning, and imposed different misclassification costs on different classes. Wang et al. [47] concluded that existing methods are unable to determine precise misclassification cost values.
Balancing the accuracy rates of the majority and minority classes is challenging in imbalanced classification. Furthermore, data characteristics have a significant impact on the performance of imbalanced classifiers, which are generally neglected by existing evaluation methods. The objective of this study is to introduce a new criterion to comprehensively evaluate imbalanced classifiers. Specifically, we introduce an efficiency curve that is established using data envelopment analysis without explicit inputs (DEA-WEI), to determine the trade-off between the benefits of improved minority class accuracy and the cost of reduced majority class accuracy. In sequence, we analyze the impact of the imbalanced ratio and typical imbalanced data characteristics on the efficiency of the classifiers. Empirical analyses using 68 imbalanced data reveal that traditional classifiers such as C4.5 and the k-nearest neighbor are more effective on disjunct data, whereas ensemble and undersampling techniques are more effective for overlapping and noisy data. The efficiency of cost-sensitive classifiers decreases dramatically when the imbalanced ratio increases. Finally, we investigate the reasons for the different efficiencies of classifiers on imbalanced data and recommend steps to select appropriate classifiers for imbalanced data based on data characteristics.
Predict-then-optimize or predict-and-optimize? An empirical evaluation of cost-sensitive learning strategies
2022, Information Sciences
Citation Excerpt :
A recent, dedicated framework and overview of cost-sensitive ensemble methods is presented in [34]. Moreover, whereas this work focuses on cost-sensitive learning in the context of supervised learning, other work has focused on cost-sensitive semi-supervised [44] and positive-unlabeled learning [8]. Finally, a related line of work in regression considers asymmetric objectives to more closely align a regression model’s learning objective with the decision-making task [19].
Predictive models are increasingly being used to optimize decision-making and minimize costs. A conventional approach is predict-then-optimize: first, a predictive model is built; then, this model is used to optimize decision-making. A drawback of this approach, however, is that it only incorporates costs in the second stage. Conversely, the predict-and-optimize approach proposes learning a predictive model by directly minimizing the cost of the downstream decision-making task. This is achieved by using a task-specific loss function incorporating the costs of different outcomes in the first stage, with the eventual aim of obtaining more cost-effective decisions in the second stage. This work compares both approaches in the context of cost-sensitive classification. Conceptually, we use the two-stage framework to categorize existing cost-sensitive learning methodologies by differentiating between methodologies for cost-sensitive model training and decision-making. Empirically, we compare and evaluate both approaches using different cost-sensitive training and decision-making methodologies, as well as both class-dependent and instance-dependent cost-sensitive methods. This is achieved using real-world data from a range of application areas and a combination of cost-sensitive and cost-insensitive performance measures. The key finding is that the decision-making strategy is generally found to be more effective than training with a task-specific loss or their combination.
A recent survey on instance-dependent positive and unlabeled learning
2022, Fundamental Research
Training with confident positive-labeled instances has received a lot of attention in Positive and Unlabeled (PU) learning tasks, and this is formally termed “Instance-Dependent PU learning”. In instance-dependent PU learning, whether a positive instance is labeled depends on its labeling confidence. In other words, it is assumed that not all positive instances have the same probability to be included by the positive set. Instead, the instances that are far from the potential decision boundary are with larger probability to be labeled than those that are close to the decision boundary. This setting has practical importance in many real-world applications such as medical diagnosis, outlier detection, object detection, etc. In this survey, we first present the preliminary knowledge of PU learning, and then review the representative instance-dependent PU learning settings and methods. After that, we thoroughly compare them with typical PU learning methods on various benchmark datasets and analyze their performances. Finally, we discuss the potential directions for future research.
Evidential reasoning based ensemble classifier for uncertain imbalanced data
2021, Information Sciences
Citation Excerpt :
Algorithm level methods are to develop a new algorithm or modify existing algorithms to adapt them to imbalanced data [20]. Cost-sensitive methods combine resampling methods or algorithm level methods and assign different misclassification costs for classes in the training process of classifiers [21]. Ensemble learning is frequently combined with resampling methods to balance the class distribution of data before individual classifiers are trained.
Various studies have focused on the classification of uncertain or imbalanced data. However, previous studies rarely consider the classification for uncertain imbalanced data. To address this research gap, this study proposes an evidential reasoning (ER) based ensemble classifier (EREC). In the proposed method, an affinity propagation based oversampling method is developed to obtain the balanced class distributions of the training datasets for individual classifiers. Using the balanced training datasets, ER-based classifiers are constructed as individual classifiers to handle data uncertainty, in which attribute weights are learned from the similarity between the values of attributes and labels. With trained individual classifiers, final results are generated by combining the results of individual classifiers using the ER algorithm, in which the weights of individual classifiers are determined according to the classification performance on out-of-bag data. The proposed EREC is applied to the diagnosis of thyroid nodules using the datasets of five radiologists, obtained from a tertiary hospital located in Hefei, Anhui, China. Using real datasets and UCI datasets, the EREC is compared with 12 representative ensemble classifiers and other oversampling methods based ensemble classifiers to highlight its high performance.

View all citing articles on Scopus

View full text

Cost-sensitive positive and unlabeled learning

Abstract

Introduction

Section snippets

Related work

The proposed algorithm

Optimization

Theoretical analyses

Experiments

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Inf. Sci.

Pattern Recogn.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Inf. Sci.

Learning from positive and unlabeled data: a survey

Mach. Learn.

Efficient training for positive unlabeled learning

IEEE Trans. Pattern Anal. Mach. Intell.

A positive and unlabeled learning algorithm for one-class classification of remote-sensing data

IEEE Trans. Geosci. Remote Sens.

Classification in the presence of label noise: a survey

IEEE Trans. Neural Netw. Learn. Syst.

Special issue on learning from imbalanced data sets

ACM SIGKDD Explorations Newslett.

Cost-sensitive feature selection by optimizing F-measures

IEEE Trans. Image Process.

PEBL: positive example based learning for web page classification using SVM

Loss decomposition and centroid estimation for positive and unlabeled learning

IEEE Trans. Pattern Anal. Mach. Intell.