1 Introduction

Positive-unlabeled (PU) classification is a problem of training a binary classifier from only positive and unlabeled data (Letouzey et al., 2000; Elkan & Noto, 2008). This problem is important when it is difficult to gather negative data, and appears in many applications, such as inlier-based outlier detection (Blanchard et al., 2010), land-cover classification (Li et al., 2011), matrix completion (Hsieh et al., 2015), and sequential data classification (Li et al., 2009; Nguyen et al., 2011). Several heuristic approaches for PU classification have been proposed in the past (Liu et al., 2003; Li & Liu, 2003), which aim to identify negative samples in the unlabeled dataset, yet they heavily rely on the heuristic strategy and data separability assumption. One of the most theoretically and practically effective methods for PU classification was established by Plessis et al. (2014, 2015), called unbiased PU classification. It rewrites the classification risk in terms of the distributions over positive and unlabeled samples, and obtains an unbiased estimator of the risk without negative samples. Although unbiased PU classification works well with simple models such as linear-in-parameter models, it easily suffers from overfitting with flexible models such as deep neural networks. To overcome this problem, a non-negative risk estimator (Kiryo et al., 2017) for PU classification was proposed.

Besides unbiased PU classification, various approaches for PU classification have also been proposed recently. For example, generative adversarial networks (GAN) have been applied to PU classification by Hou et al. (2018), which allows one to learn from a small number of positive samples. Zhang et al. (2019) introduced ambiguity to unlabeled samples and performed PU label disambiguation (PULD) based on margin maximization to determine the true labels of all unlabeled examples. A variational approach and a data augmentation method based on Mixup (Zhang et al., 2018) were proposed by Chen et al. (2020) for PU classification without explicit estimation of the class-prior of the training data.

One of the drawbacks of these approaches is that the distribution of the test data must be identical to that of the training data, which may be violated in practice (Quionero-Candela et al., 2009). For example, the class-prior (the ratio of positive data) in the training unlabeled dataset might be different from that of the test data, known as the class-prior shift problem. To cope with this problem, Charoenphakdee and Sugiyama (2019) showed that classification under class-prior shift can be written as cost-sensitive classification, and proposed a risk-minimization approach and a density ratio estimation (Sugiyama et al., 2012) approach. In their study, both the class-priors of the training and test data are assumed to be given in advance, but this assumption does not hold in many practical cases. Therefore, we need to estimate them with the training and test data.

However, it is usually hard to obtain samples from the test distribution at the training time, and this is not natural because we do not know whether the prior shift would occur or not in advance. Furthermore, the training data would be inaccessible once the training has been completed, especially in privacy-concerned situations such as click analysis (McMahan et al., 2013), purchase prediction (Martínez et al., 2018), and voting prediction (Coletto et al., 2015). In that kind of problem, the model is trained with data including personal information, and only the trained model is kept while the data must be discarded. This implies that we are not allowed to use training data when a classifier is adapted to an unknown test distribution.

Table 1 Comparisons of representative existing PU classification methods. uPU was proposed by Plessis et al. (2014, 2015), nnPU was proposed by Kiryo et al. (2017), GenPU was proposed by Hou et al. (2018), PULD was proposed by Zhang et al. (2019), VPU was proposed by Chen et al. (2020), and PUa was proposed by Charoenphakdee and Sugiyama (2019)

To overcome these problems, we propose an approach based on density ratio estimation (Sugiyama et al., 2012). Density ratio estimation for PU classification has appeared in several existing works (Charoenphakdee & Sugiyama, 2019; Kato et al., 2019), yet their studies have no guarantees on the theoretical relationship between binary classification and density ratio estimation. Our proposed method can train a classifier without given knowledge of the class-priors, and adapt to the test-time class-prior shift without the training data. Table 1 summarizes comparisons of representative existing methods and our proposed method. Our main contributions are: (i) We propose a method for PU classification under test-time class-prior shift with unknown class-priors, (ii) We theoretically justify the proposed method, and (iii) Experimental results show the effectiveness of the proposed method.

2 Preliminaries

In this section, we introduce the notations, and review the concepts of unbiased/non-negative PU classification, cost-sensitive classification, and density ratio estimation.

2.1 Problem formulation

Let \(X \in {\mathbb {R}}^d\) and \(Y \in \{\pm \, 1\}\) be the input and output random variables, where d denotes the dimensionality of the input variable. Let p(xy) be the underlying joint density of (XY) and p(x) be the input marginal density. We denote the positive and negative class-conditional densities as

$$\begin{aligned} \begin{aligned} p_+(x)&=p(x \mid Y=+1) \\ p_-(x)&=p(x \mid Y=-1). \end{aligned} \end{aligned}$$
(1)

Let \(\pi = p(Y=+1)\) be the positive class-prior probability. Assume we have i.i.d. sample sets \(\mathcal {X}_\mathrm {P}\) and \(\mathcal {X}_\mathrm {U}\) from \(p_+(x)\) and p(x) respectively, and let \(n_\mathrm {P}= \left|\mathcal {X}_\mathrm {P} \right|\) and \(n_\mathrm {U}= \left|\mathcal {X}_\mathrm {U} \right|\), where \(\left|\cdot \right|\) denotes the cardinality of a set. We denote the expectations over each class-conditional density as

$$\begin{aligned} \begin{aligned} \mathbb {E}_{\mathrm {P}} \left[ \cdot \right]&= \mathbb {E}_{X \sim p_+} \left[ \cdot \right] \\ \mathbb {E}_{\mathrm {N}} \left[ \cdot \right]&= \mathbb {E}_{X \sim p_-} \left[ \cdot \right] \\ \mathbb {E}_{\mathrm {U}} \left[ \cdot \right]&= \mathbb {E}_{X \sim p} \left[ \cdot \right] = \mathbb {E}_{X} \left[ \cdot \right] , \end{aligned} \end{aligned}$$
(2)

and their empirical counterparts as

$$\begin{aligned} \begin{aligned} \widehat{\mathbb {E}}_{\mathrm {P}} \left[ f(X) \right]&= \frac{1}{n_\mathrm {P}} \sum _{x \in \mathcal {X}_\mathrm {P}} f(x) \\ \widehat{\mathbb {E}}_{\mathrm {U}} \left[ f(X) \right]&= \frac{1}{n_\mathrm {U}} \sum _{x \in \mathcal {X}_\mathrm {U}} f(x), \end{aligned} \end{aligned}$$
(3)

where f is an arbitrary function of \(x \in {\mathbb {R}}^d\). Let \(g : {\mathbb {R}}^d \rightarrow {\mathbb {R}}\) be a real-valued decision function. The purpose of binary classification is to minimize the expected classification risk

$$\begin{aligned} R(g) = \mathbb {E}_{X, Y} \left[ 1\{\mathrm {sign}(g(X)) \ne Y\} \right] , \end{aligned}$$
(4)

where \(1\{\cdot \}\) denotes the indicator function. Since the optimization problem based on the zero-one loss is computationally infeasible (Arora et al., 1997; Bartlett et al., 2006), a surrogate loss function \(\ell : {\mathbb {R}} \times \{\pm \, 1\} \rightarrow {\mathbb {R}}\) is used in practice. Classification risk with respect to surrogate loss is defined as

$$\begin{aligned} R_\ell (g) = \mathbb {E}_{X, Y} \left[ \ell (g(X), Y) \right] . \end{aligned}$$
(5)

2.2 Unbiased/non-negative PU classification

The surrogate classification risk can be written as

$$\begin{aligned} R_\ell (g) = \pi \mathbb {E}_{\mathrm {P}} \left[ \ell (g(X), +1) \right] + (1 - \pi ) \mathbb {E}_{\mathrm {N}} \left[ \ell (g(X), -1) \right] . \end{aligned}$$
(6)

Since negative samples are unavailable in PU classification, we rewrite the expectation over the negative class-conditional distribution as

$$\begin{aligned} (1 - \pi ) \mathbb {E}_{\mathrm {N}} \left[ \ell (g(X), -1) \right] = \mathbb {E}_{\mathrm {U}} \left[ \ell (g(X), -1) \right] - \pi \mathbb {E}_{\mathrm {P}} \left[ \ell (g(X), -1) \right] , \end{aligned}$$
(7)

where \(p(x) = \pi p_+(x) + (1 - \pi ) p_-(x)\) is used (Plessis et al., 2014). Then, the risk can be approximated directly with \(\mathcal {X}_\mathrm {P}\) and \(\mathcal {X}_\mathrm {U}\) as

$$\begin{aligned} \widehat{R}_\ell (g) = \pi \widehat{\mathbb {E}}_{\mathrm {P}} \left[ \ell (g(X), +1) \right] - \pi \widehat{\mathbb {E}}_{\mathrm {P}} \left[ \ell (g(X), -1) \right] + \widehat{\mathbb {E}}_{\mathrm {U}} \left[ \ell (g(X), -1) \right] . \end{aligned}$$
(8)

The empirical risk estimator \(\widehat{R}_\ell (g)\) is unbiased and consistent (Niu et al., 2016), i.e., \(\mathbb {E} \left[ \widehat{R}_\ell (g) \right] = R_\ell (g)\) where the expectation \({\mathbb {E}}\) is taken over all of the samples, and \(\widehat{R}_\ell (g) \rightarrow R_\ell (g)\) as \(n_\mathrm {P}, n_\mathrm {U}\rightarrow \infty \).

Unbiased PU classification easily suffers from overfitting when we use a flexible model such as neural networks, because the model can be so powerful that it fits all of the given samples, and then the empirical risk goes negative (Kiryo et al., 2017). To mitigate this problem, a non-negative risk correction approach was proposed (Kiryo et al., 2017). Since

$$\begin{aligned} \mathbb {E}_{\mathrm {U}} \left[ \ell (g(X), -1) \right] - \pi \mathbb {E}_{\mathrm {P}} \left[ \ell (g(X), -1) \right] = (1 - \pi )\mathbb {E}_{\mathrm {N}} \left[ \ell (g(X), -1) \right] \ge 0 \end{aligned}$$
(9)

holds for any non-negative loss function, we correct the corresponding part of the expected risk to be non-negative. Approximating the expectations by sample averages gives the non-negative risk estimator:

$$\begin{aligned} \widetilde{R}_\ell (g) = \pi \widehat{\mathbb {E}}_{\mathrm {P}} \left[ \ell (g(X), +1) \right] + \left( \widehat{\mathbb {E}}_{\mathrm {U}} \left[ \ell (g(X), -1) \right] - \pi \widehat{\mathbb {E}}_{\mathrm {P}} \left[ \ell (g(X), -1) \right] \right) _+, \end{aligned}$$
(10)

where \((\cdot )_+ = \max (0, \cdot )\). The non-negative risk estimator is biased yet consistent, and its bias decreases exponentially with respect to \(n_\mathrm {P}+ n_\mathrm {U}\) (Kiryo et al., 2017).

2.3 Cost-sensitive classification

For arbitrary false-positive cost parameter \(c \in (0, 1)\), cost-sensitive classification is defined as a problem of minimize the following risk (Elkan, 2001; Scott, 2012) :

$$\begin{aligned} R_{\pi , c}(g) = (1 - c) \pi \mathbb {E}_{\mathrm {P}} \left[ 1\{\mathrm {sign}(g(X)) \ne +1\} \right] + c(1 - \pi ) \mathbb {E}_{\mathrm {N}} \left[ 1\{\mathrm {sign}(g(X)) \ne -1\} \right] . \end{aligned}$$
(11)

When \(c= 1/2\), cost-sensitive classification reduces to ordinary binary classification, up to unessential scaling factor 1/2. (Charoenphakdee & Sugiyama, 2019) showed that classification under class-prior shift can be formulated as cost-sensitive classification. For example, let \(\pi ^\prime \in (0, 1)\) be the class-prior of the test distribution, then \(R_{\pi ^\prime , 1/2} \propto R_{\pi , c}\) with \(c= \frac{\pi (1 - \pi ^\prime )}{\pi (1 - \pi ^\prime ) + (1 - \pi )\pi ^\prime }\).

2.4 Class-prior estimation

In unbiased/non-negative PU classification, the class-prior is assumed to be given, which does not hold in many practical cases. Unfortunately, we cannot treat \(\pi \) as a hyperparameter to be tuned, because there exists a trivial solution such as \(\pi = 0\) and \(g(x) \equiv \mathrm {argmin}_v \ell (-1, v)\). One of the solutions to this problem is to estimate both the training and test class-priors by existing methods respectively with positive, training-unlabeled, and test-unlabeled datasets. In fact, it is known that class-prior estimation is an ill-posed problem, without any additional assumptions (Blanchard et al., 2010; Scott et al., 2013). For example, if

$$\begin{aligned} p(x) = \kappa p_+(x) + (1 - \kappa ) p_-(x) \end{aligned}$$
(12)

holds, then there exists a density \(p_-^\prime (x)\) such that

$$\begin{aligned} p(x) = (\kappa - \delta )p_+(x) + (1 - \kappa + \delta )p_-^\prime (x) \end{aligned}$$
(13)

for \(0 \le \delta \le \kappa \). In practice, an alternative goal of estimating the maximum mixture proportion

$$\begin{aligned} \kappa ^* = \max \{ \kappa \in [0, 1] : \exists p_- \ \mathrm {s.t.} \ p(x) = \kappa p_+(x) + (1 - \kappa )p_-(x) \} \end{aligned}$$
(14)

is pursued (Blanchard et al., 2010; Scott et al., 2013). The irreducibility assumption (Blanchard et al., 2010; Scott et al., 2013) gives a constraint on the true underlying densities which ensures that \(\kappa ^*\) is the unique solution of prior estimation.

Definition 1

(Irreducibility (Blanchard et al., 2010; Scott et al., 2013)) Let G and H be probability distributions on \(({\mathbb {R}}^d, \mathfrak {S})\) where \(\mathfrak {S}\) is a Borel algebra on \({\mathbb {R}}^d\). We say that G is irreducible with respect to H, if there is no decomposition of the form \(G = \kappa H + (1 - \kappa ) H^\prime \) where \(H^\prime \) is some probability distribution and \(0 < \kappa \le 1\).

Let P, \(P_+\), and \(P_-\) be the cumulative distribution functions of p, \(p_+\), and \(p_-\) respectively. Under the irreducibility assumption, the class-prior is identical to the maximum mixture proportion.

Proposition 1

[(Blanchard et al., 2010; Scott et al., 2013)] Let P, \(P_+\), and \(P_-\) be probability distributions on \(({\mathbb {R}}^d, \mathfrak {S})\). If \(P = \pi P_+ + (1 - \pi ) P_-\) and \(P_-\) is irreducible with respect to \(P_+\), then

$$\begin{aligned} \begin{aligned} \pi&= \max \{\kappa \in [0, 1] : \exists Q \ \mathrm {s.t.} \ P = \kappa P_+ + (1 - \kappa )Q \} \\&= \inf _{S \in \mathfrak {S}, P_+(S) > 0} \frac{P(S)}{P_+(S)}. \end{aligned} \end{aligned}$$
(15)

Note that the set \(S \in \mathfrak {S}\) corresponds to a measurable hypothesis \(h : {\mathbb {R}}^d \rightarrow \{-1, +1\}\) bijectively. Based on these facts, several works for class-prior estimation have been proposed (Blanchard et al., 2010; Scott et al., 2013; Scott, 2015; Ramaswamy et al., 2016; Plessis et al., 2016). However, they usually work with kernel methods which are computationally hard to apply to large-scale and high-dimensional data. Furthermore, since the unbiased/non-negative risk estimators depend on the class-prior, an estimation error of the class-prior directly affects the optimization. In addition, we usually do not have a sample set from the test distribution at the training time, and thus cannot even estimate the class-prior of the test data by such existing methods.

2.5 Density ratio estimation

The ratio of two probability densities has attracted attention in various problems (Sugiyama et al., 2009, 2012). Density ratio estimation (DRE) (Sugiyama et al., 2012) aims to directly estimate the ratio of two probabilities, instead of estimating the two densities separately. Sugiyama et al. (2011) showed that various existing DRE methods (Sugiyama et al., 2008; Kanamori et al., 2009; Kato et al., 2019) can be unified from the viewpoint of Bregman divergence minimization, so we consider the DRE problem as a Bregman divergence minimization problem.

Here we consider estimating the ratio of the positive class-conditional density to the input marginal density. Let \(r^*(x) = p_+(x)/p(x)\) be the true density ratio and \(r : {\mathbb {R}}^d \rightarrow [0, \infty )\) be a density ratio model. For a convex and differentiable function \(f : [0, \infty ) \rightarrow {\mathbb {R}}\), the expected Bregman divergence, which measures the discrepancy from \(r^*\) to r, is defined as

$$\begin{aligned} \begin{aligned} \mathrm {BR}_{f} \left( {r^*} \parallel {r} \right)&= \int \left( f(r^*(x)) - f(r(x)) - f^\prime (r(x))(r^*(x)-r(x)) \right) p(x)\mathrm {d}x \\&= \mathbb {E}_{\mathrm {P}} \left[ -f^\prime (r(X)) \right] + \mathbb {E}_{\mathrm {U}} \left[ f^\prime (r(X))r(X) - f(r(X)) \right] + \mathrm {const.}, \end{aligned} \end{aligned}$$
(16)

where the constant term does not include r. The function f is called the generator function of the Bregman divergence (Menon & Ong, 2016). We can see that the Bregman divergence of DRE does not contain the class-prior \(\pi \), and can be approximated by taking empirical averages over the positive and unlabeled datasets, except the constant term.

Similarly to the case of unbiased PU classification, it was revealed that empirical Bregman divergence minimization often suffers from severe overfitting when we use a highly flexible model (Kato & Teshima, 2021). To mitigate this problem, non-negative risk correction for the Bregman divergence minimization was proposed, based on the idea of non-negative PU classification (Kiryo et al., 2017).

The objective function for Bregman divergence minimization is defined by Eq. (16) without the constant term

$$\begin{aligned} \mathcal {L}_f(r) = \mathbb {E}_{\mathrm {P}} \left[ -f^\prime (r(X)) \right] + \mathbb {E}_{\mathrm {U}} \left[ f^\prime (r(X))r(X) - f(r(X)) \right] . \end{aligned}$$
(17)

We also consider its empirical counterpart \(\widehat{\mathcal {L}}_f(r)\). Let us denote

$$\begin{aligned} \begin{aligned} f^*(t)&= tf^\prime (t) - f(t) \\ \mathfrak {F}(t)&= f^*(t) - f^*(0), \end{aligned} \end{aligned}$$
(18)

then \(\mathfrak {F}\) is non-negative on \([0, \infty )\) because \((f^*)^\prime (t) = tf^{\prime \prime }(t) \ge 0\) (i.e., f is convex.). We pick a lower bound of \(\pi \) as \(\alpha \) and then we have

$$\begin{aligned} \mathbb {E}_{\mathrm {U}} \left[ \mathfrak {F}(r(X)) \right] - \alpha \mathbb {E}_{\mathrm {P}} \left[ \mathfrak {F}(r(X)) \right] \ge (1 - \pi ) \mathbb {E}_{\mathrm {N}} \left[ \mathfrak {F}(r(X)) \right] \ge 0. \end{aligned}$$
(19)

Thus, we define the corrected empirical estimator of \(\mathcal {L}\) as

$$\begin{aligned} \begin{aligned} \widetilde{\mathcal {L}}_f(r)&= \widehat{\mathbb {E}}_{\mathrm {P}} \left[ -f^\prime (r(X)) + \alpha \mathfrak {F}(r(X)) \right] \\&\quad + \left( \widehat{\mathbb {E}}_{\mathrm {U}} \left[ \mathfrak {F}(r(X)) \right] - \alpha \widehat{\mathbb {E}}_{\mathrm {P}} \left[ \mathfrak {F}(r(X)) \right] \right) _+ + f^*(0), \end{aligned} \end{aligned}$$
(20)

where \((\cdot )_+ = \max (0, \cdot )\). \(\widetilde{\mathcal {L}}_f\) is consistent as long as \(0 \le \alpha \le \pi \), and its bias decreases exponentially with respect to \(n_\mathrm {P}+ n_\mathrm {U}\). Even though we do not have any knowledge of \(\pi \), we can tune \(\alpha \) as a hyperparameter to minimize the empirical estimator without non-negative correction \(\widehat{\mathcal {L}}_f(r)\), which contains neither \(\pi \) nor \(\alpha \), with the positive and unlabeled validation datasets.

3 Density ratio estimation for PU learning

In this section, we formulate a cost-sensitive binary classification problem as a density ratio estimation problem, and propose a method of Density Ratio estimation for PU learning (DRPU). All proofs are given in Appendix A.

3.1 Excess risk bounds

From Bayes’ rule, we have \(p(Y=+1 \mid X) = \pi p_+(x)/p(x) = \pi r^*(x)\). Therefore, the optimal solution of the Bregman divergence minimization, \(r = r^*\) gives a Bayes optimal classifier by thresholding \(p(Y=+1 \mid X) = 1/2\), and it motivates us to use r for binary classification. However, this statement only covers the optimal solution, and it is unclear how the classification risk grows along with the Bregman divergence. To cope with this problem, we interpret the DRE as minimization of an upper bound of the excess classification risk. Although the relationship between DRE and class probability estimation has been studied by Menon and Ong (2016), differently from that, our work focuses on the ratio of the densities of positive and unlabeled data, and the value of the Bregman divergence does not depend on the class-prior \(\pi \).

We denote the Bayes optimal risk as \(R_{\pi , c}^* = \inf _{g \in \mathcal {F}} R_{\pi , c}(g)\), where \(\mathcal {F}\) is the set of all measurable functions from \(\mathbb {R}^d\) to \(\mathbb {R}\), and the difference \(R_{\pi , c(g)} - R_{\pi , c}^*\) is called the excess risk for \(R_{\pi , c}\). The following theorem associates DRE with cost-sensitive classification under a strong convexity assumption on f, justifying solving binary classification by DRE.

Theorem 1

Let f be a \(\mu \)-strongly convex function, i.e., \(\mu = \inf _{t \in [0, \infty )}f^{\prime \prime }(t) > 0\). Then, for any \(\pi \in (0, 1)\), \(c\in (0, 1)\), \(r : \mathbb {R}^d \rightarrow [0, \infty )\), and \(h_\theta = \mathrm {sign}(r - \theta )\), we have

$$\begin{aligned} R_{\pi , c}(h_\theta ) - R_{\pi , c}^* \le \pi \sqrt{\frac{2}{\mu } \mathrm {BR}_{f} \left( {r^*} \parallel {r} \right) }, \end{aligned}$$
(21)

where \(\theta = c/ \pi \).

As we have already seen in Sect. 2.3, the class-prior shift problem can be transformed into a cost-sensitive classification problem. Next, we extend the excess risk bound in Theorem 1 to the case of prior shift. Let \(\pi ^\prime \) be the test-time class-prior and \(c^\prime \) be the test-time false positive cost. Note that \(c^\prime = 1/2\) corresponds to standard binary classification. The classification risk with respect to the test distribution is defined as

$$\begin{aligned} \begin{aligned} R_{\pi ^\prime , c^\prime }(g)&= (1 - c^\prime )\pi ^\prime \mathbb {E}_{\mathrm {P}} \left[ 1\{\mathrm {sign}(g(X)) \ne +1 \} \right] \\&\quad + c^\prime (1 - \pi ^\prime ) \mathbb {E}_{\mathrm {N}} \left[ 1\{\mathrm {sign}(g(X)) \ne -1 \} \right] . \end{aligned} \end{aligned}$$
(22)

The following theorem gives an excess risk bound for \(R_{\pi ^\prime , c^\prime }\).

Theorem 2

Let f be a \(\mu \)-strongly convex function. Then, for any \(\pi , \pi ^\prime \in (0, 1)\), \(c^\prime \in (0, 1)\), \(r : \mathbb {R}^d \rightarrow [0, \infty )\), and \(h_{\theta } = \mathrm {sign}(r - \theta )\), we have

$$\begin{aligned} R_{\pi ^\prime , c^\prime }(h_{\theta }) - R_{\pi ^\prime , c^\prime }^* \le C \sqrt{\frac{2}{\mu }\mathrm {BR}_{f} \left( {r^*} \parallel {r} \right) }, \end{aligned}$$
(23)

where \(c_0 = \frac{c^\prime \pi (1 - \pi ^\prime )}{(1 - c^\prime )(1 - \pi )\pi ^\prime + c^\prime \pi (1 - \pi ^\prime )}\), \(\theta = c_0 / \pi \), and \(C = \pi \frac{c^\prime + \pi ^\prime - 2c^\prime \pi ^\prime }{c_0 + \pi - 2c_0 \pi }\).

Note that this is a generalized version of Theorem 1, which is the case of \(\pi ^\prime = \pi \) and \(c^\prime = c\). This result shows that even when the class-prior and cost are shifted, by changing the classification threshold to \(c_0\), the classification risk can still be bounded by the Bregman divergence of DRE.

3.2 Estimating the class-priors

Although we can train a model r and deal with a prior shift problem without the knowledge of the class-priors, we still need to estimate them to determine the classification threshold \(c_0 / \pi \). Based on Proposition 1, we propose the following estimator of \(\pi \):

$$\begin{aligned} \hat{\pi }(r) = \inf _{h \in \mathcal {H}_r} \frac{\widehat{P}(h)}{\widehat{P}_+(h)}, \end{aligned}$$
(24)

where

$$\begin{aligned} \begin{aligned} \widehat{P}(h)&= \widehat{\mathbb {E}}_{\mathrm {U}} \left[ 1\{h(X) = +1\} \right] \\ \widehat{P}_+(h)&= \widehat{\mathbb {E}}_{\mathrm {P}} \left[ 1\{h(X) = +1\} \right] \end{aligned} \end{aligned}$$
(25)

and

$$\begin{aligned} \mathcal {H}_r = \big \{ h : \mathbb {R}^d \rightarrow \{\pm \, 1\} \mid \exists \theta \in \mathbb {R}, h(x) = \mathrm {sign}(r(x) - \theta ) \wedge \widehat{P}_+(h) > \bar{\gamma } \big \}. \end{aligned}$$
(26)

Here,

$$\begin{aligned} \bar{\gamma } = \frac{1}{\gamma } \mathrm {max} (\varepsilon (n_\mathrm {P}, 1/n_\mathrm {P}), \varepsilon (n_\mathrm {U}, 1/n_\mathrm {U})) \end{aligned}$$
(27)

and,

$$\begin{aligned} \varepsilon (n, \delta ) = \sqrt{\frac{4 \log (en / 2)}{n}} + \sqrt{\frac{\log (2 / \delta )}{2n}} \end{aligned}$$
(28)

for some \(n > 0\) and \(0< \delta < 1\), and \(0< \gamma < 1\) is an arbitrarily fixed constant. The main difference from the estimator proposed by Blanchard et al. (2010) and Scott et al. (2013) is that the hypothesis h is determined by thresholding the trained density ratio model r, thus we need no additional training of the model.

To consider convergence of the proposed estimator, we introduce the concept of the Area Under the receiver operating characteristic Curve (AUC), which is a criterion to measure the performance of a score function for bipartite ranking (Menon & Williamson, 2016). For any real-valued score function \(s : \mathbb {R}^d \rightarrow \mathbb {R}\), AUC is defined as

$$\begin{aligned} \mathrm {AUC} (s) = \mathbb {E} \left[ 1\{(Y - Y^\prime )(s(X) - s(X^\prime )) > 0\} \mid Y \ne Y^\prime \right] , \end{aligned}$$
(29)

where the expectation is taken over X, \(X^\prime \), Y, and \(Y^\prime \). In addition, we define the AUC risk and the optimal AUC risk as \(R_\mathrm {AUC}(r) = 1 - \mathrm {AUC}(r)\) and \(R_\mathrm {AUC}^* = \inf _{s \in \mathcal {F}}R_\mathrm {AUC}(s)\), where \(\mathcal {F}\) is a set of all measurable functions from \(\mathbb {R}^d\) to \(\mathbb {R}\). Then, the following theorem gives a convergence guarantee of the estimator.

Theorem 3

For \(\hat{\pi }(r)\) defined by Eq. (24), with probability at least \((1 - 1/n_\mathrm {P})(1 - 1/n_\mathrm {U})\), we have

$$\begin{aligned} \left|\hat{\pi }(r) - \pi \right| \le \xi \left( R_\mathrm {AUC}(r) - R_\mathrm {AUC}^* \right) + \mathcal {O} \left( \sqrt{\frac{\log n_\mathrm {P}}{n_\mathrm {P}}} + \sqrt{\frac{\log n_\mathrm {U}}{n_\mathrm {U}}} \right) . \end{aligned}$$
(30)

Here, \(\xi \) is an increasing function such that

$$\begin{aligned} \xi \left( R_\mathrm {AUC}(r) - R_\mathrm {AUC}^* \right) \le \frac{2(1 - \pi )}{1 - \bar{\gamma }^2} R_\mathrm {AUC}(r), \end{aligned}$$
(31)

and \(\xi (0) \rightarrow 0\) as \(\bar{\gamma } \rightarrow 0\).

This result shows that a better score function in the sense of AUC tends to result in a better estimation of \(\pi \). Furthermore, we can see that the scale of r is not important for the estimator \(\hat{\pi }(r)\); it just needs to be a good score function, therefore r can be used not only to estimate \(\pi \) but also to estimate \(\pi ^\prime \). Given a sample set \(\mathcal {X}_\mathrm {U}^\prime \) from the test density \(p^\prime (x) = \pi ^\prime p_+(x) + (1 - \pi ^\prime ) p_-(x)\), we propose the following estimator of the test prior \(\pi ^\prime \):

$$\begin{aligned} \hat{\pi }^\prime (r) = \inf _{h \in \mathcal {H}_r} \frac{\widehat{P}^\prime (h)}{\widehat{P}_+(h)}, \end{aligned}$$
(32)

where \(\widehat{P}^\prime (h) = \widehat{\mathbb {E}}_{\mathrm {U}^\prime } \left[ 1\{h(X) = +1\} \right] \). Replacing \(\pi \) by \(\pi ^\prime \), \(\hat{\pi }\) by \(\hat{\pi }^\prime \), and \(n_\mathrm {U}\) by \(n_\mathrm {U}^\prime = \left|\mathcal {X}_\mathrm {U}^\prime \right|\) in Theorem 3 , we can obtain an error bound of \(\hat{\pi }^\prime \).

In Eq. (32), we require the dataset from the positive class-conditional distribution and the test-time input marginal distribution. As described in Sects. 1 and 2, we sometimes do not have access to the training data at the test-time. Fortunately in Eq. (32), we need only the value of \(\widehat{P}_+(h)\) for each h, and we do not care about the samples themselves. Also, \(\widehat{P}_+(h)\) takes ascending piece-wise constant values from 0 to 1 with interval \(1/n_\mathrm {P}\). Hence, preserving the list of intervals \(\{\Theta _i\}_{i=0}^{n_\mathrm {P}}\) such that \(\widehat{P}_+(\mathrm {sign}(r(X)-\theta )) = i/n_\mathrm {P}\) for all \(\theta \in \Theta _i\) at the training time, we can use it to estimate the test-time class-prior, without preserving the training data themselves.

3.3 Practical implementation

The entire flow of our proposed method is described in Algorithm 1. Since the strong convexity of the generator f of the Bregman divergence is desired, we may employ the quadratic function \(f(t) = t^2 / 2\). In this case, the DRE method is called Least-Square Importance Fitting (LSIF) (Kanamori et al., 2009). As a parametric model r, a deep neural network may be used and optimized by stochastic gradient descent. Details of the stochastic optimization method for \(\widetilde{\mathcal {L}}_f\) are described in Algorithm 2. In the prior estimation step, it is recommended to use data that is not used in the training step to avoid overfitting, especially when we are using flexible models. So we split the given data into the training and validation sets, then use the validation set to tune hyperparameters and estimate the class-priors.

figure a
figure b

4 Discussions

In this section, we provide further theoretical analysis and compare the convergence rate of the proposed method to that of unbiased/non-negative PU classification.

4.1 Selection of the Bregman generator function

Theorem 2 needs the assumption of strong convexity on the Bregman generator function f. We used a quadratic function in the proposed method; nevertheless there could be other choices of f. The following proposition shows that the tightest excess risk bound is achieved by a quadratic function.

Proposition 2

Let f be a strongly convex function and \(\mu = \inf _{t \in [0, \infty )} f^{\prime \prime }(t)\). Then the quadratic function \(f_{\mathrm {S}}(t) = \mu t^2 / 2\) satisfies

$$\begin{aligned} \mathrm {BR}_{f_\mathrm {S}} \left( {r^*} \parallel {r} \right) \le \mathrm {BR}_{f} \left( {r^*} \parallel {r} \right) . \end{aligned}$$
(33)

Furthermore, Bregman divergence with respect to the quadratic function can be related to the excess classification risk with respect to the squared loss function. Let us denote the classification risk w.r.t. the squared loss as

$$\begin{aligned} R_\mathrm {sq}(g) = \mathbb {E}_{X, Y} \left[ \frac{1}{4}(Yg(X) - 1)^2 \right] , \end{aligned}$$
(34)

where \(g: \mathbb {R}^d \rightarrow [-1, 1]\), and the optimal risk as

$$\begin{aligned} R_\mathrm {sq}^* = \inf _g R_\mathrm {sq}(g) = R_\mathrm {sq}(2\eta - 1), \end{aligned}$$
(35)

where \(\eta (x) = P(Y = +1 \mid X = x)\). Then, the Bregman divergence is decomposed into the excess risk w.r.t. the squared loss and a superfluous term.

Proposition 3

Let \(g_r = 2 \min (\pi r, 1) - 1\) for any \(r: \mathbb {R}^d \rightarrow [0, \infty )\). Then,

$$\begin{aligned} \frac{2 \pi ^2}{\mu } \mathrm {BR}_{f_\mathrm {S}} \left( {r^*} \parallel {r} \right) = R_\mathrm {sq}(g_r) - R_\mathrm {sq}^* + \chi _r {\mathfrak {D}}_r, \end{aligned}$$
(36)

where \(\chi _r = (1 / \pi ^2) \mathbb {E}_{X \mid \pi r(X) > 1} \left[ (\pi r(X) - 1)(\pi r(X) - 2 \eta (X) + 2) \right] \) and \({\mathfrak {D}}_r = P(\pi r(X) > 1)\).

If r is bounded above by \(1 / \pi \), the superfluous term is canceled and the Bregman divergence corresponds to the excess risk w.r.t. the squared loss, up to the scaling factor.

4.2 Excess risk bound for AUC

Here we consider the relationship between AUC optimization and DRE. It is clear that the optimal density ratio \(r^*\) is the optimal score function (Menon & Williamson, 2016), and as Theorem 1, we can obtain an excess AUC risk bound by the Bregman divergence of DRE as follows:

Theorem 4

Let f be a \(\mu \)-strongly convex function. Then, for any \(r : \mathbb {R} \rightarrow [0, \infty )\), we have

$$\begin{aligned} R_\mathrm {AUC}(r) - R_\mathrm {AUC}^* \le \frac{1}{1 - \pi } \sqrt{\frac{2}{\mu } \mathrm {BR}_{f} \left( {r^*} \parallel {r} \right) }. \end{aligned}$$
(37)

Theorem 4 implies that a better estimation of the density ratio in the sense of the Bregman divergence tends to result in a better score function in the sense of AUC.

4.3 Excess risk bound with the estimated threshold

Theorem 2 gives an excess risk bound with the optimal threshold. However, in practice, we need to use an estimated threshold. Here we also consider an excess risk bound for that case. Let \(\theta \) be the true classification threshold for \(h = \mathrm {sign}(r - \theta )\), defined as \(\theta = c_0 / \pi \), and \(\hat{\theta }\) be the empirical version of \(\theta \), obtained from \(\hat{\pi }(r)\) and \(\hat{\pi }^\prime (r)\). Then, we have the following excess risk bound.

Theorem 5

Let f be a \(\mu \)-strongly convex function and \(\theta = c_0 / \pi \) where \(c_0\) is defined in Theorem 2. Then for \(\hat{\theta } \in (0, 1)\) and \(h_{\hat{\theta }} = \mathrm {sign}(r - \hat{\theta })\), we have

$$\begin{aligned} R_{\pi ^\prime , c^\prime }(h_{\hat{\theta }}) - R_{\pi ^\prime , c^\prime }^* \le \left( C + \pi ^\prime \omega _{\hat{\theta }} \right) \sqrt{\frac{2}{\mu } \mathrm {BR}_{f} \left( {r^*} \parallel {r} \right) } + \pi ^\prime \left|\hat{\theta } - \theta \right|, \end{aligned}$$
(38)

where C is defined in Theorem 2 and \(\omega _{\hat{\theta }}\) is a constant such that \(0 \le \omega _{\hat{\theta }} \le 1\) and \(\omega _{\theta } = 0\).

Theorem 5 reduces to Theorem 2 when \(\hat{\theta } = \theta \). We can also prove that the estimation error of the threshold decays at the linear order of the estimation error of the class-priors as follows:

Proposition 4

Let \(\hat{\pi }\), \(\hat{\pi }^\prime \) be estimated class-priors and \(\hat{\theta }\) be an estimated threshold by \(\hat{\pi }\), \(\hat{\pi }^\prime \). Then,

$$\begin{aligned} \left|\hat{\theta } - \theta \right| \le {\mathcal {O}}\left( \left|\hat{\pi } - \pi \right| + \left|\hat{\pi }^\prime - \pi ^\prime \right| \right) \quad \mathrm {as} \ \left|\hat{\pi } - \pi \right|, \left|\hat{\pi }^\prime - \pi ^\prime \right| \rightarrow 0. \end{aligned}$$
(39)

Combining Corollary 5 and Proposition 4, we can see that the excess risk decays at the linear order of the estimation error of the class-priors.

4.4 Convergence rate comparison to unbiased/non-negative PU classification

From the above results and theoretical analysis for non-negative Bregman divergence minimization provided by Kato and Teshima (2021), we can derive the convergence rate for our proposed method. Let \({\mathcal {H}}\) be a hypothesis space of density ratio model \(r : \mathbb {R}^d \rightarrow [0, \infty )\) and let us denote the minimizer of the empirical risk as \({\hat{r}} = \mathrm {argmin}_{r\in {\mathcal {H}}} \widetilde{{\mathcal {L}}}_f(r)\) where \({\mathcal {L}}_f\) and \(\widetilde{{\mathcal {L}}}_f\) are defined in Sect. 3.3. Theorem 1 in Kato and Teshima (2021) states that if f satisfies some appropriate conditions and the Rademacher complexity of \({\mathcal {H}}\) decays at \({\mathcal {O}}(1/\sqrt{n})\) w.r.t. sample size n, for example, linear-in-parameter models with a bounded norm or neural networks with a bounded Frobenius norm (Golowich et al., 2018; Lu et al., 2020), the estimation error \({\mathcal {L}}({\hat{r}}) - \inf _{r \in {\mathcal {H}}} {\mathcal {L}}(r)\) decays at \({\mathcal {O}}(1/\sqrt{n_\mathrm {P}} + 1/\sqrt{n_\mathrm {U}})\) with high probability. Applying this to Corollary 5 and Proposition 4, the following theorem is induced.

Corollary 1

Let f be a \(\mu \)-strongly convex function and satisfy Assumption 3 in Kato and Teshima (2021). Then, for \({\hat{h}}_{\hat{\theta }} = \mathrm {sign}({\hat{r}} - \hat{\theta })\), with probability at least \(1 - \delta \), we have

$$\begin{aligned} R_{\pi ^\prime , c^\prime }({\hat{h}}_{\hat{\theta }}) - R_{\pi ^\prime , c^\prime }^* \le A_{\mathcal {H}} + D_\delta \cdot {\mathcal {O}} \left( \frac{1}{n_\mathrm {P}^{1/4}} + \frac{1}{n_\mathrm {U}^{1/4}} \right) + {\mathcal {O}} \left( \left|\hat{\pi } - \pi \right| + \left|\hat{\pi }^\prime - \pi ^\prime \right| \right) , \end{aligned}$$
(40)

where \(A_{\mathcal {H}} = (C + \pi ^\prime ) \sqrt{\frac{2}{\mu } \left( \inf _{r \in {\mathcal {H}}}{\mathcal {L}}_f(r) - {\mathcal {L}}_f(r^*) \right) }\) with the constant C defined in Theorem 2 and \(D_\delta = \sqrt{\log (1 / \delta )}\).

For comparison, we consider the convergence of the excess risk based on the theoretical analysis of unbiased/non-negative PU classification provided by Kiryo et al. (2017) and the properties of classification calibrated loss functions provided by Bartlett et al. (2006) and Scott (2012). Let \({\mathcal {G}}\) be a hypothesis space of decision function \(g : \mathbb {R}^d \rightarrow \mathbb {R}\) and let us denote the minimizer of the empirical risk as \({\hat{g}} = \inf _{g \in {\mathcal {G}}}{\widetilde{R}}_\ell (g)\) where \(R_\ell \) and \({\widetilde{R}}_\ell \) are defined in Sect. 2. Assume that the loss function \(\ell \) satisfies some appropriate conditions and if the Rademacher complexity of \({\mathcal {G}}\) decays at \({\mathcal {O}}(1/\sqrt{n})\), the estimation error \(R_\ell ({\hat{g}}) - \inf _{g \in {\mathcal {G}}} R_\ell (g)\) decays at \({\mathcal {O}}(1/\sqrt{n_\mathrm {P}} + 1/\sqrt{n_\mathrm {U}})\) with high probability. In addition, if \(\ell \) is classification calibrated (Bartlett et al., 2006; Scott, 2012), there exists a strictly increasing function \(\psi \) and the excess risk w.r.t. the zero-one loss is bounded above by the surrogate excess risk. That is, with probability at least \(1 - \delta \), we have

$$\begin{aligned} R_{\pi , c}({\hat{g}}) - R_{\pi , c}^* \le \psi ^{-1} \left( A_{\mathcal {G}} + D_\delta \cdot {\mathcal {O}} \left( \frac{1}{\sqrt{n_\mathrm {P}}} + \frac{1}{\sqrt{n_\mathrm {U}}} \right) \right) , \end{aligned}$$
(41)

where \(A_{\mathcal {G}} = \inf _{g \in {\mathcal {G}}} R_\ell (g) - R_\ell ^*\) and \(D_\delta = \sqrt{\log (1 / \delta )}\).

For specific loss functions such as the hinge loss or the sigmoid loss, \(\psi \) is the identity function (Bartlett et al., 2006; Steinwart, 2007), hence the convergence rate of unbiased/non-negative PU classification would be faster than that of the density ratio estimation approach for PU classification (DRPU). This result is intuitively reasonable, because a method solving a specific problem tends to have better performance than a method solving more general problems (Vapnik, 1995). That is, the hinge loss and sigmoid loss are not proper losses in the context of class-posterior probability estimation (Buja et al., 2005; Reid & Williamson, 2009), and risk minimization with respect to these losses allows one to bypass the estimation of the posterior probability and obtain a classifier directly, while DRE does not.

Based on the above discussions, we should choose nnPU when we know the class-prior of the training data and it is assured that there is no class-prior shift in the test phase. In other cases, DRPU could be a better choice to solve PU classification more stably. Also, we should notice that nnPU with the sigmoid loss or the hinge loss does not provide the class-posterior probability.

5 Experiments

In this section, we report our experimental results. All the experiments were done with PyTorch (Paszke et al., 2019).Footnote 1

5.1 Test with synthetic data

We conducted experiments with synthetic data to confirm the effectiveness of the proposed method via numerical visualization. Firstly, we define \(p_+(x) = \mathcal {N}(+1, 1)\) and \(p_-(x) = \mathcal {N}(-1, 1)\) where \(\mathcal {N}(\mu , \sigma ^2)\) denotes the univariate Gaussian distribution with mean \(\mu \) and variance \(\sigma ^2\), and \(p(x) = \pi p_+(x) + (1 - \pi ) p_-(x)\). We generated samples from \(p_+(x)\) and p(x) with \(\pi = 0.4\) for training data, and from p(x) with \(\pi ^\prime = 0.6\) for test data.

The training dataset contained 200 positively labeled samples and 1000 unlabeled samples, and the validation dataset contained 100 positively labeled samples and 500 unlabeled samples. The test dataset consisted of 1000 samples. As a parametric model, linear-in-parameter model with Gaussian basis functions \(\varphi (x) = \exp (-(x - x_i)^2 / 2)\), where \(\{x_1, \ldots , x_{n_\mathrm {U}}\} = \mathcal {X}_\mathrm {U}\), was used. Adam with default momentum parameters \(\beta _1 = 0.5\), \(\beta _2 = 0.999\) and \(\ell _2\) regularization parameter 0.1 was used as an optimizer. Training was performed for 200 epochs with the batch size 200 and the learning rate \(2 \times 10^{-5}\).

We did experiments with unbiased PU learning (uPU) (Plessis et al., 2014, 2015) with the logistic loss and our proposed method (DRPU) with LSIF. In uPU, the class-prior of the training data was estimated by KM2 (Ramaswamy et al., 2016), and for DRPU, the test unlabeled dataset was used as an unlabeled dataset to estimate the test prior \(\pi ^\prime \). The left-hand side of Fig. 1 shows the obtained classification boundaries, and the boundary of DRPU was closer to the optimal one than that of uPU.

Secondly, we tested the case where the irreducibility assumption does not hold. Let \(p_+(x) = 0.8 \mathcal {N}(+1, 1) + 0.2 \mathcal {N}(-1, 1)\), \(p_-(x) = 0.2 \mathcal {N}(+1, 1) + 0.8 \mathcal {N}(-1, 1)\), and set the training prior \(\pi = 0.6\), the test prior \(\pi ^\prime = 0.4\). The result is illustrated in the right-hand side of Fig. 1. Class-prior estimation by KM2 was inaccurate since the irreducibility assumption did not hold, and then uPU led to a large error. DRPU also gave inaccurate estimations of the class-priors, but they did not affect the training step, so the influence of the estimation error was relatively mitigated.

Fig. 1
figure 1

Visualized classification boundaries of uPU and DRPU, averaged over 10 trials. Each of the vertical lines are the boundaries and the colored areas are the standard deviations. “Oracle” is the optimal classification boundary. The red curve means the probability density \(p_+\) scaled by \(\pi ^\prime \), and the blue curve means \(p_-\) scaled by \(1 - \pi ^\prime \). The upper graph corresponds to the case where irreduciblity assumption holds, while the lower one does not (Color figure online)

5.2 Test with benchmark data

Table 2 The means and standard deviations of the classification accuracy in percent on benchmark datasets over 10 trials. “Train” and “Test” denote the class-priors of the training and test data respectively. “Avg” is the averaged accuracy of the four results with different priors. The best results with respect to the one-sided t-test at the significance level 0.05 are highlighted in boldface (for the “Avg” case, just picking the highest one)
Fig. 2
figure 2

The means and standard deviations of the classification error as functions of the training epoch

We also measured the performances of nnPU (Kiryo et al., 2017), PUa (Charoenphakdee & Sugiyama, 2019), VPU (Chen et al., 2020), and DRPU (the proposed method) on MNIST (Lecun et al., 1998), Fashion-MNIST (Xiao et al., 2017), Kuzushiji-MNIST (Lamb et al., 2018), and CIFAR-10 (Krizhevsky, 2012). Here we summarize the descriptions of the datasets and the training settings.

  • MNIST (Lecun et al., 1998) is a gray-scale 28 \(\times \) 28 image dataset of handwritten digits from 0 to 9, which contains 60000 training samples and 10000 test samples. Since it has 10 classes, we treated the even digits as the positive class and the odd digits as the negative class respectively. We prepared 2500 positively labeled (P) samples and 50000 unlabeled (U) samples as the training data, and 500 P samples and 10000 U samples as the validation data. The test dataset was made up of 5000 samples for each of the test distributions with different class-priors respectively. As a parametric model, 5-layer MLP : 784-300-300-300-1 with ReLU activation was used, and trained by Adam with default momentum parameters \(\beta _1 = 0.9\), \(\beta _2 = 0.999\) and \(\ell _2\) regularization parameter \(5 \times 10^{-3}\). Training was performed for 50 epochs with the batch size 500. The learning rate was set to \(10^{-4}\) for nnPU/PUa and \(2 \times 10^{-5}\) for VPU/DRPU, which is halved for every 20 epochs. In VPU, we set hyperparameters for Mixup as \(\alpha =0.3\) and \(\lambda =2.0\). In DRPU, we set a hyperparameter for non-negative correction as \(\alpha =0.475\).

  • Fashion-MNIST (Xiao et al., 2017) is a gray-scale 28 \(\times \) 28 image dataset of 10 kinds of fashion items, which contains 60000 training samples and 10000 test samples. We treated ‘Pullover’, ‘Dress’, ‘Coat’, ‘Sandal’, ‘Bag’, and ‘Ankle boot’ as the positive class, and ‘T-shirt’, ‘Trouser’, ‘Shirt’, and ‘Sneaker’ as the negative class respectively. We prepared 2500 P samples and 50000 U samples as training data, and 500 P samples and 10000 U samples for validation data. The test dataset was made up of 5000 samples for each of the test distributions with different class-priors respectively. As a parametric model, we used LeNet (Lecun et al., 1998) -based CNN : (1 \(\times \) 32 \(\times \) 32) - C(6, 5 \(\times \) 5, pad=2) - MP(2) - C(16, 5 \(\times \) 5, pad=2) - MP(2) - C(120, 5 \(\times \) 5) - 120 - 84 - 1, where C(c, \(h \times w\), pad=p) means c channels of \(h \times w\) convolutions with zero-padding p (abbreviated if \(p=0\)) followed by activation function (ReLU), and MP(k) means \(k \times k\) max pooling. Batch normalization was applied after the first fully-connected layer. The model was trained by Adam, with the same settings as the case of MNIST. Training was performed for 100 epochs with the batch size 500 and the learning rate \(2 \times 10^{-5}\), which is halved for every 20 epochs. In VPU, we set hyperparameters for Mixup as \(\alpha =0.3\) and \(\lambda =0.5\). In DRPU, we set a hyperparameter for non-negative correction as \(\alpha =0.6\).

  • Kuzushiji-MNIST (Lamb et al., 2018) is a gray-scale 28 \(\times \) 28 image dataset of 10 kinds of cursive Japanese characters, which contains 60000 training samples and 10000 test samples. We treated ‘o’, ‘ki’, ‘re’, ‘wo’ as the positive class, and ‘su’, ‘tsu’, ‘na’, ‘ha’, ‘ma’, ‘ya’ as the negative class respectively. We prepared 2500 P samples and 50000 U samples as the training data, and 500 P samples and 10000 U samples as the validation data. The test dataset was made up of 5000 samples for each of the test distributions with different class-priors respectively. The model and optimization settings were the same as the cases of Fashion-MNIST. In VPU, we set hyperparameters for Mixup as \(\alpha =0.3\) and \(\lambda =0.5\). In DRPU, we set a hyperparameter for non-negative correction as \(\alpha =0.375\).

  • CIFAR-10 (Krizhevsky, 2012) is a colored 32 \(\times \) 32 image dataset, which contains 50000 training samples and 10000 test samples. We treated ‘airplane’, ‘automobile’, ‘ship’, and ‘truck’ as the positive class, and ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, and ‘horse’ as the negative class respectively. We prepared 2500 P samples and 45000 U samples as the training data, and 500 P samples and 5000 U samples as the validation data. The test dataset was made up of 5000 samples for each the test distributions with different class-priors respectively. As a parametric model, we used the CNN introduced in Springenberg et al. (2015) : (3 \(\times \) 32 \(\times \) 32) - C(96, 5 \(\times \) 5, pad=2) - MP(2 \(\times \) 2) - C(96, 5 \(\times \) 5, pad=2) - MP(2 \(\times \) 2) - C(192, 5 \(\times \) 5, pad=2) - C(192, 5 \(\times \) 3, pad=1) - C(192, 1 \(\times \) 1) - C(10, 1 \(\times \) 1) with ReLU activation. Batch normalization was applied after the max pooling layers and the third, fourth, fifth convolution layers. The model was trained by Adam, with the same settings as the case of MNIST. Training was performed for 100 epochs with the batch size 500 and the learning rate \(10^{-5}\), which is halved for every 20 epochs. In VPU, we set hyperparameters for Mixup as \(\alpha =0.3\) and \(\lambda =4.0\). In DRPU, we set a hyperparameter for non-negative correction as \(\alpha =0.425\).

Table 3 The means and standard deviations of the AUC on benchmark datasets over 10 trials. The best results with respect to the one-sided t-test at the significance level 0.05 are highlighted in boldface

In nnPU, the class-prior of the training data was estimated by KM2 (Ramaswamy et al., 2016). In PUa, we estimated both the training and test priors by KM2, then performed cost-sensitive non-negative PU classification (Charoenphakdee & Sugiyama, 2019). Note that in this setting, PUa needs the unlabeled test dataset at the training-time to estimate the test prior by KM2, while DRPU needs it at only the test-time. Moreover, PUa needs to train a model for each time the test prior changes. In nnPU and PUa, the sigmoid loss was used as a loss function.

Table 2 shows the results of the experiments. nnPU and PUa unintentionally achieved high accuracy in some cases because of estimation errors of the class-priors, while they had poor results in the other cases. VPU achieved good results in several cases where the scale of class-prior shift was small since it does not need the class-prior in the training phase, but it was not adapted to large class-prior shift. DRPU outperformed the other methods in almost all cases, and was the most robust to the test-time class-prior shift. Figure 2 gives the classification errors in the experiments. For example, in the case of Fashion-MNIST with \(\pi ^\prime = 0.8\), nnPU and PUa suffered from overfitting due to the estimation error of the training class-prior. Also, as seen in the case of Kuzushiji-MNIST with \(\pi ^\prime = 0.2\), DRPU gave a better result than the other methods, and was the most stable, i.e., it had the smallest variance.

In addition, Table 3 reports the computed AUC values on the experiments for each of the methods. The results were picked from \(\pi ^\prime = 0.6\) case. DRPU had a bit worse results than VPU on MNIST and Fashion-MNIST, while it performed well on Kuzushiji-MNIST and CIFAR-10. Table 4 summarizes the absolute error of the class-prior estimation by KM2 and our method described in Sect. 3.2. For KM2, we used 2000 positive samples from the training dataset and 2000 unlabeled samples from the test dataset. The inputs were transformed into 50 dimensions by PCA (Jolliffe & Cadima, 2016). For our method, we used 500 positive samples from the validation dataset and 5000 unlabeled samples from the test dataset. It is observed that our class-prior estimation method outperformed KM2 in almost all cases.

Table 4 The means and standard deviations of the absolute error of the class-prior estimation on benchmark datasets over 10 trials. The best results with respect to one-sided t-test at the significance level 0.05 are highlighted in boldface

5.3 Comparisons under different numbers of labeled samples

To numerically verify the theoretical insight provided in Sect. 4.4, we compared nnPU and DRPU with different sizes of the positively labeled dataset. In this experiment, we assumed that the true class-prior \(\pi \) was known and no class-prior shift would occur. We performed nnPU and DRPU on MNIST and CIFAR-10, with \(n_\mathrm {P}\in \{500, 1000, 2000, 4000\}\). Note that we skipped the class-prior estimation step of DRPU because the class-priors were given. Figure 3 shows the results of the experiments. On MNIST, the performance of DRPU was comparable to that of nnPU when \(n_\mathrm {P}\) was small, yet it got outperformed under larger \(n_\mathrm {P}\). On CIFAR-10, unlike the MNIST case, the difference in the classification error was larger when \(n_\mathrm {P}\) was smaller. As a whole, nnPU stably outperformed DRPU, and this experimental result supports the theoretical discussion in Sect. 4.4.

Fig. 3
figure 3

Classification errors of nnPU and DRPU on MNIST and CIFAR-10, averaged over 10 trials for each settings of the number of labeled samples. The vertical bars at each of the points refer the standard deviations

6 Conclusions

In this paper, we investigated positive-unlabeled (PU) classification from a perspective of density ratio estimation, and proposed a novel PU classification method based on density ratio estimation. The proposed method does not require the class-priors in the training phase, and it can cope with class-prior shift in the test phase. We provided theoretical analysis for the proposed method, and demonstrated its effectiveness in the experiments. Extending our work to other weakly-supervised learning problems (Lu et al., 2019; Bao et al., 2018) or multi-class classification settings (Xu et al., 2017) is a promising future work.