Positive unlabeled learning with tensor networks

Positive unlabeled learning is a binary classification problem with positive and unlabeled data. It is common in domains where negative labels are costly or impossible to obtain, e.g., medicine and personalized advertising. Most approaches to positive unlabeled learning apply to specific data types (e.g., images, categorical data) and can not generate new positive and negative samples. This work introduces a feature-space distance-based tensor network approach to the positive unlabeled learning problem. The presented method is not domain specific and significantly improves the state-of-the-art results on the MNIST image and 15 categorical/mixed datasets. The trained tensor network model is also a generative model and enables the generation of new positive and negative instances.


Introduction
Positive unlabeled learning (PUL) is a binary classification problem where only some positive samples are labeled, and the remaining positive and all negative ones are unlabeled [1]. This setting is natural in many domains, where the labeling of one class is expensive, laborious, or not possible, e.g., in medicine (disease gene identification [2], identifying protein complexes [3]), drug discovery [4], remote sensing [3], recommender systems [5], personalized advertising [1], and more. The PUL problem is related to machine learning tasks where not all data is labeled, in particular one-class learning and semi-supervised learning. The main difference from the former is that PUL explicitly uses unlabeled data. In contrast, the semi-supervised learning problem assumes some labels for all classes.
Most PUL approaches are applicable to either text data [3,6,7], images [8,9], or categorical data [10]. Methods applicable to text typically apply metric-based approaches which do not apply to categorical data (e.g., cosine distance). Similarly, the GAN approaches, which are state-of-the-art on many image-related tasks, struggle with categorical datasets. Recently several methods based on conditional GANs have been developed and applied to categorical tasks [11]. However, they have * bojan.zunkovic@fri.uni-lj.si not been applied to the PUL problem which makes the training and the architecture significantly more complicated [12]. Similarly, the state-of-the-art approaches on categorical data [13] are in principle applicable also to continuous data, but have so far not been tested on images.
In this paper, we introduce a tensor-network approach to PUL. The central part of our model is a tensor network (TN) called a locally purified state (LPS) [14,15] 1 . Tensor networks are widely applicable in physics to model many-body quantum systems [16,17,18,19]. Recently, they have been adopted to machine learning problems, in particular classification [20,21,22], generative modelling [23], image segmentation [24], anomaly detection [14], and rule learning [21,25]. Although TNs provide competitive results, they rarely achieve state-of-the-art. A notable example is anomaly detection with tensor networks [14], which is the basis of our approach. Following [14] we classify the samples to the positive/negative class based on a feature-space distance to the reference positive/negative state. The tightness of the models is ensured by minimizing the Frobenius norms of the reference states. Besides classification, our model enables efficient unbiased sampling of positive and negative samples.
Our model is tested on synthetic point datasets, the MNIST image dataset, and 15 categorical/mixed datasets from the UCI machine learning repository. Presented results are first where a TN model outperforms the state-of-the-art deep neural network (in this case, a generative adversarial network -GAN) on the image and categorical and mixed datasets. The model is also applicable to missing attributes and in a more general semi-supervised setting with arbitrary number classes. Besides a new TN approach to PUL learning, a new metric is introduced, which applies to model selection and hyperparameter tuning with unlabeled data.
Summary of main contributions: • The anomaly detection model is adapted to the positive unlabeled problem. The modification includes adding a new reference state, modifying the loss terms L 1,2,3 to handle labeled and unlabeled data, modifying the Frobenius loss term L 4 to balance the norms of the positive and the negative reference states, and adding a term L 5 to solve the class collapse problem.
• In contrast to [14], the introduced TN model enables the generation of new positive and negative samples and has a natural probabilistic interpretation of all loss terms.
• A new metric based on the fraction of matching labels between best-performing models is introduced.
It is applicable for model selection and hyperparameter tuning with unlabeled data.
• The introduced model significantly improves the positive unlabeled learning state-of-the-art results on the MNIST image and 15 categorical datasets.
The paper is organized as follows. Related work and models are reviewed in Section 2. The model is introduced in Section 3 and the results are presented in Section 4. Finally, conclusions and future research directions are discussed in Section 5.

Related work
Most PUL approaches fall into one of the four categories [1]: two-step techniques, biased learning, class prior incorporation techniques, and generative adversarial networks.
The two-step approaches first identify reliable negative samples and then use (semi-)supervised techniques to train a classifier. To perform the first step, we implicitly assume that close samples are labeled similarly. We then identify negative samples as unlabeled samples far from any labeled positive sample. We perform this identification based on a non-traditional classifier or a particular distance metric, e.g., cosine distance and term frequency-inverse document frequency. Different distance measures apply to different data types and are domain-restricted. Most applications have focused on text [26,6,7], and only few on categorical data [10] or even mixed (categorical and numeric) data [13]. As discussed in Section 3.6, our approach has some features of a two-stage approach.
Biased learning techniques treat the unlabeled data as negative and assume that the negative labels have a large amount of noise, i.e., many negative labeled samples are positive [1]. We learn from such datasets using traditional binary classification methods but with a higher weight on positive samples. Example approaches in this class are weighted SVM-based methods [3,6] and probabilistic Latent Semantic Analysis based methods [7]. Several methods, e.g., [6], use a higher weight for positive samples in combination with identifying reliable negative samples (a two-stage technique). The presented approach also uses a higher weight for labeled positive samples incorporating some biased learning techniques (see Section 3.3).
The third, class prior techniques, incorporate knowledge about the labeling mechanism. We do this by adjusting a non-traditional classifier by the label frequency (postprocessing) [3], weighting the data (by the label frequency), and then training the classifier [27], or changing the learning algorithm to include the label frequency [28]. It is possible to incorporate the class prior into the learning objective of the tensor network model. However, this remains an open problem for future research.
Finally, generative adversarial network (GAN) techniques have been adopted for the PUL problem. Most of the GAN approaches are two-stage techniques, where we first train a generator of negative (and sometimes also positive) samples and then use it to train a binary classifier on generated negative and labeled positive data [8,9,29]. Recently, a single-stage technique [12] has been proposed, which simultaneously trains the generator and the classifier network. Despite being a generative model, the presented single-stage tensor network method is much simpler than the GAN approaches, can generate positive and negative samples, and applies to categorical datasets.
Some positive unlabeled learning methods (e.g., [30,31]) are based on the idea of co-training [32]. In co-training, we simultaneously train two models on labeled and unlabeled data (typically in a semi-supervised manner). The goal is to obtain two models that predict identical labels. We use a similar idea for model selection and hyperparameter tuning (see Section 3.5). The main difference is that we train the models independently and use the label agreement fraction as a proxy metric for accuracy (a co-labeling metric). Incorporating original co-training ideas into the workflow remains for future research.
We evaluate our approach on the MNIST image and 15 categorical datasets. Accordingly, we use different model sets for comparison/evaluation. We compare the accuracy of the presented model on the MNIST dataset with several GAN approaches: • Generative positive and unlabeled framework (GenPU) [8]: a series of five neural networks: two generators and three discriminators evaluating the positive, negative, and unlabeled distributions.
• Positive GAN (PGAN) [9]: Uses the original GAN architecture and assumes that unlabeled data are mostly negatived samples.
• Divergent GAN (DGAN) [29]: Standard GAN with the addition of a biased positive-unlabeled risk. DGAN can generate only negative samples.
• Conditional generative PU framework (CGenPU) [12]: Uses the auxiliary classifier GAN with a new PU loss discriminating positive and negative samples.
We use the trained generator network to generate training samples for the final binary classifier (two-stage approach).
• Conditional generative PU framework with an auxiliary classifier (CGenPU-AC) [12]: The same as CGenPU, but uses an additional classifier to classify the test data (single stage approach).
On categorical data, the GAN approaches do not perform well. Therefore, we use the approaches evaluated in [10] and [13] as baselines: • Positive Naive Bayes (PNB) [30,33,1]: calculates the conditional probability for the negative class by using the prior weighted difference between the attribute probability and the conditional attribute probability for the positive class.
• Average Positive Naive Bayes (APNB) [30,33,1]: Differs from PNB in estimating prior probability for the negative class. PNB uses the unlabeled set directly, while the APNB estimates the uncertainty with a Beta distribution.
• Positive Tree Augmented Naive Bayes (PTAN) [34]: builds on PNB by adding the information about the conditional mutual information between attributes i and k for structure learning.
• Average Positive Tree Augmented Naive Bayes (AP-TAN) [34]: Differs from PTAN in estimating prior probability for the negative class. PTAN uses the unlabeled set directly, while the APTAN estimates the uncertainty with a Beta distribution.
• Positive Unlabeled Learning for Categorical datasEt (Pulce) approach [10]: a two-stage PUL approach that uses a trainable distance measure Distance Learning for Categorical Attributes (DILCA) to find reliable negative samples. In the second stage, we use a k-NN classifier to determine the class.
• Generative Positive-Unlabeled (GPU) approach [13,1]: Learns a generative model (typically by using probabilistic graphical models (PGMs)) from labeled positive samples. We determine reliable negative samples as the ones with the lowest probability given by the trained generative model. Finally, we train a binary classifier (typically a support vector machine) on labeled positive and reliable negative samples. This approach has also been extended by aggregating many PGMs in an ensemble [13].
Our main technical tool is a tensor network called the locally purified state (LPS) [15] recently applied to a related anomaly detection problem [14]. We use the model in [14] as a starting point and extend it by considering two tensor networks and adopting the loss for the positive and unlabeled settings. We also introduce an additional loss term necessary to avoid class collapse. Finally, we use the trained models to generate new positive and negative samples, which has not been discussed in [14].
The most common application of tensor networks in machine learning is as a linear model in the exponentially large feature space [20,21,22,23,24,14]. Recently a deep tensor network with arbitrary nonlinearities has been proposed [21] and applied study the grokking phenomenon [36,25]. Another important application of tensor networks in machine learning is the compression of big deep-network models [37,38]. Finally, tensor networks are used in big data applications [39], e.g., dimensionality reduction [40], dynamically weighted directed networks [41,42], and tensor completion [43].

Model
In this section, we will first provide a concise overview of the model, which we shall present in detail in the following two subsections. Then, we will discuss the generation of new positive and negative samples. Finally, we will discuss the model selection and hyperparameter tuning in the positive unlabeled setting.

Overview
Our model is a tensor-network kernel method inspired by the tensor-network anomaly detection model [14]. We show a schematic representation of the model in Fig. 1. We first embed the raw inputs x of size N with an embedding function/layer Φ onto a unit sphere in an exponentially large vector space V ⊂ R d N . The dimension d is an embedding parameter. Since the embedded space is high-dimensional, we separate the data by projections P p,n onto positive and negative subspace W p,n . Hence, we transform the inputs in two ways asŷ p,n (x) = P p,n Φ(x). The positive map y p "projects" positive instances onto a hypersphere of radius e µ0 centered at the origin. Negative instances have a large overlap with the kernel of the positive mapŷ p and are mapped close to the origin (center of the positive hypersphere). Similarly, the negative map y n "projects" negative instances onto a hypersphere of radius e µ0 centered at the origin. Positive instances have a large overlap with the kernel of the negative map P n and are mapped close to the origin (center of the negative hypersphere). An instance is recognized as positive if the norm of its positive projection y p is larger than the norm of its negative projection y p , namely ||ŷ p (x)|| 2 > ||ŷ n (x)|| 2 . We denote actual positive samples with x p and actual negative samples with x n and concisely write the action of the model as follows The positive and negative subspaces W p,n are still exponentially large. Yet, their dimension is much smaller than the dimension of the embedding vector space, dim(W p,n ) dim(V ) = d N . We map to lower dimensional spaces to ensure that the kernels ofŷ p andŷ n are sufficiently large to contain the corresponding negative and positive distributions, respectively.

Tensor network architecture
We use tensor networks to make the norm calculation in Eq. 1 tractable. As shown in Fig. 1, the total model consists of one embedding layer and two projectors P p.n , which we represent by LPS tensor networks. We will first discuss the embedding, then the projectors. Finally, we will explain the generation of new positive and negative samples.
Embedding We will assume that the inputs x are real vectors of size N with elements in the unit interval. If one or more attributes of the input do not have the correct form, we transform them in the preprocessing step. We define the embedding with a local vector transformation φ, which maps each input vector element into a local vector space of dimension d. We use the following one-parameter Fourier embedding where the parameter d determines the dimension of the embedding vector space. The complete embedding transformation Φ is then given by the tensor product of local embeddings We will show the tensor-network calculations in a diagrammatic notation [20,22]. In this notation, we represent the local and the full embedding as Instead of the cosine, we sometimes use the sine basis functions. We may use distinct basis functions for different attributes (elements of the input x). An essential property of the local embedding functions is the element-wise orthonormality on the domain, i.e., , which will enable efficient sampling [20]. In the case of an unbounded domain, we apply a finite set of orthonormal functions (on that domain) without changing the properties of the proposed model. We interpret the embedding functions as basis functions encoding (expanding) the data probability distribution and determine the expansion coefficients by the tensor-network projectors P p,n discussed in the next section.
Tensor-network projector We use two separate LPS tensor networks representing the positive/negative projector P p,n . An LPS is a one-dimensional tensor network consisting of three-dimensional parameter tensors A i ∈ R D×D×d and four-dimensional parameter tensors B i ∈ R D×D×d×d . It represents an exponentially large non-square real matrix which we diagrammatically write as , (5) where S is a hyper-parameter of the LPS. The bottom legs act on the input space V , and the upper legs act on the output space W . The number of the top legs is S-times smaller than the number of the bottom legs, which guarantees that dim(V ) is d S times smaller dim(W ). Exponentially smaller output space ensures that the kernel is sufficiently large to contain the data distribution of the opposite class. We calculate the norm ofŷ associated with an LPS by contracting the following tensor network Contraction complexity We perform the contraction of Eq. 6 efficiently in O(N D 2 (D + d)( d S + 1)) operations (see e.g. [14]). Besides the mapsŷ p,m , we also need the Frobenius norm of the projectors P p,n . We calculate the Frobenius norm efficiently in operations by contract the following tensor network [14] ||P || 2 F = . (7) Workflow The entire production workflow of the proposed tensor network model is summarised as follows.

Flatten
If the input is multi-dimensional, we flatten it such that the first dimension is the batch dimension and the second is the feature dimension. We use N to determine the number of features. After flattening the working tensor if of size n batch × N , where n batch determines the batch size.

Normalization
We normalize the data to the domain of the embedding map. We use an embedding with a unit interval domain for each feature. Hence we normalize all features to the unit interval. Categorical features are first represented as integers and then normalized.
3. Embedding We transform each feature by an embedding function, which may vary for different features. However, we use the same embedding function for all features (see Eq. 2). The embedded tensor has dimension n batch × N × d, where d is the dimension of the local embedding.
4. Log-norm Finally, we calculate the norms of positive and negative projections by contracting the embedding tensor with the LPS tensor networks (see Eq. 6). To avoid numerical overflow, we calculate the log-norm of the positive and negative projections. The class is positive if the log norm of the positive projection is larger than the log-norm of the negative projection.

Loss
We will construct a loss that pushes the positive labeled/classified instances towards the boundary of the positive hypersphere and into the kernel of the negative mapŷ n (terms L 1 and L 2 ) and the opposite for negative classified samples (term L 3 ). To achieve a tight positive/negative subspace W p,n we will penalize the norm of the matrices P p,n (term L 4 ). Finally, we will add a term that will prevent the collapse of the classifier to one class (term L 5 ). At each stage of training, we separate a data batch into three groups: (i) labeled positive samples D l , (ii) unlabeled samples classified as positive D p , and (iii) unlabeled samples classified as negative D n . For labeled positive samples, i.e. x ∈ D l , we define the following loss where λ 1,2 are loss parameters. We fix µ 0 = 5 and µ 1 = −50. We denote with |D| the number of elements in the batch/dataset D. The first term of the loss L 1 ensures that the positive projectionsŷ p (x) have a norm close to exp µ 0 . The second term of L 1 pushes the labeled samples towards the kernel of the mapŷ n . We use the logarithm of the norms to stabilize the training. The second loss term L 2 concerns the positive classified samples, i.e., x ∈ D p , which we treat as labeled positive samples. Therefore, we define with new loss parameters λ 2,3 and the same constants µ 0 = 5 and µ 1 = −50.
In the case of negative classified samples, i.e., x ∈ D p , we reverse the roles of the positive and negative mapŝ y p,n . The vectorŷ n (x) should have a norm close to exp(µ 0 ), while the norm ofŷ p (x) should be close to zero. Therefore, we define with new loss parameters λ 5,6 and the same constants µ 0,1 .
To learn a tight fit of the positive and negative distributions we want the Frobenius norms of the projectors P p,n to be as close to one as possible. Moreover, to avoid a collapse to one class, we want to ensure that the Frobenius norms of the projectors are close to each other. We encourage both objectives with the fourth loss term Since the last term does not always prevent a class collapse, we add the fifth term, which ensures the classification of at least some samples in each class. This is done by making the averages of ||ŷ p,n || 2 close to each other, namely The parameter λ 8 is our last hyperparameter. The entire loss optimized during training is a sum of five different terms and has eight hyperparameters. Removing any of the presented loss terms would significantly degrade the performance or lead to a class collapse.
The positive and negative tensor network projectors determine probability amplitudes (square roots) of the approximate positive and negative distributions (see Section 3.4). Therefore, we provide an intuitive probabilistic interpretation of the proposed loss. The first summand in L 1 proportional to λ 1 minimizes the Kullback-Leibler (KL) divergence between the positivelabeled data distribution and the (positive) tensornetwork approximationŷ p . In contrast, the second summand in L 1 (proportional to λ 2 ) maximizes the KL divergence between the positive-labeled data and the (negative) tensor-network approximationŷ n . Similar interpretations have the second and third loss term L 2 and L 3 : we increase the KL divergence between the opposite-class distributions and decrease the KL divergence between the same-class distributions. The fourth term normalizes the positive/negative tensor-network distributions and captures the remaining parts of the KL divergence in the losses L 1−3 . The final, fifth loss L 5 ensures that the positive and negative tensor-network distributions have the same distance (KL divergence) to the uniform distribution on the unlabeled data. We can incorporate prior information into our objective by changing the uniform prior in the fifth loss L 5 with another distribution.
As described in the Section 4, we fix the loss hyperparameters once and then keep them constant for all experiments presented in the paper. We find that the best parameter set has a higher weight for labeled samples than the positive/negative classified samples akin to the biased learning techniques discussed in Section 2.

Sampling
We will now reformulate the first part of the loss terms L 1,2,3 and show that they minimize a distance between the data distribution and positive/negative "quantum" probabilities given by ρ p,n = (P p,n ) T P p,n . We represent a dataset D as an unnormalized "quantum" probability by using the diagonal ensemble where Φ(x) denotes a column vector embedding and Φ(x) T a row vector embedding. We now promote our embedding vector space to a Hilbert space with a Hilbert-Schmidt inner product and interpret the terms in the loss-functions L 1,2,3 as the inner product between the actual data distribution and the modeled distributions ρ p,n x∈Dŷ p,n (x) = Tr (ρ(D)ρ p,n ) Therefore, the final matrices P p,n define the "square roots" of the approximate positive/negative data distributions.
A distinct feature of many tensor-network models is that they enable efficient unbiased sampling [20,23], which is also true for probability distributions determined by LPS states. In a tensor network, we typically sample sequentially by constructing a local probability density. Let us assume that we have already sampled all positions until i + 1. In other words, we know the values x 1 , x 2 , . . . , x i while the values for positions i + 1, i + 2, i + 3, . . . N are unknown (to be determined). We sample the attribute at position i + 1 from the probability density at position i + 1 given by Due to the orthonormality of the embedding functions, we replace the marginalization integrals or sums over the variables x i+2 , x i+3 , . . . x N by a simple contraction. The tensor network in Eq. 16 can be efficiently evaluated and represents an embedding basis function expansion of the unnormalized probability density of the current variable u. After we sample the current position, we continue with the next one to the right.

Ensemble sampling
The linearity of the model also enables efficient ensemble sampling. At a given position, we first construct the probability densities corresponding to each LPS in the ensemble and then sample according to the average probability density. Finally, we assign the sampled value to each LPS in the ensemble.
Missing attributes The LPS model also naturally processes samples with missing attributes. We handle missing attributes by calculating the overlap with the approximate marginal probability distribution, where we marginalize over missing attributes. Specifically, if one or more attributes are missing, we contract the corresponding top and bottom LPS tensors analogously to the normalization Eq. 16. We can apply the marginalization over the missing attributes during the training and the prediction phase.

Model selection and parameter tuning
Before calculating the evaluation metric (accuracy or F1-score), we perform a model selection step. After training several models, we compare the label predictions on the training dataset for each model pair and select the models which agree most. We find that the agreement fraction is a reliable estimate of the final test accuracy of the models (see Section 4.2). We calculate the evaluation metric only for selected models. Further, we showed that the agreement fraction is also a good metric for hyperparameter tuning (see Section 4.3) in the case of unlabeled data. Accordingly, we use the same strategy to determine the remaining model hyperparameters: bond dimension D, embedding dimension d, number of training epochs, and learning rate schedule.

Summary
We introduce a one-stage tensor-network generative model for positive unlabeled learning. Our model is inspired by the tensor-network anomaly detection model [14], but has several key differences. First, we add a second projector that approximates the distribution of negative samples. Second, besides the loss for labeled samples L 1 , we add a loss for probably-positive L 2 and probably-negative L 3 samples as well as the "convergence" loss L 5 , which ensures that the model does not collapse to one class. Third, we present an efficient unbiased scheme for generating new positive and negative samples and show how to process samples with missing attributes. Finally, we introduce a novel model selection and parameter-tuning strategy applicable to unlabeled data. The proposed setup is similar to a two-stage PUL approach since we use only reliable negative and positive samples for training a classifier. The difference is that we use the same model as a classifier and a generator. Therefore, we omit the second stage, similar to the CGenPU-AC approach.

Results
Our approach was tested on three synthetic point datasets, the MNIST image dataset, and 15 categorical/mixed datasets. In all cases, the Adam optimizer was used. The loss parameters were fixed to λ 1 = λ 2 = λ 8 = 4, λ 4 = λ 5 = 2, and λ 3 = λ 6 = 1. The loss hyperparameters were determined by manual tuning on the point datasets. The exception is the hyperparameter λ 7 , which was determined dynamically after each training epoch. If all labeled samples in the epoch were correctly classified λ 7 was increased by a factor k inc. = 1.1 up to the maximum value fixed to 10. In contrast, if the accuracy of the labeled samples in the epoch was smaller than 0.95, λ 7 was decreased by a factor k dec. = 0.9 up the minimal value fixed to 0.1. Finally, the factors k inc./dec. were updated at each change from a decrease to an increase of the hyperparameter λ 7 by setting their new values to (k inc./dec. ) 0.8 . The hyperparameters of the λ 7 dynamics were fixed by observing the training on toy datasets. The training dynamics do not significantly depend on the hyperparameter choice. Similarly, the particular choice of the constants µ 0 = 5, and µ 1 = −50 does not significantly change the results on toy datasets as long as the values are aligned with our objective, namely µ 0 > µ 1 .

Synthetic datasets
The three point datasets, namely the two moon dataset, the circles dataset, and the blobs dataset available through the scipy API were used to test the generative properties of our approach. In all cases, 1000 training samples were used, where half of the samples were positive, of which 100 were labeled. To obtain a more expressive model, the input was first repeated nine times, resulting in 18-dimensional feature vectors that were processed as discussed in Section 3. Local basis functions were randomly chosen between sine and cosine, and S = 3, and D = d = 12 were used. The models were trained with the Adam optimizer and a learning rate of 0.1. After training, sampling from an ensemble was performed by sequential sampling from the average local probability density. Then, samples were accepted only if all models predicted a high probability that the sample is in the correct class. For positive samples, the threshold was set toŷ p −ŷ n > 20. The sampled distribution was further improved by thresholding and removing obvious outliers (see Fig. 2), although the initial samples were already close to the original distribution. Visually, it is clear that our method better reproduces the original distribution (especially after thresholding) when compared to the state-of-the-art GAN results [8,12].

MNIST
On the image MNIST dataset, the one-vs-one classification for each class pair and the one-vs-all classification for each of the ten classes were performed. The Adam optimizer was used with a learning rate of 0.01 and a batch size of 256. The images were cropped to size 20 × 20 pixels, and random rotation (for the angle 0.05π) and random zoom (with a factor in the range [0.8, 1.2]) were applied as a data augmentation step. Balanced datasets were used for training and testing on both tasks. Experiments were conducted with N p = 100, 10, 1 labeled positive samples. The model parameters were determined by adopting the setting of the previous studies of the TN classification on the MNIST dataset [14,21] and fixed to S = 10, d = 6, and D = 20.
Batch generation In a typical PUL scenario, the number of labeled positive samples is much smaller than that of unlabeled samples. Hence, it is likely that many batches will not have any labeled samples during training. This problem is solved by adding a subsample (with repetition) of labeled positive samples. The number of added labeled samples equals the number of samples in the original batch, which contained randomly selected labeled and unlabeled samples.

Model selection and accuracy estimate
As discussed in Section 3.5, the model selection has been performed based on the fraction of matching labels. From several trained models, the ones with the largest overlap of predicted labels were evaluated. We refer to the fraction of the matching labels as the estimated accuracy. The effectiveness of this metric on the MNIST dataset is established by comparison with the actual test accuracy. In Fig. 3, we show the histogram of the difference between the estimated and the test accuracy of the best/selected models on the one-vs-one task. The total number of models equals the total number of class pairs. We perform the comparison for different numbers of labeled samples N p = 1, 10, 100. As expected, the estimated accuracy is closer to the test accuracy for more labeled training samples N p . Interestingly, in most cases, the estimated accuracy underestimates the test accuracy. We also observe that for N p = 1, 10 more than 90% of the differences in Fig. 3 are within one standard deviation of the average test accuracy for the corresponding number of labeled samples N p reported in Table 1. In the case N p = 100, approximately 50% of the differences are within one standard deviation of the average test accuracy reported in Table 1. However, in this case, the standard deviation is small, namely 0.01, and for all class pairs, the estimated accuracy underestimates the actual test accuracy.

Comparison with GAN approaches
As discussed in the introduction, we evaluate the performance of the TN models on the image datasets by comparing the results with the results of the state-of-the-art GAN approaches. We follow [12] and evaluate the model two tasks. In the one-vs-one task, we select two classes and use one as positive and the other as negative. We repeat this for all possible pairs and then report the average metric. In the one-vs-all task, we select one class as positive and all remaining classes as negative. In both cases, we balance the train and the test datasets to contain the same number of positive and negative samples. Since we balance the datasets, we use the accuracy to evaluate the models. We report the results for the one-vs-one classification task in Table 1 and the results for the one-vs-rest classification task in Table 2. Our model performs significantly better than the GAN approaches [12]. Notably, in the one-vs-one setting, our approach with only one labeled sample is better than all presented GAN approaches with 100 labeled samples. On the one-vs-rest task, we compare our model only with the CGenPU-AC, PGAN, DGAN, and GenPU approaches [12]. Also, in this setting, our approach restricted to only one labeled sample is, on average, comparable to the CGenPU-AC model with 50 labeled samples, which is the state-of-the-art GAN approach. Further, if we increase the number of labeled samples in the one-vs-all setting to 10, we significantly improve the state-of-the-art CGenPU-AC results trained on 50 labeled samples.

Statistical tests
Since the baseline data for the onevs-one task is unavailable, we perform the statistical analysis only for the one-vs-rest MNIST task. However, considering the average test accuracy, our improvement over the CGenPU-AC model [12] is larger than the improvement of the CGenPu-AC over the GenPU model. Further, our model reduces the error from 1.6 (N p = 1) to 11 (N p = 100) times compared with the previous state-of-the-art. Finally, the average accuracy difference between the CGenPU-AC model and our model is larger than four standard deviations, except in the extreme case of N p = 1 (see Table 1). On the one-vs-rest task, we assess the statistical Table 1: Average test accuracy on the one-vs-one classification task on the MNIST dataset. A comparison of the GAN approaches and the proposed TN approach. The data for GAN approaches are taken from [12].  Table 2: Average test accuracy on the one-vs-rest task on the MNIST dataset. The largest accuracy is in bold. The data for GAN approaches are taken from [12]. significance of the results on MNIST data by using the Friedman statistics and the Nemenyi test [44].
In the Friedman test, the null hypothesis is that all the methods perform similarly, i.e., the χ 2 F value is similar to the critical value for the chi-square distribution with k − 1 degrees of freedom. For the data reported in Table 2 the Friedman coefficient is χ 2 F = 33.7 with p − value = 9 · 10 −7 . Therefore, we reject the null hypothesis of the Friedman test.
In the posthoc analysis, we use the Nemenyi test, which compares the classifier average-rank differences with the critical distance where k = 5 is the number of models and N = 10 is the number of tasks (in our case the one-vs-all task), α is the significance level and q α is the critical value for the two tailed Nemenyi test [44]. For q α=0.05 = 2.727774 the critical distance is CD = 1.93. Our method is, therefore, statistically significantly better than all but the CGenPU-AC approach (see Fig. 4 for details). The results differ from [12] since we used significantly fewer experimental runs.

Categorical and mixed dataset
In this section, we evaluate our approach to PUL on categorical and mixed datasets. We follow the setup of [10] and consider 15 datasets 2 . For each dataset, we consider three different PUL tasks related to different fractions of labeled positive samples, namely 30%, 40%, and 50% of all positive samples. Negative samples and the remaining positive samples comprise the unlabeled samples. We thus evaluate our model on 45 different PUL problems. As in [10], we assume the class with the most instances as positive and the class with the second most instances as negative. In the multi-class setting, we discard the remaining classes. In contrast to the MNIST experiments, the train and test datasets are not balanced. The preferred metric in the case of unlabeled and unbalanced data is the F1score [10], which we use to evaluate different approaches to PUL on categorical datasets.

Training details
The preprocessing of the mixed/categorical datasets includes several steps. First, the categorical attributes were converted to numbers. Then, all attributes were normalized to the unit interval. Finally, the embedding functions discussed in Section 3.2 were applied. The use of discrete basis/embedding functions for categorical attributes was avoided for simplicity. Since considered datasets are small, a single batch training without the repetition of labeled samples was performed. The loss hyperparameters were the same as on the synthetic and MNIST datasets.
Hyperparameter tuning For each dataset, the embedding dimension d, the bond dimension D, the number of train epochs, and the learning rate schedule were determined by hyperparameter tuning as described in Section 3.5. In all cases, the hyperparameter tuning has been performed by random search on the following intervals: d ∈ {4, 12, 20}, D ∈ {2, 6, 12, 20}, epochs ∈ {10, 20, 50, 150, 210, 400}. The patience parameter was tied to the number of epochs (see Table 3). Tuned hyperparameters were the same for all three considered fractions of positive samples. The skip size S, and the number of repetitions of the input attributes were determined based on the number of all attributes in the dataset. The main guiding principle for their choice was that the dimension of the kernel and the rank of the projectors should be sufficiently large, as discussed in Section 3.2. The Table 3 shows the dataset characteristics and the chosen hyperparameters. The constant attributes have been removed after the preprocessing steps. Therefore, the number of attributes reported in the table differs from the number of all attributes reported in [10].

Estimated accuracy and model selection
To validate the model agreement fraction (accuracy estimate) as a reliable accuracy estimate and metric for model selection and hyperparameter tuning, we compare the estimated accuracy with the F1 score on the test dataset of the 10-fold cross-validation. The accuracy estimate has been calculated as follows. First, a set of 50 hyperparameter tuples was created. Then, ten models were trained for each tuple and each fold in the crossvalidation. Out of the ten trained models, only the best model based on the agreement fraction was chosen for the evaluation of the hyperparameter performance. Therefore, for each of the 50 parameter tuples, 100 models were trained in the hyperparameter tuning step, from which only ten were chosen for evaluation, one for each fold in the cross-validation. The test F1 score has been calculated only for experimental validation of the estimated accuracy metric and is not applicable for hyperparameter tuning due to missing labels.
In Fig. 5, we show the relation between the estimated accuracy (highest agreement fraction between trained models) and the test F1 score. We observe a monotonic relationship on almost all considered datasets. The only two datasets where monotonic relation is not visually clear are the Cancer and Hepatitis datasets. Even in those datasets, the hyperparameter settings with high estimated accuracy have a high F1 score. To quantify the relation between the estimated accuracy and the test F1 score, we calculate the Spearman coefficient ρ and the corresponding p-value. From 15 datasets, 13 have significant correlations, i.e., p − value > 0.05. By averaging only the significant values, we obtain the Spearman coefficient ρ = 0.81 ± 0.17. This experiment demonstrates the utility and reliability of the estimated accuracy for hyperparameter tuning and model selection in case of missing labels.
Comparison with other methods As discussed in the introduction, we compare the results of our TN model with seven different approaches to positive un-labeled learning on categorical datasets. In Table 4, we show the F1-score from the 10-fold cross-validation for each of the considered PUL tasks and models. The results for models PNB, PTAN, APNB, Pulce, and LUHC are taken from [10] 3 . The results for the GPU model were reproduced using the code accompanying the paper [13]. The GPU model has been reevaluated since it represented the best model for the PUL task on categorical datasets.
In Table 4, we show the F1 score on the test dataset for all dataset-model pairs. We observe that our TN model obtains the best results in 32 out of 45 PUL tasks, which is considerably more than the next-best GPU model with the eight best results. Our model performs worst on hepatitis and Pima datasets, however, on those datasets, it still performs better than the overall secondbest GPU model. On the remaining datasets where our model is not the best-performing model, the F1 scores are within one standard deviation of the bestperforming model. On datasets chess, heart-c, nursery, spambase, and vote the F1 scores obtained by our model are by more than one standard deviation higher than the next-best model, which is not always the same. On the remaining datasets, our model has the highest F1 score, and the second-best F1 score is within one standard deviation. Finally, our model also significantly improves the average F1-score from 0.81 to 0.90.

Statistical tests
As in the MNIST case, we assess the statistical significance of the results on categorical data using the Friedman statistics and the Nemenyi test [44,10]. For the data reported in Table 4 the Friedman coefficient is χ 2 F = 146 with p − value = 2 · 10 −28 . Therefore, we reject the null hypothesis of the Friedman test.
In the post hoc analysis, we use the Nemenyi test. For q α=0.05 = 3.030879, N = 45, and k = 8 the critical distance is CD = 1.56. Therefore, our method is significantly better than all but the GPU approach (see Fig. 6 for details).

Conclusions and outlook
We propose a tensor-network approach to the positive unlabeled learning problem and obtain state-of-the-art results on the MNIST image and 15 categorical datasets. To date, no tensor network approach outperformed the best neural network methods on image datasets. However, to use the model on larger images, we need to apply additional preprocessing, e.g., patch-based embedding. Although the proposed loss is complex, all terms in the loss are well-motivated and necessary to obtain good performance and avoid class collapse. We used a co-training-inspired metric (estimated accuracy) for model selection and hyperparameter tuning. We also demonstrated the generation of new positive and negative samples on simple synthetic datasets. Besides  missing labels, our model naturally processes also samples with missing attributes.
We envision two extensions of the proposed approach. First, we can adapt the approach to the general semisupervised learning problem with any number of classes. Second, based on surprisingly good results with very few (up to one) labeled samples, we expect that we can use the presented approach as an unsupervised clustering method. The idea is to train many one-sample PUL TN classifiers with different labeled samples. We can then compare the trained models with the proposed co-training-like metric to find the number of classes and cluster the dataset. Table 4: Mean F1-score of the 10-fold cross-validation for different PUL classification methods on categorical (and mixed) data. Scores in white columns are reported in [10], which does not include the standard deviation, and are shown for convenience. The values for the GPU method are reproduced by using the implementation accompanying the paper [13]. The last TN column refers to the presented tensor network approach.  Figure 6: Critical difference graph of the Nemenyi test for the categorical datasets computed using average rankings of methods and 0.05 significance level.