Semi-supervised Anomaly Detection Algorithm using Probabilistic Labeling (SAD-PL)

To detect abnormal data via semi-supervised learning, unlabeled data are generally assumed to be normal data. This assumption, however, causes inevitable performance degradation when a small fraction of abnormal data is included in the unlabeled dataset. To overcome the degradation and to maintain stable detection performance, we propose a semi-supervised anomaly detection algorithm using probabilistic labeling (SAD-PL) for unlabeled data. The proposed SAD-PL is composed of two steps: (1) estimating local outlier factor (LOF) scores of latent vectors from both labeled and unlabeled data and (2) estimating labeling probability on the unlabeled data by using the prior missing probability of the labeled data via the Neyman-Pearson (NP) criterion. The SAD-PL runs iteratively by using the proposed complementary learning functions until the rate of label changes is lower than the predefined threshold. Experimental results reveal that the SAD-PL shows superior detection probability over the existing algorithms and stable performance regardless of the normal to abnormal data ratio in unlabeled data and the ratio of change variation of unlabeled data statistics to labeled data statistics.


I. INTRODUCTION
Anomaly detection is used for detecting abnormal samples that deviate from the predefined normality [1]. It has various applications, such as in medicine, security, and manufacturing [1]. Further applications include intrusion detection in cybersecurity [2] - [4], industrial fault and damage detection in monitoring sensor data [5] - [7], and acoustic novelty detection for audio surveillance and underwater sonar systems [8] - [13]. Typical anomaly detection methods based on unsupervised learning assume that most of the samples are normal. The unsupervised approaches, primarily treated as a one class classification problem, learn features of normal samples [14] - [18]. Typical anomaly detection methods such as the one class support vector machine (OC-SVM) [14] and support vector data description (SVDD) [15] attempt to learn compact descriptions of normal samples. Recent deep learning approaches have shown outstanding performance by overcoming the problems with shallow learning on highdimensional data [16] - [18]. Deep SVDD [16], a representative deep approach, trains a neural network, while minimizing the volume of a hypersphere that encloses normal samples in latent space. Most unsupervised approaches that are not trained on abnormalities have limited detection performance.
Some labeled data, as well as unlabeled data, may be utilized in real-world applications, and an especially small number of anomalous samples in labeled data can be used. Song et al. [19] and Akvay et al. [20] proposed semisupervised anomaly detection models that use reliable normal samples in unlabeled data for training. However, since these models do not train the abnormalities like the unsupervised approaches, they have limited performance. Ruff et al. [21] proposed a deep semi-supervised anomaly detection (Deep SAD) that learns anomalous samples in labeled data. By assuming that most unlabeled samples are normal, the Deep SAD trains the normal data to be concentrated in the center of the latent space, and then trains labeled abnormal samples to move away from the center of the latent space. An unsupervised Outlier Exposure (OE) approach for learning OE data and unlabeled data has been proposed based on a similar assumption of semi-supervised anomaly detection [22]. The anomaly detection method based on unsupervised OE learning trains a binary classifier on the OE data and the unlabeled abnormal and normal data. Hyper sphere classification (HSC) [23] is a class classification algorithm based on unsupervised OE learning that uses the relative distance in latent space for training. The Deep SAD and HSC are advanced anomaly detection algorithms since they consider abnormal samples or OE data in the training process. However, their assumption of most unlabeled samples being normal inevitably causes performance degradation as the number of abnormal samples in the unlabeled dataset increases. To overcome the degradation, a new semi-supervised anomaly detection algorithm that can learn unlabeled normal and abnormal data efficiently is required.
In this paper, we propose an anomaly detection algorithm based on probabilistic normal or abnormal labeling for each sample in unlabeled data. The proposed algorithm, denoted as the semi-supervised anomaly detection algorithm using probabilistic labeling (SAD-PL), involves the two-step probabilistic labeling process: (1) computing the local outlier factor (LOF) score of latent vectors from both labeled and unlabeled data and (2) estimating the labeling probability on the unlabeled data by using the missing probability of labeled normal data via the Neyman-Pearson (NP) criterion. The SAD-PL runs until the labeling change rate becomes lower than the preset threshold.
Our paper is organized follows: Section 2 describes the SAD-PL and evaluates the performance using toy data. In Section 3, the experimental results on the image dataset are described in comparison with existing algorithms. Conclusions are presented in Section 4.

II. SAD-PL
In this section, we introduce the SAD-PL based on semisupervised learning with probabilistic labeling. Figure 1 shows the proposed algorithm along with the existing semisupervised algorithms of the Deep SAD and HSC. The proposed SAD-PL uses feature representations ( ) through autoencoder pretraining and learns the encoded normal samples to close centroid in latent space without the decoding network. Then, the SAD-PL is trained according to the proposed probabilistic labeling, which uses the LOF score = LOF { ( )}. The LOF scoring the density based on the relative distance between neighbor samples is known to show robust performance in the multimodal normality case [24]. To detect a small number of group anomaly samples, the SAD-PL sets large enough to cover the relative distance between normal and abnormal samples. The probability used for labeling is computed by using the missing probability of the labeled normal data as where ( | 0 ) denotes the probability density function of = LOF { ( )} obtained from labeled normal data ( 0 ). denotes the threshold for the given . Note that varies as the SAD-PL. These changes in cause the probabilistic label ( ) to change each training epoch . However, labels ( ) , ≤ for n labeled samples are fixed as follows: ( ) = { 1, ∀ for normal 0, ∀ for abnormal (2) For unlabeled data , the labels ( ), > are defined as follows: where represents the LOF score of . According to determined by the NP criterion, ( ) has a corresponding probability and consequently implies the probabilistic label for the unlabeled . For network training, the SAD-PL uses datasets { , ( ) , ≤ } and { , ( ), > }, which are composed of labeled and ( − ) unlabeled data from the corresponding probabilistic labels estimated in (2) and (3). The resultant objective function of the SAD-PL can be described as follows: where is the weights of layer ∈ {1, ⋯ , }, ‖ • ‖ denotes the Frobenius norm and ( ( )) is the function of the distance from using Geman-McClure loss and defined as follows: The proposed SAD-PL objective function consists of a weight decay regularizer on multiplied by hyperparameter > 0 for preventing overfitting and two complementary learning functions of symmetrical losses for normal and abnormal samples, which are multiplied by complementary probabilistic labels ( ). The SAD-PL learns the normal samples closer to and the abnormal samples away from for the labeled data. For unlabeled data training, the network uses complementary losses with probability labeling ( ) . To avoid a trivial solution of = 0, the bias is not updated during the learning process. By adopting the Geman-McClure loss in (5), the SAD-PL can prevent divergence in the learning process with a limited loss of 0 to 1 and obtain robust stability for mislabeled data [25]. Note that the SAD-PL object function has both regression and classification properties since it learns by computing distance loss based on the estimated probabilistic labels ( ) for unlabeled data.
Soft Labeling: To find the optimum for unlabeled data labeling, the SAD-PL adopts the ensemble networks is the model trained with ( ). To estimate the optimum , the SAD-PL uses the fact that the LOF difference between normal and abnormal samples in well-trained networks becomes relatively larger than that in ill-trained networks. To determine the optimum ( ), we compute the LOF difference ( ) for as follows: where ̅ ( ) 0 and ̅ ( ) 1 represent the average LOFs of the labeled normal and abnormal samples, respectively. Then, from Q differences [ (1), ⋯ , ( )], the optimum ( ) is determined by finding the maximum Δ ( ) at a given epoch . The SAD-PL learns the model by using the threshold
Hard Labeling: By setting ( ) = 0 , the SAD-PL can reduce the computational cost of memory and time. This version of the SAD-PL is known as hard labeling and learns the network by using the threshold denoted as in Figure 1.
The proposed SAD-PL runs until the labeling change rate ∆ ( ) at iteration and is lower than the preset threshold . The ∆ ( ) is computed as Algorithm 1 shows the overall learning procedure of the SAD-PL.

A. COMPARISON OF ANOMALY DETECTION MODELS
We compare the anomaly detection models described above with toy examples. The training data consist of a total of 10000 samples, of which 10% are labeled data and 90% are unlabeled data. Five percent of labeled data and 1% of unlabeled data consist of abnormal samples. We generate the normal samples in the training data as two-dimensional big moon and small moon patterns. We also add Gaussian noise of variance 0.2 and 0.25 to the labeled and unlabeled data, respectively. The abnormal samples in labeled data are located to be clearly distinguished from the normal samples. On the other hand, the abnormal samples in unlabeled data are located adjacent to the boundaries of the normal samples. The test data consist of 1000 normal and abnormal samples. The normal samples contain Gaussian noise with a variance of 0.3 in big moon and small moon patterns. The abnormal samples are generated with a uniform distribution. The models for comparison use the same encoding network of the pretrained autoencoder. The encoding network consists of two hidden layers of 100 nodes, followed by ELU activations that represent a two-dimensional input in a twodimensional latent space. For autoencoder pretraining, we employ the above architectures for the encoding networks and then construct decoding networks symmetrically. We set = 1 for the Deep SAD, and = 100 and = 0.0001 for the SAD-PL. We use = 0.04 obtained from soft labeling training. For all models, we set = 10 −6 and use the Adam optimizer with a learning rate of 10 −5 . In addition, we train all models except the SAD-PL in epoch 300. Figure 2 shows the decision boundaries with training data and the test AUC (area under the receiver operating characteristic curve) of anomaly detection models. The decision boundaries in Figure 2 are represented with an upper bound on 10% of the anomaly score normalized via min-max scaling. The anomaly score is computed ‖ ( ) − ‖ 2 for the Deep SAD and the SAD-PL and ‖ ( )‖ 2 for HSC. In (a) and (c) of Figure 2, the Deep SAD and HSC create the decision boundaries along the perimeter of the large moon and small moon patterns because the unlabeled data are assumed to be normal. Therefore, the abnormal samples close to the normal samples in unlabeled data are located within the decision boundaries, and the decision boundaries are created wider in a region where the abnormal sample in the labeled data does not exist. However, the SAD-PL represents the tight decision boundaries with complementary learning for the labeling probability of the unlabeled data in (e) of Figure 2. In particular, the SAD-PL has higher anomaly scores than the existing methods in areas separated inside the large moon and small moon patterns. In (b) of Figure 2, the supervised deep SAD is trained with only normal samples in unlabeled data, since the labels are not used on the unlabeled term in the objective function. For this reason, the supervised Deep SAD represents compact decision boundaries compared to the semi-supervised Deep SAD. However, the supervised Deep SAD, which learns relatively few abnormal samples, creates wider decision boundaries than the SAD-PL. (d) and (f) in Figure 2 show the decision boundaries and test AUC for the supervised HSC and SAD-PL learning the unlabeled data with labels. The supervised HSC and SAD-PL have extremely tight decision boundaries and high AUCs of 99.25% and 99.28%, respectively. These results show improved AUCs of more than 4.02% compared to the semi-supervised HSC, but only up to a 0.62% difference in AUCs compared to the semi-supervised SAD-PL. Figure 3 shows the training AUCs of the comparative models according to epoch and the ∆ ( ) in the SAD-PL learning. In Figure 3, the HSC, which learns distance based on a radial basis function, achieves higher learning efficiency than the Deep SAD. However, the HSC and the Deep SAD, assuming the unlabeled data are normal, show slower learning caused by adversarial learning between the abnormal samples in labeled and unlabeled data. However, the SAD-PL achieves high learning efficiency with complementary learning based on probabilistic labeling of unlabeled data. We can also predict the completion of learning as ∆ ( ) converges to zero as the epoch increases.

B. CHARACTERISTICS OF PROBABILISTIC LABELING
The SAD-PL sets the labeling probability on unlabeled data with of the labeled data via the NP criterion. It determines by Equation (1) and sets the labeling probability by (3). Typically, unlabeled data consist of a large number of samples with various statistics, such as variance, rather than labeled data. Therefore, the NP condition may not be met on unlabeled data for the set in labeled data. The proposed algorithm repeats the probabilistic labeling in an iteration of learning to change normal samples in labeled and unlabeled data into a similar probability distribution of LOF scores. In this learning process, the missing probability for the normal unlabeled samples converges to the predefined . Figure 4 shows histograms of LOF scores for the normal samples in an iteration of training. In (a) and (b) of Figure 4, the probability distributions of LOF scores for normal samples in labeled and unlabeled data become similar with each iteration of training. As a result, the missing probability for the normal unlabeled samples varies from 20.25% to 4.02%, close to the predefined = 0.04. Figure 5 shows that the labeling accuracy for unlabeled data, which is mostly composed of normal samples, reaches approximately 96%.
To verify the soft probabilistic labeling, we compute the difference in LOF scores Δ ( ) according to epoch.   shows three Δ ( )s according to three ( ) along with test AUCs. As shown in Figure 6, we can observe that a larger (q) corresponds to a larger AUC, and thus, can determine a suitable for unlabeled data via the proposed soft probabilistic labeling procedure.

III. EXPERIMENTS
We evaluate the SAD-PL on the well-known MNIST [26], CIFAR10 [27], and MNIST-C [28] datasets. The SAD-PL is compared with the methods based on one class classification. We present results from unsupervised methods of SVDD [15] and the Deep SVDD [16] and semi-supervised methods of the Deep SAD [21] and HSC [23]. We implement the semisupervised HSC by replacing OE data with the labeled data. We also present the results of the supervised Deep SAD, HSC, and SAD-PL by using labeling information of the unlabeled data. The supervised Deep SAD learns only normal samples in unlabeled data since the labels are not used on unlabeled term in the objective function. We run all experiments for ∈ {0.1, 0.25, 0.5} of SVDD with a Gaussian kernel and show the corresponding results. The deep models use the same encoding network structure in the pretrained autoencoder. We employ LeNet-type convolutional neural networks (CNNs), where each convolutional module consists of a convolutional layer followed by leaky ReLU activations and 2×2 maxpooling. In the MNIST and MNIST-C experiments, we employ a CNN with two modules, 8×(5×5) filters followed by 4×(5×5) filters, and a final dense layer of 32 units. In the CIFAR10 experiments, we employ a CNN with three modules, 32×(5×5) filters, 64×(5×5) filters, and 128×(5×5) filters, followed by a final dense layer of 128 units. For the pretraining autoencoder, we employ identical encoding networks and then construct the decoding networks symmetrically, where we replace max-pooling with simple upsampling and convolutions with deconvolutions. We use a batch size of 200 and set λ = 10 −6 . We also use the Adam optimizer with a learning rate of 10 −5 . For experiments on the Deep SVDD, we use the one-class Deep SVDD model [16]. We run all experiments for ∈ {0.01, 0.1, 1, 0, 100} of the Deep SAD and show the best results. The SAD-PL is evaluated via two separate hard and soft labels. We set = 200 and complete learning when ∆ ( ) is less than ε = 0.001. In soft labeling, we set Q = 101 by setting = 0 to = 0.2 with an = 0.002 interval and select the ( ) in which ∆ is maximum in = 10. We train the deep models for 300 epochs. The image data used in the experiments are normalized through min-max scaling.
We use a typical one vs. rest evaluation method on the MNIST and CIFAR10 datasets [29]. On MNIST and CIFAR 10, we set the ten classes to be normal classes and let the remaining nine classes represent anomalies. We use the original training and test data. In the training data, we constitute most of the data as the normal class and replace a small amount with the data from the abnormal class according to the experimental scenario. We also divide the training data into labeled and unlabeled data, while organizing the abnormal samples in labeled data from a single anomalous class. However, the abnormal samples in unlabeled data equally contain samples of all anomalous classes. The class for abnormal samples in labeled data is randomly determined in each experiment. This gives training set sizes of approximately 6000 for MNIST and 5000 for CIFAR10. Both test sets have 10000 samples, including samples from the nine anomalous classes for each setup. On MNIST-C, we set original images to be normal and corrupted images to be abnormal. We use pre-configured training and test data. We also make the training data into most normal samples and a few abnormal samples according to the scenario. The training data are divided into labeled and unlabeled data. We organize the abnormal samples in labeled data using corrupted images of the same type. The abnormal samples in unlabeled data include all kinds of corrupted images equally. The type of corrupted images for abnormal samples in labeled data is randomly determined in each experiment. This configuration gives a training set size of approximately 60000 and a test set size of 160000 for MNIST-C.
We use 5% as labeled data and 95% as unlabeled data in the training data. We constituted 98% normal samples and 2% abnormal samples in the labeled data, along with99% normal samples and 1% abnormal samples in the unlabeled data. We present the evaluation results for the models with an average AUC of 30 times. Table 1 shows the evaluation results on MNIST dataset. In Table 1, the SAD-PL (Hard) and the SAD-PL (Soft) indicate the evaluation results for hard and soft labeling, respectively. The SAD-PL achieves a better performance than existing unsupervised and semisupervised anomaly detection models with an average AUC of at least 93.65%. The SAD-PL also represents an average AUC that is at least 2.69% higher than the supervised deep SAD by training abnormal samples in unlabeled data. The semisupervised SAD-PL has an improved average AUC of 1.97% via soft labeling, which is only 3.86% lower than that of the supervised SAD-PL. Additionally, in the experiment in which digit '0' is a normal class on MNIST, the SAD-PL is trained with = 0.04 determined by soft labeling. Figure 7 shows the labeling accuracy and ∆ ( ) in the iteration of the SAD-PL training with soft labeling. As shown in Figure 7, the labeling accuracy reaches approximately 96%, corresponding to = 0.04 , and ∆ ( ) converges to 0.1% or less as the epoch increases. Tables 2 and 3 show the evaluation results for anomaly detection models on CIFAR10 and MNIST-C, respectively. In Table 2, the SAD-PL proposed achieves a better detection   performance than existing unsupervised and semi-supervised methods with an average AUC of at least 64.37%. The SAD-PL with hard labeling also represents an average AUC that is 1.69% higher than the supervised deep SAD. We can see that the SAD-PL with soft labeling has a 5.22% improvement in the average AUC compared to the SAD-PL with hard labeling. This result represents an average AUC that is only 2.43% lower than the supervised SAD-PL. In Table 3, the SAD-PL similarly achieves a higher performance than the existing unsupervised and semi-supervised models, with an average AUC of at least 90.05% in the experiment using the MNIST-C dataset. The SAD-PL also represents an average AUC that is at least 5.22% higher than the supervised deep SAD. The proposed method with soft labeling has a 2.38% improvement in the average AUC compared to the SAD-PL with hard labeling. Next, we investigate the effect of including labeled data during training on the MNIST, CIFAR10 and MNIST-C datasets by increasing the ratio of labeled training data from 0% to 10% and presenting the averaged AUCs of 30 times. We compute the average AUCs according to the one vs. rest evaluation method on the MNIST and CIFAR10 datasets. The ratio of abnormal samples in labeled and unlabeled data is maintained at 5% and 1%, as in the previous experiments. Figure 8 shows variations of performance in the average AUCs for the semi-supervised models according to the ratio of labeled data during training. The HSC, a classification model, shows high improvement in the average AUC as the labeled data increase. In particular, HSC presents a higher average AUC than the Deep SAD when the ratio of the labeled data is 5% or more. The Deep SAD, a regression model that is trained with distance loss, shows robust performance, even with a small amount of labeled data. However, the SAD-PL, having both properties of regression and classification by learning with complementary objectives, represents a high average AUC compared to the existing HSC and the Deep SAD. The SAD-PL with hard labeling has a similar average AUC to the HSC average as the ratio of the labeled data increases. Note that the SAD-PL with soft labeling represents a remarkably high average AUC, even with a high proportion of labeled data during training.
Similar to the experiment above, we investigate the effect of including abnormal samples in the unlabeled data during the training with the MNIST, CIFAR10 and MNIST-C datasets. To do this, we increase the anomaly ratio, which is the proportion of abnormal samples in unlabeled training data, from 0% to 10% and represent the average AUCs of 30 times. We also use the one vs. rest evaluation method on the MNIST and CIFAR10 datasets. For the experiment, we keep the labeled data at 95% of the training data and include 98% and 2% of normal and abnormal samples, respectively. Figure 9 shows the variations of performance for the average AUCs of the unsupervised and semi-supervised models according to the anomaly ratio in unlabeled data during training. In Figure 9, the unsupervised and semi-supervised approaches assume that the unlabeled data consisting of only normal samples represent performance degradation in the average AUCs as the anomaly ratio in unlabeled data increases. However, the semisupervised models have a high performance in the average AUC compared to the unsupervised models by learning abnormal samples in labeled data. However, the SAD-PL represents the improvement in performance for the average AUCs as the anomaly ratio in unlabeled data increases via probabilistic labeling. This performance variation appears in both hard and soft labeling.

VI. CONCLUSION
Unlike existing semi-supervised anomaly detection algorithms, which are trained by assuming that most of the samples in unlabeled data are normal, we propose the SAD-PL, which can be applied when abnormal samples are included in unlabeled data. The proposed SAD-PL uses LOF scores obtained from both labeled and unlabeled data and then estimates the labeling probability on the unlabeled data by using the LOF scores. Because of probabilistic labeling and complementary objective function, the SAD-PL has properties of regression and classification. Through experiments, we show that the SAD-PL presents a higher performance in the average AUCs, displays tighter decision boundaries and achieves higher learning efficiency than the existing algorithms. Additionally, the SAD-PL shows an improved performance in the average AUCs as the abnormal data ratio in unlabeled data increases, whereas the existing algorithms show performance degradation. Therefore, the SAD-PL can be a good candidate for providing stable detection performance, regardless of the existence of abnormal samples in unlabeled data.