An Active Learning Method Based on Variational Autoencoder and DBSCAN Clustering

Active learning is aimed to sample the most informative data from the unlabeled pool, and diverse clustering methods have been applied to it. However, the distance-based clustering methods usually cannot perform well in high dimensions and even begin to fail. In this paper, we propose a new active learning method combined with variational autoencoder (VAE) and density-based spatial clustering of applications with noise (DBSCAN). It overcomes the difficulty of distance representation in high dimensions and prevents the distance concentration phenomenon from occurring in the computational learning literature with respect to high-dimensional p-norms. Finally, we compare our method with four common active learning methods and two other clustering algorithms combined with VAE on three datasets. The results demonstrate that our approach achieves competitive performance, and it is a new batch mode active learning algorithm designed for neural networks with a relatively small query batch size.


Introduction
In practical issues, the amount of labeled data is relatively small and the vast majority is unlabeled. It is often unrealistic to consume a lot of human resources and expensive cost to annotate unlabeled data for the annotation budget is limited. Active learning [1] is born out of this problem, which is one field trying to address the difficulties in data labeling. It assumes that the collection of data is relatively easy, but the labeling process is costly. It tackles the problem of which samples we should label to result in the highest improvement in test accuracy under a fixed labeling budget. Existing active learning algorithms are mainly divided into two categories, query-synthesizing [2][3][4] and query-acquiring [5]. Querysynthesizing approaches use generative models to generate informative samples whereas query-acquiring algorithms use different sampling strategies to determine how to select the most informative samples.
In this paper, we mainly pay attention to query-acquiring methods. One of the query-acquiring algorithms is uncertainty-based method [6,7]. Settles [8] mentioned that the classifier gives every unlabeled data a probability score which represents the uncertainty belonging to its class, and then we choose the data with the highest uncertainty. Lewis and Gale [9] argued that uncertainty-based methods perform well on a large and diverse set of datasets. Yarin et al. [10] also proposed a Bayesian active learning framework, in which Bayesian neural networks [11] are used to estimate uncertainty. Later, Gissin and Shai proposed a discriminative active learning method [12] in which an uncertainty idea is also used to sample the unlabeled data with the top-K highest possibility when the batch size is relatively large. e other type of query-acquiring methods is representationbased approach, which relies on selecting few examples by increasing diversity in a given batch. Ozan and Savarese [13] and Suyog and Grauman [14] adopted this idea in their experiments. However, it seems to be ineffective to use distance-based representation methods like Core Set for high-dimensional data because the distance concentration phenomenon appears in the computational learning literature with respect to p-norms in high dimensions [15].
To address the difficulty of high-dimensional distance representation, Sinha proposed a variational adversarial method [5] by learning a latent space using a VAE [16,17]. VAE is called the perfect representation learning method of high-dimensional data, which has been proved effective. erefore, we adopt VAE in our approach to solve the difficulty of distance representation especially for digital images [18]. VAE plays a great role in our method, which perfectly represents high-dimensional data into the lowdimensional representation in latent space, preventing the occurrence of the phenomenon ofp-norm distance concentration of high-dimensional data. And in common clustering approaches [19][20][21][22], DBSCAN [23] is good at identifying the noise and discovering arbitrarily shaped clusters without knowing the number of cluster classes to be formed in advance. erefore, we propose a new active learning strategy combined with VAE and DBSCAN clustering for considering their advantages. What is more, excellent active learning models are usually based on large query batch sizes. But they may ignore the performance of a relatively small query batch size based on small sample problems. So we conduct our experiment on a relatively small query batch size to see the performance. e rest of this paper is organized as follows. Section 2 illustrates the problem setting and the brief definition of related methods. en Section 3 introduces our proposed model and the algorithms. Subsequently, the experimental results based on a relatively small query batch size are provided in Section 4. In the end, we draw a conclusion in Section 5.

Problem Definition.
We define the problem of active learning formally as follows. Given a labeled pool (X L , Y L ) and a much larger unlabeled pool X U , we are aimed to sample the most informative unlabeled data x U from X U by iteratively querying a fixed sampling budget in order to train the most label-efficient model in an active learning task. During this progress, n number of unlabeled data will be sampled by an acquisition function and be annotated by the oracle.
is is an iterative process which is done until a certain stopping criterion is reached, such as the desired amount of samples or test accuracy.

Related Methods.
With respect to VAE, the parameters are trained by two loss functions if N training samples are given. One is the reconstruction loss L 1 , which forces the reconstructed sample x i ∧ to match the original input sample x i . Here, we use the cross entropy for measurement: In addition, we are aimed to make x i ∧ as close as possible to x i itself. at is to say, it can retrieve as much of the original information as possible through decoding. e other one is regularization loss L 2 , which helps to learn latent space with good structure and reduces overfitting on the training data. e formula is shown as follows: In the formula above, the mean μ i and the variance σ 2 i of every input data are computed by an encoder. en the encoder learns a low-dimensional latent space for the underlying distribution N(μ i , σ 2 i ) using a Gaussian prior. erefore, we obtain the total objective function for VAE as follows: en, we minimize the total loss L VAE and a high efficient variational autoencoder model will be obtained after rounds of optimization. It is worth mentioning that the reparameterization trick is adopted by sampling a data ε from a standard normal distribution N(0, 1). en, we could sample the low-dimensional data z i randomly in the latent space: in which ε is a very small random tensor and ⊙ is the element-wise product.
In our approach, clusters in DBSCAN are defined as the largest set of points connected by density. A region with sufficient density can be divided into a cluster, and points that are not clustered are called noise. e basic process of DBSCAN is as shown in Figure 1

e Combination of VAE and DBSCAN Clustering.
In this paper, we suggested a new active learning method based on VAE and DBSCAN clustering on a relatively small query batch size. We motivate our method with a simple idea. First, VAE learns a valid low-dimensional latent feature space for the underlying distribution using a Gaussian prior from the labeled and unlabeled pool, whose space is a mixture of the latent features. en we adopt density-reachable clustering DBSCAN to remove noise in the initial clusters, and we sample the most valuable unlabeled data in high density also in different clusters from the latent space. e framework of our model is shown in Figure 2.
In detail, the VAE model in our experiment is combined by a convolutional neural network and a deconvolutional neural network. e convolutional one named encoder consists of four convolutional layers, one flatten layer, and three fully connected layers in order. And the deconvolutional one named decoder is relatively simple, consisting of a fully connected layer and two convolutional layers. We update VAE by descending stochastic gradients. After iteration and updating, we obtain the trained parameters θ VAE also the trained model VAE. In the end, VAE learned a twodimensional latent feature space efficiently. e low-dimensional feature vectors learned by VAE serve as the input data for the following clustering methods. In DBSCAN, we start sampling unlabeled points in the clustered classes after an initial unlabeled core object is computed. Here, we note that we remove the noise at the first step. Suppose the needed amount of unlabeled data is C, the total number of all unlabeled density-reachable points in all clusters by DBSCAN is required to be as close to C as possible.
e purpose is to ensure that different types of high-density unlabeled data can be retrieved as much as possible. After that, C number of corresponding original unlabeled data will be sampled by an acquisition function and to be annotated by Oracle. en put them into the original labeled pool and use the task learner to test the mean accuracy of our proposed model.

Algorithms.
For clarity, we describe our method in Algorithms 1 and 2.

Dataset and Task Module.
We evaluated our method on three typical image datasets MNIST [24,25], Fashion-MNIST, and CIFAR-10. e first one MNIST contains 60,000/10,000 (train/test) handwritten digit images (0∼9) with the resolution of 28 × 28. And Fashion-MNIST is an alternative version of MNIST which has the same size as MNIST with frontal images of different items from 10 categories such as shirts, pants, sandals, and bags. ey are also the single channel images with the same size of 28 × 28.
CIFAR-10 contains 10 categories of images such as airplanes, automobiles, birds, and cats. Different from the two datasets above, images in CIFAR-10 are 3-channel color RGB images with the size of 32×32. Not only the noise is very large, but also the proportion and characteristics of the object are not the same, which brings great difficulties for recognition.
We used the classic LeNet architecture [26] as our task module for MNIST and Fashion-MNIST, while for CIFAR-10 task we used VGG-16 (Simonyan and Zisserman, 2014) as our architecture. And we took a simple convolutional network just similar to LeNet architecture as the second task module when comparing the different clustering active learning algorithms.
is module consists of three convolutional layers, two maximum pooling layers, a flatten layer, and two fully connected layers in order.

Baseline Algorithms.
To demonstrate the effectiveness of our method, we took the following four common active learning methods and also the random sampling as the baseline algorithms. ey are briefly described as follows: (i) Random: the query batch is chosen uniformly at random (ii) Uncertainty: uncertainty sampling with minimal top confidence (iii) Core Set: in every round of query, sample the unlabeled data which are of the largest distance from the labeled set and then add to the labeled pool (iv) EGL: estimated gradient length (v) Bayesian: Bayesian uncertainty sampling with minimal top confidence Furthermore, in order to comparing the performance of different clustering algorithms [27] in active learning task, we replaced DBSCAN by two classic and widely used clustering algorithms: K-means [28] and Mean shift [29]. And they are also compared with random sampling. ese clustering methods in our experiment were based on a twodimensional feature dataset learned by VAE which are extracted from the three datasets listed above. eir brief description is as follows: Noise Core object Computational Intelligence and Neuroscience (i) K-means with VAE: after VAE, we sample the twodimensional data around the cluster centers according to K-means clustering rule and a given radius. (ii) Mean shift with VAE: after VAE, solve for a vector that moves the center of the circle in the direction with the highest density of the dataset, and the average position of the points inside the circle is found as the new center position in each iteration. We choose the points around these centers. (iii) DBSCAN with VAE: our method, detailed above.

Implementation Details.
In our VAE model, the learning rate was selected by default as α � 1 × 10 − 3 . What is more, we chose Sigmoid as our optimizer. Based on a relative small query batch size, it shows that when epochs reach about 400, the learning loss almost does not decrease and the distribution of latent feature space learned by VAE comes to be stable. In other words, we set the number of training epochs to 400 to prevent overfitting. Figure 3 shows the two-dimensional feature points learned by VAE in different epochs. After VAE is trained, a certain number of low- Input: (X L , Y L ), X U , query batch size: k, and parameters: Eps and MinPts, N: total budget, and n: the amount of miniquery (1) Sample ε ∼ N(0, 1) (2) Sample z i N i�1 from the underlying distribution by using equation (4) (3) Sample (z 1 , z 2 , . . . , z k ) as P randomly from z i N i�1 and shuffle, k < N (4) Cluster P by adjusting Eps and MinPts (5) Remove noise (6) Sample all density-reachable unlabeled set C in all the clusters (7) for j � 1, 2, . . . , N/n do (8) Sample the needed amount of (x 1 , x 2 , . . . , x j ) ∈ X j ′ from C randomly, and find the corresponding original high-dimensional samplesX j (9) dimensional data samples were extracted by its well-trained encoder. We checked some corresponding original images as shown in Figure 4 (taking Fashion-MNIST as an example).
In order to observe the performance of active learning algorithms on a relatively small query batch size, we tried to sample 100 low-dimension data points randomly from the latent space learned by encoder in VAE as our clustering dataset for DBSCAN, including 10 labeled data and 90 unlabeled data, and our query batch size is set to 5 which is relatively small. Finally, we obtained the needed amount of unlabeled density-reachable points in different clusters after removing the noise. It is particularly noted that we adjust the parameters to make the total number of unlabeled points in all density-reachable clusters to be as close as possible to the needed amount of data for annotations.

Results.
In our experiment, we used the test accuracy as our metric to evaluate the performance. e results are averaged from 20 runs in order to ensure statistical validity. By the results in Table 1, we plot the test accuracy of experiments using four different active learning methods on MNIST in Figure 5. e results demonstrate that, in MNIST, our method, DBSCAN with VAE, performs better on small query batch size than the active learning methods listed above. Figure 5 also shows that some methods perform on par or worse than random sampling. is may be the reason that the sample size is relatively small and the randomness is strong, which reduces the gap between each sampling method and random sampling. It may also be because some methods are more suitable for large query batch sizes.
We also notice that, except for EGL performing well than random sampling on very small batch sizes below 20, most algorithms listed above including random sampling consistently outperform EGL. As for EGL, a possible explanation for the discrepancy between these results may be the architecture used for the different tasks because EGL uses the gradient relative to the model parameters as the score function [12]. Among all the methods above, Bayesian performs the poorest and it may be because Bayesian is more suitable for large training datasets.
Next, we did a comparison test on CIFAR-10. e result is in Table 2 and we plot it in Figure 6.
As can be seen from Figure 6 on CIFAR-10, Core Set often performs worse than random sampling, and it is also possibly because the random advantage is greater than Core Set when the sample size is small. And when the size is below 25, our approach shows little advantage than other methods, but our approach leads all the way obviously when the number of labeled samples is over 25.
Furthermore, we replaced DBSCAN by K-means and Mean shift in order to study the different effects caused by different clustering methods. We also combined them with VAE and did experiment on MNIST and Fashion-MNIST. Detailed experimental results are listed in Tables 3 and 4, respectively. And we plot their performance in Figures 7 and 8.
However, what we may notice that, in Figure 7, the result of K-means with VAE is better than our method when the number of labeled data reaches 30 on MNIST. DBSCAN and K-means perform similarly until size 35 and DBSCAN performs clearly better after size 40. e reason may be that, in this quantity case, a great part of samples sampled by K-means method are closer to the potential characteristics of the labeled data than the data sampled by DBSCAN. In other side, this situation is probably caused by the uneven distribution of the samples when mapping the high-dimensional image data into the two-dimensional latent space. After size 40, DBSCAN may be more sensitive to detect the outlier points in latent space and prevent them being clustered. So the performance of DBSCAN is better than that of K-means when size is greater than 40. Of course, this is a hypothetical explanation. Probably, it is difficult to know for sure why K-means starts performing worse after size 40. However, on CIFAR-10 in Figure 8, compared with random sampling and the other two clustering methods, DBSCAN with VAE behaves better and the performance of increasing accuracy is more stable. It is somewhat clear from the results that our proposed method performs well on the whole on MNIST and Fashion-MNIST.
In order to show our experimental results more significantly, the improvement percentage of different clustering methods over random sampling is listed in Tables 5  and 6, and we plot their performance in Figures 9 and 10. On      MNIST, our method DBSCAN with VAE, achieves the best results when the size of labeled samples reaches 20, increasing the accuracy by 11.30% over random sampling. Generally speaking, our approach performs even better in the other labeled data quantities. However, on Fashion-MNIST, our approach performs clearly better than the other two clustering methods at every labeled size in our experiment. And it achieves the best improvement of 6.32% when the labeled samples size reaches 45.
In the end, we computed the average accuracy improvement of each clustering method combined with VAE to see their overall improvement over random sampling, and we plot them in Figure 11. On MNIST, K-means and Mean shift achieve the improvement of 3.26% and 1.86%, respectively. And our method achieves the best performance of 4.48%. And on Fashion-MNIST, the three clustering active learning methods achieve the improvement over random sampling by 3.23%, 1.36%, and 4.23% in order. In a word, our proposed algorithm, DBSCAN with VAE, finally shows its superiority than K-means and Mean shift when combined with VAE on a relatively small query batch size on MNIST and Fashion-MNIST.     Accuracy improvement percentage over random sampling Computational Intelligence and Neuroscience

Conclusion and Discussion
In this paper, we proposed a new active learning method based on VAE and DBSCAN designed for neural networks with a relatively small query batch size. It overcomes the difficulty of distance representation in high dimensions and prevents the distance concentration phenomenon from occurring in the computational learning literature with respect to high-dimensional p-norms. Based on the results on MNIST, Fashion-MNIST, and CIFAR-10, we empirically show that our method achieves competitive performance compared with four common active learning methods listed above, and it is also superior to the other two clustering active learning methods for image classification when the query batch size is relatively small. For active learning models are usually based on a relatively large query batch sizes, our approach of small query batch sizes can be regarded as a supplement to the previous research studies. In addition, our method is simple to implement and can conceptually be extended to other domains, and we see it as a welcome addition to the arsenal of methods in use today. Furthermore, the inadequacy of our method is that manual parameter adjustment is required to make the total number of unlabeled points in all density-reachable clusters to be as close as possible to the needed amount of data for annotations in DBSCAN. And we will still thoroughly study the related work presented in this paper in the future.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.