Representation Learning Based on Autoencoder and Deep Adaptive Clustering for Image Clustering

,


Introduction
Clustering is a basic unsupervised learning problem whose purpose is to divide data into several subgroups. Generally speaking, the elements in the same subgroup are similar and different from the elements of the other subgroups [1]. Image clustering is one of the fundamental high-dimension data clustering tasks in computer vision and machine learning [2]. Despite decades of development, the reliable clustering method of image data is still an outstanding problem [3,4].
From the perspective of image representation, there are two types of image clustering methods, which are the traditional image clustering methods and deep clustering methods [5]. e traditional image clustering methods group data on handcrafted features and treat feature extraction and clustering separately [6]. Based on this insight, many attempts have been dedicated to developing suitable clustering feature extracting techniques such as manually designed feature descriptors, including Bag of feature (BOW) [7], Histogram of Oriented Gradient (HOG) [8], Principal Component Analysis (PCA) [9], and Scale-Invariant Feature Transform (SIFT) [10]. However, the representation ability is limited by using handcrafted features that do not depend on the distribution of input data. How to establish an effective feature representation is a crucial problem that needs to be solved in image clustering.
In recent years, a deep neural network has been successfully applied in various supervised learning tasks [11,12]. e reasons for the success of deep neural networks are to learn more essential representation of images by constructing a network with multiple hidden layers and train the network with a large number of data [13]. Motivated by the success of deep neural networks in supervised learning, some unsupervised deep learning methods have been used to image clustering. ese methods are called deep clustering [14]. Most previous deep clustering studies are two-stage training schemes based on an autoencoder. First, they usually train an autoencoder to reduce the dimension of image data. en, the encoder acts as a feature extractor and uses a clustering algorithm to train it simultaneously. e two-stage clustering methods have been widely studied and successfully applied in many works [15][16][17][18][19]. e reason for the effectiveness of autoencoder based methods is that it can preserve some properties of data by adding prior knowledge to subjective. us, the encoder constructs a feature representation that can comprehensively describe the image information. However, since no clustering-driven objection participates during the training two-stage clustering methods, the learned encoder may not be suitable for clustering.
Latter, one-stage clustering methods that jointly accomplish feature transformation and clustering come into being. Deep adaptive image clustering (DAC) is a typical one-stage image clustering algorithm [20]. It defines an effective objective and proposes a self-learning scheme to realize image clustering. e defined objective function is used to update the parameters of a convolutional network by selecting highly confident image pairs and the cluster assignment is integrated into classification labels. However, there are two crucial factors that affect the stability and effectiveness of the DAC algorithm. On the one hand, the initialization of the convolutional network is an important factor affecting the performance of DAC. On the other hand, with the training of DAC, the local structure preservation of representation cannot be guaranteed.
us, the image representation in the distorted feature space will hurt the clustering performance.
To overcome the problems of DAC, we present an image clustering representation learning method based on autoencoder (AE) [21] and deep adaptive image clustering (DAC) [20]. Specifically, to obtain the essential features of the image and provide initial parameters for DAC, we incorporate a fully convolutional autoencoder into DAC algorithm. As a clustering algorithm, DAC helps to train the autoencoder to get clustering friendly features. Autoencoder can guarantee the feature space not to be distorted. e proposed method can learn the image representation suitable for clustering and simultaneously find the clustering labels of each image. Extensive experiments verify the effectiveness of the proposed algorithm.
e main contributions of this paper can be concluded in three aspects: (1) We propose a novel system based on an AE and DEC and use it to learn an informative image representation (2) Since AE and DEC can complement each other, we use the learned image representation to realize image clustering (3) We conduct extensive experiments on four realworld datasets to verify the effectiveness of the proposed algorithm e rest of the paper is organized as follows. e Section 2 will briefly introduce the related work of our paper. Section 3 proposes the clustering algorithm as well as some details of the algorithm. Section 4 provides a series of experiments to verify the effectiveness of the proposed algorithm. e last section briefly concludes our paper.

Deep Clustering.
Deep clustering refers to clustering with the related algorithm of deep neural networks [22]. Existing deep clustering algorithms are mainly to seek some effective ways to combine deep feature learning with traditional clustering methods, which are mainly divided into two categories: (I) a two-stage work that apply clustering after a representation is learned; (II) a one-stage work that jointly optimize the representation learning and clustering [23].
Two-stage methods usually train an autoencoder at the first stage. en, the encoder acts as a feature extractor and uses a clustering algorithm to obtain the clustering results. Autoencoder (AE) is a classical feature learning method that is based on deep neural networks and image reconstruction loss function [18]. Recently, many image algorithms attempt to regularize the learning of image representation of autoencoder with the loss function of the traditional clustering algorithm. For instance, Deep Embedding Clustering (DEC) utilizes KL-divergence as a loss function to measure the distance between the distribution of image feature and the target distribution [24]. Ghasedi Dizaji et al. propose Stacked Auto-Encoder (SAE) to learn a deep learning-based latent feature representation and use it to improve classification [18]. Peng et al. propose a novel clustering method by minimizing the discrepancy between pairwise sample assignments for each data point [25]. Gaussian Mixture Variational Autoencoders (GMVAE) is a representative generation-based clustering algorithm that incorporates Gaussian distribution to variational autoencoder [16]. e advantage of AE is that it keeps the essential information of features in the process of clustering algorithm training, and the learned representation is more suitable for clustering tasks. us, it can avoid the degradation of the solution and improve clustering performance. e disadvantage of the two-stage method is the mismatch problem between image representation and clustering. Specifically, the clustering algorithm does not participate in representation learning, which will lead to the blindness of representation learning.
One-stage methods combine image representation with clustering learning. For instance, deep adaptive image clustering (DAC) is a typical one-stage image clustering algorithm [20]. It defines an effective objective and proposes an adaptive mechanism to realize image clustering. Guo et al. propose Improved Deep Embedded Clustering (IDEC) algorithm to take care of data local structure preservation [26]. IDEC trains AE and self-training simultaneously to realize local feature preservation. (CatGAN) uses general Generative Network Adversarial (GAN) and entropy as loss function to realize data clustering [27]. JULE proposes a recurrent framework for joint unsupervised learning of deep representations and image clusters [17]. e effectiveness of these learning schemes has been proved in theory and practical experiments. However, there are two crucial factors that affect the stability and effectiveness of these algorithms. On the one hand, the initialization of the convolutional network is an important factor. On the other hand, with the training going on, the local structure preservation of representation cannot be guaranteed.

Image Clustering Based on AE and DEC
3.1. Autoencoder. AE is a type of artificial neural network which are used to learn efficient data codings in an unsupervised manner [28]. Generally speaking, the objection of an AE is to extract a feature (encoding) of the input data. In the field of computer vision, it is usually used to learn image representations and reduce image dimension [16,29].
Consider X: � (x i ) n i�1 being a set of images, where n denotes the number of images. An AE reduces the dimension of images from high-dimensional spaces R D to a low dimensional space T d and d < D. e embedding of the dataset T d is denoted by Y: � (y i ) n i�1 . e function that performs the embedding denotes as f θ . us, .., n. Generally, to guarantee the learned representation Y can adequately represent the input image information. e following reconstruction loss is used to train the autoencoder network: where g w (·) is the decoder that maps the representation to the output. In our algorithm, to extract essential features and preserve spatial locality of images, we adopt a fully convolutional autoencoder to realize the image feature extraction stream.

Deep Adaptive Image
Clustering. DAC is a clustering algorithm that is realized by a convolutional neural network (CNN) and an adaptive training mechanism [20]. It employs some constraints on the classification output and generates a feature for image clustering.
Let us assume that x i and y j are two unlabeled images, r ij denotes an unknown binary output received by the generated label, where r ij � 1 if x i and y j belong to the same cluster and otherwise, r ij � 0. In this case, is the dot product of z i and z j , and it indicates the similarity, where f θ ′ (·) is a classification network. Based on the similarity of the input image features, the binary labels are defined as follows: where ξ is an adaptive parameter, u(·) and l(·) are two learnable thresholds.
For network training, the objection function of DAC is defined as follows: where s(·) denote the estimated similarity of x i and x j with a classification network parameter θ, v ij an indicator coefficient matrix to predict the training samples, which v ij � 1 means that the sample is selected to train the network, and v ij � 0 otherwise. v ij is defined as follows: is the loss function and is defined as follows:

Network Architecture.
e network architecture of the proposed clustering algorithm is shown in Figure 1. ere are two streams in our network, the autoencoder stream, and the DAC stream. e autoencoder stream is realized by several fully convolutional layers and the DAC stream is composed of the autoencoder's encoder and several fully connected layers. Considering a dataset D: � (x i ) n i�1 with n input samples that need to be clustered. e number of clustering K is a priori knowledge. Z: � (z i ) n i�1 be the output of the encoder. us, the encoder can be defined as a nonlinear mapping f θ 1 : x i ⟶ z i and the decoder is g w : z i ⟶ x i where θ 1 and w are the parameters of encoder and decoder, respectively. x i denotes the output of the decoder and the output of DAC can be represented as f θ 1 ,θ 2 (x i ), which is a classification network and θ 2 is the parameters of fully connected layers.

Loss Function.
Since we aim to seek an encoder that makes the extracted feature more suitable for clustering. e reconstruction loss of the autoencoder is added to the initialization and training process of the DAC algorithm. On the one hand, the reconstructive loss function is used to assist the learning of image representation as it can learn the essential feature of input images and avoid the distortion of feature space in the training process of DAC. On the other hand, the loss of an autoencoder is only focused on image reconstruction, which loses the useful information needed for clustering. DAC loss can guide it to obtain a better representation suitable for clustering. us, we define the complete loss function as follows: where α is a balance coefficient. L r and L c are reconstruction loss and clustering loss, respectively. e final objective function is as follows: where the definitions of L r and L c are as follows:

Network Training.
In this section, we present the whole training process of our algorithm. To minimize the loss function proposed in (7), we first abandon the DAC stream and pretrain a convolutional autoencoder by using loss L r . e trained encoder can provide initial parameters for the DAC algorithm. us, the DAC algorithm will select more accurate labeled image pairs in the initial stage. en, we simultaneously train the autoencoder stream and the DAC stream by minimizing (7). e detailed algorithm is formalized as Algorithm 1.
When the autoencoder is trained, it seems to get a clustering friendly representation by finetuning the autoencoder's encoder in the DAC algorithm. e encoder is connected with a number of fully connected layers to form a classification network, and the labels are calculated by the algorithm proposed in [20]. However, we suppose that this kind of finetuning can distort the representation space, which may weaken the expressive power and thereby hurt clustering results. For this reason, in the process of training the DAC stream, the autoencoder also needs to be trained to maintain the DAC algorithm to obtain highly confidence labeled image pairs.

Experiments
In this section, we carry out a series of experiments to verify the effectiveness of our algorithm. All the experiments are carried out in Tensorflow and Keras environment running Ubuntu14.04, Inter(R) Core i7-4790 CPU 3.6 GHz and Titan X GPU 12 GB.

Datasets.
In this part, fore challenging image datasets including Fashion-MNIST, Cifar-10, Cifar-100, and STL-10 datasets are selected to verify the effectiveness of our algorithm. We first briefly introduce datasets.

Fashion-MNIST.
Fashion-MNIST is a dataset of Zalando's article images, which includes a training set of 60,000 examples and a test set of 10,000 examples [30]. In the Fashion-MNIST dataset, each example is a 28 × 28 grayscale image, associated with a label from 10 classes. -10 and Cifar-100. CIFAR-10 contains 50,000 training images and 10,000 test images from 10 classes [31]. Each image has a size of 32 × 32. Cifar-100 is similar to Cifar-10, except it has 10 times fewer images per class.

STL-10.
e STL-10 dataset is an image dataset used to develop unsupervised feature learning, deep learning, and selfsupervised learning algorithms [32]. It is inspired by the CIFAR-10 dataset but with some modifications. e high-resolution dataset (96 × 96) will make it a challenging benchmark to develop more scalable unsupervised learning methods. In our experiments, the training set and validation set of each dataset are jointly utilized. In particular, the 20 superclasses of Cifar-100 dataset are considered in all the experiments. We summarize the detailed information of each dataset in Table 1.

Evaluation Metrics.
ree commonly clustering metrics including accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI) are adopted to evaluate the performance of clustering algorithms. ese metrics reflect the cluster performance from different perspectives. ACC measures the best matching between the clustering labels and ground truth labels. NMI measures the similarity between pairs of clusters [33]. ARI establishes a baseline by using the expected similarity of all pairwise comparisons between clusters specified by a random model [34].

Traditional Image Clustering
Methods. K-means++ and SSC: these methods first use Bag of Wording (BOW) to encode the images, and then, the image features are clustered to achieve image clustering.

Deep Clustering
Methods. GLWTDN: it first trains an AE model to extract image features and then uses the K-means algorithm to cluster the image features [37].
CILR: it first adopts consistent inference of latent representations (CILR) to generate latent labeled data points of the inputs. en, CILR is derived to pretrain DNNs by minimizing the distance between latent labeled data points to realize image clustering [15].
GMVAE: it uses the Gaussian mixture model as a prior distribution to improve the traditional variational autoencoder. It uses the improved latent vector as image representation and then clusters representation to realize image clustering [16].
CatGAN: it uses general Generative Network Adversarial (GAN) and entropy as loss function to realize data clustering [27].
JULE: it proposes a recurrent framework for joint unsupervised learning of deep representations and image clustering [17].
DEC: it first learns image representations from an AE. en, clusters are obtained by utilizing a typical K-means algorithm [24]. DAC: it formulates image clustering as a binary pairwise classification problem and identifies these pairs of images which should belong to the same cluster [20].

Experimental Settings.
Following the setting in DAC, we set the initial thresholds which construct highly confident pseudolabel to u(ξ) � 0.95 − ξ and l(ξ) � 0.5 + 0.1ξ, respectively. e initialization of the parameter ξ is set to 0. Considering the convergence of our algorithm, we set the learning rate of ξ to η � 0.005. e balance coefficient is 0.5. For training, we adopt the well-known Adam optimizer with an initial learning rate 0.0001. In addition, the batch size is set to b s � 128 in all the experiments. e detailed network architecture used in each dataset is shown in Table 3.
As shown in Table 2, for each dataset, the performance of deep clustering methods is superior to that of traditional clustering algorithms. Autoencoder based method such as AE outperforms traditional algorithm K-means with a large margin, which justifies the fascinating potential of autoencoder in clustering task. Furthermore, note that the proposed method outperforms the other algorithms on all datasets. In addition, the clustering accuracy of our algorithm outperforms all competitive baselines, with significant margins of 4.79%, 4.47%, 4.95%, and 3.94% in the case of Fashion-MNIST, Cifar-10, Cifar-100, and STL-10, respectively. ese results verify the effectiveness of our method in image clustering tasks. Figure 2 shows the confusion matrixes of the clustering results for Cifar-10 and STL-10 datasets. e values along the diagonal represent the percentage of samples correctly classified into the corresponding categories. We can find that all clustering accuracy is average and stable for these two datasets. is proves that our method does not aggregate samples into a few categories and can effectively avoid the degenerate solution problem.

Visualization.
In this part, we use two methods to visualize the clustering results of our algorithm. e first visualization experiment is conducted on the Fashion-MNIST dataset. We randomly sampled 10000 samples of the representation z i and mapped them to a 2-dimension vector by using t-SNE [30]. e experiment results are shown in Figure 3. In Figures 3(a)-3(f ), different colors indicate different clusters and the corresponding clustering accuracies are reported as follows. e experimental results show that the proposed algorithm can effectively improve the separability of data, which is helpful to improve the clustering accuracy of the image.
In the second visualization experiment, we qualitatively analyze the cluster results by the proposed Cifar-10 dataset. For each category, we randomly select an image as the original image at the first stage. en, we pick up several samples, which are the smallest Euclidean distance between original images from the same cluster. Finally, we pick the samples which are closest to the original image in the incorrect cluster. All the picked images are shown in Figure 4, and we mark the incorrect samples with red boxes. Form the visualization results, we can find that the successful cases not only depend on texture information but also contain some semantic information of categories. e failure cases also contain a lot of texture information similar to the source images. It implies that our method not only captures image appearance information but also captures some abstract image information for image clustering. is is the reason why the proposed method can precisely discover cluster assignments.

On the Effect of Number of Clusters.
In this part, we study the effect of the number of clusters on our algorithm and compare the results with the DAC algorithm. For each dataset, we conduct 6 experiments on different training sets. For Cifar-10 and STL-10 dataset, the number of training sets varies in the range of [5,10] at equal intervals. For Cifar-100 dataset, the number of training samples varies in the range of Most of the results are excerpted from [20]. e best and second-best results are marked in bold and underlined, respectively. Mathematical Problems in Engineering [10,20] with an interval of 2. We report the variation curves of clustering accuracy with the number of clusters in Figure 5. e detailed numerical results are shown in Table 4. As shown in Figure 5 and Table 4, with the increase in the number of clusters, the accuracy of clustering decreases gradually. For all datasets, the clustering accuracy of our method is always higher than that of the DAC algorithm in the different number of clusters. In addition, the other two metrics results also show the superiority of the proposed algorithm.
is is because autoencoder can reduce the

On the Effect of the Parameter α and β.
In this experiment, we mainly study the effect of the parameter α and β. e range of parameters (α, β) is selected by the grid search of the region [0.1, 0.9] × [0.1, 0.9] in a step size of 0.1. In Figure 6, we report the clustering accuracies with different (α, β). From the results, we can find that when α and β tend to be close, the clustering accuracy is the highest. is is mainly because the autoencoder can guarantee the local structure of image representation and prevent the distortion of feature space. It also means that autoencoder can promote the clustering performance of DAC, which explains the reason why our algorithm is effective.

Conclusion
In this paper, we present a novel representation learning method and use it to solve the image clustering problem. To generate more informative representations for clustering, we borrow the DAC algorithm and incorporate it to train a fully convolutional autoencoder. e proposed algorithm was evaluated on unsupervised clustering tasks using popular datasets, achieving competitive results compared to the current state of the art. Furthermore, we may improve the proposed algorithm by applying some deep feature extraction models, e.g., Variational AutoEncoder (VAE) and Generative Adversarial Networks (GANs). To improve our method, it is an interesting direction to learn the distribution of image data instead of reconstructing the image. We will see this for future work.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.