A COMPARATIVE ANALYSIS OF UNSUPERVISED AND SEMI - SUPERVISED REPRESENTATION LEARNING FOR REMOTE SENSING IMAGE CATEGORIZATION

: This work aims at investigating unsupervised and semi-supervised representation learning methods based on generative adversarial networks for remote sensing scene classification. The work introduces a n ovel approach, which consists in a semi-supervised extension of a prior unsupervised method, known as MARTA-GAN. The proposed approach was compared experimentally with two baselines upon two public datasets, UC-MERCED and NWPU-RESISC45 . The experiments assessed the performance of each approach under different amounts of labeled data. The impact of fine-tuning was also investigated. The proposed method delivered in our analysis the best overall accuracy under scarce labeled samples, both in terms of absolute value and in terms of variabilit y across multiple runs.


INTRODUCTION
Over the last decades, much of the effort involved in deploying automatic image classification algorithms has been invested in designing and manually selecting custom features for a target application.In this sense, the use of Bag-of-Visual-Words (BoVW) was one of the first attempts in the field (Yang , Newsam, 2010), followed later by different classifiers like Random Forest (RF) and Support Vector Machines (SVM) (Helber et al., 2017).Recently, Deep Learning (DL) techniques have become the dominant trend in image classification (Simonyan , Zisserman, 2014, Szegedy et al., 2015, Cheng et al., 2018), mainly due to their ability to automatic learn discriminative features directly from data (LeCun et al., 2015, Krizhevsky et al., 2012, Penatti et al., 2015, Nogueira et al., 2017), when labeled samples are abundant.
Although recent years have witnessed an increase of Earth observation data, remote sensing labeled data still falls short of the demands imposed by DL-based techniques.Mainly because of the high costs involved in field survey and the required labor-intensive visual interpretation.
In this sense, transfer learning (Pan , Yang, 2010, Weiss et al., 2016) and unsupervised deep learning techniques, such as Stacked Denoising Autoencoders, Convolutional Autoencoders and Deep Belief Networks (Liang et al., 2017, Romero et al., 2016, Zou et al., 2015), emerged as attractive alternatives.In transfer learning, networks already trained using huge data-sets are reused in problems where the labeled data is limited by performing a fine tuning (Nogueira et al., 2017) of certain layers.On the other hand, unsupervised methods do not require any labeled data for the learning process.
In the last few years, Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) have been catching the community attention due to their ability to learn data distributions through an unsupervised two-player min-max game performed by two different networks: a generator and a discriminator.
Considering the power of GANs for unsupervised learning, Lin et al (Lin et al., 2017) proposed a Multiple-Layer Feature-Matching GANs architecture (MARTA-GANs) for feature learning.In short, MARTA-GANs capture latent features from the discriminator network, which can be later used as input to a classifier.This method presented substantial improvements in comparison with others unsupervised feature learning models.
Aiming to exploit cases where few labeled samples are available, (Springenberg, 2015) proposed to work with semi-supervised GAN (SS-GAN) algorithms.More specifically, they introduced the categorical generative adversarial networks (CatGANs) for image classification.This model was extended in (Salimans et al., 2016) to improve its convergence.Specifically, they proposed the feature matching term and the mini-batch discrimination concept among others modifications.Later, the SS-GAN approach was adapted to remote sensing data applications, such as object detection (Chen et al., 2018) and pixel-wise PolSAR (Liu et al., 2018) and hyperspectral (He et al., 2017, Zhan et al., 2018) image classification.However, despite the efforts of (Salimans et al., 2016), SS-GANs still present some convergence problems, mostly when the number of unlabeled samples is much larger than the labeled ones.
Motivated by this scenario, we introduce in this paper a Semi-Supervised Representation Learning GAN (SSRL-GAN), which, although conceptually similar to SS-GANs, presents a different training strategy and adaptations in architecture.In short, SSRL-GANs present an external classifier allowing the use of binary cross-entropy cost functions for supervised and unsupervised stages.With these changes, we observed an improvement in the convergence of the model and in the classification performance, mainly when less labeled samples were used to train the model.
We further analyze and compare different alternatives for remote sensing image categorization when a limited number of labeled samples is available.First, we take the MARTA-GAN (Lin et al., 2017) as baseline, which is an unsupervised learning method.Then, we compare it with two semi-supervised approaches: the Semi-Supervised GANs, as presented in (Salimans et al., 2016), and the Semi-Supervised Representation Learning GAN proposed in this work.Additionally, we evaluate how these methods behave when more labeled samples are added in the training set.And finally, we adopt a classic fine tuning approach, using only labeled data, to investigate if their performance can still be enhanced.
The rest of this paper is organized as follows.Section 2 briefly describes the fundamentals underlying GANs.A detailed description of each assessed method is the subject of Section 3. The experimental protocol is reported in Section 4, while Section 5 shows the results obtained by the experiments.Finally, Section 6 summarizes the main conclusions and indicates future directions.

GENERATIVE ADVERSARIAL NETWORKS (GANS)
GANs, introduced by (Goodfellow et al., 2014), constitute a class of unsupervised machine learning models composed by two neural networks: the generator, which synthesizes realistic images and the discriminator, which tries to correctly discern between synthesized and real images.
A min-max game procedure is used to train these neural networks.The Generator learns a function G that maps samples of a known random distribution p(z) into samples of a distribution p model (x), which the Discriminator D can hardly distinguish from a sample of a given data distribution p data (x).
The Discriminator, in turn, is trained to learn a function D that distinguishes whether a sample comes from p data (x) or p model (x).The optimal mapping function G * can be found by solving the following equation: where L(G, D) is the GAN loss function defined by, where E and log are the expectation and logarithmic operators, respectively, and z is a random noise vector, which follows a known noise distribution p(z), typically uniform or Gaussian.

EVALUATED METHODS
This section presents the four methods assessed in this paper for remote sensing image categorization with few labeled samples available.In the following, we describe the unsupervised MARTA-GAN, the Semi-Supervised GAN, the Semi-Supervised Representation Learning GAN, and the Fine Tuning applied in the Discriminator of all methods.
3.1 Multiple-Layer Feature Matching GANs (MARTA-GANs) MARTA-GAN (Lin et al., 2017) is an unsupervised representation learning algorithm that relies on the same GAN's min-max game to learn discriminative features f (x).Like Deep Convolutional GANs (Radford et al., 2015), the Generator and the Discriminator are convolution networks trained to minimize a modified loss function L(G, D) given by the equation: The third term, called feature matching loss, is added to the GAN loss function to favor similarity between the generated and real images.The learned features f (x), named in (Lin et al., 2017) multi-feature layer, result from concatenating the outputs of the three last convolutional layers of the discriminator network.

Semi-Supervised GANs (SS-GANs)
SS-GANs (Salimans et al., 2016) exploit the available labeled data together with the unlabeled data to perform a semi-supervised learning.The Discriminator output is changed from 1 neuron to K + 1 neurons, where the first K neurons are used to classify the real labeled samples into one out of the K classes present in the data-set and the (K + 1)−th neuron computes the probability that the input sample is real or fake, i.e. synthesized by the GAN.The training function for the SS-GANs becomes: where: [log(D(x, y|y < K + 1))] (5) and Observe that, L(G, D) is a composition of the standard supervised loss function L supervised with the unsupervised loss L unsupervised , which actually represents the standard GAN min-max game, including the well known feature matching loss.The optimal solution can be found by minimizing these two losses jointly.

Semi-Supervised Representation Learning GANs (SSRL-GANs)
The proposed SSRL-GANs differs from the SS-GANs by an auxiliary classifier not embedded in the Discriminator.Thus, the Discriminator is responsible for verifying if the input sample is real or fake, whereas the Classifier evaluates how good are the features at the multi-feature layer for the classification of the available labeled samples.The architecture of the SSLR-GAN is shown in Figure 1 and involves three networks: Generator, Discriminator and Classifier.
The training process is divided into two consecutive stages, unsupervised and supervised, depending on whether the training data is labeled or not.In the first, pure unlabeled data is used in each mini-batch while in the second only labeled samples are employed.The Generator is trained in the same way for both stages, since it does not rely on labels.Thus, while the parameters of the Discriminator are fixed, the parameters of the Generator are updated to synthesize images realistic enough to fool the Discriminator.Formally, it is about minimizing the following cost function which also includes the feature matching loss term: Analogously, while the Discriminator is being trained, the Generator parameters are kept fixed.Thus, in the unsupervised stage, the Discriminator parameters are updated so that the function LD is maximized for real samples and minimized for synthetic ones, as stated below: In the supervised stage, the function LD is modified to include a new term that tries to maximize the probabilities C(f (x), y) assigned by the Classifier to the real class y of each sample x, as shown in Equation 9.
Aiming to minimize this expression, the Discriminator will tend to produce more discriminative and representative features.
Since the Classifier network requires label information for training, it is not used in the unsupervised stage.In the supervised stage, it is trained using the features f (x) learned by the Discriminator considering only the real labeled data.In summary, the whole method can be mathematically described as: where L(G, D, C) is the GAN objective function defined by,

Fine Tuning
We further tested if the features learned by the aforementioned methods could be improved by a subsequent fine-tuning step.
For MARTA-GAN and SSRL-GAN the original classification layer was replaced by a softmax multiclass classification layer.
For SS-GAN, we kept the first K neurons of the Discriminator output layer.Then, a new supervised training was carried out using the available labelled samples.

EXPERIMENTAL ANALYSIS
The experiments performed in this work aimed to evaluate the representations learned by the methods described above, specifically: MARTA-GAN, SS-GAN, SSRL-GAN and the fine tuned version of these algorithms.
Once the methods were trained, we took the features extracted from their respective multi-feature layers for image categorization.As in (Lin et al., 2017), we used a Support Vector Machine (SVM) (Hearst et al., 1998) for this purpose.
The SVM was trained on the same labeled samples available on the training set.

Datasets
We assessed the methods using two public datasets for remote sensing image categorization.
The first dataset was the UC MERCED Land Use Dataset1 (Yang , Newsam, 2010).It comprises 21 land-use classes.Each 256×256 pixel image has a spatial resolution of 0.3 m per pixel.For each class, 100 images were manually extracted from large images downloaded from the USGS National Map of different urban areas around the United States.Some image samples of this data-set are shown in Figure 2(a).
The second dataset used in our experiments was the NWPU-RESISC45 (Cheng et al., 2017).This dataset2 contains 31500 remote sensing images of size 256×256 pixels and spatial resolution from about 30 m to 0.2 m per pixel for most classes.A total of 45 scene classes are represented in the dataset.For each class, 700 images were extracted from Google Earth by experts in the remote sensing field.Figure 2(b) shows samples of these images.

Network Architectures
The architecture of the Generator and Discriminator networks were essentially the same as that of the MARTA-GAN (Lin et al., 2017).The Classifier, used only in the SSRL-GANs, was a Multi-Layer Perceptron (MLP) network, which took as input the feature vector at the multi-feature layer of the Discriminator and propagated it into a hidden layer with 512 units empirically chosen and using a rectified linear unit (ReLU) as activation (256,256,3) tanh(.)(256,256,3) Table 2. Architecture of the Generator for the three methods.
function.Its output layer implemented a softmax function and had as many units as the number of classes in the dataset.
The three network architectures (Classifier, Generator, and Discriminator) are described in more details in Tables 1, 2 and 3.The symbols denote for each layer, convolution (C), deconvolution (D), batch normalization (B), ReLU (A1), Leaky ReLU (A2), MaxPooling (P), Flatten (F) and Fully Connected (Fc).The number of filters, filter's dimension and the convolution stride are indicated in parenthesis.All filters were square and the stride was equal in horizontal and vertical directions.The multi-feature layer resulted from the concatenation of F1, F2 and F3 which were the product of a flattening operation over feature maps at different scales in the network.
All methods were trained with a batch size of 64 samples using the Adam optimizer (Kingma , Ba, 2014), which parameters learning rate and momentum β1 were set to 0.0002 and 0.5, respectively.The α parameter in the Leaky ReLU activation function was set to 0.2.The terms that make up the cost functions of all methods had the same relevance, been setting each importance coefficient to one.As in (Lin et al., 2017), we scaled the input images in the range of [−1, 1] before training and testing.Also, we applied the early stopping regularization procedure to avoid overfitting.The patience parameter, which controls the number of epochs without improvements in the validation loss, was set to 10.Each experiment was executed 5 times in order to evaluate the sensitivity of the methods to the initial solution of trainable parameters.
To verify the influence of the number of labeled samples in the performance of each method, our experiments were carried out in two different protocols.We used the same Train set in both protocols in the unsupervised learning stage.The protocols differed in the number of labeled samples used for the supervised training stage of SS-GANs and SSRL-GANs, and also for training the SVM.
In Protocol 1, we used for the supervised stage the Aux set, as described before.In Protocol 2 we applied vertical and horizontal flips, rotations and data replication to augment the number of labeled samples.This way, the number of labeled samples in Protocol 2 was about seven times larger than in Protocol 1.The methods were implemented in TensorLayer 3 on a NVIDIA Titan XP GPU.

RESULTS
Figure 3 summarizes the results for the UC-MERCED and NWPU-RESISC45 datasets in terms of Overall Accuracy (OA).The bar plots in Figure 3a to 3b refer to UC-MERCED, whereas Figure 3c to 3d relates to NWPU-RESISC45.
The results for the fine tuned version of the evaluated methods are presented in the Figure 3b and 3d for UC-MERCED and 3 https://tensorlayer.readthedocs.io/en/stable/NWPU-RESISC45, respectively.In these figures the suffix FT denotes the results obtained after fine-tuning.Each bar group indicates the median OA over all runs for each method and protocol.The plots also show, in black, the highest and the lowest OA value recorded in our experiments in each case.
As expected, the augmentation of labeled data improved the accuracy, in some cases remarkably.This can be seen by comparing corresponding bars within each plot.Data augmentation affected favorably even the MARTA-GAN results, an unsupervised representation learning method.The improvement for this method came from the SVM classifier, which profited from the extra labeled samples.The gain brought by labeled data augmentation ranged from 4.6% for MARTA-GAN on UC-MERCED, to 19.2% for SS-GAN FT on NWPU-RESISC45.
A comparison of plots related to the same dataset reveals that fine-tuning also improved the accuracy consistently.Also the variability of the results across multiple runs reduced thanks to fine-tuning.The improvement in terms of OA ranged from 0.3%, for SS-GAN in Protocol 1, to 4,8% for MARTA-GAN in Protocol 2.
However, the key issue in this analysis is the comparison of the three methods in each scenario.The proposed method, SSRL-GAN, was consistently superior to MARTA-GAN in all experiments.Data augmentation and fine-tuning affected MARTA-GAN and SSRL-GAN performance similarly on both datasets.Even so, the proposed method always outperformed its unsupervised counterpart.Thus, the exploitation of labeled samples in MARTA-GAN as proposed in SSRL-GAN, was generally beneficial in all variants tested in our experiments.SS-GAN presented a unique behavior.In all experiments conducted under Protocol 1 it presented the worst results among all methods, both in terms of absolute values and in terms of variability.However, when we increased the number of labeled samples, moving to Protocol 2, SS-GAN became consistently the best performing method.
These results indicate that SS-GAN was among all tested methods the most sensitive to the so-called, small sample size problem.In other words, the experiments indicated that under conditions of greater scarcity of labeled data the SSRL-GAN presented the best results among all analyzed methods on both databases.Additionally, in conditions of more abundance of labeled data the proposed method was overcome by SS-GAN.

CONCLUSIONS
In this work, we performed a comparative analysis of semi-supervised representation learning methods for remote sensing scene classification.We further introduced a novel semi-supervised approach based on Generative Adversarial Networks (GANs).
The methods were evaluated on two public datasets.We took as baseline an unsupervised and a semi-supervised method, both based on GANs.The experimental analysis indicated that the features learned by the proposed method allowed to achieve better accuracy than the baselines when the amount of labeled data was small.The experimental analysis also revealed that a fine-tuning step further improved results in all tested methods.
Figure1.Overview of the SSRL-GAN method.The Generator (G) learns to synthesize images to fool the Discriminator (D), which learns to distinguish between real and synthesized images.The semi-supervised procedure is performed by switching between unlabeled and labeled real images.When labeled images are used, features f (x) are extracted from the multi-feature layer and used as input to the Classifier (C) which will influence the GAN objective function.

Figure 3 .
Figure 3. Overall Accuracy results in (%): FT in the plots on the right indicates fine-tuning As a continuation of the present research, we intend to explore the conclusions drawn from this work for solutions based on GANs for other applications.