Feature Dimension Reduction Using Stacked Sparse Auto-Encoders for Crop Classification with Multi-Temporal, Quad-Pol SAR Data

Crop classification in agriculture is one of important applications for polarimetric synthetic aperture radar (PolSAR) data. For agricultural crop discrimination, compared with single-temporal data, multi-temporal data can dramatically increase crop classification accuracies since the same crop shows different external phenomena as it grows up. In practice, the utilization of multi-temporal data encounters a serious problem known as a “dimension disaster”. Aiming to solve this problem and raise the classification accuracy, this study developed a feature dimension reduction method using stacked sparse auto-encoders (S-SAEs) for crop classification. First, various incoherent scattering decomposition algorithms were employed to extract a variety of detailed and quantitative parameters from multi-temporal PolSAR data. Second, based on analyzing the configuration and main parameters for constructing an S-SAE, a three-hidden-layer S-SAE network was built to reduce the dimensionality and extract effective features to manage the “dimension disaster” caused by excessive scattering parameters, especially for multi-temporal, quad-pol SAR images. Third, a convolutional neural network (CNN) was constructed and employed to further enhance the crop classification performance. Finally, the performances of the proposed strategy were assessed with the simulated multi-temporal Sentinel-1 data for two experimental sites established by the European Space Agency (ESA). The experimental results showed that the overall accuracy with the proposed method was raised by at least 17% compared with the long short-term memory (LSTM) method in the case of a 1% training ratio. Meanwhile, for a CNN classifier, the overall accuracy was almost 4% higher than those of the principle component analysis (PCA) and locally linear embedded (LLE) methods. The comparison studies clearly demonstrated the advantage of the proposed multi-temporal crop classification methodology in terms of classification accuracy, even with small training ratios.


Introduction
It is very important to acquire crop type information in crop growing monitoring, biomass estimation, crop yield prediction, etc. [1][2][3]. Different from the conventional optical remote sensing, polarimetric synthetic aperture radar (PolSAR) is an active remote sensing technology that can work in all weather and all time conditions. During the past few decades, crop classification with PolSAR data has attracted strongly increasing research interest [4][5][6].
Currently, a variety of classification algorithms have been proposed and developed for PolSAR data. Generally, the development mainly underwent the following three phases [7]: 1) Lee investigated the regularities of distribution and applied statistical models to discriminate different objects [8].
2) With the deeper research on the mechanisms of PolSAR data, the scattering mechanisms of electromagnetic waves were introduced into the analysis and application of PolSAR data [9,10]. Many studies have proved that retrieved polarimetric features by both coherent and incoherent decomposition algorithms can improve the recognition and classification accuracy [11,12], especially for quad-pol SAR images [13,14]. 3) Currently, several deep learning architectures have been developed and notable results were attained [15,16]. These knowledge-based algorithms [17,18] have opened a new horizon in this area. However, most of existing methods only concentrate on single-date PolSAR data. For agricultural-crop-type discrimination, single-temporal PolSAR images cannot provide sufficient information since the same crop shows different external phenomena as it grows up [19]. Additionally, the date of data collection is quite crucial. For instance, it is very challenging to discriminate different crops within the sowing periods. Accordingly, it is necessary to take the advantage of multi-temporal PolSAR images to produce classification results with a high accuracy [14]. Fortunately, with the rapid developments of SAR techniques, increasing numbers of spaceborne SAR systems have been launched and operate in orbits, which can collect a large amount of multi-temporal SAR data for Earth observations. Nowadays, several representative systems are available for civilian applications, including C-band Sentinel-1 systems [20,21], RADARSAT-2 and Radarsat Constellation Mission (RCM) [22,23], L-band Advanced Land Observing Satellite (ALOS) ALOS-PALSAR/PALSAR-2 [24,25], X-band Tandem-X [26], and X-band Constellation of Small Satellites for Mediterranean basin Observation (COMSMO) COMSMO -SkyMed constellation [27]. Based on these operational systems, large amounts of multi-temporal PolSAR data can be collected and adopted for use in crop classification and other applications [28][29][30][31][32][33][34][35][36].
For multi-temporal remote sensing data, a key point is how to take full advantage of the wealth of multi-temporal characteristics. Currently, a variety of classification algorithms with time-series information have been proposed. These algorithms can be mainly split into two methods: on one hand, long short-term memory (LSTM) networks have been adopted for the recognition and classification of multi-temporal data. For instance, Teimouri et al. combined an Fully Convolutional Networks (FCN) and a ConvLSTM network to classify and map land covers with multi-temporal Sentinel-1 data [29]. In order to further raise the performances of LSTM, the input features of LSTM are artificially modified with the fusion of high-resolution optical and SAR images. Zhou et al. used extracted SAR features in multiple deep convolutional networks (DCNs) as the input features of LSTM to improve the classification accuracy [30]. On the other hand, time-series or multi-temporal information is extracted manually from multi-temporal and multi-source data [31][32][33][34][35][36]. For example, Zhong et al. designed a one-dimensional convolution network to extract time-series features and discriminate different ground objects [31]. Yang et al. combined Normalized Difference Vegetation Index (NDVI) of optical images with SAR data to raise the accuracy of classifying paddy-rice in mountainous areas [32]. Guo et al. defined a new parameter based on differential characteristics of Cloude scattering parameters to improve the crop classification accuracy [33]. It can be seen that there is a rapid development in the analysis and application of multi-temporal optical and SAR data. These existing methods have the following limitations: First, it is difficult to discriminate multiple crop types by only using the time-series information due to the similarity between various crops. Second, the performances of the LSTM networks depend heavily on the input features and excessive features Remote Sens. 2020, 12, 321 3 of 26 or sparse time-series data seriously deteriorate its performances. Consequently, for both the LSTM networks and time-series-information-based methods, most of the algorithms still concentrate on extracting more effective features. However, since SAR data represent the compositive interaction between radar signals and vegetation and soil [21], a variety of famous decomposition algorithms have been proposed to efficiently extract polarimetric scattering features [37][38][39][40]. However, direct utilization of many polarimetric features will cause a "dimension disaster" for most of the classification methods. Therefore, effective feature dimension reduction from these redundant ones is a very important task.
In the area of data dimension reduction, early representative algorithms mainly include principle component analysis (PCA) [41] and locally linear embedded (LLE) methods [42]. However, since most engineering problems are nonlinear, PCA performs poorly due to its assumption that the processed data are linear [43]. In addition, LLE can automatically extract low-dimensional nonlinear feature representations from high-dimensional data and is easy to implement, but they are not robust to outliers [44]. More recently, artificial neural network (ANN) and deep learning theory have realized considerable development in terms of image classification [45], target recognition [46], data dimensionality reduction [47], and other machine vision fields. With more than three hidden layers, deep learning models contain sufficient complexity to learn effective features from the data itself [4]. Currently, deep learning has also been proven to be an extremely powerful tool in remote sensing data analysis [48]. For instance, convolutional neural networks (CNN) and deep stack auto-encoder (S-AE) are the most successful network architectures, showing excellent performances in image classification and feature learning [16,49,50].
To make full use of multi-temporal PolSAR images and manage the serious problem of a "dimension disaster", this study first employed several common scattering decomposition algorithms to extract detailed and quantitative parameters. Second, a three-hidden-layer stacked sparse auto-encoder (S-SAE) network was built to effectively extract polarimetric features and reduce the feature dimensionality. Third, the architecture of a deep CNN classifier was proposed to enhance the crop classification performance with limited training ratios. The main contribution of this paper was to apply S-SAEs to effectively reduce the feature dimension for crop classification improvements with multi-temporal, quad-pol SAR images.
The remaining sections are arranged as follows. Section 2 reviews the PolSAR data structure and polarimetric feature extraction. Furthermore, the main concept of an S-SAE and the architecture of CNN classifier is also proposed and discussed in this section. In Section 3, the main configuration and optimization of an S-SAE are analyzed and the performances of the proposed processing framework are assessed using simulated Sentinel-1 data. Section 4 is the conclusion of this paper.

Methodology
For the purpose of crop classification with multi-temporal remote sensing data, this study employed stacked sparse auto-encoders to learn low-dimensional features from a number of decomposed scattering signatures, which were fed to CNN classifiers to achieve classification results with high accuracies. The flowchart of the proposed method is shown in Figure 1, which was mainly composed of three steps including scattering feature extraction, feature dimensionality reduction with an S-SAE, and crop classification with CNN. It should be emphasized that the second step was the key of this study. Its aim was to acquire the optimal low-dimension features to represent sufficient information contained in the original multi-temporal data. In this section, the PolSAR data structure and features, along with the stacked sparse auto-encoders, are reviewed, and the architecture of a proposed deep CNN classifier is proposed and discussed.

PolSAR Data Structure and Features
In quad-pol SAR systems, the measured vector data can be expressed using a 2 × 2 complex scattering matrix as the following format: where S HH , S VH , S HV , and S VV are the four scattering elements from four independent polarization channels, with "H" and "V" standing for the horizontal and vertical linear polarizations. With the assumption of reciprocal backscattering, S HV is approximately equal to S VH and the polarimetric scattering matrix can be rewritten as the Lexicographic basis vector: where the superscript "T" is the matrix transpose. Then, a covariance matrix can be constructed as: Additionally, the Pauli-based scattering matrix can also be obtained as: where T m is the one-look coherency matrix for the mth pixel, where (·) * denotes conjugate complex of (·). Actually, PolSAR images are multi-look processed to suppress speckles such that the coherency matrix is spatially averaged to become: Remote Sens. 2020, 12, 321 5 of 26 where N indicates the number of equivalent looks. It has also been proven that the covariance and coherency matrices are linearly related and can be easily converted to each other [51]. According to the covariance and coherency matrices, several obvious features, including the polarization intensities of |S HH |, |S HV |, and |S VV |, can be directly obtained, where |·| denotes the absolute value of ·. With the deeper research on the mechanisms of PolSAR data, various coherent and incoherent scattering decomposition algorithms based on the covariance and coherency matrices have been employed to extract a variety of detailed and quantitative parameters from multi-temporal PolSAR data. Consequently, several features can be extracted for single and multiple temporal PolSAR images. Bai summarized a 123-dimensional feature vector extracted from single temporal quad-pol SAR images [52]. Furthermore, some new features have also been proposed recently [15]. These tremendous features encounter the great difficulty of the curse of dimensionality at high dimensionalities. To take full advantage of the wealth of multiple temporal PolSAR images for crop classification, feature dimension reduction is essential and important.

Auto-Encoder
In the past few years, feature learning with neural network architectures has attracted increasing attention, which can be used to extract optimal features from high-dimension data. An auto-encoder (AE) is an unsupervised learning algorithm and its goal is to set the target values to be approximately equal to the inputs. A single-layer AE network comprises three main steps (i.e., encoder, activation, and decoder), having one visible input layer (x) of w units, one hidden layer (y) of s units, and one reconstruction layer (z) of w units, which is shown in Figure 2, where f (·) and g(·) denote the activation functions.   Considering the input data x i ∈ R w , where i is the index of ith data point, the AE network first maps it to the latent representation y i ∈ R s . This process is named the encoding step and can be mathematically represented as:

Introduction
where W y ∈ R s×w is the encoding matrix and b y ∈ R s is the bias. Here, the logistic sigmoid function of The decoder has a similar structure to the encoder and maps the compressed data to a reconstruction z i ∈ R w with the weight matrix W z ∈ R w×s , bias b z ∈ R w , and an activation function g(I) = f (I) = (1 + e −I ) −1 , which can be represented as: For simplification, the assumption of the weights strategy W y = W T z = W is used. Consequently, the three parameters W, b y , b z need to determined. An auto-encoder can be trained by minimizing the cost function (i.e., the error between the inputs and reconstructed ones): where n is the number of training samples. In addition, a weight attenuation term can be added in the cost function to control the degree of the weight reduction such that the term inhibits the influence of noise on the irrelevant components in the target and weight vector, significantly improves the generalization ability of the network, and effectively avoids over fitting. The cost function is defined as: where λ is used to control the regularization strength, u is the number of hidden layers, and Ω weights is the weight attenuation term named L2 regularization. By using the Stochastic gradient descent (SGD) algorithm, the weight matrix and bias are trained and optimized [53].

Stacked Sparse Auto-Encoder
A sparse auto-encoder is the foundation of an S-SAE, which is developed from an AE. By adding sparsity constraints to an AE, the sparsity constraints act on hidden layer units. This method can preferably express high-dimensional features [54]. In order to realize the inhibitory effects, an SAE uses a Kullback-Leibler (KL) divergence to force it to be close to a given sparse value ρ (sparsity parameter) by constraining the average activation valueρ of the hidden layer neuron output. KL divergence is added to the cost function as a penalty term. Therefore, the cost function of an SAE can be updated in accordance with Equation (11). Furthermore, the penalty term is called a sparse regularization term, which is expressed in Equation (12).
In Equation (12), the average activation valuesρ i of all training samples on neurons i of the hidden layer are defined as:ρ where x j is the jth training samples, w i is ith row of the weight matrix W in the first layer, and b (1) i is the ith entry of the bias vector.
KL divergence is adopted because it can measure the difference between two different distributions wel. Ifρ i = ρ i , KL(ρ i ρ i ) = 0. If the difference between ρ andρ is large, KL divergence will force them to be close. In Equation (11), β (0 < β < 1) is used to control the weight of the sparse regularization term. In order to inhibit most neurons, the value is generally close to 0. If the value is 0.03, the average activation value of each neuron on the SAE will be close to 0.03 through this constraint.
S-SAE is a deep learning architecture constructed by linking several sparse auto-encoders in series, in which the outputs of each layer are fed to the following layer. The training of an S-SAE is through greedy training layer by layer. An S-SAE with the Softmax regression neural network [49] can train and adjust network parameters again, which is called fine-tuning. The structure and training process of an S-SAE are given in Figures 3 and 4, respectively.  series, in which the outputs of each layer are fed to the following layer. The training of an S-SAE is through greedy training layer by layer. An S-SAE with the Softmax regression neural network [49] can train and adjust network parameters again, which is called fine-tuning. The structure and training process of an S-SAE are given in Figures 3 and 4, respectively. Figure 3. An S-SAE network structure.
In the network, the number of labels is equal to the dimension of the output vector ( ) u y . Here, is finely tuned to determine the parameters of W and b using backpropagation.

Architecture of the Proposed Deep CNN Classifier
A convolutional neural network (CNN) is one of the most successful network architectures in deep learning methods. The learning process of CNNs is computationally efficient and insensitive to shifts in data like image translation, making CNN a leading model to recognize 2D patterns in images [31]. In remote sensing studies, a deep learning network based on CNN has been widely used in remote sensing, such as in large-scale image recognition, semantic segmentation, and target classification [48]. In the field of remote sensing target classification, since the ground truth data collection is always a hard task, the aim of this paper was to construct a lightweight CNN network to achieve classification results with a high accuracy and the classical LetNet was selected as the basic structure. Additionally, it is difficult to obtain better classification performance by deepening the In the network, the number of labels is equal to the dimension of the output vector y (u) . Here, W (l) and b (l) , l ∈ {1, 2, 3 . . . u}, are the parameters of the lth layer, where W (l) is the weight matrix of the lth layer and b (l) is the bias vector. According to the known labels, the whole S-SAE network is finely tuned to determine the parameters of W and b using backpropagation.

Architecture of the Proposed Deep CNN Classifier
A convolutional neural network (CNN) is one of the most successful network architectures in deep learning methods. The learning process of CNNs is computationally efficient and insensitive to shifts in data like image translation, making CNN a leading model to recognize 2D patterns in images [31]. In remote sensing studies, a deep learning network based on CNN has been widely used in remote sensing, such as in large-scale image recognition, semantic segmentation, and target classification [48]. In the field of remote sensing target classification, since the ground truth data collection is always a hard task, the aim of this paper was to construct a lightweight CNN network to achieve classification results with a high accuracy and the classical LetNet was selected as the basic structure. Additionally, it is difficult to obtain better classification performance by deepening the network due to the small size of the input data cubic. Consequently, this paper promoted the LetNet configuration using two parallel branches inspired by the characteristics of GoogLeNet [55]. The proposed structure of a CNN classifier is given in Figure 5. It mainly comprises four convolutional layers, one average-pooling Remote Sens. 2020, 12, 321 8 of 26 layer, one addlayer layer, one fully connected layer, and a Softmax classifier. Basically, it retains the same elements as GoogLeNet and utilizes the average pooling layer to reduce the number of fully connected layers. Furthermore, the network uses two convolution branches with different depths and the addlayer structure to fuse the features from different depths, enhancing the relevant features to achieve a fast convergence. The dimension of the input data is 15 × 15 × M where M indicates the number of input features. First, the input is through one convolution layer with 32 filters whose size is 5 × 5 and the stride is 1. The dimension of the generated feature cubic is 15 × 15 × 32 using zero padding. Second, the generated feature maps are fed into two pathways, where one pathway contains two convolution layers and another is one convolution layer. These three convolution layers have the same 64 filters with the same size of 3 × 3 and the strides are 2, 1, and 2. Third, the feature maps from the upper pathway are activated using a Rectified Linear Unit (ReLU) function and combined with the feature from the lower pathway. Thus, the feature maps with dimension of 8 × 8 × 64 are obtained. Fourth, the combined maps are downsampled by one average-pooling layer with the size of 2 × 2 and the stride is 2. Finally, in order to use Softmax to calculate the probability for each class, one fully connected layer is applied to map the data to a vector.

Experimental Sites and PolSAR Data
The performances of the proposed method were assessed with two data sets from two experimental sites. The first experimental site was an approximate 14 km × 19 km rectangular region located in the town of Indian Head (103 • 66 87.3"W, 50 • 53 18.1"N) in southeastern Saskatchewan, Canada [56,57]. The second one was a 13 km × 9 km rectangular region located in Flevoland (5 • 33 53.67"W, 52 • 26 45.73"N), Netherlands. The location maps of the study areas are shown in Figure 6. These two sites were established by the European Space Agency (ESA) to evaluate the performances of crop classification with Sentinel-1 data. Both study sites contained various crop types. There are mainly 14 classes of crops and the corresponding planting areas are summarized in Table 1.
The two experimental PolSAR data sets were both simulated with Sentinel-1 system parameters from real RADARSAT-2 data by ESA before launching real Sentinel-1 systems. The real RADARSAT-2 data were collected from April to September (i.e., Indian Head data set: 21 April, 15  The multi-temporal images have already been coregistered and filtered for speckle noise suppression using an averaged structure Lee filter [58]. For the performance evaluation, optical images of high resolutions and ground surveys were combined to establish the ground truth maps, which are shown in Figure 7.   For the subsequent experiments, polarimetric scattering features were first derived from seven time-series PolSAR images separately using various methods. Some of these features were based on the measured data and directly obtained, and others were calculated with Freeman decomposition [37], Huynen decomposition [38], Yamaguchi decomposition [39], and Cloude decomposition [40]. It should be noted that the parameter "A" from Cloude decomposition was substituted by the theta parameter proposed in Ji et al. [34] and the null angle parameters were from Chen and Tao [15]. In total, 252 features from 7 time series were prepared, which are summarized in Table 2 and some typical features (i.e., amplitude of HH-VV correlation where HH and VV stands for the complex SAR images from horizontal and vertical channels respectively, phase difference of HH-VV, co-polarized ratio, and null angle parameters) are shown in Figure 8.

Feature Extraction Methods Features Dimension
Features based on measured data Polarization intensities (|S HH |, |S HV |, |S VV |) 3 Amplitude of HH-VV correlation Phase difference of HH-VV (atan Im S HH S * VV /Re S HH S * VV ) 1 Co-polarized ratio (10 log 10 |S HV | 2 /|S VV | 2 ) 1 Cross-polarized ratio (10 log 10 |S VV | 2 /|S HH | 2 ) 1 Co-polarization ratio (10 log 10 |S HV | 2 /|S HH | 2 ) 1 Degrees of polarization   Unsupervised pretraining is the first key step to constructing an S-SAE. The ratio of selected training samples significantly affects the performance and computation time. For simplification, a single-layer sparse auto-encoder with the regularization term λ of 0.001, sparsity regularization term β of 4.5, and the sparsity parameter ρ of 0.25 was used to compress 252 input features down to 9 features. To investigate the performances of an S-SAE with different training samples, this section compares the performances of different training sample ratios of 0.3%, 1%, 5%, 10%, 20%, and 50%. The pretraining epochs were 400. After the dimensionality reduction, the nine reduced features were fed into a support vector machine (SVM) classifier. The ratio of the supervised training samples for the SVM classifier was 1% and randomly selected, and the remaining 99% were for testing. In the experiments, crop classifications were carried out ten times and the averaged classification accuracy was utilized for the evaluation. The overall accuracy (OA) and kappa coefficient of the classification results are plotted in Figure 9a. Figure 9b gives the training calculation time with different ratios of unsupervised training samples. From Figure 9, it can be seen that the classification accuracy increased with the increase of unsupervised pretraining samples. For the case of a 50% training ratio, the OA and kappa were raised by about 6% and 5%, respectively, relative to those for the 0.3% training ratio. However, the training time also became longer. Therefore, taking the computational efficiency into account, the ratio of the unsupervised pretraining samples was fixed to 0.3%, 1%, and 5% in the following experiments.

Depth and supervised fine-tuning of the parameters of an S-SAE
In order to investigate the performance of an S-SAE with different depths, this section sets five AEs with one to five layers (i.e., AE, S-AE2, S-AE3, S-AE4, and S-AE5), where AE is a single layer auto-encoder, S-AE is stack auto-encoder. These five S-SAEs were first pretrained, as discussed in the last section, and then finely tuned with supervised training samples. The ratios of the pretraining samples were 0.3%, 1% and 5%. The ratio of the supervised training samples for fine-tuning was set to be 1%. In order to reduce the disturbance of other parameters, λ, β, and ρ of the five S-AEs were all set to 0.001, 0, and 0. The number of epochs of pretraining and fine-tuning training were 400 and 800, respectively. Again, the nine reduced features were fed into the SVM classifier after the dimensionality reduction. The OA and kappa of the classification results with five different S-AEs are shown in Figure 10 (1% pretraining ratio), and the evaluation parameter values were calculated and are listed in Table 3. From Figure 10 and Table 3, it is clear to see that: 1) The accuracy of the classification results increased with the increase of pretraining samples, which is consistent with the conclusion of previous step. 2) Comparing Table 3 and Figure 9a, the parameters of λ, β, and ρ improved the accuracy, especially in the case of the 0.3% pretraining samples. 3) The fine-tuning of S-SAE raised the classification accuracy dramatically. However, the power of improvement became weaker with the increase of pretraining samples, which was mainly due to the insufficient training caused by the decreased ratio of fine-tuning samples to pretraining ones. 4) As the number of layers increased, the classification accuracies became gradually higher [55]. However, when the network contained more than three layers, the performance did not increase. Therefore, the network architecture of 252-100-50-9 was configured as the optimal one and used in the following experiments.

Size of the hidden layers
For an AE network, the number of learned features is decided by the hidden layer. To investigate the effects of the hidden layers on crop classification, a group of experiments were conducted. Here, the goal was still to reduce the 252 input features to 9, where the optimal network with the hidden neurons of 252-100-50-9 in the previous step was chosen as the original network configuration. Let L 1 and L 2 denote the number of units contained in the two hidden layers and the searching intervals were [35,260] and [20,200], respectively. In the experiment, the grid search strategy with a step of 15 was employed to determine the optimal number of hidden neurons. The values of λ, β, and ρ were still set to 0.001, 0, and 0. The nine compressed features were fed into the SVM classifiers trained with 1% supervised samples. The OA and kappa of the classification results are shown in Figure 11. From Figure 11, it is clear that the effects of the size of the hidden layer were strong. The highest OA was approximately 10% higher than the lowest. The zone with the higher classification accuracy was mainly distributed in the top-left area, appearing as an inverted triangle, marked by a yellow line in Figure 11. In this area, the optimal interval for the size of the second hidden layer shrunk gradually with the decrease of the size of the first hidden layer. Consequently, the size of the first hidden layer was much bigger than that of the second one. In addition, when the size of the first hidden layer was close to the size of the input data, a better performance of the dimension reduction was obtained, which was mainly because having more neurons could raise the ability of feature learning and improve the performance of the whole S-SAE. Finally, the best classification accuracy with an OA of 80.45% and kappa of 75.42% was achieved when L 1 and L 2 were 215 and 95, and thus the S-SAE with the structure of 252-215-95-9 was the optimal configuration.

Parameters of λ, β, and ρ
The determination of λ, β, and ρ is the final step to constructing an optimal S-SAE. In order to study the effects of these three parameters, this study carried out some comparison experiments for a single-layer AE network.
For λ, the grid search of ∆ log 10 λ was carried out with a stride of 1 and an interval of [−7, 1]. The other remaining implementation was similar with that in the previous step. The OA and kappa of the classification results with an SVM classifier are plotted in Figure 12, where the highest accuracy was obtained with a λ of 0.1 (i.e., log 10 λ = −1), and a λ of 10 generated the lowest accuracy.   For β and ρ, two groups of experiments with a log 10 λ of −2 and −1 were compared. For β, the searching interval and step were [0.5, 4.5] and 1. For ρ, the searching interval and step were set to [0.05, 0.45] and 0.1. The AEs were configured with these parameters and the reduced features were fed into the SVM classifiers. The OA and kappa were charted in Figure 13, where it was evident that: 1) the values of λ, β, and ρ had a great impact on the recognition performances; and 2) the best results were generated when λ, β, and ρ were set to 0.1, 2.5, and 0.45. The highest OA and kappa were 80.62% and 75.84%, which were both 5% higher than that of an SAE with the original parameters (i.e., λ = 0.001, β = 0, ρ = 0). Additionally, clear regulations during the process of determining optimal parameters were not found in these experiments.   Finally, after a variety of experiments, a group of optimal parameters were determined and are listed in Table 4 for the S-SAE with the structure of 252-215-95-9. Consequently, the OA and kappa of the classification results with the SVM were 81.77% and 77.20%, which were slightly higher than those in the previous section.

Comparison of Classification Results with the Different Methods
In this section, nine reduced features were extracted from the original 252 multi-temporal features using different methods, including PCA, LLE, a single-layer AE(1) of 252-9, a finely tuned AE(2) of 252-9, a finely tuned S-AE(1) of 252-100-50-9, a finely tuned S-AE(2) of 252-215-95-9, and a finely tuned S-SAE of 252-215-95-9 with the optimal parameters listed in Table 4. The λ, β, and ρ of AE(1), AE(2), S-AE(1), and S-AE(2) were all set to 0.001, 0, and 0. The ratio of the supervised fine-tuning samples for AE was 1%. For LLE, the number of neighbors was 300 and the max embedding dimensionality was 9. Then, the extracted nine-dimensional features were fed into the SVM and the proposed CNN classifiers. For the CNN classifier, the initial parameters were randomly selected and renewed with a learning rate of 0.01, momentum parameter of 0.9, and weight decay of 0.0004.
First, the ratio of the training samples for the SVM and CNN classifiers were 1% and the other 99% were for testing. The 14 types of crops were classified, including lentil, durum wheat, spring wheat, field pea, oat, canola, grass, mixed pasture, mixed hay, barley, summer fallow, flax, canary seed, and chemical fallow. Compared with the ground truth, some of the classification results along with the error map are shown in Figure 14, and the OA and kappa for the SVM and CNN classifiers, along with the correct classification ratio for each class, were calculated and are summarized in Tables 5 and 6, respectively. The crop classification results showed that the optimal S-SAE had the best performances for both the SVM and CNN classifiers. For the SVM classifier, OA from the S-SAE was 16.77% and 14.56% higher than those from PCA and AE (1). For the CNN classifier, OA from the S-SAE was raised by 7.55% and 7.58% relative to those from PCA and AE (1). The combination of S-SAE and CNN achieved the best classification results, of which the OA and kappa were up to 95.44% and 94.51% Second, the classification methods with different training ratios (i.e., 1%, 5%, and 10%) for training the SVM and CNN classifiers were compared. The quantitative evaluation parameters were calculated and are listed in Table 7. Meanwhile, some results with the conventional Complex Wishart classifier, LSTM in Zhong et al. [31] and Chen method [15] are also included. The 36 × 7 decomposed features were the input of the LSTM and 14 decomposed roll-invariant features from 7 time series were taken as the input of the Chen method.
From Figure 14, we can conclude that: 1) for feature dimension reduction, the SVM and CNN with nine features from the S-SAE achieved the best classification results; 2) with the increase of the training ratio, the recognition accuracy for all methods became better, as expected; 3) compared with the SVM classifier, the CNN classifier improved the classification performance of the dimensionality reduction features, especially for the crops with small planting areas; 4) among all the compared methods, the proposed method achieved the highest classification accuracy. The LSTM wass not effective in discriminating crop species with similar external morphologies, such as barley and durum wheat. Moreover, the proposed method of S-SAE + CNN achieved excellent crop classification accuracies, even with finite training ratios.

Comparison of the Dimensionality Reduction Features with Different Methods
In order to further demonstrate the advantages of this method regarding feature extraction, this section compares the dimensionality reduction features with different dimensionality reduction methods for the following two aspects. First, according to the contribution of PCA and visual quality of LLE and S-SAE, the first four extracted features from these three methods were visualized and are shown in Figure 15, from which it can be seen that the features with S-SAE exhibited the best visual effects. Second, the values of the standard deviation between different crops and within the same class were calculated for quantitative comparison. The six main crops with the highest planting area in the Indian Head site were selected for comparison: lentil, spring wheat, field pea, canola, barley, and flax. Then, six local windows with the size of 100 × 100 were used to choose samples for the six main crops, and the standard deviations of the optimal feature (i.e., the first column in Figure 15) from each method were calculated and plotted in Figure 16 (in Figure 16a, the number of the horizontal ordinate stands for the six selected crops; in Figure 16b, the number of the horizontal ordinate stands for three crops, including the current and neighboring crops), which shows the maximum standard deviation between different crops and the minimum within the same crop. Consequently, the features with the S-SAE had a better classification ability and are expected to improve the accuracy of crop recognition and classification.

Results and Analysis for the Flevoland Site
In order to further verify the reliability of the proposed method, another experiment with multitemporal data from Flevoland was carried out and is reported on in this section. The methods of dimensionality reduction and classification maintained the same setup as the previous one. For a fair comparison, according to the contribution rate (i.e., more than 90%) of reduced parameters using PCA method, the original 252 features were all reduced to 12 features.

Results and Analysis for the Flevoland Site
In order to further verify the reliability of the proposed method, another experiment with multi-temporal data from Flevoland was carried out and is reported on in this section. The methods of dimensionality reduction and classification maintained the same setup as the previous one. For a fair comparison, according to the contribution rate (i.e., more than 90%) of reduced parameters using PCA method, the original 252 features were all reduced to 12 features.
First, various dimensionality reduction methods, including PCA, LLE, a single-layer AE(1) of 252-12, a finely tuned AE(2) of 252-12, a finely tuned S-AE(1) of 252-100-50-12, a finely tuned S-AE(2) of 252-215-95-12, and an S-SAE of 252-215-95-12 were used to extract 12 low-dimension features from the original 252 multi-temporal features. In addition, the comparison methods also included a finely tuned SAE of 252-12 with λ, β, and ρ being 0.1, 2.5, and 0.45. The determined optimal parameters of an S-SAE of 252-215-95-12 are listed in Table 8. Compared with the ground truth, some of the classification results and the error maps are shown in Figure 17, and the OA and kappa, along with the correct classification ratio for each class, were calculated and are summarized in Table 9, from which it can be seen that the optimal S-SAE still had the best performances. The combination of the S-SAE and CNN achieved the best classification results, of which OA and kappa were up to 91.63% and 90.11%, respectively. Furthermore, the AE with a deeper structure and fine-tuning parameters improved the data dimensionality reduction ability. Secondly, the classification methods with different training ratios (i.e., 1%, 5% and 10%) for training SVM and CNN classifiers are compared. The quantitative evaluation parameters are calculated and listed in Table 10. The experimental results show that the classification accuracy with Chen method is a little lower than that in Indian Head data. Other main conclusions are the same as those in the first study site.

Discussions
The accuracy of the crop recognition and classification can be raised dramatically with multi-temporal quad-pol SAR data. Recently, an increasing number of spaceborne SAR systems launched into orbit around the Earth can acquire a large amount of real data, providing a great opportunity for multi-temporal data analysis [28]. Meanwhile, deep learning has shown powerful abilities in the fields of remote sensing [48]. Based on these two considerations, this study attempted to combine the effective feature dimension methods of AE and CNN classifiers to achieve crop classification results with a high accuracy. Based on the theoretical analysis and experimental results above, we provide the following discussions.

Contribution of Multi-Temporal SAR Data and Decomposed Features
Different from other ground objects, agricultural crops usually hold a stable and regular growing cycle. Generally, the crop categories would not change during various growing stages. In this case, single-temporal PolSAR images cannot provide sufficient information since the same crop shows different external phenomena as it grows up. Compared with single temporal data, multi-temporal remote sensing data can provide much richer information to improve the classification accuracy [29,30].
In order to better understand the polarimetric scattering mechanisms of various objects, a variety of polarimetric feature decomposition techniques have been developed [37][38][39][40]59]. With these polarimetric decomposition algorithms, a huge number of features can be extracted from PolSAR data that are used to raise the classification accuracy. For multi-temporal data from M time series, the number of features becomes M times greater. If a deep learning algorithm is directly applied, these tremendous features encounter the serious difficulty of the curse of dimensionality at high dimensionality [50], especially for quad-pol SAR images. Therefore, data dimension reduction and effective feature extraction are very essential and important issues.

Network Construction and Parameter Optimization of an S-SAE
In this study, the S-SAE was employed to extract effective features from multi-temporal polarimetric features. Furthermore, the effects of the training ratios, network configurations, size of the hidden layers, and main parameters were studied in Section 3.2.2.
For the training ratios, the increasing training ratio of unsupervised training samples increased the performances of S-SAE. However, the power of improvement became weaker with the increase in pretraining samples, which was mainly due to the insufficient training caused by the decreased ratio of fine-tuning samples to pretraining ones. In addition, the increasing training samples brought about a heavy computation burden. Therefore, an appropriate training ratio is of paramount importance.
For network depths and the size of hidden layers, as the network deepened, the classification accuracies became higher. However, the network depth was not as deep as possible [50,53]. When the network contained more than three layers, the performance did not increase any more. Additionally, by setting the step size of the searching grid to be smaller, this study provided new discoveries. If the number of neurons in the first hidden layer approximated the number of input features and the number of neurons in the second hidden layer kept a certain gap with the former and following layers, a better performance of the data dimension reduction could be obtained.
For the parameters of λ, β and ρ, the regularization of λ and the sparsity terms of β and ρ had uncertain impacts on the performances of the S-SAE. Appropriate values of these parameters improved the OA of classification results by 5%, which was even higher than that of the optimal structure of 252-215-95-9. Unfortunately, clear regulations were not found in these experiments, which needs further study in the future.

Differences from Existing Works
Regarding the purpose of classification with multi-temporal PolSAR data, Lee investigated the regularities of distribution and applied statistical models to discriminate between different objects [8]. The proposed complex Wishart classifier only uses the statistical distribution of the covariance matrix and can be easily extended to the case of multi-band or multi-temporal PolSAR images.
With the development of polarimetric decomposition theory, various decomposed features can be calculated, and it has been proven that the performances of target recognition and classification can be improved greatly with these meaningful parameters [15,33]. Meanwhile, the CNN classifier was also applied to classification with PolSAR images since it has shown great success in computer vision and other fields [15,48]. In order to take full advantage of the polarimetric decomposition and the CNN classifier, a feature dimension reduction was necessary. In this case, the conventional PCA and LLE were used to extract effective information expressed by the reduced parameters.
In Section 3.2.3, for fair comparison, according to the contribution rate of the reduced parameters using the PCA method, the original 252 features were all reduced to 9 features. The experimental results show that the S-SAE performed best and the accuracy of the classification result obtained using the S-SAE+CNN was the highest. For further comparison, PCA and S-SAE were used to reduce the original 252 features to 3, 6, 9, and 12 low-dimensional features for the Indian Head site, and then the reduced features were fed into the SVM classifier. The OA and kappa with different reduced features are shown in Figure 18. It can be seen clearly that the classification accuracy increased as the number of reduced features increased, as we expected. Regardless of the dimension of the reduced features, S-SAE had a better performance than the conventional PCA method.

of 27
With the development of polarimetric decomposition theory, various decomposed features can be calculated, and it has been proven that the performances of target recognition and classification can be improved greatly with these meaningful parameters [15,33]. Meanwhile, the CNN classifier was also applied to classification with PolSAR images since it has shown great success in computer vision and other fields [15,48]. In order to take full advantage of the polarimetric decomposition and the CNN classifier, a feature dimension reduction was necessary. In this case, the conventional PCA and LLE were used to extract effective information expressed by the reduced parameters.
In Section 3.2.3, for fair comparison, according to the contribution rate of the reduced parameters using the PCA method, the original 252 features were all reduced to 9 features. The experimental results show that the S-SAE performed best and the accuracy of the classification result obtained using the S-SAE+CNN was the highest. For further comparison, PCA and S-SAE were used to reduce the original 252 features to 3, 6, 9, and 12 low-dimensional features for the Indian Head site, and then the reduced features were fed into the SVM classifier. The OA and kappa with different reduced features are shown in Figure 18. It can be seen clearly that the classification accuracy increased as the number of reduced features increased, as we expected. Regardless of the dimension of the reduced features, S-SAE had a better performance than the conventional PCA method.

Limitations and Future Work
First, only a limited number of polarimetric decomposition algorithms for quad-pol SAR images were used in this paper. Polarimetric decomposition is still a rapidly developing area and many other algorithms can also be used together to form a rather high-dimensional feature vector [52]. Second, the proposed S-SAE feature dimension reduction method had a heavy computation burden, although it performed much better than PCA. Third, the training ratio, network depth, size of the hidden layer, and other main parameters can impact the performances of data dimension reduction, especially for a multi-layer deep S-SAE. Finally, the proposed method was only assessed with Sentinel-1 data. The performances with multiple SAR sensors and combination of SAR and optical data will be investigated in future work.

Conclusion
To deal with the problem of the dimension disaster, this paper proposed an S-SAE to reduce the data dimension of scattering features extracted from multi-temporal PolSAR images. To validate the performances of the proposed S-SAE + CNN strategy, the Sentinel-1 data, along with the established ground truth maps for two experimental sites, were used. Combined with the traditional SVM and constructed CNN classifiers, the classification results were compared and evaluated. The experimental results showed that the S-SAE could extract low-dimension features from original ones. For the CNN classifier, this method could significantly improve the classification accuracy of small sample crops compared with the traditional methods. With the increasing availability of PolSAR data,

Limitations and Future Work
First, only a limited number of polarimetric decomposition algorithms for quad-pol SAR images were used in this paper. Polarimetric decomposition is still a rapidly developing area and many other algorithms can also be used together to form a rather high-dimensional feature vector [52]. Second, the proposed S-SAE feature dimension reduction method had a heavy computation burden, although it performed much better than PCA. Third, the training ratio, network depth, size of the hidden layer, and other main parameters can impact the performances of data dimension reduction, especially for a multi-layer deep S-SAE. Finally, the proposed method was only assessed with Sentinel-1 data. The performances with multiple SAR sensors and combination of SAR and optical data will be investigated in future work.

Conclusions
To deal with the problem of the dimension disaster, this paper proposed an S-SAE to reduce the data dimension of scattering features extracted from multi-temporal PolSAR images. To validate the performances of the proposed S-SAE + CNN strategy, the Sentinel-1 data, along with the established ground truth maps for two experimental sites, were used. Combined with the traditional SVM and constructed CNN classifiers, the classification results were compared and evaluated. The experimental results showed that the S-SAE could extract low-dimension features from original ones. For the CNN classifier, this method could significantly improve the classification accuracy of small sample crops compared with the traditional methods. With the increasing availability of PolSAR data, the proposed method also provides an efficient way for crop classification with multi-temporal PolSAR data.
In order to construct an optimal S-SAE, this paper also investigated the effects of the training ratio, network depth, size of the hidden layer, network configuration, and other main parameters. The experimental results showed that these factors could greatly impact the performance of the feature dimension reduction. For this specific purpose, the configuration and main parameters of an S-SAE should be optimized. It should be noted that the single-layer SAE could achieve a dimension reduction performance that was close to the optimized S-SAE when the training parameters were set properly. Due to the complexity of optimizing the parameters of a multi-layer S-SAE, the potential of multi-layer S-SAE is not fully utilized. Methods to effectively determine the configuration and parameters for constructing an optimal S-SAE will be discussed in a separate paper.