Multimodal Remote Sensing Benchmark Datasets for Land Cover Classification with A Shared and Specific Feature Learning Model

As remote sensing (RS) data obtained from different sensors become available largely and openly, multimodal data processing and analysis techniques have been garnering increasing interest in the RS and geoscience community. However, due to the gap between different modalities in terms of imaging sensors, resolutions, and contents, embedding their complementary information into a consistent, compact, accurate, and discriminative representation, to a great extent, remains challenging. To this end, we propose a shared and specific feature learning (S2FL) model. S2FL is capable of decomposing multimodal RS data into modality-shared and modality-specific components, enabling the information blending of multi-modalities more effectively, particularly for heterogeneous data sources. Moreover, to better assess multimodal baselines and the newly-proposed S2FL model, three multimodal RS benchmark datasets, i.e., Houston2013 -- hyperspectral and multispectral data, Berlin -- hyperspectral and synthetic aperture radar (SAR) data, Augsburg -- hyperspectral, SAR, and digital surface model (DSM) data, are released and used for land cover classification. Extensive experiments conducted on the three datasets demonstrate the superiority and advancement of our S2FL model in the task of land cover classification in comparison with previously-proposed state-of-the-art baselines. Furthermore, the baseline codes and datasets used in this paper will be made available freely at https://github.com/danfenghong/ISPRS_S2FL.


Introduction
The rapid development of remotely sensed imaging techniques enables the measurement and monitoring of Earth on the land surface and beneath (e.g., identification of underground minerals (Bishop et al., 2011), geological environment survey and monitoring (Van der Meer et al., 2012), volcanic terrain component analysis (Amici et al., 2013)), of the quality of air and water, and of the health of humans, plants, and animals (Nativi et al., 2015).Remote sensing (RS) is one of the most important contact-free sensing means for Earth observation (EO) to extract relevant information about the physical properties of the Earth and environment system from spaceborne and airborne platforms.With the ever-growing availability of RS data sources from both satellite and airborne sensors on a large scale and even global scale, multimodal RS image processing and analysis techniques have been garnering growing attention in various EO-related tasks (Schmitt and Zhu, 2016), such as land cover change detection (Liu et al., 2017(Liu et al., , 2019)), disaster monitoring and management (Zhu et al., 2019;Liu et al., 2020), urban planning (Weng, 2009;Xie and Weng, 2017), mineral exploration (Hong et al., 2019b;Siebels et al., 2020).
The data acquired by different platforms can provide diverse and complementary information, including light detection and ranging (LiDAR) or digital surface model (DSM) providing the height information about the ground elevation, synthetic aperture radar (SAR) providing the structure information about Earth's surface, and multispectral (MS) or hyperspectral (HS) data providing detailed content information of sensed materials.The joint exploitation of different RS data has been therefore proven to be useful to further enhance the understanding, possibilities, and capabilities to Earth and our environment.In a complex urban scene, the ability of spectral data (e.g., RGB, MS, HS) in finely identifying the land cover categories usually remains limited, particularly for those categories that have extremely similar spatial structure or spectral signatures.For example, the material "Asphalt" on the road or on the roof can be hardly classified by only observing subtle discrepancies from their spectral profiles (Heiden et al., 2012).But fortunately, we might expect to have the height information provided by DSM data, which enables the fine-grained recognition of these similar materials at a higher accuracy compared to only single modalities.It is well known that HS images are characterized by nearly continuous spectral properties, while MS images can provide finer spatial information.The fu-sion of MS and HS images naturally becomes a feasible solution to obtain the image product of high spatial and spectral resolutions.Another common example is cloud removal.Optical RS data (e.g., RGB, MS, HS) tend to suffer from the cloud occlusion in imaging process, thereby bringing the risk of important information loss.SAR data or DSM data generated from the LiDAR are insensitive to the cloud coverage by capturing intrinsic structure or elevation information that are not closely associated with the cloud (Gao et al., 2020).
In recent years, many previous works have been proposed by the attempts to boost the development of multimodal RS techniques in land cover classification.Nevertheless, the ability of these approaches in multimodal RS data representations remains limited, further limiting the performance gain of subsequent high-level applications.This can be well explained by two possible factors as follows.
• Multimodal RS Benchmark Datasets.Despite the increased availability of the multimodal RS data, different contexts, structures, sensors, resolutions, and imaging conditions pose a great challenge in data acquisition and processing for to-be-studied scenes (Dalla Mura et al., 2015).Consequently, the lack of multimodal RS benchmark datasets, to a larger extent, limits the development of the corresponding methodologies and the practical application of land cover classification.
• Multimodal Feature Learning Models.Feature extraction (FE), as one of the most important steps prior to the high-level data analysis, also plays a key role in multimodal RS.Most of previous FE methods, either hand-crafted or learning-based (Rasti et al., 2020), usually extract or learn multimodal features in a concatenation fashion.Such a manner could lead to inadequate information fusion and even hurt or destroy the original components of each modality, due to the highly coupled information between multimodal RS data (particularly heterogeneous data).
According to the aforementioned challenges, we aim in this paper to build diversified multimodal RS benchmark datasets and devise novel feature learning models for land cover classification.For this purpose, we first collect multiple RS data with different resolutions, different modalities, and different sensors, with well-labeled land cover maps.Supported by these datasets, we propose a novel multimodal feature learning (MFL) model by decomposing different modalities into shared and specific representations, in order to fuse diverse information from multimodal RS data in a compacted and discriminative way.The proposed model establishes the explicit mapping relations between the to-be-learned features and original multimodal RS data, providing an interpretable MFL model.More specifically, the contributions of this paper can be highlighted as follows.
• Three multimodal RS benchmark datasets are prepared and built with the application to land cover classification.They are diversified, including homogeneous HS-MS Houston2013 datasets, heterogeneous HS-SAR Berlin datasets, and heterogeneous HS-SAR-DSM Augsburg datasets.We will make them freely and openly available after a possible publication, contributing to the RS and information fusion communities.Currently, the heterogeneous RS feature learning and fusion techniques are less investigated, particularly three-modality datasets.To our best knowledge, this is the first time to open heterogeneous RS benchmark datasets that simultaneously involve HS, SAR, and DSM data for land cover classification.
• A shared and specific feature learning (S2FL) model is devised to extract diagnostic features from multimodal RS data.S2FL is capable of decoupling different modalities into shared and specific feature spaces, respectively, by aligning the common components between multi-modalities on manifolds.Such separable properties tend to capture fine-grained differences between different categories, further yielding better classification results.
• An alternating direction method of multipliers (ADMM)-based optimization framework is customized for the fast and accurate solutions of the proposed S2FL model.
The rest of this paper is organized as follows.In Section II, a deep literature review is made in terms of related work in MFL.Section III then elaborates on the methodology of our proposed S2FL model as well as the ADMM-based model optimization process.In Section IV, extensive experiments are conducted on three multimodal RS benchmarks in comparison with several state-of-the-art baselines.Finally, Section V makes the summary with some important conclusions and hints at potential future research trends.

Related Work
Image-level fusion is a straightforward way to enhance certain information (e.g., spatial or spectral resolutions) of homogeneous data, such as RS image pansharpening (Ehlers et al., 2010), HS and MS fusion (Wei et al., 2015).However, there are more heterogeneous RS data in reality, typically optical and SAR data that can provide richer and more complementary information.The image-level fusion can not meet the demand for the heterogeneous data fusion task to some extent.Feature-level learning and fusion are needed.For decades, extensive efforts have been made by researchers to develop a variety of MFL algorithms for land cover classification of RS data.These existing methods can be roughly categorized into two main groups from the perspectives of different fusion strategies, i.e., concatenation-based MFL and alignment-based MFL.

Concatenation-based MFL Models
As the name suggests, concatenation-based MFL models can obtain the fused features by • first stacking the input multimodal RS images and then passing through a certain feature extractor or learner; • or first extracting or learning the feature representations for each modality and then stacking them as a certain classifier input.
There have been many classic and state-of-the-art models related to concatenationbased MFL in RS.Morphological operators (Fauvel et al., 2008), as the main member of the feature extractor family, have been widely and successfully applied to multimodal RS image feature extraction and classification.For instance, Liao et al.
generalized the graph embedding model (Hong et al., 2020b) for the fusion of the morphological profiles of HS and LiDAR data in land cover classification (Liao et al., 2014).In (Rasti et al., 2017), a novel component analysis model based on total variation was designed to further refine the feature representations of extinction profiles (Fang et al., 2017) obtained from HS and LiDAR data.Yokoya et al. extracted morphological features of time-series MS data and corresponding OpenStreetMaps and concatenated them as the classifier input (Yokoya et al., 2017).Ma et al. used the multisource RS data for "Ghost City" phenomenon identification (Ma et al., 2018).Authors of (Chen et al., 2017) proposed to fuse the multisource RS data by the means of deep networks for accurate land cover classification.Ref. (Xia et al., 2019) developed a semi-supervised graph fusion model for HS and LiDAR data classification.Inspired by the recent advancement of deep learning techniques in data representations, enormous learning-based approaches have been developed for multimodal RS feature fusion (Zhu et al., 2017).(Hong et al., 2021) for the first time proposed a general and unified deep learning framework for multimodal RS image classification.In (Hang et al., 2020), a coupled convolutional neural network was employed to fuse the heterogeneous RS data in the feature level.However, there is room for improvement in the interpretablity of DL-based methods.The interpretable knowledge embedding can guide the network learning towards better solutions (or results) more effectively.Furthermore, the past decades have witnessed a favorable development of concatenation-based MFL in the RS community, yet the ability to fully take advantage of the diverse information of different modalities, especially heterogeneous data, remains limited.

Alignment-based MFL Models
Unlike the above feature concatenation strategy, alignment-based MFL methods seek to learn a common feature set that multiple modalities share by the means of well-known manifold alignment (MA) techniques (Wang and Mahadevan, 2011).By introducing the MA into the RS applications, Tuia et al. aligned the multi-view MS data to a consistent representation in a semi-supervised way (Tuia et al., 2014).(Tuia and Camps-Valls, 2016) projected the multimodal data to a higher-dimensional kernel space, where different data sources can be better aligned.Moreover, the semisupervised model in (Tuia et al., 2014) was improved by using a mapper-induced graph structure for the fusion of optical image and polarimetric SAR data (Hu et al., 2019).Inspired by the MA idea, Hong et al. proposed a MA-regularized representation learning model, simultaneously involving subspace learning and ridge regression, CoSpace for short (Hong et al., 2019a).CoSpace is capable of learning the alignment representations across multi-modalities, thereby yielding more effective information fusion.The same investigators further extended the CoSpace model by replacing ridge regression with sparse regression, generating a 1 -norm version of the CoSpace model, i.e., 1 -CoSpace (Hong et al., 2020a).Pournemat et al. (Pournemat et al., 2020) proposed a semi-supervised charting approach for multimodal MA in the spectral domain.There are also some MA-based variants successfully applied to other RS-related applications, such as visualization (Liao et al., 2016), dimensionality reduction, cross-modality retrieval (Hong et al., 2019c).Admittedly, the alignmentbased strategy is capable of performing well in heterogeneous data fusion by the means of information-sharing mechanism.Yet only capturing the shared information across multi-modalities is hardly achievable to learn better multimodal feature representations, due to the lack of modeling or fusing modality-specific properties.

Method Overview
To enhance the representation ability of multimodal data fusion and reduce the information loss (possibly due to only considering modality-shared properties and ignoring those modality-specific ones), we seek to find a more discriminative feature  k=1 , and P denote the shared subspace projection, the specific subspace projections, and the regression matrix, respectively, Note that we here take the bi-modality as an example.
space by disentangling different data sources into shared and specific domains.Using the to-be-learned features, a better decision boundary is expect to obtain in the classification task.For this purpose, a shared and specific feature learning model is devised, called S2FL, by aligning shared components between multi-modalities on the latent manifold subspace and simultaneously separating out their specific information.Such a modeling strategy is interpretable and effective for learning multimodal RS feature representations, further yielding the great potentials in land cover classification.Fig. 1 illustrates the flowchart of the proposed S2FL model.

Notation
Let X k ∈ R d k ×N be the unfolded matrix with respect to the k th modality with d k channels by N pixels, and K be the number of all considered modalities.Y ∈ R C×N denotes the one-hot encoding label matrix, where C is the number of categories.Θ 0 ∈ R ds× K k=1 d k and Θ k ∈ R ds×d k denote the shared subspace projection and specific subspace projections with the respect to the k th modality, respectively, and d s means the feature (or subspace) dimension.P ∈ R C×ds is defined as the regression matrix that connects the subspace and label information (i.e., Y).I, X F , and tr(X) Table 1: The symbols of variables used in the proposed S2FL model as well as their description and size, where we take the bi-modality as an example, i.e., K = 2.

Symbols
Description Size d s the dimension of the feature space or subspace the shared subspace projections for all considered modalities the specific subspace projection for the k th modality the generalized subspace projections, obtained by the adjacency matrix of to-be-aligned modalities 2N × 2N D the degree matrix of the matrix W, obtained by the trace of the matrix X 1 × 1 denote the identity matrix, the Frobenius norm of the matrix X, and the trace of the matrix X, respectively.Moreover, the Laplacian matrix is denoted by L, which can be computed by D − W. W is the adjacency matrix of X, and D is a diagonal matrix, whose diagonal elements can be obtained by D ii = i =j W i,j .Table 1 gives the detailed definitions of these variables used in the S2FL model.

Problem Formulation
With the aforementioned problem statement and given definitions of variables, we model the S2FL's problem as follows min where α and β denote the penalty parameters to balance the importance of different terms.
Please note that Y 1 , Y 2 , ..., and Y K only aim to show the groups from 1 to K corresponding to different modalities, and actually represent the same labels, i.e., Y.
The optimization problem of the first term in Eq. ( 1) is highly ill-posed due to its large freedom degrees (e.g., simultaneous estimation of coupled variables P and Θ).To this end, • we regularize the variable P with its Frobenius norm, denoted as P F , in order to stabilize the convergence process and improve the generalization ability of the model; • the modality-shared and modality-specific components can be learned and separated on a latent subspace by the means of one common projection Θ 0 for all modalities and K characteristic projections {Θ k } K k=1 for each individual modality.This information-sharing process can be performed by using the MA regularization, i.e., tr(Θ 0 XL(Θ 0 X) ), and then the remaining information is naturally irrelevant and unique between different modalities.Moreover, the joint graph W in L consists of k 2 subgraphs, as illustrated in Fig. 2. In W, the block diagonal matrix is the intra-modality subgraph, which can be obtained by using the Gaussian kernel function with the width of σ as follows and the rest is the inter-modality subgraph, which can be directly constructed by the means of label information, leading to the following discriminative graph structure (Hong et al., 2019a): where N C is the number of samples for the C th class; • the orthogonal constraints with respect to the variables Θ 0 and {Θ k } K k=1 are added in S2FL model to reduce the freedoms and shrink the solution space effectively, thereby finding better local optimal solutions.

Model Optimization
The optimization problem of Eq. ( 1) is typically non-convex, whose global minimum is usually hard to be found.We, however, expect to have local optimal solutions by alternatively optimizing separable convex subproblems of to-be-estimated variables P, Θ 0 , and {Θ k } K k=1 .Algorithm 1 details an overview optimization process of the proposed S2FL model.

Learning Linear Regression Matrix -P
The optimization with respect to the variable P is nothing but a least-square regression problem with a common Tikhonov-Phillips regularization.This subproblem can be then written as which has the following analytical solution Algorithm 1 S2FL: Global optimization Require: Y, X, L, and parameters α, β, σ, and maxIter.1: Initialize the parameter Θ using locality preserving projections (LPP) (He and Niyogi, 2004) and the parameter Θ 0 = 0.The iteration starts with t = 1 and the tolerate error is set to ζ = 10 −4 .2: Compute the adjacency matrix W using Eqs.( 2) and (3) Laplacian matrix L. 3: for The iteration number t = 1 to maxIter do 4: Learn the linear regression matrix P using Eq. ( 5).

5:
Learn the shared subspace projection matrix Θ 0 by Algorithm 2. 6: Learn the specific subspace projection Θ k for the k th modality.

8:
end for 9: When t > 1 and calculate the loss of the objective function in the t th and (t − 1) th iterations, denoted as E t and E t−1 , and check the stopping condition: Stop iteration. 12: else 13: t ← t + 1.

14:
end if 15: end for Ensure: Linear regression matrix P, shared subspace projection Θ 0 for all considered modalities, and modality-specific subspace projections {Θ k } K k=1 .

Learning Shared Subspace Projection Matrix -Θ 0
The constraint optimization problem with respect to the variable Θ 0 can be formulated as follows The problem ( 6) can be optimized by designing an ADMM-based solver (Boyd et al., 2011).More specifically, two auxiliary variables H and G are introduced into the Eq. ( 6) to replace Θ 0 X in the first term and Θ 0 in the constraint term, respectively, we then have the following equivalent form of Eq. ( 6) We further rewrite the Eq. ( 6) to its augmented Lagrangian function: where the variables Λ 1 and Λ 2 denote the Lagrange multipliers and µ is the regularization parameter.
Under the ADMM optimization framework, the problem ( 8) can be effectively solved by successively minimizing the object function L for the variables Θ 0 , H, G, Λ 1 , and Λ 2 , respectively, when other variables are fixed.
Optimization with respect to Θ 0 : The optimization problem of the variable Θ 0 can be formulated as follows min hence its closed-form solution is Optimization with respect to H: We estimate the variable H by solving the following optimization problem: For Eq. ( 11), it is straightforward to derive the analytical solution, i.e., Optimization with respect to G: The optimization problem with the orthogonal constraint for the variable G is Algorithm 2 Subprobelm optimization with respect to the variable Θ 0 Require: Y, P, X, L, and the parameter β, and maxIter.1: Initialize Θ 0 = 0, the auxiliary variables, e.g., G = Λ 2 = 0, Λ 1 = 0.The regularization parameter is µ = 10 −3 , whose maximal limitation and scaling factor are µ max = 10 6 and ρ = 1.5, respectively.The iteration starts with t = 1 and the tolerate error of the variable is set to ε = 10 −6 .2: for The iteration number t = 1 to maxIter do 3: Fix other variables and update H using Eq. ( 12).

13:
end if 14: end for Ensure: Shared subspace projection Θ 0 .whose solution can be obtained by the means of a well-known solver, i.e., splitting orthogonality constraints (SOC) (Lai and Osher, 2014).The method of SOC solves the orthogonality constrained problem in two steps.
• 2) The orthogonality is satisfied when the variable G is updated by Updating with respect to Lagrange multipliers Λ 1 and Λ 2 : Updating with respect to the regularization parameter µ: where ρ > 1 and µ max denote the scaling factor and the maximal value of µ within finite steps, respectively.The specific optimization procedures for estimating the variable Θ 0 , i.e., solving the problem (6), are summarized in Algorithm 2.

Learning Specific Subspace Projection Matrices
The solution to the variable Θ k can be obtained by using the same solver with the problem (6).The only difference lies in that there is no the MA term, i.e., β 2 tr(Θ 0 XL(Θ 0 X) ), in the optimization problem of {Θ k } K k=1 .As a result, the Algorithm 2 can be directly applied to estimate the variables {Θ k } K k=1 as well.

Convergence Analysis and Computational Cost
The global optimization given in Algorithm 1 can be performed by the alternating minimization.The block coordinate descent (BCD) is a commonly-used strategy to solve the problem.The BCD is able to converge well in theory as long as the convexity for each subproblem is met (Bazaraa et al., 2013).The ADMM solver in Algorithm 2 is actually a variant of inexact Augmented Lagrange Multiplier (ALM) (Lin et al., 2010), which has been successfully used for solving the multiblock ADMM-based optimization problems (Chen et al., 2016;Deng et al., 2017;Wang et al., 2018).Up to the present, the convergence of the multi-block ADMM in some common cases (e.g., Algorithm 2) has been well studied in various practical cases (Zhou et al., 2016;Xu et al., 2019;Yao et al., 2019), although a strictly mathematical proof needs to be further improved and perfected.Moreover, we experimentally record the relative loss of the objective function in each iteration to draw the convergence curves of the proposed S2FL model on three datasets used in the experiment section, as shown in Fig. 3. Furthermore, it is clear to observe that the overall computational cost of the optimization problem (1) (i.e., our S2FL model) is mainly dominated by classic matrix algebra, such as matrix multiplication and matrix inversion.In detail, the update of the linear regression matrix P exhibits a total complexity with O(d For each iteration of Algorithm 2 in solving the subproblem (6), optimizing Θ 0 and H generally yields the costs of O(d The scene consists of HS and MS data, which is a typical homogeneous dataset.The original HS image is available from IEEE GRSS data fusion contest 20131 and has been widely concerned and applied for land cover classification.This image acquires the campus area of the University of Houston, Texas, USA, with 349 × 1905 pixels and 144 channels covering the spectral range from 0.38µm to 1.05µm.To make full use of high spectral information of the HS image and high spatial information of the MS image, we generate the HS-MS Houston2013 benchmark datasets by degrading the original HS image in spatial and spectral domains.More specifically, Spectral degradation: The low spectral resolution MS image can be obtained by using the spectral response functions (SRFs) of the Sentinel-2 sensor.The resulting MS image is composed of the same size with the original HS image and 8 spectral bands at a ground sampling distance (GSD) of 2.5m.Spatial degradation: The low spatial resolution HS image with a 10m GSD is generated by the means of the bilinear interpolation.To meet the pixel-to-pixel correspondences between HS and MS images, the degraded HS image is re-upsampled to the size of the MS image, i.e., 349×1905, by using the nearest neighbor interpolation.
Table 2 lists the types of ground objects and the number of training and test samples used for the classification task in the HS-MS Houston2013 datasets, and correspondingly Fig. 4 visualizes the false-color images and the distribute of training and test sets.

HS-SAR Berlin Data
The simulated EnMAP data synthesized based on the HyMap HS data graphically describes the Berlin urban and its rural neighboring area at 30m GSD, which can be freely downloaded from the website 2 .In detail, there are 797 × 220 pixels in this scene, where 244 spectral bands are given in the wavelength range of 0.4µm to 2.5µm.More details can be found in (Okujeni et al., 2016).To get a corresponding SAR data of the same region, we download a Sentinel-1 dual-Pol (VV-VH) single look complex (SLC) product from ESA.The product is collected under the Interferometric Wide (IW) swath mode.With the help of ESA toolbox SNAP 3 , we build up a  pre-processing workflow to prepare the SLC product as an analysis-ready SAR image.The workflow includes apply orbit profile, radiometric calibration, deburst, speckle reduction, terrain correction, and region-of-interest extraction.The analysis-ready SAR image used in this paper is geocoded in UTM/WGS84 coordinate system, and it is spatially averaged while applying speckle reduction and saved in 2 × 2 PolSAR covariance matrix.Since the azimuth resolution of the Sentinel-1 data is approximately 13m, the processed SAR image has a 13.89m GSD and consists of 1723 × 476 pixels.Similarly to the first datasets, the nearest neighbor interpolation is performed on the HS image, enabling the same image size with the SAR data.The ground reference data for land cover classification is generated by using the OpenStreetMap data (Haklay and Weber, 2008), where the information regarding the training and test sets is shown in Fig. 4 and Table 3.

HS-SAR-DSM Augsburg Data
This dataset consists of three different data sources, including a spaceborne HS image, a dual-Pol PolSAR image, and a DSM image, where HS and DSM data are acquired by DAS-EOC, DLR, and the PolSAR data are collected from the Sentinel-1 platform, over the city of Augsburg, Germany.They are collected by the HySpex sensor (Baumgartner et al., 2012), the Sentinel-1 sensor, and the DLR-3K system (Kurz et al., 2011), respectively.To evaluate the performance of multimodal fusion classification effectively, we downsample the spatial resolution of all images to a unified 30m GSD.The scene comprises of 332 × 485 pixels and 180 spectral bands ranging from 0.4µm to 2.5µm for the HS image, 1 band for the DSM image, and 4 features from the dual-Pol (VV-VH) SAR image.Note that the SAR data is preprocessed in the same way as the SAR component in HS-SAR Berlin Data.The 4 features are VV intensity, VH intensity, the real part and the imaginary part of the off-diagonal element of the PolSAR covariance matrix.The ground reference data is generated from the OpenStreetMap data.Detailed information regarding the training and test sets is shown in Fig. 4 and Table 4.

Experimental Setup and Preparation 4.2.1. Evaluation Metrics
In the experiments, land cover classification is regarded as a potential application to evaluate the quality of the multimodal feature representations learned by the proposed S2FL model.More specifically, we compute three widely-used criteria, i.e., Overall Accuracy (OA), Average Accuracy (AA), and Kappa Coefficient (κ) to make a quantitative performance comparison using a nearest neighbor (NN) classifier.
More specifically, we first apply the proposed S2FL model to extract or learn the multimodal feature representations and then feed them into a classifier (NN in our case).The main reason to select the NN classifier can be clarified by the fact that we expect to see the performance gain owing to the learned features obtained by our proposed method rather than those advanced classifiers, e.g., support vector machine (SVM), random forest (RF), deep learning-based classifiers, etc.

Implementation Details
The algorithm performance, to a great extent, depends on the parameter tuning, e.g., regularization parameters in CoSpace and our S2FL.As a result, selecting a proper range for these parameters is of great significance in practical applications.For this reason, a 10-fold cross validation is conducted on the training set to determine the parameter combination of different methods.These parameters are tuned in a given range to maximize the classification accuracy.For example, the number of nearest neighbors (q) and the parameter σ for computing the adjacency matrix W in USMA and S2FL can be selected from {5, 10, ..., 50} and {10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 }, respectively.The regularization parameters, e.g., α and β in CoSpace and S2FL, can be determined in the range of {10 −3 , 10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 }.Moreover, the feature dimension is also a key parameter, which has great effects on the quality of final learned feature representations.For that, the optimal feature dimension (d) can be found out from 5 to 50 at a 5 interval, according to the best classification performance on the training set.
It should be noted, however, that the DSM image only holds one feature band in the Augsburg datasets, hence attribute profiles are first extracted to fully utilize the spatial information of the DSM image, extending the feature bands to 3, before performing feature learning and classification.

Results and Analysis on Houston2013 Datasets
Table 5 lists the quantitative classification results of compared feature learning methods on HS-MS Houston2013 datasets, including OA, AA, κ, and the accuracy for each class.Overall, the joint use of multiple modalities (e.g., HS+MS) obviously performs better than single modalities in three main indices, i.e., OA, AA, and κ.The classification accuracies obtained by JDR-PCA are basically same to those jointly using HS and MS data.For MA-based approaches (e.g., USMA, SMA), they fail to classify the materials well, due to their sensitivity to various complex noises.This indirectly leads to relatively poor performance, compared to PCA and HS+MS.Owing to the use of label (supervised) information, the classification accuracy of SMA is higher than that of USMA, bringing an increase of about 2% points OA.Moreover, the joint learning-based group, e.g., 2 -CoSpace, 1 -CoSpace, and S2FL, dramatically outperforms other competitors, either in main indices (OA, AA, and κ) or in the accuracy for the majority of categories.The performance of 1 -CoSpace is superior to that of 2 -CoSpace (approximately 2% points improvement), since the feature selection strategy is performed in 1 -CoSpace by the means of sparsity-promoting 1 -norm term.More remarkably, our proposed S2FL model can obtain a higher classification result with around 5% points gain in OA on the basis of 1 -CoSpace that ranks the second place.In addition, S2FL also plays a dominated role in the classification accuracy of each class.That is, S2FL is capable of achieving the best classification performance in many categories (e.g., Soil, Commercial, Road, Highway, Railway, Tennis Court, and Running Track ), which demonstrates its effectiveness and superiority in the land cover classification task.
Visually, Fig. 5 shows the integral classification maps of considered compared algorithms in the given scene.There is a basically identical trend in both quantitative and qualitative comparisons (between Fig. 5 and Table 5).It is worth noting that the classification map obtained by the proposed S2FL model has not only more structurally geometric information for those man-made materials, e.g., Residential, Commercial, Parking Lot, etc., but also more detailed textural information, e.g., for the ground objects of Grass and Tree.

Parameter Analysis of S2FL Model
Parameter (or model) selection is the key factor to wield significant influence over the performance gain of the feature learning method.It is necessary, therefore, to discuss and analyze the parameter sensitivity.There are five main parameters, i.e., the number of nearest neighbors (q) and the parameter σ in Eq. ( 2), the regularization parameters (α and β) in Eq. ( 1), and the feature dimension (d), in the S2FL model.As shown in Fig. 6, the regularization parameter α and the feature dimension d yield a higher change in the term of OA with different input values, compared to other three parameters.Moreover, varying the number of nearest neighbors q is relatively insensitive to the classification accuracy, while the parameters σ and β can moderately control the change of OA.From the overall perspectives, the OA The number of nearest neighbors datasets, e.g., the number of nearest neighbors and σ in Eq. ( 2), the regularization parameters α and β in Eq. ( 1), and the feature dimension.
value remains stable as long as these parameters are selected in a proper range.The optimal parameter combination is (q, σ, α, β, d) = (45, 10 0 , 10 −2 , 10 0 , 30), which is basically identical to that obtained by cross-validation on the training set.This, to some extent, demonstrates the acceptable practicability of the proposed S2FL model in the real application.

Ablation Analysis of the Proposed S2FL Model
To verify the effectiveness of the proposed idea (i.e., shared and specific feature disentangling) in the S2FL model, we investigate the performance gain with the use of different components, i.e., only using modality-shared features (obtained via Θ 0 ), only using modality-specific features (obtained via Θ k ).The quantitative results in terms of OA, AA, and κ indices on the Houston2013 datasets are reported in Table 7.More specifically, the S2FL's results only using one modality (e.g., HS, MS)4 are reported, which is better than those without feature learning (the first two rows in Table 7).We also consider the case of only using shared features or only using specific features in our S2FL model for land cover classification.As can be seen from Table 7, the performance of S2FL by considering multimodal input is superior to that of only using single modalities, while the modality-specific information is more important than the modality-shared information (OA: 83.11 vs 78.77) in the classification task.Moreover, we found that the performance happens a dramatic degradation without the orthogonal constraint in the S2FL model.Not unexpectedly, our S2FL with a joint combination of shared and specific features achieves the best results.

Cross-modality Experiments: A Special Case of Multi-modality
Take the bi-modality as an example, cross-modality learning (CML) for simplicity refers to that training a model using two modalities and one modality is absent in the testing phase, or vice versa (only one modality is available for training and bimodality for testing) (Ngiam et al., 2011).Such a CML problem that exists widely in various RS tasks is more applicable to real-world cases.n recent years, there have   been some works proposed to investigate the CML issue and applied for land cover classification.More details regarding the CML's setting can refer to (Hong et al., 2019a,c).Table 8 lists the quantitative comparison between the single modalities and the CML's cases using our S2FL.There is a basically consistent trend in both HS and MS data.That is, the classification accuracies of directly using the original spectral features (either HS or MS) are lower than those using learned features (e.g., via S2FL).In addition, the features learned by the S2FL-CML setting can better classify the land cover materials compared to those directly learning features from the single modalities (e.g., S2FL-HS or S2FL-MS), showing the effectiveness of the proposed S2FL model for both cases of multi-modality learning and the CML.

Results and Analysis on Berlin Datasets
Unlike the homogeneous HS-MS data, the heterogeneity between HS and SAR data remains challenging in the feature learning and fusion.Therefore, the quantitative comparison conducted on the Berlin datasets (see  meaningful for the performance assessment of MFL models.As can be seen in Table 6, the heterogeneous datasets are challenging and difficult to perform the land cover classification, yielding a sharp decrease in classification performance compared to results on the Houston2013 datasets.Nevertheless, the whole trend between different algorithms is similar.The classification accuracy using the concatenation of HS and SAR data is still higher than that only using single modalities.PCA-based joint feature learning obtains a basically same result with HS+SAR's.Similarly, USMA and SMA fail to align the heterogeneous data well.The reasons are two-fold: the sensitivity to the noises and only considering the shared component representations across modalities.The CoSpace family, i.e., 2 -CoSpace and 1 -CoSpace, is robust to the noise effects by learning the latent subspace to bride the multimodal data and label information, bringing increments of 5.84% and 8.01% points OA, respectively, on the basis of SMA.Not unexpectedly, our S2FL method achieves the best performance with further improvement of 3.39%, 3.83%, and 3.06% points with respect to OA, AA, and κ over 1 -CoSpace.Additionally, the S2FL model can also obtain desirable classification accuracy for each class in comparison with other approaches.Fig. 7 shares a similar visual comparison with quantitative results.Only using SAR data yields a poor classification map with extensive noisy points.Notably, our method is capable of fully blending the HS and SAR information by the means of the interpretable shared and specific feature learning mechanism, thereby reducing the noisy pixels and generating more smooth classification maps.Particularly for those ground objects that hold rich texture information, e.g., Forest, Low Plants, the S2FL model tends to capture their subtle differences against noises by the means of specific information of each modality.Furthermore, the learned common components can depict the structural information, thereby further identifying the materials of Residential and Commercial effectively.

Results and Analysis on Augsburg Datasets
We further investigate the generalization ability and effectiveness of the proposed S2FL model in the case of three-modality data, i.e., HS, SAR, and DSM.Table 9 details the classification results of different compared algorithms.Generally, there is an obvious improvement (around 4%) in OA when jointly using three modalities, compared to that using HS, SAR, and DSM independently.When the number of considered modalities increases to 3, the performance of those MA-based models dramatically declines, especially USMA that is a lack of label guidance.It is worth noted, however, that the CoSpace-based approaches modeled by either 2 -norm or 1 -norm, can be effectively extended to the case of three modalities, yielding significant improvement in classification accuracies.The feature selection works well in handling the issue of multiple heterogeneous data.That is, the classification result of the 1 -CoSpace is higher than that of 2 -CoSpace, which increases by 5.32% points OA.As excepted, the S2FL performs better than 1 -CoSpace with an increase of approximately 2 percentage points OA.More importantly, the AA value obtained by the S2FL is far higher than other competitors, where almost of each class can achieve the highest classification results.This, to a great extent, shows the superiority of our proposed shared and specific learning strategy (i.e., S2FL model).
Furthermore, the classification maps shown in Fig. 8 also give a strong support to the aforementioned conclusion.The S2FL by the means of multiple modality data can obtain more realistic material identification in land cover mapping.In particular, the materials, e.g., Forest, Residential Area, Low Plants, are classified in more smooth fashion, showing semantically meaningful structure.

Ablation Study on the Use of Multiple Modalities
To further verify the effectiveness and superiority of the proposed S2FL model in multimodal RS data feature learning and fusion, we will investigate the performance gain by using different combinations of multiple modalities.More specifically, three high-performance MFL algorithms, i.e., 2 -CoSpace, 1 -CoSpace, S2FL, are used for quantitative comparison on the HS-SAR-DSM Augsburg datasets, as detailed in Table 10.On the whole, there are several important and intuitive conclusions, which can be summarized as follows: • The joint exploitation of multiple modalities can break the performance bottleneck in land cover classification.For example, the HS+SAR+DSM can usually obtain better classification results than the only use of two modalities.
• Characterized by rich spectral information, the HS image tends to identify the materials at a more accurate level compared to SAR and DSM.
• The HS+SAR results obtained by 2 -CoSpace are even slightly better than those of HS+SAR+DSM.This indicates that the 2 -CoSpace method fails to better fuse the multimodal information to some extent when the number of modalities increases.
• Feature selection guided by sparsity-promoting 1 -norm is an effective strategy for MFL.The resulting 1 -CoSpace observably outperforms 2 -CoSpace in different modality combinations.
• By decoupling the multimodal data into shared and specific components, S2FL is capable of better learning feature representations of multimodal data (cf. 2 -CoSpace and 1 -CoSpace), further yielding higher classification performance in either two modalities or three modalities.
It is worth noting that currently-developed DL-based approaches have shown great potential in the fusion and representation learning of multimodal RS data, yet these methods inevitably suffer from various possible performance degradation.However, facing these problems, the proposed S2FL model could, to a great extent, offer capabilities that DL methods do not provide in the aspects of robustness, interpretability, and sensitivity to the training set size.

Conclusion
Land cover classification has long been considered as a main research topic in the RS and geoscience community.As a crucial step, feature extraction has been paid much attention by researchers.However, the feature representation ability extracted from only single RS data resources remains limited.Fortunately, the rapid development of RS imaging techniques makes multimodal RS data available on a large scale.To speed up the development of multimodal RS data processing and analysis, we in this paper aim at opening three multimodal RS benchmark datasets, they are homogeneous HS-MS Houston2013, heterogeneous HS-SAR Berlin, and threemodality HS-SAR-DSM Augsburg datasets.Further, we also propose a novel MFL model, called S2FL, yielding more discriminative and compact feature blending by learning shared-modality and specific-modality representations.By comparing with previously-proposed advanced MFL methods, the S2FL model obtains the best classification performance on the three datasets, which is obviously superior to other competitors.We will open the three potential benchmark datasets and the MFL toolbox including newly-proposed S2FL model, contributing to the RS and information fusion community.In future work, we would like to further extend these datasets to a larger scale and also develop the corresponding feature learning models, e.g., based on more powerful deep learning techniques by embedding more interpretable knowledge or priors to guide the network optimization in the multimodal feature learning task.

Figure 1 :
Figure 1: An illustration to clarify the learning process for shared and specific subspaces (or features) of multimodal RS data in the proposed S2FL model.The to-be-estimated variables Θ 0 , {Θ k } 2k=1 , and P denote the shared subspace projection, the specific subspace projections, and the regression matrix, respectively, Note that we here take the bi-modality as an example.

Figure 2 :
Figure 2: An example to illustrate the adjacency matrix (W).

Figure 3 :
Figure 3: Convergence analysis of the S2FL model on three multimodal RS benchmark datasets: HS-MS Houston2013, HS-SAR Berlin, and HS-SAR-DSM Augsburg.

Figure 4 :
Figure 4: False-color visualization as well as the spatial distribution of training and test samples of three multimodal RS benchmark datasets: HS-MS Houston2013, HS-SAR Berlin, and HS-SAR-DSM Augsburg.
respectively, while updating the variable G requires computing a SVD with the order of cost as O(min(d 2 k d s K 2 , d 2 s d k K)).The final update of the specific subspace projection matrices {Θ k } K k=1 bears the same complexity as that of the Algorithm 2.

Figure 5 :
Figure 5: Classification maps obtained by different MFL algorithms on the Houston2013 datasets.

Figure 6 :
Figure6: Parameter sensitivity analysis of S2FL model conducted on the HS-MS Houston2013 datasets, e.g., the number of nearest neighbors and σ in Eq. (2), the regularization parameters α and β in Eq. (1), and the feature dimension.

Table 8 :
Quantitative comparison using the S2FL model under the cross-modality setting."Only HS (MS)" means directly using original HS (MS) data as the features, "S2FL-HS (MS)" means the single modality feature learning (e.g., HS or MS), and "S2FL-CML-HS (MS)" means the CML setting, i.e., training on HS-MS data and only HS (MS) data are available in the testing phase.Method Only HS S2FL-HS S2FL-CML-HS Only MS S2FL

Figure 7 :
Figure 7: Classification maps obtained by different MFL algorithms on the Berlin datasets.

Figure 8 :
Figure 8: Classification maps obtained by different MFL algorithms on the Augsburg datasets.

Table 2 :
Description of the investigated HS-MS Houston2013 datasets, including the types of ground objects and the corresponding number of training and test samples.

Table 3 :
Description of the investigated HS-SAR Berlin datasets, including the types of ground objects and the corresponding number of training and test samples.

Table 4 :
Description of the investigated HS-SAR Augsburg datasets, including the types of ground objects and the corresponding number of training and test samples.

Table 5 :
Quantitative results of different compared approaches in terms of OA, AA, and κ as well as the accuracy for each class on HS-MS Houston2013 datasets using NN classifier, where the parameters are determined by cross-validation on the training sets.The best one is shown in bold.

Table 6 :
Quantitative results of different compared approaches in terms of OA, AA, and κ as well as the accuracy for each class on HS-SAR Berlin datasets using NN classifier, where the parameters are determined by cross-validation on the training sets.The best one is shown in bold.

Table 7 :
Ablation analysis of the proposed S2FL model on the Houston2013 datasets.

Table 9 :
Quantitative results of different compared approaches in terms of OA, AA, and κ as well as the accuracy for each class on HS-SAR-DSM Augsburg datasets using NN classifier, where the parameters are determined by cross-validation on the training sets.The best one is shown in bold.