X-ModalNet: A Semi-Supervised Deep Cross-Modal Network for Classification of Remote Sensing Data

This paper addresses the problem of semi-supervised transfer learning with limited cross-modality data in remote sensing. A large amount of multi-modal earth observation images, such as multispectral imagery (MSI) or synthetic aperture radar (SAR) data, are openly available on a global scale, enabling parsing global urban scenes through remote sensing imagery. However, their ability in identifying materials (pixel-wise classification) remains limited, due to the noisy collection environment and poor discriminative information as well as limited number of well-annotated training images. To this end, we propose a novel cross-modal deep-learning framework, called X-ModalNet, with three well-designed modules: self-adversarial module, interactive learning module, and label propagation module, by learning to transfer more discriminative information from a small-scale hyperspectral image (HSI) into the classification task using a large-scale MSI or SAR data. Significantly, X-ModalNet generalizes well, owing to propagating labels on an updatable graph constructed by high-level features on the top of the network, yielding semi-supervised cross-modality learning. We evaluate X-ModalNet on two multi-modal remote sensing datasets (HSI-MSI and HSI-SAR) and achieve a significant improvement in comparison with several state-of-the-art methods.


Introduction
Currently operational radar (e.g., Sentinel-1) and optical broadband (multispectral) satellites (e.g., Sentinel-2 and Landsat-8) enable the synthetic aperture radar (SAR) [43] and multispectral image (MSI) [84] openly available on a global scale.Therefore, there has been a growing interest in understanding our environment through remote sensing (RS) images, which is of great benefit to many potential applications such as image classification [70,26,68,9], object and change detection [87,75,86,74], mineral exploration [19,38,34,80], multi-modality data analysis [36,39,79,30], to name a few.In particular, RS data classification is a fundamental but still challenging problem across computer vision and RS fields.It aims to assign a semantic category to each pixel in a studied urban scene.For example, in [20], spectral-spatial information is applied to significantly suppress the influence of noise in dimensionality reduction, and the proposed method is obviously effective in extracting nonlinear features and improving the classification accuracy.
Recently, enormous efforts have been made on developing deep learning (DL)based approaches [46], such as deep neural networks (DNNs) and convolutional neural networks (CNNs), to parse urban scenes by using street view images.Yet it is less investigated at the level of satellite-borne or aerial images.Bridging advanced learning-based techniques or vision algorithms with RS imagery could allow for a variety of new applications potentially conducted on a larger and even a global scale.A qualitative comparison is given in Table 1 to highlight the differences as well as advantages and disadvantages in the classification task using different scene images (e.g., street view or RS images).

Motivation and Objective
We clarify our motivation to answer the following three "why" questions: 1) Why classify or parse RS images?2) Why use multimodal data? 3) Why learn the cross-modal representation?
• From Street to Earth Vision Remotely sensed imagery can provide a new insight for global urban scene understanding.The data in Earth Vision, on  Top: Given a large-scale urban area in yellow, both SAR in magenta and MSI in chestnut are openly and largely available with a high spatial resolution but limited by poor feature discrimination power, while the HSI in red is information-rich but only a small-scale available, as shown in area 1 overlapped with SAR (or MSI).Bottom: The model is trained on multimodalities (e.g., HSI-MSI or HSI-SAR) with the sparse training labels, and one modality is absent in the process of predicting.
one hand, benefit from a "bird's perspective," providing a structure-related multiview surface information; and, on the other hand, it is acquired on a wider and even global scale.
• From Unimodal to Multimodal Data Limited by the low image resolution and a handful of labeled samples, unimodal RS data are inevitable to meet the bottleneck in performance gain, despite being able to be openly and largely acquired.Therefore, an alternative to maximize the classification accuracy is to jointly leverage the multimodal data.
• From Multimodal to Crossmodal Learning In reality, a large amount of information-rich data, such as hyperspectral imagery (HSI), are hardly collected due to technical limitations of satellite sensors.Thus, only the limited multimodal correspondences can be used to train a model, while one modality is absent in the test phase.This is a typical cross-modality learning (CML) issue.
Fig. 1 illustrates the to-be-solved problem and potential solution, where MSI in magenta or SAR in cyan is freely available at a large and even global scale but they are limited by relatively poor feature representation ability, while the HSI in red is characterized by rich spectral information but fails to be acquired in a largecovered area.This naturally leads to a general but interesting question: can a limited amount of spectrally discriminative HSI improve the parsing performance of a large amount of low-quality data (SAR or MSI) in the large-scale classification or mapping task?A feasible solution to the problem is the CML.
Motivated by the above analysis, the CML issue that we aim at tackling can be further generalized to three specific challenges related to computer vision or machine learning.
• RS images acquired from the satellites or airplanes inevitably suffer from various variations caused by environmental conditions (e.g., illumination and topology changes, atmospheric effects) and instrumental configurations (e.g., sensor noise).
• Multimodal RS data are usually characterized by the different properties.Blending multi / cross-modal representation in a more effective and compacted way is still an important challenge in our case.
• RS images in Earth Vision can provide a larger-scale visual field.This tends to lead to costly labeling and noisy annotations in the process of data preparation.
According to the three factors, our objective can be summarized to develop novel approaches or improve the existing ones, yielding a more discriminative multimodality blending and robust against various variabilities in RS images with the limited number of training annotations.

Method Overview and Contributions
Towards the aforementioned goals, a novel cross-modal DL framework is proposed in a semi-supervised fashion, called X-ModalNet, for RS image classification.As illustrated in Fig. 2, a three-stream network is developed to learn the multimodal joint representation in consideration of unlabeled samples, where the network parameters would be shared from the same modalities.Moreover, an interactive learning strategy is modeled across the two modalities to facilitate the information blending more effectively.Prior to the interactive learning (IL) module, we also embed a self-adversarial (SA) module robustly against noise attack, thereby enhance the model's generalization capability.To fully make use of unlabeled samples, we iteratively update pseudo-labels by label propagation (LP) on the graph constructed by high-level hidden representations.Extensive experiments are conducted on two multimodal datasets (HSI-MSI and HSI-SAR), showing the effectiveness and superiority of our proposed X-ModalNet in the RS data classification task.
The main contributions can be highlighted in four-folds: • To our best knowledge, this is the first time to investigate the HSI-aided CML's case by designing such deep cross-modal network (X-ModalNet) in RS fields for improving the classification accuracy of only using MSI or SAR with the aid of a limited amount of HSI samples.
• According to spatially high resolution of MSI (SAR) as well as spectrally high resolution of HSI, our X-ModalNet is a novel and promising network architecture reasonably, which takes a hybrid network as backbone, that is, CNN for MSI or SAR and DNN for HSI.Such design enables the best full use of high spatial and rich spectral information from MSI or SAR and HSI, respectively.
• We propose two novel plug-and-play modules: SA module and IL module, aiming at improving the robustness and discrimination of the multimodal representation.On the one hand, we modularize the idea of generative adversarial networks (GANs) [23] into the network to generate robust feature representations by simultaneously learning original features and adversarial features in SA module.On the other hand, we design the IL module for better information blending across modalities by interactively sharing the network weights to generate more discriminative and compact features.
• We design an updatable LP mechanism into our proposed end-to-end networks by progressively optimizing pseudo-labels to further find a better decision boundary.
• We validate the superiority and effectiveness of X-ModalNet on two crossmodal datasets with extensive ablation analysis, where we collected and processed the Sentinel-1 SAR data for the second datasets.

Scene Parsing
Most recently, the research on scene parsing has made unprecedented progress, owing to the powerful DNNs [44].Most of these state-of-the-art DL-based frameworks for scene parsing [81,56,76,89,47,85,12,61] are closely associated with two seminal works presented on the prototype of deep CNN: fully convolutional network [49], DeepLab [12].However, a nearly horizontal field of vision makes it difficult to parse a large urban area without extremely diverse training samples.Therefore, RS images might be a feasible and desirable alternative.
We observed that the RS imagery has attracted increasing interest in computer vision field [45,77,51], as it generally holds a diversified and structured source of information, which can be used for better scene understanding and further make a significant breakthrough in global urban motoring and planning [15].Chen et al. [14] fed the vector-based input into a DNN for predicting the category labels in the HSI.They extended their work by training a CNN to achieve a spatialspectral HSI classification CNN [13].Hang et al. [27] utilized a Cascaded RNN to parse the HSI scenes.Perceptibly, the scene parsing in Earth Vision is normally performed by training an end-to-end network with a vector-based or a patch-based input, as the sparse labels (see Fig. 1) can not support us to train a FCN-like model.As listed in Table 1, RS images are noisy but low resolution, and are relatively expensive and time-consuming in labeling, limiting the performance improvement.A feasible solution to the issue is to introduce other modalities (e.g., HSI) with more discriminative information, yielding multimodal data analysis.

Multi/Cross-Modal Learning
Multimodal representation learning related to DNN can be categorized into two aspects [6].

Joint Representation Learning
The basic idea is to find a joint space where the discriminative feature representation is expected to be learned over multi-modalities with multilayered neural networks.Although some recent works have attempted to challenge the CML issue by using joint representation learning strategy, e.g., [35,30], yet these methods remain limited in data representation and fusion, particularly for heterogeneous data, due to their linearized modeling.A representative work in the multimodal deep learning (MDL) was proposed by Ngiam et al. [54], in which the high-level features for each modality are extracted using a stacked denoising autoencoder (SDAE) and then jointly learned to a multimodal representation by an additional encoder layer.[64] extended the work to a semi-supervised version by additionally using a term into loss function that predicts the labels.Similarly, Srivastava et al. utilized the deep belief network [66] and deep Boltzmann machines [67] to explain the multimodal data fusion or learning from the perspective of probabilistic graphical models.In [60], a novel multimodal DL with cross weights (MDL-CW) is proposed to interactively represent the multimodal features for a more effective information blending.Besides, some follow-up work has been successively proposed to learn the joint feature representation more effectively and efficiently [57,73,59,63,50,48].

Coordinated Representation Learning
It builds the disjunct subnetworks to learn the discriminative features independently for each modality and couples them by enforcing various structured constraints onto the resulting encoder layers.These structures can be measured by similarity [18,17], correlation [11], and sequentiality [72], etc.
In recent years, some tentative work has been proposed for multimodal data analysis in RS [22,42,52,2,3,83,21]. Related to ours for scene parsing with multimodal deep networks, an early deep fusion architecture, simply stacking all multi-modalities as input, is used for semantic segmentation of urban RS images [42].In [3], optical and OpenStreetMap [25] data are jointly applied with a twostream deep network for getting a faster and better semantic map.Audebert et al. [4] parsed the urban scenes under the SegNet-like architecture [5] by using MSI and Lidar.Similarly, Ghosh et al. [21] proposed a stacked U-Nets for material segmentation of RS imagery.Nevertheless, these methods are mostly developed with optical (MSI or RGB) or Lidar data for the rough-grained scene parsing (only few categories) and fail to perform sufficiently well in a complex urban scene due to the relatively poor feature representation ability behind the networks, especially in CML [54].

Semi-Supervised Learning
Considering the fact that the labeling cost is very expensive, particularly for RS images, the use of unlabeled samples has gathered increasing attention as a feasible solution to further improve the classification performance of RS data.There have been many non-DL-based semi-supervised learning approaches in a variety of RS-related applications, such as regression-based multitask learning [33,29], manifold alignment [71,40], factor analysis [88].Yet this topic is less investigated by using the DL-based approaches.Cao et al. [10] integrated CNNs and active learning to better utilize the unlabeled samples for hyperspectral image classification.Riese et al. [62] developed a semi-supervised shallow network -self-organizing map framework -to classify and estimate physical parameters from MSI and HSI.Nevertheless, how to embed the semi-supervised techniques into deep networks more effectively remains challenging.

The Proposed X-ModalNet
The CML's problem setting drives us to develop a robust and discriminative network for pixel-wise classification of RS images in complex scenes.Fig. 2 illustrates the architecture overview of the X-ModalNet, which is built upon a threestream deep architecture.The IL module is designed for highly compact feature blending before feeding the features of each modality into joint representation, and we also equip with the SA module and an iterative LP mechanism to improve the robustness and the generalization ability of the proposed X-ModalNet, particularly in the presence of noisy samples.

Network Architecture
The bimodal deep autoencoder (DAE) in [54] is a well-known work in MDL, and we advance it to the proposed X-ModalNet for classification of RS imagery.The differences and improvements mainly lie in four aspects.

Hybrid Network Architecture
Similarly to [8], we propose a hybrid-stream network architecture in a bimodal DAE fashion, including two CNN-streams on the labeled MSI (SAR) and unlabeled one, and a DNN-stream on HSI, to exploit high spatial information of MSI/SAR data and high spectral information of HSI more effectively.Since hyperspectral imaging enables discrimination between spectrally similar classes (high-spectral resolution) but its swath width from space is narrow compared to multispectral or SAR ones (high-spatial resolution).More specifically, we take the patches centered by pixels as the input of CNN-streams for labeled and unlabeled MSIs (SARs), and the spectral signatures of the corresponding pixels as the input of DNN-stream for labeled HSI.Moreover, the reconstructed patches (CNN-streams) and spectral signatures (DNN-stream) of all pixels as well as the one-hot encoded labels can be regarded as the network outputs.

Self-Adversarial Module
Due to the environmental factors (e.g., illumination, physically and chemically atmospheric effects) and instrumental errors, it is inevitable to have some distortions in RS imaging.These noisy images tend to generate attacked samples, thereby hurting the network performance [69,24,53].Unlike the previous ad-  versarial training approaches [16,7] that generate adversarial samples in the first place and then feed them into a new network for training, we learn the adversarial information in the feature-based level rather than the sample-based one, with an end-to-end learning process.This might lead to a more robust feature representation in accord with the learned network parameters.As illustrated in Fig. 3(a), given a vector-based feature input of the module, the network is first split into two streams (NS).It is well-known that the discriminator in GANs enables the generation of adversarial examples to fool the networks.Inspired by it, we assume that in our SA module, one stream extracts or generates the high-level features of the input, while another one correspondingly learns the adversarial features by allowing for an adversarial loss on the top layer (AL).In this process, the discriminator can be well regarded as a constraint to achieve the function.In addition, this has been also proven to be effective by the reference [82] to a great extent.
Finally, the features represented from the two subnetworks are concatenated as the module output (FC) in order to generate more robust feature representations by simultaneously considering the original features and its adversarial features into the network training.Moreover, the superiority of our SA module mainly lies in that the parameters in the module is an end-to-end trainable in the whole X-ModalNet, which can make the learned adversarial features more suitable for our classification tasks.By contrary, if we select to first generate adversarial samples by using an independent GAN and feed them into the classification network together with existing real samples, then the generated adversarial samples could bring the uncertainty for the classification performance improvement.The main reason is that the adversarial samples are generated by an independent GAN, which might be applicable to the GAN but might not be applicable to the classification network because they are trained separately.

Interactive Learning Module
We found that in the layer of multimodal joint representation, massive connections occur in variables from the same modality but few neurons across the modalities are activated, even if each modality passes through multiple individual hidden layers before being fed into the joint layer.Different from the hard-interactive mapping learning in [60,55] that additionally learns the weights across the different modalities, we propose a soft-interactive learning strategy that directly copies the weights learned from one modality to another one without additional computational cost and information loss, then fuses them on the top layer only with a simple addition operation, as illustrated in Fig. 3(b).This would be capable of learning the inter-modality corrections both effectively and efficiently by reducing the gap between the modalities, yielding a smooth multi-stream networks blending.

Label Propagation Module
Beyond the supervised learning, we also consider the unlabeled samples by incorporating the label propagation (see Fig. 3(c)) into the networks to further improve the model's generalization.The main workflow in the LP module is detailed as follows: • We first train a classifier on the training set (SVMs used in our case) and predict unlabeled samples by using the trained classifier.These predicted results (pseudo-labels) can be regarded as the network ground truth of unlabeled data stream, which is further considered with real labels into the network training for a multitask learning.
• Next, we start to train our networks until convergence occurs.We call this process as one-round network training.Once one-round network training has been completed, the high-level features extracted from the top of the network (see Fig. 2) are used to update the pseudo-labels using the graphbased LP [90].The LP algorithm consists of the following two steps.
-Step 1: construct similarity matrix.The similarity matrix S between any two samples [31], e.g., x i and x j , either labeled or unlabeled, is computed by where σ is a hyperparameter determined from the range of [0.001, 0.01, 0.1, 1, 10, 100] by cross-validation on the training set.
-Step 2: propagate labels over all samples.Before carrying out LP, a label transfer matrix (P), e.g., from the sample i to the sample j, is defined as where N is the number of samples.Assume that given M labeled and N − M unlabeled samples with C categories, a soft label matrix Y ∈ R N ×C is constructed, which consists of a labeled matrix Y l ∈ R M ×C and a unlabeled matrix Y u ∈ R (N −M )×C obtained by one-hot encoding.Our goal is to update the matrix Y, we then have the update rule in the t-th (t ≥ 1) iteration as follows: 1) update Y t by PY t−1 ; 2) reset Y t l in Y t using the original Y l as Y t l = Y l ; 3) repeat the steps 1) and 2) until convergence.
We re-feed these updated pseudo-labels, i.e., Y u into the next-round network training.The workflow is run repeatedly until the pseudo-labels are not changed any more.Note that we experimentally found that three to four repetitions are usually enough, leading to the model convergence.

Objective Function
Let x SA and z SA be the input and output of the SA module, and then we have where G is the generative subnetwork that consists of several encoder, normalization (BN) [41] and dropout [65] layers (see Fig. 3).Given the inputs of two modalities x 1 IL and x 2 IL in the IL module, its output (z IL ) can be formulated by where M LP , namely multi-layer perception, holds a same structure with G in Eq. ( 3), as illustrated in Fig. 3.We define the different modalities as x i where i ∈ {o, t, u} stands for the first modality, the second modality, the unlabeled samples, and the corresponding l-th hidden layer as z (l) i .Accordingly, the network parameters can be updated by jointly optimizing the following overall loss function.
where L l is the cross-entropy loss for labeled samples while L pl for pseudolabeled samples.In addition to the two loss functions that connect the input data with labels (or pseudo-labels), we consider the reconstruction loss (L rec ) for each modality as well as unlabeled samples.
where xi denotes the reconstructed data of x i .For the adversarial loss (L adv ), it acts on the SA module formulated based on GANs as where D i represents the discriminator in adversarial training.Linking with Eq.
(3), z r i = (z 1 SA ) i and z f i = (z 2 SA ) i are a real / fake pair of data representation on the last layers of SA module.

Model Architecture
The X-ModalNet starts with a feature extractor : two convolution layers with 5×5 and 3×3 convolutional kernels for MSI or SAR pathway and two fullyconnected layers for HSI pathway, and then passes through the SA module with two fully-connected layers.Following it, an IL module with two fully-connected layers is connected over the previous outputs.In the end, four fully-connected layers with an additional soft-max layer are applied to bridge the hidden layers with one-hot encoded labels.Table 2 details the network configuration for each layer in X-ModalNet.

Data Description
We evaluate the performance of the X-ModalNet on two different datasets.Fig. 4 shows the false-color images for both datasets as well as the corresponding training and test ground truth maps, while scene categories and the number of training and test samples are detailed in Table 3.There are two things particularly noteworthy in our CML' s setting: 1) vector (or patch)-based input due to the sparse groundtruth maps; 2) we assume that the HSI is present only in the process of training and it is absent in the test phase.

Homogeneous HSI-MSI Dataset
The HSI scene that has been widely used in many works [37,32] consists of 349×1905 pixels with 144 spectral bands in the wavelength range from 380 nm to 1050 nm at a ground sampling distance (GSD) of 10 m (low spatial-resolution), while the aligned MSI with the dimensions of 349×1905×8 is obtained at a GSD of 2.5 m (high spatial-resolution).
Spectral simulation is performed to generate the low-spectral resolution MSI by degrading the reference HSI in the spectral domain using the MS spectral response functions of Sentinel-2 as filters.Using this, the MSI consists of 349×1905 pixels with eight spectral bands at a GSD of 2.5 m.
Spatial simulation is performed to generate the low-spatial resolution HSI by degrading the reference HSI in the spatial domain using an isotropic Gaussian point spread function, thus yielding the HSI with the dimensions of 349 × 1905 × 144 at a GSD of 10 m by upsampling to the MSI's size.

Heterogeneous HSI-SAR Dataset
The EnMap benchmark HSI covering the Berlin urban area is freely available from the website 1 .This image consists of 797 × 220 pixels with a GSD of 30 m, and 244 spectral bands ranging from 400 nm to 2500 nm.According to the geographic coordinates, we downloaded the same scene of SAR image from the Sentinel-1 satellite, with the size of 1723 × 476 pixels at a GSD of 13 m and four polarimetric bands [78].The used SAR image is dual-polarimetric SAR data collected by interferometric wide swath mode.It is organized as a commonly used four-component PolSAR covariance matrix (four bands) [78].Note that we upsample the HSI to the same size with the SAR image by the nearest-neighbor interpolation.

Implementation Details
Our approach is implemented on the Tensorflow framework [1].The network configuration, to our knowledge, always plays a critical role in a practical DL system.The model is trained on the training set, and the hyper-parameters are determined using a grid search on the validation set2 .In the training phase, we adopt the Adam optimizer with the "poly" learning rate policy [12].The current learning rate can be updated by multiplying the base one with (1 − iter maxIter ) power , where the base learning rate and power are set to 0.0005 and 0.98, respectively.We use the DAE to pretrain the subnetworks for each modality to greatly reduce the training time of the model and find a better local optimum easier.Also, the momentum is set to 0.9.
To facilitate network training and reduce overfitting, BN and dropout techniques are orderly used for all DL-based methods prior to the activation functions.The model training ends up with 150 epochs for the heterogeneous HSI-MSI dataset and 200 epochs for the heterogeneous HSI-SAR dataset with a minibatch size of 300.Both labeled and unlabeled samples in SAR or MSI share the same network parameters in the process of model optimization.
In the experiments, we found that when the unlabeled samples, from neither training nor test sets, are selected at an approximated scale with the test set, the final classification results are similar to that directly using test set.We have to admit, however, that the full use of unlabeled samples enable further improvement in classification performance, but we have to make a trade-off between the limited performance improvement and exponentially increasing cost in data storage, transmission, and computation.Moreover, we expect to see the performance gain when using these proposed modules, thereby demonstrating their effectiveness and superiority.As a result, we, for simplicity, select the test set as the unlabeled set for all semi-supervised compared methods for a fair comparison.
Furthermore, two commonly used indices: Pixel-wise Accuracy (Pixel Acc.) and mean Intersection over Union (mIoU) are calculated to quantitatively evaluate the parsing performance by collecting all pixel-wise predictions of the test set.Due to random initialization, both metrics show the average accuracy and the variation of the results out of 10 runs.

Comparison with State-of-the-art
Several state-of-the-art baselines closely related to our task (CML) are selected for comparison; they are 1) Baseline: We train a linear SVM classifier directly using original pixelbased MSI or SAR features.Note that the hyperparameters in SVM are determined by 10-fold cross-validation on the training set.
2) Canonical Correlation Analysis (CCA) [28]: We learn a shared latent subspace from two modalities on the training set, and project the test samples from any one of the two modalities into the subspace.This is a typical crossmodal feature learning.Finally, the learned features are fed into a linear SVM.We used the code from the website 3 .
3) Unimodal DAE [14]: This is a classical deep autoencoder.We train a DAE on the target-modality (MSI or SAR) in a unsupervised way, and finely tune it using labels.The hidden representation of the encoder layer is used for final classification.The code we used is from the website 4 .
4) Bimodal DAE [54]: As a DL's pioneer to multi-modal application, it learns a joint feature representation over the encoder layers generated by AEs for each modality.
5) Bimodal SDAE [63]: This is a semi-supervised version for Bimodal DAE by considering the reconstruct loss of all unlabeled samples for each modality and adding an additional soft-max layer over the encoder layer for those limited labeled data.Figure 6: Classification maps of ROI on HSI-SAR datasets.The OpenStreetMap [25] is used as the ground truth generator for this area.

Results on the Homogeneous Datasets
Table 4 shows the quantitative performance comparison in terms of Pixel Acc. and mIoU.Limited by the feature diversity, the baseline yields a poor classification performance, while there is a performance improvement (about 2%) in the unimodal DAE due to the powerful learning ability of DL-based techniques.For the homogeneous HSI-MSI correspondences, the linearized CCA is more likely to catch the shared features and obtains the better classification results.The features can be better fused over the hidden representations of two modalities.Therefore, the bimodal DAE improves the performance by 2% on the basis of CCA's.The accuracy of bimodal SDAE can further increase to around 79%, since it aims at training an end-to-end multimodal network to generate more discriminative features.
Different from previous strategies, Corr-AE and CorrNet couple two subnetworks by enforcing the structural measurement on hidden layers, such as Euclidean similarity and correlation, which allows a more effective pixel-wise classification.The MDL-CW with learning cross weights can facilitate the multimodal information fusion, thus achieving better classification results than Corr-AE and CorrNet.As expected, X-ModalNet outperforms these state-of-the-art methods, demonstrating its superiority and effectiveness with a large improvement of at least 6% Pixel Acc. and mIoU over CorrNet (the second best method).

Results on the Heterogeneous Datasets
Similar to the former dataset, we evaluate the performance for the Heterogeneous HSI-SAR scene quantitatively.Two assessment indices (Pixel Acc. and mIoU) for different algorithms are summarized in Table 5.There is a basically consistent trend in performance improvement of different algorithms.That is, the performance of X-ModalNet is significantly superior to that of others, and the methods with the hyperspectral information perform better than those without one, such as Baseline and Unimodal DAE.It is worth noting that the proposed X-ModalNet brings increments of about 9% Pixel Acc. and 10% mIoU on the basis of CorrNet.Moreover, the CCA fails to linearly represent the heterogeneous data, leading to a worse parsing result and even lower than the baseline.Additionally, the gap (or heterogeneity) between SAR and optical data can be effectively reduced by mutually learning weights.This might explain the case that the MDL-CW observably exceeds most compared methods without such interactive module (nearly 20% over baseline), e.g., Bimodal DAE and its semi-supervised version (Bimodal SDAE) as well as CorrNet.

Visual Comparison
Apart from quantitative assessment, we also make a visual comparison by highlighting a salient region overshadowed by the cloud on the Houston2013 datasets.As shown in Fig. 5, our method is capable of identifying various materials more effectively, particularly for the material Commercial in the upper-right of the predicted maps.Besides, a trend can be figured out, that is, the methods with the input of multi-modalities achieve more smooth parsing results compared to those with the input of single modalities.
Similarly, we visually show the classification maps of those comparative algorithms in a region of interest in the EnMap datasets, as shown in Fig. 6.We can see that our X-ModalNet shows a more competitive and realistic parsing result, X-ModalNet (IL) X-ModalNet (IL+LP) X-ModalNet (IL+LP+SA)  especially in classifying Soil and Plants, which is more approaching to the real scene.

Ablation Studies
We analyze the performance gain of X-ModalNet by step-wise adding the different components (or modules).Table 6 lists a progressive performance improve-  ment by gradually embedding different modules, while Fig. 7 correspondingly visualizes the learned features in the latent space (top encoder layer).It is clear to observe that successively adding each component into the X-ModalNet is conducive to a more discriminative feature generation.
We also investigate the importance of dropout and BN techniques in avoiding overfitting and improving network performance.As can be seen in Table 6, turning off the dropout would hinder X-ModalNet from generalizing well, yielding a performance degradation.What is worse is that the classification accuracy without BN reduces sharply.This could result from low-efficiency gradient propagation, thereby hurting the learning ability of the network.Moreover, we can observe from Table 6 that the classification performance without any proposed modules is limited, only yielding about 83.14% and 64.44% Pixel Acc. on the two datasets.It is worth noting that the results achieve an obvious improvement (around 2% ∼ 3%) after plugging the IL module.By introducing the semisupervised mechanism, our LP module can bring increments of 1.5% and 2% Pixel Acc. on the basis of only using the IL module for HSI-MSI and HSI-SAR, respectively.Remarkably, when adding the SA module over the IL and LP modules in networks, our X-ModalNet behaves superiorly and obtains a further dramatic improvement in classification accuracies.These, to a great extent, demonstrate the effectiveness and superiority of several proposed modules as well as their positive effects on the classification performance.

Robustness to Noises
Neural networks have shown their vulnerability to adversarial samples generated by slight perturbation, e.g., imperceptible noises.To study the effective-ness of our SA module against noise or perturbation attack, we simulate the corrupted input by adding Gaussian white noises with different signal-to-noise-ratios (SNRs) ranging from 10 dB to 40 dB at a 10 dB interval.Fig. 8 shows a quantitative comparison in term of Pixel Acc.before and after triggering the SA module.

Conclusion
In this paper, we investigate the cross-modal classification task by utilizing multimodal satellite or aerial images (RS data).In reality, the HSI is only able to be collected in a locally small area due to the limitations of the imaging system, while MSI and SAR are openly available on a global scale.This motivate us to learn to transfer the HSI knowledge into large-scale MSI or SAR by training the model on both modalities and predict only on one modality.To address the CML's issue in RS, we propose a novel DL-based model X-ModalNet, with two well-designed components (IL and SA modules) to effectively learn a more discriminative feature representation and robustly resist the noise attack, respectively, and with an iteratively updating LP mechanism for further improving the network performance.In the future work, we would like to introduce the physical mechanism of spectral imaging into the network learning for earth observation tasks.

Figure 1 :
Figure 1: Our proposed solution (bottom) for the cross-modality learning problem in RS (top).Top: Given a large-scale urban area in yellow, both SAR in magenta and MSI in chestnut are openly and largely available with a high spatial resolution but limited by poor feature discrimination power, while the HSI in red is information-rich but only a small-scale available, as shown in area 1 overlapped with SAR (or MSI).Bottom: The model is trained on multimodalities (e.g., HSI-MSI or HSI-SAR) with the sparse training labels, and one modality is absent in the process of predicting.

Figure 2 :
Figure 2: An overview of the proposed X-ModalNet.It mainly consists of three modules: (a) SA module, (b) IL module, and (c) LP module, installed in a hybrid (MSI or SAR: CNN and HSI: DNN) semi-supervised multimodal DL framework.
Features of Labeled Samples 2 (c) Label propagation (LP) module

Figure 3 :
Figure 3: An illustration for three proposed modules in the X-ModalNet: (a) SA module, (b) IL module, and (c) LP module.The arrowed solid lines denote the to-be-learned parameters, and their colors mean the different streams in (a) or modalities in (b).Note that MLP is the abbreviation of multi-layer perception[58].For example to see modality 1 in (b), modality 1 reaches the hidden layer (orange) through the parameters and meanwhile modality 2 reaches the hidden layer (green) through the same parameters.

Figure 4 :
Figure 4: Exemplary datasets for HSI-MSI and HSI-SAR: false-color images and corresponding training and test labels.

Figure 7 :
Figure 7: t-SNE visualization of the learned multimodal features in the latent space using X-ModalNet with different modules on the two different datasets.

Figure 8 :
Figure 8: Resistance analysis to noise attack using the proposed X-ModalNet with and without SA module on the two datasets.

Table 1 :
Qualitative comparison of urban scene parsing using street view images and RS images in terms of goal, acquisition perspective, scene covering scale, spatial resolution, feature diversity, data accessibility, and ground truth maps used for training.

Table 2 :
Network configuration in each layer of X-ModalNet.FC, Conv, and BN are abbreviations of fully connected, convolution, and batch normalization, respectively.The symbols of '↔' and '-' represent the parameter sharing and no operations, respectively.Moreover, d 1 and d 2 , denote the dimensions of MSI / SAR and HSI, and C is the number of class.Please note that the reconstruction happens after passing through the first block of prediction module.

Table 3 :
The number of training and test samples on two datasets.

Table 4 :
Quantitative performance comparison with baseline models on the HSI-MSI dataset.The best one is shown in bold.

Table 5 :
[11]titative performance comparison with baseline models on the HSI-SAR datasets.The best one is shown in bold.Corr-AE[17]: A coupled AEs are first used to learn a shared high-level feature representation by enforcing similarity constraint between the encoder layers of two modalities.The learned features are then fed into a classifier.8)CorrNet[11]:Similar to Corr-AE, AE is responsible for extracting features of each modality, while CCA serves as a link with the features by maximizing their correlations.The code is available from the website5.Classification maps of ROI on HSI-MSI datasets.The ground truth in this highlighted area is manually labelled.

Table 6 :
Ablation analysis of the X-ModalNet with a combination of different modules in term of Pixel Acc. on two datasets.Moveover, importance analysis in the presence and absence of BN and dropout operations is discussed as well.