Learnable Manifold Alignment (LeMA) : A Semi-supervised Cross-modality Learning Framework for Land Cover and Land Use Classification

In this paper, we aim at tackling a general but interesting cross-modality feature learning question in remote sensing community --- can a limited amount of highly-discrimin-ative (e.g., hyperspectral) training data improve the performance of a classification task using a large amount of poorly-discriminative (e.g., multispectral) data? Traditional semi-supervised manifold alignment methods do not perform sufficiently well for such problems, since the hyperspectral data is very expensive to be largely collected in a trade-off between time and efficiency, compared to the multispectral data. To this end, we propose a novel semi-supervised cross-modality learning framework, called learnable manifold alignment (LeMA). LeMA learns a joint graph structure directly from the data instead of using a given fixed graph defined by a Gaussian kernel function. With the learned graph, we can further capture the data distribution by graph-based label propagation, which enables finding a more accurate decision boundary. Additionally, an optimization strategy based on the alternating direction method of multipliers (ADMM) is designed to solve the proposed model. Extensive experiments on two hyperspectral-multispectral datasets demonstrate the superiority and effectiveness of the proposed method in comparison with several state-of-the-art methods.


Introduction
Multispectral (MS) imagery has been receiving an increasing interest in the urban area (e.g. a large-scale land-cover mapping [1] [2], building localization [3]), agriculture [4], and mineral products [5], as operational optical broadband (multispectral) satellites (e.g.Sentinel-2 and Landsat-8 [6]) enable the multispectral imagery openly available on a global scale.In general, a reliable classifier needs to be trained on a large amount of labeled, discriminative, and high-quality samples.Unfortunately, labeling data, in particular large-scale data, is very gruelling and time-consuming.A natural alternative way to this issue is to consider tons of unlabeled data, yielding a semi-supervised learning.On the other hand, MS data fails to spectrally discriminate similar classes due to its broad spectral bandwidth.A simple way is to improve the data quality by fusing high-discriminative hyperspectral (HS) data [6].Although such data is expensive to collect, we may be able to expect a small amount of such data available.The aforementioned two points motivate us to raise a question related to transfer learning and crossmodality learning: Can a limited amount of HS training data partially overlapping MS data improve the performance of a classification task using a large coverage of MS testing data?
Over the past decades, land-cover and land-use classification tasks of optical remote sensing imagery has received increasing attention in the unsupervised [7] [8] [9], supervised [10] [11], and semi-supervised ways [12] [13].To our best knowledge, the classifying ability in unsupervised learning (or dimensionality reduction) still remains limited, due to missing label information.By fully considering the variability of intra-class and inter-class from labels, supervised learning is able to perform the classification task better.In reality, a limited number of labeled samples usually hinders the trained classier towards a high classification performance, further leading to a possible failure in some challenging classification or transferring tasks owing to the lack of generalization and representability.Alternatively, semi-supervised learning draws into plenty of unlabeled data in learning process.This is capable of better capturing the distribution of different categories in order to find an accurate decision boundary.
On the other hand, considerable work related to transfer learning (TL) or domain adaptation (DA) has been successfully developed and applied in the remote sensing community [14,15,16,17,18,19].According to the different transferred objects, the TL or DA approaches can be roughly categorized into three groups, including parameter adaptation, instance-based transfer, and feature-based alignment or representation.
The seminal work dealing with parameter adaptation was presented in [20] and [21], aiming at transferring an existing classifier (or parameters) trained or learned from the source domain to the target domain.Differently, the instancebased transferring technique transfers the knowledge by reweighting [22] or resampling [23] the samples of the source domain to those of the target domain.A similar idea based on active learning [24] has also been proposed to address this issue, by selecting the most informative samples in the target domain to replace with those samples of the source domain that do not match the data distribution of the target domain [25].
For the final group of feature-based alignment or representation, manifold alignment (MA) is one of the most popular semi-supervised learning framework [26] that facilitates transfer learning.MA has been successfully applied to various tasks in remote sensing community, e.g.classification [27], data visualization [28], multi-modality data analysis [13], etc.The key idea of MA can be generalized as learning a common (or shared) subspace where different data can be aligned to learn a joint feature representation.Generally, existing MA methods can be approximately categorized into unsupervised, supervised, and semisupervised approaches.The unsupervised approach usually fails to align multimodal data sufficiently well, as their corresponding low-dimensional embeddings may be quite diverse [29].In the supervised case, only aligning the limited number of training samples to learn a common subspace leads to weak transferability.While preserving a joint manifold structure created by both labeled and unlabeled data, semi-supervised alignment allows different data sources to be better transformed into the common subspace [30].
Although the joint manifold structure used in conventional semi-supervised MA approaches can relate features or instances, poor connections between the common subspace and label information still hinder the low-dimensional feature representation from being more discriminative.More importantly, in most graph-based semi-supervised learning algorithms (e.g.graph-based label propagation (GLP) [31], semi-supervised manifold alignment (S-SMA [13]) [30]), the topology of unlabeled samples is merely given by a fixed Gaussian kernel function, which is computed in the original space rather than in the common space.This makes it difficult to adaptively transfer unlabeled samples into the learned common subspace, particularly when applied to multimodal data due to different numbers of dimensions.To address these issues, we propose a learnable mani-fold alignment (LeMA) by a data-driven graph learning directly from a common subspace so as to make the multimodal data comparable as well as improve the explainability of the learned common subspace, which further results in a better transferability.More specifically, our contributions can be summarized as follows: • We propose a novel semi-supervised cross-modality learning framework called learnable manifold alignment (LeMA) for a large-scale land-cover classification task.One spectrally-poor MS and one spectrally rich HS data are considered as two different modalities and applied for this task, where the spatial extent of the former is a true superset of that of the latter.
• Unlike jointly feature learning in which the model is both trained and tested from completed HS-MS correspondences, LeMA learns an aligned feature subspace from the labeled HS-MS correspondences and partially unlabeled MS data, and allows to identify out-of-samples using either MS data or HS data; Such the learnt subspace is a good fit for our case of cross-modality learning 1 .
• Instead of directly computing graph structure with a Gaussian kernel function, a data-driven graph learning method is exploited behind LeMA in order to strengthen the abilities of transferring and generalization; • An optimization framework based on the alternating direction method of multipliers (ADMM) is designed to fast and effectively solve the proposed model.
The remainder of this paper is organized as follows.Section II elaborates on our motivation and proposes the methodology for the LeMA and the corresponding optimization algorithm.In Section III, we present the experimental results on two HS-MS datasets over the areas of the University of Houston and Chikusei, respectively, and meanwhile discuss the qualitative and quantitative analysis.Section IV concludes with a summary.

Learnable Manifold Alignment (LeMA)
In this section, a cross-modality learning problem is firstly casted and the motivation is stated in the following.Accordingly, we formulate the methodology of  our proposed and then elucidate an ADMM-based optimization algorithm to solve it.

Problem Statement and Motivation
For many high-level data analysis tasks in remote sensing community, such as land-cover classification, data collection plays an important role, since informationrich training samples enable us to easily find an optimal decision boundary.
There is, however, a typical bottleneck in collecting a large amount of labeled and discriminative data.Despite the MS data available at a global scale from the satellites of Sentinel-2 and Landsat-8, the identification and discrimination of materials are unattainable at an accuracy level by MS data, resulting from its poorly spectral information.On the contrary, HS data is characterized by rich spectral information, but only can be acquired in very small areas, due to the limitations of imaging sensors.This issue naturally guides us to jointly utilize the HS and MS bi-modal data, specifically leading to the following interesting and challenging question can a limited number of HS training data contribute to the classification task of a large-scale MS data?A feasible solution to the issue can be unfolded to two parts: 1) cross-modality learning: learning a common subspace where the features are expected to absorb the different properties from the HS-MS modalities and meanwhile the HS and MS data can be transferred each other; 2) semi-supervised learning: Embedding massive unlabeled MS samples which are relatively in large quantities and easy to be collected, so as to learn a more discriminative feature representation.Fig. 1 illustrates the workflow of LeMA.

Problem Formulation
To effectively model the aforementioned issue, we intend to develop a joint learning framework which better learns a discriminative common subspace from high-quality HS data and low-quality MS data.Intuitively, such a common subspace can be shaped by selectively absorbing the benefits of both high-quality data with more details and low-quality data with more structural information.Therefore, following a popular joint learning framework [32], we formulate the common subspace learning problem as where Y = [Y, Y] ∈ R d×2N and Y ∈ R d×N is the label matrix represented by one-hot encoding, X = X H 0 0 X M ∈ R (d H +d M )×2N and X H and X M stand respectively for the data from hyperspectral and multispectral domains, Θ = [Θ H , Θ M ] and P are respectively the common subspace projection and the linear projection to bridge the common subspace and label information.L = D − W ∈ R 2N ×2N stands for a joint Laplacian matrix, W is an adjacency matrix and . W is generally used to measure the similarity between samples.With the orthogonal constraint (ΘΘ T = I), the global optimal solutions with respect to the variables Θ and P can be theoretically guaranteed [32].
The first term of Eq. ( 1) is a fidelity term, and the regularization term α 2 P 2 F parameterized by α aims to achieve a reliable generalization of the proposed model.The third term acts as supervised manifold alignment (SMA) [26].We refer to the proposed framework for joint common subspace learning as CoSpace.
To further exploit the information of unlabeled samples, we extend the CoSpace in Eq. ( 1) to LeMA by learning a joint Laplacian matrix, which can be formulated as follows with extra constraints related to necessary conditions of L: where and X U ∈ R d M ×N U represents the unlabeled MS samples and s > 0 controls the scale.Note that a feasible and effective approach to choose the unlabeled data with respect to the variable X is to group total samples besides the training samples into some landmarks (cluster centers).These landmarks are used as the unlabeled data, which can fully take into account the available information and meanwhile effectively reduce the computational cost.Due to the use of clustering technique in unlabeled data, we experimentally and empirically set the ratio of labeled and unlabeled data to approximately be 1:1.
The model in Eq. ( 2) can be simplified by optimizing the adjacency matrix ( W) instead of directly solving a hard optimization problem of L, then we have where is defined as a pairwise Euclidean distance matrix : Z i,j = H i − H j 2 .denotes the Schur-Hadamard (termwise) product.
Using Eq. ( 3), we can equivalently convert the optimization problem of smooth manifold in (2) to that of graph sparsity where W Z 1,1 can be interpreted as a weighted 1 -norm of W which enforces weighted sparsity.
We further elaborate the relationship between the proposed LeMA model and our motivation in an easy-understanding way.In general, we aim at finding a common subspace by learning a pair of projections (Θ M and Θ H ) corresponding to two kinds of different modalities (e.g., MS and HS), respectively.In order to Algorithm 2: Solving the subproblem for Θ Input: Y, P, J, X, X , L, β, maxIter.

4
Fix other variables to update Θ by Update Lagrange multipliers by Update penalty parameter by µ = min(ρµ, µ max ). 8 Check the convergence conditions: Stop iteration; effectively improve the discriminative ability of the learned subspace, we make a connection between the subspace and label information by jointly estimating the regression coefficient P and common projections Θ, as formulated in Eq. ( 1).What's more, the alignment behavior of different modalities can be represented by W's connectivity, that is, if the i th sample X i and the j th sample X j are connected (W i,j = 1), and then the two samples belong to the same class; vice versa.
Besides, we construct an extra adjacency matrix based on those unlabeled samples in order to globally capture the data distribution.The matrix is usually obtained by a Gaussian kernel function (semi-supervised CoSpace) and also can be learned from the data (LeMA as formulated in Eq. ( 2)).

Model Optimization
Considering the complexity of the non-convex problem (4), an iterative alternating optimization strategy is adopted to solve the convex subproblems of each variable P, Θ, and W.An implementation of LeMA is given in Algorithm 1.
Optimization with respect to P: This is a typical least-squares problem with Algorithm 3: Solving the subproblem for W HU (M U ) Fix other variables to update W by     Update Lagrange multipliers by Tikhonov regularization, which can be formulated as which has a closed-form solution where E = Θ X.
Optimization with respect to Θ: the optimization problem for Θ can be for-mulated as In order to solve (7) effectively with ADMM, we consider an equivalent form by introducing auxiliary variables J and G to replace Θ X and Θ, respectively.
Algorithm 2 lists the more detailed procedures for solving the problem (8).
Optimization with respect to W: W is a joint adjacency matrix and consists mainly of nine parts as shown in Fig. 2.Among the nine parts, W HH , W HM , W M H and W M M can be directly inferred from label information in the form of the LDA-like graph [33]: Given the symmetry of W, (i.e., W HM = W M H , W M U = W U M , and W M U = W U M ), we only need to update three of out nine parts, namely W HU , W M U , and W U U .The optimization problems of W HU and W M U can be formulated by which can be solved by ADMM.More details can be found in Algorithm 3, where Z H(M ) and Z U represent respectively the subspace features of X H(M ) and X U , prox stands for the proximal operator for W 1,1 = s [34].We technically add the constraint W i,j 1/N k in order to share the same unit level with LDA-like graph.
For W U U , the objective function can be written as Algorithm 4: Solving the subproblem for W U U Input: Z U , W, γ, maxIter.Output: W.
Fix other variables to update W by 5 Fix other variables to update U by 6 Fix other variables to update Fix other variables to update M by 8 Fix other variables to update S by S = prox( W − Λ 4 /µ).

Fix other variables to update
Update Lagrange multipliers by Update penalty parameter by µ = min(ρµ, µ max ).
Check the convergence conditions: end end which can be effectively solved using Algorithm 4.
Finally, we repeat these optimization procedures until a stopping criterion is satisfied.

Convergence Analysis
The alternative alternating strategy used in Algorithm 1 is nothing but a block coordinate descent (BCD), which has been theoretically supported to converge to a stationary point as long as each subproblem in Eq. ( 4) is exactly minimized [35].As observed, these subproblems with respect to the variables P, Θ and W are strongly convex, and hence each independent task can ideally find an unique minimum when the Lagrangian parameter is updated within finitely iterative steps [36].Besides, ADMM used in each subproblem optimization is actually generalized to inexact Augmented Lagrange Multiplier (ALM) [37], whose convergence has been well studied when the number of block is less than three [38] (e.g.Algorithm 2).Although there is still not a generally and strictly theoretical proof in multi-blocks case, yet the convergence analysis for some common cases such as our Algorithm 3 and Algorithm 4 has been well conducted in [39] [40][41] [42].We also experimentally record the objective function values in each iteration to draw the convergence curves of LeMA on two used HS-MS datasets (see Fig. 3).

Experiments
In this section, we quantitatively and qualitatively evaluate the performance of the proposed method on two simulated HS-MS datasets (University of Houston and Chikusei) and a real multispectral-lidar and hyperspectral dataset provided by 2018 IEEE GRSS data fusion contest (DFC2018), by the form of classification using two commonly used and high-performance classifiers, namely linear support vector machines (LSVM), and canonical correlation forest (CCF) [43].Three indices: overall accuracy (OA), average accuracy (AA), kappa coefficient (κ), are calculated to quantitatively assess the classification performance.Moreover, we compare the performance of the proposed LeMA and several other state-of-art algorithms, i.e.GLP [31], SMA, S-SMA [29], CoSpace and Semi-supervised CoSpace (S-CoSpace).The original MS data is used as a baseline.SMA constructs an LDA-like joint graph using label information.Besides label information, S-SMA method also uses unlabeled samples to generate the joint graph by computing the similarity based on Euclidean distance.The same strategy of graph construction is adopted for CoSpace and S-CoSpace.

The Simulated MS-HS Datasets over the University of Houston 3.1.1. Data Description
The HS data in the simulated Houston MS-HS datasets was acquired by the ITRES-CASI-1500 sensor with the size of 349 × 1905 at a ground sampling distance (GSD) of 2.5m over the University of Houston campus and its neighboring urban areas.This data was provided for the 2013 IEEE GRSS data fusion contest, with 144 bands covering the wavelength range from 364nm to 1046nm.Spectral  simulation is performed to generate the MS image by degrading the HS image in the spectral domain using the MS spectral response functions (SRFs) of Sentinel-2 as filters (for more details refer to [6]).The MS data we used is generated with dimensions of 349 × 1905 × 10.

Experimental Setup
To meet our problem setting, a HS image partially overlapping MS image and a whole MS image are used in our experiments, and meanwhile the corresponding training and test samples can be re-assigned, as shown in Fig. 4. In detail, since the total labels are available, we seek out a region where all kinds of classes are involved.The labels in the region are selected as the training set and the rest are seen as the test set, as shown in Fig. 4 and specifically quantified in Table 1.
The parameters of the different methods are determined by a 10-fold cross- validation on the training data.More specifically, we tune the parameters of the different algorithms to maximize their performances, e.g.dimension (d), penalty parameters (α, β), etc.The dimension (d) is a common parameter for all compared algorithms, and it can be determined covering the range from 10 to 50 at an interval of 10.For the number of nearest neighbors (k) and the standard deviation of Gaussian kernel function (σ) in artificially computing the adjacency matrix (W) of GLP, SMA, and S-SMA, we select them in the range of {10, 20, ..., 50} and {10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 }, respectively, Similarly to CoSpace, S-CoSpace and LeMA, we set the two regularization parameters (α, β) ranging from {10 −2 , 10 −1 , 10 0 , 10 1 , 10 2 }.

Results and Analysis
Fig. 5 shows the classification maps of compared algorithms using LSVM and CCF classifiers, while Table 2 lists the specific quantitative assessment results with optimal parameters obtained by 10-fold cross-validation.
Overall, the methods based on manifold alignment outperform baseline and GLP using the different classifiers.This means that the limited amount of HS data can guide the corresponding MS data towards better discriminative feature representations.More specifically when compared with S-SMA, SMA yields a relatively poor performance since it only considers the correspondences of MS-HS labeled data.This indicates that reasonably embedding unlabeled samples into the manifold alignment framework can effectively help us capture the real data distribution, and thereby obtain more accurate decision boundaries.Unfortunately, these approaches only attempt to align different data in a common subspace, but they hardly take the connections between the common subspace and label infor-mation into account2 , which leads to a lack of discriminative ability.With regards to this, our proposed joint learning framework "CoSpace" and its semi-supervised version "S-CoSpace" achieve the desired results on the the given MS-HS datasets.By fully considering the connectivity of the common subspace, label information, and unlabeled information encoded by the learned graph structure, the performance of LeMA is much more superior to that of any other methods as can be observed in Table 2.This demonstrates that LeMA is likely to learn a more discriminative feature representation and to find a better decision boundary.
As observed from Fig. 4 and Table 2, the training samples are relatively a few and meanwhile the distribution between different classes is extremely unbalanced.While training the classifier, more attentions are paid on those classes with large-size samples, and some small-scale classes possibly play less and even nothing.For this reason, we propose to consider those large-scale unlabeled data, achieving a semi-supervised learning.Using this strategy, the semi-supervised methods, i.e.GLP, S-SMA, S-CoSpace, obviously perform better than baseline and their supervised ones (SMA and CoSpace).Moreover, we can see from Table 2 that there is a significant improvement of classification performance in some classes (e.g.Stressed Grass, Water) after accounting for unlabeled samples, particularly between SMA and S-SMA as well as CoSpace and S-CoSpace.However, these aforementioned semi-supervised methods carry out the label propagation on a given graph manually computed by gaussian kernel function, limiting the adaptiveness and discriminability of the algorithms.LeMA can adaptively learn a datadriven graph structure where the labels tend to spread more smoothly, which can result in a more effective material identification for those challenging classes (few training samples), such as Trees, Residential, Railway, Parking Lot1.In addition, we can also observe an easily overlooked phenomenon that the LeMA's ability in identifying certain classes still remains limited, such as Parking Lot2(only 1.78%) and Railway (49.96%).Parking Lot2 is basically classified to Commercial and Parking Lot1, while Railway is largely identified as Road and Commercial.This might be explained by the limited number of training samples as well as fairly similar spectral properties between several classes.

Experimental Setup
Fig. 6 shows the corresponding MS and partial HS images as well as selected training labels and test labels.Again, the overlapped region between MS and HS, which should include all the classes listed in Table 1, is chosen based on the given ground truth [44].Additionally, the parameters configuration for all algorithms can be adaptively completed by a 10-fold cross-validation on the training set, which is more generalized to different datasets.Regarding how to run the cross-validation for parameters setting, please refer to section 3.1.2for more details.

Results and Analysis
We assess the classification performance of the different algorithms for the Chikusei MS-HS data both quantitatively and visually, as shown in Fig. 7 and Table 3.
Similarly to the University of Houston MS-HS data, there is a basically consis- tent trend for the different algorithms in the Chikusei MS-HS data.On the whole, the original MS data (baseline) fails to identify some specific materials such as Plastic House, Manmade (Dark), Rice Field (Grown), Bare Soil (Farmland), and Forest, due to its poor spectral information and a limited number of training samples.GLP utilizes the unlabeled samples to augment the training samples in a semi-supervised way, yet it is still limited by the low-discriminative spectral signatures.By aligning the MS and HS data, these alignment-based approaches (e.g.SMA, S-SMA, CoSpace, S-CoSpace, and LeMA) are able to find a common subspace in which the learnt features are expected to absorb the different properties from two modalities, resulting in a better performance.Compared to the supervised methods (SMA and CoSpace), their corresponding semi-supervised versions (S-SMA and S-CoSpace) obtain higher classification accuracies on both classifiers, which is detailed in Table 3.As expected, the performance of the LeMA is significantly superior to that of others, thanks to the great contributions of a common subspace learning from MS-HS data, a data-driven graph learning and the semi-supervised learning strategy.Despite so, the LeMA still fails to recognize some challenging classes, such as Weeds in Farmland, Row Crops, Plastic House, and Asphalt.The reasons could be two-fold.On one hand, the performance of LeMA is limited, to some extent, by the unbalanced data sets.On the other hand, LeMA' transferring ability would sharply degrade when a great spectral variability between training and test samples exists.

The Real Multispectral-Lidar and Hyperspectral Datasets in DFC2018
Although we follow strict simulation procedures, yet the two MS-HS datasets used above (Houston and Chikusei) essentially originate from a similar data source (homogeneous), which means there is a strong correlation in their spectral features.This makes the information of the different modalities transferred more effectively, but could limit the generalization ability in practice.To this end, we apply a real bi-modal dataset -multispectral-lidar and hyperspectral (heterogeneous) provided by the latest IEEE GRSS data fusion contest 2018 (DFC2018).

Data Description
Multi-source optical remote sensing data, such as multispectral-lidar data, hyperspectral data, and very high-resolution RGB data, is provided in the contest.More specifically, the multispectral-lidar imagery consists of 1202 × 4768 pixels with 7 bands ( 3 intensity bands and 4 DSMs-related bands [45]) collected from 1550nm, 1064nm, and 532nm at a 0.5m GSD, while the hyperspectral data comprises 48 bands covering a spectral range from 380nm to 1050nm at 1m GSD, and its size is 601 × 2384.In our case, our LeMA model is trained on partial multispectral-lidar and hyperspectral correspondences and tested only using multispectral-lidar data, in order to meet the requirement of our cross-modality learning task.The first row of Fig. 8 shows the RGB image of this scene and the labeled ground truth image.

Experimental Setup
Our aim is, once again, to investigate whether the limited amount of hyperspectral data can improve the performance of another modality, e.g., multispectral data (homogeneous) or multispectral-lidar data (heterogeneous).Therefore, we randomly assign 10% of total labeled samples as training set and the rest of it as test set in the experiment.Moreover, 16 main classes are selected out of 20 (see Fig. 8), by removing several small classes with too few samples, e.g.Artificial Turf, Water, Crosswalks, and Unpaved Parking Lots.Likewise, we automatically configure the parameters of the proposed LeMA and the compared algorithms by a 10-fold cross-validation on the training set, which is detailed in section 3.1.2.

Results and Analysis
We show the averaged results of the different algorithms out of 10 runs to obtain a relatively stable and meaningful performance comparison, because the training and test sets are randomly generated from total samples in each round, as listed in Table 4. Correspondingly, Fig. 8 visually highlights the differences of classification maps for the different methods.
Generally speaking, hyperspectral information embedding can effectively improve the classification performance of the multispectral-lidar data, which implies that the models based common subspace learning (e.g., SMA, S-SMA, CoSpace, S-CoSpace, and LeMA) can transfer the knowledge from one modality to another modality to some extent.We also observe from Table 4 that the semisupervised methods which consider the unlabeled samples (e.g., GLP, S-SMA, S-CoSpace, and LeMA) always perform better than those purely supervised ones.Not unexpectedly, LeMA integrating rich spectral information and unlabeled samples achieves a superior performance, which demonstrates that the learning-based graph structure is more applicable to capturing the data distribution and further find a potential optimal decision boundary.
One thing to be noted, however, is that compared to the performance of the different algorithms in the simulated MS-HS datasets from similar sources (homogeneous), the knowledge transferring ability of these algorithms in handling the real multispectral-lidar and hyperspectral datasets from different sources (heterogeneous) remains limited, since all listed methods including our LeMA are modeled in a linearized way.Unfortunately, a single linear transformation fails to fit the gap between heterogeneous modalities well, despite a limited performance improvement.

Conclusions
In real-world problems, a large amount of low-quality data (e.g.MS data) can often be easily collected.On the contrary, high-quality data (e.g.HS data) are usually expensive and difficult to obtain.This motivates us to investigate whether a limited amount of high-quality data can contribute to relevant tasks with a large amount of low-quality data.For this purpose, we propose a novel semi-supervised learning framework called LeMA, which effectively connects the common subspace and label information, and automatically embeds the unlabeled information into the proposed framework by adaptively learning a Laplacian matrix from the data.Extensive experiments are conducted using the LeMA on two homologous MS-HS simulated datasets and a heterogenous multispectral-lidar and hyperspectral real dataset in comparison with the other state-of-arts algorithms, demonstrating the superiority and effectiveness of the LeMA in the knowledge transferring ability.We have to admit, however, that despite a significant performance improvement in LeMA, yet its representative ability is still limited by linearly modeling way, especially facing highly-nonlinear heterogenous data.Towards this issue, we will continue to improve our model to a nonlinear version and simultaneously consider the spatial information (e.g., morphological profiles) to further strengthen the feature representation ability.

Figure 1 :
Figure 1: An illustration of the proposed LeMA method.

Figure 2 :
Figure 2: An example for the joint adjacency matrix W.

6
Fix other variables to update M by

10
Update penalty parameter by µ = min(ρµ, µ max ).Check the convergence conditions:if U − W F < ε and M − W F < ε and S − W F < ε and K − W F < ε and W t+1 − W t F < ε then 11

Figure 3 :
Figure 3: Convergence analysis of LeMA are experimentally performed on the two MS-HS datasets.

Figure 4 :
Figure 4: The multispectral image and its corresponding hyperspectral image that partially covers the same area, as well as training and testing labels, for University of Houston dataset.

Figure 5 :
Figure 5: Classification maps of the different algorithms obtained using two kinds of classifiers on the University of Houston dataset.

Figure 6 :
Figure 6: The multispectral image and its corresponding hyperspectral image that partially covers the same area, as well as training and testing labels, for Chikusei Dataset.

3. 2 .
The Simulated MS-HS Datasets over Chikusei 3.2.1.Data DescriptionSimilarly to Houston data, the MS data with dimensions of 2517 × 2335 × 10 at a GSD of 2.5 m was simulated by the HS data acquired by the Headwall s Hyperspec-VNIR-C sensor over Chikusei area, Ibaraki, Japan.It consists of 128 bands in the spectral range from 363nm to 1018nm with the 10nm spectral resolution.The dataset has been made available to the scientific research[44].

Figure 7 :
Figure 7: Classification maps of the different algorithms obtained using two kinds of classifiers on the Chikusei dataset.

Figure 8 :
Figure 8: Classification maps of the different algorithms obtained using two kinds of classifiers on the real dataset of DFC2018 (Multispectral-Lidar and Hyperspectral data).
HU and W M U by max( W HU , W M U );

Table 1 :
The number of training and testing samples for the two used MS-HS datasets.

Table 2 :
Quantitative performance comparison with the different algorithms on the University of Houston data.The best one is shown in bold.

Table 3 :
Quantitative performance comparison with the different algorithms on the Chikusei data.The best one is shown in bold.

Table 4 :
Quantitative performance comparison with the different algorithms on the DFC2018 data.The best one is shown in bold.