Learning to Propagate Labels on Graphs: An Iterative

10 Hyperspectral dimensionality reduction (HDR), an important preprocessing step prior to high-level data analysis, has been garnering growing attention in the remote sensing community. Although a variety of methods, both unsupervised and supervised models, have been proposed for this task, yet the discriminative ability in feature representation still remains limited due to the lack of a powerful tool that effectively exploits the labeled and unlabeled data in the HDR process. A semi-supervised HDR approach, called iterative multitask regression (IMR), is proposed in this paper to address this need. IMR aims at learning a low-dimensional subspace by jointly considering the labeled and unlabeled data, and also bridging the learned subspace with two regression tasks: labels and pseudo-labels initialized by a given classifier. More significantly, IMR dynamically propagates the labels on a learnable graph and progressively refines pseudo-labels, yielding a well-conditioned feedback system. Experiments conducted on three widely-used hyperspectral image datasets demonstrate that the dimensionreduced features learned by the proposed IMR framework with respect to classification or recognition accuracy are superior to those of related state-of-the-art HDR approaches.


, 2019), larg
-scale urban or agriculture mapping (Dell'Acqua et al., 2004;Yang et al., 2013;Fan et al., 2015;Xie and Weng, 2017), spectral unmixing (Henrot et al., 2016;Hong et al., 2017;Zhong et al., 2016;Hong et al., 2019a), object detection (McCann et al., 2017;Wu et al., 2018;Li et al., 2018;Wu et al., 2019), and multimodal scene interpretation (Tuia et al., 2016;Yokoya et al., 2018;Zhu et al., 2019;Liu et al., 2019), as forthcoming spaceborne spectroscopy imaging satellites (e.g., EnMAP (Guanter et al., 2015)) make hyperspectral imagery (HSI) available on a larger scale.Although HSI features richer spectral information than RGB (Kang et al., 2018) and multispectral (MS) data (Hong et al., 2015), yielding more accurate and discriminative detection and identification of unknown materials, yet the very high dimensionality in HSI also introduces some crucial drawbacks that need to be taken seriously: high storage cost, information redundancy, and the performance degradation resulting from the curse of dimensionality, to name a few.A general but effective solution to these issues is dimensionality reduction, also referred to as subspace learning.In this process, we expect to compress the HSI to a low-dimensional subspace along the spectral dimension while preserving the highest possible spectral discrimination.

With the significant support in both theory and practice as well as a fact that the learning-based strategy is somehow superior to the manually-designed feature extraction (Hong et al., 2016a), a considerable nu ber of subspace learning approaches have been designed and applied to hyperspectral data processing and analysis in the past decades (Licciardi et al., 2009;Huang and Yang, 2015;Hong et al., 2016b;Luo et al., 2016;Liu et al., 2017;Xu et al., 2018a;Xu et al., 2019), particularly hyperspectral dimensionality reduction (HDR) (Gao et al., 2017a;Hong et al., 2017;Gao et al., 2017b) and spectral band selection (Sun et al., 2015;Sun et al., 2017a).Depending on their different learning strategies, HDR techniques are roughly categorized as unsupervised, supervised, or semi-supervised strategies.

The classic principal component analysis (PCA) (Martínez and Kak, 2001) is a user-friendly dimensionality reduction method for that is limited to capturing the underlying topology of the data.Ra her, manifold learning techniques (e.g., locally linear embedding (LLE) (Roweis and Saul, 2000), Laplacian eigenmaps (LE) (Belkin and Niyogi, 2003), local tangent space alignment (LTSA) (Zhang and Zha, 2004), and their variants: locality preserving projections (LPP) (He and Niyogi, 2004), neighborhood preserving embedding (NPE) (He et al., 2005), large-scale LLE (Hong et al., 2016c), enhanced-local tangent space alignment (ENH-LTSA) (Sun et al., 2014)), by and large, follow the graph embedding framework presented in Yan et al. (2007).This framework starts with the construction of graph (topology) structure and aim at learning a low-dimensional data embedding while preserving the topological structure.Some popular and advanced methods have been proposed based on the graph embedding framework for HDR.For example, Ma et al. (2010) proposed to locally embed the intrinsic structure of the hyperspectral data into a low-dimensional subspace for hyperspectral image classification.Li et al. (2012) modeled the locally neighboring relations between hyperspectral data in a linearized system for HDR.In Huang et al. (2019), a multi-feature manifold discriminant analysis was developed on the basis of graph embedding framework for hyperspectral image classification.Authors of Sun et al. (2014) upgraded the existing landmark isometric mapping approach for the fast and nonlinear HDR.The same investigators (Sun et al., 2017b) further extended their work to linearly extract the low-dimensional representation with sparse and low-rank attribute embeddings for HSI classification.In Hong et al. (2017), a joint spatial-spectral manifold embedding is developed to extract the discriminative dimension-reduced features.Subsequently, Huang et al. (2019) proposed a general spatial-spectral manifold learning framework to reduce the dimension of hyperspectral imagery.

In supervised HDR strategies, the main consideration is the discrimination between intra-class and inter-class, w ere different discriminative rules are followed: local discriminative analysis (LDA) (Martínez and Kak, 2001), local fisher discriminative analysis (LFDA) (Sugiyama, 2007), sparse discriminant analysis (Huang and Yang, 2015), noise-adjusted discriminant analysis (Li et al., 2013), feature space discriminant analysis (Imani and Ghassemian, 2015), and so on.Despite the superior class separability, these methods still might fail to robustly represent the features due to sensitivity to various complex noises and ill-conditioned statistical assumptions, especially in the case of small-scale samples.Unlike the aforementioned approaches that seek to project the original data directly into a discriminative subspace, Ji and Ye (2009) simultaneously performed dimensionality reduction and classification under a regression-based framework, in order to find an optimally latent subspace where the decision boundary is expected to be better determined.With the local manifold regularization in the projected subspace, this strategy has been successfully applied and extended to learn the discriminative representation for supervised HDR (Hong et al., 2018).

Most previously-proposed HDR methods adhere to either the unsupervised or the supervised strategy yet the labeled and unlabeled information is less frequently taken into consideration.A straightforward way to consider the unlabeled samples is the graph-based label propagation (GLP) (Zhu et al., 2003), which has been successfully applied to semi-supervised HSI classification (Li et al., 2016) together with the support vector machine (SVM) classifier.To effectively improve the discrimination and generalization of dimension-reduced features, some proposed semi-supervised HDR works have been proposed by the attempt to preserve the potentially global data structure that lies in the whole high-dimensional space.For example, Ma et al. (2015) followed a graph-based semi-supervised learning paradigm for HDR and classification, where the graphs are constructed by different local manifold learning approaches.A general but effective work integrating LDA with LPP, called semi-supervised local discriminant analysis (SELD), was proposed in Liao et al. (2013) for a semi-supervised hyperspectral feature extraction.Inspired by GLP, (Zhao et al., 2014) enhanced the performance of LDA by jointly utilizing the labels and "soft-labels" predicted by GLP for the semi-supervised subspace dimensionality reduction.Wu and Prasad (2018) proposed a similar approach to achieving a semi-supervised discriminative dimensionality reduction of HSI by embedding pseudo-labels (instead of the similarity measurement in LPP (Liao et al., 2013)) into LFDA rather than LDA in Zhao et al. (2014).


Motivation and objectives

Although these proposed semi-supervised approaches have been proven to be

fective in handling the is
ue of HDR to some extent, yet their graph structures for unlabeled samples are constructed either from the similarity measurement (e.g., using RBF) or from the pseudo-labels inferred by GLP or pre-trained classifier.The resulting features by using this type of graph construction strategy is neither robust nor generalized, due to the noisy data and labels as well as the scarce labeled samples.Also, these semi-supervised algorithms, as often as not, attempt to find a single transformation that connects the original data and the subspace to be estimated.On account of the complexity in the learning process, the optimal subspace search is hardly accomplished only by a single transformation.On the other hand, in spite of being guided by label information, there is still lack of an explicit and direct connection between the learned subspace and the label space in the subspace learning strategy interpreted by a single projection, further causing the performance bottleneck.In addition, these subspacelearning-based models are commonly treated as a disjunct feature learning step before classification.In other words, it is unknown what kinds of features in the learning process may be capable of improving classification accuracy.

According to these factors, our objectives in this paper can be summarized as follows: 1) to bridge the to-b -estimated subspace with the label information more explicitly and effectively; 2) to introduce many unlabeled samples for improving the model's generalization ability; 3) and to refine the quality of class indicators of unlabeled samples for high discriminative HDR.


Method overview and contributions

Towards the aforementioned goals, a novel regression-induced learning mo

l motivated by the joint learning
JL) framework (Ji and Ye, 2009;Hong et al., 2018) is proposed, which seeks to learn an optimal subspace by considering the correspondences between the training samples and labels on a to-be-estimated latent subspace.We further extend the JL framework to a multitask regression model with the joint embedding of labeled and unlabeled samples.In the multitask framework, we also propose to adaptively learn a soft-graph structure from the data rather than utilizing a hard-graph (fixed graph) constructed manually or generated by additional algorithms, yielding a high-performance and more generalized label propagation.In the meantime, to facilitate the use of pseudo-labels more effectively, the learned graph can be updated after each outer iteration ends, and the pseudo-labels accordingly refined, thereby enabling the learned features to be progressively optimized.More specifically, the main contributions of this work can be highlighted as follows.

direction method of multipliers (ADMM) optimizer for the solution of our proposed IMR method.
The proposed methodology

In this section, we start with a brief review of our model's cornerstone, the JL framework, and then extend it to a variant of multitask learning by synchronously regressing the labeled and u labeled data.We will further introduce the proposed iterative multitask regression (IMR) model by integrating the JL framework with the advanced g aph learning technique, which more effectively propagates labels.Finally, an ADMM-based optimizer is used for the IMR solution.Fig. 1 illustrates the workflow of the proposed IMR method.l N be the corresponding one-hot encoded label matrix with l classes by N pixels.We model the original JL problem (Ji and Ye, 2009) ) is the

rresponding degree matrix
The term tr denotes the trace of matrix parameterized by .The JL-based models in Eqs.


Review of the JL model
l ii i j L i j ( ) ( ,
(1) and ( 2) have been proven to be effectively solved with the ADMM optimizer (Hong et al., 2019b).Once the projection matrix S is learned, the subspace features can be com- puted by SX.


Iterative Multitask Regression (IMR)

Labeling in Earth Vision is extremely costly and time-consuming, as the remote sensing images have a larger-scale and more complex visual field.This leads to a limited

umber of labeled sample
, which further hinders improvement of the model's learning and generalization capability.To this end, we effectively utilize the information of unlabeled samples that are largely available by making a regression between the unlabeled samples ith graph learning

In the multitask framework, we propose a learning-based graph regularization instead of a fixed graph artificially constructed with the known kernels (e.g., using Gaussian kernel function), in order to depict the connectivity (or similari y) between samples.Accordingly, a Fig. 1.An overview of the proposed IMR framework.In fact, each iterative (t-step) starts with the input of labeled and unlabeled data and ends up the output of the subspace projections (S t ( ) ), regression matrix (A t ( ) ), and learned graph (W t ( ) ) aligning the labeled with unlabeled samples.With the t

mework is proposed for semi-supervise
HDR by optimizing the following objective function.
+ + + = = = = s Y ASX Y ASX A SXLX S SS I L L L L L min tr( ) s. t. , , 0, 0, tr( ) , l l pl pl i j i j i j i j A S L , , 2 F 2 1 2 F 2 2 F 2 2 T T T T , , , ,(3)
where
× X pl d M and × Y pl l M denote unlabeled hyperspectral data
and a one-hot encoded pseudo-label matrix, respectively, while = × +


X X X [ , ]

l pl d N M (

) and
+ × + L N M N M ( ) (
) is a joint Laplacian matrix.The term > s 0 is a constant to control the scale.F

thermore, the two fidelity terms in multi
ask learning are balanced by a penalty parameter .

To solve (3) effectively, we rewrite the trace term as
= = SXLX S WZ W Z tr( ) 1 2 tr( ) 1 2 , T T 1,1(4)
where
+ × + W N M N M ( ) (
) is the to-be-learned joint adjacency matrix (see Fig. 2 in red).In W, the similarities between X can be measured by a pair-wise distance matrix (
+ × + Z N M N M (2
) ( 2) ) on Euclidean space; this matrix can be computed by
= Z SX SX ( ) ( ) i j i j ,
2 .Moreover, the operator is interpreted as a term-wise Schur-Hadamard product.By means of Eq. ( 4), optimizing problem (3) on a smooth manifold can be equivalently converted on a sparse graph as follows.
+ + + = = = s Y ASX Y ASX A W Z SS I W W W W min s. t. , ,0,
.
l l pl pl i j A S W , , 2 F 2 1 2 F 2 2 F 2 4 1, 1. Iterative Multitask Regression (IMR)


Fig. 2. A showcase for joint adjacency matrix (W) (in

), where W L (in ) is a LDA-like graph constr.uctedby labels.


Optimiz graph-based label propagation

In Eq. ( 3), the pseudo-labels are predicted by using a trained classifier, e.g., SVM or random forest.Although the model's performance can be mod rately improved through the use of unlabeled samples an nsion- ins limited by only regressing the static pseudo-labels.For this reason, the labels are dynamically propagated on the learned graph using GLP, when t udo-labels together with the other inputs of X X , l pl , and Y l can be re-fed into the next round of model training, thus progressively improving the learning and generalization work (Ma et al., 2010;Sun et al., 2014;Hong et al., 2017;Huang et al., 2019;Huang et al., 2019) that solve low-dimensional embedding as

ED) (Yan
et al., 2007), our model learning process is to iteratively and alternately optimize several convex subproble

with respect to the variables A S , , and W as well as to-b
-updated Y pl instead of directly solving the non- convex problem (5) by the separable strategy of the variables.An implementation of the proposed IMR is summarized in Algorithm 1.Such optimization strategy has been proven to be effective for solving the aforementioned issue (Bertsekas, 1997;Boyd et al., 2011) and successfully applied in many real cases (Ji and Ye, 2009;Hong et al., 2018;Hong et al., 2019b;Hong et al., 2019c).


Learning regression matrix (A)

Intuitively, the optimization problem for solving the variable A is a Tikhonov-regularized least square regression, which is formulated as follows.
+ + Y ASX Y ASX A min 2 1 2 2 . l l pl pl A F 2 F 2 F 2(6)
A closed-form solution of Eq. ( 6) is given by
= + × + + A YH Y H H H H H I ( (1 ) ) ( (1

Learning subspace projecti

s (S)

The vari
ble S can be estimated by solving the following optimi- zation problem.
+ + = Y ASX Y ASX SX L X S SS I min tr( ) s. t. . l l pl pl l l l S 2 F 2 1 2 F 2 2 T T T (8)
The orthogonality-constrained regression problem in Eq. ( 8) has been effectively solved by using an ADMM-based optimization algorithm (Hong et al., 2019b).


Learning graph structure (W)

In the sub-problem, we learn the connectivity (or similarity) between samples from the data rather than using certain existing distance measurements.Therefore, the resulting optimization problem can be formulated as
= = N s W Z W W W W min 4 s. t. , 1/ 0, , k i j W 1,1 T , 1,1(9)
whose solution has been obtained with an effective ADMM as well, as presented in Hong et al. (2019c).Please note that for those samples with

bels, we construct a graph-base
local discriminant analysis (LDA) (Belkin and Niyogi, 2003) in the place of the corresponding part in the learned graph W, as shown in Fig. 2. The mensional feature representation for the label posed IMR method on three different

tasets: Indine Pines, Houston2018,
and Berlin EnMap.Note that the relative loss recorded in the convergence curve 1 Given the inputs of X l and X pl as well as Y l and Y pl , we estimate the variables of A S , , and L by solving problem (3).This process is defined as

"step" or in our case an "ite
ation".
= k W X X ,
and are the samples belonging to the th class; 0 , otherwise,
L i j N i j ( , ) 1 k (10)
where N k denotes the number of samples belonging to k-th class.


Updating pseudo-labels (Y pl )

Given the and the labeled (X l ) and unlabeled (X pl ) samples, we can correspondingly learn the joint graph structure (W t ( ) ) in the t-th step from the t-th latent feature spaces (Z t ( ) ).The learned W t ( ) can then be further applied to infer the pseudo-labels of next step ( + Y pl t

( 1) ) by LP, and then the updated pseudo-labels can be fed into a next-round model e note that the model's iteration will be suspended as long as the to-be-learned adjacency matrix W is

t change
or the residual error ( ) between the current W t ( ) and the former step W t

( 1) are close to zero (e.g., 10 6 ).


Convergence analysis and computational complexity

Considering the non-convexity of Eq. ( 5) when all variables are considered simultaneously, a common and ef ective solution for the optimization problem is using a block coordinate descent (BCD) by alternatively optimizing each subproblem with respect to A S , , and W in an alternating strategy.The BCD as been guaranteed in theory to converge to a stationary point -estimated variable in Eq. ( 5) can be exactly minimized (Bertsek

, 1997).Owing to the convexity
n each independent task, a unique minimum can be ideally found in our case when the Lagrangian parameters used in ADMM are updated within finitely iterative steps (Boyd et al., 2011).The same or similar criterion has been successfully applied in various practical applications (Hong et al., 2017;Zhou et al., 2017;Xu et al., 2018b;Hong and Zhu, 2018).In addition, we also draw the convergence curves corresponding to the three used datasets, respectively, by recording the relative loss of objective function of Eq. ( 5) in each iteration, as shown in Fig. (4).One can be seen from the figure is that our model is able to fast reach the state of convergence with more or less 20 steps.

As observed in Section 2.3: Model Learnin

the computational cost in our IMR model is mainly
dominated by matrix products, where the most costly step lies in solving S, yielding an overall 2) )
+ d N M t ( (
2 O computational cost for Eq. ( 5).


Experiments


Data description

Three popular and promising HSI datasets -Indian Pines (Baumgardner et al., 2015), Houston2018 (Le Saux et al., 2018), and Berlin EnMap (Okujeni et al., 2016) -are used to assess the quantitative and qualitative performance of the IMR method, as briefly described below.


Indian pines dataset

The hyperspectral scene located in the northwestern Indiana, USA, has been widely used in various HSI-related tasks, such as dimensionality reduction (Hong et al., 2016b;Hong et al., 2018) and classification (Dópido et al., 2012).It consists of × 145 145 pixels with 220 spectral bands covering the wavelength from 400 nm to 2500 nm at intervals of 10 nm.There are 16 classes in the scene that are mostly vegetation, as detailed in Table 1 along with the number of training and test samples.Fig. 6 shows the false-color image of the studied scene as well as the distribution of training and test samples used in Ghamisi et al. (2014), Hong et al. (2018).


Houston2018 dat set

This dataset is multi-modal data provided for the 2018 IEEE GRSS data fusion contest, where the HSI was acquired by an ITRES CASI 1500 sensor.The HSI, with dimensions of × × 601 2384 50

ound sampling dis
ance (GSD) of 1 m.This is a complex city scene with 20 challenging classes (see Fig. 7 and Table 1 for more details, including the specific training and test information).Note that we downsampled the ground truth map to the same GSD with the HSI by the

arestneighbor-interpo
ation.


Berlin EnMap dataset

The EnMap HSI with a GSD of 30 m was simulated by the corresponding HyMap data (Mueller et al., 2002) over a hybrid area that includes urban, rural, and vegetation in Berlin, Germany, this data is openly and freely available from the website 2 .This image consists of × 797 220 pixels and 244 spectral bands in the wavelength ranging from 400 nm to 2500 nm.The ground truth in the scene is generated by the Haklay and Weber (2008) in the form of land cover and land use, and further refined and corrected by means of Google Earth.


Experimental configuration


Evaluation metrics

With the

put of different dim
nsion-reduced features, we adopt the pixel-wise classification as a potential application for quantitative evaluation in terms of classification or recognition accuracy.More specifically, three commonly-used indices, Overall Accuracy (OA), Average Accuracy (AA), and Kappa Coefficient ( ), are computed to quantify the experimental results using two simple but effective classifiers: nearest neighbor (NN) and linear SVM (LSVM).In our case, the two classifiers were selected because those more powerful classifiers (e.g., kernel SVM, random forest, deep neural

etwork) tend to resul
in confusing evaluation, as it is unknown whether the performance improvement originates from either these advanced classifiers or the features itself.


Comparison with state-of-the-art baselines

We evaluate the performance of the proposed IMR model visually and quantitatively in comparison with eight state-of-the-art baselines, including.

• Non-HDR: original spectral features (OSF); • Supervised HDR: feature space discriminant analysis (FSDA) (Imani and Ghassemian, 2015), joint learning (JL) (Hong et al., 2019b);

• Semi-supervised subspace learning for HDR: semi-supervised local discriminant analysis (SELD) (Liao et al., 2013), collaborative discriminative manifold embedding (CDME) (Lv et al., 2017);

• GLP-

-label LDA (SL-LDA)
(Zhao   et al., 2014), semi-super-vised fisher local discriminant analysis (SSFLDA) (Wu and Prasad, 2018).


Implementation preparation

The parameter settings for the algorithms play a key role in performance assessment.A common tactic for model selection is to run cross-validation on the training set.Following that, we conducted a 10fold cross-validation to determine the optimal parameter combination for the different algorithms.In detail, there parameters that need to be tuned to maximize the classification performance on the training set were subspace dimension3 (d sub ), selected from 5 to 50 at intervals of 5; the number of nearest neighbors (k); the standard deviation ( ) in SELD and SSLFDA, ranging from … {10, 20, , 50} and {10 , 10 , 10 , 10 , 10 }  re

from … {0.1, 0.2, , 0.9}.M
reover, initializing the adjacency matrix (W) and pseudo-labels (Y pl ) in IMR is also an important factor in determining the model's performance.We first predict the unlabeled samples using a pre-trained classifier on the training set; then the predicted results can be naturally input into the model as pseudo-labels.Likewise, the initialized W can be given by the labels and pseudo-labels.In addition, note that the clustering technique (e.g., K-means) is applied to handle the highly computational complexity caused by the large quantity of unlabeled samples during the process of model learning.As a trade-off, the number of cluster centers used in our case is approximately set to be the same as that of the training samples.


The number of iterations in the proposed IMR

According to the model's stopping criteria in Algorithm 1, our IMR method generally converges to a desirable solution that corresponds to a well-learned adjacency matrix (W) out of three

four it
rations.To support the results more effectively, we further investigate the effects of assigning a different number of iterations in IMR for the three datasets.Fig. 5 gives both visual and quantitative results with the increase of the IMR's iterations4 .Note that the IMR with iterative 0 equivalently degrades to a version without label propagation.The OAs are clearly much lower without using an iterative strategy to update pseudo labels (iterative 0) than when using several iterations.Intuitively, this proves the superiority of the iterative strategy by gradually optimizing the pseudo-labels.It is worth noting, however, that the performance gain starts to slow down after two iterations and then remains essentially stable in the follow-up iterations, as the variable W is hardly changed any further.Similarly, for the different number of iterations, there is a consistent trend in the compactability of intra-class and the separability of inter-class.To summarize, we determine the number of iterations in the IMR to be 3 (IMR-3 for short); it will be used for comparison in the following experiments.


Results and analysis


The Indian pines dataset

Fig. 6 presents the classification maps for different HDR compar

er combination.

Using the NN classifier, the
e is basically the same classification performance in OSF and FSDA.Despite an improved supervised criteria, FSDA still yields poor classification accuracy, since directly projecting the original data into a discriminative subspace with the limited amount of labeled samples is very challenging, especially when dealing noisy data (e.g., HSI) with various spectral variabilities.Overall, the classification performance by considering the unlabeled samples is better than that without considering them.It should be noted, however, that inspired by latent subspace learning, the JL model dramatically outperforms FSDA (more than 10% improvement), but also improves the OAs of around 4%, 6%, 2%, and 1%, respectively, compared to those semi-supervised HDR approaches (SELD, CDME, SL-LDA, and SSLFDA).This intuitively indicates the superiority of the regression-based JL model for feature learning.Following the JL-like model, the proposed IMR framework achieves the best performance owing to the multitask learning framework, where the labeled and unlabeled samples can be jointly regressed, and to the iterative updating strategy of pseudo-labels.There is a similar trend in classification performance using the LSVM classifier, yet its performance is relatively weaker than those with the NN classifier.The possible reason for that is the few training samples available, further leading to the poor estimation of decision boundary for the SVM-like classifier learning.

Furthermore, we can observe from Table 2 that our IMR not only outperforms other

A, AA, and , but it also
btains highly competitive results for each class, particularly for those classes with a relatively limited number of training samples in comparison with the number of test samples, such as Corn-Notill, Grass-Trees, Soybean-Notill, Soybean-Mintill, Soybean-Clean, and Wheat.This provides powerful evidence of the effectiveness of transferring the unlabeled samples to the learned subspace and the superiority of iteratively optimizing pseudo-labels.


The Houston2018 dataset

Classification performance using the different low-dimensional feature representations is evaluated on the Houston2018 dataset both visually and quantitatively, as shown in Fig. 7 and listed in Table 3, respectively.The optimal parameters used for different compared methods are given in Table 3 as well.Likewise, due to more challenging categories in this scene and small-scale training set, the ability to classify the materials for the LSVM is limited.This might explain a phenomena in Table 3, that is, why the NN-based classifier, to some extent, performs better than the SVM-based one for many compared methods.

More specifically, OSF yields a poor classification performance, due to the highly redundant spectral information and the sensitivity to noise.Unlike OSF that directly uses the original spectral features as the input features, FSDA and JL are apt to discriminate the materials due to


Table 3

Quantitative performance comparison among the different algorithms with the optimal parameters on the Houston2018 data et in terms of OA, AA, and as well as accuracy for each class.The best is shown in bold.Note that IMR-3 denotes the IMR with three iterations.the utilization of the label information.Further, taking the unlabeled samples into account is of great benefit in finding a better decision boundary, yielding a possible performance improvement, as shown in those subspace-based learning semi-supervised HDR methods (e.g., SELD, CDME).It is worth noting that the regression-based JL model is provided with nearly identical performance to those semi-supervised HDR approaches using both NN a

LSVM classifiers, even
hough the powerful GLP is utilized (e.g., SL-LDA, SSLFDA).As expected, the    performance of the IMR framework, which optimizes the pseudo-labels in an iterative fashion, is dramatically superior to that of others with the OA's increase of approximately 8% (NN) and 5% (LSVM).More intuitively, the proposed IMR performs better at identifying each material than other methods.In particular, when facing the extremely unbalanced sample distribution (see Table 1), our method gradually improves the quality of the pseudo-labels, thereby making the model develop a more powerful learning ability.Table 3 also reveals an inter sting but unsurprising result: for those classes with a very limited number of training samples (e.g., Deciduous Trees, Bare Earth, Water, Crosswalks, Highways, Unpaved Parking, and Stadium Seats), the IMR makes a significant performance gain (an increase of at least 50% for these classes) with the aid of iterative pseudo-label learning.


The Berlin EnMap dataset

For the Berlin EnMap dataset, the visual comparison of eight different algorithms in the form of classification maps is shown in Fig. 8. Table 4 details the comparison by means of three quantitative indices: OA, AA, and .

With a very high spectral dimension (244), OSF only holds a 53.97% accuracy when using the NN classifier.The performance of supervised HDR methods (SFDA and JL) is obviously superior to that of OSF, with an increase of at least 8% using the NN classifier.This reveals the importance of HDR in the follow-up hyperspectral data analysis.Furthermore, these methods exhibit balanced accuracies using the LSVM classifier, where JL shows a better classification performance owing to its well-designed architecture in the regression-based latent subspace learning.SELD learns the subspace projections by not only considering the label information but also computing the similarities between the unlabeled samples, yielding an effective semi-supervised low-dimensional embedding.However, the similarities between samples are usually measured by certain fixed functions, i.e., radial basis function (RBF), in the high-dimensional space, leading to poor robustness and ability to generalize.CDME implements an automatic similarity measurement by collaboratively representing the connectivity between the samples for the low-dimensional embedding.By the means of the soft (or pseudo) labels instead of using similarity measurement, SL-LDA and SSFLDA jointly use the labels and pseudo-labels to find a high discriminative subspace in a semi-supervised embedding approach.

Beyond the two subspace-based (SELD and CDME) and two GLPbased (SL-LDA and SSFLDA) semi-supervised strategies, we propose to iteratively optimize the pseudo-labels and feed them into a multitask regression framework in order to find a latent o

ses in the studied image
xceeds the vast majority of compared methods except the material of Commercial, thereby further revealing the IMR's advantages in low-dimensional representation learning.


Parameter sensitivity analysis


On the regulariz tion parameters

The quality of low-dimensional features extracted by the proposed IMR model is, to some extent, sensitive to the selection of three regularization parameters ( , , and ) as shown in Eq. ( 5).For this reason, we experimentally investigate the effects of different parameter setting in terms of OA via the NN classifier.The resulting analysis on the three datasets is quantified in Fig. 9, where the parameter combinations of = = = = = = 0.8, 0.01, 0.1), ( 0.9, 0.01, 0.01), and = ( = = 0.8, 0.1, 0.01) obtain the optimal classification performance on the test set for the Indine Pines dataset, Houston2018 dataset, and Berlin EnMap dataset, respectively.The results regrading the parameter setting are basically consistent with those obtained by cross-validation on the training set (see the Section 3.2.3:Implementation Preparation).Thus, the cross-validation strategy can be effectively used to determine the model's parameters so that other researchers can produce the results for their tasks.


On the subspace dimension

Apart from the regularization parameters, we analyze the performance gain in the different subspace dimension of our IMR method, since a proper subspace dimension tends to reach a trade-off between discrimination and redundancy of the dimension-reduced product.For this purpose, the corresponding experiments are conducted by using the NN classifier to see the classification performance with the gradually-reducing dimension.As can be seen from Fig. 10, with the increase of subspace dimension, the IMR's performance sharply increases to around 20 for first dataset, 30 for the second dataset, and 20 for the last dataset, respectively, then starts to reach a relatively stable state, and finally decreases with a slight perturbation when the subspace dimension is approaching to that of original spectral signature.


On the training set size

Although the IMR adopts the semi-supervised learning strategy by jointly a

eled samples, yet the HDR's perfo
mance is determined by the number of training samples to a great extent.This is, therefore, indispensable to investigate the sensitivity with an increasing size of training set.To highlight and emphasize the effectiveness and superiority of our proposed method in the HDR issue, we arrange the classification task by resetting the training set randomly selected from all labeled samples out of 10 run with the different proportions in the range of 5% to 50% at a 5% interval and the rest as the test set, and the average classification accuracies are reported by integrating the ten outputs in the end.Fig. 11 shows a similar trend in OAs with two classifiers (NN and LSVM) on the three different datasets, that is, the classification performance improves with the size of training set, faster in the early, and later basically stabilized.This also indicates that our semi-supervised method is not heavily dependent on a largescale training set, which can hold a desirable and competitive performance i

HDR, even when only small-
cale labeled samples are used for training.On the other hand, we can observe an interesting conclusion on the first two datasets from the Fig. 11 that the NN classifier outperforms the LSVM one when the training samples are insufficient, e.g., less than around 15% of total samples.This could be well explained by the fact that LSVM is a learning-based classifier depending on the adequate samples for training an effective model, which is also supported by the experimental results yielding the higher OAs using the LSVM than those using the NN while using more training samples.Furthermore, with the increasing of training samples, the performance gain is prone to gradually become slow and meet the bottleneck, probably due to the lack of the spatial information modeling.


Computational cost in different methods

The exp

iments for HDR conducted
y different methods are implemented for simulation on a laptop with the CPU i7-6700HQ (2.60 GHz) and a 32 GB random access memory (RAM).Herein, we assess the operational efficiency of the compared HDR approaches in terms of running time, as listed in Table 5.

In general, the running time of supervised HDR is much less than that of semi-supervised HDR, such as between supervised discriminant analysis (FSDA) and semi-supervised discriminant analysis (SELD, CDME, SL-LDA, and SSFLDA).The conclusion is just as much applicable to another group, that is, JL and our proposed IMR.Remarkably, although the newly-proposed IMR model seems to be operationally complex compared to other HDR methods, yet as it turns out, the IMR shows the computationally efficiency and the time cost is acceptable, mainly owing to the fast matrix-based computing power in regressionbased techniques.


Conclusions

To facilitate the use of unlabeled samples effectively and efficiently, we propose a novel regression-based semi-supervised HDR model, called iterative multitask regression (IMR), which 1) simultaneously bridges the labeled and unlabeled samples with the labels and pseudolabels in a multitask regression framework; and 2) progressively updates the pseudo-labels in an iterative fashion.This model provides us a new insight into the solutions of HDR-related problems.We conducted extensive experiments on three convincing and challenging HSI datasets, demonstrating that our method (IMR) is capable of extracting more discriminative features by allowing for the unlabeled samples and by optimizing the pseudo-labels.

It should be noted, however, that while there has been a desirable performance boost in IMR, it is still limited to working well only by linearly learning the low-dimensional feature representations for complex nonlinear cases.For this reason, our future work will address the HDR issue in a more complex scene and extend our framework to a nonlinear one with possible spatial information modeling.



the unfolded hyperspectral

ata with d bands by N pixels (or samples
, and × Y l


Fig. 3 .
3
Fig. 3.An illustration of label propagation used for updating the pseudo-labels, where = Z S X l t t l ( ) ( ) and = Z S X pl t t pl ( ) ( )


Fig.

Fig. Visual and quantitative (OA) performance analysis with the different number of iterations in IMR on the three dataset .


Fig. 6 .
6
Fig. 6.False-color image, the distribution of training and test samples as well as classification maps of the compared methods using two different classifiers on the Indian Pines dataset.(For interpretation of references colour in this figure reader is referred to the of article.)


Fig. 7 .
7
Fig. 7. False-color image, the distribution of training and test samples as well as classification maps of compared methods using two different classifiers on the Houston2018 dataset.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this artic

.)


Fig. 8
8
Fig. 8. False-color image, the distribution of training and test samples as well as classification maps of compared methods using two different classifiers on the EnMap Berlin dataset.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)


Fig. 9 .
9
Fig.9.Sensitivity analysis on the regularization parameters (e.g., , , and ) of the IMR in Eq. (5).


Fig. 10 .Fig. 11 .
1011
Fig. 10.Sensitivity analysis on the subspace dimension in the proposed IMR method.




as follows.
manifold (graph) regularization is formulated by optimizing the fol-lowing objective function.min A S ,1 2Y ASX llF 2+2A2 F+2tr(SX L X S l l l T T) s. t.SS T=I,(2) hereL l× N N=D lW l is the Laplacian matrix,W l× N N is anadjacency matrix (graph), andD=Wmin A S ,1 2Y ASX llF 2+2A2 Fs. t.SST=I,(1)whereSdsub×NandA× l dsub denote the subspace projection andthe regression matrix linking the estimated subspace with label in-formation, respectively, and d sub represents the subspace dimension.||•|| F denotes the Frobenius norm and is the regularization parameter.S