MHST: Multiscale Head Selection Transformer for Hyperspectral and LiDAR Classification

The joint use of hyperspectral image (HSI) and light detection and ranging (LiDAR) data has gained significant performance on land-cover classification. Although spatial–spectral feature learning methods based on convolutional neural networks and transformer networks have achieved prominent advances, contextual information described by fixed convolutional kernels and all self-attention heads selected have limited ability to characterize the detailed information and nonredundant features of land-covers on multimodal data. In this article, a multiscale head selection transformer (MHST) network, is proposed to fully explore detailed and nonredundant features in spatial and spectral dimensions of HSI and LiDAR data. To better acquire detailed information of spatial and spectral features at different scales, a multiscale spectral–spatial feature extraction module, including cascaded multiscale 3-D and 2-D convolutional layers, is inserted into MHST. Simultaneously, an adaptive global feature extraction module based on head selection pooling transformer is given after transformer encoder module for alleviating token redundancy in an adaptive computation style. Finally, we develop a multimodal–multiscale feature fusion classification module with local features and global class token, to exploit a powerful global–local fuse style. The extensive experiments on three popular datasets demonstrate that MHST significantly outperforms other related networks.


I. INTRODUCTION
H YPERSPECTRAL image (HSI) [1], [2], [3], [4], [5], one of remote sensing data, has been widely used in several applications [6], [7], [8], [9], [10] related to land-cover mapping, target detection, mineral exploration, etc., due to its rich spectral information, which can accurately reflect the spectral reflection characteristics of the surface of ground objects [11].Nevertheless, with more types of land covers with complex structures, single remote sensing image has been unable to meet the requirements of high precision, such as ground objects with similar spectral characteristics and different elevation information.Light detection and ranging (LiDAR) data or digital surface model (DSM) could provide the object height information of Earth surface [12], [13], [14], then the integration of HSI and LiDAR data opens up the possibility to enhance the land-cover classification performance [15], [16], [17] by multimodal feature fusion and interaction.
Despite these advantages of HSI and LiDAR data, there remain some unique technical challenges in land-cover classification as follows that significantly constraining its applicability.
1) For data characteristics, the scale variation of land covers makes it difficult to accurately depict the local characteristics of land covers.2) For feature learning, given the comprehensive consideration of feature redundancy, the global sequence properties of HSI spectral features and LiDAR data, limits the improvement in classification accuracy.Due to rapid development of deep learning, a multitude of deep neural networks, e.g., convolutional neural networks (CNNs), have exhibited significant promise in joint hyperspectral and LiDAR classification [18], [19], [20], [21] due to their powerful ability to extract local features.Hang et al. [22] utilized coupled CNNs, feature-level fusion and decision-level fusion strategies for acquiring the distinguishable features from HSI and LiDAR data.Zhang et al. [23] proposed an interleaving perception CNN, which is an information fusion CNNs, for classifying land covers via hyperspectral and LiDAR data.Moreover, emphasizing the challenge of addressing weak boundaries and spatially fragmented classification, a dual-tunnel CNNs and hierarchical random walk layer [24] were given and significantly enhanced classification performance of HSI and LiDAR data.Nevertheless, for land covers with different scales and complex terrain structure, multiple scale data blocks can be used as network inputs, e.g., global-local transformer network (GLT-Net) [25], or multiscale convolution operation is an effective method.This enables the simultaneous utilization of distinguishable local information at different levels, thus leading to a better understanding of high-level semantic features within HSI and LiDAR data.
For global sequence feature learning, transformer has a significant advantage in capturing long-term dependencies and global deep features.In the collaborative land cover classification using HSI and LiDAR, constructing long-range dependencies can effectively capture spectral information and global information of land cover, such as binary-tree transformer network [26] and parallel transformer [27].Particularly, for multisource remote sensing feature fusion learning, feature redundancy affects the model discriminability.Then, a local information interaction transformer model was proposed by Zhang et al. [28] for mining the complementary information and data imbalance problem of HSI-LiDAR data, simultaneously, this proposed model reduces the redundant information via dynamically filtering source components.For addressing the limitations and gaps in the newly acquired Earth observation data from a single source data, Feng et al. [29] inserted two effective modules into spectralspatial-elevation fusion transformer, it is worth noting that this proposed transformer network could reduce redundant spatial information.Currently, despite some existing transformer-based methods considering the redundancy of spatial features, most of them do not consider the redundancy of global sequence features for HSI and LiDAR data.Therefore, starting from the characteristics of transformer models, it is of great significance to construct an adaptive feature selection mechanism for extracting global sequence properties of HSI and LiDAR data.
To address the challenge of characterizing the detailed local information and nonredundant global sequence properties of land-covers on multimodal data, followed by CNNs-transformer feature learning style, a multiscale head selection transformer (MHST) network is proposed.For the spatial and spectral feature information of HSI images, cascading multiple multiscale 2-D convolutions and 3-D convolutions are utilized to fully capture the spatial-spectral detail information of HSI images.For LiDAR data, multiple multiscale 2-D convolutions are employed to fully capture the elevation spatial information of LiDAR.Furthermore, starting from the transformer structure, an adaptive global feature extraction module based on head selection pooling transformer after dual-branch fusion features, is introduced after the transformer encoder module to mitigate token redundancy in an adaptive computational style.Finally, the multiscale aggregated spatial-spectral features and nonredundant global sequence properties are fed into the classifier to accomplish land cover classification.Toward the end, the proposed MHST exhibits three main contributions, which include the following.
1) The embedding of multiscale spectral-spatial feature extraction (MSFE) module can simultaneously capture spatial and spectral features from HSI and LiDAR data at different scales, effectively considering the global structure and local detailed characteristics of various-scale land-covers.
2) The head selection pooling transformer based on a decision network is proposed for learning global and nonredundant spectral features.This is achieved through the sequential stacking of multiple layers of conventional transformer and an adaptive head selection pooling transformer.3) In three publicly available datasets, we validated the impact of different feature fusion weights on MHST and confirmed the effectiveness of MHST as proposed in this article.In addition, we publicly provide codes, training weights, and training log.The rest of the article is organized as follows.Related works are given in Section II.In Section III, we introduce MHST and provide a comparison of its experimental results with other related methods in Section IV.Finally, Section V concludes this article.

A. CNNs-Based Methods
In CNNs-based methods, dual-branch or multibranch classification architecture can effectively classify land-cover [30], Xue et al. [31] inserted hierarchical residual structure, selfcalibrated convolution, self-attention module, and nonlinear feature fusion style into multiscale deep learning network with self-calibrated convolution.Xu et al. [17] focused extensively on addressing the challenge of imbalanced multimodal learning and feature interaction, proposed a dual-branch dynamic modulation network.Roy et al. [32] incorporated morphology learning and convolutional features into dual-branch networks for exploring the powerful joint features.Fang et al. [33] utilized spatial and spectral enhancement modules to enhance the spatial and spectral features effectively.To enhance the collaborative utilization of multisource land cover classification, superpixel-guided kernel principal component analysis, 2-D and 3-D Gabor filters, and a weighted majority voting-based decision fusion strategy were incorporated to effectively enhance multisource land cover classification [34].Other relevant CNNs-based methods include attention-based CNNs method [35], [36], [37], [38], [39], a triplet deep neural network [40], deep encoder-decoder network (EndNet) [41], MDL-cross method [42], and a feature fusion and extraction framework (FusAtNet) [43].CNNs-based methods can effectively fuse the rich spectral information of HSI and the elevation information of LiDAR, fully leveraging their complementarity.

B. Transformer-Based Methods
Although CNNs excel in capturing local features and spatial structures in imagery, global spatial-spectral association is absolutely crucial for classifying HSI and LiDAR data.The GLT-Net [25] introduced multiscale convolutional and spectral feature learning (based on transformer network) modules, then Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the complete exploration and collaborative utilization of supplementary data in multiple modes, as well as the local and global spectral-spatial details, can be achieved.Meanwhile, a fusion encoder known as cross-token attention [44] was created to merge the spectral and spatial features of HSI and LiDAR data.Xue et al. [45] and Zhang et al. [46] designed spatial-spectral hierarchical transformer and multimodal transformer for exploring the effectiveness of transformer structures.A transformer and multiscale fusion network [47], including attention strategy and scale-based method, was performed on LiDAR and HSI classification.In addition, numbers of improved transformer networks were employed in HSI classification filed, such as [48], [49], [50], [51], [52], [53], [54].Although the above-mentioned methods exhibit significant advantages in feature-level learning and fusion, they face limitations in effectively capturing the intricate details and distinctive characteristics of land cover in multimodal data.

III. METHODOLOGY
The proposed MHST primarily consists of three parts as follows.MSFE, adaptive global-local feature extraction (AGFE), and multimodal-multiscale feature fusion classification (MFFC), as illustrated in Fig. 1

A. Multiscale Spectral-Spatial Feature Extraction
Since the conventional transformer has deficiencies in local feature expression, numbers of models applied on land-cover classification adopt CNNs to preliminarily extract local spatial features from input images.In previous work, smaller sized convolutional kernels have been widely used in CNNs due to their fewer parameters and high computational efficiency.These CNNs expand the receptive field by stacking multiple small-sized kernels to form a convolutional chain, and use downsampling layers to gradually reduce the size of input.However, as each convolutional kernel only focuses on a local area, after stacking multiple layers, the receptive field may still be limited, leading to the filtering out of some subtle but important details in the feature map, thereby affecting the posterior transformer block's understanding of the global structure of HSI and LiDAR images based on the features extracted by CNNs.What is more, the strategy of using relatively smaller convolutional kernels and gradually increasing the receptive field may have limitations in handling objects of different sizes.To address these issues, we have designed the MSFE module, which consists of multiscale CNNs aimed at resolving the input image by parallel applying kernels of different sizes, expanding the receptive field, and capturing information at different levels to improve the model's ability to handle multiscale and complex scenes.
Moreover, there are physical interactions among spectral bands and correlations between spectral features in HSI data.convolutional kernels are inserted into MSFE to simultaneously capture spectral and spatial features from HSI data.The size of spatial dimensions of multiscale 3-D convolutional kernels are all set to 1, while the size of depth dimensions are sequentially set to a, a ∈ {1, 3, 5, 11}.For multiscale 2-D CNNs used in LiDAR data [55], [56], the sizes of four levels of convolutional kernels are empirically given as b × b, b ∈ {3, 5, 7, 9}.Notably, spectral channels within each convolution layer are grouped for reducing model computation.Finally, batch normalization layer and ReLU activation function are given in MSFE.
MSFE captures features from single scale HSI and LiDAR image blocks by spectral-spatial feature encoder (SSFE) and spatial feature encoder (SFE).Concretely, SSFE is a 3-D convolutional sequence composed of single scale 3-D CNNs and multiscale 3-D CNNs, denoted as sequence operator E ssf .An SFE can be characterized as a 2-D convolutional sequence, incorporating single scale 2-D CNNs alongside multiscale 2-D CNNs, denoted as sequence operator E sf .
For HSI data X H ∈ R H×W ×d h , and LiDAR data X L ∈ R H×W .H × W represents the original size of spatial dimensions of HSI and LiDAR data, and d h is original size of spectral dimension of HSI data.After padding around data's edge pixels, patch extraction operations are performed on each pixel of HSI and LiDAR data separately, resulting in HSI cubes X P H ∈ R m×m×d h and LiDAR cubes X P L ∈ R m×m , where m × m denotes spatial size.
The HSI cube X P H of training set is employed as the input samples.Initially, it is passed through SSFE, generating a spectral-spatial signature cube.After flattening along the depth dimension, the cube is input into SFE to further extract spatial features from the spectral-spatial feature cube.Finally, a maxpooling operation is applied to reduce the spatial dimensions of the cube by half, resulting in the HSI spectral-spatial feature f h ss .Similarly, for the LiDAR cube X P L used for training, it goes through two identical layers of SFE and max-pooling operations, resulting in the LiDAR elevation feature f h s .The features f h ss and f h s can be obtained via Hence, multimodal local spectral-spatial feature f cnn based on CNNs extraction can be calculated via where ω is the weight coefficient, which can be manually adjusted.Herein, MSFE could capture spatial and spectral features from HSI and LiDAR data at different scales.

B. Adaptive Global-Local Feature Extraction
Multiple layers of conventional global transformers [57] are utilized to integrate multimodal features from local spectral space f cnn and are performed initial global spectral feature extraction of AGFE.Assuming that z 0 ∈ R N ×d represents the features derived from f cnn after tokenization strategy, the processing of z 0 through ViT encoder is as follows: where L 1 represents the depth of conventional transformer encoder.Note f vit as the spectral sequence attribute feature maps obtained after L 1 conventional transformer blocks.FFN, LN, and MHSA stand for feedforward networks, layer normalization, and multihead self-attention, respectively.The number of heads is empirically set to 4 and the attention calculation of each head is as follows: where Q, K, and V are the query, key and value matrices, respectively, and d k represents the scaling factor.An MHSA employs the same computation process to obtain attention scores for each head, and it concatenates the attention scores from multiple heads and projects them into where h represents the number of attention heads, W O denotes the parameter matrix.
While the MHSA mechanism in conventional transformer can map feature maps into different subspaces to extract global spectral sequence features, the potential for increased overlap in attention between heads becomes more pronounced as the number of layers in multilayered transformer deepens.This results in unnecessary information redundancy.Furthermore, because there are substantial variations between feature maps in different spectral bands, the way their long-range dependencies are managed differs.Therefore, we have designed an adaptive head selection decision network to learn the usage strategy of self-attention heads.This network chooses to selectively disable specific heads, reducing the model's computational cost and minimizing the processing of redundant information.
Concretely, the decision network consists of a linear layer, a sampling process, and a threshold selection layer (Θ(•)).The linear layer and sampling process are used to generate a policy probability matrix, and threshold selection layer sets a probability threshold (0.5) to filter out which self-attention heads to keep or discard.For the input z l at lth block, the self-attention head usage policy matrices for this block is where , and L 2 represents the depth of head selection pooling transformer.Subsequently, the binary decision k (k = 1 means the corresponding head is retained and k = 0, discarded) at ith head of lth block is derived in the following way: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where G l,i = −log(Exp l,i ) in which Exp l,i is sampled from exponential distribution, and τ is used to control the distribution of output probability [58].Afterward, the self-attention head selection strategy matrix in lth block P l is computed as while P l,i = 1, the ith head at lth block is retained; when P l,i = 0, the ith head at lth block is discarded.Notably, before attending the input, the Q, K, and V are achieved via the pooling operator P(•; φ) to capture spatial relationships between patches in MHSA, aiming to simultaneously consider global spectral and local spatial information where φ = (k, s, p) in which k, s, and p represent the pooling kernel, corresponding stride, and padding, respectively.Certainly, the operation imposes limitations represented by constraints s K ≡ s V on the pooling operators applied to Q, K, and V while utilizing the same padding strategy to preserve shape.
As shown in Fig. 2, head selection decision network and pooling operation are inserted into head selection pooling transformer block.In our model, two head selection strategies are adopted, namely partial discard and complete discard.For partial discard of attention heads, the attention matrix corresponding to that head is replaced by an identity matrix 1.Then the computation of attention in ith head of lth block [58] is as follows: Regarding complete discard, the entire head is correspondingly removed from the MHSA mechanism and does not participate in the computation of self-attention for that layer In general, the forward propagation process takes spectral sequence feature f vit obtained from conventional global transformer as input, and obtains the global-local spectral sequence feature f vit after L 2 head selection pooling transformer block.The cls token f cls vit ∈ R 1×d is extracted from f vit for subsequent classification tasks

C. Multimodal-Multiscale Feature Fusion Classification
Feature fusion decision classification is employed to fully capture the local spectral-spatial features and the global-local spectral features in MFFC.Specifically, f cnn is classified by a CNNs-based network for outputting classification probabilities, consisting of a multiscale 2-D CNNs, batch normalization, and ReLU activation layers.Afterward, adaptive global average pooling, fully connected layer, and softmax function are inserted.For f cls vit , it is fed into multilayer perceptron and softmax layer for classification.
Finally, we make a decision classification based on classification probability vectors corresponding to CNNs and ViT, denoted as P cnn and P vit , to obtain the final probability vector P f ∈ R 1×cls , where the label corresponding to the maximum probability is assigned as the class for that pixel.P f can be represented as P f = λ • P vit + (1 − λ) • P cnn where λ is the weight coefficient for feature fusion decision classification.

A. Data Description
In the experiments, three commonly HSI and LiDAR datasets are utilized to evaluate the effectiveness of MHST.
1) Houston2013 Dataset: The Houston2013 dataset [25] includes an HSI and a LiDAR-based DSM, collected by the National Center for Airborne Laser Mapping in June 2012 using the ITRES CASI-1500 imaging sensor over the campus of the University of Houston.The dataset was provided by the IEEE GRSS Data Fusion Competition.The HSI comprises 144 spectral bands covering a wavelength range from 0.38 to 1.05 μm while LiDAR data are provided for a single band.Both HSI and LiDAR data share dimensions of 349 × 1905 pixels with a spatial resolution of 2.5 m.The dataset contains 15 categories, with a total of 15 029 real samples available.
2) Trento Dataset: The Trento dataset comprises HSI and LiDAR data obtained from southern Trento, Italy.The HSI was obtained by the airborne hyperspectral imaging systems Eagle sensor, consisting of 63 spectral bands with a wavelength range from 0.42 to 0.99 μm [44].LiDAR data were gathered using the Optech airborne laser topographic mapping (ALTM) 3100EA sensor with one raster.The scene consists of 166 × 600 pixels, with a spatial resolution of 1 m.This dataset contains six land cover types with a total of 30 214 real samples.
3) MUUFL Dataset: The MUUFL dataset was acquired in November 2010 over the area of the campus of University of Southern Mississippi Gulf Park, Long Beach Mississippi, USA.The HSI data were gathered via ITRES Research Limited (ITRES) compact airborne spectral imager (CASI-1500) sensor, initially comprising 72 bands.Due to excessive noise, the first and last eight spectral bands were removed, resulting in a total of 64 available spectral channels ranging from 0.38 to 1.05 μm.LiDAR data were captured by an ALTM sensor, containing two rasters with a wavelength of 1.06 μm.This dataset consists of 53 687 ground-truth pixels, encompassing 11 different land-cover classes.

B. Experimental Setting
The proposed MHST is implemented in PyTorch framework.The experiments are performed on Ubuntu 22.04 platform equipped with an I9-13 900 K CPU, a NIVIDIA RTX 4090Ti GPU, and RAM: 32 GB.We use an AdamW optimizer with a learning rate decay parameter of 0.9 to optimize the network.In the training phase, the batch size, and the number of training epochs are set to 64, and 3000, respectively.Considering three datasets have different data scales and spatial resolutions, initial learning rates are set to 8e-4 (Houston2013 dataset), 5e-4 (Trento dataset), and 4e-4 (MUUFL dataset).In terms of model parameter configuration, the depths of conventional transformer encoder block and head selection pooling transformer block are set to 5 and 8, respectively.And initial values of feature fusion weight coefficient ω and feature fusion decision classification weight coefficient λ are set to 0.6 and 0.7, respectively.For head selection, a complete discard strategy is employed, unless otherwise specified.Standard cross-entropy is utilized as the loss function.
Moreover, three evaluation indicators are adopted to quantitatively reflect the classification performance of MHST: overall accuracy (OA), average accuracy (AA), and Kappa coefficient.Tables I, II, and III provide detailed information of training and testing samples.

C. Performance Comparison
To validate the effectiveness of our proposed framework, we selected several representative HSI and LiDAR joint classification models for comparison, including EndNet [41], FusAt-Net [43], MDL-cross [42], S2E [33], HCT-Net [44], and GLT-Net [25].Among these models, EndNet, FusAtNet, MDL-cross, and S2E are all based on deep CNNs architectures, while HCT-Net and GLT-Net utilize CNNs-transformer architectures.The parameters for these models were set according to their respective reference papers and optimized on the same server.Furthermore, same training and testing samples were used for fair comparison.
Several comparative methods are evaluated through visual comparisons (as shown in Figs.3-5) and quantitative metrics, e.g., per-class accuracy, OA, AA, and Kappa coefficient.Tables I-III clearly present the objective classification results of our proposed method and each of the comparative methods on Houston2013, Trento, and MUUFL datasets, with the best results in each row highlighted in bold.
1) Quantitative Analysis: Tables I-III display the quantitative classification results of different methods on three popular HSI and LiDAR datasets, respectively.Through the analysis, our MHST achieves the highest classification scores on three datasets, surpassing the second-best model by approximately 0.5%, 0.3%, and 3% on Houston2013, Trento, and MUUFL datasets, respectively.By combining the analysis of features related to land-cover categories and their corresponding classification accuracy, we draw the following conclusions.exhibits high classification accuracy, surpassing all CNNsbased networks, our approach, which is based on a combination of conventional transformer and head selection pooling transformer in the global feature extraction module, makes full use of global spectral dependencies.As a result, our method still achieves higher classification accuracy and has the best classification performance in five categories of 15 land-cover categories as compared to S2E method.4) The classification strategy after decision-level multimodal-multiscale feature fusion can further improve classification accuracy.For instance, the performance of MHST with MFFC module surpasses any single feature-level fusion method on MUUFL dataset, such as MDL-cross.Similarly, GLT-Net, employing a similar decision-level feature fusion strategy, achieves the second-best results.5) The selection of self-attention heads in the transformer contributes to reducing attention on redundant features.Specifically, for "Yellow curb" on MUUFL dataset with a smaller number of samples, despite the complexity and clutter in the scenes near the pixels of this category, benefiting from the embedding of the designed head selection decision network in transformer architecture, the proposed MHST achieves a classification accuracy of up to 92.68%, which is about 10% and 13% higher than HCT-Net and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.GLT-Net,respectively. Furthermore, our MHST achieves the highest classification scores in three out of 11 categories.Based on these advantages, we seamlessly embed the proposed modules into an end-to-end framework, aiming to simultaneously consider multiscale, local spectral-spatial features.Simultaneously, we utilize MHSA mechanism with a combined head selection decision network to consider global spectral nonredundant information.The decision-level feature fusion classification method is also adopted to improve the classification performance.
2) Visual Comparison and Analysis: Figs.3-5, respectively, illustrate the classification maps of several methods on three datasets, where Trento and MUUFL utilize local zoom operations to more clearly display the performance differences between different methods.It can be seen that as compared to other  methods, the visualized classification results of MHST are closer to the ground-truth map, resulting in better classification performance.Furthermore, MHST exhibits superior classification performance compared to other methods, resulting in smoother classification results.On the other hand, several classification methods, such as EndNet, FusAtNet, and MDL-cross tend to have more isolated data points.Specifically, almost all other methods incorrectly classify some "vineyard" as "apple trees" on Trento dataset, as shown in Fig. 4. In the locally enlarged image, some methods classify this category as the "vineyard" which is geographically adjacent, for the "ground" with fewer samples.In contrast, the classification results of MHST are almost completely correct, consistent with the results shown in Table II.For MUUFL dataset shown in Fig. 5, the proposed method produces clearer classification boundaries.Such as, MHST's results are closest to the ground-truth map for the boundary between "road" and "trees," while the boundaries produced by other methods are not only more blurred and difficult to distinguish, but also produce more classification errors, e.g., classifying "trees" as "buildings shadow."In conclusion, MHST has demonstrated its effectiveness in both quantitative and visual analysis, showcasing strong classification performance.
3) Computational Complexity Analysis: Table IV shows the computational complexity of different methods, including the trainable parameters in the backpropagation phase and the testing time on the MUUFL dataset.It can be seen that the number of trainable parameters in CNNs-transformer-based networks is higher than in most CNNs-based networks.The EndNet method Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.has the fewest trainable parameters, while the FusAtNet uses the most parameters.The proposed model's parameter counts and testing time are slightly higher than those of the GLT-Net with a similar network structure because we embedded multiple layers of head selection pooling transformer blocks after the conventional transformer encoder, increasing the computational cost while improving the accuracy of land-cover classification.Within an acceptable range of testing time and computational cost, MHST exhibits the best classification performance.

D. Parameters Analysis
1) Weight Coefficient: Different feature fusion weights ω and decision classification weights λ have influences on classification performance.Then, we set the default values of ω and λ to 0.5, while keeping other hyperparameters fixed, and set the values of ω and λ from 0.1 to 0.9 in fixed increments of 0.1.Figs. 6 and 7 show the different values of ω and λ and their corresponding OA, AA, and Kappa on three datasets.It can be observed that while ω is less than 0.5, which means fewer HSI feature feeds, MHST achieves optimal classification performance on two datasets (Houston2013 and Trento).Nevertheless, while λ exceeds 0.5, indicating that the classification decision relies more on the features extracted by the head selection pooling transformer, the proposed model delivers the best classification results across all three datasets.In addition, while λ is too large or too small, it will lead to varying degrees of decline in classification performance.These observations collectively underscore the importance of integrating fused features from HSI and LiDAR for further spectral feature extraction, while also affirming the efficacy of designed head selection pooling transformer. 2

) Training Samples (TS) Number:
To better validate the robustness and generalizability of MHST, we systematically vary the quantity of TS and analyze the corresponding trends in overall classification accuracy.Specifically, we randomly select different samples of each land-cover category for training, and the remaining samples are used for testing.The number  Table V shows the classification performance of MHST under different TS sizes.On all three datasets, the three evaluation metrics exhibit a trend of fluctuating improvement with increasing TS.Taking the Houston2013 dataset as an example, the performance of proposed network notably improves as the TS increase from 40 to 100, benefiting from additional samples providing more feature and interfeature relationship information.Nevertheless, the performance gain from 100 to 200 samples is only about 20% of the gain observed from 40 to 100 samples.This might be because the model already captured most crucial features with 100 TS for each category, so increasing the TS size may not bring the same level of information gains.The marginal reduction on classification performance could be attributed to potential model overfitting at a particular training data, leading to fluctuations in the classification performance, rather than a consistent increase.In general, with more TS fed, the proposed MHST is able to extract strong features to improve the accuracy of classification and generalize across varying data volumes, resulting in consistent and strong classification performance.
3) Depth of Conventional Transformer Encoder: The depth of the conventional transformer encoder L 1 will affect the model's feature representation ability.As shown in Fig. 8, by varying the size of L 1 on three datasets to explore its impact on classification performance, it can be observed that while the encoder depth is 5, the proposed model can achieve the best classification performance.In addition, increasing the depth does not necessarily lead to better classification performance.With the increase in depth, the model's classification performance on the three datasets shows a decreasing trend.On the other hand, a conventional transformer encoder that is too shallow may not be sufficient to capture complex data patterns and features, which can also lead to a decrease in model performance.This indicates the importance of determining the optimal encoder depth through experiments to balance representation ability and prevent overfitting.

E. Ablation Studies
Due to some trainable parameters located within the head selection pooling transformer module, we conducted ablation experiments to specifically assess the effectiveness of the head selection decision network and pooling operation.The experimental results, based on three objective metrics (OA, AA, and Kappa) from three datasets, are depicted in Fig. 9, the discrepancy in classification evaluation metrics among different datasets signifies the varying advantages of two modules in feature extraction.For Houston2013 dataset with higher scene Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.complexity, benefiting from the integration of pooling operation in transformer, the second variant can focus on global features while also considering the extraction of local features, thereby paying more attention to information near the target classified pixels.Consequently, compared to first variant, which omits the pooling operation, the second variant achieves slightly higher classification accuracy.Conversely, as for the Trento dataset, it has lower scene complexity, which leads to limited local feature information.Since the head selection decision network enhances the global feature extraction capability, the first variant achieves a higher score than the second variant.Simultaneously, the contribution of locally extracted features by the pooling operation module to classification accuracy is reduced.Thus, the scores of the first variant is only slightly lower than both two modules.
For MUUFL dataset, only using the head selection decision network or pooling operation leads to lower classification accuracy, by around 1.3%, as compared to the combined use of both modules.Overall, the experimental findings and objective analysis suggest that both head selection decision network and pooling operation have a positive impact on land-cover classification.

V. CONCLUSION
This article focuses on the detailed and nonredundant features in spatial and spectral dimension for efficiently HSI and LiDAR classification.The MSFE module effectively accounts for the overall patterns and intricate local attributes of various-scale land-covers.Adaptive global feature extraction module could adaptively select the heads in transformer to avoid feature redundancy caused by the participation of all heads.Furthermore, we validated the effectiveness of MHST under different feature fusion ratios and verified the performances of proposed MHST from multiple dimensions, such as different TS and ablation experiments.Moreover, there are some details in MHST that deserve improvement.For example, structural features are crucial for HSI and LiDAR data classification, so one of the key focuses in future work is how to fully utilize the selected structural features of HSI.Specifically, we will further explore how to improve the head selection decision network to generate a more effective head selection strategy, reasonably allocate weights to the retained heads, and make the model further select the retained nonredundant features after filtering out redundant features.Simultaneously, since word tokens carry more specific information, the model could pay more detailed attention to local features.Therefore, one of the main research contents in the future is to explore how to more effectively integrate word tokens in order to optimize the model's understanding of HSI and LiDAR data.
Due to the distinct spectral reflection information of the same land cover in different bands, these bands provide unique and complementary information for land-cover classification.Building upon this, we propose a multiscale 3-D CNNs designed to simultaneously consider multiple correlated spectral bands during multiscale local spatial feature extraction, allowing the model to comprehensively capture the rich spectral information in HSI data, thereby improving classification accuracy.In particular, multiscale 3-D CNNs with four different levels of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 2 .
Fig.2.Illustration of head selection pooling transformer block.We insert a head selection decision network before each vision transformer block.For the input initial features extracted by multiple layers of conventional transformer blocks, the decision network generates a usage strategy for self-attention heads.These instance-specific usage strategies aim to reduce processing of redundant information and lower computational costs.Simultaneously, we utilize the pooling operator P(•; φ) in the multihead self-attention mechanism to capture relationships between patches, aiming to consider both global spectral and local spatial information.See the texts for further details.

TABLE I COMPARISON
OF CLASSIFICATION PERFORMANCES OBTAINED BY DIFFERENT METHODS FOR HOUSTON2013 DATA TABLE II COMPARISON OF CLASSIFICATION PERFORMANCES OBTAINED BY DIFFERENT METHODS FOR TRENTO DATA

TABLE III COMPARISON
OF CLASSIFICATION PERFORMANCES OBTAINED BY DIFFERENT METHODS FOR MUUFL DATA

TABLE IV COMPARISON
OF COMPUTATIONAL COMPLEXITY AND TESTING TIME(S) OF DIFFERENT METHODS ON THE MUUFL DATA

TABLE V IMPACT
OF THE NUMBER OF TS PER CLASS ON MHST