Hybrid Fully Connected Tensorized Compression Network for Hyperspectral Image Classification

Deep learning models, such as convolutional neural networks (CNNs), have made significant progress in hyperspectral image (HSI) classification. However, these models require a large number of parameters, which occupy a lot of storage space and suffer from overfitting, thus resulting in performance loss. To solve the above problems, in this article, we propose a new compression network [namely, a Hybrid Fully Connected Tensorized Compression Network (HybridFCTCN)] by considering the high dimensionality of HSI data. First, using the low-rank fully connected tensor network decomposition (FCTND), three novel units, i.e., FCTN-FC, FCTNConv2D, and FCTNConv3D, are designed to compress the weight tensor of standard fully connected (FC) layer and kernel tensor of convolutional layer, reducing their parameters. In the novel units, the intrinsic correlation of the decomposed factors is adequately exploited by the FC structures, which enhances their feature extraction and classification abilities. Then, benefiting from the hybrid network backbone composed of the FCTNConv3D and FCTNConv2D units, HybridFCTCN can extract more discriminative features with fewer parameters, while it has great generalization capability and robustness, enabling better HSI classification. Finally, the rank of above-designed units is defined, and its determination is discussed to facilitate the application of the proposed model. Extensive experiments on three widely used HSI datasets reveal that the proposed model achieves state-of-the-art classification performance for different training sample sizes with a very small number of parameters.


I. INTRODUCTION
H YPERSPECTRAL images (HSIs) contain very rich spatial-spectral information, thus providing a substantial opportunity to explore and understand earth surface characteristics. Among existing HSI processing technologies, classification is by far the most well-established field and has been widely applied to various scenarios in Earth science, such as monitoring coastal wetlands [1], precision agriculture [2], water quality analysis [3], and other fields [4], [5].
Most HSI classification methods stem from computer vision [6], which, in the early, are mainly based on traditional machine learning techniques [7]. For example, the k-means [8], iterative self-organizing data analysis technique (ISODATA) [9], and fuzzy c-means (FCM) [10] algorithms were first adopted due to their fast and unsupervised nature. Subsequently, some supervised algorithms, such as support vector machine (SVM) [11], random forest (RF) [12], sparse representation classifier (SRC) [13], [14], and k-nearest neighbor (KNN) [15], generally perform better by leveraging the labeled samples. However, these traditional classification methods are usually constructed with handcrafted features, whose design is dependent on human proficiency, thus limiting their classification performance under different scenarios.
Deep learning (DL) methods, which automatically learn discriminative features, have recently made significant breakthroughs in various fields, such as face recognition [16], fault detection [17], and posture estimation [18]. This has stimulated interest among researchers in the community of HSI classification. Representative DL methods include convolutional neural networks (CNNs) [19], recurrent neural networks (RNNs) [20], [21], graph convolutional networks (GCNs) [22], [23], and capsule networks (CapsNets) [24], among which CNNs have been widely applied to HSI classification. Hu et al. [25] and Chen et al. [26] first employed CNNs for HSI classification in the spectral, spatial, and spatial-spectral domains. In [27], different portions of HSIs were fed into the diverse region-based 2-D-CNN to identify the multiscale contextual interactional features. A 3-D-CNN that implements a border mirroring strategy to learn joint spatial-spectral information was proposed in [28]. Li et al. [29] proposed a two-stream model to extract the spectral, local spatial, and global spatial features simultaneously. Roy et al. [30] constructed a hybrid spectral CNN (HybridSN), which combines the advantages of both 3-D-CNN and 2-D-CNN, strengthening spatial feature This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ learning. Meanwhile, some residual structures and attention blocks were integrated into the standard CNNs to improve feature extraction and classification performance. In [31], a spatial-spectral residual network (SSRN) made use of residual connections to alleviate the decline in classification accuracy caused by deeper networks. The work in [32] constructed a pyramidal residual network by stacking the bottleneck residual units with the gradually increased feature channels. Li et al. [33] proposed a two-stream spectral feature fusion network, where two branches generate local and interlocal spectral correlation features that are adaptively integrated via dual-channel attention and decision fusion to achieve better classification results. In [34], a double-branch dual-attention (DBDA) mechanism network was presented, which adds some attention blocks after the spectral and spatial branches (separately) to adaptively optimize the feature maps. By using both residual and attention blocks, a residual spatial-spectral attention network (RSSAN) was further proposed in [35]. In addition, other DL networks containing convolution operations are also utilized for HSI classification. Hu et al. [36] developed a spatial-spectral ConvLSTM 2-D neural network (SSCL2DNN), which can model the long-range dependency of the spectral dimension for feature extraction. By introducing convolutional capsule layers and maximum correntropy criterion, a 3-D CapsNet [37] alleviated the influence of noise and outliers on HSI classification, yielding more robust performance. Despite the fact that these DL-based models have achieved satisfactory performance in HSI classification, they have a plethora of trainable parameters and subsequently require large storage costs, whose training is more prone to overfitting.
Recently, some efforts have been made to construct efficient structures for network compression to solve the aforementioned problems. A fast and compact 3-D-CNN with few parameters was developed in [38]. Some efficient convolution operations have been explored to reduce the number of network parameters. In [39], multikernel depthwise convolution and group convolution were utilized for lightweight feature fusion. A lightweight model was designed by replacing the standard convolution with a depthwise convolution and a pointwise convolution in an effort to reduce the complexity of the whole model [40]. Based on this work, LiteSCANet [41] improved the computational efficiency by using a residual double-branch structure for HSI classification. Cui et al. [42] also considered depthwise separable convolution and decreased the number of channels in the feature maps to reduce model complexity. Meanwhile, designing different operators to replace the convolution kernels is another popular way. For instance, a random patches network (RPNet) [43] directly utilized random patches from HSIs as convolution kernels to extract hierarchical features. The ESSINet [44] featured a new lightweight involution kernel on channel interactions and incorporated spatial information via a dual-pooling layer. Although the convolutional layers have been compressed by these compact structures, the fully connected (FC) layers still have a large number of parameters [45].
Differing from the above studies, tensor decomposition is an alternative for network compression. By exploiting the low rankness, classical Tucker and CANDECOMP/PARAFAC (CP) decompositions have been successfully applied for tensor completion [46], [47] and HSI denoising [48], which also implemented the convolution kernel tensor approximation for network compression [49], [50]. However, these works belong to the mapping way that is more suitable for model acceleration and has poor compression performance. Hence, Novikov et al. [51] made use of the tensorized way and reformulated the weight matrices of FC layers in a tensor train (TT) [52] format, which reduces a large number of parameters in FC layers while maintaining their expression ability. Inspired by Novikov et al. [51], Garipov et al. [53] reshaped 2-D convolution kernels into high-order tensors and factorized them by TT decomposition (TTD), and then, Wang et al. [54] extended this work to the 3-D form, achieving a better compression effect. Finally, TTD has been introduced to HSI classification for network compression. Hu et al. [55] proposed a spatial-spectral TT-ConvLSTM 2-D neural network (SSTTCL2DNN), which only lightweights the convolution kernels of ConvLSTM, without dealing with the FC layers.
This model yet illustrates the compression effectiveness of tensor decomposition but sacrifices the classification accuracy due to the finite expression of TTD. At the same time, the tensor ring decomposition (TRD) [56] was introduced as a general form of TTD for compressing the convolutional and FC layers, which relaxes the condition over the rank and has an enhanced representation ability [57]. Nevertheless, these two decompositions have a limited correlation characterization of tensors and are sensitive to the permutation of tensor modes. Unlike the aforementioned tensor decompositions, FC tensor network decomposition (FCTND) [58] had an FC structure and broke through the limitations of TTD and TRD. In fact, various tensor decompositions were compared in [59] and [60], and we find that FCTND can better recover the original data from a small amount of data in tensor completion and exhibits stronger representation ability.
In this article, we make the first attempt to use FCTND for the purpose of network compression. Specifically, we newly construct the FCTND-based FC and convolutional units, with which a new Hybrid FC Tensorized Compression Network (HybridFCTCN) is proposed to fully capture the joint spatial-spectral information (with lower storage requirement) for improving the HSI classification performance. The main contributions of this work are listed as follows.
1) By exploiting the low-rank FCTND, three novel units, i.e., FCTN-FC, FCTNConv2D, and FCTNConv3D, are designed to compress the weight tensor of the standard FC layer and kernel tensor of convolutional layer, in which the FC structure can make better use of the intrinsic correlation between arbitrary two factors to enhance feature extraction and classification abilities. 2) HybridFCTCN adopts the hybrid structure of spatial-spectral feature learning with FCTNConv3D followed by spatial feature enhancement with FCTNConv2D. As such, the proposed model can not only learn the highly discriminative features but also achieve a more lightweight network, thus being not prone to overfitting under small sample sizes and obtaining state-of-the-art classification performance. 3) In terms of the existence of FCTND with equal rank, the rank of FCTND-based units is defined, and then, rank determination in different layers of HybridFCTCN is discussed, thus facilitating its practical application. The remainder of this article is organized as follows. Section II introduces some background information about tensors and presents three tensor decompositions. Section III describes three novel compression units (FCTN-FC, FCTNConv2D, and FCTNConv3D), the architecture of the proposed model, and the determination of the ranks in HybridFCTCN. Section IV reports the obtained experimental results. Section V concludes this article with some remarks.

II. TENSOR PRELIMINARIES AND TENSOR DECOMPOSITIONS
In this section, some preliminaries including notations and tensor operations, as well as related tensor decompositions, are concisely presented for the self-contained purpose.
A. Preliminaries 1) Notations: Throughout this article, scalars, vectors, matrices, and tensors are denoted by italic letters x, bold lowercase letters x, bold capital letters X, and calligraphic letters X , respectively. Fig. 1 shows an illustration of their tensor diagrams (i.e., a graphical representation of multidimensional data), where the number of edges extending from the node denotes the tensor order.
2) Tensor Operations: Two basic tensor operations, including tensor contraction and tensor convolution, are illustrated in Fig. 2, with different line shapes to distinguish different operations of tensor. Specifically, the tensor contraction operation is the process of removing the matching dimensions between two tensors, represented by the solid line in Fig. 2(a). For example, by performing the tensor contraction operation between a three-order tensor M ∈ R I 0 ×I 1 ×I 2 and the other three-order tensor N ∈ R J 0 ×J 1 ×J 2 when I 2 = J 1 , we have a four-order tensor X ∈ R I 0 ×I 1 ×J 0 ×J 2 . Fig. 2(b) gives the tensor convolution operation by the dashed line, with which the convolution of two three-order tensors generates a fiveorder tensor.
B. Tensor Decompositions 1) Tensor Train Decomposition [52]: As depicted in Fig. 3, TTD factorizes a large-sized tensor into a set of sequentially   connected small-sized tensors, where the side factors are matrices, and the others are three-order tensors. The mathematical expression of TTD can be written as follows: where X ∈ R L 1 ×···×L d stands for the large-sized tensor and 2) Tensor Ring Decomposition [56]: TRD links the side factors of TTD to construct a ring-like form, whose sketch map is shown in Fig. 4. It can be regarded as a linear combination of TTD, having the properties of good representation and cyclic invariance. Given a d-order tensor X , its decomposition form is expressed as where R 0 = R d = R and R > 1.
3) Fully Connected Tensor Network Decomposition [58]: As shown in Fig. 5, a large d-order tensor can be decomposed into a series of small d-order factors that build links between each other through FCTND. The advantages of this decomposition lie in the outstanding ability to characterize the intrinsic correlation between G (k 1 ) and G (k 2 ) , and the essential invariance for transposition. The mathematical expression of FCTND is written as follows: , . . . , is the rank of FCTND. With the FC structure, FCTND can address the problem that TTD and TRD are highly sensitive to the transposition of tensor modes caused by only connecting adjacent (rather than arbitrary) two factors. Note that FCTND can be degraded into TRD and TTD by setting certain ranks to 1.

III. HYBRIDFCTCN
This section first illustrates the compression of the FC and convolutional layers based on FCTND. Then, with three newly designed units, the whole framework of the proposed Hybrid-FCTCN model (as shown in Fig. 6) for HSI classification is described in detail. Finally, the determination of the ranks in HybridFCTCN is discussed.

A. Design of FCTN-FC
In an FC layer, the input feature map x ∈ R I is transformed into the output vector y ∈ R O via a large weight matrix W ∈ R I ×O . The mathematical expression is defined as Under the assumption of low rankness, W can be tensorized and then decomposed for reducing model redundancy. Tensorizing x, y, and W into their high-order formats, (4) can be rewritten as where Using FCTND, the weight tensor W is decomposed into the form of (6) to compress the FC layer, with (m + n) factors multiplied in sequence , . . . , As a result, (5) can be replaced with its decomposed version Fig. 7(a) gives a graphical representation of the above operation, where the input vector x is first reshaped into a high-order tensor X , then multiplied by the input and output factors of the weight tensor, and, finally, transformed into the output vector y. For convenience, we name the above tensorized FC layer the FCTN-FC unit. Compared to the original FC layers, the number of parameters in the FCTN-FC units is reduced, with the compression ratio calculated as where C FC stands for the complexity saving in parameters.
Here, the compression effect of the FCTN-FC unit correlates inversely with its ranks; the smaller the ranks, the higher the compression ratio. To achieve a better compression effect, the ranks of the FCTN-FC units should be appropriately set to a small number in the experiments.

B. Design of FCTNConv
To reduce the trainable parameters in a 2-D convolutional layer, the four-order convolution kernel K ∈ R L×L×I ×O is directly decomposed by the FCTND and then can be mathematically expressed as where S, P, and Q stand for the spatial, input channel, and output channel factors of K. Particularly, two dimensions with size L in the convolution kernel K are retained in the factor S to ensure the spatial feature extraction. After substituting (9) into the original 2-D convolution kernel, the approximate evaluation of the standard convolutional layer can be expressed by the following three consecutive Fig. 6. Graphical illustration of the proposed HybridFCTCN model. HybridFCTCN is composed of data preprocessing, spatial-spectral feature learning, spatial feature enhancement, and HSI classification. Data preprocessing includes PCA and normalization. Spatial-spectral feature learning contains three 3-D units in which the features with the spatial-spectral structure (red block) are propagated. Spatial feature enhancement has an FCTNConv2D unit where the features are propagated in the spatial structure (yellow block). For HSI classification, a GAP layer, two FCTN-FC units, and a softmax layer are utilized to predict the sample attribute. steps: where (10) is a tensor contraction operation between the input tensor X ∈ R H ×W ×I and the input channel factor P; (11) describes that the intermediate feature U and the spatial factor P ← TENSORCONTRACTION(P, G (i) ); 4: end for 5: Increase the feature channels via (10); 6: Extract the features via (11); 7: Q ← G (m+1) ; 8: for j = m + 2 to m + n do 9: Q ←TENSORCONTRACTION(Q, G ( j) ); 10: end for 11: Reduce the feature channels via (12); 12: retrun Y.
S perform a 2-D convolution operation; and (12) is also a tensor contraction operation to obtain the final output tensor Y.
We name this unit FCTNConv2D, whose sketch map is shown in Fig. 7(b) (the solid line denotes tensor contraction operation, while the dashed line denotes tensor convolution operation), and computation scheme is described in Algorithm 1. To be specific, the channel factors P and Q are recovered from the decomposed factors by tensor contraction operations, and then, they are used together with the spatial factor S to extract features from the input tensor X . According to the design of the FCTNConv2D unit, its capability in compression can be expressed as With the purpose of further extracting the spatial-spectral features for HSI classification, the FCTNConv2D unit is extended to a 3-D form called the FCTNConv3D unit. The difference between these two units lies in the data dimensionality. In other words, the uncompressed convolution kernel K 3D ∈ R L×L×T ×I ×O has the spectral dimension T which is added to the factor S of the FCTNConv2D unit to construct the S 3D in the FCTNConv3D unit. The mathematical expression of FCTNConv3D is given by

C. Framework of the Proposed HybridFCTCN
HSI contains hundreds of continuous and narrowband image data, which can be naturally represented by a three-order tensor. To completely and efficiently explore both spatial and spectral information, a lightweight tensorized neural network model, i.e., HybridFCTCN, is proposed for HSI classification. The overall network architecture of HybridFCTCN is shown in Fig. 6, which mainly includes four modules: data preprocessing, spatial-spectral feature learning, spatial feature enhancement, and classification. In particular, the underlying hybrid structure can integrate the complementary information provided by FCTNConv3D and FCTNConv2D for highly discriminative feature extraction.
In the data preprocessing, principal component analysis (PCA) is utilized for spectral redundancy reduction. For each pixel, its neighboring patch is extracted as the spatial-spectral information and used as the input of the proposed model. In order to facilitate understanding, we take the Indian Pines dataset as an example. The input sample of HybridFCTCN has a size of (15 × 15 × 30, 1), where the first two numbers in the parentheses are the spatial dimensions, the third denotes the spectral dimension, and the last represents the number of feature channels.
The spatial-spectral feature learning module consists of a standard convolutional layer and two FCTNConv3D units. Initially, the above input sample is fed into a standard 3-D convolutional layer with a size of (3 × 3 × 7 × 1 × 8), where the numbers in the parentheses sequentially correspond to the length, width, and height of the kernel, as well as the number of input and output channels. Then, the output passes through the first FCTNConv3D unit with a size of (3 × 3 × 5 × 8 × 16), in which the number of input channels is factorized into (2 × 4) and the output into (4 × 4). For the second FCTNConv3D unit with a size of (3 × 3 × 3 × 16 × 32), its output channels are decomposed into (4 × 8), and the obtained output is (9 × 9 × 18, 32). In each FCTNConv3D unit, the factor S 3D contains both spatial and spectral dimensions to extract the spatial-spectral features via 3-D convolution operations, thus maintaining the completeness of HSI data in the network. Since the FCTNConv3D unit has more feature channels, by stacking two such units, the network can extract features with richer high-level semantic information.
Considering that spatial information is important for HSI classification, an FCTNConv2D unit is followed by the spatial-spectral feature learning module to construct a hybrid network structure for extracting more discriminative spatial features. After merging the spectral and channel dimensions from the output of the second FCTNConv3D unit, the feature is fed into the FCTNConv2D unit, where the kernel size is (3 × 3 × 576 × 256) and the dimensions of the input and output channels are decomposed into (8 × 8 × 9) and (4 × 8 × 8), respectively. As elaborated on in Algorithm 1, the diversity of the spatial features can be enriched via this unit.
Finally, the output of the former module passes sequentially to a global average pooling (GAP) layer, two FCTN-FC units, and a softmax layer for HSI classification. Particularly, the decomposed output dimensions of these two FCTN-FC units are, respectively, (4 × 4 × 8) and (4 × 4), where (4 × 4) is consistent with the number of categories in the Indian Pines dataset. With the FC structure of the FCTN-FC unit, information is adequately propagated among the decomposed factors, which leads to better classification performance.
In HybridFCTCN, the newly designed units take place of the original FC and convolutional layers, which greatly compresses the scale of model parameters. Note that the batch normalization (BN) and the ReLU activation function are added after each convolutional unit. The detailed parameter settings of HybridFCTCN are reported in Table I.

D. Determination of the Ranks in HybridFCTCN
Rank determination is vitally important in tensor decompositions and has a considerable effect on the classification performance and complexity of the proposed model. Before elaborating on the rank determination in HybridFCTCN, Theorem 1 needs to be introduced.
Theorem 1: Let X ∈ R L 1 ×L 2 ×···×L d be a d-order tensor and L s be the second largest number in the set {L 1 , L 2 , . . . , L d }; then, there exists an R ≤ L s such that X has an FCTND format with Rank FCTND (X ) = (R, R, . . . , R).
Proof: See the Appendix. ■ The newly designed FCTN-FC and FCTNConv units are comprised of the FCTND factors of the original weight tensors. On this basis, a novel unit rank is defined to solve the problem of rank determination.
Definition 1: Supposing that there are a series of decomposed factors G (k) ∈ R R 1,k ×···×R k−1,k ×L k ×R k,k+1 ×···×R k,d with k = 1, 2, . . . , d in the FCTN-FC and FCTNConv units, and X ∈ R L 1 ×L 2 ×···×L d is their corresponding uncompressed weight tensor. According to Theorem 1, there exists an R such that X has an FCTND format with Rank FCTND (X ) = (R, R, . . . , R). Then, we define R u = R as the unit rank.
In HybridFCTCN, there are two FCTN-FC and three FCT-NConv units, whose ranks can be determined by using the defined unit rank. Based on the above analysis, the bounds of ranks in Hybrid-FCTCN for the Indian Pines dataset are found, and by the same method, the ones for the University of Pavia and Houston datasets can also be obtained. The experiments to determine the specific rank in both FCTN-FC and FCTNConv units for three different datasets are conducted in Section IV-B.

IV. EXPERIMENTAL RESULTS
To verify the performance of the proposed HybridFCTCN model, SVM [61], 3D-CNN [6], SSRN [31], HybridSN [30], MANet [38], Hybrid3D-2DCNN [62], SSCL2DNN [36], and SSTTCL2DNN [55] are selected as comparative models. The evaluation indexes adopted are the overall accuracy (OA), average accuracy (AA), and Kappa coefficient (κ). All of the experiments are conducted ten times to eliminate any bias caused by randomly choosing training samples. We use a desktop PC with an Intel Core i7-10700K CPU and an NVIDIA GeForce RTX 2080 SUPER GPU. The code of HybridFCTCN is in Python 3.7.0 with PyTorch 1.11.0 and developed based on [63].

A. Dataset Description
Three widely used HSI datasets, i.e., Indian Pines, University of Pavia, and Houston, are adopted in experiments.
1) Indian Pines: The Indian Pines dataset was collected by the Airborne Visible Infrared Imaging Spectrometer (AVIRIS) sensor in northwest Indiana, USA. This dataset has a spatial size of 145 × 145 pixels with a resolution of 20 meters per pixel (mpp), and the number of spectral bands is 200 over the range from 0.4 to 2.5 µm. The ground-truth map covers 16 land types, including regular and irregular agricultural areas.  3) Houston: The Houston dataset was acquired by the Compact Airborne Spectrographic Imager (CASI) sensor over the University of Houston and its neighboring area. It is published by the IEEE Geoscience and Remote Sensing Society (GRSS) in the 2013 Data Fusion Contest (DFC). It comprises 349 × 1905 pixels with a spatial resolution of 2.5 mpp, 144 spectral bands ranging from 0.38 to 1.05 µm, and 15 land-cover classes. The dataset is available online from http://dase.grssieee.org.

B. Experimental Settings
In the following experiments, the HSI datasets are divided into the training and testing sets, where 10% samples in the Indian Pines and 1% in the University of Pavia datasets are randomly selected as the training set, respectively, and the rest for testing. For the Houston dataset, the given training and testing samples in the 2013 DFC are used. Tables II-IV list the detailed train-test splits for each dataset.
The parameters of the 3D -CNN, SSRN, HybridSN, MANet,  Hybrid3D-2DCNN, SSCL2DNN, and SSTTCL2DNN models are consistent with the original settings to reproduce their For the proposed HybridFCTCN model, there are some hyperparameters that need to be determined, i.e., the size of the convolution kernels, the ranks in the FCTNConv and FCTN-FC units, the optimal number (K ) of the principal components, and the window size (S × S) of the input. Initially, the convolution kernels are set in accordance with the values in [30], which are 3 × 3 × 7, 3 × 3 × 5, 3 × 3 × 3, and 3 × 3 in different convolution units. Then, according to Section III-D, the experiments to determine the ranks in HybridFCTCN are conducted. Table V gives the OA values of different ranks in the FCTN-FC units generated from {2, 3, 4}, {2, 3, 4}, and {2, 3, 4, 5} for three datasets, respectively. Considering the performance and complexity of the model comprehensively, the specific rank in all FCTN-FC units is set to 2. Due to the limited experimental environment, the ranks in all FCTNConv units are uniformly set to 2, which also achieves satisfactory performance for HSI classification. In data preprocessing, PCA is employed as a dimensionality reduction method to reduce redundant spectral information.
To determine the optimal number of principal components, In the following experiments, the learning rate of Hybrid-FCTCN is set to 0.001 for all 500 training epochs with the batch size set to 64. Adam is the selected optimizer, and CrossEntropy is the loss function.

C. Classification Performance
According to the above experimental settings, Tables VIII-X present a quantitative assessment of HybridFCTCN and other comparative models for the Indian Pines, University of Pavia, and Houston datasets, respectively. It can be observed that the proposed model achieves the best performance in terms of OA, AA, and κ with the fewest parameters. The experimental results also show that the spatial-spectral SVM is outperformed by all DL models because it is unable to make use of the spatial information. 3D-CNN can utilize the spatial-spectral information to reach good classification accuracy in the three HSI datasets, but it requires millions of parameters. In some classes containing very few training samples (e.g., Grass-pasture-mowed and Oats classes in the Indian Pines dataset), the ability of residual blocks in SSRN to learn discriminative features is not demonstrated though it obtains a good overall performance, achieving the best results for certain classes. SSTTCL2DNN introduced TTD into the convolutional layers of SSCL2DNN to compress the model, sacrificing the classification performance. Compared with MANet, HybridSN and Hybrid3D-2DCNN adopt a 2-D convolutional layer to strengthen the spatial feature learning, contributing to better performance but at the expense of a large number of parameters. The proposed model achieves excellent performance with fewer parameters for the three considered HSI scenarios. This is due to twofold reasons. On the one hand, the FC structures in the FCTN-FC and  FCTNConv units boost the information flow across the feature channels, strengthening their representation ability. On the other hand, since the novel units allow more feature channels, HybridFCTCN extracts richer spatial and spectral features that contain high-level semantic information for better HSI classification. With these efficient feature extraction designs, the proposed compression model produces satisfactory results with a very small number of parameters.
Compared with 3D-CNN, which achieves good performance under different scenarios, the proposed model obtains improvements of 1.17%, 0.84%, and 2.54% in terms of OA for the Indian Pines, University of Pavia, and Houston datasets, respectively. As far as the HybridSN model is concerned, the gains in OA obtained by our model are 0.57%, 4.17%, and 11.13% for the three considered HSI datasets. The relatively wide margin between HybridFCTCN and HybridSN in the  University of Pavia and Houston datasets may be caused by the presence of more heterogeneous classes in those scenes.
Some of the classification maps achieved by HybridFCTCN and other comparative models for the three considered datasets (together with the ground-truth maps of the original scenes) are displayed in Figs. 8-10. As it can be seen in these figures, HybridFCTCN provides the smoothest classification maps and is also the most similar to the ground-truth maps in the Indian Pines, University of Pavia, and Houston datasets, respectively.
Other models can roughly distinguish different classes in the three datasets, but there are still some misclassifications, e.g., the Corn-notill class in the Indian Pines dataset, the Bare Soil class in the University of Pavia dataset, and the Highway class in the Houston dataset.

D. Comparison and Analysis Using Small Training Samples
In practice, labeling HSIs is time-consuming and challenging, which restricts the amount of labeled data available for  To the best of our knowledge, this deficiency of HybridSN could be related to overfitting due to too many parameters and too few training samples. The superiority of HybridFCTCN in scenarios dominated by small training samples lies in its strong representation ability with very few parameters. On the one hand, low-rank regularization is applied to the model  weights by FCTND, which reduces the noise in the data and extracts more expressive features. On the other hand, since the proposed model has few parameters, its training is less prone to overfitting, ensuring great performance.
To further validate the generalization of HybridFCTCN, 20, 30, and 40 labeled data are randomly selected from each class as training sets for the Indian Pines, University of Pavia, and Houston datasets, respectively. It should be noted that, due to the limited labeled samples in the Grass-pasture-mowed class and the Oats class in the Indian Pines dataset, the number of training samples for these two classes is uniformly set to 10 in the following experiments. Fig. 11 depicts the OA curves achieved by all models using different numbers of samples in the three datasets. It can be seen that the OA metrics of all models improve as the number of training samples increases. The proposed HybridFCTCN model always maintains the highest accuracy in all comparative models, which verifies its generalization ability in scenarios dominated by small training samples.

E. Analysis of the Number of Parameters
In addition to performance analysis, the number of parameters in the HybridFCTCN and other comparative models are further investigated. Table XIV compares the distributions of model parameters in the Indian Pines dataset, from which it is obvious that HybridFCTCN requires a significantly lower number of parameters in both convolutional and FC layers, and obtains higher classification accuracy as analyzed before.
Because of the large-sized input data, the 3D-CNN model has the largest number of parameters in the convolutional In HybridFCTCN, both convolutional and FC layers are compressed by the tensor decomposition. The number of parameters in the convolutional layers is reduced via the novel units (i.e., FCTNConv2D and FCTNConv3D) from a few hundred thousands to several thousands, and the FCTN-FC units, combined with a GAP layer, obtain a similar effect in the FC layers. It should be noted that there is only one FC layer in SSRN, while we use two in the HybridFCTCN model to improve accuracy, which slightly increases the number of parameters in the FC layers. In general, the proposed HybridFCTCN model has a significantly small number of parameters while achieving outstanding performance in terms of HSI classification.

F. Comparison of Different Tensor Decompositions
HybridFCTCN is further compared with the models based on TTD [64] and TRD [57]. For a fair comparison, only the FCTN-FC, FCTNConv2D, and FCTNConv3D units are replaced with the corresponding TTD-based or TRD-based units. Also, the ranks of the comparative tensor decompositions are set according to the number of parameters in HybridFCTCN, i.e., 6 or 7 in TTD-based and 5 or 6 in TRDbased models.
The experimental results of the models based on different tensor decompositions are listed in Table XV. It can be seen that, with a similar number of parameters, the proposed model has the best OA values and only moderately increases the testing time. To be specific, compared to the TTD-based models, HybridFCTCN improves by 2.28% over TTD-6 and 1.69% over TTD-7 in the Indian Pines dataset. Meanwhile, as far as the TRD-based models are concerned, our model is 0.98% more accurate than that in TRD-5 and 0.69% than TRD-6. For their computational complexity, the testing times of TRD-based models are about twice that of TTD-based ones, and the time is further increased in HybridFCTCN. The reason for the experimental results is that the FC structure in FCTND-based units is able to adequately characterize the correlation between any two factors and has transposition invariance for better accuracy but also leads to higher computational complexity. For other comparative models, the chainand ring-like structures in the tensorized units, which only connect adjacent factors, sacrifice the classification performance. Overall, compared to other tensor decomposition models, HybridFCTCN achieves satisfactory results with acceptable testing time.

V. CONCLUSION
In this article, a new compression model called Hybrid-FCTCN is proposed for HSI classification, achieving outstanding performance with significantly fewer parameters than other comparative models. Owing to the FC structures, three novel units with fewer parameters, i.e., FCTN-FC, FCTNConv2D, and FCTNConv3D, effectively achieve the information flow across the channels, which improves their feature extraction and classification abilities. By making use of hybrid network architecture, the proposed model combines complementary spatial-spectral and spatial information with a lower storage requirement for better HSI classification. Moreover, the rank determination in HybridFCTCN has been discussed, which facilitates the practical application of the proposed model. A series of experiments on three widely used HSI datasets demonstrate the superiority of our model, especially in scenarios dominated by small training samples.