GhoMR: Multi-Receptive Lightweight Residual Modules for Hyperspectral Classification

In recent years, hyperspectral images (HSIs) have attained considerable attention in computer vision (CV) due to their wide utility in remote sensing. Unlike images with three or lesser channels, HSIs have a large number of spectral bands. Recent works demonstrate the use of modern deep learning based CV techniques like convolutional neural networks (CNNs) for analyzing HSI. CNNs have receptive fields (RFs) fueled by learnable weights, which are trained to extract useful features from images. In this work, a novel multi-receptive CNN module called GhoMR is proposed for HSI classification. GhoMR utilizes blocks containing several RFs, extracting features in a residual fashion. Each RF extracts features which are used by other RFs to extract more complex features in a hierarchical manner. However, the higher the number of RFs, the greater the associated weights, thus heavier is the network. Most complex architectures suffer from this shortcoming. To tackle this, the recently found Ghost module is used as the basic building unit. Ghost modules address the feature redundancy in CNNs by extracting only limited features and performing cheap transformations on them, thus reducing the overall parameters in the network. To test the discriminative potential of GhoMR, a simple network called GhoMR-Net is constructed using GhoMR modules, and experiments are performed on three public HSI data sets—Indian Pines, University of Pavia, and Salinas Scene. The classification performance is measured using three metrics—overall accuracy (OA), Kappa coefficient (Kappa), and average accuracy (AA). Comparisons with ten state-of-the-art architectures are shown to demonstrate the effectiveness of the method further. Although lightweight, the proposed GhoMR-Net provides comparable or better performance than other networks. The PyTorch code for this study is made available at the iamarijit/GhoMR GitHub repository.


Introduction
Hyperspectral images (HSIs) are image cubes where each pixel is measured as one near-continuous spectrum. Unlike RGB images, HSIs have hundreds of spectral bands, containing knowledge regarding wavelengths beyond the visible spectrum. These cubes contain both spatial and spectral information, which can be widely utilized in remote sensing for analyzing a scene of interest. Hyperspectral imaging also finds its applications in agriculture [1], forestry [2,3], archaeology [4], medical analysis [5], food quality control [6], military defense [7], forensics [8], and several other domains as well. Thus, research in HSI processing and analysis is growing rapidly, and several studies have been published in past years for the same. Often, the high spectral dimensionality of an HSI poses a The above shortcoming in earlier works inspired us to propose the multi-receptive lightweight residual block called GhoMR. A singular GhoMR uses a complex strategy inspired by Res2Net [64] to extract information from HSI data. Each module contains multiple RFs, where each RF extracts features in a hierarchical fashion using information from other RFs in the same module. These RFs are connected with residual-like connections. However, with an increase in complexity, the number of learnable weights increases. Thus, to ensure a lightweight architecture, the Ghost module (GM) is used as the basic building unit. A single receptive layer of a CNN has multiple convolutional kernels which generate several feature maps. Research has shown [65] that many of these feature maps are similar and can be easily constructed by transforming other features. GMs take advantage of this feature redundancy in CNNs. Inside a GM, a very limited number of features are extracted from the input using a convolutional layer. Then, more features are generated from the existing ones using cheap linear operations on them. This strategy reduces the number of parameters, giving rise to a lightweight feature extraction module. The GM was first used in GhostNet [65], published in CVPR 2020, and later it became a backbone for many methods. Recently, an architecture based on GM called Improved GhostNet [66] was used for remote sensing classification as well. However, the proposed GhoMR is the first to use GM on HSIs. Stacking four such GhoMR modules, a classification network called GhoMR-Net is constructed, which is tested on three benchmark datasets and compared with state-of-the-art architectures.
The main contributions of this research can be summarized as follows:

1.
A novel lightweight multi-receptive feature extraction module called GhoMR is proposed for HSI classification, 2.
A GhoMR utilizes complex feature extraction strategy using several internal RFs, connected in a residual fashion, 3.
To reduce the number of trainable parameters, Ghost modules are used, which uses low-cost transformations to address feature redundancy in CNNs, 4.
An architecture called GhoMR-Net is designed using multiple GhoMR blocks to perform experiments on three public HSI datasets, 5.
Comparisons are shown, which verifies that the proposed GhoMR gives better or comparable results than state-of-the-art techniques.
The rest of the paper is organized as follows. Section 2 describes the proposed methodology, Section 3 describes the datasets used and discusses the experiments, comparisons, and visualizations performed on them, while Section 4 concludes our research.

Brief Description of Ghost Modules
CNNs are driven by receptive kernels or filters having randomly initialized weights. These kernels traverse an input (image or feature maps) and perform element-wise multiplication with underlying pixels, followed by summation to extract features. This operation is termed as convolution. During training, sufficient examples are fed, and along with many iterations, these weights are updated using backpropagation, as the network learns to generalize over unseen examples. However, CNN architectures use several kernels to extract a wide variety of feature maps. This increases the cardinality of trainable weights, thus demanding heavy computational costs and expensive hardware to train and store them.
Let I ∈ R W×H×C be the input to a single convolutional block, where W and H are the spatial dimensions, while C is the number of channels. To extract a unique feature map y i from I, a kernel k i ∈ R s×s×C is used to perform the convolution, where s < W and s < H. The convolution operation can be represented as Similarly, a set of C kernels {k 1 , k 2 , k 3 , ..., k C } is used to generate different feature maps, which are stacked to produce a feature block Y ∈ R W ×H ×C , which becomes the input for another set of kernels. This total operation involves s × s × C × C number of parameters, which can be as large as hundreds or thousands, owing to large values of C and C . Thus, to reduce parameters, the number of kernels, C must be optimized (assuming that C is constant). Prior research has shown that many feature maps derived by these kernels are similar to each other. So, these can be generated by mutating the existing ones, rather than using separate kernels. To exploit this redundancy, the Ghost module (GM) [65] was recently invented.
A GM reduces the cardinality of kernels while keeping a minimal loss of information at the same time. Feature extraction in a GM is done in two steps: 1.
The first step involves simple convolutional operations as described above. Keeping all hyper-parameters constant, C kernels are used to generate a set of intrinsic feature maps Y = {y 1 , y 2 , y 3 , ..., y C }, where C << C . As a result, the total number of parameters in the network reduces to s × s × C × C .

2.
The reduction of parameters leads to the loss of significant information. To make up for the remaining C − C features, new feature maps are derived from each of the existing features by performing T low-cost operations (Ghost transformations) on them. These derived features are called Ghost features. This equation can be represented as where y i is the ith feature map in Y and θ ij is the jth linear operation deriving a Ghost feature y g ij from y i . Thus, 1 ≤ i ≤ C and 1 ≤ j ≤ T. Among the T Ghost transformations applied on y i , one operation θ i1 is kept as identity operation to retain the original feature map. The remaining T − 1 operations generates the ghost features. Thus, now a total of C × T features are generated, such that C × T ∼ C . Figure 1 shows a simple illustration of the Ghost module. For the transformation function θ, convolutional filters of size K T × K T are used instead of hand-crafted low-cost linear operations. These filters are called Ghost filters. This is done to utilize the learning capability of convolution operation to perform the most appropriate transformations. Moreover, it gives the flexibility to experiment with different values for K T , since the kernels of different spatial dimensions extract different types of features. Note that the computational complexity of θ is much less than ordinary convolution, a detailed analysis of which is given in the founding manuscript [65].  Figure 2 shows the diagram of a single GhoMR module, which is the proposed backbone for HSI classification. A GhoMR uses multiple internal GMs to extract features in a residual hierarchical fashion. This strategy is inspired by Res2Net [64] and is useful for extracting complex details from the HSI cube. Let the input for an arbitrary GhoMR module be I ∈ R W×H×C , where W, H, and C are the width, height, and channels respectively. Feature extraction from this cube is done in three steps:

1.
At first, a GM using 1 × 1 kernels is used to extract the feature block Y 1 ∈ R W×H×C .
Note, these 1 × 1 kernels are not the Ghost filters, but are used to generate the original feature maps. For the Ghost filters, experiments with different sizes (K T ) are performed, which is discussed in Section 3.

2.
In the next step, the N feature maps of Y 1 are split into four subsets, denoted by n i , where 1 ≤ i ≤ 4. Except n 1 , each subset is passed through a 3 × 3 GM. The output of the previous GM, o i−1 is fused hierarchically using element-wise summation with the current subset n i , to produce the set of features o i . The equations supporting this operation are where + refers to element-wise summation. Note, the GM for the first split n 1 is omitted in order to reuse features and reduce parameters in the module.

3.
Finally, the output maps o 1 , o 2 , o 3 and o 4 , are concatenated on their depth to form a singular feature block containing all the information. This is further passed through a 1 × 1 GM and fused with input I through a residual connection to produce the final output O. This operation is expressed as where ⊕ refers to concatenation and + denotes element-wise summation.

Datasets
The proposed methodology is evaluated on three public HSI datasets (http://www.ehu.eus/ ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes). The description of these datasets are given as follows:

1.
Indian Pines (IP)-The images in this dataset were collected in 1992, over the Indian Pines test site in north-western Indiana using the AVIRIS [67] sensor. The HSI cube has a spatial dimension of 145 × 145 pixels with 224 spectral bands in the wavelength range of 400 to 2500 nm, among which 24 bands corresponding to regions of water absorption were eliminated. Among the 21,025 pixels, 10,249 are annotated with ground truth from a set of 16 different vegetation classes.

2.
University of Pavia (UP)-This dataset was acquired in 2001, over the university campus at Pavia, Northern Italy, using the ROSIS sensor. It has a spatial dimension of 610 × 340 pixels and 103 spectral bands in wavelength between 430 to 860 nm. The ground truth is a set of 9 urban land-cover classes, and approx. 20% of the total 207,400 pixels are annotated with this information.

3.
Salinas Scene (SA)-This dataset was collected over Salinas Valley, California, in 1998 using the AVRIS sensor. The spatial dimension is 512 × 217 pixels and the spectral information is encoded in 224 bands with a wavelength in the range of 360 to 2500 nm. Similar to IP, 20 spectral bands due to water absorption are discarded. The ground truth contains 16 different classes from vegetables, bare soils, and vineyard fields.

Experimental Protocols
Using several GhoMRs, a network called GhoMR-Net is proposed as shown in Figure 3. At first, the input is fed to a simple convolutional layer of 24 kernels. The output is then passed through a series of four GhoMR modules, which produces 24, 36, 48, and 60 feature maps, respectively. Inside each GhoMR, the first 1 × 1 GM generates 48 feature maps from the input, which is split into four parts, having 12 features each. The 3 × 3 GMs operating on each split (n i ) extract 12 feature maps, which are concatenated again into a single block of size 48. This block is fed to the final 1 × 1 GM, which outputs the set of features for the next GhoMR block. To increase the efficiency, after every GM batch-normalization [68] and ReLU activation is used. On the extracted features from the final GhoMR, global average pooling (GAP) [69] is performed and the resulting vector is fed to a fully-connected (FC) layer to output the class probabilities. The class with the maximum probability is the predicted class.
The above architecture is trained to classify each pixel of an HSI cube C H . This 3D image cube has hundreds of spectral channels, containing redundant information. This makes classification difficult and increases computational costs. Thus, principal component analysis (PCA) is performed along the spectral axis. This PCA-reduced cube C P retains the spatial information and reduces the channels to S, where S is 30 for IP, and 15 for SA and UP respectively. Now, C P is divided into spatially overlapping 3D patches D ∈ R W×W×S , where W is the spatial dimension of a patch. The ground-truth Y T ∈ R N C ×1 assigned to each patch is the same as that of the central pixel in the patch. These 3D patches are fed to the proposed GhoMR-Net, which outputs a vector Y P ∈ R N C ×1 , where N C is the number of classes.
The cross-entropy loss is then calculated between Y T and Y P and the network is trained to minimize this loss. As discussed in Section 2, the GMs used in the GhoMR blocks have two hyperparameters-number of Ghost transformations (T) and spatial size of ghost filters (K T ). With an increase in T, less raw features are extracted from the input, and more are derived using Ghost operations, thus reducing the number of parameters. While a larger value of K T means a greater filter dimension, thus increasing trainable parameters in the network. Performance with different combinations of T and K T are discussed in the next subsection. Experiments with different spatial sizes (W) of input patches and different training ratios are also discussed. All the experiments are done using PyTorch 1.6.0 with CUDA 10.1 in the GPU environment of Google Colaboratory. The architecture is trained using Adam [70] optimizer for 100 epochs, keeping a batch size of 100 and a learning rate of 0.001. The code for this research is available at https://github.com/iamarijit/GhoMR.
To measure the performance, three standard evaluation metrics are used-overall accuracy (OA), average accuracy (AA), and Kappa coefficient. OA measures the total number of samples correctly classified in the test set, AA calculates the average of the class-wise accuracies and Kappa measures the degree of agreement between the ground-truth and predicted classification map. The OA, AA, and Kappa for each experiment are calculated five times and are written as mean ± std. Based on these metrics and the above-mentioned hyperparameters, five sets of analysis are carried out to demonstrate the classification potential and lightweight nature of the proposed GhoMR-Net:

1.
First experiment calculates the class-wise accuracies, OA, AA, and Kappa for IP, UP, and SA datasets using 10% and 20% training data. The 3D spectral-spatial inputs have spatial dimensions 15 × 15 for all three datasets. The value of T and K T are kept 2 and 3 respectively. 2.
In the second experiment, OA, AA, and Kappa are measured on the three datasets for different values of T and K T , such that T ∈ {2, 4} and K T ∈ {3, 5, 7}. A comparative study between all the six combinations of T and K T is performed. This experiment is conducted on 10% training data with 3D input cubes of spatial dimension 15 × 15.
Comparisons are shown for both 10% and 20% training data, keeping input spatial dimension of 15 × 15.

4.
The fourth experiment measures the OA, AA, and Kappa on lesser training data (5% and 3%) and smaller spatial dimensions (13 × 13 and 11 × 11) of input patches. The parameters T and K T are kept 2 and 3 respectively.

5.
The final experiment demonstrates the effectiveness of GhoMR-Net using t-SNE visualization [71] and confusion matrices. Moreover, the number of trainable parameters in the network is compared with other state-of-the-art architectures.

Classification Results and Visualizations
The first experiment was conducted to calculate the class-wise accuracies for the three datasets, using hyperspectral inputs of spatial dimension 15 × 15. The results are shown in Tables 1 and 2 for 20% and 10% training data, respectively. For each dataset, the first three columns contain class labels and data distribution (training and test samples), while the fourth column shows the accuracy (in percent %) for each class. The last four rows of the table represent the overall accuracy (OA), Kappa coefficient, average accuracy (AA), and training time for each experiment. For 20% training data, the OAs obtained are 99.54%, 99.90% and 99.99%, while on 10% data, it is 98.64%, 99.75% and 99.98% for IP, UP and SA, respectively. On IP, the proposed GhoMR-Net performs worse than SA and UP, which can be explained by fewer training examples and significant imbalance among the classes. To better understand the results, the ground-truth and predicted classification maps for IP, UP and SA are shown in Figures 4-6, respectively.    In the second set of experiments, the dependence on the hyperparameters T and K T is explored. The OAs, Kappas, and AAs for different combinations of T and K T are given in Table 3. On IP and SA, the model performs best when T = 2 and K T = 3, i.e., 2 ghost operations are used using 3 × 3 filters. Unlike IP and SA, the performance on UP increases when K T is increased. When K T is increased, the number of parameters increases. Since IP and SA have more classes (16) and fewer training samples per class (on an average), the tendency of overfitting increases with increasing K T . Thus, performance on the test set decreases. Fixing the value of T and K T to 2 and 3 respectively, GhoMR-Net is compared with ten state-of-the-art techniques, using 10% and 20% training samples. The spatial window dimensions of the input are kept the same as the prior experiments. For IP, the method outperforms FuSENet, SSRN, and HybridSN with an increase in OA by 0.53%, 0.31%, and 0.07% respectively, on 20% training data. Improvements or comparable results are obtained on SA and UP as well, which is reported in Table 4. In spite of having very few parameters, the satisfactory classification results of GhoMR-Net can be explained by the multi-receptive feature extraction strategy of GhoMR modules. In the next experiment, the robustness of the approach and the influence of input spatial dimensions are explored. This is performed on lesser training samples, i.e., 5% and 3%, using inputs of spatial size 13 × 13 and 11 × 11. The OAs, AAs, and Kappas given in Table 5 show that performance deteriorates for all three datasets, which is expected. The classification maps for IP given in Figure 7 further verify it. It is observed, on increasing spatial size, the performance for IP and SA improves, since more spatial context is captured. However, in UP, as shown in Figure 5, the patches are short and discontinuous, unlike IP and SA. Thus, increasing spatial dimensions capture more noise, which reduces the classification accuracies. Figure 7. Predicted classification maps for IP with 11 × 11 and 13 × 13 input spatial size for (a,b) 5% training data and (c,d) 3% training data, respectively.   Finally, a set of visualizations are performed to demonstrate the discriminative power of GhoMR-Net. The higher-dimensional features from the GAP layer of the network are extracted for each sample in the test set and are reduced to two-dimensional coordinates via t-SNE. These coordinates are plotted and shown in Figure 8 for the three datasets. It is clearly observed, that the features representing pixels having the same ground-truths form nearby clusters, which are represented by similar colors. Moreover, the confusion matrices are obtained on 90% test data and are given in Figure 9. Furthermore, the total number of trainable parameters is compared with seven above-mentioned architectures-3D-CNN [52], M3D-CNN [56], Two-CNN [55], HybridSN [59], SENet [63], FuSENet [63], and SSRN [58]. As shown in Figure 10, the proposed network has only 32,704 trainable parameters, which is much lesser than HybridSN, SSRN, and FuSENet having 5,122,176, 500,384, and 128,848 parameters, respectively.

Conclusions
In this study, a lightweight multi-receptive module called GhoMR is proposed for hyperspectral image (HSI) classification. It contains several internally connected receptive fields (RFs) to extract complex features from HSIs in a hierarchical approach. Unlike other approaches using convolutional layers, recently invented Ghost modules are used as RFs, which extracts hand-full features from the input and derives the remaining from existing ones. Using GhoMR blocks, a simple lightweight architecture called GhoMR-Net is designed to perform experiments on three standard datasets. The classification results are measured using three metrics and compared with other state-of-the-art techniques. Experiments with lesser training data and smaller input spatial sizes are also performed along with several visualizations and plots to understand the discriminative potential of the architecture better.