Abstract

Recent imaging science and technology discoveries have considered hyperspectral imagery and remote sensing. The current intelligent technologies, such as support vector machines, sparse representations, active learning, extreme learning machines, transfer learning, and deep learning, are typically based on the learning of the machines. These techniques enrich the processing of such three-dimensional, multiple bands, and high-resolution images with their precision and fidelity. This article presents an extensive survey depicting machine-dependent technologies’ contributions and deep learning on landcover classification based on hyperspectral images. The objective of this study is three-fold. First, after reading a large pool of Web of Science (WoS), Scopus, SCI, and SCIE-indexed and SCIE-related articles, we provide a novel approach for review work that is entirely systematic and aids in the inspiration of finding research gaps and developing embedded questions. Second, we emphasize contemporary advances in machine learning (ML) methods for identifying hyperspectral images, with a brief, organized overview and a thorough assessment of the literature involved. Finally, we draw the conclusions to assist researchers in expanding their understanding of the relationship between machine learning and hyperspectral images for future research.

1. Introduction

Hyperspectral imagery is one of the most significant discoveries in remote sensing imaging sciences and technological advancements. Hyperspectral imagery (HSI) is the technology that depicts the perfect combination of Geographic Information System (GIS) and remote sensing. Besides, HSI has several advantages such as ecological protection, security, agriculture and horticulture applications, crop specification and monitoring, medical diagnosis, identification, and quantification [1]. RGB images are made up of three dimensions: width, height, and 3 color bands or channels consisting of color information, that is, red, green, and blue. They are stored as a 3D byte array that explicitly holds a color value for each pixel in the image; a combination of RGB intensities put down onto a color plane. However, in contrast, HSI comprises thousands of hypercubes and hence possesses a large resolution and an enormous amount of embedded information of all kinds—spectral, spatial, and temporal. This information enables various applications to detect and characterize land covers, which are most significantly explored [2]. RGB images are captured by digital RGB cameras capable of characterizing objects only based on their shape and color. Moreover, the embedded information is minimal since only three visible bands are available in the human visibility range. The HSI, on the other hand, is captured by specialized airborne hyperspectral sensors placed on artificial satellites, that is, spectrometers. They have a broad range of scenes by acquiring large numbers of consecutive bands, not confined to the visible light spectrum and through a wider spectral band-pass. However, compared to the digital sensor that absorbs light in just three wide channels, a hyperspectral sensor’s channel width is much narrower, making the spectral resolution and data volume much higher, resulting in hurdles to store, mine, and manage [3]. Furthermore, processing these data with a massive number of bands imposes many obstacles such as noise-causing image calibration, geometric distortion, noisy labels, and limited or unbalanced labeled training samples [46], that is, Hughes phenomenon and dimensionality reduction-related artifacts: overfitting, redundancy, spectral variability, loss of significant features between the channels, etc. [7].

Classifying HSIs is considered to be an intrinsically nonlinear problem [8], and the initial approach by linear-transformation-based statistical techniques such as principle component analytical methods, that is, principal component analysis (PCA) [9] and independent component analysis (ICA) [10]; the discriminant analytical methods, that is, linear [11] and fisher [12]; wavelet transforms [13]; and composite [14], probabilistic [15], and generalized [16] kernel methods, had shown promising outcomes. Still, their focus was limited to spatial information. They emphasized that the feature extractor techniques assisted by some basic random classifiers that lead to complexity in terms of cost, space, and time are not sufficiently accurate. After the success of these traditional methodical techniques assigned for HSI classification, researchers became keenly interested in applying the most recent emerging but not tedious computer-based methods that made the entire process smoother and vicinal to perfection. Study advancements suggest that the last decade can be considered the most escalating era regarding computer-based technologies due to the emergence of machine learning (ML). ML is an algorithmic and powerful tool that resembles the human brain’s cognition. It simply represents a complex system by holding abstraction. Hence, it can reduce complexities and peep into the insights of the vast amount of HS data to fetch out the hidden discriminative features, both spectral and spatial [17]. Thus, it overcomes all the stumbling blocks to achieve the desired accuracy in identifying the classes that the objects of the target HSI data belong to. Hence, they act as all-in-one techniques that can serve the purpose without further assistance. Keeping this in mind, we conducted an extensive survey based on the various discriminative machine and deep learning (ML, DL) models for HSI. In most of the literature studies, the HSI datasets that are commonly used for landcover classification are AVIRIS Indian Pines (IP), Kennedy Space Center (KSC), Salinas Valley (SV), and ROSIS-03 University of Pavia (UP), along with less frequently used Pavia Center, Botswana, University of Houston (HU), etc. They are pre-refined and made publicly available on [18] for download and perform operations.

The motivation of our work is divided into three parts. First, a novel methodology is proposed for the review work that is entirely systematic and helps find the inspiration in forming the research gaps and embedded questions after going through a large pool of research articles. Second, this work focuses on the current advancements of ML technologies for classifying HSI, with their brief, methodical description and a detailed review of the literature involved with them. Finally, the inferences are drawn and help the researchers boost knowledge for their future research. The key contributions made to the research field on hyperspectral imagery by our novel effort are as follows:(1)The thorough revision of the analytical and classification work carried out to date on HS imagery by employing ML/DL techniques.(2)Emphasis on the categorized methods explored and practiced so far in an overly frequent manner. Also, it includes a brief interpretation of the most recent technologies and the highlighted hybrid techniques.(3)An open knowledge base that acts as a reservoir of relevant information that is listed out that interprets all research on each mentioned technique in terms of their methodology, convenience and limitations, and future strategies. This illustration might administrate in making a proper choice of objective for further research on the field of HSIs.(4)Explicit idea of the growth of interest in the concerned field that would attract researchers to invest themselves with a coherent, substantial specification (benefaction and drawbacks) of all the methods, individually, that contributes academically to the researchers about their favorable result and the difficulties for a chosen technique.(5)A transitory rendition of the most recent research on HSIs signifies the currently adapted technologies as hot spots. Also, focus on the research areas about the interest that could apply to others, that is, the hybridized methods popular among researchers to address the problem and achieve the desired experimental results.

The rest of the article is arranged as follows: Section 2 briefly explains the constraints faced by the researchers in dealing with HSI; Section 3 represents the methodology for the research along with the motive behind this review; Section 4 describes seven ML techniques, namely, support vector machine (SVM), sparse representation (SR), Markov random field (MRF), extreme learning machine (ELM), active learning (AL), deep learning (DL), and transfer learning (TL); Section 5 shows up the complete summary of the literature review work in the form of answers to the research questions; Section 6 depicts the conclusions; and Section 7 explains the limitations and future work.

2. Constraints of HSI Classification

Since their emergence, several difficulties have caused issues in analyzing and performing operations on hyperspectral images. Initially, it suffered from spectroscopy technology due to the bad quality of hyperspectral sensors and poor quality with insufficient data. However, along with the advancement in applied science, things have come to ease, but there are still some well-known nondispersible hitches that need to be overcome. Some of them are stated as follows:(a)Lack of high-resolution Earth observation (EO) noiseless images: During the initial stage of the discovery of spectrometers, they were not very efficient. Due to this, noises caused by water vapor, atmospheric pollutants, and other atmospheric perturbations modify the signals coming from the Earth’s surface for Earth observations. Several efforts have been made over the last decades to produce high-quality hyperspectral data for Earth observation and develop a wide range of high-performance spectrometers that combines the power of digital imaging, spectroscopy, and extracting numerous embedded spatial-spectral features [19].(b)Hindrances in the extraction of features: During data gathering, redundancy across contiguous spectral bands results in the availability of duplicated information, both spatially and spectrally, obstructing the optimal and discriminative retrieval of spatial-spectral characteristics [7].(c)The large spatial variability and interclass similarity: The hyperspectral dataset collected contains unusable noisy bands due to mistakes in the acquisition that result in information loss in terms of the unique identity, that is, the spectral signatures and excessive intraclass variability. Furthermore, with poor resolution, each pixel comprises broad spatial regions on the Earth’s surface, generating spectral signature mixing, contributing to the enhanced interclass similarity in border regions, thus creating inconsistencies and uncertainties for employed classification algorithms [19].(d)Limitation of available training samples and insufficient labeled data: Aerial spectrometers cover significantly smaller areas, so they can only collect a limited number of hyperspectral data. That leads to the restriction of the number of training samples for classification models [20]. In addition, HSIs typically contain classes that correspond to a single scene, and available classification models’ learning procedures require labeled data. However, labeling each pixel requires human skill, which is arduous and time-consuming [21].(e)Lack of balance among interclass samples: The class imbalance problems, where each class sample has a wide range of occurrences, diminish the usefulness of many existing algorithms in terms of enhancing minority class accuracy without compromising majority class accuracy, which is a difficult task in and of itself [22].(f)The higher dimensionality: Due to incorporating more information in multiple channels, such high-band pictures increase estimation errors. The curse of dimensionality is a significant drawback for supervised classification algorithms, as it significantly impacts their performance and accuracy [23].

The possible solutions to the above limitations that also represent the possible operations that are performed to analyze and comprehend the HSIs can be (1) technological advancement to make versatile and robust hardware for the spectrometers to capture the scenes more accurately, (2) spectral unmixing and resolution enhancement for better feature extraction and distinguishing capability of the embedded objects, (3) image compression-restoration and dimensionality reduction for addressing the high-dimensions and lack of data, and (4) use of robust classifiers that are capable of dealing with the above issues as well as promote fast computation ability [7].

These hurdles were very prominent for the methods that classify HSI based on the feature extrication from HSI. After ML/DL came into the scene, the operations on HSI became effortless as explicit feature extraction is not needed, and it has also many advantages such as great dealing with noise and time complexity. However, ML/DL acquires a few drawbacks in specific criteria [19], including parameter-tuning and numerous local minima problems in training procedures and compression [20] overfitting, optimization, and convergence problems despite many positive aspects.

3. Research Methodology

This section is divided into three categories that will assist in understanding the review procedure and its ambition.

3.1. Planning of the Review

Three systematic advances are utilized that comprise the planning behind our work. First, based on efficacy and frequency of applicability on classifying HSIs, seven most recently used ML techniques have been chosen in this article for review, which establishes the operational relationship and compatibility with the issue of categorizing the land covers of a particular scene captured as HSI. Second, this relationship provides all the shortfalls and benefits of those methods and their potential possibilities. Finally, we identified the limitations of our present review work and how to rectify them in the future.

3.2. Conducting the Review

The entire review work has been conducted in the following steps:(a)Collection of literature: The literature studies have been collected based on the keywords: “Hyperspectral image classification,” “Machine learning techniques,” “Deep learning techniques,” from the most relevant search engine, that is, Google (Google Scholar), which provides the scholarly articles for the concerned topic. These literature studies include Web of Science (WoS), Scopus, SCI, and SCIE-indexed and SCIE-related articles, both journals and conferences. Several methods are utilized throughout the literature that assist the classification of hyperspectral data, out of which ML techniques seem to be more convenient and promising.(b)Screening: The collected research papers depict raw data, sorted categorically according to the chronological order of the ML techniques used over the periods. The screening was accomplished based on the following constraints:(i)Time Period: The studies published in the range of 2010–2021 are included in this work. Studies published before 2010 are not included.(ii)Methodology: The studies on HSI’s analytical operations (denoising, spectral unmixing, etc.) other than classifying the underlying land covers are rejected.(iii)Type: The studies that deal with the hyperspectral images of a particular land scene are considered, discarding the medical hyperspectral imagery, water reservoir, etc.(iv)Design of study: The studies comprising experimental outcomes and the elaboration of the models are accepted; other literary-based articles or review papers are only for primary knowledge gain.(v)The language used: The studies written in the English language are only considered.Figure 1 represents the total number of the literary studies screened individually on each of the categories of chosen ML techniques in the form of pie-charts with a percent-wise pattern. Figure 2 is a standard graphical depiction of the number of most recent articles that we screened for each chosen ML-based method in the period ranging from 2015 to 2021.(c)Selection: Out of all the papers screened based on the abovementioned criteria, a few most eligible are handpicked. The selection has been made keeping specific parameters: the modeling strategy and algorithm and its suitability with the modern technological scenario. The final result is the corresponding overall accuracy (COA) for each dataset used, preferably journals with a good citation index.(d)Analysis and inference: These selected papers are thoroughly reviewed to determine their contribution, restrictions, and future propositions. Based on this analysis, the deductions are drawn to show the pathway of further research.

3.3. Research Investigations (RI)

The analysis arises some of the queries: RI 1: What is the significance of traditional ML and DL for analyzing HSI?RI 2: How is ML/DL more impactful on HSI than other non-ML strategies?RI 3: What are the advantages and challenges faced by the researchers for the chosen ML/DL-based algorithm for HSI classification?RI 4: What are the emerging literary works of ML/DL on HSI classification in the year 2021?RI 5: How are ML- and DL-based hybrid techniques helping scientists in HSI classification?RI 6: What are the latest emerging techniques associated with addressing classifying HSIs?

3.4. Datasets

The HSI datasets are pre-refined and made publicly available for download and perform operations. There are six datasets that are described here in a concise manner:(i)AVIRIS Indian Pines: This dataset was taken by airborne visible infrared imaging spectrometer (AVIRIS) sensor, on June 12, 1992. The scene captured here was Indian Pines test site in North-Western Indiana, USA, and contains an agricultural area exemplified by its crops of regular geometry and some irregular forest zones. It consists of 145 ∗ 145 pixels with a spectral resolution of 10 nm and a spatial resolution of 20 mpp and 224 spectral reflectance bands in the wavelength range 0.4–2.5 μm, out of which 24 noisy bans are removed due to low signal-to-noise ratio. The scene contains 16 different classes of land covers.(ii)Salinas Valley: This scene was obtained by AVIRIS sensor over various agricultural fields of Salinas valley, California, USA, in 1998. The scene is characterized by a high spatial resolution of 3.7 mpp and a spectral resolution of 10 nm. The area is covered by 512 ∗ 217 spectral samples with a wavelength range of 0.4–2.5 μm. Out of 224 reflector bands, 20 noisy bands are discarded due to water absorption coverage. The scene comprises 16 different land classes.(iii)Pavia Center: This scene was captured by a reflective optics system imaging spectrometer (ROSIS-03) sensor during a flight campaign over Pavia, northern Italy. It possesses 115 spectral bands, out of which only 102 are useful. Its spectral coverage is 0.43–0.86 μm, with a spectral resolution of 4 nm and a spatial resolution of 1.3 mpp defined by 1096 ∗ 1096 pixels. There are 9 different land cover classes in the area.(iv)Pavia University: This scene was also captured by the same sensor at the same time as Pavia center, over the University of Pavia in 2001. It has the same structural features as the Pavia center, only contrasting in considering 103 bands out of 115 bands with a size of 610 ∗ 340 are taken after discarding 12 noisy bands. The scene contains 9 classes with urban environmental constructions.(v)Kennedy Space Center: This scene was acquired by NASA AVIRIS sensor over Kennedy Space Center, Florida, USA, on March 23, 1996. It was taken from an altitude of approximately 20 kilometres, having a spatial resolution of 18 kilometres and a spectral resolution of 10 nm. The wavelength range of the scene is 0.4–2.5 μm with the special size of 512 ∗ 614 pixels; 24 of 48 bands were removed for a low signal-to-noise ratio. The ground contains 13 predefined classes by the center personnel.(vi)Botswana: The scene was obtained by the Hyperion sensor placed on the NASA EO-1 satellite over Okavango delta, Botswana, South Africa, on May 31, 2001. It has a special resolution of 30 metres and a spectral resolution of 10 nm while taken at an altitude of 7.7 kilometres. Out of 242 bands containing 1476 ∗ 256 pixels, with a wavelength range of 400–2500 nm, 97 bands are considered to be water-corrupted and noisy; hence, 145 remaining are useful. The scene comprises 14 land cover classes.

4. Machine Learning-Based Techniques for HSI Classification

ML technologies are not only intelligent and cognitive, but also their accuracy is skyrocketing due to their embedded mechanical abilities such as extraction, selection, and reduction of joint spatial-spectral features as well as contextual ones [2426]. Moreover, the hidden dense layers with various allocated functions of the extensive networks work as intelligent learners by creating dictionaries or learning spaces to store deterministic information and then separate the landcover classes through its classification units [2729]. The latest ML techniques that assist in classifying the hyperspectral data, that is, SVM, SRC, ELM, MRF, AL, DL, and TL, are shown categorically in Figure 3 and are discussed hereafter in detail.

4.1. Support Vector Machine (SVM)

SVM is an innovative pattern-recognition technique rooted in the principle of statistical learning. The rudimentary concept of SVM-based training can unravel the ideal linear hyperplane so that the predicted classification error is mitigated, be it for binary or multiclass purposes [30], as depicted in Figure 4. For linearly separable binary classification, let (xi, yi) be the standard set of linearly separating samples with x ∈ (R)N and y ∈ {−1, +1}. The universal formula of linear decision function in n-dimensional space with the classification hyperplane iswhere is the weight directional vector and b is the slope of the hyperplane. A separating hyperplane with margin 2/|||| in the canonical form must gratify the following constraints:

For multiclass scenarios, we presumably transform the datapoints to S, a probable infinite-dimensional space, by a mapping function ψ defined as ψ(x) = (x12, x22, √2x1x2), x = (x1, x2). Linear operations performed in S resemble nonlinear processes in the original input space. Let K(xi, xj) = ψ(xi)Tψ(xj) be the kernel function, which remaps the inner products of the training dataset.

Constructing SVM requires values of the constants, that is, Lagrange’s multipliers, α = (α1, …, αN) so thatis maximized with the constraints with respect to α:

Because most αi are supposedly equal to zero, samples conforming to nonzero αi are support vectors. Conferring to the support vectors, the modified optimally ideal classification function is

The application of SVM for classifying HSI started two decades ago [31, 32]. Focusing on the potentially critical issue of applying binary SVMs [33], fuzzy-based SVM [34] as fuzzy input-fuzzy output support vector machine (F2-SVM), SVM evolved to dimensionality reduction and mixing of morphological details [35]. It also assisted particle swarm optimization (PSO) [36] and wavelet analysis with semi-parametric estimation [37], as the classifier “wavelet SVM” (WSVM). Table 1 summarizes the research carried out so far for the classification purpose of HSI using SVM.

4.2. Sparse Representation and Classification (SRC)

Sparse method depends on dictionary learning that enhances and rectifies the values of parameters based upon the current training observations while accumulating the knowledge of the previous observations prior. It then generates the sparse coefficient vector using sparse coding. This method is supremely efficient as it embeds dictionary learning to extract rich features embedded inside the HSI dataset. SR can classify images pixelwise by representing the patches around the pixel with a linear combination of several elements taken from the dictionary. The generalization of SRC called multiple SRC (mSRC) has three chief parameters—patch size, sparsity level, and dictionary size. Dictionary learning is the first step for sparse, using K-SVD algorithm. Let Y = [y1, y2, …, yN] be a matrix of L2-normalized training samples yi ∈ Rm [4547].

The size of patches around the pixel iswhere D is a member of RmXn is the learned over a complete dictionary, with n > m atoms, B = [b1, b2, …, bm] represents the matrix of corresponding sparse coding vectors bi ∈ Rn, and ∣∣⋅∣∣F is the Frobenius norm. Sparsity S limits the number of nonzero coefficients in each bi. The next step sparse coding is provided with dictionary D and represents y as a linear combination of y = D where is sparse. For the final classification step, suppose for each class j ∈ {1, …, M} of an image, a dictionary Di is trained. Then, the classification of a new patch ytest is achieved by estimating a representation error. The class assignments rule [47] is calculated through a pseudoprobability measure P(Cj) for each class error Ej as

mSRC obtains residuals of disjoint sparse representation of ytest for all classes j. Each dictionary Dj is updated by eliminating nonzero atoms from after each of k iterations and ytest is assigned to the class, using Q total iterations:

Sparse representation is an essential and efficient machine-dependent method in many areas, including denoising, restoration, target identification, recognition, and monitoring. It may grow even more vital when associated with logistic regression, adaptivity, and super-pixels to extricate the joint features globally and locally. SR has a very high potential of being associated with methods such as PCA, ICA, Markov random fields, conditional random fields, extreme learning machines, and DL methods such as CNN and graphical convolutional network. Table 2 gives a summary of the research performed so far for the classification purpose of HSI employing SRC.

4.3. Markov Random Field (MRF)

MRF describes a set of random variables satisfying Markov probability, depicted by undirected graphs. It is similar to the Bayesian network but, unlike it, undirected and cyclic. An MRF is represented as a graphical model of a joint probability distribution defined in Figure 5. The undirected graph of MRF, G = (V, E), in which V is the nodes representing random variables.

Based on the Markov properties [57], the neighborhood set Nc of a node c is defined as

The conditional probability of Yc decides the joint distribution of Y as

To prosper the construction, the graph G absorbs a Gibbs distribution all over the maximum cliques (C) in G:where Z is the partition function. Therefore, equation (11) can be rewritten aswhere T is the temperature, whose value is generally 1, and represents the energy.

Markov models depict the stochastic method that is represented by a graph made of circles has an acute advantage of not considering the past states for all upcoming future states for a random alterable dataset such as HSIs. The variants of Markov random fields are adaptive, hierarchical, cascaded, and probabilistic, a blend of Gaussian mixture model, joint sparse representation, transfer learning, etc., whose outcomes are pretty victorious. Hidden Markov random fields are highly suitable for the unsupervised classification of HSIs where the model parameters are estimated to make each pixel belong to its appropriate cluster [58], leading to the precise classification. Table 3 lists out the research carried out so far for the classification purpose of HSI employing MRF.

4.4. Extreme Learning Machine (ELM)

An efficacious learning algorithm based on single hidden layer feedforward neural network (SLFNN), it is applied to classify patterns and regression. Let (xi, pi) ∈ RnX Rm be N arbitrarily perceptible samples where xi = [xi1, …, xin]T ∈ Rn and pi = [pi1, …, pim]T ∈ Rm [72]. The standard SLFNN having hidden nodes and f(x) as activation function is approached mathematically as

Here,  = [, …, ]T gives the weight vector establishing the connection between input nodes and ith is the hidden node and αi = [αi1, …, αim]T represents the weight vector connecting between output node Oj with the ith hidden node, and .xj represents the inner product. The zero error for N samples can be written in the matrix form as  = P, where A (, …, , b1, …, , x1, …, xN) is the neural network hidden layer output matrix, and the ith is hidden node output with respect to x1, …, xN; the ith column of A represents xN inputs. The training of SLFNN is based on finding specific α, , and bi, (i = 1, …, ) [73] such that

This equation denotes the cost function with a depreciation. By using gradient-based algorithms, the set of weights (αi, ) and biases bi are attuned with epochs as

The learning rate η must be accurate for better convergence and  <<  for better generalization performance.

Extreme learning methods proposed overcoming the disadvantage of a single hidden layer feedforward neural network and improving learning ability and generalization performance. It is a supervised method but is highly recommended to get an extension to its semi-supervised and unsupervised versions for dealing with the huge amount of data such as HSIs, which are primarily unlabeled and suffering from lack of training samples. Great potential lies with its other variants than those mentioned here, [74] of ELM, like two-hidden layer ELM, multilayer ELM, feature mapping-based ELM, incremental ELM, and deep ELM to become superior and achieve victorious precision in classifying HSIs. Table 4 underneath provides the summary of the research executed so far for the classification purpose of HSI utilizing ELM.

4.5. Active Learning (AL)

It is a special type of the supervised ML approach to build a high-performance classifier while minimizing the size of the training dataset by actively selecting valuable data points. The general structure of AL can be understood from Figure 6. There are three categories of AL—stream-based selective sampling, that is, where each unlabeled dataset is enquired for a certain label whether to assign a query or not; pool-based sampling; that is, the whole dataset is under consideration before selecting the best set of queries; and membership query synthesis; that is, it involves data augmentation to create user selected labeling. The decision to select the most informative data points depends on the uncertainty measure used in the selection. In an active learning scenario, the most informative data points are those the classifier is least sure about. The uncertainty measures for datapoints x [88] areLeast Confidence (LC): responsible for selecting the classifier’s data point is least certain about the chosen class. With y as the most likely label sequence and ф as the learning model, LC is represented asSmallest Margin Uncertainty (SMU): Represents the difference between classification probability of the most likely class (y1∗) and that of the second-best class (y2∗), written mathematically as:Largest Margin Uncertainty (LMU): Represents the difference between classification probability of most likely class (y1∗) and that of the least likely class (ymin), written mathematically as:Sequence Entropy (SE): Detects the measure of disorder in a system; higher the entropy implies a more disordered condition. The denotation of SE iswith ranging over all possible label sequences for input x.

Although not considered customary and coherent, AL is pretty much capable of reducing human effort, time, and processing cost for a large batch of unlabeled data. This method relies on prioritizing data that needs to be labeled in a huge pool of unlabeled data to have the highest impact on training. A desired supervised model keeps on being trained through active queries and improvising itself to predict the class for each remaining data point. AL is advantageous for its dynamic and incremental approach to training the model so that it learns the most suitable label for each data cluster [89]. Table 5 lists out the research performed so far for the classification purpose of HSI using AL.

4.6. Deep Learning (DL)

Deep learning is the most renowned ML technology in application and accuracy terms. Although it is considered the next tread of ML, it also lends concepts from artificial intelligence. DL is the mother of algorithms that resemble human brain simulations, that is, creativity, enhanced analysis, and proper decision-making, based on pure or hybrid large networks for any given real-life problem. It has enhanced the throughput of computer-based, especially unsupervised snags for the practical technology-based applications such as automated translation of machines, image reconstructions and classifications, computer vision, and automated analysis. [104] The basic structure of any DL model possesses a three-type-layered architecture: it contains one input layer through which input data are fed to the next layer(s) known as the intermediate hidden layer responsible for all the computations based on the problem given, which passes its generated data to the final layer, that is, the output layer, which provides the desired ultimate output. The steps involved in DL models are as follows: having proper knowledge and understanding of the problem, collecting the input database, selecting the most appropriate algorithm, training the model with the sample source database, and finally testing the target database [105].

DL models are more efficient and advantageous over other ML models due to the following reasons [19]:(1)The capability to extract hidden and complicated structures from raw data is inextricably linked to their ability to represent the internal representation and generalize any form of knowledge.(2)They have a wide range of data types that they can accommodate, for example, 2D imagery data and complex 3D data such as medical imagery and remote sensing. In addition, they can use HSI data’s spectral and spatial domains in both standalone and linked ways [106108].(3)They provide architects a lot of versatility in terms of layer types, blocks, units, and depth.(4)Furthermore, its learning approach can be tailored to various learning strategies, from unsupervised to supervised, with intermediate strategy.(5)Additionally, developments in processing techniques, including batch partitioning and high-performance computation, especially on distributed and parallel architecture, have enabled DL models to find better opportunities and solutions when coping with enormous volumes of data [109].

The models that are broadly used for HSI classification are described as follows.

(a)Autoencoder (AE): AEs are the fundamental unsupervised deep model based on the backpropagation rule. AEs consist of two fragments: encoder, connecting the input vector to the hidden layer by a weight matrix; decoder, formed by the hidden layer output via a reconstruction vector tied by a specific weight matrix. SAEs are AEs with multiple hidden layers where the production of every hidden layer is fed to the successive hidden layer as input. It comprises three steps: (1) first AE trained to fetch the learned feature vector; (2) the former layer’s feature vector is taken as input to the next layer, and this process is redone till the completion of training; (3) backpropagation is used after all the hidden layers have been trained to reduce the cost function and to update the weights is done with a named training set to obtain fine-tuning [110]. The architecture of SAE is depicted in Figure 7.Let xn ∈ Rm; n = 1, 2, …, N represent the unlabeled input dataset, En be the hidden encoder vector computed by xn, and yn be the decoder vector of the output layer [111].-> encoding function, Wi-> encoder weight matrix, bi-> encoder bias vector.f-> decoding function, Wj-> decoder weight matrix, bj-> decoder bias vector.The reconstruction error in SAE is denoted asAEs are unsupervised neural networks that embed several convolutional hidden layers based on nonlinear activation functions and transformations [112]. There are high risks of data loss during training, but it handles the model well for specific data types through specialized training. There are AEs for every purpose such as convolutional, sparse, variational, deep, contractive, and denoising applied for data compression, noise removal, feature extraction, image augmenting, and image coloring. AE inevitably provides a vast platform for further research on its various applicability and its capability to participate in hybridization. Table 6 describes a few research works in the aspect of AEs.(b)Convolutional Neural Network (CNN): It is a famous deep neural network that works like a human visual cortex with many interconnected layers applied widely in image, speech, and signal processing. It assigns learnable and modifiable weights and biases to the input image to identify various objects or patterns with differentiable features. As shown in Figure 8, each layer of CNN possesses filtering capabilities with ascending complexities: the first layer learns filtering corners and edges; intermediate layers learn object parts filtering; and the last layer learns filtering out the entire object in different locations and shapes. The comparison between the layers in terms of several parameters is shown in Table 7. It consists of four layers [117, 118]:(1)Convolution: This operation is the cause of the naming of CNN, that is, a dot product of the original pixel values with weights identified in the filter or kernel of the image. The findings are compiled into one number representing all the pixels found in the filter. Assuming I be the hyper-input-cube of dimension p × q × r where p × q denotes the spatial size of I with r number of bands, and ik is the kth feature map of I. Let d number of filters be present in each convolutional layer, and weight Wm and bias bm represent the mth filter. The mth convolutional layer output with transformation function is denoted as(2)Activation: The convolution layer produces a matrix significantly smaller than the actual image. The matrix is passed through an activation layer (generally rectified linear unit, aka ReLU), adding nonlinearity that enables the network to train itself through backpropagation.(3)Pooling: It is the method of even more downsampling and reduction of the matrix size. A filter is applied over the results obtained by the previous layer and chooses a number from each set of values (generally the maximum, the max-pooling), which allows the network to train much more quickly, concentrating on the most valuable information in each image feature. For an m × m square window neighbor S with N elements and zij activation value concerning (i, j) location, the average pooling is formulated as(4)Fully Connected (FC): A typical perceptron structure with multilayers. The input is a single-dimensional vector representing the output of the layers above. Its output is a probability list for the various possible labels attached to the image. Classification decision is the mark that receives the highest likelihood. It is mathematically represented with transformation function , for N samples of inputs with X″ and Y″ being the outputs having W being the weight matrix and b, the bias constant, is as follows:CNN is the most method-in-demand and widely explored model among all DL models. The functional unit of convolutional layers is kernels that expertise in extricating the most relevant and enriched spatial and spectral features from the given dataset through automated filtering by convolution operation [119]. It provides an intense description of the whereabouts of CNNs. The most popular ones are attention-based CNN, ResNet, CapsNet, LeNet, AlexNet, VGG, etc. Some of them are still unexplored yet in classifying HSI. The detailed research work on CNN for dealing with HSI classification is listed in Table 8.(c)Recurrent Neural Network (RNN): DL is a very efficient approach that follows a sequential framework with a definite timestamp t. “Recurrent” refers to performing the same task for each sequence element, with the output depending on the preceding computations. In other words, they have a “memory” that enfolds information about the calculation so far type of neural network, and the output of a particular recurrent neuron is fed backward as input to the same node, which leads the network to efficiently predict the output, represented in Figure 9, where RNN unrolls, that is, show the complete sequence of the entire network structure neuron by neuron. It consists of the following steps:(1)X = […, xt−1, xt, xt+1, …] be the input vector, where xt represents input at timestamp t.(2)ht is the “memory of the network,” the hidden state at timestamp t. Preliminarily, h−1 is initialized to zero vector to calculate the first hidden step. ht being the current step is calculated based on previously hidden step ht−1, formulated by [132]where f denotes a function of nonlinearity, that is, tanh or ReLU, and W be the weight vector.(3)Y = […, yt−1, yt, yt+1, …] be the output vector, where yt represents input at timestamp t, generally a softmax function: yt = softmax(Q ht).RNN is an efficient deep model with large potential. The recurrence looping structure acquainted with RNN enables it to store relevant information about spatial-spectral relationships between the pixels and neighbors. There are several RNN architectures based on inputs/outputs as stated in [133], and based on LSTM, there are five categories [134]. These variates can be well utilized in collaboration with other DL methods such as MRF and PCA to find their accuracy.The literature studies based on RNN are cataloged in Table 9.(d)Deep Belief Network (DBN): DBNs are formed by greedy stacking and training restricted Boltzmann machines (RBMs), an unsupervised learning algorithm based on “contrastive divergence.” For neural networks, RBMs suggest taking a probabilistic approach and are thus called stochastic neural networks. Each RBM is made of three parts: a visible unit (input layer), an invisible unit (hidden layer), and a bias unit. The general structure of a DBN is depicted in Figure 10.For a DBN, the joint distribution of input vector, X with n hidden layers hn, is defined as [137]where X = h0, P(hi−1, hi) is the conditional distribution of the visible units on the hidden RBM units at level i and P(hn−1, hn) is the hidden-visible joint distribution in top-level RBM. DBN has two phases: the pretraining phase depicts numerous layers of RBM, and fine-tuning phase is simply a feedforward NN.DBN is the graphical representation that is generative; that is, it creates all distinct outcomes that can be produced for the particular case and learn to disengage a deep hierarchical depiction of the sample training data. DBNs are structurally more capable than RNNs as they lack loops, are pretrained in an unsupervised way, and are computationally eminent for particularly classification problems. Minor modifications or collaborations can improvise DBNs functionally and accuracy. Table 10 depicts a list of works done on DBN.(e)Generative Adversarial Network (GAN): One of the most recent DL models that are rapidly growing its footsteps in the area of technical research. The GAN model is trained using two kinds of neural networks: the “generative network” or “generator” model that learns to generate new viable samples and the “discriminatory network” or “discriminator,” which learns to discriminate generated instances from existing instances. Discriminative algorithms seek to classify the input data, which is given as a collection of certain features; the algorithm maps feature on labels [140]. In contrast, generative algorithms attempt to construct the input data, which is given with a set of features, and it will not classify it, but it will attempt to create a feature that matches a certain label. The generator tries to get better at deluding the discriminator during the training, and the discriminator tries to grab the counterfeits generated by the generator. Thus, the training procedure is termed adversarial training. The generator and discriminator should be trained against a static opponent, keeping the discriminator constant while training the generator and keeping the generator constant when training the discriminator. That helps to understand the gradients better.

In a GAN model, say D and G denote the discriminator and the generator units that map a noise data space θ to real and original data space x, respectively. G(θ) denotes the fake output generated by G, and D(y), and D(G(θ)) are D’s output for real and fake training samples, respectively. Pθ(θ) and Pd(y) represent the input model distribution and original data distribution, respectively, when θPθ [141] as shown in Figure 11.

Combining equations (28) and (29), the total loss of the entire dataset represented by the min-max value function is given by

GAN is a generative modeling neural network architecture based on the concept of adversarial training that utilizes a model to build new instances that are conceivably derived from an existing sample distribution. Hence, GANs are new favorites for classifying HSIs as they compensate for the lack of data problem and classify the data in a pro manner. There are several types of GANs—conditional GAN, vanilla GAN, deep convolutional GAN (simple type); and Pix2Pix GAN, CycleGAN, StackGAN, and InfoGAN (complex type) [142]. These may be very useful for images like HSIs as they can deal with related issues. The research works based on the GAN are listed in Table 11.

4.7. Transfer Learning (TL)

It is the most current hot topic in interactive learning, and there are more to it to be explored. It is an approach where information gained is transferred in one or more source tasks and is used to enhance the learning of a similar target task. TL can be represented diagrammatically by Figure 12 and mathematically shown as follows:

Domain, D, is represented as {X, P(X)}, X = {x1, …, xn}, xi ∈ X; X denotes the feature space, and P(X) symbolizes the marginal probability of sample data point X [149].

Task T is depicted as {Y, P(Y|X)} = {Y, Φ}, Y = {y1, …, yn}, yi ∈ Y; Y is the label space, Φ is the prognostic objective function, having learned form (feature vector, label) couples, (xi, yi); xi ∈ X, yi ∈ Y, and calculated as the conditional probability.

Also, for every feature vector in D, Φ predicts its corresponding label as Φ(xi) = yi.

If DS and DT be the source and target domains, TS and TT be the source and target tasks, respectively, with DS ≠ DT and TS ≠ TT. TL objectifies to learn P(YT|XT), that is, the target conditional probability distribution in DT with knowledge obtained from DS and TS.

Traditional learning is segregated and solely based on particular tasks, datasets, and different independent models working on them. No information that can be converted from one model to another is preserved, but on the contrary, TL possesses the human-like capability of transferring knowledge; that is, knowledge can be leveraged from priorly trained models to train new models, the process of which is faster, more accurate, and with the limited amount of training data. Table 12 represents a brief detail about the research works on transfer learning.

5. Discussion

Based on the reviewed articles, we can draw the desired inferences that provide answers to the investigative questions mentioned in Section 2 and show the clear motive and benefits of this review.

RI 1: What is the significance of traditional ML and DL for analyzing HSI?

Ans: Hyperspectral data have certain restrictions, as cited in Section 1. Statistical classifiers initially addressed them, but the operations and analysis became much easier and more accurate after the invention of ML/DL strategies in a machine-dependent way [155, 156]. The general advantages that researchers were provided by the ML/DL algorithms while dealing with HSIs are as follows: (i) easy dealing with high-dimensional data, that is, troubles of Hughes phenomenon removed [115, 125]; (ii) equally manipulative to labeled and unlabeled samples [99, 150]; (iii) precise and the meticulous choice of features [51, 127]; (iv) high-end-precise models to deal with real hypercubes, hence top-notch classification accuracy [119, 154]; v) removes overfitting, noises, and other hurdles to a much greater extent [120, 147]; (vi) embedded spatial-spectral feature extraction and selection units [119, 133]; (vii) mimics human brain to solve multiclass problems [136, 138].

RI 2: How are ML/DL more impactful on HSI than other non-ML strategies?

Ans: The initial discovery of hyperspectral data has suffered due to its limitations. In the preliminary research stage, the scientists followed the traditional methodology for classifying HSIs, that is, preprocessing (if required), extraction, and selection of discriminative characteristics and then ran a classifier on those features to identify the land cover groups. Hence, they emphasized the feature extractor techniques such as PCA [9], ICA [10], and wavelets [13], assisted by some basic random classifiers such as extended morphological profiles [2, 157], NN [158, 159], logistic regression [160], edge-preserving filters [10, 161], density functions/matrices [162], and Bayes law of classification [163, 164]. These classic mathematics-oriented techniques were not enough to deal with such a huge amount of data like HSI, as they were simple in structure and design and easy to implement. It also could not predict well enough the multiclass problems, which is very much required for a dataset like HSI, whose land covers belong to multiple classes of regions. Also, these methods were not accurate in feature selection and extraction or dealing with the storage of such bulk data. These reasons made researchers struggle to analyze properly, process, and classify HSIs. On the contrary, the advancements of ML/DL technologies have opened a broad gateway of research that researchers are still exploring and combining with different groupings to address the HSI classification problem in real life, dealing with the limitations mentioned above [26, 131]. The tabular depiction of the advantages and disadvantages of the ML and non-ML strategies applied for HSI classification is shown in Table 13.

RI 3: What are the advantages and challenges faced by the researchers for the chosen ML/DL-based algorithm for HSI classification?

Ans: We added the advantages and challenges of the ML- and DL-based techniques in Table 13.

RI 4: What are the emerging literary works of ML/DL on HSI classification in the year 2021?

Ans: In the ongoing years, 2021 seems to be more promising in terms of technical advancements for the problem concerned. New techniques are emerging, along with hybrid ones, to solve the issue to a whole new level, the methodologies’ accuracy to be described. Recent work on MRF with a band-weighted discrete spectral mixture model (MRF-BDSMM) in a Bayesian framework has been proposed in [165], an unsupervised adaptive approach to accommodate heterogeneous noise and find the abundant labeled subpixels to extricate joint features. A collaboration of Kernel-based ELM with PCA, local binary pattern (LBP), and gray-wolf optimization algorithm (PLG) is proposed as novel methodologies. They help reduce huge dimensions, seek global and local-spatial features, and optimize the KELM parameters to obtain the class labels [166]. A variant of SRC is proposed in [167], dual sparse representation graph-based collaborative propagation (DSRG-CP) that separates spatial and spectral dimensions with the respective graph to improve the labeling scheme limited samples by collaborating the outcomes. AL has been one of the hot topics so far, as it integrates with a Fredholm kernel regularized model (AMKFL) that enables better labeling than manual ones, even for noisy images [168]. It ties with DL with the augmentation of training samples to label the uncertain hypercubes (ADL-UL) accurately [169], facilitates iterative training sample augmentation by expanding the hypercubes and adds discriminative joint features (ITSA-AL-SS) [170], extracts local unique spatial multiscale characteristics from the super-pixels (MSAL) [171]. A novel idea of attention-based CNNs is proposed in [172, 173], the former (SSAtt-CNN) collides two attention subnetworks—spatial and spectral with CNN as the base, and the latter (FADCNN) is a dense spectral-spatial CNN with feedback attention technique that perfectly poses the band weights for better mining and utilization of dominant features. GAN is one the most exploited methods to date, and [174] proposes the full utilization of shallow features from the unlabeled bands through a multitasking network (MTGAN); in [175], the discriminator is based upon capsule network and convolutional long short-term memory to extricate less visible features and integrates them to build high-profile contextual characteristics (CCAPS-GAN); 1D and 2D CapsGAN together form a dual-channel spectral-spatial fusion capsule GAN (DcCaps-GAN) shown in [176]; and generative adversarial minority oversampling for 3D-hypercubes (3D-HyperGAMO) is depicted in [177] that focuses on the minor class features using existing ones to label and classify them properly.

RI 5: How are ML- and DL-based hybrid techniques helping scientists in HSI classification?

Ans: Since the dawn of the emergence of HSIs, it has suffered many hurdles in its path of analysis and information extraction. The maximum number of highly correlated bands and the high spatial-spectral features signature by the electromagnetic spectrum embedded in it are always considered a traction matter. Thus, finding an appropriate technology for the classification of such interconnected and hugely confined featured high-dimensional images is a very tedious and strenuous matter. The classification methods chosen so far have been mostly limited to supervised. The requirement of a sufficient number of quality-labeled data and unsupervised, in which the lack of coherence between the spectral clusters and the target regions, causes the failure in obtaining the desired accuracy. A semi-supervised method is needed to overcome such problems as a combination of supervised and unsupervised methods, named the hybrid method. A hybrid method is always advantageous in robustness and flexibility towards the high-dimensional data.

The hybrid methods have the following benefits:(i)Specifically designed to overcome the limitations and take advantage of the methodologies involved in the concerned hybrid to achieve a deep, rich, and insightful conclusion (general).(ii)Addressing and resolving multiple issues regarding the handling and analyzing the HSI data, at a time, depending upon the methods that are chosen for mixing/hybridizing [179183].(iii)Coherence in time, space, and cost complexities [184186].(iv)Better interpretability, quality, effectivity leading to the construction of a more refined framework [180, 182, 183, 187194].(v)Deterministic spectral, spatial, and contextual feature extraction, reduction, and selection, and combining them to achieve desired accuracy and performance [182, 183, 187, 188, 195197].

ML, being a standard versatile technology, can merge with traditional techniques like PCA for its benefit. As stated in [195, 198], PCA is exploited at its best for feature extraction, selection, and reduction to achieve higher accuracy and performance quality. PCA is one of the best preprocessing methods considered to date for improvised spectral dimension reduction [180], proper selection of spectral bands and their multiscale features in a segmented format [181, 199], noise-reduced spectral analysis [27], and feature extraction [130, 196]. PCA, in collaboration with SVM [195, 200], DL for feature reduction and better classification [182, 183], CNN with multiscale feature extraction [188, 189], and sparse tensor technology [190], has highly been appreciated as soulful research. All these recent time collaborations and a special honor to the merging of ICA-DCT with CNN cited in [191] are the evidence that although PCA is categorized under traditional methods, it is supremely relevant for its significant usefulness in handling HSIs.

Some other hybridizations are also explored by researchers, such as SRC with mathematical index of divergence-correlation [192], Gabor-cube filter [193], and ELM [83, 85]; ELM with CNN [86] and TL [26]; AL based on super-pixel profile [201, 202], AL with CNN [203], CapsNet [204], CNN [204, 205], and TL [151, 184]; CNN with attention-aided methodology [172, 173, 185] and GAN [186]; GAN with dynamic neighborhood majority voting mechanism [194, 197], CapsNet [175, 176, 206, 207]; and TL with MRF [70]. These articles depict the highly tenacious performance with literal mitigation of the computational complexities enforced on the raw HSI data to build a strong and enhanced model for achieving higher accuracy than ever.

RI 6: What are the latest emerging techniques associated with addressing classifying HSIs?

Ans: The following are the most recent research studies that have enlightened a new path of dealing with the purpose: (i)DSVM: The latest and novel concept incorporates DL facilities with traditional kernel SVM. This combines four deep layers of kernels with SVM being the hidden layer units, namely, exponential and gaussian radial basis function (ERBF and GRBF), neural and polynomial [208]. This approach has outperformed several efficient DL methods with nearly 100% accuracy for IP and UP datasets.(ii)Conditional Random Fields (CRFs): These are the structured generalization of multinomial logistic regression in the form of graphical models based on a priori continuity considering the neighboring pixels of analogous spectral signatures that possess the same labels. They extensively explore the hidden spectral-contextual information. In [146], CRF incorporates with semi-supervised GAN whose trained discriminators produce softmax predictions that are guided by dense CRFs graph constraints to improve HSI classification maps. A collaboration between 3D-CNN and CRF has been proposed in [209] to make a deep CRF capable of extracting the semantic correlations between patches of hypercubes by CNN’s unary and pairwise potential functions. A semi-supervised approach is depicted in [210], embedding subspace learning and 3D convolutional autoencoder to remove redundancy in joint features and obtain class sets using an iterative algorithm. In [211], CRF with Gaussian edge potentials associated with deep metric learning (DML) classifies HSI data pixelwise using the geographical distances between pixels and the Euclidean distances between the features. A novel framework using HSI feature learning network (HSINet) with CRF is proposed [212] that is a trainable end-to-end DL model with backpropagation that extracts joint features, edges, and colors based on subpixel, pixel, and super-pixels. In [213], a decision fusion model including CRF and MRF is built based on sparse unmixing and soft classifiers output.(iii)Random Forest (RF): It is an efficient algorithm that ensembles regression and classification tree. It enables the HSI classification model to be noise-tolerant, inherent in the multiclass division, robustness in parallelism, and speed. In [214], RF is compared to the DL algorithm, which outshined the classification accuracy. A new framework of cascaded RF is shown in [215] that uses the boosting strategy to generate and train base classifiers and Hierarchical Random Subspace Method to select features and suitable base classifiers based on the diversity of the features. A novel collaboration of semi-supervised learning and AL and RF is featured in [216], where the queries based on spatial information are fed to AL, and then, the labeled samples are classified by RF through semi-supervision. [217, 218] depicts a deep cube CNN model that extracts pixelwise joint features and is classified by RF.(iv)Graph Convolutional Network (GCN): A descendent of CNN, a structure designed to generalize and convert the convolution data to graph data. It consists of three steps feature aggregation, feature transformation, and classification. Being an expert in graphical modeling considers the spatial interrelations between the classes at its best. In [219], the different unique features collected from CNN and GCN are fused additive, elementwise, and concatenated way. A new framework of globally consistent GCN is introduced in [220], which first generates a spatial-spectral local optimized graph whose global high-order neighbors obtain the enriched contextual information employing the graph topological consistent connectivity; at last, those global features determine the classes. [221] shows the concept of a dual GCN network, which works with a limited number of training samples, where first extricates all the significant features and second learns label distribution. A novel idea of deep attention GCN is introduced in [222] based on similarity measurement criteria between the mixed measurement of a kernel-spectral angle mapper and spectral information divergence to accumulate analogous spectra. [223] emerges as a collaboration between CNN and GCN to extract pixel and super-pixelwise joint features by learning small-scale regular regions and large-scale irregular regions.

6. Conclusion

This article depicts the various technologies and procedures used for HSI classification since the dawn of its invention to date. There are many barriers to dealing with such high-band data as HSI mentioned above. Despite that, many researchers have taken their interest in this field to improvise the existing techniques or even invent new ones throughout the last decade. As per the considerable improvement in technologies and the introduction of ML into the classification issues of HSI, it has become more accurate than traditional and contemporary state-of-art methodologies. As a result, DL has emerged as the most eminent work tool for HSI classification for the last half of this decade. The more the researchers focused on this, the more they explored the remote sensing and space imagery features.

This review article bears the individual information for every method and their submethods about their performance, research gaps, and achievements. In addition, it appends a novel research methodology that makes this work more distinctive than others. After going through each methodology’s minute details, the most significant inferences have been drawn, which add further novelty to our work. Also, it shows a path of choosing an appropriate technique and its alternatives for future researchers, hence alleviating its creativity and uniqueness, above all other contemporary review works on this subject. Also, it provides the details of the most recent research scenario on HSI classification and some of the currently developed techniques that might be acutely useful in several future research. Our study holds the uniqueness and the novelty regarding several aspects, such as the following: (1) it includes the research works carried out in the last decade, that is, 2010–2020, and the most recent papers of the previous year, i.e., 2021, and we have mentioned it in Section 3; (2) the number of papers referred here is above 200, outnumbering other review papers; (3) the review is carried out by selecting the most appropriate papers solely dedicated to our subject of interest, that is, machine learning techniques serving the purpose of hyperspectral image classification. Then, the findings from those works of literature are systematically arranged in the tabular format (Tables 112); (4) the objective behind this review work is expressed by RQ 1–6. Also, they provide a clear view of the recent technological advances and applications that the researchers are developing in recent times; (5) Table 14 provides an explicit idea of the pros and cons of each ML technique described in this manuscript when applied for classifying hyperspectral images, which will help the researchers in their future research; and (6) the researcher who wishes to write a literature review can follow our proposed methodology that depicts the flow of work in a methodical way. [224].

7. Limitations of Present Work and Its Future Scope

The study has some limitations: (i) we have used fewer keywords in the current research (ii) we only focused on seven popular ML techniques; (iii) we briefly explain the emerging methodologies; and (iv) the experimental details are not fully discussed.

As a future proposition, we would like to explore more keywords, more techniques, and more studies that offer a better understanding of other learning methods, both traditional and contemporary. In addition, there are several instances of hybrid strategies along with some more eminent and latest ML/DL techniques that we shall look forward to exploring in both qualitative and quantitative manner.

Acronym

HS:Hyperspectral
HSI:Hyperspectral image
GIS:Geographic Information System
PCA:Principal component analysis
ICA:Independent component analysis
SVM:Support vector machine
SR:Sparse representation
SRC:Sparse representation and classification
MRF:Markov random field
HMRF:Hidden Markov random field
ELM:Extreme learning machine
AL:Active learning
HU:University of Houston
TL:Transfer learning
DL:Deep learning
AE:Autoencoders
SAE:Stacked autoencoders
CNN:Convolutional neural network
RNN:Recurrent neural network
DBN:Deep belief network
GAN:Generative adversarial network
IP:Indian pines
KSC:Kennedy space center
SV:Salinas valley
UP:University of Pavia.

Data Availability

Publicly available data are used in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

Jana Shafi would like to thank the Deanship of Scientific Research, Prince Sattam bin Abdul Aziz University, for supporting this work. This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (Grant no. 2022R1C1C1004590).