A Comprehensive Survey of Imbalance Correction Techniques for Hyperspectral Data Classification

Land-cover classification is an important topic for remotely sensed hyperspectral (HS) data exploitation. In this regard, HS classifiers have to face important challenges, such as the high spectral redundancy, as well as noise, present in the data, and the fact that obtaining accurate labeled training data for supervised classification is expensive and time-consuming. As a result, the availability of large amounts of training samples, needed to alleviate the so-called Hughes phenomenon, is often unfeasible in practice. The class-imbalance problem, which results from the uneven distribution of labeled samples per class, is also a very challenging factor for HS classifiers. In this article, a comprehensive review of oversampling techniques is provided, which mitigate the aforementioned issues by generating new samples for the minority classes. More specifically, this article pursues a twofold objective. First, it reviews the most relevant oversampling methods that can be adopted according to the nature of HS data. Second, it provides a comprehensive experimental study and comparison, which are useful to derive practical conclusions about the performance of oversampling techniques in different HS image-based applications.

The source will be made publicly available at https://github.com/mhaut/ imbalance-review.
Digital Object Identifier 10.1109/JSTARS. 2023.3279506 In particular, HS image classification and segmentation are some of the most active research domains within the remote sensing community, mainly because they steadily work for providing accurate earth's surface estimates and land-cover predictions, which eventually allow modern society to deal with current and future technological challenges and needs [13], [14], [15]. HS imaging combines conventional spectroscopy techniques with digital imaging to collect detailed spectral and spatial information from an observation area, producing a large data cube (X) for each recorded scene. In this context, HS classification/segmentation consists of assigning to each pixel (spectral vector) in the data, a single classification label [16], i.e, given the HS scene X ∈ R H×W ×D , defined by its height (H) and width (W ) dimensions, together with the number of spectral bands (D), respectively, and the set of C possible land-cover classes, the purpose is obtaining the data-label pair {x i , y i } for all the pixel of the scene (i ∈ [1, N] defining N = W H), with y i = 1, . . . , C denoting the corresponding label. The improved performance of HS classifiers comes from the capacity of HS instruments to capture images using hundreds of narrow and contiguous spectral bands, which are recorded from different wavelengths of the electromagnetic spectrum, providing detailed spectral-spatial information of the target scene. For instance, one of the most popular sensors within the remote sensing community is the airbone visible infrared imaging sensor spectrometer (AVIRIS) [17], which collects 224 bands in the spectral range 400-2500 nm with 20 m of spatial resolution. Other popular HS instruments are the reflective optics system imaging spectrometer (ROSIS) [18], or the compact airborne spectrographic imager (CASI) [19], which have also been used to collect multiple remotely sensed data benchmarks [20]. Regardless of the acquisition instrument, HS processing techniques take advantage of the detailed spectral and spatial resolution available to provide a precise material identification over specific earth surface areas of interest [16], [21].
In this context, from kernel-based classification methods [22], [23], through statistical models [24], to the most recent deep learning approaches [25], [26], [27], [28], different paradigms have been successfully proposed and applied to process and classify remotely sensed HS data. Following the natural evolution of artificial intelligence and automatic processing algorithms, from traditional machine learning (ML), which tend to hand-crafted feature processing, to the current state-of-the-art dominated by deep learning (DL) models, characterized by their unparalleled ability to automatically extract deep and abstract features, the scientific community has provided interesting HS classifiers based on supervised, unsupervised, and semisupervised approaches [29], [30]. In spite of their technical differences, all these technologies share a common requirement: they should all cope with the intrinsic complexity of the HS image domain. On the one hand, HS data often contains a high level of spectral redundancy, as narrow contiguous bands tend to be highly correlated and prone to produce spectral leakage on the radiance acquisition process [31]. Moreover, HS data cubes contain noise due to uncontrolled changes in atmospheric conditions and instrumental limitations, which are coupled with significant spectral mixing due to the tradeoff between spectral and spatial resolutions, results in high data variability. On the other hand, obtaining accurate labeled training data is expensive as well as time consuming. This eventually introduces an important limitation on the availability of ground-truth (labeled) HS data, and also contrasts with the requirement of large amounts of training samples needed to alleviate the so-called Hughes phenomenon [32]. The lack of training samples prevents the classifier from covering both the variability of the data and the complexity introduced by the high dimension of the features. As a result, the model does not fit properly and its behavior degrades rapidly. Similarly, this shortcoming poses a major challenge in semantic segmentation, given the requirement of pixel-level annotations, which are expensive and time-consuming to obtain. Indeed, training datasets for image semantic segmentation are often small and do not cover the full range of variations that can occur in real-world images.

A. Class Imbalance
In addition to these problems, there is also an important aspect that significantly affects remotely sensed HS data collections: the large class-imbalance problem [33]. This is characterized for having a training set with highly irregular distribution in terms of the number of samples per class, which may eventually introduce an important bias to the classes with more samples during the training process. Indeed, processing methods tend to fit more closely to the majority classes, producing large classification errors for minority classes. Regardless of whether these class differences are naturally present over the earth's surface or artificially generated by some external factors, the class-imbalance problem is a challenging factor when it comes to remotely sensed HS image classification [34], [35] and semantic segmentation [36]. With the ongoing developments in HS imaging acquisition and processing technologies, the earth's surface is being characterized in an unprecedented level of detail, providing rich information for multiple purposes, such as fine-grained land-cover classification (with an inherent class asymmetry). As more ambitious land-cover classification taxonomies are proposed, the class-imbalance problem is more likely to occur, since the class heterogeneity in the earth's surface is naturally diverse [37]. Different approaches have been developed to properly tackle the class-imbalance problem, such as cost-sensitive methods, kernel-based methods, and active learning methods [38], [39], [40]. Notwithstanding their positive impact on final accuracy, these approaches have a number of limitations that hinder their performance in real HS scenes, e.g., kernel-based models and active learning have a high computational burden, while cost-sensitive methods have to define misclassification costs that are not usually available in HS datasets. Generally, patch-based processing methods have enhanced the classification of HS data by capturing fine-grained details and reducing the impact of within-class variability (the so-called salt and pepper classification noise) using small patches of the images, such as U-Net models. Novel works include hyperspectral change detection [41] and generative adversarial minority oversampling (3-D-HyperGAMO) strategies to increase the accuracy [35]. Nonetheless, these approaches could exacerbate data imbalance issues when certain classes are underrepresented in the patches, which can be solved by balancing the classes within the patches. Furthermore, patch-based approaches are usually computationally intensive, especially when dealing with high-resolution images or large datasets, and requiring from preprocessing steps to improve the performance, such as normalization or data augmentation. Traditionally, HS data dimensionality reduction methods, such as spectral band selection or the popular principal component analysis (PCA), have been used to simplify the feature space, reducing the complexity of the processing models while better separating samples belonging to different classes. Notwithstanding the improved results, these methods do not address the imbalance problem.
Over the past years, extensive research work has been conducted to address the class-imbalance problem [42], [43], [44]. From a general perspective, there are two main trends to deal with imbalanced datasets: 1) preprocessing; and 2) cost-sensitive techniques. Whereas the preprocessing approach is focused on modifying the original data collection to relieve the classimbalance effect, the cost-sensitive solution proposes to manage these deviations in the classifier itself. Although both frameworks have shown to obtain competitive results in many different application domains, many works in the literature adopt the preprocessing scheme because of its simplicity and more generic design, as it does not affect the classification process itself [45]. The preprocessing strategy, is separated into two primal alternatives: 1) oversampling and; 2) under-sampling. Focusing on the former, oversampling techniques aim at generating new samples for the minority classes. On the contrary, under-sampling methods are focused on eliminating samples from the majority classes, thus, alleviating the class imbalance effect. From a general perspective, both resampling methods have been studied over standard data collections, where the oversampling scheme has become the predominant approach [46]. Nonetheless, the special complexity of HS images makes it difficult to extrapolate general purpose oversampling results to the remotely sensed HS image domain.

B. Contributions and Article Structure
Given the aforementioned issues, this article pursues a twofold objective. First, it reviews the most relevant oversampling methods that can be adopted according to the nature of HS data. Second, it provides an experimental study and comparison Oversampling procedure for imbalanced HS scenes. From the imbalanced dataset X, minority X β and majority X α subsets are extracted. Consequently, each pixel-label pair (x i , y i ) within X β and X α has its corresponding class. Oversampling strategy is applied to generate a new balanced dataset X.
useful to derive practical conclusions about the performance of oversampling techniques in different HS image-based applications. Moreover, this article presents a comprehensive analysis of state-of-the-art oversampling techniques used to improve the classification accuracy of underrepresented classes in HS scenes. In addition, the investigation extends to alternative class balancing methods for semantic segmentation. The contributions of this work provide valuable insights for improving the performance of HS image classification and related tasks, which can have significant implications in various domains, such as remote sensing.
The rest of this article is organized as follows. Section II provides a detailed discussion about some popular oversampling techniques, which have been widely-adopted for HS image classification. Section IV presents an experimental comparison of these techniques, using different widely-used classifiers and detailing the evaluation metrics that have been considered in Section IV-B, and the set of benchmark HS scenes in Section IV-A. This section also provides best practice recommendations for the selection of oversampling techniques in different application domains. Finally, Section V concludes this article with some remarks and hints at plausible future research lines.

II. OVERSAMPLING FOR HS IMAGE CLASSIFICATION
In recent years, several efforts have been made toward developing novel techniques to effectively classify remotely sensed HS data [16], [21], [47], [48], [49]. Despite all the conducted research, there are still significant challenges to deal with, as the air-borne and space-borne image acquisition technologies are continuously improved [14]. In general, the increase of the spectral-spatial resolution of modern HS sensors makes the task of identifying pixel and subpixel components more complex, since more detailed information is available for the study and classification of the earth's surface. In addition, the class-imbalance problem also has a considerable impact on the final land-cover classification performance, mainly because minority classes may not have enough samples to be properly represented and generalized [45]. To this extent, oversampling techniques have shown to be excellent tools to balance the class distribution in the dataset, while guaranteeing a detailed earth surface characterization. In this context, the study conducted by [50] aimed to tackle the challenge of landslide classification in remote sensing images through the use of oversampling techniques. The results of this research highlight the effectiveness of oversampling methods for imbalanced data issues and demonstrate the potential of these techniques in the scope of remote sensing data analysis.
This section describes the most popular oversampling methods in the remotely sensed HS image classification domain, providing their technical details. The overall procedure is shown in Fig. 1.
In this research, multiple oversampling techniques are reviewed, which can be divided into three groups-random-based, SMOTE-based, and adaptive synthetic sampling (ADASYN). First, random oversampling is a naive technique for class balancing based on the replication of existing training samples. Second, SMOTE consists of generating synthetic samples for minority classes. Due to the fact of showing good results in several applications, some variations have been proposed to improve its effectiveness. Finally, ADASYN applies adaptive learning to reduce class imbalance by adjusting the corresponding decision boundaries to those minority samples that are harder to learn.

A. Random Oversampling
The random oversampling method (RANDOM) is the most basic technique to balance a data collection. In particular, this method randomly duplicates samples of the minority class until the classes are balanced. From a mathematical point of view, X can be considered as an imbalance dataset comprising N samples with their corresponding labels, and assuming there are two classes, one majority (X α ) and one minority (X β ), the entire dataset could be divided into two subsets, by splitting those samples that belong to the majority class (identified as x α ), and those that belong to the minority class (identified as x β ). According to (1), there must be more samples from the majority class than from the minority class, i.e., N α >> N β In this context, the RANDOM method samples new minority class instances by randomly replicating the original samples of X β until getting N α = N min , where N min is the new number of samples in the minority class after the oversampling. From a geometric perspective, the newly generated samples for the minority classes are always placed on the coordinates of an existing sample. As a consequence, it is noteworthy that this method only works for increasing the number of samples in the minority classes, without introducing any data variety in the newly generated samples. Despite the simplicity of the random oversampling approach, this method is quite prone to over-fitting given that the same information is replicated multiple times within the minority classes. In this regard, some alternatives have been proposed to mitigate the issue of over-fitting in classification tasks. For instance, one of the most popular solutions involves addressing the noise problem that can arise during data oversampling, where generating synthetic data points that are too similar to existing data points can lead to over-fitting. To address this issue, researchers proposed the application of noise-robust oversampling (NROMM) [51] technique in the minority classes.

B. Synthetic Minority Oversampling Technique (SMOTE)
The synthetic minority oversampling technique (SMOTE) [52] is one of the most popular oversampling methods. In particular, this algorithm is based on generating synthetic samples of the minority classes by using their corresponding nearest neighbors. Assuming the mathematical notation described in Section II-A, where x α denotes any sample from the majority class and x β any sample from the minority class (for simplicity, the index j can be omitted), the SMOTE method defines two predefined input parameters: K and ω. Parameter K refers to the number of neighbors that are computed for each minority sample x β , while parameter ω indicates the amount of oversampling to be performed.
The first step is to compute K nearest neighbors for each sample belonging to the minority class X β , producing a subset of minority nearest neighbors K β . Note that only minority samples are considered when computing its nearest neighbors x β k . This is expressed by the following: (2) Once the K neighbors are computed for each minority sample x β , the next step generates new synthetic samples according to an oversampling parameter, ω. Let S be the set of synthetic samples. The number of samples to be generated, |S|, is obtained as shown in the following: The generation of synthetic samples is performed by drawing a segment between the selected minority sample, x β and a random minority neighbor, x β k . Then, the distance between these two samples is multiplied by a random scalar, λ ∈ [0, 1]. In this way, if λ is lower than 0.5, the new sample will be created closer to the processed minority sample, x β . Analogously, if λ is greater than 0.5, the synthetic sample will be located closer to the corresponding neighbor, x β k . Equation (4) represents the generation of a synthetic sample s. The number of repetitions will depend on parameter ω SMOTE seeks to address class imbalance by generating synthetic samples in underrepresented regions of the feature space. By augmenting the density of minority class instances in these regions, SMOTE aims to improve the identification of the underrepresented class. Various extensions of this algorithm have been developed to enhance its consistency and robustness, particularly in challenging scenarios. Four distinct variants of the SMOTE algorithm are discussed below, each with its unique characteristics: BORDERLINE-1, BORDERLINE-2, SVM-SMOTE, and K-Means SMOTE.
1) SMOTE BORDERLINE-1 (SMOTEBD1): Many existing classification algorithms estimate accurate decision boundaries among classes in order to obtain high precision and reliability when classifying the input data. Nevertheless, the samples located in boundary regions (known as borderline samples) are more likely to be misclassified, as the level of uncertainty is higher due to the spectral mixture. Precisely, oversampling techniques can take advantage of this to produce more relevant synthetic data.
Contrary to the regular SMOTE version, which does not consider class boundaries, the BORDERLINE-1 [53] focuses on the borderline samples of the minority classes to generate more consistent class separability with the newly generated synthetic data. Specifically, the SMOTE BORDERLINE-1 works as follows, requiring the K and ω parameters, too.
The first step of Borderline-1 is to calculate theK nearest neighbors for each minority sample from X β . Considering X α k as the subset of neighbors of x β ∈ X β belonging to the majority class, X α , Borderline-1 considers three kinds of samples: noisy, dangerous and safe. Based on the number of majority samples, |X α k | in the neighborhood of x β , the algorithm follows the next considerations.
1) If |X α k | =K, the x β sample is noisy since its whole neighborhood belongs to the majority class. 2) IfK/2 ≤ |X α k | <K, x β is dangerous because most of its neighboring samples are within the majority class.
3) If 0 ≤ |X α k | <K/2, x β is safe or secure as most of its neighborhood belongs to the minority class. Let consider X β D ⊂ X β as the set of dangerous minority samples. For each sample x β ∈ X β D , K minority neighbors are computed to obtain the desired K β . Once minority neighbors are computed, the number of samples to be generated, |S| is calculated following: Then, a random number, 1 ≤ θ ≤ K of minority neighbors from K β is selected for each danger sample until |S| is reached. As in SMOTE, the generation of synthetic samples is performed according to (4).
As a result, the oversampling process can be conducted only by considering the borderline samples, labeled as danger, to increase the density on the minority class boundaries.

2) SMOTE BORDERLINE-2 (SMOTEBD2):
Inspired by BORDERLINE-1, the BORDERLINE-2 algorithm [53] considers a wider data diversity when generating the new synthetic samples. In order to achieve this goal, the SMOTE BORDERLINE-2 algorithm not only considers elements of the minority class (X β ) when computing the neighborhood of the borderline or danger samples, X β D , but also elements of the majority class (X α ) in order to produce a higher data variation in the minority class. Precisely, this higher variability introduces diversity into the training, which helps to reduce over-fitting.
Whereas the SMOTE BORDERLINE-1 algorithm is designed to produce new samples from the boundaries of minority classes, the SMOTE BORDERLINE-2 extension relaxes this constraint by introducing synthetic samples closer to majority class samples. Generation of new samples is performed as shown in (4) by introducing the random scalar, λ ∈ [0, 0.5] to calculate the location of the new synthetic sample around the minority observations.
3) SVM-SMOTE: Support vector machines (SVMs) have shown a huge potential to identify class boundaries in many different application domains effectively [12]. Nevertheless, the class-imbalance problem within remote sensing HS domain is not an exception [23]. Some authors are focused on modifying the SVM classification process to manage the class-imbalance problem [54]. SVMs also provide a robust framework to generate new synthetic samples. For instance, Japkowicz and Stephen [55] demonstrated that SVMs are excellent tools to deal with such imbalance issues, since class boundaries are typically based on a small number of support vectors.
In this context, the SVM-SMOTE method [56] is a popular oversampling technique, which exploits the robustness of SVM when dealing with high-dimensional data to generate new synthetic samples of minority classes. Likewise, BORDERLINE-1 and BORDERLINE-2, SVM-SMOTE also increases the minority class density in those feature space areas with a high uncertainty level.
Considering the imbalanced dataset described in (1), the first step to apply SVM-SMOTE is training an SVM classifier with all the available training data, i.e., X. Thus, the optimal hyperplane that best divides classes X α and X β is found. The location of the sample x on/under the hyperplane is described as follows: To generate the weights, w, such that only the support vectors determine the borderline regions between classes, an optimization algorithm is necessary. Consequently, the support vectors, which are minority samples located in the vicinity of the class boundary, are used to determine the weights. Let X β b ∈ X β , a set of minority support vectors and x β b ∈ X β b , a support vector. The computation of K nearest neighbors to form K is shown in the following: In contrast to previous methods, only the borderline minority instances that are approximated by support vectors are oversampled. Consequently, original SMOTE (3) is redefined into the following: Nonetheless, it must be noted that, when dealing with large imbalance problems, the decision hyperplane that best maximizes the margin between samples of different classes may be biased toward the majority class [57]. This produces two main issues: 1) minority instances lie far from the optimal decision hyperplane; and 2) SVMs bias majority instances when majority and minority observations overlap in feature space. In this regard, an interpolation procedure generates a new sample between two points, as depicted in (9a) when most of the points in K belong to the majority class. Otherwise, extrapolation is conducted as represented by (9b) One important difference between SVM-SMOTE and SMOTE that should be noted is the fact that new instances are generated in order, i.e., SVM-SMOTE iterates the neighborhood from the closest sample to the further one.

4) K-Means SMOTE:
When addressing data sparsity, it is advisable to carefully deliberate before implementing oversampling. For certain problems, it may be the case that samples within the same class do not adhere to any discernible pattern, thus necessitating the establishment of suitable criteria for partitioning the data. To address this issue, the initial step involves employing the K-means algorithm to partition the data into n clusters, wherein each observation is assigned to the cluster the centroid of which is closest. This process is inspired by the unsupervised classifier, Douzas et al. [58], which implements the K-means SMOTE. Three stages are performed, i.e., clustering, filtering, and oversampling: K-Means SMOTE is applied to an imbalanced dataset as described in (1). The first step consists in clustering data using K-Means algorithm. Let C be a set of n clusters as specified in the following: In comparison with SMOTE, K-Means algorithms requires an additional parameter, the imbalanced ratio threshold (IRT ). This parameter determines the necessity of applying oversampling for a specific cluster C i . The following provides the calculation of the imbalance ratio given a cluster, IR(C i ): Then, the set of m clusters, m ≤ n, to be oversampled is defined in the following: Finally, to determine the amount of oversampling to be performed in each cluster, sampling weight, SW m is calculated according to the density of minority samples in the feature space for each cluster. In this regard, high sampling weights yields more synthetic samples. The total number of synthetic samples to be generated is given by the following: (13)

C. Adaptive Synthetic (ADASYN) Oversampling
The adaptive synthetic sampling approach (ADASYN) is a popular oversampling approach implemented by He et al. [59]. Specifically, this technique addresses the class imbalance problem by gradually adapting the corresponding decision boundaries to the minority classes.
In addition to the training dataset provided by (1), it is necessary to define some input parameters: IR th , ω and K. The first parameter is used to manage synthetic sample generation. ω refers to the desired oversampling ratio. Finally, parameter K refers to the number of neighbors that are computed for each minority sample x β ∈ X β during the oversampling process.
As a first step, the imbalance ratio must be calculated, IR, between minority and majority classes, X β and X α , respectively. Provided that the obtained value is lower than IR th , the ADASYN algorithm proceeds to the next step. Conversely, if the value exceeds IR th , oversampling is concluded. The calculation of this value is shown in the following: The number of synthetic samples to be generated is calculated at this point as showed in the following: At this stage, the computation of K nearest neighbors is necessary for each minority sample in X β . In contrast to SMOTE, both majority, X α , and minority, X β , samples are considered. This process can be formulated as shown in the following: (16) Following the neighbors calculation, ratio of majority samples, R α is computed for each minority sample in order to decide the amount of oversampling per minority sample, as shown in the following: Each ratio r i must be normalized using the following: Once this ratio is computed, the expected number of synthetic samples to be created per sample, denoted by |S i |, can be estimated using the following: The procedure to generate synthetic samples for each minority sample, x i ∈ X β , is the same as in (4).
The idea behind the ADASYN algorithm is based on using the r i density ratio to determine the number of synthetic samples required for each minority class sample x i . This differs from the behavior of other oversampling methods, which consider the sample position belonging to the minority class (BORDER-LINE) or are based on a random criterion (SMOTE). In practice, the density ratio of the ADASYN represents a quantification of the weight distribution for each minority class sample according to its difficulty level in the corresponding learning process. In this context, ADASYN is focused on generating more synthetic samples in the most challenging areas of the dataset, in order to encourage learning features from minority class samples (which are more difficult to be detected). Table I  To better understand how each of the reviewed methods works, a synthetic dataset has been created with three classes (one majority and two minority) as shown in Fig. 2. As previously discussed, when oversampling is performed using random oversampling, new minority class samples are always generated on the coordinates of an existing sample. SMOTE-based methods generate new samples with different patterns. It is interesting to discuss the differences between Borderline1 and Borderline2. It is visible how the former method generates new samples considering the majority class since new samples are generated in the center of the axes. However, Borderline1 is limited to the boundaries of the minority classes. Moving to SVM-SMOTE it can be seen how new samples are generated taking into account support vector samples. This can be seen because most new samples are generated along a few directions. In the case of K-Means SMOTE, one minority class (displayed in green) clearly show how two clusters were created and new samples are generated inside them. Finally, concluding ADASYN operational mode using only the plots is more difficult. Nevertheless, newly generated samples are surrounded by samples from other classes. This will force the classifier to learn boundaries between classes.

III. TACKLING CLASS IMBALANCE BY MEANS OF LOSS FUNCTION
Oversampling techniques have been widely used to tackle the class imbalance problem, providing competent results. In this framework, there is another set of methods that require special attention in this work, due to their promising performance when facing class imbalance and their high impact on the design of the processing method. Indeed, great efforts have been invested to design more descriptive loss functions with the aim of facilitating the processing method to traverse the objective function surface toward the desired result. In this regard, with the aim of reducing the negative impact of class imbalance, multiple loss functions have been developed to increase the weight of underrepresented classes, playing a crucial role in the enhancement of the classification/segmentation performance for underrepresented classes. Most common functions are: 1) multiclass cross-entropy loss; 2) focal loss; 3) cyclical focal loss; 4) asymmetric focal loss.
The multiclass cross-entropy (CE) loss function assumes that all classes in a given dataset X are equally represented, which is not often in real-world scenarios. The probability distribution generated by the model represents the likelihood of each pixel belonging to a particular class, i.e., considering C classes, each P (x i |y i = c) provides the probability that x i belongs to the cth class ∀c ∈ [1, C], i.e., P (x i |y i = c) = 1 if it is the correct classification label (i.e., the true label Y c ), or P (x i |y i = c) = 0 otherwise. The loss is minimized across all classes equally, without considering their distribution. The operation of cross-entropy is calculated by the following for a specific sample x i , where P (x i |y i = c) is the class predicted probability: For the sake of simplicity, x i and y i can be obviated from (20), simplifying the expression to L CE = − C c=1 Y c log(P c ) Regarding the focal loss (FL) [60], it weights the loss calculation, assigning higher weights to misclassified samples, while reducing the importance of the correct classified ones (down weighting). Indeed, focusing on those samples where processing fails the most will ensure that the process improves its results on hard samples over time. This is demonstrated in (21a), where γ is the focusing parameter. Furthermore, an α-balanced variant is used as described in (21b), where α c is the balancing factor This model emphasizes on identifying and prioritizing samples that are challenging to classify, while mitigating the influence of easily classifiable samples. The selection of the most suitable loss function is contingent upon the specific objectives of the semantic segmentation task at hand, aimed at improving the performance of the model and addressing the challenge of class imbalance. Cyclical focal loss (C-FL) [61] is a novel variant for the focal loss based on the learning-rate scheduler. The integration of a cyclical learning rate aids in improving model convergence by enabling the network to escape from suboptimal local minima, while simultaneously mitigating the impact of over-fitting and improving generalization. The C-FL functionality is based on a linear schedule, i.e., ξ, defined in terms of the fraction between the current epoch e and the total number of epochs E, and a fixed cyclical factor f c ≥ 1, which provides the cycles of ξ (with f c = 1, ξ has one cycle over the epochs, from a value of 1 at the first epoch, to a value of 0 in the last epoch; with f c = 2, ξ has two cycles, from a value of 1 at the first epoch, to a value of 0 in the epoch E/2, and again rising to a value of 1 in the last epoch, and so on) Indeed, ξ controls the loss function at every epoch, which is expressed as a combination of FL and the CFL, each one controlled by the corresponding focusing parameters γ 1 and γ 2 (1 − P c ) γ 2 log (P c ) . (22) Lastly, asymmetric focal loss (A-FL) [62] aims to prioritize the learning of harder-to-classify examples, which are typically the minority class examples in such datasets. A-FL loss achieves this by assigning different weights to the loss function for each class c based on the difficulty of the classification task. In this regard, the loss function assigns a higher weight to minority class samples that are more challenging to classify, while assigning a lower weight to majority class examples that are easier to classify. An approximation of this calculation is shown in the following:

IV. EXPERIMENTAL RESULTS
A large set of experiments on different real and popular HS datasets, using different classifiers widely known by the scientific community, has been performed in order to evaluate the impact of the oversampling techniques reviewed above. In the following, the description of the HS datasets, the set of metrics used for the evaluation of the experiments, the setting and motivation of the experiments performed, and a detailed discussion of the results obtained are provided.

A. Datasets
Three widely used HS images, with different spatial and spectral characteristics and different numbers of labeled samples, have been used to conduct the experimental validation of The available ground truth comprises 16 mutually exclusive classes. In addition to the original scene, a spatially disjoint train-test scene (DIP) has been used to evaluate the behavior of certain spectral-spatial classifiers (see Fig. 4).
2) The KSC scene [see Fig. 3(b)] was also provided by AVIRIS during a flight campaign in 1996. The spectral information ranges from 400 to 2500 nm, with 512 × 614 pixels and 176 spectral bands. Also, some low signal-tonoise ratio (SNR) bands have been removed. The groundtruth is divided into 13 mutually exclusive classes, pertaining to upland and wetland areas.
3) The BW dataset (see Fig. 5) was acquired over the Okavango Delta, Botswana, by the Hyperion sensor on the satellite EO-1. The scene contains 1496 × 256 pixels characterized by 30 m of spatial resolution, and 242 bands in the spectral range 400-2500 nm. It must be noted that 97 uncalibrated and water-corrupted bands have been removed, keeping the remaining 145 spectral bands [35]. The ground truth comprises 14 different and mutually exclusive land-cover classes, including seasonal marshes, occasional swamps and drier woodlands located in the distal part of the Delta.   4) The AeroRIT (see Fig. 6 In order to ensure the reliability of the data, ambiguous and inconsistent pixels were removed, resulting in a scene with a final resolution of 1920 × 3968. The processing hyperparameters and dataset configuration were extracted from the original study.

B. Evaluation Metrics
When extracting knowledge for imbalanced data, it is necessary to implement evaluation metrics that assess model performance. In this regard, Table II provides the metrics considered in this study to evaluate the performance of the different oversampling methods. They are provided in terms of the confusion matrix of a binary classification problem, i.e., considering the two classes Positive and Negative. In this regard, from the distribution of classifier performance on the data, the measurements collected by Table II

C. Experimental Settings
In this section, a detailed comparison between several oversampling algorithms, i.e., RANDOM sampling, SMOTE, SMOTE-BORDERLINE-1 (SMOTEBD1), SMOTE-BORDERLINE-2 (SMOTEBD2) and SVM-SMOTE, is performed on different classifiers, considering both traditional machine learning algorithms and state-of-the-art deep learning models, to evaluate the impact of oversampling on the final classification results. It must be noted that both K-means-SMOTE and ADASYN are not evaluated, as they impose severe restrictions on the minimum number of samples required for generating synthetical data properly. Indeed, these two methods have to run KNN algorithm to determine if a minority sample has to be  oversampled. Conducted experiments require a minimum of five neighbors to compute distances in the data space. Consequently, when using 5% or lower amounts of labeled samples, some minority classes do not meet the required threshold. For instance, IP scene has a severe lack of samples for several land-cover classes (such as Oats and Grass/pasture-mowed). Hyperparameter optimization is conducted using GridSearch and tenfold cross-validation over the whole experimental pipeline, including the oversampling algorithm and the classifier. In this strategy, a wide range of hyperparameters is tested on the original training set over ten partitions for each conducted experiment. As a result, the optimal values for each hyperparameter in the pipeline are estimated. Thus, it has been decided to show their performance in Sections II-B4 and II-C, respectively, but not to evaluate them experimentally. To evaluate the classification results obtained after the application of the other oversampling methods, all the measures foreseen in Table II have been adopted. In order to assess the impact of class imbalance, the study employs a rigorous methodology consisting on five Monte Carlo runs. In each run, the same seed is utilized for all algorithms to ensure consistency and uniformity in the evaluation process. The experimentation aims to analyze the robustness of the results to changes in the selected training data for imbalanced classes.
Regarding the classifiers, two different experiments have been conducted. The former performs a comparison between different standard and widely used pixelwise classifiers. Specifically, the following classifiers have been considered to evaluate the behavior of the oversampling methods on traditional machine learning models, i.e., multinomial logistic regression (MLR) [4], SVM [22], and shallow and deep multilayer perceptron (MLP and DMLP) [13]. The same procedure has been followed to fairly evaluate the oversampling methods. In particular, different amounts of randomly selected training data are selected from the HS scene (3%, 5%, 10%, 15%, and 20%). Then, the oversampling algorithms are applied to increase the number of samples within the training sets, producing an augmented set. Finally, the supervised classifiers are trained on the augmented training set and the obtained inference results provide the impact of the oversampling strategies. To further explore the impact of oversampling models, a detailed comparison is provided considering the 5% of training data, taking into account the oversampling technique with the highest G-mean score (OS), and comparing its results with the ones obtained using techniques of no oversampling, i.e., training data without oversampling (RAW), and spectral reduced data based on principal components analysis (PCA) [32].
The next experiment conducts a comparison between different state-of-the-art deep learning models to evaluate the impact of oversampling techniques. In this sense, an ablation study is performed using the convolutional neural network (CNN) as main structure [25]. Particularly, CNN3-D is used as the baseline classifier. Based on the CNN3-D, the CNN3-D + OV is built by introducing a convex 3-D hyperspectral patch generator unit to oversample the minority classes [35]. The comparison also includes the ssGAN3-D [64], a semisupervised classifier, and the 3-D-HyperGAMO [35], which is considered as a combination of the CNN3-D + OV and ssGAN3-D.
The last experiment, evaluates the impact of class imbalance on the performance of semantic segmentation models trained with different loss functions, i.e, cross-entropy (CE), focal loss (FL), asymmetric focal loss (A-FL), and cyclical focal loss (C-FL). To this end, the models were trained on an imbalanced dataset and tested using the mean intersection over union (mIoU) and overall accuracy (OA) metrics. A detailed comparison of these models is conducted for multiple image patches configurations.

D. Evaluation on Standard Machine Learning Classifiers
For each HS dataset, the classification results obtained by standard machine learning algorithms when introducing different oversampling techniques is evaluated. In particular, Figs. 7-9 depict the evolution of the G-Mean obtained by the MLR, SVM, MLP, and DMLP in IP, KSC, and BW scenes when using raw data (no oversampling, none), PCA, random, SMOTE, SMOTEB1, SMOTEB2, and SVM-SMOTE techniques with different amounts of training data, i.e., 3%, 5%, 10%, 15%, and 20%. Furthermore, Tables III-V provide a detailed comparison in terms of F1, G-mean, OA, and AA, using a 5% of the labeled data to train the models, and focusing on the oversampling technique with the best G-mean (OS), raw data, and PCA-reduced data.
1) Results on Indian Pines: Fig. 7 provides the G-mean score of each classifier implementing the different oversampling techniques. Furthermore, the respective standard deviation are shown after five Monte Carlo runs. In general, the classifiers improve their results with increasing training data, with the results obtained by the DMLP, MLP, and SVM being superior to the MLR. Indeed, it is important to note that the MLR algorithm requires at least a 20% of labeled samples to obtain similar results (slightly inferior) to those produced by the other classifiers when training with 10% of labeled samples. In this sense, the DMLP obtains the best G-mean score (84.65%), while classifiers with PCA obtain the lowest results. Regarding the standard deviation, the SVM exhibits the most reliable/stable behavior.
It is interesting to note how how, for different classifiers, the final results of classification with oversampling methods vary slightly. In fact, the results obtained show that the effectiveness of an oversampling strategy does not vary too much depending on the classifier. The percentage of samples constituting the training set also influences the final results, albeit to a lesser extent. Thus, for instance, MLR achieves better results with SVM-SMOTE technique when using 3% of labeled data, with SMOTE when using 5%-10%, and with RANDOM oversampling when considering 15%-20%, although the results between these three techniques are quite similar, with slight variations in the variance (SMOTE is more stable in general); similar behavior can be observed in SVM between SVM-SMOTE, SMOTE, and RANDOM methods, with SMOTEBD1 achieving the best results when using 5% (closely followed by SMOTE); however, for shallow MLP, the RANDOM technique provides the best classification results with few data, closely followed by SVM-SMOTE and SMOTE when using 15%-20% of training data, and finally for DMLP, both RANDOM and SVM-SMOTE impact favorably on the final results. For all cases, classifiers with PCA obtain the worst G-mean scores, while SVM-SMOTE, SMOTE, and RANDOM obtain the best results.
To further elaborate on these results, Table III provides the classification results for IP in terms of accuracy per class, F1-Score, G-Mean, OA, and AA. Moreover, classifiers have been trained with 5% of labeled samples, considering RAW data, PCA-based data and the best oversampling strategy (OS). The last one was determined by the G-mean value, thus, the MLR includes SMOTE, the SVM implements SMOTEBD1, and both the MLP and the DMLP take RANDOM oversampling strategy. Once more PCA-based classifiers obtains the lowest accuracy, while OS-based classifiers generally obtain the best values, with certain exceptions in the MLR algorithm. Indeed, oversampling enhances the classification performance in terms of F1, G-mean, OA, and AA in the SVM, MLP, and DMLP classifiers, but only in terms of G-mean and AA in the MLR. The DMLP algorithm outperforms the other classification methods, with F1 (75.07%), G-Mean (84.65%), OA (75.32%), and AA (75.15%). Focusing on the minority classes 7-Oats and 9-Grass/pasturemowed, all classifiers significantly enhance the identification and classification of samples of these land-cover types by means of oversampling techniques.
2) Results on Kennedy Space Center: In this section, RAN-DOM, SMOTE, SMOTEBD1, SMOTEBD2, and SVM-SMOTE oversampling methods are evaluated against the KSC dataset. Once more, results obtained over RAW data and PCA-based data are included. Classifiers have been trained with 3%, 5%, 10%, 15%, 20% of randomly selected data. The rest of the data was used for testing.
Obtained G-mean score is depicted by Fig. 8, coupled with the respective standard deviation after five Monte Carlo runs. At first glance, it can be seen that KSC requires few samples to estimate the overall scene. Indeed, with 3% of training samples, the G-mean exceeds 85% for all classifiers, while in IP, they need almost 5%-10% of training data. Once more, the value of G-Mean is improved as the training set increases. Also, the DMLP obtains the best classification result, closely followed by the SVM when there are few training samples (3%-5%). Indeed, the differences between DMLP and SVM are practically negligible.
Similar to IP scene, the G-mean scores prove that the effectiveness of an oversampling strategy does not vary too much depending on the classifier. Once more, the percentage of samples constituting the training set influences the final results, albeit to a lesser extent as the obtained results change proportionally. Nevertheless, the spectral nature of the image does play an important role. While IP is known for its large spectral mixture, KSC is challenging due to scarcity of labeled samples. In this scene, it can be seen that the PCA-based classifiers obtained better results than RAW and that even the oversampling methods, such as in the MLR (with 15% of labeled data), MLP (with 5%, 15%, and 20% of the training data) and DMLP (for all amounts of training data). Moreover, RAW-based results are sometimes very close to those obtained by the oversampling methods, especially in the MLR, it is pretty close to the best oversampling method (SVM-SMOTE) and even superior with 3% of training data. Focusing on MLR: with 3%, RANDOM and RAW-based techniques obtain the best G-mean; with 5%, SV-SMOTE and RAW-based methods achieve the best results; with 10% of labeled data, SVM-SMOTE and RAW-based provide the best score, and with 15%-20%, the SVM-SMOTE outperforms the other strategies. Regarding SVM: with 3%, SMOTEBD2 clearly outperforms the other techniques, however, its performance decreases by 5%, where SMOTEBD1 and SMOTE are the best oversampling methods; with 10%, all the strategies obtain very similar results, with the exception of SMOTEBD2; with 15%, SVM-SMOTE, SMOTE, and RANDOM achieve the best G-mean scores, and finally, with 20% of labeled data, SMOTEBD1, SMOTE, and RANDOM outperform the results of the other sampling strategies. Related to MLP: with 3% of training data, SMOTEBD2 provides the best G-mean, followed by RANDOM oversampling; with 5%, PCA and SMOTE offer the best accuracy; with 10%, RANDOM, PCA, and SMOTE obtain the best G-mean values; with 15%, PCA, SVM-SMOTE and SMOTE are the best oversampling techniques, and finally, with 20% of the training data, PCA, RANDOM, and SMOTE reach the best results. Similar behavior is exhibited by the DMLP, where PCA and SMOTEBD2 stand out with 3% of labeled  data; PCA is undoubtedly the best technique, followed by far by SMOTE with a training percentage of 5%; again PCA is the best with 10% of labeled data, followed by SMOTEBD1 and SMOTEBD2, and finally, the SMOTE technique is only surpassed by PCA with 15% and 20% of training.
To further explore these results, Table IV provides the classification measurements obtained over the KSC scene with 5% of the training data. Consistent with Fig. 8, the best F1, G-mean, and OA values are provided by the MLR with no oversampling method. This is because KSC samples are very sparse and oversampling techniques based on interpolations may introduce too much variability/noise in the new samples, while RANDOM oversampling makes information redundant. Also, MLP and DMLP improve their classification by PCA, which alleviates the overfitting caused by the large spectral dimension, although the SMOTE oversampling outperforms the results achieved by RAW-data. Focusing on SVM, the generation of new samples to balance the training set improves the classification results in comparison with the RAW and PCA-based data. Focusing on the minority class 7-Swap, all classifiers with oversampling techniques improve its identification and classification.
Finally, Table V provides the classification metrics obtained by the spectral classifiers over BW scene, considering 5% of training data. In general, the application of oversampling methods improves the classification results in comparison with RAW  TABLE VI  CLASSIFICATION RESULTS OF CNN3-D, CNN3-D + OV, SSGAN3-D, AND 3-D-HYPERGAMO USING DISJOINT TRAIN-TEST IP DATASET AND BY RANDOMLY  SELECTING 5% TRAINING SAMPLES FROM KSC AND BW DATASETS and PCA-data. For instance, the classification measurements obtained by MLR, SVM, and MLP are noticeably improved when using augmented training. Focusing on F1 and G-Mean, the MLR algorithm outperforms the rest of the classifiers. Nevertheless, it is quite interesting that, in the minority class 2-Hippo grass, PCA is more beneficial for some classifiers.

E. Experiment on Deep Learning Classifiers
Currently, deep learning models have established themselves as the current state of the art (SoTA) due to the unparalleled results achieved in automatic image processing. In particular, the CNN has stood out in recent years, thanks to its ability to automatically extract descriptive spatial-spectral features from the data. Notwithstanding the impressive classification result achieved by this architecture [30], their results are significantly degraded by the scarcity of training data and the high variability of the samples. In this sense, oversampling techniques are of great interest to improve the processing of deep networks. Some interesting efforts have been conducted to implement oversampling techniques for deep models. This experiment compares the performance of the baseline CNN3-D, the CNN3-D + OV (with oversampling), the ssGAN, and the 3-D-HyperGAMO models. Table VI provides the obtained results, in terms of OA and AA. The highest values of the different evaluation metrics among classifiers are represented in bold. Focusing on the DIP dataset (see Fig. 4), the comparison ensures that there is no spatial overlap between both training and testing samples. It is interesting to note that, despite including an oversampling-mechanism (or precisely because of its inclusion), the CNN3-D + OV provides the poorest accuracy results. The complexity of the model, coupled with the sparsity of the data and the large spectral mixture (which increases intraclass variability), prevent the model from achieving better results. Furthermore, the comparison between CNN3-D and CNN3-D + OV highlights the weak performance of the latter for the disjoint IP dataset. On the contrary, 3-D-HyperGAMO model provides the best accuracy, as it extract useful information from those pixels adjacent to the minority classes. Focusing on minority classes, such as 16-Stone steel towers, the generation of synthetic samples made by the 3-D-HyperGAMO model enhance effectively their classification in comparison with other models, such as the ssGAN3-D. In contrast, poor results are obtained for the 7-Grass/grass-stone class with CNN3-D + OV and ssGAN3-D compared to 3-D-HyperGAMO. This is mainly because the methods (CNN3-D, CNN3-D + OV, and ssGAN3-D) fail to extract information for classes with a low number of training samples, as they do not properly cover the features of minority classes. Finally, the oversampling strategy applied to CNN3-D does not introduce any new information, and thus, its classification results are worse. This fact is aggravated by the high complexity of the IP training set. In addition, Fig. 10 depicts the classification maps produced by the CNN3-D, CNN3-D + OV, ssGAN3d, and 3-D-HyperGAMO models. The resulting maps tend to smooth the boundaries between different land cover types. Particularly, the 3-D-HyperGAMO attains a visually comprehensible classification map with a clear and distinguishable border zones, and the noise is very localized and reduced compared to the other deep models. On the contrary, the CNN3-D and CNN3-D + OV result in noisy classification maps with slight differences.
Focusing on the KSC scene, the classifiers have been trained with 5% of labeled samples randomly chosen from the available data. Note that the class imbalance ratio in this scene is lower than in the IP dataset. Nevertheless, the results obtained in Table VI indicate that better results are obtained for almost all land cover classes by alleviating the imbalance problem using oversampling-based models. In this context, the baseline model, i.e., the CNN3-D, achieves the highest accuracy values for classes 1-Scrub, 9-Spartina marsh, and 11-Salt marsh. Nonetheless, regarding the minority classes, such as the 7-Swap, the 3-D-HyperGAMO provides a huge improvement (+13.07%) over the baseline model. In contrast to the IP scene, the results obtained on KSC reveal that the random selection of labeled samples provides a significant improvement in classification performance. Once more, the 3-D-HyperGAMO classifier reports the highest overall metrics for OA (95.31%) and AA (92.26%). Moreover, Fig. 11 provides the graphical results of the CNN3-D, CNN3-D + OV, ssGAN3d, and 3-D-HyperGAMO models. The   CNN3-D and CNN3-D + OV tend to classify the left zone as 11-Salt marsh, while the ssGAN3d and 3-D-HyperGAMO models identify the zone as 1-Scrub. Nonetheless, the labeled pixels in the test remain correctly classified overall, despite the scarcity of ground truth. In addition, CNN3-D and CNN3-D-OV classify the 12-Mud flats and 5-Oak/Broadleaf classes oppositely in the center and bottom of the image, although both achieve general improvements compared to ssGAN.
Finally, the performance of the spectral-spatial classifiers on the BW scene is evaluated by randomly selecting 5% of the data for training. It is noteworthy that BW suffers from the lowest class imbalance ratio compared to the IP and KSC scenes. Indeed, the minority and majority classes, i.e., 2-Hippo grass and 9-Accacia woodlands, contain 101 samples and 314 samples, respectively, with a difference of 213 samples. This indicates an imbalance ratio of approximately 3:1. Obtained results are reported in Fig. 5. Once again, the 3-D-HyperGAMO model outperforms the other classifiers in performance, achieving OA (97.43%) and AA (97.4%). Focusing on some minority class, such as the 2-Hippo grass, obtained results show a slight improvement when oversampling is conducted. As in the KSC experiments, the ability to classify minority classes benefits from standard oversampling due to the generation of training data. Fig. 12 depicts the classification maps obtained by the considered classifiers. Similar to previous experiments, the ss-GAN3d and 3-D-HyperGAMO models produce quite similar  experiments. The first row shows patches obtained using a patch size of 170, the second row shows patches obtained using 197, and the third row displays patches obtained using 703. Patches are extracted from the test samples.
results, particularly at the left and right areas of the HS image. In contrast, on the left side of the classification maps, the CNN3-D and CNN3-D-OV classifiers identify a large number of pixels belonging to classes 13-Exposed soils and 12-Mixed mopane.

F. Assessing Imbalance Methods for Semantic Segmentation
As stated before, semantic segmentation is a crucial task in computer vision and machine learning applications, where class imbalance poses a common challenge. Indeed, it is pretty common that certain classes are significantly underrepresented in the training data, which can lead to a poor performance of the segmentation model. To tackle this challenge, several imbalance methods have been evaluated, including loss functions such as focal loss (FL), cross entropy (CE), asymmetric focal loss (A-FL), and cyclical focal loss (C-FL). Next, an evaluation of the effectiveness of these loss functions in semantic segmentation tasks is conducted for the AeroRIT dataset.
The behavior of the aforementioned loss functions is presented in Table VII. The mean intersection over union (mIoU) metric is reported for each class along with the respective number of training (SamplesTR), validation (SamplesVAL), and test samples (SamplesTE). It can be observed that a notable imbalance training percentage (ImbalanceTR) is present for the majority of the classes. Specifically, classes vegetation and roads exhibit an imbalance percentage of 46.82% and 30.93%, respectively, while classes cars and water have significantly less training data (i.e., minority classes). It is pertinent to note that the overall accuracy (OA) metric exhibits similar results across all models, thereby rendering it unsuitable for the comprehensive evaluation of a segmentation model performance. However, balance-aware methods, such as FL, A-FL, and C-FL, significantly improve the mIoU, whereas CE performs the worst among the evaluated models due to its inability to address imbalanced classes.
Finally, Fig. 13 presents the prediction patches for studied loss functions, where the aforementioned benefits through the mIoU are observable for the FL, A-FL, and C-FL models. These models shown a better representation of the original. Therefore, obtained findings suggest that balance-aware methods through loss functions should be considered in the development of semantic segmentation models for imbalanced datasets.

V. CONCLUSION
This article provides a review of different oversampling and class imbalance methods for the classification of remotely sensed hyperspectral scenes. Specifically, the goal of these methods is to alleviate the problem of class imbalance. Different oversampling algorithms have been reviewed, i.e., Random oversampling, SMOTE, SMOTE BORDERLINE-1, SMOTE BORDERLINE-2, SVM-SMOTE, K-Means SMOTE, and ADASYN. Moreover, comprehensive experiments have been conducted to empirically evaluate the random oversampling, SMOTE, SMOTE BORDERLINE-1, SMOTE BORDERLINE-2, and SVM-SMOTE oversampling methods over widely used machine learning classifiers, such as the MLR, SVM, shallow MLP, and deep MLP, using different amounts of training data. Also, three deep learning approaches have also been tested, i.e., CNN3-D + OV, ssGAN3-D, and 3-D-HyperGAMO. As a result, the impact of oversampling methods during HS data classification has been estimated.
The obtained results demonstrate that the exploitation of oversampling techniques enhances the training procedure, while improving the final classification performance without modifying the operational behaviour of the main classifier. Also, it has also demonstrated the limitations of some oversampling mechanisms, such as K-Means SMOTE and ADASYN, with restrictive constraints on the minimum number of samples per class. On the other hand, it highlights the need to generate new oversampling mechanisms for deep networks that allow a good tradeoff between the complexity of the architecture and the final results. Additionally, the evaluation of imbalance methods in semantic segmentation has revealed several insights. First, it was observed that the traditional cross-entropy loss function struggles with imbalanced datasets, resulting in poor performance for minority classes. This has highlighted the importance of using balance-aware loss functions for addressing class imbalance. Finally, the study has shown that overall accuracy is not a reliable metric for evaluating performance on imbalanced datasets, and mIoU should be preferred instead.
As future work, it is proposed to extend the study performed to new techniques of both oversampling and undersampling, the latter being of great interest, in order to test the classification capabilities after selecting a subset of samples from the original set.