A Robust Dynamic Classifier Selection Approach for Hyperspectral Images with Imprecise Label Information

Supervised hyperspectral image (HSI) classification relies on accurate label information. However, it is not always possible to collect perfectly accurate labels for training samples. This motivates the development of classifiers that are sufficiently robust to some reasonable amounts of errors in data labels. Despite the growing importance of this aspect, it has not been sufficiently studied in the literature yet. In this paper, we analyze the effect of erroneous sample labels on probability distributions of the principal components of HSIs, and provide in this way a statistical analysis of the resulting uncertainty in classifiers. Building on the theory of imprecise probabilities, we develop a novel robust dynamic classifier selection (R-DCS) model for data classification with erroneous labels. Particularly, spectral and spatial features are extracted from HSIs to construct two individual classifiers for the dynamic selection, respectively. The proposed R-DCS model is based on the robustness of the classifiers’ predictions: the extent to which a classifier can be altered without changing its prediction. We provide three possible selection strategies for the proposed model with different computational complexities and apply them on three benchmark data sets. Experimental results demonstrate that the proposed model outperforms the individual classifiers it selects from and is more robust to errors in labels compared to widely adopted approaches.

Supervised classification plays a vitally important role for analyzing HSIs by assigning image pixels into distinct categories or classes of interest available in a scene, based on a relatively small amount of annotated examples. Recent comprehensive surveys on HSI classification in remote sensing include [11][12][13][14][15]. It is generally agreed upon that incorporating spatial context together with spectral information leads to better classification results than using spectral information alone [16][17][18][19][20]. Further improvements in the classification accuracy can be obtained by combining multiple data sources, e.g., by augmenting HSI data with Light Detection and Ranging (LiDAR) data [21][22][23], Synthetic Aperture Radar (SAR) data [24,25] and/or high-resolution colour images [26][27][28]. Fusion of these multiple data sources is typically accomplished at feature level [29][30][31], or at decision level [26,32,33]. The concept of multiple classifier systems has been widely studied as a method for designing high performance classification systems at the decision level [34][35][36]. Among these, the Dynamic Classifier Selection (DCS) [37,38] approach selects the classifier that yields the highest probability of being classified correctly. By design, the combined classifier outperforms all the classifiers it is based on [33,39,40]. The key idea of DCS is to identify the best classifier dynamically for each sample from a set of classifiers. This classifier is usually selected based on a local region of the feature space where the query sample is located in. Most works use the K-Nearest Neighbors technique (grouping samples with similar features) to define this local region [41][42][43]. Then, for a given unseen sample the best classifier is estimated based on some selection criteria [44,45]. In this work, we group samples differently, by incorporating the robustness concept to the model specification.
While current machine learning systems have shown excellent performance in various applications [46][47][48], they are not yet sufficiently robust to various perturbations in the data and to model errors to make them reliably support high-stakes applications [49][50][51]. Therefore, increasing attention is being devoted to various robustness aspects of the models and inference procedures [52][53][54]. The work in [55,56] proposed robust methods by adopting empirical Bayesian learning strategies to parameterize the prior and used this Bayesian perspective for learning autoregressive graphical models and Kronecker graphical models. Earlier work by one of us [57] analyzed the global sensitivity of a maximum a posteriori (MAP) configuration of a discrete probabilistic graphical model (PGM) with respect to perturbations of its parameters, and provided an algorithm for evaluating the robustness of the MAP configuration with respect to those perturbations. For a family of PGMs, obtained by perturbation, the critical perturbation threshold was defined as the maximum perturbation level that does not alter the MAP solution. In classification problems, these thresholds determine the level to which the classifier parameters can be altered without changing its prediction. The experiments in [57] empirically showed that instances with higher perturbation thresholds tend to have a higher chance of being classified correctly (when evaluated on instances with similar perturbation thresholds). We combined this property with DCS and applied it to classification in our earlier work [58], but only as a proof of concept for toy cases with binary classes and two classifiers. In a follow up work [59] we presented an abstract concept of how the robustness measures can be employed to improve the classification performance of DCS in HSI classification.
Here we develop a novel robust DCS (R-DCS) model in a general setting with multiple classes and multiple classifiers, and use it to take into account the imprecision of the model that is caused by errors in the sample labels. The main novelty lies in interpreting erroneous labels as model imprecision and addressing this problem from the point of view of the robustness of PGMs to model perturbations; this also sets this work apart from our previous-more theoretical-work on robustness of PGMs [57][58][59], which did not consider the problem of erroneous label. The main issue with erroneous labels, also referred to as noisy labels [60,61], is that they mislead the model training and severely decrease the classification performance [62][63][64]. Recent works that address this problem usually focus on noisy label detection and cleansing [65][66][67]. However, detection of erroneous labels is never entirely reliable, and their correction even less so, especially when the sample labels are relatively scarce or spatially scattered across the image. Thus, it is imperative to study the robustness of classifier models under different levels of label noise and to understand how the performance of different classifiers deteriorates with label noise.
We hypothesize that the framework of imprecise probabilities can offer a viable approach to improve a classifier's robustness to low-to-moderate amounts of label noise. Therefore, we build our robust DCS (R-DCS) model based on an imprecise probabilistic extension of a classical PGM. Particularly, we build on Naive Bayes Classifiers (NBCs), but it is possible to extend the proposed framework to other classification models. We use an adapted version of the Imprecise Dirichlet Model (IDM) [68] to perturb the local probability mass functions in the model to corresponding probability sets. This imprecise probabilistic extension of an NBC is called a Naive Credal Classifier (NCC) [69]. The amount of perturbation of such an NCC is determined by a hyperparameter that specifies the degree of imprecision of the IDM. The maximum value of the hyperparameter under which the NCC still remains determinate-yields a single prediction-is the perturbation threshold of the NCC. Such perturbation thresholds essentially show how much we can vary the local models of the NBC without changing the prediction result, thereby providing us with a framework for dealing with model uncertainty.
The influence of label noise on the classification performance was studied earlier from a different perspective in [70,71]. The work in [70] showed experimentally that NBCs yield favourable performance in the presence of label noise compared to classifiers based on KNN, support vector machine (SVM) and decision trees [72]. These conclusions were based on empirical classification results on thirteen synthetic datasets, designed for several selected problems from the UCI repository [73]. The work in [71] empirically analyzed the effect of class noise on supervised learning on eight medical datasets. Only the end classification results were analysed in both works.
Here we take a different approach and we characterise statistically the effect of erroneous labels on statistical distributions of features and on the estimated spectral signatures of landcover classes. The empirical results explain from this perspective clearly the reasons for the considerable robustness of NBCs to label noise and at the same time they also show how erroneous labels affect the actual conditional distributions of features given the class labels, introducing inaccuracies in the classification process. This motivates us to employ the framework of imprecise probabilities to develop a classifier that is more robust to model uncertainties. As a first step, we use this framework to analyze the effect of noisy labels on the robustness of the predictions of NBCs.
Next, we put forward our robust DCS (R-DCS) model and first apply it to HSIs classification in the presence of noisy labels. We perform dynamic selection among two classifiers: one based on spectral and the other based on spatial features. Both of these are formulated as an NBC. Specifically, we apply a principal component analysis (PCA) to extract spectral features from HSIs, where the decorrelating property of PCA justifies the conditional independence assumption that NBC relies on. The spatial features are generated by applying morphological operators on the first five principal components (PCs). We also apply our R-DCS model to a multi-source data set that includes HSI and LiDAR data. For this data set, we perform dynamic selection among three classifiers: the two classifiers corresponding to HSI and a third classifier based on the elevation information in the LiDAR data.
For the selection criteria for our R-DCS, we define three selection strategies-R-T, R-LA and R-EU-that differ in computational complexity. R-T simply selects the classifier with the highest perturbation threshold. While computationally efficient, this approach does not always perform well because the exact relation between perturbation thresholds and performance differs from one classifier to another. Two other strategies are proposed to improve upon this by determining empirical relations between the perturbation thresholds of different classifiers and their probabilities of correctly classifying the considered instance. Particularly, the empirical probabilities of correctly classifying the test sample are estimated based on the training samples that are closest to the test sample in terms of a given perturbation distance. In R-LA, the perturbation distance between two data samples is defined by the absolute value of the difference in their perturbation thresholds for a given classifier. R-EU defines this distance as the Euclidean distance in a space spanned by the perturbation thresholds of all the considered classifiers. R-EU is computationally more complex but outperforms the others in most cases of practical importance. Experimental results on three real data sets demonstrate the efficacy of the proposed model for HSIs classification in the presence of noisy labels. In the two HSI data sets, the R-EU strategy performs best among the three selection strategies when the label noise is relatively low, while R-T and R-LA offer better performance when the label noise is rather high. In the multi-source data set, the R-EU strategy always outperforms the other methods in all cases.
The main contributions of the paper can be summarized as follows: (1) We characterise statistically the effect of label noise on the estimated spectral signatures of different classes in HSIs and on the resulting probability distributions (both prior probabilities and conditional probabilities given the class label). This analysis provides insights into the robustness of NBC models to erroneous labels, and at the same time it shows in which way errors in data labels introduce uncertainty in the involved statistical distributions of the classifier features.
(2) We further analyze the effect of noisy labels on the robustness of NBCs and, in particular, on perturbation thresholds of the corresponding NCCs. The results show that this robustness decreases as the amount of label noise that is applied increases. (3) In order to cope with this decrease in robustness due to label noise, we propose a robust DCS model, dubbed R-DCS, which selects classifiers that are more robust, using different selection strategies that are based on the critical perturbation thresholds of the involved classifiers.
In particular, we provide three possible selection strategies: R-T, R-LA and R-EU. Two of these (R-T and R-LA) already appeared in a preliminary version of this work [59], but in a more abstract set-up, without exploring their performance in the presence of label noise. The third selection strategy, R-EU, proposed here, is computationally more complex, but performs better than the other two in cases with low to moderate levels of label noise (up to 30%), which are of most interest in practice. (4) The proposed R-DCS models are validated on three real data sets. Compared to our conference paper [59], we take into account the label noise into HSI classification and conduct more experiments to evaluate the proposed model. The results reveal that the proposed model outperforms each of the individual classifiers it selects from and is more robust to errors in labels compared to some common methods using SVM and graph-based feature fusion.
The rest of this paper is organized as follows. Section 2 contains preliminaries, reviewing briefly the basic concept behind NBC, its imprecise-probabilistic extension NCC and the notion of perturbation thresholds that we build upon. Section 3 describes three representative (hyperspectral and hybrid) real data sets that we use in our experiments. The following two sections contain the main contributions of this work: Section 4 is devoted to the statistical characterisation of the influence of noisy labels on spectral signatures, on statistical distributions of the classifier features and on perturbation thresholds. In Section 5, the proposed model R-DCS is described and three possible selection strategies based on imprecise-probabilistic measures are defined. The overall proposed framework for robust dynamic classifier selection for hyperspectral images is presented and discussed. Experimental results on the three real data sets are reported in Section 6. The results demonstrate that the proposed model outperforms each classifier it selects from. Comparing to the competing approaches such as SVM and graph-based feature fusion, the proposed model proves to be more robust to label errors, inheriting this robustness from the NBCs that are at its core. A detailed discussion of the main results and findings is presented in Section 7, and Section 8 concludes the work.

Preliminaries
We first introduce the basic concept behind Naive Bayes Classifiers (NBCs) in this section. Next, imprecise extensions of NBCs-called Naive Credal Classifiers (NCCs)-are introduced and their perturbation thresholds are defined.

Naive Bayes Classifiers
Let C denote the class variable taking values c in a finite set C. We denote by F i the i-th feature variable taking values f i in a finite set F i , i ∈ {1, . . . , m}, where m is the number of features. For notational convenience, we gather all feature variables in a single vector F = (F 1 , . . . , F m ) that takes values f = ( f 1 , . . . , f m ) in F 1 × · · · × F m .
For any given feature vector f, an NBC returns the Maximum a Posteriori (MAP) estimate of the class variable C, assuming the conditional independence P(f|c) = ∏ m i=1 P( f i |c). The estimated class is thus: The involved (conditional) probabilities are typically estimated from data. To avoid falsely estimated zero probabilities due to empirical estimation, we adopt the common method of Laplace smoothing [74][75][76], meaning that for all i ∈ {1, ..., m}, c ∈ C and f i ∈ F i : where n is the total number of data points, n(c) is the number of data points with class c and n(c, f i ) is the number of data points with class c and i-th feature f i .

Naive Credal Classifiers and Perturbation Thresholds
The Naive Credal Classifier (NCC) [69] is an extension of the Naive Bayes Classifier to the framework of imprecise probabilities that can be used to robustify the inferences of an NBC. Basically, the idea is to consider an NBC whose local probabilities are only partially specified.
In particular, instead of considering a probability mass function P(C) that contains the probabilities P(c) of each of the classes c ∈ C, an NCC considers a set of such probability mass functions, which we denote by P (C). Similarly, for every class c ∈ C and every i ∈ {1, . . . , m}, it considers a set P (F i |c) of conditional probability mass functions. In general, these local sets can be learned from data, elicited from experts, or obtained by considering neighbourhoods around the local models of an NBC. We here consider the first option. In particular, we use a version of the Imprecise Dirichlet Model (IDM) [68], suitably adapted such that it is guaranteed to contain the result of Laplace smoothing. In particular, P(C) is taken to belong to P (C) if and only if there is a probability mass function t on C such that where s is a fixed hyperparameter that determines the degree of imprecision. For every i ∈ {1, . . . , m} and c ∈ C, the local set P (F i |c) is defined similarly.
If we now choose a single probability mass function P(C) in P (C) and, for every c ∈ C and i ∈ {1, . . . , m}, a single conditional probability mass function P(F i |c) in P (F i |c), we obtain a single NBC. By doing this in every possible way, we obtain a set of NBCs. This set is a Naive Credal Classifier (NCC) [69].
Classification for such an NCC is done by performing classification with each of the NBCs it consists of separately. If all these NBCs agree on which class to return, then the output of the NCC will be that class. If they do not agree, the result of the NCC is indeterminate and consists of a set of possible classes, amongst which it is unable to choose.
The maximum value of s for which the result of the NCC is still determinate is a particular case of the critical perturbation threshold defined in [57]. It provides a numerical indication of the robustness of the NBC's prediction with respect to changes to the probabilities that make up the model. Furthermore, it has also been observed that for any given instance, the corresponding critical perturbation threshold serves as a good indicator for the performance of the original NBC: instances with higher thresholds are classified correctly more often [57]. In the following, we denote this perturbation threshold by s (per) and we compute it from the data at hand using the algorithm in [57].

Datasets
We conduct our experiments on two real HSI datasets: Salinas Scene and HYDICE Urban, and a multi-source data set GRSS2013.
The Salinas Scene dataset was gathered by the AVIRIS sensor with 224 bands in 1998 over Salinas Valley, California. The original data set consists of 512 × 217 pixels with a spatial resolution of 3.7 m per pixel. It includes 16 classes in total. For our experiments, we select a typical region of size 100 × 80 shown with a false color image in Figure 1a. There are six classes in this region, as listed in Table 1, which also shows the number of labeled samples per class. Figure 1b shows the ground truth spatial distribution of these classes. Brocoli-green-weeds-2 652 3 Grapes-untrained 1965 4 Lettuce-romaine-4wk 711 5 Lettuce-romaine-5wk 930 6 Lettuce-romaine-6wk 229 Total 5501 The HYDICE Urban dataset was captured by the HYDICE sensor. The original data contains 307 × 307 pixels, each of which corresponds to a 2 × 2 m 2 area. There are 210 wavelengths ranging from 400 nm to 2500 nm, resulting in a spectral resolution of 10 nm. In our experiments, we use a part of this image with size 200 × 200 shown in Figure 1c. The number of bands was reduced from 210 to 188 by removing the bands 104-108, 139-151 and 207-210, which were seriously polluted by the atmosphere and water absorption. Detailed information on the available classes and the number of samples per class is given in Table 2. The ground truth classification is shown in Figure 1d.
Our third data set, GRSS2013, was a benchmark data set for the 2013 IEEE GRSS data fusion contest [77]. This data set, consisting of HSI and LiDAR data was acquired over the University of Houston campus and the neighboring urban area in June 2012. The HSI has 144 spectral bands and 349 × 1905 pixels, containing in total 15 classes as shown in Table 3. The false color image is shown in Figure 2a and the ground truth classification is shown in Figure 2b.  1  Trees  3123  2  Concrete  1410  3  Soil  637  4  Grass  4044  5 Asphalt 912

No. Class Name Labelled Samples
Total 10,126

Robustness Analysis with Noisy Labels
We now move on to analyze the effect of errors in labels on the estimation of spectral signatures of landcover classes and on the estimated statistical distributions of the classes and of features given the class labels. This analysis gives an insight into how errors in labels introduce uncertainty about the models that various classifiers rely upon. In order to study (and further on improve) the robustness to these model uncertainties, we adopt the framework of imprecise probabilities.

Model Uncertainties Due to Noisy Labels
Here we analyze the influence of label noise on the estimated spectral signatures of different classes in an image as well as the effect on the estimated prior probabilities of the given classes and the conditional probability distributions of the features given the class label. The experiments here are conducted with NBCs. The NCC framework, which was introduced in Section 2.2, will be used further on to define the perturbation thresholds that will be employed in our new model.
We conduct experiments on the two real HSIs described in Section 3 and shown in Figure 1. To reduce the data dimensionality, PCA is commonly applied on the original HSIs data. We use discretized principal component values as the features for NBCs. Due to the decorrelating property of PCA, these features are conditionally independent given the class label, and thus conform to the assumption of NBC. In all the experiments, the PC values are uniformly discretized into twenty intervals. We define the level of label noise ρ as the proportion of training samples that have wrong labels. These erroneous labels are chosen at random in C \ {c} as in [63], with C the set of class values and c the true class. Figure 3 shows an illustration of introducing label noise in the Salinas Scene dataset. The first PC is shown in Figure 3a. All the labelled samples in Class 1 are highlighted in Figure 3b. Next, we randomly select 50% of the highlighted samples as the training samples for Class 1 (Figure 3c). We also select at random a given portion ρ of the total training samples (from various classes) and flip each of them to one of the remaining classes at random. Figure 3d illustrates an instance of the resulting Class 1 labels for ρ = 0.5. Different colours denote different original classes of the training samples that were flipped to Class 1. Note that the choice of ρ is here merely for clearer illustration purposes; a situation with 50% of wrong labels is unlikely to be relevant in practice. Figure 4 shows the average spectral signatures of each class on two real HSI data sets: Salinas Scene and HYDICE Urban. These average spectral signatures are obtained by taking the mean value of spectral intensities for each class. Without label noise the spectral signatures of different classes show different trends. In the presence of label noise, the spectral signatures falsely appear to be more similar to each other, which will affect inevitably the classification accuracy. Observe that when ρ = 0 the spectral responses of Class 1 and Class 2 in Salinas Scene are quite similar. This is because the materials corresponding to theses two classes are similar (two types of brocoli). For HYDICE Urban, the spectral signatures without label noise are rather different from each other and the effect of erroneous labels is evident. In both cases, label noise obviously tends to uniformise all the spectral signatures as expected, because now each of them is computed from a mixture of different classes. Figure 5 shows average spectral signatures together with their standard deviation regions, for two different classes from HYDICE Urban, without label noise and with label noise ρ = 0.5. The solid line in the middle shows for a given class the mean value of the pixel intensities in each spectral band, whereas the shaded region denotes the standard deviation over all the labeled pixels in that class. With erroneous labels, the spread gets larger and the means of the intensities change as well. Figure 6 shows overlays of spectral signatures for ρ = 0 and ρ = 0.5 for the six classes from another test image (Salinas Scene). We observe again that errors in labels result in wider shaded regions (larger standard deviations) and in decreased differences between the average spectral signatures of different classes due to the mixing effect, echoing the results in Figure 4.
Next, we analyze the effect of label noise on the estimated prior probabilities of different classes and the conditional probability distributions of the features given the class label. The prior probabilities and the conditional probabilities are computed by Equation (2). Each of the results is obtained as an average over 10 runs on different training samples. In each run, 50% of the labelled samples from each class are selected at random as training samples and their labels are perturbed at random according to the given ρ. Figure 7 shows the effect of label noise on prior probabilities of classes with ρ ∈ {0, 0.1, 0.3, 0.5} in the two data sets. While the actual prior probabilities of different classes are significantly different from each other, e.g., for the Salinas Scene data set, p(C = 3) = 0.36 and p(C = 6) = 0.04 for ρ = 0, these differences become smaller when label noise increases. Figures 8 and 9 show the probability distributions of the first two PCs in the two data sets, respectively, given the class label and with different levels of label noise ρ ∈ {0, 0.1, 0.3, 0.5}. For the Salinas Scene data set, when ρ = 0.5, the distribution conditioned on class 6 changed a lot in both PCs, since there are relatively few labelled samples from class 6 in this data set. The distributions conditioned on other classes keep a similar shape when increasing ρ from 0 to 0.5, but the peak values decrease and the distribution shape gets more and more flat compared to the distributions without label noise. The distributions of the second PC show similar behaviour, becoming more flat when the level of label noise increases. The conditional distributions for the HYDICE Urban data set show similar behavior to those for the Salinas Scene data set.   The presented results provide insights into how erroneous labels affect the estimation of the spectral signatures per class as well as the prior probabilities of the classes and the distributions of the classifier features (i.e., the underlying statistical model that governs the operation of NBCs and other related statistical models) conditioned on the class label. The analysis of these results demonstrates clearly that erroneous labels lead to model uncertainties, which will in their turn affect the classification performance. In order to mitigate this, models that are robust to model uncertainty are needed. We build such a robust classifier using the framework of imprecise probabilities. In particular, we adopt NCCs, an extension of NBCs, which allows us to assess how robust an NBC is with respect to model uncertainty. A clever dynamic selection among multiple NBCs will then lead to a robust dynamic classifier selection approach that we advocate in this work.

Impact of Noisy Labels on Perturbation Thresholds
An NCC provides an elegant way to account for model uncertainties by extending the probability mass functions in an NBC to corresponding sets of probabilities. Recall that the perturbation threshold of an NCC s (per) is defined as the maximum value of s under which the NCC remains determinate. Here we analyze the effect of noisy labels on these perturbation thresholds. We conduct experiments on the Salinas Scene data set. Figure 10 shows the correlation between the level of label noise and the perturbation thresholds. Let Por(s (per) ; ρ) denote the portion of samples whose perturbation thresholds are larger than s (per) under the label noise level ρ. The family of curves Por(s (per) ; ρ)in Figure 10 shows clearly that the portion of samples whose perturbation thresholds are above a given level drops when the level of label noise increases. In other words, the perturbation thresholds for some samples get smaller when the level of label noise increases. This means that, as expected, the robustness of the classification will decrease when ρ is larger. Given the correlation between robustness and accuracy [57], this will result in a lower accuracy as well. In order to mitigate this drop in robustness (and hence accuracy), we will now develop a method that uses perturbation thresholds to minimize this unwanted effect. Figure 10. The effect of erroneous labels on perturbation thresholds of NCCs estimated empirically on the Salinas Scene data set. The portion of samples whose perturbation threshold is above a given value s (per) is plotted for different values of ρ and denoted by Por(s (per) ; ρ). Observe that this portion Por(s (per) ; ρ) drops when ρ is larger indicating that with increasing the label noise the performance becomes less reliable on more and more samples.

Robust Dynamic Classifier Selection (R-DCS)
In this section, we develop a robust dynamic classifier selection (R-DCS) model to improve the classification performance under noisy labels. We first introduce some notation and then propose three possible selection strategies to estimate the best classifier among a set of available ones. Finally, an application of these R-DCS models in hyperspectral image classification is presented.

Notation
Let Ψ = {ψ 1 , ψ 2 , ..., ψ L } be a pool of base classifiers that are used for DCS. In particular, each ψ l ∈ Ψ is an NBC. Let X = {x i } be a set of training samples and Y = {y i } a set of testing samples. The methods described below can be applied in a general context, but in our hyperspectral image application, the samples x i ∈ R m and y i ∈ R m are vectors composed of pixel values at a particular spatial location in m spectral bands. We denote by s (per) l,i the perturbation threshold of the l-th classifier (ψ l ) in sample i.

Selection Strategies for R-DCS
The key idea of a DCS is to try and find the classifier with the highest probability of being correct for a given unseen sample. We propose a classifier selection approach making use of the observation that, for a given fixed classifier, instances with higher perturbation thresholds tend to have a higher chance of being classified correctly. Based on this general concept, we provide three concrete selection strategies employing perturbation thresholds as follows.

The R-T Strategy
In order to select the most competent classifier among a set of available ones, a first idea is simply to choose the classifier with the highest perturbation threshold for each sample. We refer to this strategy as R-T.
Let λ j ∈ {1, ..., L} denote the index of the base classifier that will be assigned to sample j. The R-T strategy selects for each test sample y j the classifier ψ λ j ∈ Ψ that exhibits the highest perturbation threshold:

The R-LA Strategy
Instead of analysing each sample individually, we now take into account a local surrounding region of the image sample. This local surrounding includes the nearest neighbors of the test sample in terms of a given distance. First we define this distance metric for each classifier separately and we refer to this strategy as R-LA. In particular, for each classifier we choose N training samples whose perturbation thresholds are closest to that of the test sample. To that end, we define the perturbation distance d l (x i , x j ) between two data samples x i and x j as the absolute value of the difference between their perturbation thresholds for the classifier l: Furthermore, we let N l,j be the set of N training samples that are the nearest neighbors of y j in terms of d l (x i , y j ). For each sample y j to be classified, we then determine the most competent classifier ψ λ j as follows: whereÑ l,j is the subset of N l,j composed of those training samples that are correctly classified by ψ l . Figure 11a illustrates this strategy with an artificial example with twenty training instances and two classifiers.

The R-EU Strategy
The R-EU strategy also aims to choose a classifier based on a local surrounding of the image sample in terms of the perturbation thresholds, but now a common set of nearest neighbors is defined based on a single perturbation distance. We define this common perturbation distance as the Euclidean distance in the space spanned by the perturbation thresholds of the different classifiers: where s (per) k,i and s (per) k,j are the threshold values of the k-th classifier for the sample x i and x j , respectively and L is the total number of classifiers. Now let N eu,j be the set of N training samples that are the nearest neighbors of y j in terms of d eu (x i , y j ). For each sample y j to be classified, we then estimate the most competent classifier ψ λ j as follows: whereÑ eu,l,j is the subset of N eu,j composed of those training samples that are correctly classified by ψ l . Figure 11b illustrates this strategy with a fictitious example involving twenty training samples and two classifiers.

Discussion of the Proposed Strategies
Among the three proposed selection strategies for R-DCS model, the R-T strategy is the simplest one. It operates on each sample separately, selecting the best classifier according to their perturbation thresholds for the particular sample. However, it ignores the fact that the exact relation between perturbation thresholds and performance may differ from one classifier to another. The other two strategies R-LA and R-EU improve upon this by incorporating information from a local surrounding of the test sample in the perturbation thresholds space. R-LA, which is based on absolute distance between the samples, is computationally simpler than R-EU, which is based on Euclidean distances. Notably, R-T is a parameter-free method, and for R-LA and R-EU only a simple parameter N needs to be chosen, which is particularly convenient from a practical point of view.

R-DCS in Hyperspectral Image Classification
We apply the proposed R-DCS model in hyperspectral image classification. The proposed model can be seen as an ensemble learning method, which consists of classifiers with different input features. We conduct experiments in the subsequent sections on three real remote sensing data sets, including two HSIs and one multimodal data set (HSI+LiDAR). In the experiments with HSIs alone, we employ the spectral and spatial features of HSIs as the inputs of our method as illustrated in Figure 12. While in the experiment with HSI and LiDAR, apart from the two features of HSIs, an additional feature with altitude information of objects from LiDAR is utilized. In Figure 12, we construct two classifiers, one operating on spectral and the other one on spatial features. PCA is employed for spectral feature extraction and morphological profiles [78] for spatial feature extraction. A morphological profile is constructed by the repeated use of morphological openings and closings with a structuring element of increasing size. In this work, we extract spatial features by morphological profiles composed of morphological openings and closings with partial reconstruction on PCs, similarly as in [78,79]. Note that in the general case, the R-DCS framework allows us to assign an arbitrary number of feature vectors to every pixel and feed each of those feature vectors to its own classifier. In the particular scheme from Figure 12, however, we assign to every pixel in a HSI two feature vectors: one composed of spectral features and the other composed of spatial features. For each of these feature vectors, the corresponding perturbation threshold is calculated, as explained in Section 2.2, using the algorithm of [57]. A dynamic classifier selection is then conducted for each test sample according to one of the selection strategies from Section 5.2. In the case of more than two types of features from one or more data sources, we need to calculate the perturbation thresholds for L ≥ 2 feature vectors following the same procedure as above and apply the same selection strategies, which are already formulated in general for an arbitrary number L classifiers. Algorithm 1 shows the entire process of our R-DCS model for the case with L classifiers according to Equation (7); end for do

Experimental Results in HSI Classification
We evaluate the performance of our methods on three real data sets: two HSI data sets (Salinas scene and Urban area HYDICE) and a multi-source data set (GRSS2013), which contains HSI and LiDAR data. The details of the three data sets were described in Section 3.
In the following experiments, we extract the first 50 PCs for the spectral features as a compromise between the performance and complexity for all the methods. The morphological profiles for spatial features are generated from the first 5 principal components (representing more than 99% of the cumulative variance) of the HSI data with 5 openings and closings by using disk-shaped SE (ranging from 2 to 10 with step size increment of 2). The morphological profiles for elevation features are generated from the LiDAR data with 25 openings and closings by using disk-shaped SE (ranging from 2 to 50 with step size increment of 2). The values of each PC and morphological profile are uniformly discretized into 10 intervals.
We compare the proposed R-DCS model under different selection strategies with the following schemes: (1) NBC implemented with spectral features alone (NBC-Spe), with spatial features alone (NBC-Spa) and with elevation features alone (NBC-LiDAR). (2) K-nearest neighbors (KNN) classifier with spectral features of HSIs. The number of neighbors is obtained by five-fold cross validation over the training samples.
(3) Support vector machine (SVM) classifier with polynomial kernel, applied on spectral features. (4) Generalized graph-based fusion (GGF) method of [29], which makes use of all types of features.
Observe that GGF and the proposed R-DCS combine different types of features, while other methods use one or the other type of features. The only parameter of the proposed model is the number of neighbors N in the R-LA and R-EU methods. We estimate this parameter by five-fold cross validation over the training samples, as we do for the KNN method. Three widely used performance measures are used for quantitative assessment: overall accuracy (OA), average accuracy (AA) and the Kappa coefficient (κ). Overall accuracy is the ratio between correctly classified testing samples and the total number of testing samples. Average accuracy is obtained by first computing the accuracy for each class and then considering the average of these accuracies. The Kappa coefficient, finally, measures the level of agreement between the ground truth and the classification result of the classifier [80]: κ = 1 corresponds to complete agreement and hence a perfect classifier, whereas κ = 0 corresponds to a classifier that ignores the feature vector and classifies purely at random. Let n i,j be the number of testing samples in class i that are labeled as class j by the classifier. Then OA, AA and κ are computed as: where n t := ∑ i ∑ j n i,j is the total number of testing samples, n c is the number of classes, n i,+ := ∑ j n i,j is the number of testing samples in class i, EA = ∑ i (n i,+ /n t )(n +,i /n t ) is the expected accuracy of a classifier that ignores the feature vector and n +,i := ∑ j n j,i is the number of testing samples that are classified in class i. In the following experiments, 10 percent of the labeled samples are used for training and the rest are for testing. The reported experimental results are averages over 10 runs with different training samples.

Experiments on the HYDICE Urban Data Set
The first experiment is conducted on the HYDICE Urban data set. The reference classes and their corresponding number of labeled training and testing samples are shown in Table 4. The false color image and the ground truth are shown in Figure 1a,b.
The classification results for the HYDICE Urban data set with different levels of label noise are listed in Table 5, where the best result is marked in bold. All the three proposed strategies outperform the basic classifiers they select from with different levels of label noise. The proposed R-DCS model with the R-EU strategy outperforms the others in most cases in this data set. When the level of label noise is 0, the GGF method performs the best and the two NBC models: NBC-Spe and NBC-Spa are inferior to all other methods. However, all the three proposed strategies improve the performance of NBC with about 3% improvement over NBC-Spe and 5% improvement over NBC-Spa for the R-EU strategy. With increasing levels of label noise, the accuracy of the GGF method decreases heavily from 94% to 64%, and similarly for SVM from 91% to 58%. Remarkably, all of our methods show only a slight decrease in OA which is within 5%. The more complex strategy R-EU outperforms the more simply ones R-T and R-LA, especially for the cases with less errors in labels. Table 5 also demonstrates that the proposed methods mostly outperform (and the R-EU strategy always outperforms) all the reference methods in terms of all the performance measures (OA, AA and κ) in all cases where label noise exists. Table 6 shows the classification accuracy per class for the HYDICE Urban data set with different classifiers. We compare ρ = 0 and ρ = 0.5 to study the change in accuracy for each class in the presence of label noise. The results show that the class-specific accuracies mostly drop significantly due to the effect of label noise. Class 2 is an exception: label noise there increases the accuracies for NBC-Spe, NBC-Spa, KNN, SVM and R-T. We can also see from Table 6 that when ρ = 0, the GGF method yields the highest accuracy in each class. Our proposed methods R-LA and R-EU outperform NBC-Spe and NBC-Spa in Class 2 with about 20% improvement. When ρ = 0.5, all the methods perform badly in Class 3 due to the limited number of training samples. Our proposed methods show competing accuracies in the first three classes and the R-LA method performs the best in Class 4 and Class 5. Table 4. Reference classes for the HYDICE Urban data set. 1  Trees  312  2811  3123  2  Concrete  141  1269  1410  3  Soil  64  573  637  4  Grass  404  3640  4044  5  Asphalt  91  821  912   Total  1013  9113 10,126  In Figure 13a we compare the performance of our best strategy for the HYDICE Urban data set, R-EU, with the reference methods. With increasing levels of label noise, the performance of the GGF and SVM deteriorates significantly, while NBCs, KNN and the proposed methods are more stable in terms of OA. The classification accuracy of all the three analysed strategies is depicted in Figure 13b for different levels of label noise. All the strategies outperform the individual NBCs (NBC-SPE and NBC-SPA) that they select from. When the level of label noise is less than 0.5, the R-EU method performs better, while the R-T and R-LA methods achieve higher accuracy at ρ = 0.5. Compared to NBC with spatial features, accuracy improvement is above 3% at different levels of label noise, which is significant in the task of HSI classification. Compared to some of the most competitive methods SVM and GGF, an important improvement is obtained in the presence of label noise.

Experiments on the Salinas Scene Data Set
The second experiment is conducted on the Salinas Scene data set. Information about the classes and number of samples used for training and testing are listed in Table 7. The false color image and ground truth are shown in Figure 1. The classification results for the Salinas Scene data set with different levels of label noise are depicted in Table 8 and Figure 14. The three strategies within the proposed method perform similarly due to the better performance of NBCs on this data set. The proposed R-DCS model under any of the three presented strategies outperforms all the other methods at every level of label noise, even when the level of label noise is 0. When the level of label noise increases, the accuracy of SVM drops heavily from 98% to 69%, and similarly for GGF the accuracy drops from 99% to 87%. The KNN's accuracy drops only a little in this data set from 94% to 92%. For the proposed model, with any of the three selection strategies, the decrease in the accuracy is also only about 2%. It is interesting to notice that with lower levels of label noise, the R-EU strategy performs better than R-T and R-LA, while the opposite is true at larger levels of label noise.

Experiments on GRSS2013 Data Set
The third experiment is conducted on the GRSS2013 data set. Information about the classes and number of samples used for training and testing are listed in Table 9. The false color image and ground truth are shown in Figure 2.
The classification results for the GRSS2013 data set with different levels of label noise are depicted in Table 10 and Figure 15. Our proposed methods and the representative GGF method combine the spectral features, spatial features and elevation features in these experiments. The results show that our R-EU method yields the best performance in terms of OA, AA and κ at each level of label noise. R-LA outperforms the three NBCs, i.e., NBC-Spe, NBC-Spa and NBC-LiDAR, at low levels of label noise (ρ ≤ 0.2), but performs worse than NBC-Spa when ρ > 0.2. When label noise rises from 0 to 0.5, the OA of SVM drops heavily from 85% (ρ = 0) to 54% (ρ = 0.5) and similarly GGF has an OA decrease of 39%. KNN proves to be much more robust to the label noise-its OA decreases only by 3% in the same range of ρ values. Among our methods, R-T does not yield good results on this data set. Compared with R-LA, R-EU performs consistently better, demonstrating the effectiveness of the R-EU strategy.

Performance at Extremely Large Levels of Label Noise
The previous analysis showed that the performance of NBC-based classifiers and our approach that is build upon NBC remains remarkably stable even at large levels of label noise (ρ = 0.5). To explain this behaviour and to explore at which levels of label noise this performance starts to drop, we also perform experiments with extremely large levels of label noise (ρ > 0.5). For these experiments, we choose the selection strategy that yields the best performance the most times (R-T for the two HSI data sets and R-EU for the GRSS2013 data set), and compare it with other methods.  Figure 16 shows the performance of the resulting classifiers under different levels of label noise on the three analysed data sets. In the three data sets, the overall accuracy of NBC-Spe decreases gradually with increasing label noise in the range ρ < 0.5, and it drops abruptly afterwards reaching a value near zero when ρ = 0.9. The reason for this sharp decrease can be attributed to the flattening of the conditional densities for larger ρ as shown in Figures 17 and 18 for the first PC in the two HSI data sets. (Note that the statistical distributions in Figures 17 and 18 as well as the statistical distributions in Section 4 were obtained with 50% of the labeled samples per class in order to allow for a more reliable empirical estimation of the corresponding distributions.) NBC-Spa and our method R-T in the two HSI data sets show a similar evolution in both data sets and have a sudden drop in OA around ρ = 0.7. KNN and GGF suffer from a significant performance drop at ρ = 0.6, while SVM shows approximately linear decrease in both datasets. The trend exhibited by NBCs and our proposed method is much better than a linear decrease in the accuracy, because when the accuracy drops below a certain level the exact values do not matter anymore as all methods are useless in that range. In the GRSS2013 data set, NBCs with different types of features and our method R-EU show a similar evolution and have a sudden drop in OA around ρ = 0.7. KNN shows behaviour that is similar to NBC-Spe, while SVM and GGF almost decrease linearly in this data set.

Discussion
The experimental results presented in the previous section as well as the statistical characterization in Section 4, provide new insights into the effects of label noise on supervised classification of hyperspectral remote sensing images. Empirical conditional probability distributions of HSI features conditioned on class labels, as well as prior distributions of class labels, exhibited graceful flattening with increasing amounts of label noise. Their evolution clarifies why Bayesian classifiers such as the relatively simple NBCs are much more robust to label noise than some other, more complex methods. These results are consistent with previous findings from [70,71], where it was experimentally established that NBCs yield better classification accuracy in the presence of label noise compared to classifiers based on KNN, SVM, and decision trees.
Our experimental analysis shows also clearly that incorporating spatial features into the classification process not only improves the classification accuracy but increases robustness to label noise as well. It is well established that using spatial information typically improves the classification accuracy, and some recent works that addressed HSI classification in the presence of noisy labels [66,81] also incorporate spatial features.
We addressed the effect of label noise from the point of view of the robustness of probabilistic graphical models to model perturbations. We built a robust dynamic classifier selection method upon this reasoning. The proposed approach enjoys remarkable robustness to label noise, inherited from the naive Bayesian classifiers that lie at its core. We instantiated our general approach with three particular selection strategies that have different levels of complexity. The proposed approach improves upon the NBCs that it combines and lends itself to incorporating easily multiple data sources and multiple types of features. Both NBC and the proposed robust dynamic classifier exhibit a characteristic trend in the presence of label noise: the classification accuracy decreases very slowly until the level of label noise becomes excessively high (60% or more erroneous labels) and then it drops abruptly. The evolution of the probability distributions of HSI features and estimated priors for class labels provides a nice explanation for this behaviour as was pointed out in the previous section. Interestingly, classifiers based on SVM and on an advanced graph fusion method show a faster decrease of the classification accuracy with increasing levels of label noise. While these more sophisticated classifiers outperform the other analysed ones in the case of ideally correct labels, they appear to be rather more vulnerable to label noise and become inferior to NBCs, KNN and the proposed approach already when a small percentage of labels are erroneous.
Based on these results and findings, we believe that the following research directions are interesting to explore: (1) Analyzing the performance of more advanced Bayesian classifiers, e.g., based on Markov Random Fields, in the presence of label noise. (2) Exploring which levels of label noise are acceptable for a given tolerance in the classification accuracy and how robust are different learning models in this respect. This can significantly help in practice for optimizing the resources and ensuring the prescribed tolerance levels. (3) Deep learning methods are becoming the dominant technology for supervised classification. It is well known already that these models tend to be extremely susceptible to various degradations in the data such as noise and to different adversarial attacks. It would be of interest to study thoroughly their behaviour in the presence of label noise. Motivated by the excellent performance of Bayesian models to erroneous labels, a natural idea to explore is how Bayesian approaches can be incorporated to improve the robustness of deep learning methods to label noise.

Conclusions
In this work, we started by analysing the effect of errors in data labels on the estimation of spectral signatures of landcover classes, on the estimated statistical distributions of features given the class labels and on the prior probabilities of the given classes. The analysis reveals that NBCs are remarkably robust to label noise but also that erroneous labels introduce uncertainties to models, which inevitably deteriorate the performance of all classifiers, including NBCs. To deal with the imprecision of the model that is caused by errors in the sample labels, we proposed a novel, robust dynamic classifier selection model, that we refer to as R-DCS. The R-DCS model is based on imprecise-probabilistic robustness measures and was applied to HSIs classification in the presence of errors in data labels. Three possible selection strategies are presented for the R-DCS based on the robustness measures. All the provided strategies outperform the classifiers they select from, but their performance differs for different levels of label noise. The R-EU strategy performs better than the other two in most cases, while R-T and R-LA enjoy the benefit of lower computational complexity than R-EU. The experimental results also demonstrate that the proposed model is more robust to label noise compared to some common classification approaches such as KNN, SVM and graph-based feature fusion.