Few-Shot Hyperspectral Remote Sensing Image Classification via an Ensemble of Meta-Optimizers with Update Integration

: Hyperspectral images (HSIs) with abundant spectra and high spatial resolution can satisfy the demand for the classification of adjacent homogeneous regions and accurately determine their specific land-cover classes. Due to the potentially large variance within the same class in hyperspectral images, classifying HSIs with limited training samples (i.e., few-shot HSI classification) has become especially difficult. To solve this issue without adding training costs, we propose an ensemble of meta-optimizers that were generated one by one through utilizing periodic annealing on the learning rate during the meta-training process. Such a combination of meta-learning and ensemble learning demonstrates a powerful ability to optimize the deep network on few-shot HSI training. In order to further improve the classification performance, we introduced a novel update integration process to determine the most appropriate update for network parameters during the model training process. Compared with popular human-designed optimizers (Adam, AdaGrad, RMSprop, SGD, etc.), our proposed model performed better in convergence speed, final loss value, overall accuracy, average accuracy, and Kappa coefficient on five HSI benchmarks in a few-shot learning setting.


Introduction
Hyperspectral remote sensing is a valuable monitoring technique in geoinformation that typically captures hundreds of narrow-spectral-band sets of information from the same region, revealing unique physical features of ground targets.The high spectral dimensions and spatial resolution in the hyperspectral imaging (HSI) process make accurate discrimination of different land-cover classes possible [1][2][3].Since the HSI is formed pixelby-pixel, its classification is to assign a unique label, representing a specific area, to each HSI pixel.At present, HSI classification has become an indispensable data-driven technique to achieve the United Nations Sustainable Development Goals (SDGs).For example, in agricultural settings, HSI classification is used to accurately capture the status of vegetation health, soil type, and moisture and then help to make reasonable decisions on fertilization, irrigation, and management [4,5]; in geological exploration, HSI classification is used to identify geological features, such as mineral and rock types, which aids in finding potential mineral deposits [6]; in environmental protection, HSI classification is used to monitor desertification processes and changes in land use [7][8][9][10][11].HSI classification can also monitor natural disasters such as fires, floods, and earthquakes, making disaster response and mitigation efforts simpler and quicker [12,13].
Traditional HSI classifiers rely heavily on hand-crafted features [14 -17].Benediktsson et al. [18] constructed morphological profile features with the help of mathematical morphology.Jia et al. [19] used 3D Gabor filtering to extract spatial structure information.These carefully designed features were then fed into classifiers, such as support vector machines (SVMs) [20] and random forests [21].Feature extraction techniques, such as principal component analysis (PCA) [22], independent component analysis (ICA) [23], and linear discriminant analysis (LDA) [24], were also introduced to reduce redundant spectral information and produced more distinguishable features.By incorporating spatial contextual information, spectral-spatial features can further improve the classification performance [17,25].Compared with traditional techniques, deep learning can automatically learn and capture high-level features from training data and achieve better classification accuracy [26][27][28][29][30]. Chen et al. [26] inputted spectral vectors and spatial features extracted by PCA into a stacked autoencoder to generate high-level joint spectral-spatial features and then implemented the classification by a logistic regression layer and achieved more accurate results than by traditional machine learning on KSC and Pavia HSI datasets.Mei et al. [30] used a 3D convolutional network to extract spatial and spectral features simultaneously and then achieved better results on Indian Pine, Salinas, and Pavia HSI datasets.
Manually labeling HSI samples is not only time-consuming and laborious but also requires much professional expertise.Due to the high annotation cost, only a limited number of annotated samples may be available, leading to a small training dataset size.Moreover, in hyperspectral images, the number of pixels for different land-cover classes can be significantly imbalanced, with some classes having far fewer samples than others.Severe overfitting occurs when training neural networks with insufficient samples.Current HSI classification faces the challenge of few-shot learning [31].As an advance of deep learning, meta-learning [32,33] has become the most promising approach to deal with the issue of few-shot learning.
The idea of meta-learning is to enable models to not only acquire task-specific knowledge but also learn how to learn new tasks more effectively, so it allows models to make better use of limited training samples and demonstrate robust generalization ability on new tasks [34][35][36].Although meta-learning has begun to be used in few-shot HSI classification [2,37], the impacts of certain optimizers on the final classification performance are often ignored.Since the popular human-designed optimizers, such as stochastic gradient descent (SGD) [38], RMSprop [39], AdaGrad [40], and Adam [41], are designed for general tasks, these optimizers cannot improve their performance through mining unique features of a specific HSI task, and they can possibly lead to serious overfitting, especially when dealing with insufficient training samples [42].Incorporating prior knowledge of the training task into an optimizer can significantly reduce the size of the required training data and avoid overfitting [43,44].Such an optimizer is called a meta-optimizer or meta-learning-based optimizer [45].
This study aimed to improve meta-optimizers and apply them in few-shot HSI classification.Since ensemble learning can improve the generalization ability by combining individual learners and then reduce overfitting risks and enhance the model robustness, we propose an ensemble of meta-optimizers that were generated one by one through utilizing periodic annealing on the learning rate during the meta-training process.Such a combination of meta-learning and ensemble learning not only has no significant added training costs but also demonstrates a powerful ability to optimize the deep network on few-shot HSI training.In each optimization iteration, a meta-optimizer ensemble can generate several candidates to update the parameters of the trained model.In order to integrate these candidate updates to effectively improve the model performance, we further developed an update integration technique to measure and incorporate the potential of each candidate.The real-world experiments on few-shot HSI classification consistently verified the effectiveness of our meta-optimizer ensemble with update integration over the widely used human-designed optimizers across five benchmark HSI datasets.

Related Works
During the HSI classification process, when sufficient labeled samples are provided, accurate classification can be easily implemented under a supervised classification framework [46].Let a learning task T be specified by a dataset D = (x i , y i )} n i=1 and a loss function L. The dataset D is split into a training set D train = (x i , y i )} k i=1 and a test set D test = (x i , y i )} n i=k+1 .The learning target is to obtain a predictive function f with parameters θ by solving the following optimization equation: where the sample subscript i is omitted for simplicity.The performance of f is finally evaluated on D test .When only limited labeled HSI samples are provided, Equation ( 1) is prone to becoming stuck in local optima.
Meta-learning aims to make the model learn how to learn so that it can more flexibly adapt to various learning scenarios.It consists of the learner, which is responsible for specific task learning, and the meta-learner, which guides the learner's learning process.
The training of the meta-learner typically involves two stages: the first stage is to learn across a set of tasks, and the second stage is to evaluate and adjust the meta-learner on new tasks [42].The optimal meta-knowledge ω * is found by where p(T ) is a distribution of tasks, and a task specifies a dataset and loss function T = {D, L}.
In an actual implementation, only M source tasks D source = D train source , D test source are randomly sampled from the task distribution p(T ), so the expectation in the meta-training phase (2)  is replaced simply by summation: This meta-knowledge ω * is further used to train the predictive model f with pa- For each target task i, the meta-testing stage is written as follows: Compared with Equation (1) in typical supervised learning, the learning phase of Equation ( 4) benefits from the learned meta-knowledge ω * .
The meta-knowledge ω * can be in the form of the initial parameters [32] or optimization strategy [44], etc.By treating meta-knowledge ω * as the optimization strategy, learning to optimize is proposed to optimize the learning process itself [47].Specifically, Andrychowicz et al. [43] trained a two-layer LSTM network (LSTM optimizer) to generate dynamic updates.The idea is to replace the regular update with an update generated by an LSTM network.Guided by the meta-knowledge stored in the network parameters, the LSTM optimizer can incorporate historical gradient information to generate updates.Andrychowicz et al. [43] demonstrated that the LSTM optimizer outperforms the humandesigned optimizers in a variety of tasks, including training neural networks, convex problems, and styling images with neural art.Ravi et al. [44] used LSTM networks to learn update rules for few-shot learning and set the cell state of LSTM networks as the learner's parameters, and the candidate cell state determined the gradient information for parameter updates.Li et al. [48] proposed a conceptually simple meta-optimizer, Meta-SGD, for few-shot learning.Meta-SGD learns the initialization and learning rate instead of the whole update rules.Compared with the LSTM optimizer, Meta-SGD is more straightforward to implement, but its generalization capacity needs to be improved.Chen et al. [49] trained an RNN-based meta-optimizer for global optimization of black-box functions.Their metaoptimizer, trained on synthetic functions, can optimize a broad class of black-box functions.Wang et al. [45] designed a meta-optimizer called HyperAdam, the parameter update generated by which is an adaptive combination of multiple candidate updates produced by Adam using different decay rates.
Hyperspectral images have high spectral dimensionality and high spatial complexity, which may slow down the convergence speed of classification during training.A meta-optimizer can learn some knowledge from an HSI dataset and then use the learned knowledge to optimize a model on a new HSI dataset.To the best of our knowledge, no study has focused on learning a meta-optimizer for few-shot HSI classification.

Proposed Method
Hyperspectral images (HSIs) with abundant spectra and high spatial resolution can satisfy the demand for accurately determining specific land-cover classes.Due to the significant variance within the same class in hyperspectral images, few-shot HSI classification becomes especially difficult.To solve this issue without adding training costs, we propose a new approach: an ensemble of meta-optimizers with update integration for few-shot HSI classification.

Ensembles of Meta-Optimizers
The training in conventional machine learning can be expressed as the optimization in Equation ( 1).Its solution is always based on stochastic gradient descent and its variants through the following iterative process: where α t is the learning rate at the time step t.Since Equation ( 5) usually requires thousands of iterations to find the optimum or local optimum, the performance of these optimizers will become very weak if the training samples are insufficient.To overcome this drawback, the meta-optimizer (i.e., an optimizer trained by meta-learning) is used to replace the human-designed stochastic gradient descent algorithms.It may take the form of a twolayer long short-term memory (LSTM) network, which can integrate information from the history of gradients to determine the parameter update.Such a meta-optimizer is also called an LSTM optimizer.Denote the hidden state of the LSTM with h and the output with g, and by using the LSTM optimizer G ϕ to optimize the parameters θ of a model f , the sequence of updates will become In order to solve the challenge that different input gradients have very different magnitudes, making the meta-optimizers difficult to train, we adopted the standard normalization process in [43] to rescale the input gradients: where p > 0 is the parameter controlling how small gradients are disregarded.
Minimizing the loss function can obtain the optimal parameters ϕ * of meta-optimizer G, which can reduce the objective function L(θ t ; D) as much as possible.
To build an ensemble of meta-optimizers without adding training costs, we performed periodic annealing on the learning rate.When the learning rate anneals to the minimum value, the meta-optimizer parameters are saved and added to the ensemble.In this way, we can obtain multiple meta-optimizers with different parameters without increasing the training costs.The learning rate α changes in each annealing cycle according to the following formula: where α max and α min indicate the learning rate range, and t cur is the number of iterations since the last restart.A good α max can accelerate the training process and help the model escape from local minima.The α min is a small enough number.The T p is a hyper-parameter representing the period of cosine annealing.When the training process starts, t cur = 0 and α = α max .After T p iterations, the learning rate will decrease to its minimum α min , and one annealing cycle will be completed.At the beginning of the next annealing cycle, t cur will become 0 again, and the learning rate will abruptly become α max .
When the learning rate is small, the trained model tends to converge into the closest local minimum [50].Once the learning rate reaches its minimum α min , the corresponding meta-optimizer will be added to the meta-optimizer ensemble.Then, a large enough learning rate α max is used to escape the current local minimum and restart a new annealing cycle.After N annealing cycles, we obtain an ensemble of N meta-optimizers, denoted as The ensemble process of meta-optimizers with learning rate annealing is shown in Algorithm 1.The whole meta-training phase is a double-loop structure, which differs from the traditional training process.

Algorithm 1
The proposed ensemble of meta-optimizers.

Require:
1: The predictive model f with initial parameters θ 0 2: Source HSI dataset D source = {(x i , y i )} N i=1 ; batch size B.
where β ∈ [0, 1) is a hyper-parameter.Based on the newly sampled D mini , L θ t + g i t ; D mini can approximate the loss values L(θ t+1 ) at step t + 1 by assuming g i t as the actual update.Consequently, L θ t + g i t ; D mini − L(θ t ) measures which candidate update can drop the loss value the most.Each m i t is a real number iterating over the time step t, and its initial value is set to 0. The update integration algorithm determines the optimal update g * t at the time step t according to the minimum among m i t (i = 1, 2, . . ., N), as suggested in Equation (12).When β = 0, Equation (11) will become m i t = L θ t + g i t ; D mini − L(θ t ) .This means that the update integration algorithm determines the optimal update based only on the current loss reduction L θ t + g i t ; D mini − L(θ t ).However, loss values can vary widely from one mini-batch of data to another, so we propose calculating the loss reductions with previous loss information to eliminate the stochasticity, which corresponds to the cases of β ̸ = 0 in Equation (11).
Averaging of the outputs of an ensemble is widely used.Compared with the average method, our proposed update integration algorithm can flexibly choose an update direction with a lower loss value by looking one step ahead at the loss changes.

Incorporating Update Integration into Meta-Optimizer Ensembles
The framework of the proposed method is shown in Figure 1.In each optimization iteration, we first input a batch of training samples to the neural network to be optimized and perform backpropagation to obtain the gradient information ∇ θ L of the neural network.
The meta-optimizer ensemble {G ϕ i N i=1 then takes the normalized gradients [∇ θ L] norm as inputs, and each meta-optimizer G ϕ i outputs its suggested update g i independently, directed by its parameters ϕ i .Finally, the candidate updates g i N i=1 are integrated into the final update g * by using the proposed update integration algorithm.Since the meta-optimizer ensemble has learned how to optimize from related tasks, it requires fewer training samples, leading to low computational costs and memory requirements and faster convergence in fewer iterations.Algorithm 2 illustrates the total pseudo-code for using the proposed meta-optimizer ensemble with the update integration algorithm to optimize a model .

Real-World HSI Classification Experiments
We focused the real-world HSI classification under the N-way 10-shot learning setting, i.e., we randomly picked 10 samples from each of N land cover classes to form the training set.We illustrated the effectiveness of our proposed method on five real-world HSI datasets-PaviaU, PaviaC, Salinas, SalinasA, and KSC.Due to environmental changes and observation conditions, the spectral features of the same land-cover class may vary at different times or locations.Moreover, hyperspectral images of the same region acquired by different satellites may have some differences due to band configuration, spatial resolution, radiometric calibration, etc.Therefore, we designed the following seven source-target tasks to test the generalization and robustness ability of our pro- Since the meta-optimizer ensemble has learned how to optimize from related tasks, it requires fewer training samples, leading to low computational costs and memory requirements and faster convergence in fewer iterations.Algorithm 2 illustrates the total pseudo-code for using the proposed meta-optimizer ensemble with the update integration algorithm to optimize a model f .

Algorithm 2
The proposed update integration algorithm.

Require:
1: The trained model f with initial parameters θ 1 .

Real-World HSI Classification Experiments
We focused the real-world HSI classification under the N-way 10-shot learning setting, i.e., we randomly picked 10 samples from each of N land cover classes to form the training set.We illustrated the effectiveness of our proposed method on five real-world HSI datasets-PaviaU, PaviaC, Salinas, SalinasA, and KSC.Due to environmental changes and observation conditions, the spectral features of the same land-cover class may vary at different times or locations.Moreover, hyperspectral images of the same region acquired by different satellites may have some differences due to band configuration, spatial resolution, radiometric calibration, etc.Therefore, we designed the following seven source-target tasks to test the generalization and robustness ability of our proposed method.In each task, we trained the meta-optimizer on the source dataset and then used it to optimize a predictive model on the target dataset with few labeled data.The real-world HSI classification experimental tasks consisted of different source-target combinations: The PaviaU dataset and the PaviaC dataset were acquired by the ROSIS sensor, so tasks ( 1) and ( 2) can evaluate the generalization ability in different regions under the same sensor.The SalinasA dataset is a small subset of Salinas, so task (3) can test the generalization ability to deal with a large number of unseen classes.Robustness measures the ability of the model to maintain its functionality and performance under abnormal scenarios; so, to demonstrate the robustness of our model, tasks ( 4)-( 7) adopted the strictest abnormal scenarios, where an HSI dataset from one sensor was used to train the model and an HSI dataset from a different sensor was used to test the robustness of the model.Our code is available at https://github.com/lazyhaotao/MOE-U.

Description of Datasets
The PaviaU HSI dataset was captured over Pavia using the ROSIS sensor.Its spatial size is 610 × 340, and the number of spectral bands is 103.The PaviaU HSI dataset contains 42,776 labeled pixels in 9 land-cover classes (Table 1).The PaviaC HSI dataset was also acquired by the ROSIS sensor over the same region as PaviaU.Its spatial size is 1096 × 715, and the number of spectral bands is 102.PaviaC has 148,152 labeled pixels in 9 land-cover classes (Table 2).The spatial resolution of these two datasets is 1.3 m.The Salinas HSI dataset was collected by the 224-band AVIRIS sensor over Salinas Valley.Its spatial size is 512 × 217, and the number bands is 204.It has 54,129 labeled samples classified into 16 land-cover classes (Table 3).The SalinasA HSI dataset is a small sub-scene of the Salinas dataset.It has a size of 83 × 86 and 6 land-cover classes (Table 4).The spatial resolution of these two datasets is 3.7 m.

No. Class Number
Total The KSC HSI dataset was collected by the AVIRIS sensor over the Kennedy Space Center (KSC), Florida.Its spatial size is 512 × 614, and the number of spectral bands is 176.It has 5211 labeled pixels in 9 land-cover classes (Table 5).The spatial resolution of the KSC dataset is 18 m.

No.
Class Number Graminoid-marsh C9 Spartina-marsh C10 Cattail-marsh In summary, five HSI datasets (PaviaU, PaviaC, Salinas, SalinasA, and KSC) were used to test the performance of our proposed method.These five datasets from real-world scenarios contain a huge number of land-cover classes and have been widely used as test datasets in research on hyperspectral remote sensing image classification.Table 6 shows a comparison of these five real-world HSI datasets.

Experimental Setup
The performance of various optimizers was evaluated with the convergence speed and final convergence value of the predictive model's loss function.We also evaluated the target HSI datasets for overall accuracy (OA), average accuracy (AA), and Kappa coefficient (κ), which indirectly reflect the performance of various optimizers and the degree of falling into overfitting.In detail, the OA is the proportion of correctly classified samples among all of the tested samples.The AA is defined as the average of all-class classification accuracy.Compared with OA, AA pays more attention to the classes with fewer samples.The Kappa coefficient is a statistical index evaluating the consistency and classification accuracy and is defined as κ = (p 0 − p e )/(1 − p e ), where p 0 is the OA and p e is the hypothetical probability of the chance agreement.
The Adam optimizer with learning rate α = 0.01 was used to train the meta-optimizer ensemble based on Algorithm 1.During the meta-training phase, we set B = 128 and T = 100.Then, we obtained an MOE 3 = G ϕ i 3 i=1 .For the update integration algorithm, all of the experimental results were reported with β = 0.1.The LSTM optimizer consisted of a two-layer LSTM network, with each layer containing 20 hidden units.It was trained by meta-learning without using ensemble learning.The LSTM optimizer's parameters were saved after 1000 meta-training episodes on the source dataset.For the normalization method, we set p = 10 as usual.
The predictive model f trained by the above optimizers on the target dataset was a four-layer fully connected network with a sigmoid activation function and each hidden layer containing 20 nodes.We input each HSI pixel of size 1 × 1 × channel number to the predictive model for classification.The loss function used in our experiments was cross-entropy loss.Our real-world HSI classification experiments concentrated on the N-way 10-shot learning setting, i.e., we randomly picked 10 samples from each of N land-cover classes to form the training set.We repeated each experiment ten times and reported the mean and the standard variation of various measurement indicators for classification performance.

Experimental Results
For brevity, we refer to the meta-optimizer ensemble with update integration as MOE-U, and to averaging all of the candidate updates of a meta-optimizer ensemble as MOE-A.We compared the performance of MOE-U against the LSTM optimizer, MOE-A, and a selection of popular human-designed optimizers: Adam, AdaGrad, RMSprop, SGD, and SGD with momentum, whose learning rates are 0.005, 0.1, 0.001, 0.5, and 0.5, respectively.7. Our MOE-U also achieved the lowest final convergence value of 0.3786 and the lowest Std of 0.1128, displaying better performance than the other two meta-optimizers.These numerical results confirm that our proposed update integration algorithm can effectively accelerate the convergence process and achieve lower loss simultaneously.The final convergence value of MOE-A was 0.3819, which was higher than that of MOE-U.The reason for this may be that averaging different parameter updates can bias the best update direction away from the correct direction, and our update integration process can determine the best update direction by estimating the change in loss.
land-cover classes to form the training set.We repeated each experiment ten times and reported the mean and the standard variation of various measurement indicators for classification performance.

Experimental Results
For brevity, we refer to the meta-optimizer ensemble with update integration as MOE-U, and to averaging all of the candidate updates of a meta-optimizer ensemble as MOE-A.We compared the performance of MOE-U against the LSTM optimizer, MOE-A, and a selection of popular human-designed optimizers: Adam, AdaGrad, RMSprop, SGD, and SGD with momentum, whose learning rates are 0.005, 0.1, 0.001, 0.5, and 0.5, respectively.7. Our MOE-U also achieved the lowest final convergence value of 0.3786 and the lowest Std of 0.1128, displaying better performance than the other two meta-optimizers.These numerical results confirm that our proposed update integration algorithm can effectively accelerate the convergence process and achieve lower loss simultaneously.The final convergence value of MOE-A was 0.3819, which was higher than that of MOE-U.The reason for this may be that averaging different parameter updates can bias the best update direction away from the correct direction, and our update integration process can determine the best update direction by estimating the change in loss.The classification performances by different optimizers are shown in Table 8, indicating that, compared with human-designed optimizers, meta-optimizers can achieve lower loss and better classification results while avoiding overfitting.The predictive model optimized by MOE-U achieved 0.8791, 0.8324, and 0.8340 on OA, AA, and Kappa, respectively, outperforming human-designed optimizers.Compared with the LSTM optimizer, our MOE-U increased the OA, AA, and Kappa values by 0.0277, 0.0294 and 0.0357, respectively.Compared with our MOE-A, our MOE-U increased these values by 0.0151, 0.0245, and 0.0249, respectively.In total, the results demonstrate that MOE-U has good generalization ability.It can optimize a predictive model with limited training samples by using the learned knowledge from other HSI datasets from the same sensor.The best classification maps of various methods are displayed in Figure 3.The number after the optimizer name represents the accuracy on the test set.Our MOE-U also achieved the highest classification accuracy of 0.9408.The classification performances by different optimizers are shown in Table 8, indicating that, compared with human-designed optimizers, meta-optimizers can achieve lower loss and better classification results while avoiding overfitting.The predictive model optimized by MOE-U achieved 0.8791, 0.8324, and 0.8340 on OA, AA, and Kappa, respectively, outperforming human-designed optimizers.Compared with the LSTM optimizer, our MOE-U increased the OA, AA, and Kappa values by 0.0277, 0.0294 and 0.0357, respectively.Compared with our MOE-A, our MOE-U increased these values by 0.0151, 0.0245, and 0.0249, respectively.4, and the final averaged convergence values and their standard deviations (Stds) ar given in Table 9.In this scenario, similar experimental results were achieved.MOE-U achieved the fastest convergence speed and the lowest convergence value of 0.6665.The performance comparison of the classification results is shown in Table 10.The OA and Kappa values of our MOE-U were 0.6151 and 0.5195, respectively, outperforming all human-designed optimizers and meta-optimizers.The AA value of our MOE-U was just 0.0033 lower than that of our MOE-A.The distribution of samples across land-cover classes of PaviaU was highly unbalanced.For instance, meadows (C2) had 18,649 labeled samples while shadows (C9) had 947 labeled samples.This imbalance made training and testing very difficult, e.g., the Kappa of SGD was just 0.1991.Even in the face of such landcover class imbalance, the meta-optimizers performed better than the human-designed optimizers, demonstrating that the learned meta-knowledge can effectively help optimize the network.The best classification maps by various optimizers are displayed in Figure 5.

SalinasA-Salinas Task
In this task, we trained the meta-optimizers on SalinasA and tested their classification performance on Salinas.Since the SalinasA dataset contains only 6 categories, while the Salinas dataset contains 16 categories, this task can test the generalization ability of our method in dealing with many unseen land-cover classes.The averaged loss change curves are shown in Figure 6, and the final averaged convergence values and their Stds are given in Table 11.In this experiment, MOE-U converged significantly faster and had lower losses than other optimizers.

SalinasA-Salinas Task
In this task, we trained the meta-optimizers on SalinasA and tested their classification performance on Salinas.Since the SalinasA dataset contains only 6 categories, while the Salinas dataset contains 16 categories, this task can test the generalization ability of our method in dealing with many unseen land-cover classes.The averaged loss change curves are shown in Figure 6, and the final averaged convergence values and their Stds are given in Table 11.In this experiment, MOE-U converged significantly faster and had lower losses than other optimizers.The performance comparison of the classification results is shown in Table 12, indicating that the Salinas dataset was so hard to classify with few training samples that all of the human-designed optimizers failed in this experiment.This might be attributed to the considerably large number of classes in the Salinas dataset, and the samples between each class were so similar that they were difficult to distinguish.Although the human-designed optimizers reduced the training loss, they made the predictive model severely overfit, resulting in the OA, AA, and Kappa all being near 0.2-a terrible value.However, the OA, AA, and Kappa values of our MOE-U reached 0.5333, 0.5516, and 0.4845, respectively-the best among all of the optimizers.Compared with the single meta-optimizer (LSTM optimizer), our MOE-U increased the OA, AA, and Kappa values by 0.0467, 0.0368, and 0.048, respectively.Compared with our MOE-A, our MOE-U increased these values by 0.0274, 0.0053, and 0.0233, respectively.These results indicate that our improvement of meta-optimizers can effectively enhance the generalization ability in dealing with many unseen classes.The best classification maps are displayed in Figure 7.The performance comparison of the classification results is shown in Table 12, indicating that the Salinas dataset was so hard to classify with few training samples that all of the human-designed optimizers failed in this experiment.This might be attributed to the considerably large number of classes in the Salinas dataset, and the samples between each class were so similar that they were difficult to distinguish.Although the human-designed optimizers reduced the training loss, they made the predictive model severely overfit, resulting in the OA, AA, and Kappa all being near 0.2-a terrible value.However, the OA, AA, and Kappa values of our MOE-U reached 0.5333, 0.5516, and 0.4845, respectivelythe best among all of the optimizers.Compared with the single meta-optimizer (LSTM optimizer), our MOE-U increased the OA, AA, and Kappa values by 0.0467, 0.0368, and 0.048, respectively.Compared with our MOE-A, our MOE-U increased these values by 0.0274, 0.0053, and 0.0233, respectively.These results indicate that our improvement of meta-optimizers can effectively enhance the generalization ability in dealing with many unseen classes.The best classification maps are displayed in Figure 7.

PaviaC-SalinasA Task
The PaviaC dataset and the SalinasA dataset were from different sensors, so they were completely different in terms of the number of bands, spectral range, spatial resolution, and land-cover types.Figure 8 shows the averaged loss change curves, and Table 13 shows the final convergence values.All of the meta-optimizers also performed better in convergence speed and the final convergence value.The averaged final convergence value of MOE-U was 0.2822, which was lower than all other meta-optimizers.The PaviaC dataset and the SalinasA dataset were from different sensors, so they were completely different in terms of the number of bands, spectral range, spatial resolution, and land-cover types.Figure 8 shows the averaged loss change curves, and Table 13 shows the final convergence values.All of the meta-optimizers also performed better in convergence speed and the final convergence value.The averaged final convergence value of MOE-U was 0.2822, which was lower than all other meta-optimizers.The classification results are shown in Table 14.All of the meta-optimizers had better classification results than the human-designed optimizers, and MOE-U achieved the best results in OA, AA, and Kappa again.This demonstrates that the learned knowledge can easily be generalized to HSI datasets from different sensors.Since the training dataset and test dataset in this task were from different sensors, these results effectively demonstrate the robustness of our methods.Finally, the best classification maps are displayed in Figure 9.The classification results are shown in Table 14.All of the meta-optimizers had better classification results than the human-designed optimizers, and MOE-U achieved the best results in OA, AA, and Kappa again.This demonstrates that the learned knowledge can easily be generalized to HSI datasets from different sensors.Since the training dataset and test dataset in this task were from different sensors, these results effectively demonstrate the robustness of our methods.Finally, the best classification maps are displayed in Figure 9.    15.Based on the experimental results, we drew a similar conclusion that all of the meta-optimizers performed better than the human-designed optimizers in terms of convergence speed and the final loss, and our MOE-U also achieved the lowest loss among the meta-optimizers.The classification performance results are shown in Table 16, demonstrating again that the meta-optimizers can generalize well to other HSI datasets from different sensors.Our proposed MOE-U achieved 0.8445, 0.8112, and 0.7890 in OA, AA, and Kappa, respectively, which also outperformed all other optimizers.The best classification maps on PaviaC are displayed in Figure 11, while the best classification maps by the human-designed optimizers are shown in Figure 3.

SalinasA-PaviaC Task
In this task, we trained meta-optimizers on SalinasA and tested their classification performance on PaviaC.This task can further test the robustness of our method in the scenario where the source dataset and the target dataset are from different sensors.Figure 10 shows the averaged loss change curves, and the final convergence values are shown in Table 15.Based on the experimental results, we drew a similar conclusion that all of the meta-optimizers performed better than the human-designed optimizers in terms of convergence speed and the final loss, and our MOE-U also achieved the lowest loss among the meta-optimizers.The classification performance results are shown in Table 16, demonstrating again that the meta-optimizers can generalize well to other HSI datasets from different sensors.Our proposed MOE-U achieved 0.8445, 0.8112, and 0.7890 in OA, AA, and Kappa, respectively, which also outperformed all other optimizers.The best classification maps on PaviaC are displayed in Figure 11, while the best classification maps by the human-designed optimizers are shown in Figure 3.         17.In this scenario, our MOE-U achieved the fastest convergence speed and the lowest convergence value of 1.5106.The losses of the human-designed optimizers were all significantly higher than those of the meta-optimizers.The PaviaC dataset and the KSC dataset were also from different sensors.Due to significant differences in spatial resolution (1.3 m and 18 m, respectively), the meta-knowledge from one HSI dataset is very difficult to use in classifying another HSI dataset.So, this task can further test the robustness ability of our proposed method.The averaged loss change curves are shown in Figure 12, and the final averaged convergence values and their Stds are reported in Table 17.In this scenario, our MOE-U achieved the fastest convergence speed and the lowest convergence value of 1.5106.The losses of the human-designed optimizers were all significantly higher than those of the meta-optimizers.The classification results are shown in Table 18.All of the meta-optimizers had better results than the human-designed optimizers, and MOE-U also achieved the best results in OA, AA, and Kappa again, which were 0.5361, 0.4278, and 0.4860, respectively.Among all of the human-designed optimizers, the highest values were 0.2757, 0.2131, and 0.2003, respectively, indicating that human-designed optimizers caused the network to fall into serious overfitting when the training samples were insufficient, thus failing to make the correct classification.The best classification maps are displayed in Figure 13.The classification results are shown in Table 18.All of the meta-optimizers had better results than the human-designed optimizers, and MOE-U also achieved the best results in OA, AA, and Kappa again, which were 0.5361, 0.4278, and 0.4860, respectively.Among all of the human-designed optimizers, the highest values were 0.2757, 0.2131, and 0.2003, respectively, indicating that human-designed optimizers caused the network to fall into serious overfitting when the training samples were insufficient, thus failing to make the correct classification.The best classification maps are displayed in Figure 13.19.In this scenario, all of the meta-optimizers had similar convergence speeds and final convergence values, which were also significantly better than those of the human-designed optimizers.The classification results are shown in Table 20.Our MOE-U achieved the highest OA, AA, and Kappa of 0.9024, 0.8388, and 0.8652, respectively.The best classification maps by the meta-optimizers are displayed in Figure 15.The best classification maps by the human-designed optimizers are shown in Figure 3.

KSC-PaviaC Task
For the KSC-PaviaC task, the averaged loss change curves are shown in Figure 14, and the final averaged convergence values and their Stds are reported in Table 19.In this scenario, all of the meta-optimizers had similar convergence speeds and final convergence values, which were also significantly better than those of the human-designed optimizers.The classification results are shown in Table 20.Our MOE-U achieved the highest OA, AA, and Kappa of 0.9024, 0.8388, and 0.8652, respectively.The best classification maps by the meta-optimizers are displayed in Figure 15.The best classification maps by the human-designed optimizers are shown in Figure 3.

Discussion
Manually labeling HSI samples is not only time-consuming and laborious but also requires much professional expertise, so HSI classification always faces the challenge of few-shot learning.The main difficulty is to improve the generalization ability of few-shot classifiers and avoid overfitting, where traditional human-designed optimizers (e.g., Adam, Adagrad, SGD) are widely recognized as being unable to achieve this aim.As an advance of deep learning, meta-learning becomes a powerful tool to deal with the

Discussion
Manually labeling HSI samples is not only time-consuming and laborious but also requires much professional expertise, so HSI classification always faces the challenge of few-shot learning.The main difficulty is to improve the generalization ability of few-shot classifiers and avoid overfitting, where traditional human-designed optimizers (e.g., Adam, Adagrad, SGD) are widely recognized as being unable to achieve this aim.As an advance of deep learning, meta-learning becomes a powerful tool to deal with the issue of fewshot learning.It always run fast, since only very limited samples are used to train the meta-learning model, leading to low computational costs and memory requirements.In this study, we focused on the N-way 10-shot learning setting, i.e., we randomly picked 10 samples from each of N land-cover classes to form the training set.We proposed an improvement of meta-optimizers through an ensemble of meta-optimizers and a novel update integration process.We performed periodic annealing on the learning rate during the meta-training process, leading to building an ensemble of meta-optimizers without adding training costs.The incorporation of the update integration process increased the computational costs during the iterative process but brought a significant enhancement in classification accuracy.
In seven classification tasks on five real-world HSI datasets, we trained meta-optimizers on the source HSI dataset and tested their performance on different datasets from the same or different sensors.The real-world HSI classification experimental results showed that, with the help of the learned knowledge, the meta-optimizers outperformed the humandesigned optimizers in terms of convergence speed, final convergence value, OA, AA, and Kappa.Moreover, our proposed MOE-U performed better than the single meta-optimizer (LSTM optimizer) and MOE-A, proving that an ensemble of meta-optimizers could achieve better results using our proposed update integration algorithm.Multiple meta-optimizers in an ensemble contain more parameters than a single meta-optimizer, thus containing more useful knowledge.When optimizing the predictive model, multiple meta-optimizers in an ensemble have a higher probability of generating good updates.The proposed update integration algorithm is more likely to choose good updates.Finally, our MOE-U achieved the best results in all real-world HSI classification experiments.

Conclusions
When lacking training samples, the widely used human-designed optimizers, such as Adam, AdaGrad, RMSprop, SGD, and SGD with momentum, can cause the model to fall into severe overfitting.To solve this issue, we present a meta-optimizer ensemble for HSI classification, which learns prior knowledge from the source HSI dataset and then uses the learned knowledge to train the predictive model on the target HSI dataset with limited training samples.By combining the advantages of ensemble learning and metalearning, the meta-optimizer ensemble is improved in terms of overall performance and generalization ability.Moreover, we propose an effective update integration algorithm to incorporate the candidate updates generated by the meta-optimizer ensemble into the final update.The experimental results on multiple few-shot HSI classification tasks demonstrate the superiority and effectiveness of our proposed methods.Compared with the widely used human-designed optimizers, our meta-optimizer ensemble makes the predictive model converge faster on target HSI datasets while achieving better OA, AA, and Kappa coefficient results.Except for HSI classification, our improvement of meta-optimizers has the potential to improve various few-shot learning models in broad fields, including numerical analysis, computation, medical imaging, industrial detection, and other applications.

Figure 1 .
Figure 1.The framework of our proposed method.

Figure 1 .
Figure 1.The framework of our proposed method.

4. 3 . 1 .
PaviaU-PaviaC Task In this task, we first trained the meta-optimizers, including the LSTM optimizer, MOE-A, and MOE-U, on the PaviaU dataset.The meta-optimizers were then used to train the predictive model f on the PaviaC dataset.To avoid overfitting, we set the number of training iterations as 50.The averaged loss change curves are shown in Figure 2. Meta-optimizers had much faster convergence speed than the human-designed optimizers, indicating that the learned prior knowledge from other datasets acquired from the same sensor was used to accelerate the convergence speed and significantly reduce the final convergence loss.The final averaged convergence values and their standard deviations (Std) are shown in Table

4. 3 . 1 .
PaviaU-PaviaC Task In this task, we first trained the meta-optimizers, including the LSTM optimizer, MOE-A, and MOE-U, on the PaviaU dataset.The meta-optimizers were then used to train the predictive model  on the PaviaC dataset.To avoid overfitting, we set the number of training iterations as 50.The averaged loss change curves are shown in Figure 2. Meta-optimizers had much faster convergence speed than the human-designed optimizers, indicating that the learned prior knowledge from other datasets acquired from the same sensor was used to accelerate the convergence speed and significantly reduce the final convergence loss.The final averaged convergence values and their standard deviations (Std) are shown in Table

Figure 2 .
Figure 2. Loss changes on the PaviaC dataset.

Figure 2 .
Figure 2. Loss changes on the PaviaC dataset.

Figure 4 .
Figure 4. Loss changes on the PaviaU dataset.

Figure 4 .
Figure 4. Loss changes on the PaviaU dataset.

27 Figure 6 .
Figure 6.Loss changes on the Salinas dataset.

Figure 6 .
Figure 6.Loss changes on the Salinas dataset.

Figure 8 .
Figure 8. Loss changes on the SalinasA dataset.

Figure 8 .
Figure 8. Loss changes on the SalinasA dataset.

Figure 9 .
Figure 9. Visual classification results on the SalinasA dataset: (a) False-color image.(b) Ground-truth map.Classification maps obtained by (c) Adam (0.7556), (d) AdaGrad (0.7467), (e) RMSprop (0.7609), (f) SGD (0.6605), (g) SGD with momentum (0.8708), (h) LSTM optimizer (0.9476), (i) MOE-A (0.9139), and (j) MOE-U (0.9277).4.3.5.SalinasA-PaviaC Task In this task, we trained meta-optimizers on SalinasA and tested their classification performance on PaviaC.This task can further test the robustness of our method in the scenario where the source dataset and the target dataset are from different sensors.Figure 10 shows the averaged loss change curves, and the final convergence values are shown in Table15.Based on the experimental results, we drew a similar conclusion that all of the meta-optimizers performed better than the human-designed optimizers in terms of convergence speed and the final loss, and our MOE-U also achieved the lowest loss among the meta-optimizers.The classification performance results are shown in Table 16, demonstrating again that the meta-optimizers can generalize well to other HSI datasets from different sensors.Our proposed MOE-U achieved 0.8445, 0.8112, and 0.7890 in OA, AA, and Kappa, respectively, which also outperformed all other optimizers.The best classification maps on PaviaC are displayed in Figure11, while the best classification maps by the human-designed optimizers are shown in Figure3.

Figure 10 .
Figure 10.Loss changes on the PaviaC dataset.

Figure 10 .
Figure 10.Loss changes on the PaviaC dataset.

Figure 11 .
Figure 11.Visual classification results on the PaviaC dataset: (a) LSTM optimizer (0.9407), (b) MOE-A (0.9060), and (c) MOE-U (0.9194).4.3.6.PaviaC-KSC Task The PaviaC dataset and the KSC dataset were also from different sensors.Due to significant differences in spatial resolution (1.3 m and 18 m, respectively), the meta-knowledge from one HSI dataset is very difficult to use in classifying another HSI dataset.So, this task can further test the robustness ability of our proposed method.The averaged loss change curves are shown in Figure 12, and the final averaged convergence

4. 3 . 6 .
PaviaC-KSC TaskThe PaviaC dataset and the KSC dataset were also from different sensors.Due to significant differences in spatial resolution (1.3 m and 18 m, respectively), the meta-knowledge from one HSI dataset is very difficult to use in classifying another HSI dataset.So, this task can further test the robustness ability of our proposed method.The averaged loss change curves are shown in Figure12, and the final averaged convergence values and their Stds are reported in Table

Figure 12 .
Figure 12.Loss changes on the KSC dataset.

Figure 12 .
Figure 12.Loss changes on the KSC dataset.

Figure 13 .
Figure 13.Visual classification results on the KSC dataset: (a) False-color image.(b) Ground-truth map.Classification maps obtained by (c) Adam (0.4481), (d) AdaGrad (0.3373), (e) RMSprop (0.5296), (f) SGD (0.2810), (g) SGD with momentum (0.4123), (h) LSTM optimizer (0.6173), (i) MOE-A (0.5878), and (j) MOE-U (0.6319).4.3.7.KSC-PaviaC Task For the KSC-PaviaC task, the averaged loss change curves are shown in Figure 14, and the final averaged convergence values and their Stds are reported in Table19.In this scenario, all of the meta-optimizers had similar convergence speeds and final convergence values, which were also significantly better than those of the human-designed optimizers.The classification results are shown in Table20.Our MOE-U achieved the highest OA, AA, and Kappa of 0.9024, 0.8388, and 0.8652, respectively.The best classification maps by the meta-optimizers are displayed in Figure15.The best classification maps by the human-designed optimizers are shown in Figure3.

Figure 14 .
Figure 14.Loss changes on the PaviaC dataset.

Figure 14 .
Figure 14.Loss changes on the PaviaC dataset.

source 11: Calculate the current loss value L t on D B 12: Calculate the gradients ∇ θ t L t of f 13: Normalize ∇ θ t L t → ∇ θ t L t norm according to
3: Meta-optimizer G with initial parameters ϕ 0 , MOE N = ∅.4: Learning rate α, learning rate range α max and α min ; annealing period T p .5: loss function L; meta-loss function L meta .and achieve faster convergence, we propose an update integration algorithm that can select a candidate update as the final update by estimating the quality of each update.The algorithm's core lies in how to measure the quality of each update.Typically, a better update at step t results in lower loss values for the predictive model from step t + 1 to the final step T.However, considering too many subsequent loss values is computationally expensive.As a compromise, only the loss value at step t + 1 is used to quantitatively measure the update at step t.Suppose that the predictive model to be updated is f θ .At the time step t, we sample a batch of samples from the target dataset and calculate the normalized gradients [∇ θ t L] norm .The normalized gradients are then inputted into the MOE N to obtain N candidate updates α = α min + 1 2 (α max − α min )(1 + cos( t cur T p )) 9:for t = 1, 2, . .., T do 10:Randomly draw B training samples D B = {(x i , y i )} B i=1 from D *

Algorithm 2
The proposed update integration algorithm.

Table 1 .
Land-cover classes with the number of samples per class in the PaviaU dataset.

Table 2 .
Land-cover classes with the number of samples per class in the PaviaC dataset.

Table 3 .
Land-cover classes with the number of samples per class in the Salinas dataset.

Table 4 .
Land-cover classes with the number of samples per class in the SalinasA dataset.

Table 5 .
Land-cover classes with the number of samples per class in the KSC dataset.

Table 6 .
Comparison of five real-world HSI datasets.

Table 7 .
The loss function values of different optimizers on the PaviaC dataset.

Table 7 .
The loss function values of different optimizers on the PaviaC dataset.

Table 8 .
Classification accuracy by different optimizers on the PaviaC dataset.

Table 8 .
Classification accuracy by different optimizers on the PaviaC dataset.

Table 9 .
The convergence values of different optimizers on the PaviaU dataset.

Table 9 .
The convergence values of different optimizers on the PaviaU dataset.

Table 10 .
Classification accuracy by different optimizers on the PaviaU dataset.

Table 11 .
The convergence values of different optimizers on the Salinas dataset.

Table 12 .
Classification accuracy by different optimizers on the Salinas dataset.

Table 11 .
The convergence values of different optimizers on the Salinas dataset.

Table 12 .
Classification accuracy by different optimizers on the Salinas dataset.

Table 13 .
The convergence values of different optimizers on the SalinasA dataset.

Table 14 .
Classification accuracy of the predictive model optimized by different optimizers on the SalinasA dataset.

Table 13 .
The convergence values of different optimizers on the SalinasA dataset.

Table 14 .
Classification accuracy of the predictive model optimized by different optimizers on the SalinasA dataset.

Table 15 .
The convergence values of different optimizers on the PaviaC dataset.

Table 15 .
The convergence values of different optimizers on the PaviaC dataset.

Table 16 .
Classification accuracy by different optimizers on the PaviaC dataset.

Table 17 .
The convergence values of different optimizers on the KSC dataset.

Table 17 .
The convergence values of different optimizers on the KSC dataset.

Table 18 .
Classification accuracy by different optimizers on the KSC dataset.

Table 18 .
Classification accuracy by different optimizers on the KSC dataset.

Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std C1
0.

Table 19 .
The convergence values of different optimizers on the PaviaC dataset.

Table 19 .
The convergence values of different optimizers on the PaviaC dataset.

Table 20 .
Classification accuracy by different optimizers on the PaviaC dataset.