Metamodelling of Noise to Image Classification Performance

Machine Learning (ML) has made its way into a wide variety of advanced applications, where high accuracies can be achieved when these ML models are evaluated in the same context as they were trained and validated on. However, when these high-accuracy models are exposed to out-of-distribution points such as noisy inputs, their performance could potentially degrade significantly. Recommending the most suitable ML model that retains a higher accuracy when exposed to these noisy inputs can overcome this performance degradation. For this, a mapping between the noise distribution at the input and the resulting accuracy needs to be obtained. Though, this relationship is costly to evaluate as this is a computationally intensive task. To minimize this computational cost, we employ metalearning to predict this mapping; that is, the performance of different ML models is predicted given the distribution parameters of the input noise. Although metalearning is an established research field, performance predictions based on noise distribution parameters have not been accomplished before. Hence, this research focuses on predicting the per-class classification performance based on the distribution parameters of the input noise. For this, our approach is twofold. First, in order to gain insights in this noise-to-performance relationship, we analyse the per-class performance of well-established convolutional neural networks through our multi-level Monte Carlo simulation. Second, we employ metalearning to learn this relationship between the input noise distribution and the resulting per-class performance in a sample-efficient way by incorporating Latin Hypercube Sampling. The noise performance analyses present novel insights about the per-class performance degradation when gradually increasing noise is augmented on the input. Additionally, we show that metalearning is capable of accurately predicting the per-class performance based on the noise distribution parameters. We also show the relationship between the number of metasamples and the metaprediction accuracy. Consequently, this research enables future work to make accurate classifier recommendations in noisy environments.


I. INTRODUCTION
In recent years, Machine Learning (ML) has made strides in advanced autonomous applications, such as robotics [1], autonomous driving vehicles [2] and inland vessels [3]. The tasks of ML within these applications range from lane The associate editor coordinating the review of this manuscript and approving it for publication was Jeon Gwanggil . detection and follower algorithms, to sophisticated objectdetection systems on a wide range of objects [4].
Traditionally, these ML systems have been trained on a set of data points in a certain context and are validated within that same context. This way of training and validation is proven to be very effective for operating and reasoning in many different use case domains [5]. However, problems arise when the ML model is operating in a context that was not represented in the dataset, such as noise on the input. The model then would make predictions that could be ambiguous and result in a cascade of problems [5].
One potential solution is to include these discrepancies into the learning process, so that the ML algorithm learns to generalise over them. This has been done in previous research, where the datasets have been augmented with certain noises; the algorithm reasoned over what it needs to do in case such noises appear [6]. Although this is a common technique in research regarding robustness of ML models, this does not fully solve the problem. Data points that are still outside of the learned context, also called out-of-distribution, would still result in ambiguous outputs [7]. An additional complexity is the fact that it is nearly impossible to take all discrepancies into account, as this implies the need for datasets that cover this wide range of noises and faults.
Another potential solution is to tackle the problem through a divide-and-conquer paradigm, since no single ML algorithm can achieve full coverage over all possible contexts, discrepancies and noises. This is also called the 'no free lunch' theorem [8], [9]. Instead, one could design different ML models where each performs sufficient enough in one context, but lacks performance in others. Therefore, given a certain input with a particular noise characteristic, the model with the best performance on this input could be chosen in order to provide the best output. Differences in performance between ML models originate from different aspects, such as hyperparameters of the ML model or the dataset used for training.
Such a decision-making problem for similar algorithms in different contexts has already been addressed by Garcia et al. [10]. The authors analyse the performance of different noise filters in certain contexts where noise is present. As expected, these filters all perform differently in terms of denoising their input. Hence, by recommending algorithms through metamodelling, one could pick the filter with the best expected performance given the characteristics of the present dataset.
In this research, we aim to leverage this metamodellingbased classifier recommendation approach for predicting the per-class image classification performance given the parameters of the noise distribution on the input. In this way, selecting the best performing classifier can then be based on the per-class performance, which has not been done before. Therefore, in this paper, we propose a twofold methodology for image classifying artificial neural networks (ANNs). First, their per-class classification performance is analysed by propagating a prescribed range of noise distributions through the models. Second, for each ANN, this analysis is then captured in a corresponding metaset and learned by a metamodel, which covers the relation between the applied noise distribution characteristics and the resulting classification performance.
The remainder of this paper is as follows. Section II elaborates on related work in the field of noise propagation analyses and metalearning. In Section III, we elaborate on the proposed methodology. Section IV discusses the experimental setup used to validate our approach. Section V and VI elaborate and form a discussion on the results, respectively. Finally, Sections VII and VIII present a conclusion and possible tracks for future work, respectively.

II. RELATED WORK
In this research, we focus on analysing the relationship between noise at the input of an image classifier and the resulting per-class performance, as well as learning this relationship with metalearning to predict the per-class performance based on the noise distribution parameters. Therefore, this section first elaborates on related work regarding noise propagation methods, after which related research in metalearning is discussed.

A. NOISE PROPAGATION ANALYSIS
Several related works have carried out noise propagation analyses on image classifiers. Nemcovsky et al. [11] and Liu et al. [12] have examined the performance of CIFAR-10 classifiers when exposed to adversarial noise. Hendrycks and Dietterich [13] have compared the mean Corruption Errors of different classifiers to adversarial and common noises on the ImageNet dataset. Liang et al. [14] and Berend et al. [7] compared the classification performance of classifiers when exposed to out-of-distribution samples from CIFAR-10, which includes the augmentation of common noises such as Gaussian and Uniform noise.
To perform noise propagation analyses of neural networks with noise distributions, different methods exist to acquire performance insights. Abdelaziz et al. [15] and Nathwani et al. [16] compared different techniques to perform uncertainty propagation through ANN-based automated speech recognition systems. Reference [15] performed uncertainty propagation with a multivariate Gaussian distribution to approximate uncertainty of the acoustic score at the output. They compared the usage of layer-wise Unscented Transform and piece-wise exponential approximations, along with network-wise Unscented Transform and Monte Carlo approximations. They concluded that Monte Carlo performed the best approximation in all circumstances, with the network-wise Unscented Transform as second best approximator. Reference [16] performed similar experiments and also showed that Monte Carlo consistently outperforms the Unscented Transform approximation. Nemcovsky et al. [11] use the Monte Carlo simulation for improving adversarial robustness of CIFAR-10 image classifiers via randomized smoothing of the classifier. Liu et al. [12] employed an adaptation of Monte Carlo simulation where N clean images are perturbed with corruptions by perturbing each image M times. With this, they analysed the robustness of models via the ε-Empirical Noise Insensitivity metric.
Considerable work has already been carried out regarding robustness of image classifiers and noise propagation methods. However, these works have only considered the global robustness and classification performance of the classifier. VOLUME 11, 2023 47995 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
To our knowledge, no work has been done regarding the relation between input noise and the robustness and classification performance of the individual classes.

B. METALEARNING
Three major categories exist within metalearning over Artificial Intelligence (AI) models [17]: (i) regression-based prediction of the performance of a certain AI model, (ii) single-label classification-based prediction of the most suitable AI model for a given task, and (iii) multi-label classification-based prediction of a ranking of the most suitable AI models for a given task.
Reif et al. [17] researched the possibility of creating a regression-based metamodel for non-expert end-users. They considered that predicting performance values of a classifier based on dataset characteristics was most suitable, as this would provide the most valuable insights for the end-users. Garcia et al. [10] tackled the problem of estimating the most suitable filter against noise in classification datasets. They do so by creating a regression-based metamodel which maps the relation between the characteristics of noise datasets and the induced performance of a classifier. In the work of Muñoz et al. [18], an objective way of assessing the performance of supervised learning methods is proposed. Additionally, they designed a methodology where the relation between dataset characteristics and objective performance measures is captured in a regression-based metamodel. Kotlar et al. [19] elaborate on the performance of anomaly detectors on different datasets. They capture the relationship between characteristics of the anomaly-containing datasets and the performance of detectors into a regression-based metamodel. Eggensperger et al. employ metaregressors to benchmark hyperparameter optimization systems [20] and algorithm configuration procedures [21].
In the field of using metamodels for predicting the best AI model for a given task, several works have been proposed. Lorena et al. [22] proposed a methodology for predicting the best classifier given the complexities of increasingly complex synthetic datasets. Additionally, they also propose metalearning techniques for recommending parameters of a certain AI model. Next to that, they also designed a metaregressor which predicts the expected Normalized Mean Squared Error of a regression-based AI model. Ler et al. [23] elaborate on the selection of algorithms via metamodelling by focusing on the complexity of the dataset itself, without having prior knowledge about the performance of a classifier on it. With this, they create a set of metafeatures which is less model-centric and focuses more on the general complexity. The metaset labels consist of the predictive accuracy of the underlying AI model. Aguiar et al. [8] focuses on recommending the most performant multi-target regressor for a given set of dataset characteristics.
In terms of metamodels used in research, different approaches exist. The survey of Hutter et al. [24] notes that different metaregressors have been used for similar performance prediction applications, and that the choice of metamodel depends on the metasamples. Garcia et al. [10] employed metamodels based on the Random Forest (RF) and k-Nearest Neighbours (k-NN) techniques. They concluded that the RF implementation performed best. Eggensperger et al. [20], [21] also selected RF and k-NN, along with Gaussian Processes (GP), Gradient Boosting (GB), Support Vector Regression (SVR) and Ridge Regression (RR). They concluded that GB and RF performed best, with RF performing better on metasets with a larger amount of samples and parameters. Horváth et al. [25] used metaregressors based on RF and k-NN to predict hyperparameters of Decision Tree (DT) and Support Vector Machines (SVM) algorithms. Roy et al. [26] discussed the usage of linear metaregressors, along with DTs and non-linear models such as SVRs and Multi-Layer Perceptrons (MLPs).
In order to generate the metasets on which the metamodels are trained on, metasamples need to be gathered that originate from the underlying ML model. Jin et al. [27] focus on creating surrogate models of more complex and expensive AI models. As the samples are of key importance in the performance of surrogate models, the authors propose an adaptive sampling strategy based on Latin Hypercube Sampling (LHS) where the performance of the surrogate model guides the process of gathering extra samples. Metta et al. [28] research the generation of surrogate models for feasibility region analysis of complex AI models. For this, they also make use of LHS when building the initial surrogate model. In the research of Duchanoy et al. [29], the authors conducted a compact survey for which sampling technique to use for their metamodelling use case. Out of the possible options, candidates were Monte Carlo simulation, LHS, Uniform Design and Voronoi sampling. The authors chose the latter, as this was most suitable for their research project. Muñoz et al. [18] used LHS to sample instances from a 2D design space.

III. TWOFOLD METHODOLOGY
The proposed methodology is designed in a twofold way. Firstly, we elaborate on a methodology to propagate noise through an image classifier, such that the relation between the input noise characteristic and the resulting per-class performance can be evaluated. Secondly, we aim to capture this noise-to-performance relation of the classifier into a metamodel. Given the characteristic of the input noise, this metamodel is then able to predict the per-class performance of a certain image classifier.

A. NOISE PROPAGATION THROUGH IMAGE CLASSIFIERS
In this research, we aim to observe the performance of an image classification ANN when exposed to noise on the input. We carry out this observation through our multi-level Monte Carlo simulation, using the whole validation set of the data the network has been trained on. The technique is shown in Figure 1 via steps A to D.
First, in Eq. (1), the original input image X ∈ R p is augmented with a noise sample E ∈ R p , sampled from distribution D with parameter tuple P ∈ R s . The noisy input is then propagated through a classifying ANN F : R p → R q , which results in a classification output Y ∈ R q . This is shown in Figure 1A. Note that this research primarily focuses on additive noise. However, this methodology can also be applied to other types of noise, such as multiplicative noise models.
The process of propagating a noisy image in Eq. (1) is now performed N times in Eq. (2), with N as a hyperparameter. Each time, a new noise sample is augmented onto X, after which each noisy image is propagated through F. This results in N corresponding classification outputs Y. By averaging all outputs, we get the expected value E[Y], which represents the expected output of F when noise distribution D (P) has been added on image X. By adjusting hyperparameter N , we can adjust the accuracy of the result E[Y]. This process is presented in Figure 1B.
In Eq. (3) and Figure 1C, the aforementioned process is leveraged over the whole validation set X 1...M the model F was validated on, with M as the total amount of images in the set. All expected values E[Y] 1...M are combined together into an expected value E[C], which represents the expected confusion matrix of F when exposed to noise distribution D (P).
Based on this expected confusion matrix, various performance metrics can be derived. For example, the classification accuracy can be derived from the ratio of correct predictions to the total number of predictions. Other examples of metrics include precision, recall and F1-score. With these metrics, the performance of the ANN can be judged for a particular input space along with a certain noise distribution.
The next analysis step comprises of the performance evaluation for a range of parameter sets of the noise distribution. For example, the degradation of performance for each class can be observed for increasing noise on the input space. This gives a clear sense about the classes of the classifier that start to degrade the quickest, and which classes retain a reasonable performance.
For a particular noise distribution D, a parameter space P ∈ R r×s is defined, as shown in Eq. (4). The first dimension denotes the actual parameter tuple P of the distribution, whereas the second dimension denotes the set of different parameter tuples. For example, a Gaussian distribution is characterised by a parameter tuple with two parameters: mean µ and variance σ . The second dimension then denotes different settings for µ and σ .
Eqs. (1 -3) are evaluated for all elements of vector P, ranging from P 1 to P r . In this way, we obtain a corresponding set S consisting of expected confusion matrices E[C] 1 to E[C] r , as defined in Eq. (5), where each of the elements represents the performance of F when subjected to D(P i ).

B. METAMODELLING NOISE-TO-PERFORMANCE RELATION
As the goal is to capture the relation between the noise characteristic on the input and the performance on the output, this defines the input and outputs of the metamodel. More specifically, as presented in Eq. (6), the metamodel G serves as a regression function that learns the relationship between the parameters of the input noise distribution P ∈ P and the expected performance E[C] of the classifier F.
This immediately implies the advantage of using metamodels: the metamodel only needs to be created once, so the analysis on the underlying classifier is also carried out once. Inferring this relation now needs only one evaluation of the regression function. Without metamodels, the noise-toperformance relation could only be inferred by carrying out the noise propagation ad-hoc, which is a costly process.
To create such a metamodel, a metaset is needed on which it is fitted. Figure 2a shows the technique visually. We use a variant of Latin Hypercube Sampling (LHS) that operates within a defined range of parameters for a particular noise distribution. This sampling method provides metasamples where each one represents a set of parameters of a noise distribution. By using LHS, each dimension of the noise distribution parameter space is divided into k equal parts, after which a random sample is drawn from each equal part, resulting in k different samples. For example, as shown in Figure 2, LHS could operate over a range of means and standard deviations for Gaussian noise. Hence, each sample represents a Gaussian distribution with a particular mean and standard deviation.
For each metasample obtained via LHS, we carry out the Monte Carlo simulation described in Section III-A. Hence, this yields a metaset where the inputs are the parameters for this noise distribution and the outputs being the corresponding performance measures for all classes of the classifier.
Next, we fit a regression model on this metaset, which then learns the function between the parameters of the noise distribution to the neural network's performance.
Finally, when the metamodel has been trained on the generated metaset, the model has learned the relationship between the noise distribution parameters and the expected performance of the ANN. Therefore, as shown in Figure 2b, given a particular set of parameters of the noise distribution, the expected performance is predicted.

IV. EXPERIMENTAL SETUP
As our methodology consists of two major parts (i.e. the noise propagation analysis and generation of metamodels), the experiments are set up in these two categories as well. First, we compare and elaborate on the performance of image classifiers when analysed by the process mentioned in Section III-A. Second, we present the performance of the created metamodels and how the different models hold up to each other. Finally, the influence of the collected metaset on the prediction performance of the metamodels is also being investigated.
The experiments are executed on Tesla V100 graphical processing units within an NVIDIA DGX-2.

A. DATASET & MODELS
Regarding the dataset for these experiments, we selected the CIFAR-10 dataset. As this dataset is significantly complex [30] and widely used in other research tracks [31], [32], it implies the generalisability and extensive usability of this dataset. Given the CIFAR-10 dataset, we opted for different pre-trained convolutional neural networks (CNNs) that already underwent the hyperparameter optimization process and were trained in the same conditions. As shown in Table 1, we opted for three architecturally different CNNs, each with three or four distinct versions.
As mentioned in Section II-B, different metaregressors exist that have been employed in various application fields. As the choice of metaregressor depends on the application and metasamples at hand [24], there is no straightforward choice of metaregressor for our application. Therefore, we opted to implement and compare the regression models that have been discussed in Section II-B, i.e.: (i) Random Forest, (ii) k-Nearest Neighbours, (iii) Decision Trees, (iv) Gradient Boosting, (v) Gaussian Processes, (vi) Support Vector Regression, and (vii) Ridge Regression. As the performance of these metaregressors depends on their hyperparameter configuration, we conducted a randomized hyperparameter search and implemented the most optimal parameters accordingly. These hyperparameter search spaces and best-performing values are shown in Table 2.

B. NOISE DISTRIBUTION
For this research, we make use of the frequently used Uniform Distribution [36], [37]. This distribution has two parameters: the start and end points of the range of this distribution. In our research, however, we use the distribution in a symmetric fashion; that is, it is centred around zero and the width is parametrised by only one parameter r, shown in Eq. (7).  The value u in Eq. (7) represents the sample from the uniform noise distribution D U . This sample u is a three-dimensional matrix with the same shape as the image X to which the noise is added. For each pixel (i.e., each tuple of RGB channels), three samples are drawn from D U . As shown in Figure 3, the image of the dataset and the equally-shaped noise sample are added together, resulting in a perturbed image X ′ . Figure 4 shows how different values for r affect an image sample from each class in the CIFAR-10 dataset. We chose value 2.0 as a maximum as this already heavily distorts the image; values higher than 2.0 would have a low research contribution. This is also shown experimentally in the results of Section V-A.

C. SAMPLING STRATEGY
In order to fit a metamodel on the metasamples that capture the performance-to-noise relationship, these samples need to be generated via the multi-level Monte Carlo simulation presented in Section III-A. As parameter N is a hyperparameter of the Monte Carlo simulation, this parameter is open for adjustments. According to our tests, N = 10 is an acceptable compromise between computational complexity and accuracy. Hence, 10 corresponding classification results of the ANN are collected, which are aggregated together into the expected value E [Y]. This process is carried out for the whole validation set of CIFAR-10, which contains 10,000 different images. This in turn adds up to 100,000 forward propagations through the ANN to represent its classification performance for one parameter setting of the noise distribution.
For the analysis process presented in the first part of the Results section below, we gathered 200 equal-distanced samples, ranging from noise values 0.0 to 2.0. Hence, each noise sample represents an increase of 0.01, which yields a granular and detailed analysis.
As the aforementioned analysis process is computationally expensive (i.e. 20,000,000 forward propagations), a more intelligent way of generating metasamples for the metamodel needs to be devised. Therefore, the methodology presented in Section III-B is designed to cover the distribution parameter space more intelligently. By employing LHS, the number of metasamples s determines the number of equal divisions k of the parameter space. For all k selected noise distribution settings with LHS, we apply the aforementioned multi-level Monte Carlo simulation. In the second part of the results, we alter the amount of metasamples s to train the metamodels on, which therefore also changes the number of equal divisions k of the parameter space. In this way, we are able to examine the relation between the number of metasamples and the performance of the different metamodels.

V. RESULTS
In this section, the results are presented. The first subsection elaborates on the analysis of noise-to-classification VOLUME 11, 2023 FIGURE 5. The classification performance of the VGG11 network architecture is shown. As classification accuracy alone is not a sufficient metric for judging performance, the F-score is included as a second, complimentary metric.
behaviour. In the second subsection, the results of the trained metamodels are presented.

A. ANALYSIS OF IMAGE CLASSIFIERS
The noise-to-classification behaviour analysis is carried out for all ANNs shown in Table 1. For legibility, the results of all analyses are shown on different abstraction levels. First, the results of one particular ANN implementation are thoroughly discussed. Next, the analyses of different implementations of one architecture are compared. Finally, we compare different architectures through aggregations of their different implementations. Figure 5 shows the performance of VGG11 in terms of its classification accuracy and corresponding F-score. In Figure 5a we immediately notice that the global trend of the classification accuracy decreases when the augmented input noise increases. However, some classes deviate from this global trend. First, when no noise is present on the input, all classes but 'cat' and 'dog' are high in accuracy. This classification confusion between 'cat' and 'dog' has also been demonstrated by Yan et al. [38], where they show a higher corresponding misclassification rate of said classes. They attribute this confusion to an ambiguous boundary of classification, due to similar features in both classes. Second, the progress of classes 'bird' and 'frog' is remarkable: at medium and high noise levels the network seems to bias towards frogs and birds respectively. This would seem to indicate that, when a high level of noise is present on the input, these classes are still detected with a high accuracy; the class accuracy for birds has only dropped 6%, as it was 88% when no noise was present. However, only using the accuracy metric to judge the performance of an ANN is not good practice. Figure 5b shows the F-score performance metric of the network. As this metric combines the precision and recall performance values, it provides context around the observed classification accuracies. We notice that for all classes, the F-score monotonically decreases in value, which indicates a high number of misclassifications. Hence, the accuracy for class 'bird' is high due to the fact that all images in the dataset are being detected as a bird. This means the image classifier is biased towards the 'bird' class when the input noise level is high and no features can be detected. This could be due to the hyperparameters regarding architecture, initialisation or training procedure, which results in the current local minimum of the neural loss function [39]. Interesting to see is that the F-score for 'cat' and 'dog' are lower than the rest, similar to the graphs from Figure 5a. The progress of the other classes shows that the number of misclassifications is high.
As mentioned in the introduction of this paper, we aim to use different classifiers that all solve the same classification problem. As VGG11, shown in Figure 5a, is only one ML model in a range of models, we aim to compare different versions of the same architecture, in order to examine differences between them. Figure 6 shows a collection of versions of the VGG architecture. Figures 6a and 6b show VGG11 and VGG13, while VGG16 and VGG19 are shown in Figures  6c and 6d, respectively. From the start, we notice that the four figures are quite similar. When no noise is present on the input, the classes 'cat' and 'dog' share the same characteristic. They are consistently lower than the other classes from the dataset, albeit small differences are still present. Next, small differences in accuracy are noticeable when a relatively low noise level is present. For example, the class 'horse' is more stable regarding uniform noise in VGG11 than in VGG13. At a noise level of 0.4, the accuracy is 67% in VGG11, whereas VGG13 reaches a performance of only 47%. Both VGG16 and VGG19 achieve an accuracy of approximately 51%. At first sight, the lower complexity of VGG11 suggests a higher robustness to uniform noise than the higher-complexity implementations of VGG. Hence, these differences already show that it is feasible to recommend suitable architectures for appropriate classes of interest based on class-specific accuracy for a given measure of the input noise. When the noise level continues to increase, more noticeable differences start to appear. As mentioned in the previous paragraph, the characteristics of classes 'bird' and 'frog' are remarkable when in the presence of high-noise levels. As shown in Figure 5a and 6a, the classification accuracies of both classes start to cross at a noise level of 1.27. However, for VGG13 shown in Figure 6b, this is not the case. The accuracy for 'frog' keeps increasing, while the class 'bird' is monotonically decreasing along with the other classes in the dataset. It is confirmed that the F-score for class 'frog' in VGG13 is also consistently lower than in VGG11. The reason for this behaviour is unclear and could be due to several reasons. For instance, the random weight initialisation or optimization during training could lead to another local minimum of the loss function. This, in turn, leads to different outcomes the network defaults to when a high level of noise is present.
In Table 3, we present a more in-depth numeric representation of the results for the four versions of the VGG architecture. First, we look at the performance of the ANN models regarding increasing noise. When no noise is present on the input, VGG13 is mainly better than the other classifiers for seven out of 10 classes. In the other three classes (i.e. 'horse', 'ship' and 'truck'), VGG19 is the best image classifier. However, these differences are only marginal and negligible. When the input noise is increased to a low level (0.1 ≤ r ≤ 0.3), there is no single classifier that performs best at all classes. Hence, depending on the class of interest, another image classifier would be recommended. However, when the noise further increases (r ≥ 0.5), we notice VGG11 being better than the other classifiers for most of the classes, with the exceptions of the class 'frog'. This is an interesting insight, as the complexity of the classifier is smaller than the other three networks. This could be attributed to the problem of overfitting. This is a common issue in training deep ANNs, for which several works have proposed methods for preventing overfitting in such complex networks [40], [41]. As the previous results only compared different implementations of VGG, said results only give insights into the VGG architecture itself. Therefore, in this final paragraph, we compare different architectures relative to VGG. Since the different implementations of a particular architecture have mutual similarities, for legibility these implementations have been aggregated into one single graph showing mean and standard deviation. Hence, the four different implementations shown in Figure 6 are aggregated in Figure 7a. Similarly, Figure 7b shows the aggregation of ResNet with ResNet18, 34 and 50. Figure 7c shows the DenseNet architecture with DenseNet121, 161 and 169 aggregated. We notice that the distribution over the different versions is more concentrated for both ResNet and DenseNet than it is for VGG. However, except for the difference in distributions, the overall characteristics are similar. Again, classes 'cat' and 'dog' are consistently lower than other classes, and classes 'bird' and 'frog' have similar outlier characteristics. Minor differences are present in classification accuracy of other classes. For example, with VGG and ResNet, class 'ship' is performing better than 'deer' for low noise levels, but the opposite is true for DenseNet. This figure shows that the difference in architecture does not result in a significant difference in performance. On the one hand, we notice that the robustness to uniform noise behaves in a very similar way, but small differences are present. On the other hand, we notice that the characteristics of these behaviours are very similar, too. Classes 'bird' and 'frog' are outliers and the networks bias towards 'frog', while classes 'cat' and 'dog' are consistently lower than the others. This sheds light on the importance of the dataset on which the networks were trained towards each network's respective performance to uniform noise. This importance of the CIFAR-10 dataset has also been shown in other researches: different CNNs experience similar performance drops in case of benign data distribution shifts [42] or learned similar non-sensical statistical patterns of the dataset [43].

B. EVALUATION OF METAMODELS
To assess the predictive performance of the regressive metamodel, we assess the errors between the actual values from the multi-level Monte Carlo analysis, mentioned in Section III-A, and the predicted values by the metamodels. These errors are presented in the graphs in Figure 8. Each graph represents an average error with standard deviation over four metamodels, which have been trained on metasets of VGG11 through VGG19. As the ResNet and DenseNet architectures were determined to have a similar noise-to-classification performance relationship, and therefore have similar performing metamodels, we omitted these architectures for legibility. The X-axis denotes the number of metasamples in each set; these samples are collected with LHS in the way as described in Sections III-B and IV-C. This results in a noise distribution parameter space divided into as many parts as samples taken, with a random sample drawn in each part. After training, each metamodel has been evaluated with 8 different validation metasets, where each set consists of 40 metasamples, sampled with LHS in the same way as described before. For each metaset, the Mean Absolute Error (MAE), Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are calculated. For all validation sets applied on the 4 metamodels, these error measurements are aggregated into a corresponding average and standard deviation. These aggregated MAEs, MSEs and RMSEs are shown in Figures 8a, 8b and 8c, respectively. We use these three different metrics to show the errors with three different nuances. The MAE shows the distribution over the average error in the metamodel predictions, whereas the MSE indicates the presence of outliers as it gives more weight to outlier values. The RMSE gives an indication to the distribution of the standard deviations over the prediction errors.
First, the MAE graphs are presented in Figure 8a. Note that the vertical axis denotes the differences in classification accuracy percentages. Hence, a value of 10 −2 denotes an absolute difference of one percent between the predicted classification accuracy and the actual accuracy coming from the noise propagation analysis. We immediately notice the inaccuracy of the RR metamodel. As this metaregressor is a linear approximator, we show that our problem is rather non-linear. Next, we notice the inaccuracies of the other metamodels when using a low amount of samples. That is, when using 10 LHS-generated samples, each metaregressor performs worse than when using a larger amount of samples. Though, in absolute values, the MAE for the GP, SVR and k-NN-based metaregressors is still low. The GP metaregressor has an average MAE of approximately 0.00342 or 0.342%, while SVR and k-NN have MAEs of 0.345% and 1.204% respectively. Next to that, it can be seen that the MAE of the tree-based metaregressors (i.e. RF & DT) are considerably higher than the other metaregressors, aside from RR. This could be due to different factors related to the low size and dimensionality of the metaset, such as overfitting of the regressors or inconsistent trees due the limited subsampling of data or features [44]. When increasing the amount of samples to 40 or more, we notice that the performance of GP and SVR converges to an average MAE of approximately 0.14% and 0.16%, respectively. As the size of the metaset increases, the performance of k-NN slightly improves, but the performance of RF increases rather drastically. RF performs even better than k-NN when employing a metaset size of 160 samples or more. This trend is in line with the research of Noi and Kappas [45]; they show that the performance of SVM is superior to RF and k-NN when using a low amount of samples, but that the performance of the last two improves as the amount of samples increases. It should be noted that all metaregressors aside from RR and DT perform similarly when using a large amount of samples; the MAE ranges between 0.24% and 0.14% for GB and GP as worst and best regressor, respectively. Figure 8b presents the MSE graphs. Again, we notice that RR is the worst predictor due to the non-linear characteristic of our use case. For all other metamodels, we notice that the MSE is increased when using a limited metaset (i.e. 10 to 40 samples). This shows that all metamodels reveal inconsistent prediction accuracies when fitted on a limited metaset. However, when increasing the metaset size to 40 samples or more, all MSEs start to converge. Similarly to the previous figure, GP and SVR show the best performance with the least outlier consistencies. RF, k-NN and GB show slightly worse MSE values, although the difference is small. The DT metamodel shows worse MSE values, implying that the prediction accuracies are more inconsistent than the other metamodels. We should note that the MAE for the classification accuracy is less than one percent, so although outliers are clearly present, they are still within limited ranges.
Finally, the RMSE measurements are shown in the graphs of Figure 8c. Similar to the previous figures, the error range is increased when using a low amount of samples, while it starts to converge when using 40 samples or more. The same relation between the metamodels is present as well: RR shows the highest RMSE due to its linear approach, the DT model is second-to-last, and the other metamodels are showing small differences in RMSE, bounded between GB and GP as highest and lowest RMSE respectively. Finally, Figure 9 presents the class accuracy predictions of the different metamodels when trained on 10 samples of the VGG11 network. In this way, it shows how the increased errors of Figure 8 manifest in the actual predictions. As the other versions of VGG and network architectures have similar noise-to-performance relationships, and therefore similar performing metamodels, these other network configurations are omitted for legibility. It is clear that the GP and SVR models already show accurate predictions when trained on 10 metasamples: the global trend of the individual classification accuracies is clearly shown. However, small aberrations are still present, such as the classification accuracy of most classes at very low noise inputs. Both GP and SVR show that cat has a higher classification accuracy than dogs, but this is the other way around in the actual analysis of VGG11. Additionally, the predicted class accuracies tend to have a positive slope towards a maximum, such as the classes 'deer' and 'truck'. However, in the actual analysis, this slope is flatter, if not negative for some classes. This shows that GP and SVR can accurately predict classification performance with only 10 metasamples, but that small errors are still present. Regarding the other metamodels, it is shown that 10 metasamples do not contain enough information yet for accurate predictions. The interpolation process of k-NN is clearly struggling with the limited information, while the other metamodels are showing even larger errors with their inaccurate step-wise decreases. The RR model is clearly unable to model our non-linear application.
In our use case, it is clear that GP and SVR are sampleefficient metamodels. Using 10 metasamples already results in accurate predictions, but 20 or 40 metasamples are required for minimizing the small aberrations, yielding high-accuracy predictions. k-NN and RF do not achieve the same sampleefficiency, but they start to achieve high prediction accuracies when trained on 40 samples or more. The other metamodels require more metasamples for high prediction accuracies, making them less sample-efficient.

VI. DISCUSSION
When looking at the analysis results of the image classifiers and their performances on CIFAR-10 with uniform noise, an interesting insight arises. Their respective performances for the same classes are very similar: (i) the networks appear to bias towards 'bird' and 'frog' when the noise level increases, and (ii) classes 'cat' and 'dog' all have consistently lower accuracies than the other classes, even when no noise is added. This implies that the observed differences are more likely due to the characteristics of the dataset on which the classifiers are trained, than the classifier architectures themselves. It is indeed shown that 'cat' and 'dog' have a more difficult feature set [38], while other classes could be trained on rather non-sensical statistical patterns of the dataset [43]. This shows that our noise propagation analysis is in synergy with prior findings regarding the effect of the CIFAR-10 dataset on image classifiers. It can be noted, however, that slight differences are present in the different classifier architectures. For example, as presented in Figure 6 and Table 3, the four versions of VGG-based architectures have small differences. VGG11 appears to be consistently better than the other versions for a certain set of classes, especially for the class 'horse'; the difference between VGG11 and the other variants increases considerably when the noise increases too. Dodge and Karam [46] also have a similar finding: noise on the input is being amplified through the convolutional layers. Less layers therefore results in less noise amplification. In the case of the DenseNet-based architectures, it is the only set of architectures where the class 'deer' has a better performance than 'ship', which is reversed for VGG-and ResNetbased architectures. Hence, as each classifier has its own advantages and disadvantages, specific classifiers could be recommended based on the task at hand. VOLUME 11, 2023 48003 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. As stated before, metamodels noticeably facilitate the recommendation of classifiers [10]. In our research, we focused on regressive metamodels predicting the accuracy of widely used and researched classifiers, instead of directly recommending one or more networks. We conducted experiments over the number of samples needed for an accurate regressive prediction and noticed that, with a limited number of samples (i.e. 10 -40 samples), a wider distribution of errors occurred when predicting the classification accuracy. Although this is true for all tested metamodels, the GP and SVR regressors still performed very accurate, achieving prediction errors of only 0.342% and 0.345% respectively. This could be attributed to the fact that both are kernel-based metaregressors, which makes them accurate function approximators with only a low amount of samples [47], [48]. However, more research is needed to validate this statement. The other metamodels suffer from interpolation aberrations due to the lack of metasamples, resulting in the distribution of the prediction errors being higher and wider. The worst performing metamodel is the RR model, which is unable to model our classification accuracy problem in a linear fashion. When employing a medium amount of metasamples (i.e. 40 -100 samples), all metamodels start to converge towards their best prediction accuracy. GP and SVR reach convergence at 40 samples, at which point most aberrations are minimized. k-NN, GB and RF reach convergence at around 100 samples. When using 100 metasamples or more, their MAE values are very similar: their MAEs range between 0.245% and 0.142% for GB and GP respectively. Only the DT and RR metamodels do not reach prediction accuracy convergence before 200 metasamples. RR fails to model our non-linear problem, while DT only achieves a prediction accuracy that is similar to other metamodels that were trained on 60 metasamples or less. This shows the sample-efficiency of the other, more accurate metamodels, with GP and SVR as best performing metaregressors in our use case.

VII. CONCLUSION
In the field of algorithm recommendation, the most suitable algorithm is selected based on a set of features of interest. In this work, we present a twofold methodology for predicting the performance of classifiers based on the characteristics of the input noise distribution. This, in turn, could be used for algorithm recommendation systems. First, we propose a multi-level Monte Carlo simulation on an image classifier, which yields an in-depth analysis of the classifier performance. Afterwards, we make use of a combination of LHS and the aforementioned multi-level Monte Carlo simulation to generate metasamples for the metamodel to fit on. Carrying out the noise-to-performance analyses on pre-trained stateof-the-art image classifiers, we notice that they have small differences between them, while showing largely similar behaviour.
Regarding the metamodel fitting, we studied the relation between the amount of metasamples and the resulting predictive performance of classification accuracies. A low to medium amount of metasamples in combination with the use of sample-efficient metamodels such as GP or SVR is recommended to accurately predict classification accuracies. Increasing the amount of metasamples would only result in a marginally better prediction performance as nearly all metamodels converge towards similar performance with low absolute errors.
By using our metamodelling strategy, algorithm recommendation systems can efficiently select the most suitable algorithm given the parameters of a noise distribution on the input. Without using these metamodels, this would result in a very computational complex analysis each time a performance prediction needs to be made.

VIII. FUTURE WORK
To improve this research, several tracks of future work follow. First, a more in-depth analysis of per-class performance could be carried out. In this way, more detailed insights can be gathered on class accuracy fluctuations and biases. Secondly, this methodology could be carried out on a more diverse set of noise distributions. This could include a synthetic distribution with a larger amount of parameters, as well as real-life noise distributions, such as rain interference, overexposure or blur. Finally, our method could be validated on larger and more complex image datasets, such as CIFAR-100 or ImageNet. It could also be of interest to extend this methodology towards non-image-based tasks, such as object recognition algorithms on point clouds [49]. where he is also an Assistant Professor and a Program Manager of the AI Applications Team. His team of more than 30 researchers is committed to bridging the gap between academic AI research and industry in domains, such as chemical process control, autonomous shipping, smart buildings, logistics, and mobility. He received the VIK Award for his master's thesis on parallel data structures.
PETER HELLINCKX received the master's degree in computer science and the Ph.D. degree in science from the University of Antwerp, in 2002 and 2008, respectively. He is currently a Professor with the University of Antwerp. He is also the Head of the Department of Electronics-ICT. He supervises more than 20 Ph.D. students in the field of distributed artificial intelligence. He is also teaching third year bachelor's courses advanced programming techniques and artificial intelligence, and the master's courses distributed AI and computer graphics. He is the Co-Founder of the spin-offs Hysopt, Hi10, and Digitrans. His research interests include distributed artificial intelligence for the IoT and cyber physical systems with as main application domains: autonomous driving/shipping, logistics, mobility, Industry 4.0, and smart cities. In this field, he is a reviewer in many scientific project evaluation commissions, both on a national and an international level.