Guiding the retraining of convolutional neural networks against adversarial inputs

Background When using deep learning models, one of the most critical vulnerabilities is their exposure to adversarial inputs, which can cause wrong decisions (e.g., incorrect classification of an image) with minor perturbations. To address this vulnerability, it becomes necessary to retrain the affected model against adversarial inputs as part of the software testing process. In order to make this process energy efficient, data scientists need support on which are the best guidance metrics for reducing the adversarial inputs to create and use during testing, as well as optimal dataset configurations. Aim We examined six guidance metrics for retraining deep learning models, specifically with convolutional neural network architecture, and three retraining configurations. Our goal is to improve the convolutional neural networks against the attack of adversarial inputs with regard to the accuracy, resource utilization and execution time from the point of view of a data scientist in the context of image classification. Method We conducted an empirical study using five datasets for image classification. We explore: (a) the accuracy, resource utilization, and execution time of retraining convolutional neural networks with the guidance of six different guidance metrics (neuron coverage, likelihood-based surprise adequacy, distance-based surprise adequacy, DeepGini, softmax entropy and random), (b) the accuracy and resource utilization of retraining convolutional neural networks with three different configurations (one-step adversarial retraining, adversarial retraining and adversarial fine-tuning). Results We reveal that adversarial retraining from original model weights, and by ordering with uncertainty metrics, gives the best model w.r.t. accuracy, resource utilization, and execution time. Conclusions Although more studies are necessary, we recommend data scientists use the above configuration and metrics to deal with the vulnerability to adversarial inputs of deep learning models, as they can improve their models against adversarial inputs without using many inputs and without creating numerous adversarial inputs. We also show that dataset size has an important impact on the results.


Introduction
In recent years, Deep Learning (DL) systems, defined as those software systems with functionalities enabled by at least one DL model, have become a widespread machine learning approach, due to their outstanding capacity to solve complex problems.The use of DL systems ranges from applications in autonomous driving systems to applications in medical treatments, among others [6,36,39].However, given their very nature, it is often the case that DL systems exhibit an unexpected behavior or produce anomalous results when inferring with inputs differents to those used in the training stage, due to their statistical and black box nature.These vulnerabilities or system errors can lead to undesirable consequences when integrated in real-world applications already in production.
The existence of vulnerabilities is commonplace in any type of software system.In traditional systems, i.e., not embedding any DL system, we may find a vast amount of testing approaches to ensure a certain level of robustness.When the tests detect a problem, there are specific solutions to fix the problem or at least mitigate the effects to a large extent [15].In contrast, testing DL systems is completely different, due to their non-deterministic nature: the "correct" answer or "expected" behaviour in response to a certain input is often unknown at the time of training the DL model [2,4].
Recently, one of the most worrying vulnerabilities in DL systems are adversarial inputs, which are inputs created with slight modifications to the original inputs of a dataset.These types of inputs are often not distinguishable by the human from the input from which they are generated, but they lead the DL system to produce an incorrect output.

arXiv:2207.03689v2 [cs.SE] 12 Jul 2022
Given this scenario, DL system testing has become a relatively new broad research field, in which every year the number of publications related to the subject increases [13,26,43], aiming at laying the foundations of the tests to be able to compare research results in a consistent and effective way.Different frameworks have been created to facilitate the use of these methods [1,9,29,32,44] and likewise a number of metrics have been created to compare the properties of these testing methods [16,23,30].
This work focuses on improving the testing and retraining of DL models in the presence of adversarial inputs.To this end, we apply two different approaches: (i) using the information provided by guidance metrics, which try to identify the most useful inputs according to the criteria of each metric, and (ii) adding adversarial inputs to augment the dataset employing different training inputs from the original training set.Still, questions arise: When and how to use a certain metric?What is the accuracy, resource utilization and time of using such metrics?When and how to use a certain retraining configuration?What is the accuracy and resource utilization of using a certain retraining configuration?In this context, this work aims at comparing both guidance metrics and configurations for retraining DL models with adversarial inputs based on their accuracy, resource utilization and time.The results of the research benefit data scientists to test their models' accuracy against adversarial inputs while keeping the trade-off with the resource utilization of the retraining process.We evaluate our method in one typical DL field application, namely image classification.For this reason, we focus on one particular type of DL model, namely Convolutional Neural Networks (CNN), which are widely used for image classification due to their high performance.Since LeCun et al. introduced them [20], there has been significant progress in the literature with respect to their accuracy even in challenging datasets, as well as the explanation of their behaviour and abstraction [19,42].
The main contributions of this work are: • A comparison, in terms of accuracy, resource utilization and time, of four guidance metrics, including random selection, and three configurations of retraining CNN with adversarial inputs.• A replication package, available online. 1  This document is structured as follows.Section 2 respectively introduces the DL arquitecture used for the models, the guidance metrics (Neuron Coverage (NC) and two Surprise Adequacy (SA) metrics), and the adversarial attack Fast Gradient Sign Method (FGSM) used to create the adversarial inputs.Section 3 describes related work.Section 4 presents the research questions and the study design.Section 5 presents the results.Section 6 presents the discussions.Section 7 describes the threats to validity, and Section 8 draws conclusions and future work.

Background
In this section, we introduce: (i) the type of DL model we used, namely Convolutional Neural Networks (CNN); (ii) the guidance metrics used in our study, namely: Neuron Coverage (NC) and two Surprise Adequacy (SA) metrics: Likelihood-based Surprise Adequacy (LSA) and Distance-based Surprise Adequacy (DSA); (iii) 1 Please refer to https://doi.org/10.5281/zenodo.5904550the adversarial attack used to create adversarial inputs, namely Fast Gradient Sign Method (FGSM).

Convolutional Neural Networks (CNN)
This is one of the most used architectures in image classification.A CNN consists of an input layer, output layer and multiple hidden layers.These hidden layers typically consist of convolutional layers, pooling layers, normalization layers, fully connected layers, or other type of layers to build more complex models.Each of the layers contains a different level of abstraction for an image dataset and the weights of the final model are obtained by backpropagation [11,20,21].

Guidance metrics
Current approaches to DL systems testing evaluate them according to a number of properties, either functional (such as accuracy or precision) or non-functional (such as interpretability, robustness or efficiency) [43].In order to improve the DL system behaviour with respect to these properties, it is important to have metrics to compare the behaviors of the models; among the most used and accepted metrics by the community are those related to neuron coverage and to the surprise of new inputs regarding the model [16,25,30,40].

Neuron Coverage (NC) Pei et al. proposed the concept of NC
in 2017 [30] to measure the coverage of test data of a DL model and to improve the generation of new inputs, arguing that the more neurons are covered, the more network states can be explored, having a greater possibility of defect detection [23].The metric is defined as follows.Let  be a trained DL model, composed of a set  of neurons.The neuron coverage of input  with respect to  is given by where  (, ) is true if and only if  is activated when passing  to .
where    () is the vector that stores the activation values of neurons in the  layer of  when  is entered,   () is the subset of  composed of all the inputs of the same class as ,    (  () ) =    (  ) |      () and   is the Gaussian kernel function with bandwidth matrix  [25]. where: and where  () is the predicted class of  by  and   () is the vector of activation values of all neurons of  when confronted with  [25].

Adversarial attack: Fast Gradient Sign Method (FGSM)
There are DL models whose dataset is increased when there is not enough data.This can be done through techniques such as obtaining new entries from those that already exist, carrying out transformations to these including cuts and rotations, among others [8,14,38].Another way is using adversary examples, which C. Szegedy discovered in 2013, when he noticed that several machine learning and DL models are vulnerable to slightly different inputs from those that are correctly classified, i.e. the adversarial inputs [37].This observation caused concerns because it goes against the ability of these models to achieve a good generalization.In 2015, Goodfellow introduced the FGSM method to be able to create adversarial inputs in a relatively simple way, forcing the misclassification of the input controlling the disturbance so that it is not perceptible to a human, computed as follows [10]: where  is the cost function used for the training of the model  in the neighborhood of the training point  for which the adversary wants to force a wrong classification.The adversary example corresponding to the input  that results from the method is denoted as  * .

Related work
From the reviewed literature, there are many studies focused on the creation of adversarial inputs in order to uncover debilities and vulnerabilities of DL systems or models.Goodfellow et al. [10] were the first to formulate a concrete definition of the adversarial inputs and clarify their generalization in different architectures and training sets.Additionally, they defined FGSM which allowed to generate adversarial inputs.Other studies centered on being able to detect such adversarial inputs for different purposes.Feinman et al. [8] aimed at distinguishing adversarial samples from their normal and noisy counterparts.Ma et al. [23] used their coverage criteria to quantify defect detection ability using adversarial inputs created with different adversarial techniques.Kim et al. [16] used their proposed SA criteria to show that they could detect adversarial inputs.Nevertheless, there is scarce research focused on using adversarial inputs in retraining with the purpose of improving the models  1 for a summary.In these few works, we can only identify one configuration used, similar to the configuration 3 presented in Section 4 of this work, in which the authors retrain with only a few of adversarial inputs and do not clearly show the steps for the retraining.Kim et al. [16] showed that sampling inputs using SA for retraining can result in higher accuracy, Pei et al. [30] showed that error-inducing inputs generated by DeepXplore can be used for retraining to improve accuracy.Most recently, Ma et al. [25] proposed to use testing metrics as a retraining guide, looking for an answer on how to select additional training inputs to improve the accuracy of the model.They used the original training set starting from a previous trained model but ordering these inputs following the metrics' guidance.Compared to this work, we aim to do the retraining using the information of an augmented dataset with adversarial inputs, not only the original training set.
As the use of adversarial inputs has been proved to provide many improvements, we identified a research gap there to take advantage of them during a retraining process.Moreover, we want to understand how to order an augmented dataset with adversarial inputs guided by metrics so that the retraining is efficient (with highest accuracy and using fewest inputs) for data scientists.To the best of our knowledge, our study is the first one that applies metrics to guide a retraining using an augmented dataset with adversarial inputs in order to improve the model accuracy against adversarial inputs and keeping the trade-off with resource utilization, in order to obtain computational benefits, which is key as the retraining phase is time consuming.
In our study, we consider an augmented dataset with adversarial inputs obtained from the original training set using the FGSM method.Therefore our study provides the following novel contributions: • A comparison between state-of-the-art guidance metrics for a guided retraining.• A comparison of three different configurations for doing a retraining against adversarial inputs.
4 Empirical study

Research questions
The goal [5] of this research is to analyze guidance metrics and retraining configurations for a retraining process of CNN models with the purpose of comparing them with respect to DL testing properties such as accuracy, resource utilization and time from the point of view of a data scientist in the context of image classification.Thus, we aim at answering the following research questions (RQ): • RQ1 -Does the use of guidance metrics impact the accuracy, the resource utilization and the time required for the retraining of a CNN model?• RQ2 -Does the configuration of the retraining of a CNN model impact the accuracy and the resource utilization required for retraining this model?

Variables
Table 2 describes the variables of our study.With respect to the dependent variables, some details follow: • Accuracy, which measures the capability of the retraining phase of providing higher accuracy against adversarial inputs.We plan to measure the accuracy of the CNN models against an augmented test set, composed of the original test set and an adversarial test set obtained from the previous applying the FGSM to each one of the original test input.• Resource utilization, which measures the input size to obtain the highest accuracy during the retraining phase.• Time, which quantifies the time to compute each of the considered metrics for the corresponding dataset.
Another variable that influences our conclusions is the dataset, we experiment with two different datasets.Hence, we define the dataset as a nominal variable indicating the used dataset for training and retraining the models.They are from different domains and have different sizes.

Study design
Figure 1 shows the study design.First, the Data collection and preprocessing which consists of the collection and preprocessing of raw data.Second, the Model training which consists of the traditional training and proposed retraining phase.Finally, we provide answers to the RQs by analyzing our results w.r.t.DL testing properties.

Data collection and preprocessing: datasets
We evaluate the guidance metrics on CNNs using two multi-class, single-image classification datasets.First, the German Traffic Sign Recognition Benchmark (GTSRB), with 43 classes containing 39.208 unique images of real traffic signs in Germany.It has been widely used in DL research [6,22,36].This type of images are characterized by large changes of visual appearance for different causes (e.g., weather conditions).While humans can easily classify them, for DL systems (e.g., autonomous driving systems) still is a challenge.
Second, the Intel Image Classification dataset with 6 classes containing 17.034 labeled images of natural scenes around the world, used in many studies as well [31,33,41].It was provided by Intel corporation to create another benchmark in image classification tasks such as scene recognition [3].This scene recognition task is a daily task for humans and widely used application in industries such as tourism or robotics, still an emerging field for computer vision [27,35].
The adversarial inputs are obtained using the FGSM method from the foolbox library [32].In the retraining the adversarial inputs are from the original training set (see "Adv.Train" in Fig. 2 Step 1).Another adversarial set is obtained from the original test set (see "Adv.Test" in Fig. 2 Step 1).
Using only one adversarial attack to create the adversarial inputs can be a limitation of our work.Nevertheless, when Goodfellow et al. [10] presented FGSM, he stated that this type of adversarial examples can be generalized across architectures and training sets.Being aware of other attack methods, we chose this method because it is one of the most practical methods, and it is more widely used than other attack methods such as Basic Iterative Method (BIM), Jacobian-based Saliency Map Attack (JSMA), Projected Gradient Descent (PGD) or Carlini-Wagner (CW), compared to CW in particular, FGSM is more time efficient.On top of that, we focus our study in the retraining phase rather than creation of adversarial inputs.

Model training
The traditional training of a model consists of using an available training set identified as "Train" in Figure 2 and then to evaluate the model against a test set identified by "Test" in Figure 2. The traditional training is represented in the upper part of Fig. 1 (ii) Model training phase.We plan to add in this phase a retraining process as it is represented in the lower part of Fig. 1 (ii) Model training phase.This retraining uses adversarial inputs guided by the metrics and three different configurations to improve the original model, M, against adversarial inputs.After obtaining a retrained model, M*, we evaluate the DL testing properties of this retrained model.The retraining process is shown more extensively in Figure 2. It comprises the following steps: (1) Step 1. Create adversarial inputs for training and testing: We obtain two adversarial sets applying FGSM.First, to augment the training set, we apply FGSM to a subset of the original training set, "Train", to obtain the "Adv.Train" set, the number of the adversarial inputs created is from a small proportion of the entire "Train" set, thus we do not increase much the artificial inputs used in retraining and the naturalness is not diminished either.Putting these two sets together, we obtain "Train*" set.Second, to augment the original test set, "Test", we apply FGSM to the entire set to obtain the "Adv.Test" set as it is too small compared to the augmented training set, "Train*".Putting these "Test" and "Adv.Test" sets together, we obtain "Test*" set.For each configuration, in the retraining phase, we execute a retraining of the models guided by the four metrics and for each metric we obtain 20 data points as shown in Figures 3a -3f.Each retraining run for each data point is computed from their respective initial weights.Also, we address randomness of resulted values by executing the retraining randomly, and not through incremental training.3a -3f correspond to the accuracy of the models against the augmented test set, "Test*".Table 3 shows the accuracy against the same "Test*" set and resource utilization of the experiments by dataset, configuration and metric used in each case.The "Original accuracy" column shows the accuracy of the original model M against this new augmented dataset, which is low due to the adversarial inputs, the "Accuracy w.r.t.augmented test set" column shows the highest accuracy during its respective retraining and the "Resource utilization" column shows the input size to obtain that highest accuracy.Table 5 shows the time to compute the considered metrics for every input of the corresponding dataset.

Data analysis Figures
In order to compare the impact that guidance metrics and retraining configurations have on the accuracy and resource utilization of the retrained models, we select the best model, according to the accuracy, during the retraining of the 20 data points for each combination of configuration and guidance metric (see marked points on Figures 3a -3f in Section 5).We obtain the accuracy and resource utilization of these points and report these values in Tables 3 -4 to determine if they can be improved by guiding the retraining with the studied metrics and using the studied retraining configurations.In addition to that, we evaluate the time it takes to compute the guidance metrics in Table 5.
We answer the RQs from two different angles.First, we compare the accuracy changes against the test set augmented with adversarial inputs in the 20 data points of the retrained models according to each configuration and guidance metric.The test set is composed of the original test set and adversarial inputs created with the same attack method but using inputs of the original test set; this way we ensure that the test data is never used during training or retraining.Second, we compare the input size required to obtain the model with highest accuracy between those data points according to each configuration and guidance metric.

Results
In this section, we report the results of our empirical study, answering RQ1 and RQ2, and highlighting key takeaways.
To better describe the results, we use the Figures 3a -3f and Tables 3, 4 and 5, as explained in Section 4.3.3.

Does the use of guidance metrics impact
the accuracy, the resource utilization and the time required for the retraining of a CNN model?(RQ1) Guidance metrics and accuracy.According to the Figures 3a -3f, we found in the experiments that SA metrics have the best selection of inputs for the retraining as stated by the observed accuracy.Five out of six guided retraining runs obtained the best accuracy using SA metrics (highlighted with bold font in Table 3).
Guidance metrics and resource utilization.According to the results of using C2 with the GTSRB dataset, the impact of the method can be noticed with greater difference.Clearly, SA metrics can reach a better model with less inputs, as much as 14.400/36.366inputs in the best case with 0.953 of accuracy, as shown in Figure 3c and Table 3.On the other hand, this is not so clear when using Intel dataset, which may be due to the size of the dataset: as there are not enough inputs, the metrics cannot make a difference with just a percentage of an already relatively small dataset.Nevertheless, DSA was the best option according to Figure 3d.
Overall, SA metrics may be the best due to the fact that they first identify the inputs that are most different from the previous training inputs, so the model learns features that are different from those previously learned, in order to correctly classify adversarial inputs.On the other hand, for NC metric it may be more difficult to identify good inputs for retraining because this metric identifies inputs that cover more neurons according to certain thresholds, which in retraining may not be significant.
Guidance metrics and time.We obtained the time to calculate the metrics as stated in Table 5.To obtain NC values it takes more than 7x in terms of time w.r.t. the time needed to obtain SA values, which may be due to the library used and to a non-optimized metric computation.In addition, the lower time required to get SA values may be also due to the size of the datasets, as it is known that the computation of SA metrics tends to quickly augment [40].We can observe this trend when comparing the ratio of NC time to DSA time in each dataset.When we consider Intel dataset, computing NC values takes 34x the DSA time, but when using GTSRB dataset, this ratio reduces to almost 8x.

Key takeaways for RQ1:
• SA metrics have the best selection of inputs for the retraining as stated by the observed accuracy.• SA metrics can reach a better model with less inputs.
• To obtain NC values it takes more than 7x in terms of time w.r.t. the time needed to obtain SA metrics.

Does the configuration of the retraining of a CNN model impact the accuracy and the resource utilization required for retraining this model? (RQ2)
Configuration of the retraining and accuracy.According to the total number of used inputs, as expected, C1 and C2 reached higher values of accuracy than C3.Nevertheless, between these two configurations, the benefits of using C2 are greater because with less inputs, nearly less than a half in the best case (see Table 3), it is possible to complete the retraining with high accuracy.Due to the initial weights used in C2, this incremental training creates bias towards the original training set, which can be the reason why C2 does not worsen its accuracy against inputs from the original test set: the help of the adversarial inputs "Adv.Train" that augmented the dataset makes this model better against the adversarial inputs from the augmented test set "Test*".Configuration of the retraining and resource utilization.According to the resource utilization, C2 is the best option of the studied configurations, nevertheless, using C3 shows some benefits.Table 4 shows the same values of Table 3 for C3, but for C2 the accuracy and resource utilization are obtained in an specific data point, when the model has only used the same input size for C3 (e.g., 5000 inputs when using GTSRB and 3000 inputs when using Intel dataset).Although C3 was expected to have lower accuracy than the other two configurations, because of the available input size for retraining, Table 4 shows that C3 can be a good option if we want to execute a retraining with just few inputs.The values in the rows with accuracy of C3 for both datasets are greater in their respective values for C2.
We can compare C3 with C2 because the starting point is the same original model M and we are executing a retraining with the same number of inputs, but for C2 the inputs are chosen from the   entire training set and adversarial set ("Train*"), while for C3 the inputs are only from the adversarial set ("Adv Train").
Key takeaways for RQ2: • The benefits of using C2 are greater because with less inputs, nearly less than a half in the best case, it is possible to execute a retraining obtaining high accuracy.• C3 can be a good option for the retraining in cases in which we prioritize resource utilization.

Discussion
Our experimental results focused on the retraining of models against adversarial inputs, taking into consideration the most used and practical state-of-the-art metrics: LSA, DSA, NC and Random.On top of that, we considered the manner the retraining is done.
As observed in the results, we have verified that NC is not a consistent metric to take into consideration to select inputs when doing retraining.Previous studies, mainly Harel-Canada et al. [12], have already stated that NC should not be trusted as a guidance metric for DL testing.On the one hand, NC measures the proportion of neurons activated in a model and it assumes that increasing this proportion improves the quality of a test suit.But on the other hand, Harel-Canada et al. showed that increasing NC value leads to fewer defects detected, less natural inputs and more biased prediction preferences, which is contrary to what we would expect when selecting inputs for a retraining process to have a better model and reduce the input size for that.In our study, the results confirm that NC should not be trusted as a guidance metric.Furthermore, Table 5 shows an evident disadvantage on time when computing NC.Overall, different metrics rather than NC should be used when doing guided retraining.
Ma et al. [25] found that uncertainty-based and surprise-based metrics are the best at selecting retraining inputs and lead to improvements in the number of used inputs, up to twice faster than random selection, in order to find an answer on how to select additional training inputs to improve classification accuracy.Regarding SA metrics, we confirm the results obtained in that study: SA metrics compared to the baseline of random selection and also to NC, are better and lead to faster improvements with a satisfactory model.We consider this work as a complement to the results of the RQ about the selection for retraining of Ma et al., as we focused our work on increment the model's accuracy against adversarial inputs.Although our results show that these metrics improve the model faster (in terms of the input size used for retraining), we have found that the size of the dataset greatly affects the results, because when using small datasets it is not as fast.Ma et al.only considered datasets with more than 50.000available inputs to use in the retraining (the entire training set), but in real world applications is not always possible to have such large datasets.Unlike this work, we have experimented with a smaller dataset in which the improvements are minor (Intel dataset).Previous studies, our results of SA metrics and new implementations of these metrics [17,28,40] aim to be a baseline for test methods in DL testing: for data selection in retraining processes, data generation, data selection, etc.
Additionally, an important finding is the variable that we also considered: configuration of retraining.All the related works have only experimented with their own way to execute the retraining and even sometimes they are not explicit with how they did the retraining process.As we observe in the results, the configuration can change the performance of the model considerably.We have uncovered a new challenge on finding efficient configurations to retrain deep learning models.Considering the studied configurations, using the best configuration studied in this work (C2) can give data scientists a fast and efficient method of retraining.
When using the combination of DSA metric and C2 retraining configuration, we observe that it can lead to results that are aligned with green AI research, which refers to research that takes into account computational costs to reduce the computational resources spent [7,34], as less inputs are needed for retraining and also less time to compute the values of the metric w.r.t. the evaluated options for both independent variables.We observe greater benefits of using C2 over C3, as we give greater weight to the efficiency without diminishing model's accuracy, and encourage data scientists to build greener DL models.

Threats to validity
In this section, we report the limitations of our empirical study and some mitigation actions taken to reduce them as much as possible.
Regarding construct validity, we include two original models using CNN architecture trained in two different datasets, respectively, in order to mitigate mono-operation bias.Derived from the metrics selected, we use LSA, DSA, NC and Random metrics to guide the retraining runs, which may have potential threats.However, these metrics have been used in previous work as shown in Section 3, lowering this threat.And according to threats related to the configurations, we have reviewed the relevant literature and searched for retraining configurations using adversarial inputs.
Concerning conclusion validity, the quality of the DL models and implementations depend on the experience of the developers.To mitigate this, we provide the implementations organized and following the "Cookiecutter Data Science" project structure2 , making them as simple as possible.To increase reliability in our study, we detail the procedure to reproduce our work: the process is shown in Section 4, datasets and replication package are available online.Also, we address the randomness of our results by starting the retraining runs for each data point from their respective initial weights (from original model weights for C2 and C3, and from scratch for C1).
Two threats to internal validity are the implementation of the studied DL models, as well as the computation of the metrics.We used available replication packages from the authors of the metrics, using the same configurations they used for the experiments to minimize this risk.Also, different models are used with different datasets, mitigating that the results of our study of guidance metrics and configurations are caused accidentally.
Threats to external validity stem mainly from the number of datasets, models and the adversarial generation algorithm considered.Our results depend also on the datasets, type of architecture considered and the device used for training.We believe our results are applicable to image classification datasets.Some of these threats are addressed by the use of two image datasets, several state-ofthe-art metrics and an adversarial attack widely used by scientific community.Regarding the architecture type, as the adversarial inputs can be generalized across different of these [10] and we use them for the retraining, we also believe that results for other architectures could be similar to the results for CNN architecture, only further experiments can reduce this threat.

Conclusions and Future Work
In this work we have studied DL testing metrics for guiding retraining of models.We performed an empirical study with the metrics and also considered three different configurations of retraining against adversarial inputs and did a comparison of the metrics and the configurations.In summary, we observe that (i) the models are increasing their accuracy against the test set augmented with adversarial inputs as it was sought in the objectives of this work, and (ii) there are computational benefits of using certain metrics and configurations.
The empirical study showed that the SA metrics (such as LSA and DSA) as guidance for a retraining phase are useful for data scientists when using the following configuration: an augmented training dataset with adversarial inputs, starting from original model weights.
With the previous configuration and metrics, we can improve the accuracy of the models against adversarial inputs by up to 61.8% on the GTSRB dataset and up to 45.7% on the Intel dataset without the need of using many inputs.Therefore, this can be done by using 39.6% of the inputs on the former dataset and 83.7% of the inputs on the latter when using DSA.Using random can guide to similar levels of accuracy but when using the recommended configuration, the computational benefits of using SA are that less inputs are required as discussed above in percentage.However, we do not recommend the use of NC metric, and prevent that when considering another configuration such as: using an augmented dataset with adversarial inputs and starting from scratch, can be time consuming, as almost 100% of the inputs need to be used to obtain similar accuracy to the recommended configuration.
Additionally, we revealed that the size of the dataset is important when implementing the recommended metrics and configuration.Taking this into account, we need to assess whether it would be worth calculating the metrics when using only small datasets.
In the next step, the use of other adversarial attacks for the creation of adversarial inputs and reproduction of our experiments on different datasets of varied sizes, unbalanced datasets and also non-images datasets [18] are required to generalize our findings.Particularly, experiments with other DL architectures are required to confirm or reject that our findings can be used across different architectures.Also, our results should be compared to guided retraining runs using new refinements of SA [17,28,40] and other testing metrics such as uncertainty-based metrics.

( 2 ) 3 )
Step 2. Obtain guidance metrics: Based on the original trained model, M, the "Train" set and the new adversarial inputs for training "Adv.Train", we compute the different metrics for the augmented training set, "Train*".(Step 3. Order inputs w.r.t. the guidance metrics: According to each metric, we order the inputs for the retraining.With this, we expect that the retrained model, M*, will be trained first with the images that are more difficult to classify and more informative according to the metrics' value.(4) Step 4. Retraining according to the configuration: We implement the retraining in three different ways using ordered inputs according to the following configurations (see Figure2): (a) C1: Starting from scratch using the new adversarial inputs and original training set.The new training set with the label "Train*" in Fig. 2 from Step 1 is composed of the "Train" and "Adv.Train" sets.The model M* is retrained with this "Train*" set ordered by highest score of LSA, DSA, NC and Random, respectively, starting from scratch.(b) C2: Starting from the original model M using the new adversarial inputs and original training set.The new training set is the same "Train*" set as in C1 with the difference that the model M* is retrained from the original model M weights.(c) C3: Starting from the original model M using only the new adversarial inputs.The new training set is only the "Adv.Train" set.The model M* is retrained with this set, also ordered by highest score of LSA, DSA, NC and Random, respectively, starting from the original model M weights.

Figure 3 :
Figure 3: Accuracy of the trained models.

Table 1 :
Review of the retraining configurations used in related work (C refer to the configurations presented in Section 4)

Table 3 :
Accuracy against augmented test set (original test set and adversarial test set) and resource utilization when the model reach the maximum accuracy for C1, C2 and C3 applied to each dataset Configuration of retraining.Note: bold numbers indicate the accuracy and resource utilization for the model with highest accuracy during retraining.

Table 4 :
Accuracy against augmented test set (original test set and adversarial test set) and resource utilization when the model reach the maximum accuracy for C2 (data point with same number of inputs used in C3) and C3 applied to each dataset

Table 5 :
Time in hours (hh:mm:ss) to obtain the respective values