Tolerate Failures of the Visual Camera With Robust Image Classifiers

Deep Neural Networks (DNNs) have become an enabling technology for building accurate image classifiers, and are increasingly being applied in many ICT systems such as autonomous vehicles. Unfortunately, classifiers can be deceived by images that are altered due to failures of the visual camera, preventing the proper execution of the classification process. Therefore, it is of utmost importance to build image classifiers that can guarantee accurate classification even in the presence of such camera failures. This study crafts classifiers that are robust to failures of the visual camera by augmenting the training set with artificially altered images that simulate the effects of such failures. Such a data augmentation approach improves classification accuracy with respect to the most common data augmentation approaches, even in the absence of camera failures. To provide experimental evidence for our claims, we exercise three DNN image classifiers on three image datasets, in which we inject the effects of many failures into the visual camera. Finally, we applied eXplainable AI to debate why classifiers trained with the data augmentation approach proposed in this study can tolerate failures of the visual camera.


I. INTRODUCTION
The popularity of Deep Neural Networks (DNNs) for image classification has grown enormously in the last decade, and they have been successfully applied in multiple domains such as computer vision [21], autonomous driving [12], bioinformatics [25], medical image analysis [1], and climate science [24]. The excellent performance of image classifiers depends heavily on i) the training phase, where the classifier learns how to classify images [27], and ii) the validity of the assumption that the data to be classified at test time comes from the same distribution as that of the training set [36].
When deployed in a real system, DNNs may face unexpected operational conditions that lead to out-of-distribution data due to unexpected or unknown behaviors. The images captured by the visual camera may be altered due to a The associate editor coordinating the review of this manuscript and approving it for publication was Wenming Cao . multitude of reasons [32]: this paper accounts for possible failures of the visual camera, which may be due to internal (e.g., failures of the electrical parts), external (e.g., dirt or scratched lenses), or environmental (e.g., rain or icing on lenses) factors. We consider a complete set of visual camera failures resulting from a Failure Mode and Effect Analysis [22] to embrace a complete set of visual camera failures known to date, according to the literature.
Visual camera failures may occur frequently when a visual camera is placed in its operational setup, particularly when it is placed outside. Therefore, this study shows how to build image classifiers that are robust to visual camera failures, so that altered images delivered to the DNN will hardly compromise the image classification task. To achieve these goals, we developed a methodology based on data augmentation [2] that builds image classifiers whose classification performance suffers only from minimal degradation when processing altered images. We trained DNNs using a clean training set that contains only images from the original datasets (we refer to them as regular classifiers, which are built using the usual scaling and translation approaches for data augmentation) and an augmented training set that includes images from the original datasets plus altered images in which we injected the effects of visual camera failures (we refer to the resulting classifiers as augmented classifiers).
Our experimental campaign revealed that augmented classifiers have far better classification accuracy than regular classifiers when processing altered images and also on clean images; for example, the accuracy of the clean images increased in the three datasets from 0.994 to 1 (from 82 to no misclassifications out of 12570 images), from 0.96 to 0.992 (from 46 to 9 misclassifications out of 1159 images), and from 0.997 to 0.999 (from 7 to 3 misclassifications out of 2505 images). Furthermore, we explain our results using the LIME [30] framework, which explains why our augmented classifiers output fewer misclassifications than the regular classifiers. Briefly, augmented classifiers select a few strong features compared to regular classifiers, which instead select many weak features that individually do not contribute much to classification.
The remainder of this paper is organized as follows. Section II provides background information on visual camera failures, letting Section III review related works on DNN robustness. Section IV describes the experimental methodology. Section V presents and discusses the results of the experimental campaign and elaborates on the robustness of regular and augmented image classifiers. Section VII details the limitations of this study, and Section VII concludes the paper and presents future work.

II. BACKGROUND ON VISUAL CAMERA FAILURES
Misbehavior(s) of the visual camera may generate altered images that are delivered to the image classifier. Fig. 1 depicts the main components of a visual camera [23], where camera failures may occur. The lens senses a scene from the environment in the form of light. This light is processed by an Image Sensor, whose photodiodes transform the light in its electrical encoding to produce a raw file [29]. The Bayer Filter, which acts on top of the Image Sensor, colors the photodiodes into a red-green-blue (RGB) pattern. The Image Signal Processor (ISP) processes the raw file to produce a digital image; it has multiple functions, including demosaicing, noise reduction, image sharpness correction, lens distortion correction, chromatic aberration correction, image compression, and JPEG encoding [23].
While many works, such as [35] elaborate on the effect of modified images on the classification process, only a few studies in the literature focus on the effects that failures of the visual camera may have on the produced image and consequently on the image classification. Even if the risk of accidental alterations of the output image of the camera is acknowledged as realistic [37], this consideration is usually ancillary to the main contribution of this study. Examples are [6], where the authors focus on environmental conditions and build a DNN that implements an attention mechanism for performance improvement, and [38], which explores how sensors respond when used in real circumstances, as well as to confirm the impacts of environmental conditions on driving scenarios.
The most comprehensive work on visual camera failures is [22], where the authors systematically identified the failure modes and effects of a visual camera through the application of a Failure Mode and Effects Analysis (FMEA). The effects of camera failures on the output images are summarized in Fig. 2. Failures are caused by malfunctions of the lens, Image Sensor, Bayer Filter, or ISP, and belong to the following categories.
• Banding (Fig. 2b). A banded image contains many vertical and/or horizontal lines that are visible in the background.
• Black Pixels (Fig. 2c). The frame delivered to the camera body may contain dark spots due to the anomalous behavior of the Image Sensor. The effect is the visualization of black or black spots on the image.
• Blurring (Fig. 2d). If the focus of the lens is incorrect, the image may have visible blur.
• Brightness (Fig. 2e). The brightness of the image can be altered, ranging from a fully black frame (no brightness) to a fully white frame (maximum brightness).
• Broken Lens (Fig. 2f). Scratches, lines, or black areas may appear in the image when one or more lenses of the camera break.
• Condensation (COND, Fig. 2g). Halos may appear in images affected by the condensation failure.  • Dirt (Fig. 2h). Dirt on the external or internal lens of the camera may create a wide variety of alterations in the images; most likely, the image will have scattered black spots or areas in which the colors differ from the clean image.
• Ice (Fig. 2i). Ice crystals may be visible in images when the temperature drops to the freezing point.
• No Bayer Filter (NOBF, Fig. 2j). When the Bayer filter does not work properly, the image will likely end up having altered colors or even being colorless at all.
Halos appear at the corners and edges of the image, and blur the outer edges.
• No Demosaicing (NDEMOS, Fig. 2l). If the demosaicing process fails, the image becomes pixelated: in this case, each pixel is painted as a mosaic of RGB colors.
• No Noise Reduction (NNR, Fig. 2m). If the noiseremoval component is malfunctioning, the image may include excessive noise.
• Rain (Fig. 2n). Drops of water on external or internal lenses can produce images with small circles at random positions.

III. RELATED WORKS ON ROBUST IMAGE CLASSIFIERS
The robustness of a DNN image classifier aims at sustaining the performance of the model under various image corruptions or alterations [40]. In our study, image corruption and alterations were due to visual camera failures, leading to the effects discussed in Section II. We now review the robustness approaches organized in architectural solutions for building robust DNNs, out-of-distribution detection, and data augmentation strategies for robustness and adversarial defense, explaining how this paper differs from each category.

A. ARCHITECTURAL SOLUTIONS FOR ROBUST DNNs
Several recent proposals have been made for robust image classifiers. In [14], the authors analyzed the classification performance of several DNNs for images degraded by Gaussian noise, blurring, and compression. They proposed a twostep (master-slave) process, in which the master classifies the quality of the degraded input image, which is then used to select the most suitable slave DNN for classification. In [18], the authors provided a review of methods that combine two or more images and pre-process them to delete specific regions of the images, which may cause the DNN to lean towards misclassifications. Experimental results show that this method builds image classifiers that are more robust and have better classification performance, even when dealing with altered images. In [33], the authors estimated the confidence of a DNN in response to unexpected execution contexts.
With respect to the surveyed works, we do not add any additional component in the architecture for the purpose of detecting outliers in the images.

B. OUT-OF-DISTRIBUTION DETECTION
Many studies, such as [34] and [36] have aimed to detect out-of-distribution samples without knowledge of the type of image alteration in the training data. In particular, [34] trained four different DNNs with three different supervisors at various stages of training to detect the point at which supervisors' performance began to decline during training.
This study assumes knowledge of the failure effects at the training time, as discussed in Section II.

C. DATA AUGMENTATION FOR ROBUSTNESS
Another group of studies aimed to achieve DNN robustness through data augmentation strategies [13]. The authors of [15] used data augmentation to improve the generalization capability of DNNs using smoothness regularization against perturbations to improve the classification performance. In another study [16], the authors employed a Pixel Mask to diminish the sensitivity of DNNs to image corruption. Moreover, a previous study [17] proposed a data augmentation pipeline to accelerate MRI reconstruction.
Instead, our study focuses on ICT systems that employ one or more visual cameras. To the best of our knowledge, no study has applied data augmentation to tolerate many failures of the visual camera. To address this gap in the literature, we consider a complete set of failures that may occur due to malfunctions in the visual camera.

D. DATA AUGMENTATION FOR ADVERSARIAL DEFENSE
Many recent studies on data augmentation have set the goal of adversarial defenses, that is, defending the encompassing system from images explicitly designed by an attacker to fool a classifier, and output a wrong class. Briefly, these studies increased the robustness of the target classifier, which was trained using genuine and adversarial images [39].
Even if the approach is conceptually similar to ours, our study does not consider adversarial activity as a potential source of altered images; therefore, these are only marginally related works and are interesting to this study only in terms of the approach they follow.

IV. METHODOLOGY FOR A CLASSIFIER ROBUST TO CAMERA FAILURES
We describe our data augmentation approach with the aid of Fig. 3. Traditionally, an image classifier outputs a set Pα of α probabilities, where each Pi represents the probability of the input image belonging to class i, 0 ≤ i < α. The graphical example in Fig. 3a sketches a 4-class classification problem (α = 4), where the image classifier processes the input and outputs an array Pα of 4 probabilities, the highest of which points to the class the classifier will predict for the animal in the image. Fig. 3b improves the training phase of the DNN, aiming at a more robust model. The training set is augmented with perturbations of the clean images from the training set, providing a broader set of images to the DNN. Perturbations are obtained by injecting the effects of visual camera failures into clean images, thereby saving the corrupted image as a separate item.
This approach was implemented as follows. In Section IV-A, we gathered three datasets for classification, which were images of the traffic signs. In Section IV-B we describe how we created altered images from the clean images of the three datasets. Section IV-D reports the three DNNs used to perform image classification (TSR). The training and testing processes are described in detail in Section IV-C. All the code we developed is available at [44]; all the altered images we created are available at [45].

A. DATASETS FOR TRAFFIC SIGN RECOGNITION (TSR)
We select the GTSRB [10], BelgiumTSC [11], and DITS [12] which are public datasets that contain sequences of images of different categories of traffic signs from different countries. The images in the GTSRB dataset have heterogeneous lighting conditions and distance from the camera. The DITS dataset is considered more difficult than others, because it contains images captured under different lighting and environmental conditions. Instead, images of the BelgiumTSC dataset were captured by multiple visual cameras from different viewpoints.
Each dataset has its own categorization of traffic signs based on the color, shape, and content of the traffic sign, which changes depending on the country from which the images were sampled. Overall, we identified 9 overlapping categories of traffic signs, as listed in Table 1. The GTSRB dataset contains images belonging to 8 categories out of the 9 in Table 1, missing the blue rectangular traffic sign images (Class 8). The BelgiumTSC dataset has 8 categories of traffic signs, as shown in Table 1, with missing white We injected 13 different visual camera failures into the images contained in the three datasets. Each failure had multiple configurations for a total of 103. We set up a Python script to inject each failure, with the configurations planned below.
• Banding. We created 3 different configurations by overlapping the banding images using a blend function.
• Black Pixels. Different configurations, such as a single black line, horizontal and vertical black lines, and sets of [50, 200, 500, 1000] pixels, were randomly selected and turned black in the traffic sign image.
• Blurring. We employed 12 different configurations to produce blurred images with varying degrees of blur using the OpenCV [41] blur function, with settings in the range [25], [35].
• Broken Lens. We simulated the broken-lens effect using 15 superimposed images from [19].
• Condensation (COND). We overlaid three different condensation images to clean images and simulate the condensation effect.
• Dirt. We used 36 dirt images from [19] that we overlaid to clean images and simulate the dirt effect.
• Ice. We overlaid different ice images onto clean images to create four different altered images for each clean image.
• No Bayer Filter (NOBF). We simulated this failure by converting each image into grayscale.
• No Chromatic Aberration Correction (NCAC). We created chromatic aberration failures using the code in [26].
• No Noise Reduction (NNR). We used 10 different configurations of speckle noise with different levels of noise.
• Rain. We simulated the rain effect by overlaying 5 different rain images with traffic sign images.

C. BUILDING TRAIN AND TEST SETS FOR CLASSIFIERS
We collected the GTSRB, DITS, and BelgiumTSC datasets from their repositories, and processed each image to inject 103 failure configurations from Section IV-B. This created 103 variants in addition to the original dataset, for a total of 104 datasets. Each of these 104 datasets was split into training and test sets; the split was identical for all datasets. We refer to train_clean/test_clean as the sets containing the train/test split of the original dataset; both train_clean and test_clean contain only clean images. We call train_altered/test_altered the train/test set, which contains train_clean/test_clean and train/test splits of all 103 variants of the original dataset. For the sake of our analysis, we have to understand how each of the six classifiers deals with images corrupted with different configurations of the same visual camera failure. As such, we grouped the 103 variants based on the 13 failures we used to generate them. This allowed the creation of 13 groups of test splits of variants, as follows: • Failures NOBF, NCAC, and NDEMOS were each used to create a single variant because they had only a single configuration of failure.
• Failures Banding and COND were used to create three variants each.
• Ice, Rain, Black Pixels, Brightness, NNR, Blurring, Broken Lens, and Dirt failures were responsible for generating 4,5,6,8,9,11,15, and 36 variants, respectively. Each of the 13 groups above contains all the variants generated using each of the 13 visual camera failures. Different groups differ in the number of variants and images they contain, which depends on the number of failure configurations. Noticeably, testing a classifier for each group provides a set of classification metric scores, one for each test split of a variant in the group. For example, testing a classifier with variants in the Ice group will output 4 metric scores, 5136 VOLUME 11, 2023 whereas testing a classifier with variants in the Banding group will generate only 3 metric scores because our study considers 3 configurations of the Banding failure.

D. DNNs FOR TRAFFIC SIGN RECOGNITION
The image classifiers in this study are three DNNs that were pre-trained on the ImageNet dataset: AlexNet (AN, [7]), InceptionV3 (IC3, [8]), and MobileNetV2 (MN2, [9]). These three DNNs have different characteristics: AN is a rather old DNN but is considered a milestone, as in 2012 it achieved relevant results in the ImageNet Large Scale Visual Recognition Challenge [31]. IC3 stems from GoogleNet and outperforms AN on the ImageNet dataset used for training [31], whereas MN2 is based on an inverted residual structure [9] as opposed to traditional residual models that use expanded input representations.
We adapted these pre-trained image classifiers to our TSR case study using transfer learning with a batch size of 32 and 10 epochs, utilizing the stochastic gradient descent with momentum (sgdm, [20]) optimizer and cross-entropy loss at the softmax layer on each of the GTSRB, DITS, and BelgiumTSC datasets. We retrained each DNN (AN, IC3, and MN2) twice on each of the three datasets as follows: • regular classifier: We performed transfer learning using the train_clean set of each of the three datasets. We employed classical techniques such as image scaling and translation to limit overfitting.
• augmented classifier: created using the train_altered set as the data baseline for transfer learning, without image scaling or translation, to compare the robustness of camera failure data augmentation against traditional data augmentation techniques such as scaling and translation. This process creates a total of 18 classifiers, and for each of the 3 datasets, we created a regular and augmented classifier for AN, MN2, and IC3 DNNs. We then used the test_clean and test_altered sets to test each classifier and quantify their classification performance.

V. EXPERIMENTAL RESULTS
We analyzed the classification performance of regular and augmented classifiers and measured the extent to which the augmented image classifiers were more robust to visual camera failures than regular classifiers. Experiments were performed on an Intel(R) Core (TM) i5-8350U CPU@1.7 GHz 1.9 GHz, using an NVIDIA Quadro RTX 5000 GPU.

A. IMPROVED ROBUSTNESS TO CAMERA FAILURES
We compared the classification performance of regular classifiers against augmented classifiers, as shown in Fig. 4. The figure is composed of two series of box plots. On the left (Fig. 4a), we show the average accuracy of regular classifiers such as AN, MN2, and IC3 on all three datasets against each failure. On the right (Fig. 4b), the accuracy of the augmented classifiers is presented; each figure contains 14 box plots (test_clean and 13 groups of variants from Section IV-C). On the vertical axis, we have accuracy scores: the higher the accuracy, the better is the classification performance.
The two plots in Fig. 4 appear very different from each other: the boxes in Fig. 4a span across a wider range of accuracy scores with respect to Fig. 4b. This means that regular classifiers have more variability in their classification performance than augmented classifiers. Most importantly, augmented classifiers have almost maximum accuracy, which is desirable for any classification task.
Two additional trends are evident in Fig. 4. First, the accuracy improvement of the augmented against regular classifiers is not constant across different test sets (and visual camera failures). The accuracy is almost perfect (∼1) when using augmented classifiers for test_banding, test_blackpixels, test_NDEMOS, and test_rain, whereas TABLE 2. Accuracy achieved using regular classifiers on the three datasets. (Accuracy scores report ranges, when a failure is injected using multiple configurations).

TABLE 3.
Accuracy achieved using augmented classifiers on the three datasets. (Accuracy scores report ranges, when a failure is injected using multiple configurations). regular classifiers struggle. Overall, the boxes in Fig. 4b are narrow and close to the top of the plot. For most of the other test sets, there was still a clear improvement when using the augmented classifiers, whose accuracy was close to 1.0.
Second, the accuracy improved when classifying the test_clean set with no visual camera failures. The last box in Fig. 4a (test_clean) was wider than the corresponding box in Fig. 4b. All augmented classifiers are excellent when processing clean images; instead, the accuracy scores of the regular classifiers have more variance than the augmented classifiers.
We further explored the accuracy scores in Table 2 and  Table 3. Both tables report a row for each test set and three groups of columns: one for each dataset. For each dataset, we reported the highest accuracy and the DNN that achieved it. Accuracy scores in the table often report ranges, which occur when a failure is injected using multiple configurations, thus accounting for many dataset variants (see Section IV-C). Instead, test_clean contains no failures, and test_NCAC, test_NOBF, and test_NDEMOS contain a single configuration for a failure; they have a single accuracy value in both Table 2 and Table 3. Table 2 always have inferior accuracy compared with the corresponding augmented classifiers in Table 3. Looking at the GTSRB column in Table 3, we can see that in many cases, even in the presence of visual camera failures, the classification accuracy is perfect (1.0). However, this trend does not apply to the other two datasets, which are harder to classify; neither regular nor augmented classifiers reach 1.0 accuracy on test_clean (first row of Table 3). As such, we cannot expect perfect accuracy for altered images.

The regular classifiers in
Some test sets resulted in low accuracy, regardless of the training dataset and DNN. This is the case for test_brightness test_NNR, test_ice, and to a lesser extent, test_brokenlens. The range of accuracy of test_ice on the DITS and Bel-giumTSC datasets is broad, even when adopting augmented classifiers; for some configurations of the ice visual camera failure, accuracy in DITS drops as low as 0.894 even when using the augmented classifiers. This is because the ice image overlays the image of the traffic sign image; in some cases, it covers significant parts of the image, whereas sometimes it does not. On BelgiumTSC, the Ice failure makes drop as low as 0.976 for specific configurations, which is a decent 5138 VOLUME 11, 2023  . Accuracy achieved on test_NNR by regular classifiers (Fig. 6a, on the left) and augmented classifiers (Fig. 6b, on the right) using different DNNs on different datasets with different levels of noise (0 to 5). accuracy in general terms, but still lower than that achieved in the other test sets, except for test_brightness and test_NNR.
Furthermore, Table 2 and Table 3 show which DNN results in the highest accuracy value for each test set using regular and augmented classifiers, respectively (i.e., columns Best DNN in the tables). There was no clear winner for the regular classifiers listed in Table 2. Considering the augmented classifiers in Table 3, the best performing DNN is MN2, which is always the preferred DNN for the DITS dataset and its variants, as well as for the clean (initial) dataset and 10 out of 13 groups of variants of GTSRB (see Table 3). For BelgiumTSC, it appears to be more beneficial to adopt IC3. The regular AN often achieves better classification performance than regular MN2 and IC3; however, the augmented AN classifier shows only marginal improvement in accuracy, whereas augmented MN2 and IC3 outperform regular MN2 and IC3, as well as augmented AN.
Overall, we conclude that the augmented classifiers have significantly better classification performance than the regular classifiers when processing clean and altered images. Their accuracy remains high when dealing with images containing visual camera failures, making them robust to altered images.

B. ON BRIGHTNESS AND NOISE FAILURES
We explore the impact of brightness and noise camera failures that have a major effect on the classification performance of both regular and augmented classifiers.
Brightness Analysis. Fig. 5 plots accuracy achieved by regular (Fig. 5a) and augmented (Fig. 5b) classifiers when varying brightness levels in the range [0.1; 15], where 0.1 is an image almost entirely black, 1 is the clean image, and 15 is an image that is almost entirely white. Fig. 5a shows how brightness levels in the range [0.3; 1.5] do not have a major impact on the accuracy of regular classifiers, whereas brightness levels lower than 0.3 and greater than 1.5, dramatically reduce accuracy. As shown in Fig. 5b, augmented classifiers are more robust to brightness failures: their accuracy does not decrease significantly with low brightness, and it suffers serious degradation only with DITS and BelgiumTSC, and brightness levels greater than 1.5. Even in these cases, the degradation of accuracy is much less evident than that in regular classifiers.
Noise Analysis. Fig. 6 is analogous to Fig. 5 but refers to an NNR failure. The clean image has a noise level of 0; the higher the noise level, the more degraded the image. Fig. 6a shows that regular classifiers struggle to classify noisy images. Specifically, with InceptionV3 (IC3) on GTSRB, accuracy drops to random guessing when the noise level reaches or exceeds 0.6 (in a balanced 8-class classification problem, random guessing quantifies as 12.5% of accuracy). This is visible in the figure by looking at the solid brown line that immediately falls at the bottom of the plot. Fig. 6b shows that augmented classifiers are, to some extent, robust to NNR; when the noise level is below 1, the accuracy still exceeds 0.90. In general, we observe only a very slight performance degradation: we hypothesize that this is due to a high number of images in the training data and the high resolution of the images.

C. IMPACT OF INDIVIDUAL FAILURES ON TRAINING
We investigated which visual camera failure contributed the most to the improved classification accuracy. For this purpose, we trained a DNN on 13 train variants, each composed of a train_clean set plus images altered with a single visual camera failure. This creates 13 intermediate classifiers which employ a minimal data augmentation process.
We discuss the behavior of MN2 on DITS. We deem it more interesting to explore the DITS dataset instead of GTSRB and BelgiumTSC, because classification on DITS has the lowest accuracy scores; thus, we believe it is the most challenging. MN2 is the DNN that always achieves the highest accuracy on DITS (see Table 2). Table 4 contains a row for each test set. On the columns, we have 15 classifiers: the regular, the 13 obtained using the train variants above, and the augmented. The second column of the table shows the minimum accuracy achieved by the regular MN2 classifier for any configuration of a specific failure to be used as a reference. The quantities shown in the rest of the table are the differences between the minimum accuracy achieved with the 13 intermediate classifiers and that achieved by the regular classifier. Only positive differences are reported; these are cases in which training with a specific train variant creates a better classifier than the regular one. The highest accuracy gains for each test set (each row in the table) are underlined; they show the train variant that improves classification the most with respect to training on train_clean. Finally, the last column of the table reports the accuracy difference between augmented MN2 and regular MN2.
For all test sets, except test_clean and test_COND, the greatest improvement in accuracy was obtained by adding images altered with the corresponding failure in the training set. Adopting a training set composed of clean images and images with dirt failure increased the accuracy of MN2 from 0.74 to 0.96 (thus a difference of 0.22, see 8 th row, 8 th column in Table 4) when processing test_dirt. The data augmentation process clearly provides the DNN with information about the effects of a specific visual camera failure, allowing MN2 to learn how to classify even in the presence of that specific failure. Interestingly, the greatest improvement in accuracy for test_clean and test_COND was when adding images altered with broken lens failure to the training set. This shows that a failure may be used to augment the training set and build a classifier that is robust to that specific failure and to others that may be correlated.
Another interesting observation is the accuracy of the clean images (first row of Table 4). The augmented classifier scored 0.993 accuracy with an increase of 0.033 with respect to the regular classifier, which was 0.960. Augmenting the training set with broken lens failure leads to a gain in accuracy of 0.032, whereas training with dirt failure provides an accuracy improvement of 0.030 for test_clean: this shows that specific failures make MN2 learn a model that also has enhanced classification performance with respect to clean images.

D. EXPLAINING PERFORMANCE OF REGULAR AND AUGMENTED CLASSIFIERS ON CLEAN IMAGES
While it is intuitive that the augmented classifiers have better classification accuracy than the regular classifiers on altered images, it is worth exploring why the classification accuracy on clean images improves as well. We rely on the LIME tool for eXplainable AI (XAI) [30] which allows us to examine how a classifier builds its prediction and explains the process behind the outputs of the image classifiers. LIME provides a graphical interface that shows the areas of the input image that have the highest relevance when calculating the output prediction. Fig. 7 shows four images from DITS that were misclassified by the regular MN2 classifier (on the left) but were correctly classified by the augmented MN2 classifier (on the right). Each image in Fig. 7 was processed using LIME, which visualizes a heatmap of the most relevant features used for prediction. Red areas correspond to features that contribute the most to classification, whereas blue areas have negligible to no impact.
The regular MN2 classifier on the left side of Fig. 7 selects relevant features from many areas of the image, while the augmented MN2 classifier selects only a few stronger features. The color scales on the right of each image show the contribution of each feature in the order of 10 −6 to 10 −7 for regular classifiers, whereas the features for the augmented classifier have a larger absolute weight, with a magnitude of 10 −5 . These two observations pair well: the augmented classifier selects a few strong features, whereas the regular classifier selects many weak features from different parts of the image. This difference in the models learned from regular and augmented classifiers improves the image classification accuracy; relying on fewer features provides a clear advantage in our experiments.

VI. LIMITATIONS TO VALIDITY
Here, we report possible limitations to the validity and applicability of our study. These are not to be intended as showstoppers when considering the conclusions of this study. Instead, they should be interpreted as boundaries or possible future implications that may affect the validity of this study.

A. USAGE OF PUBLIC DATA AND LIBRARIES
The use of public image datasets and tools to inject visual camera failure algorithms enables the reproducibility of our analysis (see also scripts and corrupted data in [44] and [45]). However, the heterogeneity of data sources and potential lack of documentation may limit the understandability of the data. In addition, public datasets are not under our control; therefore, possible actions, such as changing the way the data are generated, are out of consideration. For example, it is not possible to create longer sequences of traffic signs for DITS nor creating a time-sequenced version of the BelgiumTSC.

B. PARAMETERS OF DEEP IMAGE CLASSIFIERS
Each deep classifier relies on parameters that are applicable to any deep neural network or specific to a DNN model. Finding the optimal values of the parameters is a substantial process that requires sensitive analyses and is directly linked to the scenario in which the classifier is to be exercised. We tried our best to precisely tune these parameters through grid searches, which ran a classifier with different parameter values and chose the parameter that maximized the accuracy. This does not guarantee to find the absolute optimum value of a parameter for a given classifier on a given dataset but constitutes a good approximation [46].

C. GENERALIZATION BEYOND TRAFFIC SIGN RECOGNITION
External validity is usually concerned with the extent to which the results of a study can be generalized. The idea of injecting the effects of camera failures into images to perform data augmentation goes beyond Traffic Sign Recognition, which is the domain from which our experimental data comes. Therefore, the methodology we used and assessed in this study can be used to build robust image classifiers for other domains in which the data to be provided to the DNNs are captured by visual cameras, such as security cameras, camera trap images, and OCR for car plate recognition.

VII. CONCLUSION AND FUTURE WORKS
This study investigated methods for improving the robustness of image classifiers against visual camera failures, either due to internal faults or adverse environmental conditions. The image classifiers built using our data augmentation technique tolerated most failures of the visual camera. Such data augmentation approach is not only improving robustness but is also improving classification accuracy. We showed that images altered using specific failures contribute more than others to improving the accuracy and robustness of classifiers. However, to ensure robustness with respect to the entire failure set, it was necessary to perform data augmentation by enriching the training set with altered images due to multiple visual camera failures.
Our future plan is to train a failure detector to exclude images that are corrupted beyond the extent to which our augmented classifier can properly classify them. In other words, we are aiming for an architectural approach in which the images are either discarded by the failure detector or deemed of adequate quality for processing using a robust classifier. Furthermore, we will explore whether our composite approach, which merges a failure detector and data augmentation, is sufficiently general to build safer image classifiers in safety-critical domains other than Traffic Sign Recognition. ANDREA CECCARELLI is currently an Associate Professor in computer science at the University of Florence, Florence, Italy. His primary research interests include the design, monitoring, experimental evaluation of dependable and secure systems, and systems-of-systems. His scientific activities originated more than 100 papers that appeared in international conferences, workshops, and journals. He is also a member of the IFIP WG 10.4 on ''Dependable Computing and Fault-Tolerance.'' He has been the PC co-chair of the conferences SRDS and LADC.
TOMMASO ZOPPI is currently a Research Associate at the University of Florence, Brazil. He is involved in several European and nationally funded and even industrial projects. His research interests include anomaly detection, security and safety, often applies standards to plan, design, develop, and implement appropriate architectures or software in the domain of critical systems. He serves as a member of the program committee at several international conferences.
ANDREA BONDAVALLI (Senior Member, IEEE) is currently a Full Professor in computer science at the University of Florence. His research interests include the design and evaluation of resilient and secure systems and infrastructures. His scientific activities originated more than 220 papers that appeared in international journals and conferences. He led various national and European projects and has been chairing the program committee in several international conferences. He is a member of the IFIP W.G. 10.4 Working Group on ''Dependable Computing and Fault-Tolerance.'' Open Access funding provided by 'Università degli Studi di Firenze' within the CRUI CARE Agreement VOLUME 11, 2023