A Comprehensive Study on the Robustness of Image Classification and Object Detection in Remote Sensing: Surveying and Benchmarking

Deep neural networks (DNNs) have found widespread applications in interpreting remote sensing (RS) imagery. However, it has been demonstrated in previous works that DNNs are vulnerable to different types of noises, particularly adversarial noises. Surprisingly, there has been a lack of comprehensive studies on the robustness of RS tasks, prompting us to undertake a thorough survey and benchmark on the robustness of image classification and object detection in RS. To our best knowledge, this study represents the first comprehensive examination of both natural robustness and adversarial robustness in RS tasks. Specifically, we have curated and made publicly available datasets that contain natural and adversarial noises. These datasets serve as valuable resources for evaluating the robustness of DNNs-based models. To provide a comprehensive assessment of model robustness, we conducted meticulous experiments with numerous different classifiers and detectors, encompassing a wide range of mainstream methods. Through rigorous evaluation, we have uncovered insightful and intriguing findings, which shed light on the relationship between adversarial noise crafting and model training, yielding a deeper understanding of the susceptibility and limitations of various models, and providing guidance for the development of more resilient and robust models


I. INTRODUCTION
The proliferation of remote sensing (RS) technologies has remarkably augmented the volume and fidelity of RS imagery, which is critical and influential for characterizing diverse features of the earth's surface. As a consequence, the automated and intelligent processes of satellite or aerial images have become indispensable for earth observation and analysis. The significance of RS image (RSI) interpretation such as image classification and object detection is paramount and extends to a multitude of applications, encompassing but not limited to environmental monitoring, intelligent transportation, urban planning, and disaster management. In response to the pressing demand for automated analysis and comprehension of optical  RSIs, there has been a surge in the development of diverse techniques for aerial detection [1]- [4] over the past few years.
In recent years, algorithms based on deep learning (DL) technologies have emerged as the forerunners in top accuracy benchmark for a range of visual recognition tasks, e.g., image classification [5], [6], object detection [7], [8], semantic segmentation [9], [10], etc., owing to the remarkable feature representation capability of deep neural networks (DNNs). As a natural progression, DNNs have been widely adopted for the processing of optical RS imagery, with particular emphasis on image classification and object detection tasks. Undoubtedly, DNNs-based models [11]- [14] have emerged as a dominant approach, surpassing the performance of previous traditional methods by a significant margin.
However, good fortune brings misfortune on its train. The utilization of DL in intelligent recognition brings forth notable advantages, yet it also introduces substantial security concerns. The black-box nature of DL has been the subject of critique due to its inherent lack of interpretability and transparency. Furthermore, the susceptibility and vulnerability of DL models to adversarial examples have garnered significant attention within the academic community, prompting questions regarding the veracity of these models as reliable predictors. As a result, there are growing concerns that these models may merely be clever "Hans," achieving acceptable outcomes via flawed methods, which undermines the credibility and trustworthiness of DNNs-based systems. The temporal progression of publications and citations related to adversarial attacks is shown in Fig. 1. Previous works [15], [16] have demonstrated that DNNs are susceptible and vulnerable to adversarial examples, which involve the addition of carefully crafted imperceptible perturbations to benign images that can lead to erroneous predictions and pose a significant threat to both digital and physical applications [17]- [20] of DL. The research areas that have been threatened by adversarial attacks are detailed and exhibited in Fig. 2. Furthermore, studies [21]- [24] have also shown that DNNs can be easily disturbed by natural noises, indicating that DL systems are not inherently secure and robust.
The phenomena mentioned above underscore the need for delving into the mechanism of adversarial attacks and improving the resilience and reliability of DL systems. In addition, it is beneficial to comprehensively benchmark the robustness of DNNs for better understanding and developing robust DNNsbased models, while none of the public surveys and benchmarks provide a comprehensive study on the robustness of image classification and object detection in RS. We summarize the existing surveys and benchmarks as shown in Table I. Specifically, most existing related works [25]- [43] focus on surveying and benchmarking in computer vision (CV). Wei et al. [44] surveyed physical adversarial attacks and defenses in CV and briefly reviewed physical attacks in RS. In [45], the authors discussed the challenges and future trends of AI security for geoscience and RS while without further study on the robustness of DNNs-based methods in optical RSIs. Work [46] attempts to comprehensively analyze the diversity of adversarial attacks in the context of autonomous aerial imaging and provides a literature review of adversarial attacks on aerial imagery processing but without further analysis of models' robustness.
adversarial noise and the training of models. These findings contribute to a deeper understanding of the vulnerabilities and limitations inherent in various models and offer guidance for the development of more resilient and robust models in the future.
The main contributions of this article are four-fold as follows: • Comprehensive survey and benchmark on the robustness of DNNs-based models. To the best of our knowledge, this study represents the first comprehensive survey and benchmark effort to investigate the robustness of image classification and object detection models in the field of RS, specifically in the presence of both natural and adversarial noises (see Table II for the top-5 robust models against different noises). • Creation of benchmark datasets 1 with various noises.
This paper presents publicly available datasets containing a diverse range of 7 natural noises and 9 adversarial noises, specifically designed for image classification and " \" represents accuracy lower than 0.05%. "on" and "outside" represent patches on and outside targets, respectively. Adversarial robustness is evaluated under white-box conditions. object detection tasks. These datasets have been carefully derived from optical RS imagery, serving as a valuable resource for the research community and facilitating advancements in the field of robustness analysis. • Rigorous and extensive experiments. To ensure a comprehensive evaluation of the robustness of DNNs-based models, we have conducted a systematic investigation into the performance of 23 image classifiers and 20 object detectors. These models encompass a diverse range of mainstream architectures and have been rigorously evaluated on several large-scale optical RS datasets, as well as their corresponding versions with introduced noises. • In-depth analyses. Through rigorous and comprehensive experiments, we have derived insightful and intriguing findings that shed light on the potential connection between adversarial perturbation generation and model training. These findings contribute to a deeper understanding of the sensitivity and vulnerability exhibited by various DNNs-based models across different tasks. By uncovering these relationships, our study offers valuable insights into the robustness of DNNs in the face of adversarial attacks and provides a foundation for developing more resilient and robust models in the future.
The rest parts of this manuscript are organized as follows: a) We thoroughly survey the robustness of DNNs in CV and RS in Section II; b) The robustness of image classification and object detection is comprehensively benchmarked in RS field in Section III; c) Rigorous and extensive experimental results are provided in Section IV; d) Some further discussions are presented in Section V, and e) followed by the conclusions in Section VI.

II. SURVEY
The integration of DNNs into safety-critical applications, such as autonomous driving [55], [56], face recognition [57], [58], RS [1], [59], etc., highlights the criticality of enhancing model robustness and developing resilient DL systems. As a result, there is a growing need to comprehensively evaluate the robustness of DL models for a better understanding of the factors affecting their resilience and facilitate further improvements in DNNs' robustness. In this section, we first introduce the background knowledge of the adversarial attack. Then, the robustness of DNNs-based methods are comprehensively surveyed in CV and RS, respectively.

A. Background Knowledge
The primary objective of DL is to enable models to learn from data in a manner that allows them to perform tasks similar to humans when confronted with new data. Over the last decade, DL has made tremendous strides in numerous significant applications. Although DL has delivered impressive results in practical applications, recent years have revealed a disturbing phenomenon where DL models may make abnormal predictions that are inconsistent with human intuition. For instance, a model could yield significantly different predictions on two visually similar images, with one being perturbed by malicious and imperceptible noises [15], [16], whereas a human's prediction would remain unaffected by such noises. We refer to this phenomenon as the adversarial phenomenon or adversarial attack, signifying the inherent adversarial relationship between DL models and human perception [28].
The discovery of the adversarial phenomenon originated from image classification tasks in the digital realm. As a consequence, the majority of existing research on adversarial attacks has been concentrated on image classification tasks in the digital domain [15], [16], [20], [60]- [64], i.e., the socalled digital attack. In comparison, physical attack happens in real physical world scenarios. Consequently, in this section, we provide an overview of adversarial attacks, offering background knowledge on both digital attacks and physical attacks as illustrative examples. Digital attacks involve manipulating image pixel values in the digital domain after capturing an image using an imaging device. On the other hand, physical attacks involve tampering with the target to be disturbed before image capture. Although digital attack methods can easily fool various DL models in the digital domain, the generated digital perturbations lose their effectiveness in the real physical world because they are often imperceptible and cover the entire image, making them invisible to imaging devices. As a result,  researchers are increasingly studying adversarial attacks that are applicable in the physical world. Physical attack methods have been proposed and used to attack intelligent systems such as autonomous driving [65]- [68], face recognition [69]- [72], RS [18], [73]- [75], security monitoring [76]- [78], etc.
Typically, digital and physical attacks occur at different stages of an intelligent recognition task, as shown in Fig. 3, which illustrates the difference between digital and physical attacks in the context of RS. It is observed that: • For physical attacks, the attacker manipulates either the actual targets or the imaging process itself to intentionally induce incorrect predictions; • For digital attacks, the attacker directly modifies the pixel values of the image data captured by the imaging device to implement the attack. In addition, adversarial attacks can be classified based on other attack characteristics. Regarding the attacker's access to the victim model's information, adversarial attacks can be categorized into three types: white-box attack, gray-box attack, and black-box attack, as shown in Fig. 4. In white-box attacks, the attacker has full access to the internal information of the model, including its structure, parameters, gradients, and other relevant details. This comprehensive knowledge enables the attacker to craft sophisticated adversarial examples to deceive the model. Gray-box attacks grant the attacker partial access to the internal information of the model. Although not as extensive as in white-box attacks, this limited access still provides valuable insights that can be leveraged for crafting effective adversarial examples. In contrast, black-box attacks present unique challenges as the attacker lacks access to the specific parameters and structural details of the target model. Consequently, alternative techniques must be employed to generate adversarial examples in such scenarios. Transferbased attacks, where knowledge gained from a substitute model is utilized to craft adversarial examples for the blackbox model, are commonly employed. Additionally, gradient estimation methods based on query results can also be utilized to approximate the gradients of the black-box model and guide the generation of effective adversarial examples. These approaches showcase the ingenuity and adaptability of attackers in navigating the constraints imposed by black-box settings. While several white-box attack methods [74], [79]- [81] have been proposed, they typically demand extensive information about the victim model, rendering their practical applicability in real-world attack and defense scenarios quite challenging. As a result, researchers in the field of adversarial machine learning have increasingly directed their attention towards black-box attack methods [20], [71], [82], [83], which are more suitable for real-world adversarial situations where the attacker only has limited knowledge of the target model.
We have categorized adversarial attacks based on their distinct characteristics and strategies employed as shown in Fig. 5 Furthermore, we also display different forms of perturbations, as shown in Fig. 6. In this section, we illustrate the adversarial attack from different domains, i.e., digital attacks and physical attacks.
1) Digital attack formulation: Assuming the presence of an image classifier f (x) : x ∈ X → y ∈ Y that generates a prediction y based on an input image x, the primary aim of an adversarial attack is to generate an adversarial example x * that closely resembles the clean example x but causes the image classifier f (x) to make an incorrect prediction y * . From a technical standpoint, adversarial attack methods can be categorized as either non-targeted or targeted, depending on the attacker's motives.
Suppose an input image x is properly classified by a model such that its predicted label is y, i.e., f (x) = y. Nontargeted attack methods are designed to generate adversarial examples x * by adding imperceptible perturbations to clean images x, which mislead the classifier into making an incorrect prediction, i.e., f (x * ) = y. Targeted attack methods are designed to manipulate the classifier into predicting a specific label, such that f (x * ) = y * , where y * represents the target label specified by the attacker and y * = y. These methods are intended to deceive the classifier into producing a specific output rather than simply causing a misclassification. The L p norm is typically used as a measure of the visibility of the adversarial noise. In the case of digital attacks, the adversarial noise is required to be imperceptible to human vision, i.e., less than or equal to a certain threshold value , expressed as x * − x p ≤ , as shown in Fig. 7.
Current adversarial attack methods can be classified into two categories (gradient-based and optimization-based) according to the optimizing strategy adopted to generate the adversarial samples. In this article, we present the formulation of nontargeted attack methods, and the targeted version can be derived using a similar approach.
Gradient-based methods. Gradient-based methods, such as the fast gradient sign method (FGSM) [16], aim to elaborate adversarial examples x * by maximizing loss function L(x * , y). FGSM crafts adversarial examples to subject to the L ∞ norm constraint x * − x p ≤ mathematically written as: where ∇ x L(x, y) is the gradient value of the objective loss L(x, y) w.r.t. clean image x. A extension of FGSM algorithm is to satisfy the L 2 norm limitation x * − x 2 ≤ mathematically define as: The aforementioned gradient-based methods are one-step adversarial attacks. Subsequently, multi-step methods, such as iterative FGSM (I-FGSM) [17], momentum iterative (MI-FGSM) [19], projected gradient descent (PGD) [100], Nesterov accelerated gradient (NI-FGSM) [101], AutoAttack (AA) [102], etc., iteratively adopt one-step approaches multiple times with a small step size α. The iterative attack method can be expressed as: To ensure that the adversarial perturbations generated are imperceptible to human observers, i.e., , satisfy the L p constraint, which can be achieved by simply clipping x * t into the vicinity of x or simply set α = /T with T being the number of iterations.
Optimization-based methods. Optimization-based methods, such as L-BFGS [15], Deepfool [103], C&W [104], etc., directly minimize the distance between the clean and adversarial examples, while ensuring that the adversarial examples are misclassified by the model. This can be mathematically expressed as: where L(x * , y) is the objective loss w.r.t. adversarial example x * . For optimization-based methods, they directly optimize the distance between an adversarial example and the corresponding benign example, thus the optimization of the L p norm does not guarantee that the norm will be less than or equal to a particular threshold value.
To summarize, as shown in Fig. 8, gradient-based adversarial attack methods aim to generate adversarial perturbations that are farthest from the decision boundary within the specified perturbation range. On the other hand, optimizationbased methods aim to minimize the size of the adversarial perturbation, i.e., the distance between the adversarial and  (a) Pixel [16] (b) Watermark [84] (c) Trigger [85] (d) Patch [86] (e) Viewpoint [87] (f) Style [88] (g) Erosion [89] (h) Sticker [72] (i) Light [90] (j) Laser [91] (k) Color [92] (l) Zoom [93] (m) Texture [94] (n) 3D object [95] (o) Projection [96] (p) Makeup [97] (q) PS [98] (r) Location [99] Fig. 6: Adversarial perturbations in different forms. Note that PS represents Pan-Sharpening. clean samples, for a given adversarial perturbation. As a consequence, the adversarial perturbations generated by gradientbased adversarial attack methods are more effective in producing misclassifications, while the perturbations generated by optimization-based methods are more visually imperceptible.  2) Physical attack formulation: As the study of adversarial attack problems has progressed, researchers have found that generating adversarial examples in the digital domain presents considerable difficulties in launching successful attacks in the physical domain. Kurakin et al. [17] first discover that DNNs are also susceptible and vulnerable to adversarial attacks performed in real-world physical scenarios, i.e., physical attacks. Notably, physical attacks carried out in real-world settings are significantly more dangerous than digital ones. Consequently, the practical feasibility of adversarial attack methods in physical contexts has emerged as a crucial area of research in the domain of machine learning security. However, physical attacks still face some challenges when compared to digital attacks, such as: • Physical attack methods should be able to withstand the impact of the imaging process, which mainly includes optical lenses, image sensors, processors, etc.; • Adversarial perturbations created using physical attack methods should be robust enough to handle the impact of dynamic environments when they face transformations across different domains, as shown in Fig. 9; • Adversarial perturbations for physical attacks should be as concealed as possible to avoid attention-grabbing anomalies. Digital attacks typically involve pixel-level modifications to images, which are difficult for human eyes to notice while it is challenging to make physical attacks unobtrusive. Consequently, numerous studies have aimed at assessing the physical adversarial robustness of DNNs in response to the concerns mentioned above during the past few years. Physical attacks are executed in practical settings that encompass a diverse range of tasks conducted in physical scenarios. Prior to executing a physical attack, it is imperative to fabricate the adversarial example properly. Attackers frequently prioritize the practicality of a given approach within a real-world setting, taking into account factors such as environmental interference, ease of manufacture, and avoidance of visual detection by (a) Normal [65] (b) Background [74] (c) Infrared [105] (d) Clothes [94] (e) Eyeglass [70] (f) Mask [67] (g) 3D [106] (h) Semantic [107] Fig. 10: Adversarial patches in different forms. human observers. In this paper, we formulate physical attacks in patch form due to the widespread popularity of adversarial patches as an approach for implementing physical attacks in real-world scenarios. we exhibit different forms of adversarial patches in Fig. 10.
In the context of digital adversarial attacks, global perturbations engendered throughout an entire image present substantial impediments to the practical execution of such assaults within real-world environments. In contrast, adversarial patches, which solely manipulate localized pixel regions, offer a more viable alternative. These patches can be conveniently produced via printing methods and directly adhered to the designated targets. A mask is commonly utilized to regulate the geometry of the disrupted area. Upon completing the optimization process for the adversarial patch within the digital domain, the tailored patch is subsequently crafted and strategically situated on the object's exterior surface or background area, as shown in Fig. 11. Mathematically, the adversarial example with adversarial patches can be formulated as: where and p * denote Hadamard product and adversarial patches, respectively. Mask matrix M p * is used to constrict the size, shape, and location of the adversarial patch, where the value of the patch position area is 1. 1 is a unit matrix with the same size as M p * .
To address the challenge of capturing value discrepancies between neighboring pixels by image acquisition devices, Total Variation (TV), as delineated in [70], is usually incorporated into the objective function of physical attacks. The inclusion of L tv serves to ensure that the optimization process favors adversarial patches characterized by smooth patterns and gradual color transitions, as shown in Fig. 12. TV can be  mathematically defined as: where p i,j denotes the pixel value situated at the ith row and jth column within the adversarial perturbations.
Owing to the color alterations that occur when transitioning the adversarial patch from the digital domain to the physical domain, the non-printability score (NPS) outlined in [70] is frequently employed to evaluate the fidelity with which the colors in the adversarial patch can be reproduced in the physical world. This metric serves as an indicator of the distance between the digital representation of the adversarial patch and its physical manifestation when produced using a standard printer. L nps is written as: where c print represents an individual color within the set of physically printable colors, denoted as C. By incorporating L nps as part of the loss, the pixel values of the generated adversarial patch are biased towards printable colors from the set C, thereby promoting the reproducibility of the patch in the physical domain. Last but not least, camouflage loss L cam can be added to improve the invisibility of the adversarial patches to human visual perception. From an academic standpoint, the rationale for employing camouflage loss stems from the observation that carefully crafted adversarial patches often exhibit vibrant hues and unconventional patterns. By incorporating camouflage loss into the optimization process, it becomes possible to generate adversarial patches that seamlessly blend with natural things, as shown in Fig. 13, ensuring that the resultant perturbations remain inconspicuous while retaining their effectiveness in adversarial settings. Technically, the L p norm is often used as the camouflage metric to measure the distance between adversarial patches and natural images.
In summary, the total objective function of physical attacks in patch form can be derived from the combination of the aforementioned parts and adversarial loss L adv (similar to digital attacks). The total loss is depicted as: where α, β, and γ are adopted to scale different components of the total loss.

B. Survey Robustness in CV
In the following subsections, we provide a comprehensive examination of adversarial attacks as they pertain to the domain of CV, encompassing a variety of tasks including, but not limited to, image classification and object detection. By conducting an in-depth review of the pertinent literature, we aim to elucidate the underlying principles, methodologies, and implications of these attacks, thereby contributing to a more robust understanding of their role and significance within the broader context of CV research.
1) Image Classification: In the present section, we provide an overview of adversarial attacks in image classification, with a particular emphasis on both digital and physical attack methodologies. The majority of research on adversarial attacks has focused on the digital domain, as the attacks were initially discovered in this context.
x Digital attack. White-box attacks: Szegedy et al. [15] first reveal that DNNs establish input-output associations characterized by a considerable degree of discontinuity. More precisely, their findings indicate that the application of a subtle and imperceptible perturbation, identified by maximizing the network's prediction error, can effectively induce DNNs' misclassification. FGSM [16] was the first gradient-based adversarial attack method, in which only one step was moved from benign image x following the sign of gradient with the step size to obtain adversarial image x * . In [103], the proposed DeepFool algorithm effectively generates perturbations that deceive DNNs and initially evaluates the resilience of stateof-the-art (SOTA) deep classifiers to adversarial perturbations on large-scale datasets. Papernot et al. [115] present a formalization of the adversarial space w.r.t. DNNs and introduce a novel set of algorithms that generate adversarial examples through a comprehensive comprehension of the input-output mapping of DNNs. [116] achieves targeted deception of highperformance image classifiers through the development of two innovative attack techniques. The first technique (Universal Perturbations for Steering to Exact Targets, UPSET) generates universal perturbations for specific target classes, while the second technique (Antagonistic Network for Generating Rogue Images, ANGRI) generates perturbations that are specific to individual images. The authors of [118] demonstrate the existence of universal (image-agnostic) and invisible adversarial noise, which reveals important geometric correlations among the high-dimensional decision boundary of classifiers. Moreover, the universal adversarial noises can generalize well across DNNs. In [120], the researchers explore the case where the noise is allowed to be visible but confined to a small, localized patch of the image, without covering any of the main object(s) in the image, named Localized and Visible  [136] is introduced to avoid perturbations that are easily spotted by human eyes. Black-box attacks: To avoid the demands for knowledge of either the model internals or its training data, the authors in [117] introduce the first practical demonstration of an attacker controlling a remotely hosted DNN with no such knowledge by observing labels given by the DNN to chosen inputs. In work [119], Zeroth Order Optimization (ZOO) is devised to directly estimate the gradients of the proxy model for crafting adversarial examples. Specifically, They employ a combination of zeroth-order stochastic coordinate descent, dimension reduction, hierarchical attack, and importance sampling techniques to effectively fool black-box models. By introducing three novel attack algorithms that can successfully penetrate both distilled and undistilled neural networks, Carlini et al. [104] establish that defensive distillation does not notably enhance the resilience of neural networks. To strengthen blackbox attack efficacy, Dong et al. [19] propose a momentum iterative FGSM (MI-FGSM) by integrating the momentum term into the iterative process of noise optimization, which can stabilize update directions and escape from poor local maxima during the optimization, resulting in more transferable adversarial examples. Ilyas et al. [121] establish three practical threat models that more precisely reflect the nature of many real-world classifiers: the query-limited model, the partial-information model, and the label-only model. Furthermore, they propose novel attack strategies that can deceive classifiers under these more restrictive threat models. In the paper [159], the authors propose a novel and data-free method for generating universal adversarial perturbations that can be applied across multiple vision tasks. Technically, their approach involves corrupting the extracted features at multiple layers to achieve fooling, which makes the objective generalizable and applicable to image-agnostic perturbations for various vision tasks, including object recognition, semantic segmentation, and depth estimation. Work [122] proposes a framework that integrates and unifies a substantial portion of the existing research on black-box attacks and shows how to enhance the performance of black-box attacks by introducing gradient priors as a new factor in the problem. Su et al. [123] analyze an attack in an extremely limited scenario where only one pixel can be modified. To achieve this, they propose a novel method for generating one-pixel adversarial perturbations based on differential evolution (DE). Moreover, this method requires minimal adversarial information, making it a black-box attack, and is capable of fooling a wider range of networks due to the inherent characteristics of DE. Article [124] introduces a black-box adversarial attack algorithm that can successfully bypass both standard DNNs and those generated by various recently developed defense techniques, in which adversarial examples are drawn from the probability density distribution over a small region centered around the inputs. In [125], a meta attack is devised to attack a targeted model with few queries. [126] strengthens query efficiency by leveraging the advantages of both transfer-based and scored-based approaches and addressing a discretized problem through the utilization of a simple yet highly efficient microbial genetic algorithm (MGA). The authors in [127] present the HopSkipJumpAttack family of algorithms, which rely on a novel estimate of the gradient direction obtained through binary information at the decision boundary. SurFree is presented in [130] to decrease the number of queries by focusing on targeted trials along varied directions, guided by precise indications of the geometric properties of the decision boundaries. The objective of study [61] is to train a generalizable surrogate model, termed "Simulator," capable of emulating the behavior of an unknown target model. To mitigate the query cost, the authors of [132] suggest using feedback information obtained from past attacks, i.e., examplelevel adversarial transferability. By considering each attack on a benign example as an individual task, they construct a metalearning framework that involves training a meta-generator to produce perturbations based on specific benign examples. The authors of research [20] introduce a novel framework for conducting query-efficient black-box adversarial attacks by integrating transfer-based and decision-based approaches. They also elucidate the correlation between the present noise and sampling variance, the compression monotonicity of noise, and the impact of transition functions on decision-based attacks. Guo et al. [134] introduce an intermediate-level attack, which establishes a direct linear mapping from the intermediate-level discrepancies, i.e., between adversarial features and benign features, to prediction loss of the adversarial example. To strengthen attacks' transferability against black-box defenses, [89] propose a novel transferable attack capable of defeating various black-box defenses and sheds light on their security limitations.
y Physical attack In [17], the authors first demonstrate that machine learning systems are vulnerable to adversarial examples even in physical world scenarios and propose a basic iterative method (BIM). Brown et al. [86] propose a method for generating universal, robust, targeted adversarial perturbations in patch form that can be deployed in the real world. These adversarial patches can be printed, attached, photographed, and then presented to image classifiers for successful attacks. Subsequently, adversarial patches are broadly adopted in various physical attacks [65], [71]- [74], [106], [160]. To better understand adversarial examples in the physical world, Eykholt et al. [161] propose a general physical attack method, Robust Physical Perturbations (RP2), to elaborate robust visual adversarial perturbations under dynamic physical conditions. [162] provides evidence for the existence of robust 3D adversarial objects, and introduces the first algorithm Expectation Over Transformation (EOT) capable of synthesizing examples that remain adversarial across a chosen distribution of transformations. [163] focuses specifically on the subset of adversarial examples that correspond to meaningful changes in 3D physical properties, such as rotation, translation, illumination conditions, etc.. To alleviate unrealistic distortions of adversarial patterns, Duan et al. [164] introduces a novel technique called Adversarial Camouflage (AdvCam), which involves crafting and camouflaging physical-world adversarial examples in natural styles that appear legitimate to human observers. In [165], Feng et al. propose Meta-Attack by formulating physical attacks as a few-shot learning problem to improve the optimization efficiency of physical dynamic simulations. [90] propose an optical adversarial attack, which uses structured illumination to alter the appearance of the target objects to deceive image classifiers without physically touching the targeted objects, e.g., moving or painting the targets of interest. Duan et al. [91] demonstrates that DNNs can be easily deceived using only a laser beam. Research [166] uncovers the presence of an intriguing category of spatially constrained, physically feasible adversarial examples, i.e., Universal NaTuralistic adversarial paTches (TnTs). TnTs are crafted by examining the full range of spatially bounded adversarial examples and the natural input space within generative adversarial networks (GANs).
2) Object Detection: In this section, we offer a comprehensive examination of adversarial attacks pertaining to object detection, focusing specifically on digital and physical attack strategies. Given the practicality of adversarial attacks in object detection tasks, much of the current research focuses on physical attacks.
x Digital attack White-box attacks: In [172], the authors extend the concept of adversarial examples to the domains of semantic segmentation and object detection, which are notably more challenging tasks. Specifically, they introduce a novel algorithm called Dense Adversary Generation (DAG), which optimizes a loss function over a set of pixels or proposals to generate adversarial perturbations. To reduce the number of perturbed pixels, [173] presents a new technique known as the Diffused Patch Attack (DPAttack), which leverages diffused patches in the form of asteroid-shaped or grid-shaped patterns to deceive object detectors. This attack only modifies a small number of pixels in the image. Research [174] introduces a novel approach called Contextual Adversarial Perturbation (CAP), which targets contextual information of objects in order to degrade the recognition accuracy of object detectors. Zhang et al. [175] introduce a novel Half-Neighbor Masked Projected Gradient Descent (HNM-PGD) approach, capable of generating potent perturbations to deceive various detectors while adhering to stringent limitations. [176] presents a new and distinctive patch configuration comprised of four intersecting lines. The proposed patch shape is shown to be a powerful tool for influencing deep convolutional feature extraction with limited pixel availability. To ensure the stability of the ensemble attack, Huang et al. [177] present a gradient balancing technique that prevents any single detector from being over-optimized during the training process. Furthermore, they propose a novel patch selection and refining mechanism that identifies the most crucial pixels for the attack, while gradually eliminating irrelevant perturbations.
Black-box attacks: Liu et al. [178] introduce DPATCH, a black-box adversarial-patch-based attack designed to target popular object detectors, such as Faster R-CNN [7] and YOLO [8], [179], [180]. In contrast to the original adversarial patch, which only manipulates the image-level classifier, the DPATCH simultaneously targets both the bounding box regression and object classification of the object detector in order to disable their predictions. [181] introduces Efficient Warm Restart Adversarial Attack for Object Detection, which comprises three modules: Efficient Warm Restart Adversarial Attack, which selects the most appropriate top-k pixels for the attack; Connecting Top-k pixels with Lines, which outlines the strategy for connecting two top-k pixels to minimize the number of changed pixels and reduce the number of patches; Adaptive Black Box Optimization, which leverages white box models to improve the performance of the black box adversarial attack. To fool context-aware detectors, Cai et al. [182] introduce the pioneering method for producing context-consistent adversarial attacks that can elude the context-consistency check of black-box object detectors working on intricate and natural scenes.
y Physical attack Lu et al. [184] present a construction that effectively deceives two commonly used detectors, Faster RCNN [7] and YOLO 9000 [180], in the physical world. [185] extend physical attacks to object detection by implementing a Disappearance Attack, which causes a stop sign to "disappear"  either by covering the sign with an adversarial poster or by adding adversarial stickers onto the sign. The work [186] introduces ShapeShifter, and demonstrates that the EOT approach, initially proposed to improve the resilience of adversarial perturbations in image classification, can be effectively adapted to the object detection domain. [65] proposes a method for generating adversarial patches that can effectively conceal individuals from person detectors. This method is particularly designed for targets with a high degree of intra-class variety, such as persons. In [187], the authors present an intriguing experimental investigation of physical adversarial attacks on object detectors in real-world scenarios. Specifically, they explore the efficacy of learning a camouflage pattern to obscure vehicles from being detected by SOTA detectors based on DNNs. To generate visually natural patches with strong attacking ability, Liu et al. [169] present a novel Perceptual-Sensitive Generative Adversarial Network (PS-GAN) that can simultaneously enhance the visual authenticity and the attacking potential of the adversarial patch. Wang et al. [77] take the first attempt to implement robust physical-world attacks against person re-identification systems based on DNNs. They propose advPattern to generate adversarial patches on clothes, which can hide people from being detected. In [188], the authors study physical attacks against object detectors in the wild. They propose the Universal Physical Camouflage Attack (UPC), which involves learning an adversarial pattern capable of effectively attacking all instances of a given object category. Wu et al. [76] present a systematic study of the transferability of adversarial attacks on SOTA object detection frameworks. To avoid direct access to targets of interest, [189] presents a novel contactless and translucent patch containing a carefully crafted pattern, which is placed over the lens of the camera to deceive SOTA object detectors. Zhu et al. [190] first demonstrate the feasibility of using two types of patches to launch an attack on YOLOv3-based infrared pedestrian detectors. Following the previous work [190], [105] propose the infrared adversarial clothing by simulating the process from cloth to clothing in the digital world and then designing the adversarial "QR code" pattern. [191] introduces a novel approach called Adversarial Texture (AdvTexture) for conducting multi-angle attacks against person detectors. AdvTexture enables the coverage of clothes with arbitrary shapes, rendering individuals wearing such clothes invisible to person detectors from various viewing angles. In [192], the authors introduce the Differentiable Transformation Attack (DTA), which enables the creation of patterns that can effectively hide the object from detection, while also taking into account the impact of various transformations that the object may undergo. Wang et al. [193] introduce a novel training pipeline called TransPatch to optimize the training efficiency of adversarial patches. To avoid generating conspicuous and attention-grabbing patterns, [160] propose to create physical adversarial patches by leveraging the image manifold of a pre-trained GAN. Inspired by the viewpoint that attention is indicative of the underlying recognition process, [66] proposes the Dual Attention Suppression (DAS) attack to craft visually-natural physical adversarial camouflages. The DAS achieves strong transferability by suppressing both model and human attention, thereby enhancing the efficacy of the attack. In [194], the researchers propose a novel targeted and universal attack against the SOTA object detector using a label-switching technique. The attack aims to fool the object detector into misclassifying a specific target object as another object category chosen by the attacker. Mathov et al. [170] introduce a novel framework that leverages 3D modeling to generate adversarial patches for a pre-existing real-world scene. By employing a 3D digital approximation of the scene, their methodology effectively simulates the real-world environment. To bridge the divide between digital and physical attacks, Wang et al. [106] utilize the entire 3D surface of a vehicle to propose a resilient Full-coverage Camouflage Attack (FCA) that effectively deceives detectors. A universal background adversarial attack method [195] is devised to fool DNNs-based object detectors. The proposed method involves placing target objects onto a universal background image and manipulating the local pixel data surrounding the target objects in a way that renders them unrecognizable by object detectors. The focus of the study [196] is on the lane detection system, a crucial component in numerous autonomous driving applications, such as navigation and lane switching. The researchers design and realize the first physical backdoor attacks on such systems. Zhang et al. [197] propose a novel approach for producing physically feasible adversarial camouflage to achieve transferable attacks on detection models. Study [198] explores a new category of optical adversarial examples, generated by a commonly occurring natural phenomenon, shadows. They aim to employ these shadow-based perturbations to achieve naturalistic and inconspicuous physical-world adversarial attacks in blackbox settings. A systematic pipeline is introduced in [199] to produce resilient physical adversarial examples that can effectively deceive real-world object detectors. Zhu et al. [200] present TPatch, a physical adversarial patch that is triggered by acoustic signals. TPatch differs from other adversarial patches in that it remains benign under ordinary circumstances but can be activated to initiate hiding, altering, or creating attacks via a deliberate distortion introduced through signal injection attacks directed at cameras. To improve the optimizing stability and efficiency, the study [107] presents a fresh and lightweight framework that generates naturalistic adversarial patches systematically, without relying on GANs. In paper [95], the authors conduct the first investigation towards adversarial attacks that are directed at X-ray prohibited item detection and demonstrate the grave hazards posed by such attacks in this context of paramount safety significance. Finally, we summarize physical attacks against object detection ( [65], [76], [94]- [96], [105]- [107], [160], [184]- [190], [192]- [194], [197], [201]- [204]) in Table VI.
3) Face Recognition: In this section, we undertake a thorough assessment of adversarial attacks in the context of face recognition. The practicality of adversarial attacks in face recognition tasks has resulted in a significant focus on physical attacks in current research on this topic.
Zhu et al. [97] introduce a novel method to elaborate adversarial examples for attacking well-trained face recognition models. Their approach involves applying makeup effects to facial images through two GANs-based sub-networks: the Makeup Transfer Sub-network and Adversarial Attack Subnetwork. [205] aims to investigate the robustness of current face recognition models in the decision-based black-box attack scenario. Sharif et al. [70] concentrates on the attack of facial biometric systems, which are extensively used for surveillance and access control. They introduce a new attack method that is both physically realizable and inconspicuous, enabling an attacker to circumvent identification or impersonate another individual. The authors of [206] investigate the possibility of performing real-time physical attacks on face recognition systems through the use of adversarial light projections. In study [67], the researchers conduct a comprehensive evaluation of the robustness of face recognition models against adversarial attacks using patches in the black-box setting. In contrast to previous methods that rely on designing perturbations, Wei et al. [72] achieve physical attacks by manipulating the position and rotation angle of stickers pasted onto faces. Paper [71] addresses the importance of position and perturbation in adversarial attacks by proposing a novel method that optimizes both factors simultaneously. By doing so, they achieve a high attack success rate in the black-box setting. To comprehensively evaluate physical attacks against face recognition systems, [207] introduce a framework that employs 3D-face modeling to simulate complex transformations of faces in the physical world, thus creating a digital counterpart of physical faces. This generic framework enables users to control various face variations and physical conditions, making it possible to conduct reproducible evaluations comprehensively. In study [208], the authors investigate the adversarial robustness of face recognition systems against sticker-based physical attacks, aiming to gain a better understanding of the system's vulnerabilities. To increase the imperceptibility of attacks, Lin et al. [209] propose a physical adversarial attack using fullface makeup, as its presence on the human face is a common occurrence. Singh et al. [210] present a new smoothness loss and a patch-noise combo for the physical attack against face recognition systems. [211] aims to devise a more dependable technique that can holistically assess the adversarial resilience of commercial face recognition systems from end to end. To achieve this goal, they propose the design of Adversarial Textured 3D Meshes (AT3D) with the intricate topology on a human face. The AT3D can be 3D-printed and then worn by the attacker to evade the facial recognition defenses.
Finally, we summarize adversarial attacks against face recognition ( [67], [70]- [72], [97], [205]- [212]) in Table VII. 4) Others: To investigate how adversarial examples affect deep product quantization networks (DPQNs), [81] propose to perturb the probability distribution of centroids assignments for a clean query to attack DPQNs-based retrieval systems. [129] introduces the Attack on Attention (AoA) technique, which exploits the semantic property shared by DNNs. AoA demonstrates a marked increase in transferability when attention loss is employed in place of the traditional cross-entropy loss. Since AoA only modifies the loss function, it can be readily combined with other transferability-enhancing methods to achieve SOTA performance. In study [213], the authors   [133] introduces a clean-label approach for the poisoning availability attack, which reveals the intrinsic imperfection of classifiers. Paper [214] highlights how the global reasoning of (scaled) dot-product attention can represent a significant vulnerability when faced with adversarial patch attacks. The current study puts forth a novel interactive visual aid, DetectorDetective [215], which seeks to enhance users' comprehension of a model's behavior during the traversal of adversarial images through an object detector. The primary goal of DetectorDetective is to provide users with a deeper understanding of how object detectors respond to adversarial attacks. Work [170] represents an initial stride towards implementing physically viable adversarial attacks on visual tracking systems in real-life scenarios. Specifically, the authors accomplished this by developing a universal patch that serves to camouflage single-object trackers. To attack depth estimation, Cheng et al. [68] employ an optimization-based technique for systematically creating stealthy physical-object-oriented adversarial patches. Research [216] assesses the effects of the chosen transformations on the efficacy of physical adversarial attacks. Moreover, they measure attack performance under various scenarios, including multiple distances and angles. Finally, we summarize other adversarial attacks ( [66], [68], [77], [78], [80], [81], [83], [88], [99], [154], [155], [158], [159], [161], [163], [166], [172], [196], [198]- [200], [214], [217]- [233]) in Table VIII.
to provide a systematic and exhaustive analysis of the current literature, thereby fostering a deeper understanding of the principles, techniques, and ramifications of adversarial attacks in the context of RS research.
1) Image Classification: The majority of attacks against RS imagery classifiers stem from the field of CV, thus most of the existing research focuses on digital attacks. Czaja et al. [234] first considers attacks against machine learning algorithms used in RS applications. Specifically, they present a new study of adversarial examples in satellite image classification problems. In [235], the authors investigate the properties of adversarial examples in RSI scene classification. To this end, they create several scenarios by employing two popular attack algorithms, i.e., FGSM and BIM are trained on various RSI benchmark datasets to fool DNNs. The authors of [236] perform a systematic analysis of the potential threat posed by adversarial examples to DNNs used for RS scene classifi-cation. They conduct both targeted and untargeted attacks to generate subtle adversarial perturbations that are imperceptible to human observers but can easily deceive DNNs-based models. Paper [237] proposes a UNet-based [10] GAN to enhance the optimizing efficiency and attack efficacy of the generated adversarial examples for Synthetic Aperture Radar Automatic Target Recognition (SAR-ATR) models. [238] aims to provide a thorough evaluation of the effects of adversarial examples on RSI classification. Technically, eight of the most advanced classification DNNs are tested on six RSI benchmarks. These data sets consist of both optical and synthetic-aperture radar (SAR) images with varying spectral and spatial resolutions. The study [239] introduces a novel approach for generating adversarial examples to fool RSI classifiers in black-box conditions by utilizing a variant of the Wasserstein generative adversarial network. To enhance the success rate of adversarial attacks against scene classification, Jiang et al. [240] [245]. The SVA consists of two major modules: an iterative gradient-based perturbation generator and a target region extractor. [246] proposes a novel method to explore the basic characteristics of universal adversarial perturbations (UAPs) of RSIs. The method involves combining an encoderdecoder network with an attention mechanism to generate UAPs of RSIs. Qin et al. [247] present a novel universal adversarial attack method for CNN-SAR image classification. The proposed approach aims to differentiate the target distribution by utilizing a feature dictionary model, without any prior knowledge of the classifier. Finally, we summarize adversarial attacks against image classification in RS ( [195], [234]- [248]) in Table IX.
2) Object Detection: Similarly, the adversarial attack methods are divided into digital attacks and physical attacks according to the attacked domain.
x Digital attack The authors of [249] first investigate the use of patch-based adversarial attacks in the context of unmanned aerial surveillance. Specifically, they explore the application of these attacks on large military assets by laying a patch on top of them, which camouflages them from automatic detectors analyzing the imagery. [250] introduces a novel adversarial attack method called Patch-Noobj, which is designed to address the problem of large-scale variation in aircraft in RS imagery. PatchNoobj is a universal adversarial method that can be used to attack aircraft of different sizes by adaptively scaling the width and height of the patch according to the size of the target aircraft. Du et al. [237] investigate the susceptibility of DL-based cloud detection systems to adversarial attacks. Specifically, they employ an optimization process to create an adversarial pattern that, when overlaid onto a cloudless scene, causes the DNNs to falsely detect clouds in the image. In paper [98], the authors devise a novel approach for generating adversarial pan-sharpened images. To achieve this, a generative network is employed to generate the pan-sharpened images, followed by the application of shape and label loss to carry out the attack task. In the paper [251], the researchers investigate the effectiveness and limitations of adversarial camouflage in the context of overhead imagery. Fu et al. [228] propose Ad2Attack, an Adaptive Adversarial Attack approach against UAV object tracking. Adversarial examples are generated online during the resampling of the search patch image, causing trackers to lose the target in the subsequent frames. Tang et al. [252] propose a novel adversarial patch attack algorithm. In particular, unlike traditional approaches that rely on the final outputs of models, the proposed algorithm uses the intermediate outputs to optimize adversarial patches. The study [253] introduces a novel defense mechanism based on adversarial patches that aim to disable the onboard object detection network of the LSST (Low-Slow-Small Target) recognition system by launching an adversarial attack. [254] introduces a novel framework for generating adversarial pan-sharpened images. The proposed method employs a two-stream network to generate the pan-sharpened images and applies shape loss and label loss to carry out the attack task. To ensure the quality of the pan-sharpened images, a perceptual loss is utilized to balance spectral preservation and attacking performance. Sun et al. [255] concentrate on patch-based attacks (PAs) against optical RSIs and propose a Threatening PA without the scarification of the visual quality, dubbed TPA.
y Physical attack In work [256], the authors conduct a comprehensive analysis of the universal adversarial patch attack for multi-scale objects in the RS field. Specifically, this study presents a novel adversarial attack method for object detection in RS data by optimizing the adversarial patch to attack as many objects as possible by formulating a joint optimization problem. Furthermore, it introduces a scale factor to generate a universal adversarial patch that can adapt to multi-scale objects, ensuring its validity in real-world scenarios. Du et al. [75] have developed new experiments and metrics to assess the effectiveness of physical adversarial attacks on object detectors in aerial scenes, in order to investigate the impact of physical dynamics. In research [73], the authors propose an Adaptive Patch-based Physical Attack (AP-PA), which enables physically practicable attacks using malicious patches for both the white-box and black-box settings in real physical scenarios. In [18], Lian et al. made the inaugural effort to execute physical attacks in a contextual manner against aerial detection in the physical world. Following their previous work, Lian et al. propose Contextual Background Attack (CBA) [74], which can achieve high attack effectiveness and transferability in real-world scenarios, without the need to obscure the target objects. Technically, they extract the saliency of the target of interest as a mask for the adversarial patches and optimize the pixels outside the mask area to closely cover the critical contextual background area for detection. Additionally, the authors devised a novel training strategy, in which the patches are forced to be outside the targets during training. As a consequence, the elaborate perturbations can successfully hide the protected objects both on and outside the adversarial patches from being recognized. The objective of [257] is to create a natural-looking patch that has a small perturbation area. This patch can be used in optical RSIs to avoid detection by object detectors and remain imperceptible to human eyes. Paper [258] presents an approach to adversarially attack satellite RS detection using a patch-based method. The proposed method aims to achieve comparable attack effectiveness in the physical domain as that in the digital domain, without compromising the visual quality of the patch. To achieve this, the approach utilizes pairwisedistance loss to control the salience of the adversarial patch.
Finally, we summarize adversarial attacks against object detection in RS ( [18], [73]- [75], [98], [249]- [258]) in Table  X. III. BENCHMARK In this study, we introduce a comprehensive benchmark that assesses the robustness of image classification and object detection tasks in optical RSIs, as shown in Fig. 14  elements of model robustness. To explore the comprehensive range of robustness, we examine a diverse set of natural noises and adversarial noises. Below, we give a detailed introduction to the benchmark on natural robustness and adversarial robustness in Sec. III-A and Sec. III-B, respectively.

A. Natural Robustness
Various sources in the real world, such as weather fluctuations, sensor deterioration, and object deformations, generate natural noise that can be detrimental to DL models. These noises are inevitable, presenting a challenge in pursuing accurate and reliable artificial intelligence. In order to undertake a thorough assessment of the inherent resilience of RSI classification and detection models in the face of varied and diverse forms of noise, it is necessary to adopt a rigorous and systematic benchmark. This benchmark should encompass a wide range of noise types and intensities, including those arising from natural environmental factors, sensor degradation, and varying degrees of image distortion. Such an all-encompassing evaluation holds the key to enhancing the practical viability of DNN-based models and to the development of more resilient and adaptable DNN architectures. Although comprehensive benchmarks [21], [23], [87], [259]- [263] on natural noises have been established for the CV field, it is still lacking in the area of RS. Consequently, we built the first benchmark and datasets on natural noises for RS tasks. Specifically, we benchmark seven natural noises, including Gaussian noise (G), Poisson noise (P), salt-pepper noise (SP), random noise (RD), rain (R), snow (S), and fog (F), as shown in Fig. 15. Each  noise is divided into five different intensities as shown in Fig.  16.
In the following subsections, we mainly introduce the datasets, models, and metrics in our benchmark on natural robustness for image classification and object detection.
1) Image Classification: Benchmark on natural robustness for image classifiers: x Datasets AID [264] is a large-scale aerial image dataset for classification tasks, comprising sample images acquired from Google Earth imagery. The AID dataset comprises 10000 aerial images from 30 different scene categories, including airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks, and viaduct. The AID is adopted to train aerial image classifiers. To compute the overall accuracy, the ratios of training and testing sets are fixed at 50% and 50%, respectively. AID-NN is introduced as a large-scale benchmark dataset to evaluate the natural robustness of image classification in aerial images, which is derived from AID by adding seven different natural noises in five levels, respectively. The rest information on AID-NN is the same as the original AID.
y Classifiers In order to comprehensively evaluate and investigate the robustness trends across various DNN architectures for image classification, our benchmark endeavors to encompass a wide range of architectures as shown in Table XI. Regarding the CNNs, we select renowned and widely recognized classical network architectures, such as the ResNet series (including various versions of ResNet [5], ResNeXt [265], and WRN [266]) and DenseNet [267]. The lightweight ones including MobileNetV2 [268], MobileNetV3 [269], and ShuffleNetV2 [270]. As for the prevalent vision Transformer, Swin Transformer [271] and ViT [272] are adopted in this benchmark.
z Metric Acc: The mathematical formula to define the image classification evaluation index "Acc" (Accuracy) is as follows: 2) Object Detection: Benchmark on natural robustness for object detectors: x Datasets DOTA [273] is a large-scale benchmark dataset for object detection in aerial images, which contains 15 common categories, 2,806 images (image width range from 800 to 4,000), and 188,282 instances. The proportions of the training set, validation set, and testing set in DOTA are 1/2, 1/6, and 1/3, respectively. DOTA is adopted to train aerial detectors after cropping 2 the images as 1024×1024.
DOTA-NN is introduced as a large-scale benchmark dataset to evaluate the natural robustness of object detection in aerial images, which is derived from DOTA (after cropping) by adding seven different natural noises in five levels, respectively. The rest information on DOTA-NN is the same as the original DOTA.
z Metric mAP: We use mean average precision (mAP) as the evaluation metric of object detection. The mathematical definition 2 Image cropping tool: https://github.com/CAPTAIN-WHU/DOTA_devkit formula of mAP is written as follows: where n is the number of object categories being detected, AP i is the average precision (AP) of the i-th category, which is calculated as: where p interp (r) is the interpolated precision at a certain recall level r, and is defined as: Here,p(r) is the precision at a certain recall levelr, and the interpolation is done by taking the maximum precision value overall recall levels greater than or equal to r. The AP is calculated by averaging the precision values at all the recall levels at which there is a correct detection. In practice, the mAP is typically calculated for a range of intersections over union (IoU) thresholds and then averaged over those thresholds. For example, mAP@[.50:.05:.95] means that the mAP is calculated by taking the mean of the AP scores at IoU thresholds of 0.5, 0.55, 0.6, ..., 0.95.

B. Adversarial Robustness
In this section, we mainly introduce the adversarial attacks, datasets, models, and metrics in our benchmark on adversarial robustness for image classification and object detection.
1) Image Classification: Benchmark on adversarial robustness for image classifiers: x Attacks In this benchmark, we evaluate adversarial robustness with 5 digital attacks, including Fast Gradient Sign Method (FGSM) [16], AutoAttack (AA) [102], Projected Gradient Descent (PGD) [100], C&W [104], Momentum Iterative FGSM (MIFGSM) [19]. A detailed description of these attack methods is provided in Sec. II-B1. Furthermore, we conduct the aforementioned attacks in both white-box and black-box conditions. y Datasets AID [264] dataset is used for crafting adversarial examples to test the adversarial robustness of aerial image classifiers.
AID-AN is introduced as a large-scale benchmark dataset to evaluate the adversarial robustness of the image classifier in RS, which is derived from AID by adding four different adversarial noises, respectively. The rest information on AID-AN is the same as the original AID.
z Models and { Metric are the same as the counterparts of natural robustness depicted in Sec. III-A1.
2) Object Detection: Benchmark on adversarial robustness for object detectors: x Attacks We evaluate adversarial robustness with 4 patch-based attacks, including CBA [74], APPA (on) [73], APPA (outside) [73], and the method introduced by Thys et al. in [65]. Detailed information on these representatives and SOTA attacks against object detection is provided in Sec. II-C2. In addition, we not only test the aforementioned SOTA methods under both white-box and black-box conditions but also conduct experiments in a different domain, i.e., digital and physical domains, respectively. y Datasets DOTA [273] dataset is used for training the victim (whitebox) or proxy (black-box) models, i.e., the aerial detectors to be attacked, same as its role in Sec. III-A2.
In addition, we craft adversarial examples by adding adversarial patches generated by the aforementioned attack methods to perform digital attacks. The different patch settings for digital attacks are shown in Fig. 17. For physical attacks, the elaborated adversarial patches are printed to disturb the targets of interest in the physical real-world scenarios. The different patch settings for physical attacks are shown in Fig. 18.
z Models are the same as the counterpart of natural robustness depicted in Sec. III-A2.
{  For digital attacks, we employ the detection results obtained from the clean images as the reference for calculating the AP. Specifically, the AP of the clean dataset is set as 100% to ensure that targets missed by the original detector are not regarded as successful attacks.
For physical attacks, we conducted experiments scaled at a 1:400 proportion to verify the attack performance in the physical world. Technically, we trained 20 mainstream object detectors as victim or proxy models and recorded the average confidence of 18 aircraft, with the detection threshold set to 0.2. Targets with detection confidence lower than 0.2 are regarded as unrecognized because the confidence threshold of the object detection task is usually set to around 0.45.

IV. EXPERIMENTS
In this part, we present the experimental results and deep analysis of benchmarking natural robustness and adversarial robustness in Sec. IV-A and Sec. IV-B, respectively.
A. Natural Robustness 1) Image Classification: In this section, we evaluate the natural robustness of the 23 RSI classifiers introduced in Sec. III-A1 with AID RSI dataset [264] and its derived version with various natural noises. We show the classification results in Fig. 19. Please note that all the evaluation results presented in this part represent the Top-1 accuracy. Based on the experimental results, we have the following observations: • Noise type. The impact of various types of natural noise on classifiers exhibits varying degrees of influence. Specifically, random noise exerts the most significant impact on classification accuracy, resulting in the greatest reduction in model performance. In comparison, the classifiers are more robust to other noises. • Noise level. As expected, for both CNNs and Transformers, an increase in the intensity of noise across all types results in a corresponding escalation of its impact on the model, thereby leading to a more pronounced reduction in classification accuracy.         2) Object Detection: In this section, we evaluate the natural robustness of various mainstream RSI object detectors introduced in Sec. III-A2 with large-scale aerial detection dataset DOTA and its derived version corrupted with various natural noises. We show the experimental results in Fig. 20 and Fig. 21. Please note that we adopt mAP (mAP@.50 and mAP@[.50:.05:0.95]) as the evaluation metric.
Based on the experimental results, we have the following observations: • Noise type. Similarly, the influence of different types of natural noise on aerial detectors varies to a certain degree. Specifically, random noise and salt-pepper noise exert the most significant impact on aerial detectors, followed by Gaussian noise and Poisson noise. In comparison, rain, snow, and fog have a relatively lesser impact on the performance of the detectors. • Noise level. Consistent with expectations, all of the aerial detectors exhibit a consistent pattern: as the intensity of noise increases across all types, its impact on the detectors also intensifies, resulting in a more significant decline in detection performance. In comparison with natural weather noises, the level change of the remaining noises has a greater impact on the detection accuracy. • Model type. Obviously, YOLOv5 and YOLOv3 are significantly more robust than other detectors and with better detection performance, followed by Swin Transformer, which is slightly more resilient than the rest aerial detectors. In addition, it is hard to tell the difference between the robustness of different types of aerial detectors, such as CNN-based and Transformer-based, anchor-based and anchor-free, and one-stage and two-stage. • Model size. Generally speaking, when the model structure is held constant, such as YOLOv5, it becomes evident that larger model sizes exhibit a greater level of robustness same as image classifiers. However, in several cases, YOLOv5l (the second largest detector) outperforms YOLOv5x (the largest detector), overfitting may be a contributing factor to this phenomenon.
B. Adversarial Robustness 1) Image Classification: In this section, we evaluate the adversarial robustness of the 23 RSI classifiers introduced in Sec. III-B1 with AID RSI dataset [264] and its derived version with various adversarial noises. We show the classification attack results of FGSM, AutoAttack, PGD, C&W, and MIFGSM in Fig. 22,23 Based on the experimental results, we have the following observations: • Noise type.  found to have the least detrimental impact on classifiers, particularly in black box scenarios, rendering them nearly ineffective. • Noise level. Consistent with expectations, both CNNs and Transformers demonstrate an increased attack efficacy to adversarial noise as its intensity escalates, regardless of whether the attacks are conducted in white-box or black-box settings. Specifically, under white-box settings, FGSMs can successfully execute attacks, while the classification accuracy of most classifiers remains higher than 50%. However, in black-box conditions, FGSMs are found to be largely ineffective. As the perturbation amplitude increases, the accuracy of most classifiers drops below 30%, indicating a substantial reduction in   attack efficacy in white-box conditions. Under black-box settings, when keeping the classifiers' network structure constant, it is intuitive that the bigger the neural networks, the stronger the adversarial robustness. However, it is important to note that the most robust model is usually not the biggest version of the classifiers but the second largest one, this phenomenon could be attributed to overfitting. 2) Object Detection: In this section, we evaluate the adversarial robustness of the 20 RSI object detectors introduced in Sec. III-A2 with the DOTA RSI dataset [273] and its derived version with various adversarial noises. We illustrate the evolutionary progression of the adversarial patch in Figure  28, demonstrating its dynamic development over time. The generated adversarial patches are shown in Fig. 29. We show the detection attack results of four physical attack methods in Fig. 30, 31, 32, and 33, respectively. In addition, the digital attack performance is also exhibited in Fig. 34. Evaluation metrics are introduced in detail in III-B2. Please note that the experimental results presented in this section are partially derived from our previous works [73] and [74].
Based on the experimental results, we have the following observations: • Digital attack. x For attack methods, the attack effects of the four methods exhibit minimal variation in general, but for YOLOv2, the attack methods with the patch outside the target, i.e., APPA(outside) and CBA, is less effective than the attack methods with the patch on the target, i.e., [65] proposed by Thys et al. and APPA(on).
Notably, the background patch is positioned outside the targeted objects in the digital test, and a portion of the patch area is sacrificed to mask targeted objects in the physical world. y For detection methods, YOLOv2 is found to be the most vulnerable to attack, even in blackbox scenarios. On the other hand, different versions of YOLOv5 demonstrate robustness across diverse attack settings. However, detectors such as Faster R-CNN and SSD are comparatively easier to be attacked and compromised. In general, the Swin Transformer stands out as the most resilient detector, exhibiting a higher level of resistance against various attacks. • Physical attack. x For attack methods, CBA exhibits a notable physical attack effect, causing a significant number of detectors to fail in detecting any objects in real-world scenarios, which is seldom observed for APPA and [65]. In addition, CBA also shows the best attack transferability, even for some robust detectors, e.g., YOLOv3, YOLOv5, Swin Transformers, etc.y For detection methods, YOLOv5 continues to demonstrate remarkable resilience against attacks compared to other aerial detectors. However, CBA has proven highly effective in impairing the detection performance of YOLOv5 and shows strong generalization across different versions of YOLOv5. Similar to the digital attack scenario, YOLOv2 remains the most vulnerable detector in the physical world as well. Interestingly, it exhibits a certain degree of immunity to adversarial patches placed outside the targets of interest. • White-box. x The contextual background patches demonstrate a remarkable ability to completely impair 0% 50% 100% 0% 50% 100%

V. DISCUSSIONS
The investigation into the robustness of DNNs has witnessed rapid advancements in recent years, particularly in the field of CV and its related applications such as RS. Despite the significant progress made, there remain several challenging issues that demand further examination and discussion. In this section, we delve into these challenges and offer insights into potential research directions of CV and RS as follows: 1) Explain the generation of adversarial perturbations with neural network training.      [73] in the physical world.
similarities between perturbation generation and model training, it is worthwhile to explore the effective application of techniques that enhance the performance of DNNs in adversarial attacks. For instance, methods such as "momentum" introduced in [19] and "dropout" discussed in [203] have shown the potential in boosting attack efficacy. Investigating how these techniques, such as training strategies, test augmentations, and so on, can be appropriately utilized in the context of adversarial attacks could provide valuable insights for strengthening attack effectiveness, to further improve the security and resilience of DNN models. 4) Bridge the gap between digital and physical attacks.
The majority of existing research primarily concentrates on theoretical analyses of adversarial attacks and their transferability in the digital domain, rendering them ineffective when confronted with real-world physical applications. However, physical attacks raise substantial security concerns due to their potential implications in practical scenarios. Therefore, it becomes imperative to    bridge the gap between digital and physical attacks by developing techniques capable of effectively translating digital attack strategies into real-world settings. 5) Bridge the gap between attacks against different tasks. The essence of DNNs-based models for visual perception is extracting features, progressing from shallow to deep concepts and from simple to abstract representations. As a consequence, how to interfere with the feature extraction process in various visual tasks to achieve a universal attack effect is an important and promising direction for research. By understanding the underlying mechanisms of feature extraction in DNNs, researchers can develop strategies to manipulate and disrupt this process to generate effective adversarial attacks across different visual tasks. This line of research has the potential to uncover vulnerabilities and weaknesses in DNN models, leading to the development of robust defense mechanisms and improved security in various applications of CV. 6) The background features matter more than you think. The background features of a target are widely acknowledged to play a crucial role in its correct       recognition. However, recent studies [74], [201] have demonstrated that intelligent recognition systems based on DNNs can be easily deceived solely by manipulating the background features of the target, even without distorting the target itself at all. This raises the question of why a well-elaborated intelligent algorithm is so vulnerable to such manipulation. It suggests that the influence of background features may be more significant than initially anticipated. Consequently, there is a pressing need to delve deeper into the pivotal role that background features play in CV tasks and to understand their underlying mechanisms. Such research can provide valuable insights to guide the design of more robust visual perception algorithms and models. 7) Background attack in the physical world. The prevailing physical attacks directed at object detectors primarily focus on the development of perturbations in patch form. These elaborated adversarial patches are printed and affixed to the surfaces of targeted objects through painting or pasting techniques, thereby compromising the recognition capabilities of intelligent systems operating in real-world environments. However, the application of patch-based perturbations in the physical realm is accompanied by significant costs and time requirements. As a viable alternative, background attacks emerge as a promising approach, wherein only the background regions surrounding the targeted objects are manipulated, without any direct alteration of the protected objects themselves. This approach proves particularly advantageous for scenarios involving small targets, such as object detection in RS applications, where the effectiveness and practicality of adversarial patches are limited. Furthermore, the practical value of adversarial camouflage in background attacks is of utmost importance to ensure adversarial perturbations' inconspicuousness.
VI. CONCLUSIONS In this study, we present a comprehensive investigation into the robustness of image classification and object detection in the context of RS. Our work encompasses an extensive review of existing literature in both CV and RS domains, providing a comprehensive understanding of the research landscape in this area. Furthermore, we perform a series of extensive experiments to benchmark the robustness of image classifiers and object detectors specifically designed for RS imagery. We also release the corresponding datasets with various types of noise to facilitate future research and evaluation in this field. To the best of our knowledge, this study represents the first comprehensive review and benchmarking of the robustness of different tasks in optical RS. Additionally, we conduct a deep analysis of the experimental results and outline potential future research directions to further enhance the understanding and development of model robustness. Overall, our work offers a systematic perspective on the robustness of RS models, enabling readers to gain a comprehensive overview of this field and guiding the calibration of different approaches to accelerate the advancement of model robustness. We also plan to continually update this work by incorporating more details and the latest advancements in the field, to enrich the benchmarking of model robustness in RS.