Cybersecurity for AI Systems: A Survey

: Recent advances in machine learning have created an opportunity to embed artiﬁcial intelligence in software-intensive systems. These artiﬁcial intelligence systems, however, come with a new set of vulnerabilities making them potential targets for cyberattacks. This research examines the landscape of these cyber attacks and organizes them into a taxonomy. It further explores potential defense mechanisms to counter such attacks and the use of these mechanisms early during the development life cycle to enhance the safety and security of artiﬁcial intelligence systems.


Introduction
Advances in Artificial Intelligence (AI) technology have contributed to the enhancement of cybersecurity capabilities of traditional systems with applications that include detection of intrusion, malware, code vulnerabilities and anomalies. However, these systems with embedded machine learning models have opened themselves to a new set of vulnerabilities, commonly known as AI attacks. Currently, these systems are prime targets for cyberattacks, thus compromising the security and safety of larger systems that encompass them. Modern day AI attacks are not only limited to just coding bugs and errors. They manifest due to the inherent limitations or vulnerabilities of systems [1]. By exploiting the vulnerabilities in the AI system, attackers aim at either manipulating its behavior or obtaining its internal details by tampering with its input, training data, or the machine learning (ML) model. McGraw et al. [2] have classified AI attacks broadly as manipulation and extraction attacks. Based on the inputs given to the system, the training dataset used for learning, and manipulation of the model hyperparameters, attacks on AI systems can manifest in different types, with different degrees of severity. For example, adversarial or evasion attack can be launched by manipulating the input to the AI system, which results in the system producing an unintended outcome. A poisoning or causative attack can be launched by tainting the training dataset, which would result in the AI system exhibiting unethical behavior.
Therefore, it is important that we start thinking about designing security into AI systems, rather than retrofitting it as an afterthought. This research addresses the following research questions: RQ1: What are the cyberattacks that AI systems can be subjected to? RQ2: Can the attacks on AI systems be organized into a taxonomy, to better understand how the vulnerabilities manifest themselves during the system development.
RQ3: What are possible defense mechanisms to prevent AI systems being subjected to cyberattacks?
RQ4: Is it possible to devise a generic defense mechanism against all kinds of AI attacks.
To address these research questions and determine the extent of risk to safety and security of AI systems, we first conducted a systematic literature review looking for AI attacks on systems reported in the literature. We then organized these attacks into a taxonomy to not only understand the types of vulnerabilities, but also the stage in the development of AI systems when these vulnerabilities manifest themselves. We then conducted further literature search looking for any defense mechanisms to counter these attacks and improve the safety and security of AI systems.
This study is organized as follows. In Section 2, we report the results of the systematic literature review and identify the attacks, from an AI system development perspective, and their vulnerabilities. In Section 3, we introduce a taxonomy of AI attacks along with defense mechanisms and countermeasures to mitigate their threats. Section 4 concludes the study and highlights major findings.

Literature Review
This survey was founded on searching, by keywords, to find related articles to cybersecurity of AI systems. The top most used keywords are as follow: cybersecurity, cyberattack, and vulnerabilities. We searched Scopus, an Elsevier abstracts and citation database, for articles having titles that matched the search query ("cyber security" OR "cybersecurity" OR "security" OR "cyberattack" OR "vulnerability" OR "vulnerabilities" OR "threat" OR "attack" OR "AI attack") AND ("AI" OR "ML" OR "Artificial Intelligence" OR "Machine Learning") AND ("system")).
The search resulted in a total of 1366 articles. Within these articles, we looked for those in computer science or computer engineering subject areas that were published in journals in the English language, leaving us with 415 manuscripts. We carefully reviewed the abstracts of the papers to determine their relevance. Only articles that discussed the vulnerabilities of AI systems to attacks and/or their defense mechanisms were considered.
During the learning or training stage, an AI system needs data for training a machine learning model. The training data are subject to manipulation attacks, requiring that their integrity be verified. Ma et al. [3] used a visual analytics framework for explaining and exploring ML model vulnerabilities to data poisoning attacks. Kim and Park [4] proposed a blockchain-based environment that collects and stores learning data whose confidentiality and integrity can be guaranteed. Mozaffari-Kermani et al. [5] focused on data poisoning attacks on, and the defenses for, machine learning algorithms in healthcare.
During the inference or testing stage, an AI system can be subjected to manipulation attacks by presenting falsified data to be classified as legitimate data. Adversarial or evasion attacks and/or potential defenses against such attacks are discussed in [6][7][8][9][10][11][12][13][14]. Chen et al. [15] looked at such attacks in the context of reinforcement learning. Li et al. [16] proposed a low latency decentralized framework for identifying adversarial attacks in deep learning-based industrial AI systems. Garcia-Ceja et al. [17] described how biometric profiles can be generated to impersonate a user by repeatedly querying a classifier and how the learned profiles can be used to attack other classifiers trained on the same dataset. Biggio et al. [18] examined vulnerabilities of biometric recognition systems and their defense mechanisms. Ren et al. [19] also looked at querying-based attacks against black-box machine learning models and potential defense mechanisms against such attacks. Wang et al. [20] looked at a variant, termed the Man-in-the-Middle attack, using generative models for querying. Threats from, and potential defense against, attacks on machine learning models in 5G networks is discussed in [21,22]. Apruzzese et al. [23] provided an approach to mitigating evasion attacks on AI-based network intrusion detection systems. Zhang et al. [24] explored adversarial attacks against commonly used ML-based cybersecurity systems. Liu et al. [25] discussed how to improve robustness of ML-based CAD systems against adversarial attacks. Building malware detection systems that are more resilient to adversarial attacks was the focus of [26,27], and Gardiner and Nagaraja [28] provided a comprehensive survey on vulnerabilities of ML models in malware detection systems. Dasgupta and Collins [29] surveyed game theoretical approaches that can be used to make ML algorithms robust against adversarial attacks.
During the inference or testing stage, extraction attacks are possible using the feature vector of a model for model inversion or reconstruction and gaining access to private data that was used as input or for training an AI system [30].
Hansman and Hunt [31] and Gao et al. [32] proposed a taxonomy of network and computer attacks to categorize different attack types. Their taxonomy includes four dimensions to categorize attacks on AI systems, including attack classes, attack targets, vulnerabilities and exploits used by the attacks and whether the attack has a payload or effect beyond itself. Their taxonomical structure is very comprehensive and can be used to analyze a system for its dependability, reliability and security.
Despite the benefits of machine learning technologies, the learning algorithms can be abused by cybercriminals to conduct illicit and undesirable activities. It was shown in [33,34] that attackers might gain a significant benefit by exploiting vulnerabilities in the learning algorithms, which can sometimes become a weakest link in the security chain. Several studies related to attacks on machine learning algorithms have been reported in the literature using different threat models. Barreno et al. [35,36], Huang et al. [36], Biggio et al. [37] and Munoz-Gonzalez et al. [38] discussed different attack scenarios against machine learning models with different attack models. The frameworks they proposed characterize the attacks according to the attacker's goal, their capabilities to manipulate the data and influence the learning system, familiarity with the algorithms, the data used by the defender and the attacker's strategy. For example, data poisoning attacks, also known as causative attacks, are a major emerging security threat to data-driven technologies. In these types of attacks, it can be assumed that the hacker has control over the training dataset that is being used by the learning algorithm. The hacker can actively influence the training dataset in order to subvert the entire learning process, thus decreasing the overall performance of the system, or to produce particular types of errors in the system output. For example, in a classification task, the hacker may poison the data to modify the decision boundaries learned by the learning algorithm, thus resulting in misclassification of instances, or a higher error rate for a specific type of class. This is a kind of threat that is related to the reliability of the large amount of data collected by the systems [38,39].
This survey is distinct from [31,39] in studying attacks on an AI system from the perspective of a software engineering team, that organizes its work around different stages of an AI system's development life cycle. For these different stages of an AI system, and their corresponding attacks, potential defense mechanisms are also provided. Organizing the literature using this perspective can be valuable to systematically study the design of AI systems for security purposes, to explore the trade offs that result from using different defense mechanisms, and to develop a catalog of patterns and tactics for designing AI systems for security purposes. Table 1 lists various attacks carried out at different stages of the AI system development processes and the countermeasures that are taken against these attacks.

Vulnerabilities Defense Mechanisms
Model poisoning attacks [41][42][43][44] During the training stage Trust ability of the trainer, based on a privately held validation dataset.
Use of pre-trained models that are corrupted.
Securely hosting and disseminating pre-trained models in virtual repositories that guarantee integrity to preclude benevolent models from being manipulated. Identifying backdoors in malevolently trained models acquired from untrustworthy trainers by fine-tuning untrusted models.
Obtain pre-trained models from trusted source. Employ activation-based pruning with different training examples.
Model poisoning in federated learning [41,45,47,48] During the training stage Obstruct the convergence of the execution of the distributed Stochastic Gradient Descent (SGD) algorithm, Robust aggregation methods, robust learning rate.
Model extraction attack [53,54] During Inference and/or training stage Models having similar characteristics (parameters, shape and size, similar features etc.) Hiding or adding noises to the output probabilities while keeping the class label of the instances intact. Suppressing suspicious queries or input data.
Inference attack [55] During Inferencing, Training, and Testing Model Leaking information leading to inferences being made on private data.
Methods proposed in [55] have leveraged heuristic correlations between the records of the public data and attribute values to defending against inference attacks. Modifying the identified k entries that have large correlations with the attribute values to any given target users.
The following section systematically explores attacks on AI systems and their defenses in more detail.

AI Attacks and Defense Mechanisms
Research has been carried out to identify new threats and attacks on different levels of design and implementation of AI systems. Kaloudi and Li [56], stressed the dearth of proper understanding of the malicious intention of the attacks on AI-based systems. The authors introduced 11 use cases divided into five categories: (1) next generation malware, (2) void synthesis, (3) password-based attacks, (4) social bots, and (5) adversarial training. They developed a threat framework to categorize the attacks. Turchin [57] pointed out the lack of desired behaviors of AI systems that could be exploited to design attacks in different phases of system development. The research lists the following modes of failure of AI systems:

•
The need for better resources for self-upgradation of AI systems can be exploited by adversaries • Implementation of malicious goals make the AI systems unfriendly • Flaws in the user-friendly features • Use of different techniques to make different stages of AI free from the boundaries of actions expose the AI systems to adversaries Similar research is carried out by Turchin and Denkenberger [58] where the classification of attacks was based on intelligence levels of AI systems. The authors introduced three levels of AI intelligence with respect to human intelligence: (1) "Narrow AI" which requires human assistance, (2) "Young AI" which has capability a bit better than human, and (3) "Mature AI" whose intelligence is super-human. While classifying the intelligence levels of AI systems, the authors investigated several vulnerabilities during the evolution of capabilities of AI systems. Yampolsky [59] projected a holistic view of tracks as to why an AI system could be malicious, classifying the tracks into two stages: (1) Pre-deployment and (2) Post-deployment. This includes the intrinsic and extrinsic reasons for AI technologies to be malicious, such as design flaws, intentional activities, or environmental factors.

Types of Failures
Shiva Kemar et al. [60] discussed two modes of failures of machine learning (ML) systems. They claimed that AI systems can fail either due to the inherent design of the systems (unintentional failures) or by the hand of an adversary (intentional failures).
Unintentional Failures: The unintentional failure mode leads to the failure of an AI/ML system when the AI/ML system generates formally correct, but completely unsafe, behavior.
Intentional failures: Intentional failures are caused by the attackers attempting to destabilize the system either by (a) misclassifying the results, by introducing private training data, or b) by stealing the foundational algorithmic framework. Depending on the accessibility of information about the system components (i.e., knowledge), intentional failures can be further subdivided into different subcategories.

Categories of Unintentional Failures
Unintentional failures happen when AI/ML systems produce an unwanted or unforeseen outcome from a determined action. It happens mainly due to system failures. In this research we further categorize different types of unintentional failures.

•
Reward Hacking: Reward hacking is a failure mode that an AI/ML system experiences when the underlying framework is a reinforcement learning algorithm. Reward hacking appears when an agent has more return as reward in an unexpected manner in a game environment [61]. This unexpected behavior unsettles the safety of the system. Yuan et al. [62] proposed a new multi-step reinforcement learning framework, where the reward function generates a discounted future reward and, thus, reduces the influence of immediate reward on the current state action pair. The proposed algorithm creates the defense mechanism to mitigate the effect of reward hacking in AI/ML systems. • Distributed Shift: This type of mode appears when an AI/ML model that once performed well in an environment generates dismal performance when deployed to perform in a different environment. One such example is when the training and test data come from two different probability distributions [63]. The distribution shift is further subdivided into three types [64]: 1.
Covariate Shift: The shifting problem arises due to the change in input features (covariates) over time, while the distribution of the conditional labeling function remains the same.

2.
Label Shift: This mode of failure is complementary to covariate shift, such that the distribution of class conditional probability does not change but the label marginal probability distribution changes.

3.
Concept Shift: Concept shift is a failure related to the label shift problem where the definitions of the label (i.e., the posteriori probability) experience spatial or temporal changes.
Subbaswamy and Saria proposed an operator-based hierarchy of solutions that are stable to the distributed shift [65]. There are three operators (i.e., conditioning, intervening and computing counterfactuals) that work on a graph specific to healthcare AI. These operators effectively remove the unstable component of the graph and retain the stable behavior as much as possible. There are also other algorithms to maintain robustness against the distributed shift. Rojas-Carulla et al. [66] proposed a data-driven approach, where the learning of models occurs using data from diverse environments, while Rothenhausler et al. [67] devised bounded magnitude-based robustness, where the shift is assumed to have a known magnitude.
• Natural Adversarial Examples: The natural adversarial examples are real-world examples that are not intentionally modified. Rather, they occur naturally, and result in considerable loss of performance of the machine learning algorithms [68]. The instances are semantically similar to the input, legible and facilitate interpretation (e.g., image data) of the outcome [69]. Deep neural networks are susceptible to natural adversarial examples.

Categories of Intentional Failures
The goal of the adversary is deduced from the type of failure of the model. Chakraborty et al. [70] identify four different classes of adversarial goals, based on the machine learning classifier output, which are the following: (1) confidence reduction, where the target model prediction confidence is reduced to a lower probability of classification, (2) misclassification, where the output class is altered from the original class, (3) output misclassification, which deals with input generation to fix the classifier output into a particular class, and (4) input/output misclassification, where the label of a particular input is forced to have a specific class. Shiv Kumar et al. [60] identified the taxonomy of intentional failures/attacks, based on the knowledge of the adversary. It deals with the extent of knowledge needed to trigger an attack for the AI/ML systems to fail. The adversary is better equipped with more knowledge [70] to perform the attack.
There are three types of classified attacks based on the adversary's access to knowledge about the system.

1
Whiteb ox Attack: In this type of attack, the adversary has access to the parameters of the underlying architecture of the model, the algorithm used for training, weights, training data distribution, and biases [71,72]. The adversary uses this information to find the model's vulnerable feature space. Later, the model is manipulated by modifying an input using adversarial crafting methods. An example of the whitebox attack and adversarial crafting methods are discussed in later sections. The researchers in [73,74] showed that adversarial training of the data, filled with some adversarial instances, actually helps the model/system become robust against whitebox attacks. 2 Blackbox Attack: In blackbox attacks the attacker does not know anything about the ML system. The attacker has access to only two types of information. The first is the hard label, where the adversary obtained only the classifier's predicted label, and the second is confidence, where the adversary obtained the predicted label along with the confidence score. The attacker uses information about the inputs from the past to understand vulnerabilities of the model [70]. Some blackbox attacks are discussed in later sections. Blackbox attacks can further be divided into three categories: • Non-Adaptive Blackbox Attack: In this category of blackbox attack, the adversary has the knowledge of distribution of training data for a model, T. The adversary chooses a procedure, P, for a selected local model, T', and trains the model on known data distribution using P for T' to approximate the already learned T in order to trigger misclassification using whitebox strategies [53,75]. • Adaptive Blackbox Attack: In adaptive blackbox attack the adversary has no knowledge of the training data distribution or the model architecture. Rather, the attacker approaches the target model, T, as an oracle. The attacker generates a selected dataset with a label accessed from adaptive querying of the oracle. A training process, P, is chosen with a model, T', to be trained on the labeled dataset generated by the adversary. The model T' introduces the adversarial instances using whitebox attacks to trigger misclassification by the target model T [70,76]. • Strict Blackbox Attack: In this blackbox attack category, the adversary does not have access to the training data distribution but could have the labeled dataset (x, y) collected from the target model, T. The adversary can perturb the input to identify the changes in the output. This attack would be successful if the adversary has a large set of dataset (x,y) [70,71].

Grayb ox attacks:
In whitebox attacks the adversary is fully informed about the target model, i.e., the adversary has access to the model framework, data distribution, training procedure, and model parameters, while in blackbox attacks, the adversary has no knowledge about the model. The graybox attack is an extended version of either whitebox attack or blackbox attack. In extended whitebox attacks, the adversary is partially knowledgeable about the target model setup, e.g, the model architecture, T, and the training procedure, P, is known, while the data distribution and parameters are unknown. On the other hand, in the extended blackbox attack, the adversarial model is partially trained, has different model architecture and, hence, parameters [77].

Anatomy of Cyberattacks
To build any machine learning model, the data needs to be collected, processed, trained, and tested and can be used to classify new data. The system that takes care of the sequence of data collection, processing, training and testing can be thought of as a generic AI/ML pipeline, termed the attack surface [70]. An attack surface subjected to adversarial intrusion may face poisoning attack, evasion attack, and exploratory attack. These attacks exploit three pillars of the information security, i.e., Confidentiality, Integrity, and Availability, known as the CIA triad [78]. Integrity of a system is compromised by the poisoning and evasion attacks, confidentiality is subject to intrusion by extraction, while availabilty is vulnerable to poisoning attacks. The entire AI pipeline, along with the possible attacks at each step, are shown in Figure 1.

Poisoning Attack
Poisoning attack occurs when the adversary contaminates the training data. Often ML algorithms, such as intrusion detection systems, are retrained on the training dataset. In this type of attack, the adversary cannot access the training dataset, but poisons the data by injecting new data instances [35,37,40] during the model training time. In general, the objective of the adversary is to compromise the AI system to result in the misclassification of objects.
Poisoning attacks can be a result of poisoning the training dataset or the trained model [1]. Adversaries can attack either at the data source, a platform from which a defender extracts its data, or can compromise the database of the defender. They can substitute a genuine model with a tainted model. Poisoning attacks can also exploit the limitations of the underlying learning algorithms. This attack happens in federated learning scenarios where the privacy on individual users' dataset is maintained [47]. The adversary takes advantage of the weakness of federated learning and may take control of both the data and algorithm on an individual user's device to deteriorate the performance of the model on that device [48].

Dataset Poisoning Attacks
The major scenarios of data poisoning attacks are error-agnostic poisoning attacks and error-specific poisoning attacks. In the error-agnostic type of poisoning attack the hacker aims to cause a Denial of Service (DOS) kind of attack. The hacker causes the system to produce errors, but it does not matter what type of error it is. For example, in a multi-class classification task a hacker could poison the data leading to misclassification of the data points irrespective of the class type, thus maximizing the loss function of the learning algorithm. To launch this kind of attack, the hacker needs to manipulate both the features and the labels of the data points. On the other hand, in error-specific poisoning attacks, the hacker causes the system to produce specific misclassification errors, resulting in security violation of both integrity and availability. Here, the hacker aims at misclassifying a small sample of chosen data points in a multi-class classification task. The hacker aims to minimize the loss function of the learning algorithm to serve the purpose, i.e., to force the system into misclassifying specific instances without compromising the normal system operation, ensuring that the attack is undetected [38,39].
A model is built up from a training dataset. So, attacking the dataset results in poisoning the model. By poisoning the dataset, the adversary could manipulate to generate natural adversarial examples, or inject instances with incorrect labels into the training dataset. The model may learn the pattern on misclassified examples in the data that serves the goal of the adversary. The dataset poisoning attacks can be further subdivided into two categories [79].
• Data Modification: The adversary updates or deletes training data. Here, the attacker does not have access to the algorithm. They can only manipulate labels. For instance, the attacker can draw new labels at random from the training pool, or can optimize the labels to cause maximum disruption. • Data Injection: Even if the adversary does not have access to the training data or learning algorithm, he or she can still inject incorrect data into the training set. This is similar to manipulation, but the difference is that the adversary introduces new malicious data into the training pool, not just labels.
Support Vector Machines (SVMs) are widely used classification models for malware identification, intrusion detection systems, and filtering of spam emails, to name a few applications. Biggio, Nelson and Laskov [40] illustrated poisoning attacks on the SVM classifier, with the assumption that the adversary has information about the learning algorithm, and the data distribution. The adversary generates surrogate data from the data distribution and tampers with the training data, by introducing the surrogate data, to drastically reduce the model training accuracy. The test data remains untouched. The authors formed an equation, expressing the adversarial strategy, as: is the validation data. The goal of the adversary is to maximize the loss function A(x) with the surrogate data instance (x, y) to be added into the training set D tr in order to maximally reduce the training accuracy of classification. g i is the status of the margin, influenced by the surrogate data instance (x, y).
Rubinstien et al. [80] presented the attack on SVM learning by exploiting training data confidentiality. The objective is to access the features and the labels of the training data by examining the classification on the test set. Figure 2 explains the poison attack on the SVM classifier. The left sub-figure indicates the decision boundary of the linear SVM classifier, with support vectors and classifier margin. The right sub-figure shows how the decision boundary is drastically changed by tampering with one training data instance without changing the label of the instance. It was observed that the classification accuracy would be reduced by 11% by a mere 3% manipulation of the training set [81]. Nelson et al. [39] showed that an attacker can breach the functionality on the spam filter by poisoning the Bayesian classification model. The filter becomes inoperable under the proposed Usenet dictionary attack, wherein 36% of the messages are misclassified with 1% knowledge regarding the messages in the training set. Munoz-Gonzalez et al. [38] illustrated poisoning attacks on multi-class classification problems. The authors identified two attack scenarios for the multi-class problems: (1) errorgeneric poisoning attacks and (2) error-specific poisoning attacks. In the first scenario, the adversary attacks the bi-level optimization problem [40,82], where the surrogate data is segregated into training and validation sets. The model is learned on the generated surrogate training dataset with the tampered instances. The validation set measures the influence of the tampered instances on the original test set, by maximizing the binary class loss function. It is expressed in the following equation: The surrogate data D is segregated into training D tr and validation sets D val . The model is trained on D tr along with D x (i.e., the tampered instances). D val is used to measure the influence of the tainted samples on the genuine data via the function A D x , σ that explains the loss function, L, with respect to the validation dataset D val and the parameters θ of the surrogate model. In the multi-class scenario, the multi-class loss function is used for error-generic poisoning attacks.
In error-specific poisoning attacks, the objective remains to change the outcome of specific instances in a multi-class scenario. The goal of desired misclassification is expressed with the equation: D * val is the same as the D val with different labels for desired misclassified instances that the adversary chose. The attacker aims to minimize the loss of the chosen misclassified samples.
In separate research, Kloft and Laskov [83] explained the adversarial attack on detection of outliers (anomalies), where the adversary is assumed to have knowledge about the algorithm and the training data. Their work introduced a finite sliding window, while updating the centre of mass iteratively for each new data instance. The objective is to accept the poisoned data instance as a valid data point, and the update on the center of mass is shifted in the direction of the tainted point, that appears to be a valid one. They show that relative displacement, d, of the center of mass under adversarial attack is lower bounded by the following inequality when the training window length is infinite: where i and n are the number of tampered points and number of training points, respectively. The intuition behind the use of anomaly detection is to sanitize the data by removing the anomalous data points, assuming the distribution of the anomalies is different from that of the normal data points. Koh, Steinhardt, and Liang [84] presented data poisoning attacks that outsmart data sanitization defenses for traditional anomaly detection, by nearest neighbors, training loss and singular value decomposition methods. The researchers divided the attacks into two groups: • High Sensitive: An anomaly detector usually considers points as anomalous when the point is far off from its closest neighbors. The anomaly detector cannot identify a specific point as abnormal if it is surrounded by other points, even if that tiny cluster of points are far off from remaining points. So, if an adversary/attacker concentrates poison points in a few anomalous locations, then the anomalous location is considered benign by the detector. • Low Sensitive : An anomaly detector drops all points away from the centroid by a particular distance. Whether the anomaly detector deems a provided point as abnormal does not vary much by addition or deletion of some points, until the centroid of data does not vary considerably.
Attackers can take advantage of this low sensitivity property of detectors and optimize the location of poisoned points such that it satisfies the constraints imposed by the defender.
Shafahi et al. [85] discussed how classification results can be manipulated just by injecting adversarial examples with correct labels. which is known as the clean-label attack. The clean-label attack is executed by changing the normal ("base") instance to reflect the features of another class, as shown in Figure 3. The Gmail image is marked with blue dots and lies on the feature space of the target dataset. This poisoned data is used for training and shifts the decision boundary, as shown in Figure 4.
Due to the shift, the target instance is classified as "base" instance. Here, the adversary tries to craft a poison instance such that it is indistinguishable from the base instance, i.e., the instance looks similar, and also minimizes the feature representation between the target and poison instances so that it triggers misclassification while training. This attack can be crafted using the optimization problem by means of the following equation: where b is the base instance, and t and p are the target and poison instances, respectively. The parameter β identifies the degree to which p appears to be a normal instance to the human expert.  Suciu et al. [86] presented a similar type of attack on neural networks, but with the constraint that at least 12.5% of every mini-batch of training data should have tainted examples.

Data Poisoning Defense Mechanisms
There are studies that propose potential defense mechanisms to resolve the problems related to the data poisoning attacks discussed thus far. Devising a generic defense strategy against all attacks is not possible. The defense strategies are specific to the attack and a defense scheme specific to an attack makes the system susceptible to a different kind of attack. Some advanced defense strategies include:

1.
Adversarial Training : The goal of adversarial training is to inject instances generated by the adversary into the training set to increase the strength of the model [87,88]. The defender follows the same strategy, by generating the crafted samples, using the brute force method, and training the model by feeding the clean and the generated instances. Adversarial training is suitable if the instances are crafted on the original model and not on a locally-trained surrogate model [89,90].

2.
Feature Squee zing: This defense strategy hardens the training models by diminishing the number of features and, hence, the complexity of data [91]. This, in turn, reduces the sensitivity of the data, which evades the tainted data marked by the adversary.

3.
Transferability blocking: The true defense mechanism against blackbox attacks is to obstruct the transferability of the adversarial samples. The transferability enables the usage of adversarial samples in different models trained on different datasets. Null labeling [92] is a procedure that blocks transferability, by introducing null labels into the training dataset, and trains the model to discard the adversarial samples as null labeled data. This approach does not reduce the accuracy of the model with normal data instances.

4.
MagNet: This scheme is used to arrest a range of blackbox attacks through the use of a detector and a reformer [93]. The detector identifies the differences between the normal and the tainted samples by measuring the distance between them with respect to a threshold. The reformer converts a tampered instance to a legitimate one by means of an autoencoder.

5.
Defense-GAN: To stave off both blackbox and whitebox attacks, the capability of General Adversarial Network (GAN) [94] is leveraged [95]. GAN uses a generator to construct the input images by minimizing the reconstruction error. The reconstructed images are fed to the system as input, where the genuine instances are closer to the generator than the tainted instances. Hence, the performance of the attack degrades. 6.
Local Intrinsic Dimensionality: Weerashinghe et al. [96] addressed resistance against data poisoning attack on SVM classifiers during training. They used Local Intrinsic Dimensionality (LID), a metric of computing dimension of local neighborhood subspace for each data instance. They also used K-LID approximation for each sample to find the likelihood ratio of K-LID values from the distribution of benign samples to that from tainted samples. Next, the function of the likelihood ratio is fitted to predict the likelihood ratio for the unseen data points' K-LID values. The technique showed stability against adversarial attacks on label flipping. 7.
Reject On Negative Impact (RONI): The functioning of the RONI technique is very similar to that of the Leave-One-Out (LOO) validation procedure [97]. Although effective, this technique is computationally expensive and may suffer from overfitting if the training dataset used by the algorithm is small compared to the number of features. RONI defense is not well suited for applications that involve deep learning architectures, as those applications would demand a larger training dataset [39]. In [98], a defensive mechanism was proposed based on the k-Nearest Neighbors technique, which recommends relabeling possible malicious data points based on the labels of their neighboring samples in the training dataset. However, this strategy fails to detect attacks in which the subsets of poisoning points are close. An outlier detection scheme was proposed in [99] for classification tasks. In this strategy, the outlier detectors for each class are trained with a small fraction of trusted data points. This strategy is effective in attack scenarios where the hacker does not model specific attack constraints. For example, if the training dataset is poisoned only by flipping the labels, then this strategy can detect those poisoned data points which are far from the genuine ones. Here, it is important to keep in mind that outlier detectors used in this technique need to first be trained on small curated training points that are known to be genuine [99].
In many studies, the defense strategies are for the time of filtering of data during anomaly detection (i.e., before the model is trained). Koh, Steinhardt, and Liang [84] considered data sanitization defenses of five different types, from the perspective of anomaly detection, each with respective anomaly detection parameters β and parametrarized scores S β which identify the degree of anomaly. D clean and D poison are the datasets for clean and poisoned instances D = D clean ∪ D poisin and β is derived from D.
(1) L-2 Defense: This type of defense discards the instances that are distant from the center of the corresponding class they belong to, from the perspective of the L-2 distance measure. The outlier detection parameter and parametrarized score for the L-2 defense are expressed as: (2) Slab Defense: Slab defense [81] draws the projections of the instances on the lines or planes joining the class centers and discards those that are too distant from the centers of the classes. Unlike the L-2 defense, this mechanism considers only the distances between the class centers as pertinent dimensions. The outlier detection parameter and parametrarized score for the slab defense are expressed as: where θ is the learning parameter that minimizes the training loss, x denotes the data point and y is the class.
(3) Loss Defense: Loss defense removes points that are not fitted well by the trained model on D. The feature dimensions are learned based on loss function l. The outlier detection parameter and parametrarized score for the loss defense are expressed as: S β (x, y) = l β (x|y) (4) SVD Defense : SVD defense is the mechanism that works on the basis of sub-space assumption [100]. In this defense mechanism the normal instances are assumed to lie in low-ranked sub-space while the tampered instances have components that are too large to fit into this sub-space. The outlier detection parameter and parametrarized score for the loss defense are expressed as: The term |M| k RSV is the matrix of S β (x, y) = | I − ββ T x | 2 right singular vector of data matrix d.
(5) K-NN Defense: The K-NN defense discards data instances that are distant from the K nearest neighbors. The outlier detection parameter and parametrarized score for the k-NN defense are expressed as: Koh, Steinhardt, and Liang [84] have tested these 5 types of data sanitization defenses on four types of datasets: The MNIST dataset [101], Dogfish [102], Enron spam detection [103] and the IMDB sentiment classification datasets [104]. The first two datasets are image datasets. The results showed that these defenses could still be evaded with concentrated attacks where the instances concentrated in a few locations appear to be normal. However, it was observed that L-2, slab and loss defenses still diminished the test error (which is exploited by the adversary to launch a data poisoning attack) considerably, compared to the SVD and k-NN defenses.
Peri et al. [105] proposed a defense mechanism resisting clean-label poison attacks, based on k-NN, and identified 99% of the poisoned instances, which were eventually discarded before model training. The authors claimed that this scheme, known as Deep K-NN, worked better than the schemes provided by [84], without reducing the model's performance.

Model Poisoning Attacks
Poisoning of models is more like a traditional cyberattack. If attackers breach the AI system, then either they can compromise the existing AI model with the poisoned one or they can execute "A man in the middle" attack [106] to have the wrong model downloaded, while transferring learning.
Model poisoning is generally done using Backdoored Neural Network (BadNet) attack [45]. BadNets are modified neural networks, in which the model is trained on clean and poisoned inputs. In this, the training mechanism is fully or partly outsourced to the adversary, who returns the model with secret backdoor inputs. Secret backdoor inputs are inputs added by the attacker which result in misclassification. The inputs are known only to the attacker. BadNet is categorized into two related classes:

1.
Outsource training attack, when training is outsourced, and 2.
Transfer learning attack, when a pre-trained model is outsourced and used.
In the following subsections, we also explore model poisoning attacks on the federated learning scenario, where the training of the model is distributed on multiple computing devices and the results of the training are aggregated from all the devices to form the final training model. Bhagoji et al. [41] classified the model poisoning attack strategies on federated learning scenarios as: (1) explicit boosting, and (2) alternating minimization.

Outsourced Training Attack
We want to train the parameters of a model, M. using the training data. We outsource the description of M to the trainer who sends the learned parameters back to us β M . Our trustability of the trainer depends on a privately held validation dataset, with a targeted accuracy, or on the service agreement between us and the trainer.
The objective of the adversary is to return a corrupted model with backdoored trained parameters β M . This is different from β M and either should not lower the validation accuracy or decrease the model accuracy of the inputs with a backdoor trigger. Thus, the training attack can be targeted or untargeted. In a targeted attack, the adversary switches the label of the outputs for specific inputs, while in an untargeted attack, the input of the backdoored property remains misclassified to degrade the overall model accuracy.   Figure 6 illustrates a special type of potential BadNet (i.e., BadNet with backdoor detector) which makes use of a parallel link to identify the backdoor trigger. It also uses a combining layer to produce misclassifications if the backdoor appears. This perturbed model would not impact the results on a cleaned dataset, so the user would not be able to identify if the model has been compromised. In terms of defense mechanisms, Backdoor attacks like BadNet happen when we use pre-trained models. So, the less pre-trained the model, the less the attack. However, today, almost all networks are built using pre-trained models.
To make the models robust against backdoor attacks, Gu et al. [42] proposed the following defense strategies: • Securely hosting and disseminating pre-trained models in virtual repositories that guarantee integrity, to preclude benevolent models from being manipulated. The security is characterized by the fact that virtual archives should have digital signatures of the trainer on the pre-trained models with the public key cryptosystem [43]. • Identifying backdoors in malevolently trained models acquired from an untrustworthy trainer by retraining or fine-tuning the untrusted model with some added computational cost [44,46]. These researchers considered fully outsourced training attacks. Another research [107], proposed a defense mechanism with an assumption that the user has access to both clean and backdoored instances.

Transfer Learning Attack
The objective of transfer learning is to save computation time, by transferring the knowledge of an already-trained model to the target model [45]. The models are stored in online repositories from where a user can download them for an AI/ML application. If the downloaded model, M cor , is a corrupted model, then, while transferring learning, the user generates his/her model and parameters based on M cor . In transfer learning attacks, we assume that the newly adapted model, M cor , and the uncorrupted model have the same input dimensions but differ in number of classes. Figure 7 compares a good network (left), that rightly classifies its input, to BadNet (right), that gives misclassifications but has the same architecture as the good network. Figure 8 describes the transfer learning attack setup with backdoor strengthening factor to enhance the impact of weights.
In terms of potential defense mechanisms, the obvious defense strategy is to obtain pre-trained models from trusted online sources, such as Caffe Model Zoo and Keras trained Model Library [108], where a secure cryptographic hashing algorithm (e.g., SHA-1) is used as a reference to verify the downloads. However, the researchers in [42] showed that downloaded BadNet from "secure" online model archives can still hold the backdoor property, even when the user re-trains the model to perform his/her tasks.  Wu et al. [109] devised methodologies to resolve transfer learning attacks related to misclassification. They proposed activation-based pruning [110] and developed the distilled differentiator, based on pruning. To augment strength against attacks, the ensemble construct from the differentiators is implemented. As the individual distilled differentiators are diverse, in activation-based pruning, different training examples promote divergence among the differentiators; hence, increasing the strength of ensemble models. Pruning changes the model structure and arrests the portability of attack from one system to the other [44,46]. Network pruning removes the connectives between the model and generates a sparse model from a dense network model. The sparsity helps in fine tuning the model and eventually discarding the virulence of the attacks. Comprehensive evaluations, based on classification accuracy, success rate, size of the models, and time for learning, regarding the defense strategies suggested by the authors, on image recognition showed the new models, with only five differentiators, to be invulnerable against more than 90% of adversarial inputs, with accuracy loss less than 10%.

Attack on Federated Learning
In the federated learning scenario, each and every individual device has its own model to train, securing the privacy of the data stored in that device [47]. Federated learning algorithms are susceptible to model poisoning if the owner of the device becomes malicious. Research [111,112] introduced a premise for federated learning, where a single adversary attacks the learning by changing the gradient updates to arbitrary values, instead of introducing the backdoor property into the model. The objective of the attacker is to obstruct the convergence of the execution of the distributed Stochastic Gradient Descent (SGD) algorithm. In a similar study, Bagdasaryan et al. [48] proposed a multi-agent framework, where multiple adversaries jointly conspired to replace the model during model covergence. Bhagoji et al. [41] worked on targeted misclassification by introducing a sequence of attacks induced by a single adversary: (1) explicit boosting, and (2) alternating minimization. The underlying algorithm is SGD.
• Explicit Boosting: The adversary updates the boosting steps to void the global aggregated effect of the individual models locally distributed over different devices. The attack is based on running of boosting steps of SGD until the attacker obtains the parameter weight vector, starting from the global weight, to minimize the training loss over the data and the class label. This enables the adversary to obtain the initial update, which is used to determine the final adversarial update. The final update is obtained by the product of the final adversarial update and the inverse of adversarial scaling (i.e., the boosting factor), so that the server cannot identify the adversarial effect. • Alternating Minimization: The authors in [45] showed that, in an explicit boosting attack, the malicious updates on boosting steps could not evade the potential defense related to measuring accuracy. Alternating minimization was introduced to exploit the fact that it is updates related only to the targeted class that need to be boosted. This strategy improves adversarial attack that can bypass the defense mechanism with the goal of minimizing training loss and boosting parameter updates for the adversarial goals and achieved a high success rate.
In terms of potential defense mechanisms, two typical strategies are deployed, depending on the nature of the attacks on federated learning: (1) robust aggregation methods, and (2) robust learning rate.
• Robust aggregation methods: These methods incorporate security into federated learning by exploring different statistical metrics that could replace the average (mean) statistic, while aggregating the effects of the models, such as trimmed mean, geometric median, coordinate-median, etc [47,111,[113][114][115][116]. Introducing the new statistic while aggregating has the primary objective of staving off attacks during model convergence.
Bernstein et al. [117] proposed a sign aggregation technique on the SGD algorithm, distributed over individual machines or devices. The devices interact with the server by communicating the signs of the gradients. The server aggregates the signs and sends this to the individual machines, which use it to update their model weights. The weight update rule can be expressed by the following equation: where ∆ i t is the weight update of the device i at time t. ∆ i t = w k t − w t .w t is the weight the server sent to the set of devices A t at time t and γ is the server learning rate.
This approach is robust against convergence attacks, but susceptible to backdoor attacks in federated learning scenarios.
In a recent study, [118] the authors modified the mean estimator of the aggregate by introducing weight-cutoff and addition of noise [119] during weight update to deter backdoor attacks. In this method, the server snips the weights when the L2 norm of a weight update surpasses a pre-specified threshold, and then aggregates the snipped weights, along with the noise, during aggregation of weights.
• Robust Learning Rate: Ozdayi, Katancioglu, and Gel [120] introduced the defense mechanism by making the model learning rate robust with a pre-specified boundary of malicious agents. With the help of the updated learning rate, the adversarial model weight approaches the direction of the genuine model weight. This work is an extension of the signed aggregation proposed in [117]. The authors proposed a parameter-learning threshold δ. The learning rate for the i-th dimension of the data can be represented as: The server weight update at time t + 1 is where γ δ is the overall learning rate, including all dimensions, and is the feature-wise product operation. ∆ k is the update on the gradient descent update of the k-th player in the system, and k may be the adversary or the regular user.

Model Inversion Attack
The model inversion attack is a way to reconstruct the training data, given the model parameters. This type of attack is a concern for privacy, because there are a growing number of online model repositories. Several studies related to this attack hve been under both the blackbox and whitebox settings. Yang et al. [121] discussed the model inversion attack in the blackbox setting, where the attacker wants to reconstruct an input sample from the confidence score vector determined by the target model. In their study, they demonstrated that it is possible to reconstruct specific input samples from a given model. They trained a model (inversion) on an auxiliary dataset, which functioned as the inverse of the given target model. Their model then took the confidence scores of the target model as input and tried to reconstruct the original input data. In their study, they also demonstrated that their inversion model showed substantial improvement over previously proposed models. On the other hand, in a whitebox setting, Fredrikson et al. [122] proposed a model inversion attack that produces only a representative sample of a training data sample, instead of reconstructing a specific input sample, using the confidence score vector determined by the target model. Several related studies were proposed to infer sensitive attributes [122][123][124][125] or statistical information [126] about the training data by developing an inversion model. Hitaj et al. [71] explored inversion attacks in federated learning where the attacker had whitebox access to the model.
Several defense strategies against the model inversion attack have been explored that include L2 Regularizer [49], Dropout and Model Staking [50], MemGuard [51], and Differential privacy [52]. These defense mechanisms are also well-known for reducing overfitting in the training of deep neural network models.

Model Extraction Attack
A machine learning model extraction attack arises when an attacker obtains blackbox access to the target model and is successful in learning another model that closely resembles. or is exactly the same as, the target model. Reith et al. [54] discussed model extraction against the support vector regression model. Juuti et al. [127] explored neural networks and showed an attack, in which an adversary generates queries for DNNs with simple architectures. Wang et al., in [128], proposed model extraction attacks for stealing hyperparameters against a simple architecture similar to a neural network with three layers. The most elegant attack, in comparison to the others, was shown in [129]. They showed that it is possible to extract a model with higher accuracy than the original model. Using distillation, which is a technique for model compression, the authors in [130,131], executed model extraction attacks against DNNs and CNNs for image classification.
To defend against model extraction attacks, the authors in [53,132,133] proposed either hiding or adding noises to the output probabilities, while keeping the class label of the instances intact. However, such approaches are not very effective in label-based extraction attacks. Several others have proposed monitoring the queries and differentiating suspicious queries from others by analyzing the input distribution or the output entropy [127,134].

Inference Attack
Machine learning models have a tendency to leak information about the individual data records on which they were trained. Shokri et al. [49] discussed the membership inference attack, where one can determine if the data record is part of the model's training dataset or not, given the data record and blackbox access to the model. According to them, this is a concern for privacy breach. If the advisory can learn if the record was used as part of the training, from the model, then such a model is considered to be leaking information. The concern is paramount, as such a privacy beach not only affects a single observation, but the entire population, due to high correlation between the covered and the uncovered dataset [135]. This happens particularly when the model is based on statistical facts about the population.
Studies in [136][137][138] focused on attribute inference attacks. Here an attacker gets access to a set of data about a target user, which is mostly public in nature, and aims to infer the private information of the target user. In this case, the attacker first collects information from users who are willing to disclose it in public, and then uses the information as a training dataset to learn a machine learning classifier which can take a user's public data as input and predict the user's private attribute values.
In terms of potential defense mechanisms, methods proposed in [55,139] leveraged heuristic correlations between the records of the public data and attribute values to defend against attribute inference attacks. They proposed modifying the identified k entries that have large correlations with the attribute values to any given target users. Here k is used to control the privacy-utility trade off. This addresses the membership inference attack.

Conclusions
Using an extensive survey of the literature, this research addresses two research questions regarding attacks on AI systems and their potential defense mechanisms. RQ1: What are the cyberattacks that AI systems can be subjected to?
To answer this question, we discussed different categories of intentional and unintentional failures, along with the details of poisoning attacks on data and machine learning models. We also introduced backdoored neural network (discussing it from the perspective of research carried out on outsourced training attacks, transfer learning attack and federated learning attacks), model inversion, model extraction and inference attacks.
RQ2: Can the attacks on AI systems be organized into a taxonomy, to better understand how the vulnerabilities manifest themselves during system development?
Upon reviewing the literature related to attacks on AI systems, it was evident that, at different stages of the AI/ML pipeline development, vulnerabilities manifest; thus, providing an opportunity to launch attacks on the AI system. Table 1 and Figure 1 organize the AI attacks into a taxonomy, to better understand how vulnerabilities manifest and how attacks can be launched during the entire system development process.
RQ3: What are possible defense mechanisms to defend AI systems from cyberattacks? While addressing the second research question, we reviewed multiple state of the art methods that are used as potential defense mechanisms for each type of attack.
RQ4: Is it possible to device a generic defense mechanism against all kinds of AI attacks? Based on the literature review of cyberattacks on AI systems. it is clearly evident that there is no single. or generic, defense mechanism that can address diverse attacks on AI systems. Vulnerabilities that manifest in AI systems are more specific to the system design and its composition. Therefore, a defense mechanism has to be tailored, or designed, in such a way that it can suit the specific characteristics of the system. This survey sheds light on the different types of cybersecurity attacks and their corresponding defense mechanisms in a detailed and comprehensive manner. Growing threats and attacks in emerging technologies, such as social media, cloud computing, AI/ML systems, data pipelines and other critical infrastructures, often manifest in different forms. It is worth noting that it is challenging to capture all patterns of threats and attacks. Therefore, this survey attempted to capture a common set of general threat and attack patterns that are specifically targeted towards AI/ML systems. Organizing this body of knowledge. from the perspective of an AI system's life cycle, can be useful for software engineering teams when designing and developing intelligent systems. In addition, this survey offers a profound benefit to the research community focused on analyzing the cybersecurity of AI systems. Researchers can implement and replicate these attacks on an AI system, systematically apply defenses against these attacks, understand the trade offs that arise from using defense mechanisms, and create a catalog of patterns or tactics for designing trustworthy AI systems.