Con-Detect: Detecting Adversarially Perturbed Natural Language Inputs to Deep Classiﬁers Through Holistic Analysis

Deep Learning (DL) algorithms have shown wonders in many Natural Language Processing (NLP) tasks such as language-to-language translation, spam ﬁltering, fake-news detection, and comprehension understanding. However, research has shown that the adversarial vulnerabilities of deep learning networks manifest themselves when DL is used for NLP tasks. Most mitigation techniques proposed to date are supervised—relying on adversarial retraining to improve the robustness—which is impractical. This work introduces a novel, unsupervised detection methodology for detecting adversarial inputs to NLP classiﬁers. In summary, we note that minimally perturbing an input to change a model’s output—a major strength of adversarial attacks—is a weakness that leaves unique statistical marks reﬂected in the cumulative contribution scores of the input. Particularly, we show that the cumulative contribution score, called CF-score, of adversarial inputs is generally greater than that of the clean inputs. We thus propose Con-Detect—a Contribution based Detection method—for detecting adversarial attacks against NLP classiﬁers. Con-Detect can be deployed with any classiﬁer without having to retrain it. We experiment with multiple attackers—Text-bugger, Text-fooler, PWWS—on several architectures—MLP, CNN, LSTM, Hybrid CNN-RNN, BERT—trained for diﬀerent classiﬁcation tasks—IMDB sentiment classiﬁcation, fake-news classiﬁcation, AG news topic classiﬁcation—under diﬀerent threat models—Con-Detect-blind attacks, Con-Detect-aware attacks, and Con-Detect-adaptive attacks—and show that Con-Detect can reduce the attack success rate (ASR) of diﬀerent attacks from 100% to as low as 0% for the best cases and =70% for the worst case. Even in the worst case, we note a 100% increase in the required number of queries and a 50% increase in the number of words perturbed, suggesting that Con-Detect is hard to evade.


I. INTRODUCTION
Machine Learning (ML) techniques, particularly Deep Neural Networks (DNNs), have scored applications in autonomous driving [1], [2], medical image analysis [3], healthcare, and IoT devices due to their ability to generalize to complex tasks requiring intelligent decision making (e.g., classification, generation and regression).One such example is Natural Language Processing (NLP), which exists in myriad forms such as spam filtering, emotion recognition, language-to-language translation, sentence generation and answering questions to name a few.However, DNNs are over-dependent on large training data, which makes them vulnerable to attacks at both the training [4] and the post-training stages [1], [5].
Recent works show that slight perturbations to an input can cause DNNs to misclassify these inputs [6]- [8].Such a perturbation is known as the adversarial attack, and DNNs are fooled by these attacks for a range of applications in image processing [1], speech recognition [9] and NLP domains [10].This work particularly focuses on the adversarial attacks on NLP classifiers.Morris et al. [11] note that adversarial attacks on NLP models generally identify and perturb the most contributing words in an input.Authors thus unify 16 different state-of-the-art attacks under a framework, the TextAttack library, to benchmark the robustness of NLP models.
Researchers are exploring two major strategies to counter the adversarial threat-detecting adversarial inputs, and defending the models by correctly classifying adversarial inputs.Although defending the model is generally preferable, detecting adversarial inputs in itself is quite challenging [5].To illustrate this, Carlini and Wagner [5] were able to bypass ten different adversarial detection mechanisms.Most adversarial defenses proposed to date require adversarial retraining of a model to improve its robustness.However, adversarial retraining assumes the attack algorithm beforehand and causes large computational and time overheads.Pruthi et al. [12] defend NLP classifiers by correcting misspelled words in the input.However, such defense methods fail against more natural BERT-based attacks [13], [14].We identify a dire need for exploring new mitigation techniques-both the detection and the defense methods-to counter adversarial attacks.
In this work, we explore a novel scheme for detecting adversarial inputs based on their contribution scores.Our methodology is motivated by our previous work in which we analyze the robustness of NLP classifiers, specifically the fake-news detectors, under several configurations and settings [10].Particularly, in our previous work [10], we note that the contributions of words for adversarial inputs have a different distribution than that for the clean inputs.In this work, we study these signatures more comprehensively based on three commonly used contribution metrics-word removal (see Fig. 1), word substitution, and word gradients-and show that adversarial inputs exhibit significantly higher contribution Original decision is fake Adversarial decision is real Fig.1: Our methodology for computing the cumulative contribution score, C F (X), for an input, X, using the word-removal method.For each word in an input, we first measure its contribution by measuring the decrease in a classifier's output before and after removing that word from the input.Absolute contributions are then added to get the cumulative contribution score, C F (X) for X. scores as compared to the clean inputs (See Fig. 2).We identify that this behavior is a natural consequence of an attacker's stealthiness (which works by changing the decision of a model by perturbing a small fraction of input) and show that this property of adversarial examples, usually considered a strength, is also an inherent weakness, particularly for NLP classification tasks.Building upon the above analysis, we propose, Con-Detect, an adversarial detection mechanism that computes the contribution score, C F , of an input and detects adversarial inputs at runtime based on a pre-defined alarm threshold.Unlike adversarial training, Con-Detect is efficient and can be integrated with any NLP classifier without a need for retraining it, which makes Con-Detect an optimal choice in practical scenarios where retraining large models is generally infeasible because of high computational overheads.Through extensive empirical analysis, we show the efficacy of Con-Detect against the strongest state-ofthe-art attacks-Text-bugger, Text-fooler, PWWS-for models of varying architectures-MLP, CNN, LSTM, Hybrid-CNN-RNN, BERT-trained for a variety of binary-and multiclassification scenarios.
Our contributions are summarized below.
• To the best of our knowledge, we are the first to identify that the adversarial inputs have higher contribution scores as compared to the clean inputs.
• We are the first to associate this behavior to the attack's stealth, which refers to the attack algorithm's attempt at significantly impacting a classifier's output while minimally perturbing the input in a bid to evade detection.Since this stealthy behavior is a common phenomenon inherent in several attack algorithms, our findings can be generalized to a number of attack methods.• We propose the Con-Detect (Contribution based Detection) method for detecting adversarial inputs at runtime.Con-Detect can be incorporated with any classifier without having to retrain it.

A. Threat Models
Current literature on text-based adversarial attacks/defenses assume two different threat models [11] detailed as follows: 1) White-Box: White-box threat model assumes a superior attacker with complete knowledge of the model architecture and its parameters.2) Black-Box: Black-box threat model assumes a weak attacker with no knowledge of the model architecture or its parameters.Each of these threat models has unique advantages over the other.The white-box threat model favors an attacker over a defender and thus is generally preferred for evaluating adversarial defenses [15], [16].However, white-box attacks fail against defenses that hide/suppress the gradients of a model.This problem is well known as the gradient-obfuscation and is exhibited by many adversarial defenses and detectors [17].For such cases, Athalye et al. [17] recommend using blackbox attacks to evaluate the robustness of a model.Due to the use of embedding layers, deep NLP classifiers naturally exhibit the gradient-obfuscation.An embedding layer simply maps an input token (word) to a high-dimensional numeric space using a dictionary and thus has no gradients.This is also one of the reasons for the widespread use of black-box threat model for evaluating the robustness of NLP models [10], [11], [18]- [20].

B. Adversarial Attacks on NLP Classifiers
We specifically focus on four recent attacks motivated by their use in similar works [20], recommendations [10], strength, efficiency, relevance, and recency [11].Consider an input, X = {x 1 , x 2 , ..., x i , ..., x n } containing n words, and a classifier, F. An attacker aims to perturb X to get, X p = (X +∆X) such that, argmax F(X) = argmax F(X p ), where ∆X is the perturbation.To avoid detection, the attacker minimizes ∆X-perturbing just a few words-and retain the grammatical and semantic structure of the sentence by identifying and perturbing topmost contributing words in X until X p is misclassified.
Text-bugger [18] uses the Jacobian matrix to identify important words in an input sequence and perturbs them using random character substitutions and synonym replacements.Text-fooler [21] and BAE [14] attack identify important words by measuring the change in the output when a word is removed from the input (word-removal method).Perturbation is achieved by synonym replacement using a pre-defined embedding by Text-fooler and by the BERT-Masked Language Model by the BAE attack.PWWS [22] measures the change in the output when a word is replaced by an "unknown" word (word-substitution method), and perturbs them using synonym replacement using "Word2Net".

C. Adversarial Defenses Against Adversarial NLP Attacks
Limited work has been done on mitigating adversarial attacks.Most defense techniques rely on adversarial training [18], [21]- [27] which presumes the attack methods and requires retraining the classifier, which limits its practicality.Other techniques which analyze lexical features such as spelling and grammar are rendered ineffective against more natural BERT-attacks [14], [21].
Recently, Wang et al. [28] propose TextFirewall, which uses impact scores of individual words to detect adversarial inputs.We notice two similarities with their work; the basic concept of impact score used by TextFirewall is slightly similar to our C F -scores.TextFirewall detects adversarial inputs at runtime without retraining the classifier.However, we believe our work to be significantly different in the following ways.
• Unlike Con-Detect, Textfirewall is only effective for sentiment classification where a word may either be positive or negative [28].3: Con-Detect methodology; given an input sequence, X, we compute C F (X) by adding individual word contributions, C F (x i ) for all x i ∈ X.If C F (X) is greater than the alarm threshold, t, the output of sigmoid, P A (X) > 0.5, indicating that X is adversarial.
• TextFirewall is only effective against weak attackers who can only perturb up to 2.1% of words in an input sequence (which is equal to 5 words).• TextFirewall ignores neutral words, such as "but", "because", and "when".Although ignorable in sentiment classification, such words play an important role in other tasks like fake-news detection and topic classification.• The impact score of Wang et al. [28] measures only the local impacts of words without considering a complete input sequence.• TextFirewall has only been tested for the CNN and LSTM architectures [28].
III. CON-DETECT: CONTRIBUTION BASED DETECTION MECHANISM FOR ADVERSARIAL ATTACKS In this section, we first introduce Con-Detect, a novel contribution scores-based detection scheme for detecting adversarial attacks at runtime and later specifically highlight our method for computing cumulative contribution scores.Finally, we illustrate the working of our Con-Detect methodology based on calculating cumulative contribution scores by showing it in action in a case study using a Hybrid-CNN-RNN fake-news classifier [29].

A. Con-Detect Methodology
We hypothesize that the adversarial inputs have higher cumulative contribution scores as compared to the clean inputs.We later formalize and validate this hypothesis in Sections III-B and III-C.Here, we propose Con-Detect, a novel detection method that can capture adversarial samples at runtime based on their contribution scores.
1) Given an input sequence of n words, 2) C F (X) computed above is then passed through a filter with a pre-defined alarm threshold, t.The filter will mark the input as adversarial if the score is higher than the alarm threshold and vice-versa.Following some prior works [16], [30], we use a sigmoid function, σ, to implement the filter.
substitution score is the absolute change in the probability when a word is substituted with an unknown token.

ℱ 𝐶|𝑋 ℱ(𝐶|𝑋
removal score is the absolute change in the probability when a word is removed from the input.Fig. 4: Our methodology for computing the contribution scores, C F (x i ), for a word, x i , in the input sequence, X, where F is the classifier.
where σ(x) = 1 1+e −x while the alarm threshold, t, is a hyper-parameter that needs to be tuned.We implement Con-Detect as a custom layer to be incorporated with the base classifier, as shown in Fig. 3. Unlike other state-of-the-art methods, Con-Detect does not require retraining the classifier and does not presume the algorithm of an attack during training.Additionally, Con-Detect is scalable in that it can be incorporated with different classifiers of varying architectures.These factors make Con-Detect an attractive choice for practical scenarios.

B. Method of Cumulative Contribution Score
Adversarial attacks aim to change a model's decision while minimally perturbing an input which compels an attacker to choose highly contributing perturbations.We note that this property can be a major weakness of adversarial attacks, specifically for NLP classification.To explain this, consider an input sequence, X, originally classified by F into the class, C org .An attacker perturbs X into X p such that X p gets misclassified into the class, C adv .We divide X into two distinct sets, X A and X B such that (X A ∪X B = X)∧(X A ∩X B = ∅), where X A contains words that remain unperturbed while X B comprises words that the attacker replaces with X ∆ while attacking.We make the following key observations: 1) In our previous work [10], we note that most words in a clean input positively contribute to a classifier's output.Additionally, X B has a higher expected contribution to the output than X A , which is why an attacker chooses to perturb X B .Thus, for a clean input, where C Fr (x|C org ) denotes the contribution of the word x to the output class, C org .2) As X A suggests F to output the class C org (Eq.3), X A expectedly has negative contributions to the class C adv for an adversarial input.
3) Finally, for a successful attack, X ∆ , expectedly has a higher contribution than X B to overcome the negative effects of X A (4).Mathematically, Stated simply, the expected sum of absolute contributions of words in an input is larger for an adversarial input than Algorithm 1 Computing contribution scores for an input, X Input: that for a clean input, given that the attack is stealthy.This signature is a natural consequence of an attacker's inconspicuousness.However, as mentioned previously, estimating these contributions is challenging due to the gradient obfuscation in NLP models [17].Fortunately, we note that word substitution, word removal, and saliency scores are commonly used by the attackers and several explainability methods [31], [32] to estimate a word's contribution to a classifier's output.Detailed methodologies are given below.1) Word Substitution: For each word, x i , in an input, X, we replace the word with an "unknown" token and the absolute change in the probability of the original class indicates the contribution, C F (x i |C).Individual word contributions are then added, thereby getting the cumulative contribution score, called C F -substitution score, for X.A detailed methodology is given in Algorithm 1 and illustrated in Fig. 4(a).
2) Word Removal: For each x i in X, we remove x i from X, and the absolute change in the probability of the original class measures the contribution, C F (x i |C).As before, we add individual word contributions, thereby getting the cumulative contribution score, called C F -removal score.Details are given in Algorithm 1 and illustrated in Fig. 4(b).
3) Word Gradients: For each x i in X, we compute the gradient of the output of a model with respect to the embedding layer and use it as the contribution score, C F (x i |C).Individual word contributions of different words are separately added to get C F -gradient score for X.See Algorithm 1 for details.

C. An Illustrative Case Study of Con-Detect's Cumulative Contribution Score Methodology
In the previous section, we hypothesize that the adversarial inputs have higher C F scores than the clean inputs.To validate this hypothesis, we attack the state-of-the-art Hybrid-CNN-RNN [29] fake-news classifier trained on Kaggle fakenews dataset.Our attacks methods include three state-ofthe-art black-box attacks [10], [20]-Text-fooler [21], Textbugger [18], and PWWS [22] attacks-from the Text-Attack library.Our experimental setup for this analysis is shown in Fig. 7.  Word Substitution.Fig. 5 shows the histograms of C Fsubstitution scores for clean and adversarial inputs generated by the three attacks.Adversarial inputs are clearly distinguishable from the clean inputs due to higher C Fsubstitution scores, which is consistent with our intuitions.
Being stealthy-perturbing a fraction of input-is considered a strength of adversarial attacks [11].However, this forces an attacker to choose highly impacting perturbations that raises C F scores.Interestingly, C F scores are similar for different attacks despite major differences in the algorithms, suggesting that Con-Detect is attack-agnostic.Our experiments establish that the iconic stealthiness of the adversarial attacks imprints detectable signatures on the inputs of NLP classifiers.Word Removal.Fig. 6 shows histograms of C F -removal scores for the clean and adversarial inputs.For the same reasons discussed previously, adversarial inputs show higher C F scores, making them distinguishable from the clean inputs.
We note that C F -removal score cannot effectively detect adversarial inputs for the IMDB dataset.This is illustrated by the closeness of clean and adversarial scores in Fig. 6(b).Word Gradients.A similar analysis is performed as with the previous two methods.However, we note that C F -gradient score is not effective for detecting adversarial inputs.Detailed investigations are left for future work.In the subsequent experiments, we use the word-substitution and the wordremoval methods for evaluating Con-Detect.
IV. EXPERIMENTAL SETUP Our experimental setup is given in Fig. 8.A clean input is perturbed using the Text-Attack library and provided to Con-Detect-classifier which gives two outputs, F(X) and P A (X).

A. Threat Models
Due to the gradient obfuscation problem of NLP classifiers discussed in Section II-A, our experiments assume the blackbox threat model.We further consider three different scenarios for the black-box threat model as shown in Fig. 9. Fig. 9: The three considered threat models: Con-Detect-blind setting assumes that the attacker is unaware of Con-Detect; Con-Detect-aware threat model depicts a Con-Detect-aware scenario where an attacker knows that Con-Detect is deployed; and Con-Detect-adaptive setting where an attacker adaptively modifies the attack algorithm to fool Con-Detect.
• Con-Detect-blind threat model assumes that the attacker is unaware of Con-Detect.The attacker only tries to fool the classifier in this case.• Con-Detect-aware threat model assumes that the attacker knows about Con-Detect and aims to fool both the classifier and the detector simultaneously.• Con-Detect-adaptive threat model assumes that the attacker is aware of Con-Detect and intelligently modifies the attack algorithm to fool Con-Detect.
Methodology of Con-Detect-adaptive attack.Following the convention of adversarial researchers, we assume an adaptive attacker who modifies his attack algorithm to specifically fool Con-Detect.The proposed modification can be easily extended to all of the black-box attacks used in this paper.Noting that Con-Detect makes it harder for a conventional attacker to estimate the importance of a word due to the mismatch between the importance scores calculated for Con-Detect and the classifier, our adaptive attacker unifies the outputs of the classifier and Con-Detect in a wrapper function such that changing the wrapped output ensures that both the classifier and the detector are fooled.
To explain the wrapper function, we assume a binary classification task, e.g., fake-news detection where an input sequence may either be real or fake.Consider an input, X, originally marked as unperturbed and fake by Con-Detect and the classifier, respectively.The wrapped output is then defined as, where P A (X) denote the probability of X being adversarial, and P (real|X) and P (fake|X) denote the probability of X being real and fake, respectively.To change the output decision of Con-Detect-classifier from fake to real, the following (Settings: Input, X, is originally fake and unperturbed.).When Eq. 7 is satisfied, P A (X) < 0.5 (undetected) and P (real|X) > 0.5 (successful adversarial attack).
condition must be met, Fig. 10 illustrates Eq. 7. Apparently, meeting the condition in Eq. 7 makes P (real|X) > 0.5 and P A (X) < 0.5 indicating that both the classifier and the detector have been fooled.Although Eq. 6 assumes binary classification, we note that our approach can simply be extended to a multi-classification scenario using the following equation where O H is the one-hot encoded representation of argmax F(X) for M number of classes.

B. Datasets
We evaluate our methodology on three different datasets, the Kaggle fake-news dataset, the IMDB dataset, and the AG news dataset.All the three datasets are open-source, comprehensive and a common choice in similar recent works [10], [20], [29].Kaggle fake-news dataset.The dataset, accessible at the Kaggle website 1 , is a binary classification dataset having 20800 samples in the training set and 5200 samples in the test set, making a total of 26000 samples.Each sample has different fields-id, title, author, text and label.The label may either be 1 or 0 denoting the fake or real content, respectively.We use the text field-a string of multiple sentences of an article-as the input and the one-hot encoded label as the output of our classifier.and class.Following previous conventions [20], we concatenate the title and the description fields to use as input to our classifier.The one-hot encoded class field is used as the output.

C. Model Architectures and Frameworks
We use different model architectures for classification in our experiments.Specifically, in addition to the conventionally used Long-Short Term Memory-based (LSTM) architecture, we experiment with the simple multi-layer perceptrons (MLP), convolutional neural networks (CNNs), and more recent Hybrid-CNN-RNN [29] and BERT architectures.CNNs have already shown their worth as NLP classifiers [33], [34].Additionally, we note a general trend towards using hybrid CNN-RNN techniques for NLP classification [29], [35], [36].We choose the architecture proposed by Nasir et al. [29] due to its recency and generalizability [10], [29].We fine-tune BERT 3 for our tasks using a pre-trained tokenizer 4 .For all other cases, we train classifiers from scratch using a Keras tokenizer and initialize the embedding layer with the GloVe word embedding [29].
While evaluating our approach against Con-Detect-adaptive attacks, we only use Hybrid-CNN-RNN classifiers trained on different datasets.Hybrid-CNN-RNN is recent [29], computationally more efficient, and is significantly faster to attack as compared to BERT-based classifiers.This considerably reduces the time required for generating the results.
Our implementation of models and detection methodology uses the state-of-the-art open-source libraries, Tensorflow and Keras.For attacking, we use the attack implementations provided in the state-of-the-art "Text-Attack" library [11].Particularly, we use three different attacks-Text-bugger [18], Textfooler [21], and PWWS [22]-in our evaluations, following recent practices [10], [20].In our previous work, we noted that although BERT-based adversarial attacks can generate more natural adversarial perturbations, it achieves low attack success rates as compared to other attack methods, i.e., Text-fooler, Text-bugger, and PWWS.We also show later in this paper that Con-Detect is also effective against BERT adversarial attacks.

V. RESULTS
In this result, we present a thorough performance evaluation by first describing the accuracy of the classifiers ( §V-A) and then consider the effect of the alarm threshold on the performance of Con-Detect ( §V-B).We consider different threat models and provide results for the Con-Detect-blind threat model ( §V-C); the Con-Detect-aware threat model ( §V-D); and the Con-Detect-adaptive threat model ( §V-E).Finally, an analysis of the computational overhead of Con-Detect is provided in Section V-F.
To evaluate our methods, we use different metrics used in many recent works for the robustness evaluation either individually [20] or in combination [1], [10].
• Attack Success Rate (ASR).ASR is the ratio of adversarial inputs incorrectly classified by a classifier to the total number of adversarial inputs.A lower ASR suggests a more robust classifier.(Note that ASR is the inverse of accuracy-ASR % = 100 -Accuracy %).
• Number of queries.The number of times an attacker asks a classifier to make predictions.The higher the required number of queries, the more robust the classifier.• Percent words perturbed.Percentage of words that an attacker perturbs from an input sequence.For a model to be robust, the percentage should be high.

A. Accuracy of the classifiers
All our classifiers give an accuracy of above 90% for the Kaggle and AG news datasets, and around 80% for the IMDB dataset.Fig. 11 reports the comparison of vanillaclassifiers' accuracy before and after the attack assuming the black-box threat model.We observe that for most cases, the accuracy drops to under 20% when a classifier is under attack, suggesting that most classifiers are fragile.However, we find that CNNs are relatively more robust, as already noted in a previous work [10].

B. Effect of the Alarm Threshold on Con-Detect
We analyze the effect of changing t on the efficacy of Con-Detect measured using the F1-score.The base model is the Hybrid-CNN-RNN fake-news classifier trained on the Kaggle fake-news dataset and the threat model assumed is Con-Detect-blind threat model.Results of our experiments are provided in Fig. 12 and Fig. 13.Each of these figures reports two different results.1) The black line records the F1-scores of Con-Detect as the alarm threshold, t, increases from zero to the maximum.2) The green dashed line drawn vertically marks the best scenario determined by the largest F1-score.The dashed line is also accompanied by a set of values-the best F1-score, the optimal alarm threshold, false-negative5 and false-positive6 rates-detailing the best scenario.For example, in Fig. 12(a), the best results are achieved when t=0.378.At this value of t, the F1-score is 94.8%, and the false-positive rate is 7.5%.
For each of the two estimation methods, we observe that the F1-score first increases rapidly reaching a maximum value, and then decreases as the alarm threshold increases.For a too small t, a high false-positive rate causes a low F1-score-i.e., most clean samples are incorrectly marked as adversarially perturbed by our detector.When t is too large, the detector cannot detect adversarial inputs, which increases the falsenegative rate, thus F1-score decreases.Interestingly, the rapid increase in the F1-score for small values of t is because C Fscores of clean inputs are concentrated near zero (Fig. 5 and  6).Now coming to the question of how to automatically choose the alarm threshold?In Fig. 12 and 13, we note that the falsepositive rate can serve as a good indicator for the optimal t.Particularly, the figures show that for most cases, the best value of t lies between the false-positive rates of 7% and 20%.Although the precise value of allowable false-positives in practical applications would vary from case to case, as a general rule, we propose to use the false-positive rate of 15% as an indicator of the best t.

C. Evaluation under Con-Detect-blind Threat Model
We first evaluate the efficacy of Con-Detect under the Threat model 1 (Con-Detect-blind Attacker).Fig. 14 We note that, for most cases, Con-Detect reduces the ASRs of all three attacks from 100% to 0%.For the Kaggle dataset, the word-removal method more effectively detects adversarial inputs as compared to the word-substitution method, specifically for LSTM-classifier.For example, the combined ASR reduces from 100% to ≈60% for the word-substitution method and ≈0% for the word-removal method.We attribute this to a large overlap of C F -substitution scores of clean and adversarial inputs as shown in Fig. 16 for the LSTM-classifier.Fig. 14 shows that Con-Detect is not effective for BERTclassifiers.The reason is that BERT is fairly hard to attack even without a defense deployed.To fool BERT, an attacker has to perturb significantly more words as compared to other classifiers which reduces the negative contributions of X A in Eq. 4 and degrades the stealth of attack rendering the attack impractical ( §VI-B).
Con-Detect is equally effective against all three adversarial attacks for the AG news dataset, irrespective of the contribution method used for detection.This shows that Con-Detect not only works for binary classifiers but is also effective for multi-class classifiers.Although Con-Detect significantly reduces the efficacy of attacks, we note that the decrease is not as promising for the IMDB dataset as for the Kaggle and AG news datasets, notably for the word-removal method (Fig. 14(c)).We attribute this to the closeness of C F -removal scores of clean and adversarial inputs for the IMDB dataset, as illustrated in Fig. 6 under Section III-C.For the IMDB dataset, we note that word substitution-based detection gives better results as compared to the word removal-based approach.

D. Evaluation under Con-Detect-aware Threat Model
We evaluate the efficacy of Con-Detect under the Threat model 2 (Con-Detect-aware Attacker).This setting assumes an attacker who is aware of the detection mechanism and perturbs X to change argmax F(X) while P A (X) < 0.5.The classifiers used in this evaluation are based on MLP, CNN, LSTM, Hybrid-RNN-CNN, and BERT architectures, and trained for different datasets-IMDB reviews, Kaggle fake-news, AG news.Results are shown in Fig. 15.
ASR decreases.Fig. 15(a) provides combined ASRs of attacks against classifiers for different datasets.As previously, the combined ASR is an average of individual ASRs of all three attacks-Text-bugger, Text-fooler, and PWWS-used in the analysis.As expected, Con-Detect-aware attackers achieve higher ASRs than the Con-Detect-blind-attackers.However, Con-Detect significantly degrades the attack performance as compared to vanilla (no detection) classifiers even for the Con-Detect-aware attackers as shown in the figure.Unlike previously, we note that Con-Detect works best with BERTclassifiers by reducing ASRs to 0% for all attacks and datasets.This is because Con-Detect makes it harder for an attacker to attack an already hard-to-attack BERT-classifier.Other reasons can be the gradient-obfuscation caused by a deeper report the combined averages of ASRs, number of queries, and percent words perturbed by adaptive attackers.
As previously, we observe that when attacking Con-Detect, the ASR of BAE attack significantly decreases, while the average number of queries made by an attacker and the average percentage of words perturbed are increased.Our experiments establish that Con-Detect-classifiers are harder to evade even for BERT-based attacks.

E. Evaluation under Con-Detect-adaptive Threat Model
Using the methodology proposed in Section IV-A, we adaptively modify Text-bugger, Text-fooler, and PWWS attacks.Results in Fig. 18(a), (b), and (c) respectively report the combined averages of ASRs, number of queries, and percent words perturbed by Con-Detect-adaptive attackers.
We note that the ASR of an adaptive attacker is on average 30% greater than that of a non-adaptive attacker.This superiority of adaptive attacks is due to the clever unification of the classifier and detector outputs proposed in Eq. 8. Fig. 18(a) shows that the proposed adaptive attack generalizes to the multi-classification scenario as illustrated by increased ASRs for the AG News dataset.As previously, the reduced performance on the IMDB dataset is due to a high overlap between the C F -scores for the clean and the adversarial inputs as noted in Fig. 5(b) and 6(b).However, we note that the ASR of an adaptive attacker against Con-Detect-classifiers is still smaller than that against vanilla classifiers irrespective of the dataset used.
In Fig. 18(b), we note a slight reduction in the average queries made by an adaptive attacker suggesting that the cost of attack is slightly reduced.However, Fig. 18(c) shows that this supremacy is achieved at the cost of a larger percentage of words perturbed by the attacker.Specifically, an adaptive attacker has to perturb 4 to 6% more words as compared to a non-adaptive attacker.Our findings show that Con-Detect is significantly harder to defeat even for an adaptive attacker.

F. Computational overhead of Con-Detect
We compare the expected time required by Con-Detectclassifiers to process N number of inputs with that for the vanilla classifiers in Fig. 19.It is seen that the computational overhead of Con-Detect increases as the number of input samples increases.This is particularly attributed to Con-Detect computing the C F -score for each word in an input sequence and performing subsequent operations.
Although such large computational overheads can limit the practicality of Con-Detect, we observe that the time required for attacking Con-Detect-classifiers is also several folds greater than that for the vanilla classifiers.This is mainly attributed to a larger number of queries required by an attacker when attacking Con-Detect-classifiers.

A. Cost of attacking Con-Detect
In practical scenarios, such as online ML-services, there is a huge cost associated with each query/request that a user makes [1].One such example is Google APIs which only allow a limited free quota per day per user.Our results show a large increase in the number of queries required by an attacker when attacking Con-Detect-classifiers as compared to the vanilla classifiers.We attribute this to Con-Detect, making it harder for an attacker to estimate the importance of a word due to the mismatch between the importance scores calculated for Con-Detect and the classifier.Our experiments suggest that Con-Detect, in addition to reducing the Attack Success Rate, significantly increases the cost and time of launching an attack.

B. Stealth of adversarial attacks on Con-Detect
Con-Detect skillfully detects adversarial perturbations that change a classifier's output by perturbing a fraction of the input words.A Con-Detect-aware attacker thus has to choose the low impacting perturbations, consequently, perturbing a larger percentage of words to achieve the desired results, which significantly degrades the stealth of the adversarial attack.
Fig. 20 compares the adversarial examples generated by the Text-fooler attack against Con-Detect-Hybrid-CNN-RNN classifier for the Con-Detect-blind, Con-Detect-aware and Con-Detect-adaptive threat models.We note that the adversarial examples generated under Con-Detect-blind threat model against Con-Detect-classifiers are exactly similar to those generated against the vanilla classifiers; this is because in both cases, an attacker's goal is to only fool the classifier.It is evident from Fig. 20 that the adversarial examples generated against Con-Detect-classifiers are highly perturbed, raising suspicion upon manual inspection.
We note many similarities between the adversarial perturbations made by adaptive and aware attackers when attacking similar Con-Detect-methods (substitution or removal).For example, in Fig. 20, the phrase "France chose an idealistic, traditional candidate" from the clean input was changed to "France opting an fantasy, accustomed candidate" and "France selecting an fantasy, traditional candidate" by the aware and adaptive attackers respectively, when attacking word-removal based Con-Detect-classifier.The same original phrase was changed to "France option an notion, usual candidate" and "France option an notion, traditional candidate" by the aware and adaptive attackers respectively, when attacking wordsubstitution based Con-Detect-classifier.We attribute this to similar methods causing similar word contribution scores.each chunk encloses multiple subsequent words.Instead of computing the C F -score for each word, we can compute C F -scores of each chunk enclosing multiple words, which should considerably reduce the time overhead.However, such a modification to Con-Detect methodology may increase the error rate, i.e., false-positives and false-negatives.A detailed investigation is left for future work.Extension to the image and speech classification tasks.Similar to the NLP attacks, image-based attacks aim to change a model's output by minimally perturbing its input image.We hypothesize that such constraints should imprint detectable adversarial features on an input.In the future, we plan to extend Con-Detect methodology to the audio and visual classification tasks where inputs are more continuous.However, we note that our assumptions in Section III-B are particular to the NLP classification tasks, and thus our work cannot be trivially extended to other domains without careful adaptation.

VII. CONCLUSIONS
In this work, we propose Con-Detect, a novel methodology to detect adversarial inputs to natural language processing (NLP) classifiers at the runtime.We show that these adversarially perturbed inputs can be detected by using our proposed technique of cumulative contribution scores denoted C Fscores since these scores are higher for adversarial inputs when compared to clean inputs.We identify the reason underpinning the difference between adversarial and clean NLP input as the stealthiness of adversarial attacks, due to which adversarial attacks only aim to minimally perturb an input to change a classifier's decision for the sake of evading detection.These minimal perturbations, however, are also highly contributing perturbations when viewed through the lens of our proposed technique of cumulative contribution scores.We leverage the fact that the technique of minimalist perturbations, considered a strength of adversarial attacks, can also act as its Achilles' heel and help us in detecting these adversarial attacks.Further, we propose an alarm threshold mechanism to detect adversarial inputs based on their C F -scores.
In addition, we also propose a novel modification to the state-of-the-art attacks, the so-called adaptive attacks, to specifically target Con-Detect.We perform black-box attacks assuming Con-Detect-blind, Con-Detect-aware, and Con-Detect-adaptive attackers and show that Con-Detect can detect adversarial attacks with high precision and accuracy.Although an adaptive attacker may considerably degrade Con-Detect performance, the attacker only achieves this at the cost of increased queries and the percentage of words perturbed, which suggests that Con-Detect is robust.

1 :
The lawyer does half law talk, half partisan spin. −2 : The lawyer does half law talk, half partisan spin.…  − : The lawyer does half law talk, half partisan spin. − : The lawyer does half law talk, half partisan spin.:Thelawyer does half law talk, half partisan spin.−1 : The lawyer does half legal talk, half political spin. −2 : The lawyer does half legal talk, half political spin.…  − : The lawyer does half legal talk, half political spin. − : The lawyer does half legal talk, half political spin.:The lawyer does half legal talk, half political spin. fake )  fake  −1 )  fake  −2

Fig. 2 :
Fig. 2: Comparing histograms of C F (X)-removal scores for clean and adversarial inputs.(Settings: Dataset: Kaggle dataset; Threat model: black-box threat model; Attacks: Text-bugger, Text-fooler, PWWS).Adversarial inputs show considerably higher C F (X) removal scores as compared to the clean inputs for different classifier architectures.

Fig. 7 :
Fig. 7: Illustration of computation of C F (X) scores for clean and adversarial inputs against the Hybrid-CNN-RNN fake-news classifier.Blue arrows represent the clean flow and the red arrows represent the adversarial flow.

Fig. 5 :Fig. 6 :Fig. 8 :
Fig. 5: Comparing distributions of C F -substitution scores clean (blue) and adversarial inputs.(Settings: Contribution metric is C F -substitution score, Model is Hybrid-CNN-RNN classifier, Threat model is the Black-box).Adversarial inputs have higher C F -substitution scores as compared to the clean inputs.

Fig. 10 :
Fig.10: Illustrating the condition defined in Eq. 7 by an adaptive attacker for the fake-news classification task.The area highlighted in red shows that the condition is false, while the blue area represents the condition is true.(Settings: Input, X, is originally fake and unperturbed.).When Eq. 7 is satisfied, P A (X) < 0.5 (undetected) and P (real|X) > 0.5 (successful adversarial attack).

Fig. 11 :
Fig.11: Comparing the performance of our classifiers before and after attacking for all datasets.(Settings: Classifiers are MLP, CNN, LSTM, Hybrid, BERT; Attacks are Text-bugger, Text-fooler, and PWWS; Threat model is black-box).Accuracy of classifiers is fairly good for the clean inputs.However, for the adversarial inputs the accuracy severely degrades (dropping to zero in many cases).
Fig. 13: Analyzing the effect of changing the alarm threshold, t, on the F1-score of Con-Detect for different datasets.(Settings: Score: C F -removal score; Classifier: Hybrid-CNN-RNN; Threat model: black-box).C F -removal score is effective for the Kaggle fake-news and AG news datasets.For the IMDB dataset, C F -removal score is only partially effective.

Fig. 14 :Fig. 15 :Fig. 16 :
Fig. 14: Comparing the Attack Success Rates (ASRs) of three state-of-the-art attacks against the vanilla classifiers and Con-Detect-classifiers for a range of deep learning architectures and datasets.(Settings: Architectures are MLP, CNN, LSTM, Hybrid-CNN-RNN, and BERT; Datasets are Kaggle, AG News, and IMDB; Threat model is black-box; Attacker is unaware of Con-Detect.).Con-Detect can effectively detect blind adversarial attacks, which causes low ASRs.

Fig. 19 :
Fig. 19: Combined averages of time (in seconds) required by Con-Detectclassifiers to process N number of inputs as compared to the vanilla classifiers.(Settings: Classifiers used for averaging are MLP, CNN, LSTM and Hybrid-CNN-RNN.Dataset is the Kaggle dataset).Con-Detect-classifiers cause large time overheads as compared to vanilla classifiers.