Dual-Targeted Textfooler Attack on Text Classification Systems

Deep neural networks provide good performance on classification tasks such as those for image, audio, and text classification. However, such neural networks are vulnerable to adversarial examples. An adversarial example is a sample created by adding a small adversarial noise to an original data sample in such a way that it will be correctly classified by a human but misclassified by a deep neural network. Studies on adversarial examples have focused mainly on the image field, but research is expanding into the text field as well. Adversarial examples in the text field that are designed with two targets in mind can be useful in certain situations. In a military scenario, for example, if enemy models A and B use a text recognition model, it may be desirable to cause enemy model A tanks to go to the right and enemy model B self-propelled guns to go to the left by using strategically designed adversarial messages. Such a dual-targeted adversarial example could accomplish this by causing different misclassifications in different models, in contrast to single-target adversarial examples produced by existing methods. In this paper, I propose a method for creating a dual-targeted textual adversarial example for attacking a text classification system. Unlike the existing adversarial methods, which are designed for images, the proposed method creates dual-targeted adversarial examples that will be misclassified as a different class by each of two models while maintaining the meaning and grammar of the original sentence, by substituting words of importance. Experiments were conducted using the SNLI dataset and the TensorFlow library. The results demonstrate that the proposed method can generate a dual-targeted adversarial example with an average attack success rate of 82.2% on the two models.


I. INTRODUCTION
Deep neural networks [1] provide good performance in the fields of image classification [2], speech classification [3], text classification [4], and intrusion detection [5], which involve machine learning tasks. In particular, the deep neural network known as the ''bidirectional encoder representations from transformers'' (BERT) [6] model provides good performance in the text domain. The BERT technology is not limited to a specific field but performs well in all natural language processing fields. Unlike previous models, BERT is a language model that can improve the performance of a specific model through pretrained embeddings. It provides good performance on text classification problems and in deter- The associate editor coordinating the review of this manuscript and approving it for publication was Marta Cimitile . minations of similarity between two sentences. However, the BERT model is vulnerable to adversarial examples [7], [8]. Adversarial examples are samples created by adding a small amount of noise to an original data sample that are correctly classified by humans but will be misclassified by a targeted classification model. Adversarial examples have been studied extensively in the image field, but research is currently being conducted in the text domain as well. Unlike those in the image field, adversarial examples [9], [10] in the text domain consist of text that has the same meaning as the original text but are designed to be misclassified by a targeted model. They are created by selecting important words from text sentences and replacing them with other words that are similar. The creation of adversarial examples in the text domain requires a more complex process than that in the image domain. Existing studies [9], [11] on adversarial examples in the text domain have proposed methods of attack that target one model. In some situations, however, it may be desirable to create adversarial examples in the text domain that target two models, inducing a different misclassification in each one. For example, in a military scenario, if enemy models A and B use a text recognition model, it may be desirable to cause enemy model A tanks to go to the right and enemy model B self-propelled guns to go to the left by using strategically designed adversarial messages. Such a dual-targeted adversarial example could accomplish this by causing different misclassifications in different models, in contrast to single-target adversarial examples produced by existing methods.
In this paper, I propose the dual-targeted textfooler method for attacking text recognition systems. This method creates a dual-targeted adversarial example that is designed to be misclassified as a different class by each model when there are two target models. In this method, adversarial examples are created by setting the target class determined by the attacker for each model and replacing important words in the original sentence with similar words in a word-wise manner such that the probability value for the target class is the highest. The contributions of this paper are as follows.
• I propose the dual-targeted textfooler method for attacking a text recognition system. The proposed scheme, designed for use in the text domain, produces an adversarial example that is designed to be misclassified as a different class by each of two models. I explain the basic principle and systemic structure of the proposed method.
• I analyze the attack success rate of the dual-targeted textfooler method and the target models' accuracy on the original data. In addition, I perform a sentence analysis comparing the original sentence and the dual-targeted adversarial sentence.
• I verify the performance of the proposed method using the SNLI dataset [12]. BERT [6], the latest text recognition model, was used as the target model.
The remainder of this paper is organized as follows. Section II describes research related to the proposed method. Section III explains the conceptual basis of the proposed method. Section IV describes the proposed method in detail. Section V describes the experiment and presents the evaluation of its results. Section VI discusses various aspects of the proposed method. Section VII concludes the study.

II. RELATED WORK
Adversarial examples, first proposed by Szegedy et al. [13], are samples that are designed to be correctly classified by humans but misclassified by a model; they are created by adding a small amount of noise to an original sample. This section briefly describes the target model, which is the BERT model [6], and additionally deals with the availability of target model information, recognition target, distortion, and methods of generating adversarial examples in the text domain.

A. BERT MODEL
The ''bidirectional encoder representations from transformers'' (BERT) [6] model is constructed by stacking several VOLUME 11, 2023 transformer encoders. Figure 1 shows how the model is trained to output an embedding vector for a specific word according to a given context. The transformer has an encoderand-decoder structure. Each encoder (or decoder) is a structure in which several encoder layers (or decoder layers) are stacked. Each encoder layer consists of a self-attention layer and a fully connected layer. The self-attention layer is configured to refer to the meaning of words that exist at different positions when the meaning of the word corresponding to each position in the input sentence is determined. Multi-head attention is implemented by using several self-attention layers simultaneously. Encoding vectors generated by multiple self-attention layers are encoded once again when passing through a fully connected layer.
The BERT model, in which several transformer encoders are stacked, is a model that attains good performance by fine-tuning various subtasks after pretraining so that effective embeddings can be learned by training on a large number of documents. At its core, the BERT model learns the word embeddings in the pre-learning process. Transformer encoders are stacked in several layers to take into account the context of the given sentences before and after, and a certain percentage of input words are removed to prevent indirect self-referencing when a specific word is predicted. A word embedding is learned by predicting the input word that has been removed from the output layer. In addition, to learn the relationships between not only words but also sentences, two sentences are given, and the relationship between the two sentences is predicted. Because the classification [CLS] token is specifically allocated as the first input in the BERT model, this type of prediction is performed with the output vector at the first position. In other words, the output vector corresponding to the [CLS] token is used to create a document-level classification model.
When undergoing this pre-learning process, the BERT model outputs an embedding vector that fits the context for each word in a given input sentence. These output vectors can be used in various subtasks. In the case of a classification problem, fine-tuning can be performed by using the output vector corresponding to the [CLS] token; in the case of a question-and-answer task, a question and a paragraph about the situation can be given as input and fine-tuned so that the correct answer is produced as an output. In other words, although the structure of the pretrained BERT model does not change and only a few layers are added that receive output vectors as inputs, a model can be created that performs a specific task.

B. AVAILABILITY OF TARGET MODEL INFORMATION
Adversarial examples are classified into white-box attacks and black-box attacks according to the information available on the target model. A white-box attack [7], [14] [15] is an attack in an environment in which the attacker knows everything about the target model, such as its structure, the values of its parameters, and the probability values of each class for a given input. A black-box attack [16], [17] [18] is one in which an attacker can know only the result value for a query and does not have information about the target model itself. In this study, the proposed method is for a white-box attack on two models; it is assumed that all models, structures, and classification scores are known.

C. SPECIFICITY OF RECOGNITION TARGET
Adversarial examples are divided into targeted attacks [19], [20] and untargeted attacks [21] according to the specificity of the recognition target. A targeted attack is one in which the adversarial example is mistaken for the specific target class determined by the attacker. An untargeted attack is one in which the adversarial example is mistaken for any invalid class rather than as the original class. The method proposed in this paper is for a targeted attack against two models that causes misclassification by each model into the target class determined by the attacker.

D. TYPE OF DISTORTION
In the image field, it is easy to generate adversarial examples [22], [23] [24], [25] [26], [27] because some noise is applied in each pixel of the original data sample. In the text domain, however, adversarial examples are generated by changing words, not pixels; thus, a more complex process is required, with discrete operations. In addition, it needs to be considered that the adversarial example generated should not be grammatically or semantically problematic to human perception.

E. TEXT DOMAIN METHOD OF ADVERSARIAL EXAMPLE ATTACK
The adversarial example has been investigated mainly in the image field, but study has recently been expanded to include the text domain. In the image field, an adversarial example is created by applying a minimal distortion to each pixel of an image. In the text domain, however, an adversarial example is created by replacing an important word with a similar word, thereby creating a sample that is apparently the same as seen by humans but will be misclassified by the target model. Zhao et al. [9] proposed a method for generating adversarial examples in the text domain using a generative adversarial net. Their method creates an adversarial sentence similar to the original sentence by changing the latent vector of the input data. Ebrahimi et al. [11] proposed a white-box attack method for generating adversarial examples that operates by changing a specific word. This method targets the CharCNN-LSTM model and creates adversarial examples by changing specific words. However, it selects the words to be changed randomly rather than prioritizing important words, and it does not consider grammatical aspects. Jin et al. [28] proposed a method to generate adversarial examples for the BERT model in a word-wise manner while preserving grammatical details. This method generates adversarial examples that exhibit no grammatical problems but will be misclassified by the target model; it works by first analyzing an original sentence to identify important words and then creating an adversarial example by replacing one or more of them with a similar word.
All of the above methods create an adversarial example that is designed to cause misclassification by one target model. However, a dual-targeted method, which causes different misclassifications in two different models, can be useful in certain scenarios. In this paper, I propose such a method, called the dual-targeted textfooler method. Figure 2 shows the decision boundaries of models A and B with respect to an original sample x and an adversarial example x * . Samples that lie within a specific area demarcated by a decision boundary will be classified as the class corresponding to that area. In the example in the figure, the premise is ''A woman holding a newborn baby'', and the original sample is ''A woman holds baby''. Because the original sample is similar in meaning to the sentence given as the premise, the original sample is located in the entailment class area for both model A and model B. On the other hand, the adversarial example created by the proposed method, ''A woman holds toddlers'', constitutes an entailment with respect to the premise from the perspective of human judgment, and it has no abnormalities in terms of grammar or meaning. However, this adversarial example is misclassified as a contradiction by model A and as neutral by model B. Thus, the proposed method can generate an adversarial example that is similar to the original sample according to human perception and yet will be misclassified differently by two models, A and B.

IV. PROPOSED SCHEME A. ASSUMPTION
In the proposed scheme, the attacker must know the confidence scores generated by the two models for the input data. The proposed scheme can create textual adversarial examples using confidence scores without the need for information on the parameters and structures of the two models. Under this assumption, the proposed method is capable of generating adversarial examples that will be misclassified differently by different models and that do not exhibit any problems evident to human perception. Figure 3 shows an overview of the proposed scheme. The proposed method has two procedures: word importance ranking (WIR) and word transformation. First, in the WIR, the words that have a significant influence on the model's prediction are ranked in order of importance. Second, the synonym extraction (SE) collects candidate words that can be substituted for the highest priority words in the WIR. Then, these candidate words are subjected to the part-of-speech (POS) check [29], which keeps certain words that affect the sentence's grammar unchanged. Next, among the selected candidate word, a group of candidates capable of maintaining the highest similarity to the original sample is found through a semantic similarity check (SSC) [30]. After that, a dualtargeted adversarial example is created by substituting the word among the remaining candidate words that will be  as different classes by the two models and has the highest similarity to the original sample. If the corresponding confidence score is low, processing proceeds to the next selected word, and the same steps are followed.

B. PROPOSED METHOD
The above procedure can be expressed mathematically as follows. First, for the WIR, given a sentence X = {x 1 , x 2 , . . . , x n } consisting of n words, the proposed method needs to find some key words that strongly influence prediction models M k . Therefore, it uses the selection mechanism that most strongly affects the change in the last predicted result. In addition, semantic similarity should be maintained in this selection process, with minimal changes. After a word x i is deleted from X = {x 1 , x 2 , . . . , x n }, the confidence value is calculated by determining the difference between the prediction score and the result values output by models M k . The importance score I x i is calculated from the change in the prediction score that occurs as a result of the word change. When words are ranked by their importance score, words such as ''the'' and ''it'' are filtered out so that the grammar is not disrupted.
Second, in the word transformer, a mechanism is needed to perform the replacement of a word of high importance I x i . To find the word substitution most suitable for creating the dual-targeted adversarial example, three steps are required: SE, the POS check, and SSC.
In the SE step, the set of all possible substitution candidates is assembled for the selected word x i . Let the candidates be the N synonyms with the closest cosine similarity to the word x i (≥ ). Word embedding is used to express this word. The embedding vector is used to identify the N synonyms having cosine similarity greater than the value of . In this study, was set to 0.7, and N was set to 50; these parameters control the diversity and the semantic similarity.
The POS check is needed to maintain the same part of speech in the substitution candidates for the word x i and thereby ensure the grammatical consistency of the text.

Algorithm 1 Dual-Targeted Textfooler
Input: Two models M k (1 ≤ k ≤ 2), original sample X = {x 1 , x 2 , . . . , x n }, the original label Y , multiple target classes Y * k , sentence similarity function s(·), cosine similarity function s c (·), threshold , word embedding e over the vocabulary V , final candidate set C final Dual-targeted textfooler: Calculate the importance score I x i end for Generate a set X I of all words x i ∈ X through the importance score I x i Filter out the stop words in X I for each word x i in X I do Initiate the set of candidates C by extracting the top N synonyms through s c (e x i , e word ) for each word in In the SSC step, the word x i in the sentence is replaced by one of the remaining substitution candidates, creating an adversarial example. Next, this adversarial example is provided to the models M k to obtain a prediction score. Using the universal sentence encoder (USE), the semantic similarity is computed using a high-dimensionality vector of sentence similarity along with the cosine similarity score for the original sample and the adversarial example. If the semantic similarity is greater than the specified value, the substitution candidate is stored in the pool of final candidates.
The adversarial example with the highest similarity score is chosen from among the final candidates. If no final candidate exists, the SE, POS check, and SSC steps are repeated as above for the next selected word.
The details of the above procedure for generating an adversarial example are given in Algorithm 1.

V. EXPERIMENT AND EVALUATION
Experiments were conducted to assess the ability of the proposed method to generate a dual-targeted adversarial example for a text classification system. The TensorFlow library was used as a machine learning library, and an Intel(R) i5-7100 3.90-GHz server was used as hardware. Definitions of the abbreviations used in this section are given in Table 2 in the appendix.

A. EXPERIMENTAL SETUP
SNLI was used as the dataset for the experiment; it consists of 573,000 sentences. Designed for tasks that determine the relationship between two sentences, it can be used to determine whether there is an entailment, contradiction, or neutral relationship between the first and second sentences. 570,000 sentences were used as training data and 3,000 sentences as test data.
As the two text classification models, two BERT models were configured. In the BERT model, there are 768 nodes and 12 hidden layers. The vocabulary size was 30,522 words, the intermediate size was 3072, the maximum number of position embeddings was 512, and GELU [31] was used as the hidden activation function. To configure M 1 and M 2 as different models, each was trained using different parameters, shown in Table 3 in the appendix. Models M 1 and M 2 had 90.4% and 90.1% accuracy, respectively, on the test data after learning the original training data.
For generating the adversarial example, the number of synonyms was set to 50, and the similarity score threshold was set to 0.7. The maximum sequence length was set to 64, and the batch size was set to 32. The proposed method was used to generate dual-targeted adversarial examples using 500 randomly chosen test data. Figure 4 shows three example sentence trios, each consisting of an original premise, original sentence, and proposed sentence for M 1 and M 2 .

B. EXPERIMENTAL RESULTS
Because the dual-targeted adversarial sentence is formed by replacing a specific word with a similar word, no problem is created in terms of the human perception of the sentence's grammar and semantics. However, the new sentence will be misclassified by both models, each of which will assign it to a different incorrect class. The results of the performance of the proposed method for the additional original sentence and proposed sentence are presented in the appendix. Figure 5 shows the accuracy on the original samples and the attack success rate of the proposed adversarial example for two models M 1 and M 2 according to the number of synonyms. For the original samples, the accuracy of the two models averaged 90.3%. The average success rate of the dual-targeted adversarial examples (the proposed sentence being misclassified by the two models) was 82.2%. As the number of synonyms increased, the attack success rate of the dual-targeted adversarial examples increased slightly. The best performance of the proposed method is when the number of synonyms is 50; the attack success rate of the dual-targeted adversarial example is 87.6%. Thus, it is demonstrated that the proposed adversarial sentence is misclassified by the two models, each model assigning it to a different incorrect class. Figure 6 shows the average percent change and number of queries needed to generate the proposed adversarial example   according to the number of synonyms. As the number of synonyms increased, the average percent change decreased and the number of queries increased. The number of queries required for each synonym increased proportionally with the VOLUME 11, 2023 number of synonyms. As the number of synonyms increases, the number of possible changes increases, and as various attacks are possible, the average percent change decreases slightly to maximize the similarity between the original sample and the proposed adversarial example.

VI. DISCUSSION
This section discusses attack considerations, word changes, number of synonyms, human perception, applications, and limitations and future research.

A. ATTACK CONSIDERATIONS
The proposed method requires information on the confidence scores for several models. The proposed adversarial example created based on the confidence scores for each model is misclassified as a different class by each different model. The proposed method creates the proposed adversarial example by substituting a specific word in the original sentence using the confidence scores for each model.
Because the SNLI dataset consists of single-sentence data, the sample size is limited to the length of a single sentence. Figure 7 shows one of the longest sentences in the SNLI dataset. In the figure, we can observe that the proposed sentence is incorrectly recognized differently by the M 1 and M 2 models when specific words in the original sentence are changed. In the SNLI dataset consisting of single-sentence data, the proposed method is generally applicable.

B. WORD CHANGES
The proposed method creates a dual-targeted adversarial example by changing a specific word, selecting the word that can change the classification outcome the most. This method creates adversarial examples by making small changes to words that will be misclassified differently by each model, replacing words of high importance with similar words.

C. NUMBER OF SYNONYMS
Because the proposed adversarial example is generated by replacing specific words in a word-wise manner, the performance of the proposed method depends on the number of words to be replaced. Factors that are affected by the number of synonyms include the accuracy of the two models on the adversarial examples, the number of queries required, and the word change rate. As the number of synonyms increases, the number of replaceable words increases; therefore, the attack success rate for the proposed scheme increases as the number of synonyms increases. The best performance of the proposed method is when the number of synonyms is 50; the attack success rate of the dual-targeted adversarial example is 87.6%. As the number of synonyms increases, the number of queries required increases, but the word change rate decreases.

D. MODEL CONFIGURATIONS
In the experiment for this study, different models were created with homogeneous architectures but configured with different parameter values. Different parameter values result in different models because the decision boundaries of the two models will be different. When configuring different models, it is important to note that the accuracy on the test data should not be reduced; the test data accuracy of the two models (M 1 and M 2 ) was 90.4% and 90.1%, respectively, as reported in Section V-A. In addition, the experiment was run with two models constructed with heterogeneous architectures, configured as shown in Table 1. In these two models, the maximum number of position embeddings was 512, the vocabulary size was 30,522 words, attention dropout was 0.1, hidden dropout was 0.01, and the initial constant was 0.02. In this version of the experiment, models M 1 and M 2 showed an accuracy of 90.4% and 89.8% on the original test data, and the average attack success rate of the dual-targeted adversarial example was 84.3%. In all cases, whether for models with homogeneous architectures and different parameter values or for models with heterogeneous architectures and different parameter values, the proposed method generates a dual-targeted adversarial example that is misclassified differently by the two models while maintaining their accuracy on the original samples.
In the proposed method, the important point was that M 1 and M 2 are different models. Accuracy was used as a parameter to prove the difference between the two models, and for the same model (homogeneous architecture), different models were considered if the accuracies were different with parameter changes. The two models were analyzed in terms of homogeneous and heterogeneous structures, which are the criteria for different models, and the experimental results showed similar high success rates in both cases. Even if I construct the two models with different structures, it does not significantly contribute to the performance of the proposed method.

E. HUMAN PERCEPTION
To humans, the proposed adversarial example has the same meaning as the original sample and presents no problems in terms of grammar. The proposed method uses the POS check to avoid changes to words that would affect basic grammar. The proposed method minimizes the change in meaning between the original sample and the adversarial example by changing a word in the original sample into a similar word using many synonyms.

F. APPLICATIONS
The proposed method can be applied in the fields of international politics or industrial espionage. For example, it could be used to cause misinterpretation by intentionally generating adversarial sentences so that sentences spoken by important international figures are misclassified differently by different countries. In the field of industrial espionage, sentences generated by the proposed method could be used as a covert channel, causing different types of misclassification by different companies and providing correct information to affiliated companies.

G. LIMITATIONS AND FUTURE RESEARCH
Because the proposed method uses a word-wise replacement method, the generation of an adversarial example may be limited if there is no appropriate alternative word that can induce misclassification into different classes by several models.
The proposed method performs a white-box attack, as might be the scenario for a white-hat attack. In a more realistic attack in a real-world setting, it should be possible to attack in an environment in which there is no information available about the model (a black-box attack), as in a blackhat attack; this could be a topic for future work. If the attack is a black-box attack, one possible approach is to use a substitute network. The substitute network method first creates a similar network to be used as a substitute, through multiple queries against the black-box target model. An adversarial example generated by a white-box attack against the substitute network created in this way can be successfully used in a black-box attack against the target model. For other types of black-box attack, there are ways to attack using an adversarial example generated from a local known model, which can have a certain degree of effectiveness against an unknown model under the concept known as transferability.
In addition, methods of leaking information about AI systems in a black box environment will be an interesting topic for future research. Many governments, businesses, and military situations likely lack information about the AI systems they want to attack. To obtain information on this model, methods are needed to leak system information with malicious code through internal attackers, phishing methods, and advanced persistent threat (APT) methods. First, an insider attack is initiated by a malicious user who has been granted authorized (i.e., insider) access to a system. From stealing corporate data to spreading malware, insider attacks of all reported e-criminals are known to cause massive damage. Through this attack method, information about the target AI system can be obtained. Second, a phishing attack is a forged communication that appears to come from a trusted source but can compromise any type of data source. Attacks facilitate access to online accounts and personal data, gain access to modify, and compromise connected systems, such as pointof-sale terminals and order processing systems. Through this attack method, information about the target AI system can be obtained. Third, an advanced persistent threat (APT) is an attack method that uses malware methods to gain unauthorized access to a computer network and acts as a covert threat that remains undetected for a long period. Through this method, it is possible to infiltrate the target AI system and obtain relevant information. After knowing all the information about the target AI system through the above-mentioned methods, the attacker can use the proposed method to attack the AI systems with a white-box attack.

VII. CONCLUSION
In this paper, I have proposed a method for generating dual-targeted adversarial examples in the text domain. The method creates dual-targeted adversarial examples that will be misclassified as different classes by several models without any change in meaning or grammar as perceived by humans; it does this by replacing words of high importance with synonyms, in contrast to methods used in studies with images. In the experiment, the proposed method generated dual-targeted adversarial examples that had an average attack success rate of 82.2%.  In future research, it may be interesting to develop a method for generating adversarial examples based on a new generation method such as generative adversarial nets (GANs) [32]. Another interesting research topic would be methods of defense against the proposed method. Tables 2 and 3.

See
The sixteen cases: the original premise, the original sentence, and the proposed sentence for the models M 1 and M 2 . The children are playing on a toboggan. #14_Original premise: A boy in blue shorts and a t-shirt is poised to hit a tennis ball with a racket. #14_Original sentence (M 1 and M 2 : entailment): A boy in blue shorts and a t-shirt is about to hit a tennis ball with his racket. #14_Proposed sentence (M 1 : contradiction, M 2 : neutral): A men in blue shorts and a t-shirt is about to hit a tennis ball with his racket. #15_Original premise: A middle-aged man in a collard shirt is raising his forearms next to his face.