Cycle-Consistent Adversarial GAN: the integration of adversarial attack and defense

In image classification of deep learning, adversarial examples where inputs intended to add small magnitude perturbations may mislead deep neural networks (DNNs) to incorrect results, which means DNNs are vulnerable to them. Different attack and defense strategies have been proposed to better research the mechanism of deep learning. However, those research in these networks are only for one aspect, either an attack or a defense, not considering that attacks and defenses should be interdependent and mutually reinforcing, just like the relationship between spears and shields. In this paper, we propose Cycle-Consistent Adversarial GAN (CycleAdvGAN) to generate adversarial examples, which can learn and approximate the distribution of original instances and adversarial examples. For CycleAdvGAN, once the Generator and are trained, can generate adversarial perturbations efficiently for any instance, so as to make DNNs predict wrong, and recovery adversarial examples to clean instances, so as to make DNNs predict correct. We apply CycleAdvGAN under semi-white box and black-box settings on two public datasets MNIST and CIFAR10. Using the extensive experiments, we show that our method has achieved the state-of-the-art adversarial attack method and also efficiently improve the defense ability, which make the integration of adversarial attack and defense come true. In additional, it has improved attack effect only trained on the adversarial dataset generated by any kind of adversarial attack.


Introduction
With Deep Neural Networks (DNNs) rapid development, they have achieved great success in various tasks handling the image recognition [1], text processing [2], and speech recognition [3]. Despite the great success, DNNs have been proved to be vulnerable and susceptible to adversarial example [4], and the carefully crafted samples look similar to natural images but are designed to mislead a pretrained model. On the one hand, an adversarial example leads to potential security threats by attacking or misleading the practical deep learning applications, for example, mistaking a stop sign for a yield sign [5] when autodriving, and a thief for staff when face recognition [6]. On the other hand, adversarial examples are also valuable and beneficial to not only the deep learning models but also the machine learning model, as they can enhance the robust of models and provide insights into their strengths, weaknesses, and blind spots [7]. e strategy to generate adversarial examples is to intentionally add imperceptible perturbations to clean instances, for fooling DNNs to make wrong predictions. In the past years, various attack algorithms have been developed to produce adversarial examples in the white-box manner with the knowledge of the structure and parameters of a given model, where the adversary has full access to the classifier. A more straightforward approach is to change pixels value simultaneously in the direction of the gradient such as fast gradient sign method (FGSM) [8] and iterative variants of gradient-based methods (BIM) [9]. However, they quickly find the perturbations at the expense of uncontrollability.
And another method based on optimization aims to find the smallest possible attack disturbance and use complex linear search methods to find the best disturbance value such as box-constrained LBFGS [4] and Carlini and Wagner attacks [10], but they are complicated and take a long time. And one innovative method is to utilize Generative Adversarial Networks (GANs) as part of their approach to generate adversarial examples which made adversarial examples more natural to human [11], such as AdvGAN [12] and natural GAN [13]. Xiao et al. [12] proposed AdvGAN to generate adversarial examples with generative adversarial networks (GANs), which can learn and approximate the distribution of original instances. Once trained, the feed-forward generator can produce adversarial perturbations efficiently. However, they only generate perturbation by adding loss to make target model predict in a wrong way, instead of considering the relationship between different adversarial attack methods to find the distribution of perturbations. At the same time, defense algorithms have made progress with advances in attack algorithms. So far, there are two main ideas to defend against adversarial attack. A more straightforward approach is to make the model more robust by enhancing training data or adjusting learning strategies, such as adversarial training that retrains a neural network to predict correct labels for adversarial examples [8] and defensive distillation [14] that migrate knowledge of complex networks to simple networks. e second is a series of detection mechanisms for detecting and rejecting against the examples. One important way in the second method is to utilize GAN to defense adversarial example [15], with the advantage to learn the latent distribution of perturbation and reconstruct clean samples better. Shen et al. [16] propose a framework for reconstructing images based on GAN using adversarial examples to generate clean samples similar to the original samples. First, the original sample and the adversarial example training are used to generate the GAN. After the training, the adversarial example and the original sample are first passed through the generator to eliminate the adversarial perturbations, and then the target classifier is classified. e author consider that the misclassification of the adversarial examples is mainly caused by some pixel-level intentional imperceptible perturbation of the input image, so it is desirable to propose an algorithm to eliminate the adversarial perturbations of the input image, thereby achieving the purpose of defending against the attack.
Network-based techniques have achieved satisfying performance not only in terms of attacks but also defenses owing to their great power for generating high-quality synthetic data and detecting these subtle differences to distinguish between adversarial and clean examples. However, existing attack methods exhibit low efficacy when attacking black-box models. For example, most of existing methods, such as FGSM and optimization methods, cannot successfully attack them in the black-box manner, due to the poor transferability [17]. Meanwhile, existing defense methods also have poor transferability, as they only can defend against a certain attack method. On the other hand, in the previous white-box attacks, the adversary needs to have white-box access to the architecture and parameters of the model all the time.
In addition, those studies in these networks are only for one aspect, either an attack or a defense, not considering that attacks and defenses should be interdependent and mutually reinforcing, just like the relationship between spears and shields. Inspired by the idea of GAN, just as the generative model is pitted against an adversary, if we then train a model consisting of attack and defense, and the attacker and defender are also fighting against each other all the time, it can improve the attack ability as well as the defense ability. Meanwhile, considering the existence of adversarial examples, there are different viewpoints because of the unexplained nature of DNNs [18][19][20][21]. However, it is widely accepted that the linear properties of deep neural networks in high latitude space are sufficient to generate an adversarial attack [8]. ere is also a guess that adversarial examples and clean samples are subject to two independent distributions [22], which makes style transfer between the two data possible. To use domain adaptation for style transfer, Zhu et al. [23] proposed CycleGAN for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. e training procedure requires a set of style images in the same style and a set of target images of similar content. e learned mapping function takes one target image as input and transforms it into the style domain. From the mechanism of generating adversarial examples, since there is a companion relationship between the original sample and the adversarial example, they can be regarded as a sample pair, subjected to different distributions but interrelated. erefore, from the perspective of image style migration, this set of sample pairs can be transformed by domain migration. e original sample is transferred into the adversarial example corresponding to the adversarial attack, and the adversarial example is transferred to the original sample corresponding to the adversarial defense. From the perspective of generation mechanism of adversarial example, the mutual transformation between adversarial examples and clean samples are realized by adding specific perturbation, and it can be considered that the two are subject to different data distributions but are related. erefore, through GAN learning the relationship between the two data distributions of adversarial examples and clean samples, the study focuses on different issues ranging from the generation of single adversarial example to data distribution of adversarial example, providing a new research direction for further efficient generation of better migration of adversarial examples and their effective recovery.
Consequently, utilizing the cycle consistency idea of CycleGAN [23], we apply a similar paradigm to combine the attacker and defender to promote each other and better learn the latent distribution of adversarial examples and clean instances, which proves the adversarial and clean examples are not twins and also improve the transferability. In this paper, we propose training a Cycle-Consistent Adversarial GAN (CycleAdvGAN) to achieve the integration of adversarial attack and defense. Once trained, there is no need to access the target model itself anymore and it will not matter whether it is in the manner of attack or defense, and the CycleAdvGAN can generate perturbations and recover the adversarial examples such that the resulting example must be realistic according to a discriminator network.
To evaluate the effectiveness of our strategy CycleAdv-GAN, we experiment on different datasets including MNIST and CIFAR10 for an ensemble of target models. We evaluate these attack strategies in both semiwhite-box and black-box settings. We show that adversarial examples generated by CycleAdvGAN have higher success rates in both semiwhitebox and black-box attacks and also have good effect in defense, due to the fact that attack and defense promote each other, better leaning the differentiation between adversarial example and clean example. In summary, we made the following contributions.
(i) We are the first to achieve the integration of adversarial attack and defense by proposing Cycle-Consistent Adversarial GAN. (ii) We show a high success rate of attacks as well as defense in semiwhite-box attack only trained on the adversarial dataset generated by any kind of adversarial attack. (iii) We demonstrate a powerful capability of transferability and it does not matter whether it is in the manner of attack or defense. (iv) We indirectly demonstrate the adversarial and clean data are not twins and subject to two different distributions.

Materials and Methods
In this section, we will first introduce the problem definition and then briefly describe two approaches we utilized to generate adversarial images and at last elaborate the framework, formulation, and corresponding network architecture of our proposed Cycle-Consistent Adversarial GAN.

Problem Definition.
Let A ⊆ R n be the clean feature space, with n the number of features. Suppose that (a i , t i ) is the ith instance within the clean dataset, which is comprised of feature vectors a i ∈ A, generated according to some unknown distribution a i ∼ P A , and t i is the corresponding true class labels. Let B ⊆ R n be the adversarial feature space, with n the number of features. Suppose (b i , l i ) is the ith instance within the adversarial dataset, which is comprised of feature vectors b i ∈ B, generated according to some unknown distribution b i ∼ P B , and l i is the corresponding predict labels. e learning model aims to mapping functions between two domains A and B given the training samples a i N i�1 where a i ∈ A within the clean dataset and b i M i�1 where b i ∈ B within the adversarial dataset. We denote the data distribution as a i ∼ P A and b i ∼ P B . Given an instance a i , the goal of the attack generator (G A ) is to generate adversarial example b i , which is classified as F(b i ) ≠ t i (untargeted attack), where t denotes the true label. And given an adversarial example b i , the goal of the defense generator (G D ) is to recover b i to clean instance a i , which is classified as F(a i ) � t i , where t denotes the true label. b i should also be close to the original instance a i in terms of L 2 or other distance metric.

2.2.
e CycleAdvGan Method. Figure 1 illustrates the overall architecture of CycleAdvGAN including two mappings, the generator G A : A ⟶ B and the generator G D : B ⟶ A, which mainly consist of five parts: the generator G A , the generator G D , the discriminator of domain A(D A ), the discriminator of domain B(D B ), and the target neural network F. Here the generator G A takes the original instance a as its input and generates perturbation G A (a).
en, the generated adversarial example b fake � a + G A (a) will be sent to the discriminator D B , which is used to distinguish the generated data and the original adversarial example b. D b encourages G A to translate A into outputs indistinguishable from domain B. As shown in Figures 1(a) and 1(b), the two parts of the architecture are symmetrical. erefore, the generator G D takes the adversarial example b as its input and generates a perturbation G D (b). en, the generated clean instancea fake � b + G D (b) will be sent to the discriminator D A , which is used to distinguish the generated data and the original instance a. D A encourages G D to translate B into outputs indistinguishable from domain A.
To fulfill the goal of fooling the learning model, we perform the white-box attack, where the target model is F in this case. F takes b fake and a fake as its input and outputs and its loss is L adv which represents the distance between the prediction and the target class t (targeted attack), or the opposite of the distance between the prediction and the ground truth class (untargeted attack).

Adversarial Loss for G A .
We apply adversarial losses to both mapping functions. For the mapping function G A : A ⟶ B and its discriminator D B , we express the objective as where generator G A aims to generate imperceptible perturbation G A (a) that is added to a for looking similar to original instance, while D B aims to distinguish between generated adversarial example and adversarial example.

Adversarial Loss for G D .
We introduce a similar adversarial loss for the mapping function G D : B ⟶ A, and its discriminator D A . e loss is defined as where generator G D aims to recover adversarial example b to clean example, while D A aims to distinguish between generated clean example and clean example.

Adversarial Loss for F.
e loss for fooling the target model F in an untargeted attack is where l t is the target label and l c represents the true class. Meanwhile, L F denotes the loss function (e.g., cross-entropy loss) used to train the original model F.
e L adv losses encourage the perturbed image to be misclassified and the recovered instance to be correctly classified.

Total Loss.
We optimize a min-max objective function min G A ,G B ,F max D A ,D B L, where the loss L is defined as λ 1 , λ 2 , λ 3 , and λ 4 are the weights to balance the multiple objectives. e next section will provide more training details and discuss the appropriate weights.

2.3.
e Methods of Generating Adversarial Examples Datasets. In this subsection, two common approaches mentioned above were utilized to generate adversarial examples as the training sets are provided with a brief description.

Fast Gradient Sign Method Attack (FGSM).
Goodfellow et al. proposed a fast method called Fast Gradient Sign method to generate adversarial examples [8]. ey only performed one step gradient update along the direction of the sign of the gradient at each pixel. eir perturbation can be expressed as where ϵ is the magnitude of the perturbation. e generated adversarial example x adv is calculated as x adv � x + p. is perturbation can be computed by using backpropagation. ey claimed that the linear part of the high dimensional deep neural network could not resist adversarial examples, although the linear behaviour speeded up training. Regularization approaches are used in deep neural networks such as dropout. Pretraining could not improve the robustness of networks.

Basic Iterative Method (BIM). Kurakin et al. applied
adversarial examples to the physical world [9]. ey extended Fast Gradient Sign method by running a finer optimization (smaller change) for multiple iterations. In each iteration, they clipped pixel values to avoid large change on each pixel: where I i p denotes the perturbed image at the ith iteration, Clip limits the change of the generated adversarial image in each iteration with its argument at ϵ and α determines the step size (normally, α � 1). e BIM algorithm starts with I 0 p � I i c and runs for the number of iterations determined by the formula [min(ϵ + 4, 1.25ϵ)].

Projected Gradient Descent (PGD). Madry et al.
proposed an attack called "projected gradient descent" (PGD) used in adversarial training [24]. eir PGD attack consists of initializing the search for an adversarial example at a random point within the allowed norm ball and then running several iterations of the basic iterative method [9] to find an adversarial example. e noisy initial point creates a stronger attack than other previous iterative methods such as BIM, and performing adversarial training with this stronger attack makes their defense more successful.

Generator.
is generator network contains two stride-2 convolutions, several residual blocks, and two strided-2 deconvolutions. We employ the Resnet architecture in our generator, which allows low-level information to shortcut across the network, leading to better results. We use four blocks for 28 × 28 Mnist images and four blocks for 32 × 32 cifar10. e encoder-decoder architecture consists of Encoder: C8-C16-C32 Decoder: C32-C16-C8 where C means the channel of kernel.

Discriminator.
For the discriminator networks, we use three stride-2 convolutions. e last layer of discriminator network is fed into a linear layer to generate a 1-dimensional output, followed by a Sigmoid function. Our discriminator architecture is C8-C16-C32

Target Model.
For the target model, we trained different models on MNIST [25] and CIFAR10 [26], respectively. For MNIST, in all of our experiments, we generate adversarial examples for two models whose architectures are shown in Table 1. For CIFAR10, we select ResNet-18 [27] and VGG-16 [28] for our experiments. We show the classification accuracy of pristine MNIST and CIFAR10 test data in Table 2. e targeted models F could be any given deep networks with the last two layers accessible (e.g., soft-max layer and the layer before it). ese two layers are used as a part of L adv . To perform adversarial attack, the loss L adv encourages the adversarial example b to be misclassified by F and the clean examples a to be correctly classified by F.

Training Details.
Our code and models will be available upon publication. We apply the loss in Carlini and Wagner [10] as our loss L adv � max(max i≠t F(x A ) i − F(x A ) i , k) , where t is the true class, and F represents the target network in the semiwhite-box setting. We set the confidence k � 0 and use Adam as our solver [29]. For L GAN , just as Zhu et al. did, we replaced the negative log likelihood objective by a least-squares loss.
is loss is more stable during training and generates higher quality results. In particular, for a GAN loss L GAN (G, D, A, B), we train the generator E a ∼ P A [(D(G(x)) − 1) 2 ] to minimize and train the discriminator to minimize

Implementation Details.
In our experiments, we use Pytorch for the implementation and test them on a NVIDIA Tesla V100 GPU cluster in Nvidia DGX station. We train CycleAdvGAN for 100 epochs with a batch size of 64, with the learning rate of 0.01, decreased by 10% every 20 steps.

Results and Discussion
In this section, we first evaluate CycleAdvGAN for both semiwhite-box and black-box settings on MNIST and CIFAR10. We then apply CycleAdvGAN to generate adversarial examples on different target models and test the attack success rate for them. Meanwhile, we also recover adversarial examples to clean instances and test the recovery success rate for them and show that our method can achieve higher attack success rates as well as the higher recovery success rates. We use the classification accuracy to measure attacking performance and defending performance with the lower accuracy indicating better attacking performance and the higher accuracy indicating better defending performance. We generate all adversarial examples for different attack methods based on the L ∞ bound as 0.3 on MNISTand 0.03 on CIFAR10.

CycleAdvGAN in Semiwhite-Box Setting.
In semiwhitebox condition, there is no need to access the original target model after the generator is trained, in contrast to traditional white-box condition. First, we apply different architectures for the target model F for MNIST and with ResNet-18 and VGG-16 for CIFAR10. We apply CycleAdvGAN to perform semiwhite-box setting against each model on MNIST dataset and CIFAR10 dataset. From the classification accuracy rate of attacking and defending in Figure 2, with the increasing of  Security and Communication Networks the number of training iterations, the accuracy of attacker on the test set is continuously decreasing, and the accuracy of defender is continuously increasing. It shows that the attack generator and the defense generator are constantly improving their respective attack and defense capabilities in mutual confrontation in the training process, and the attacking and defending effects are improved. It is further proved that the attacker and defender can be integrated under this framework. From the performance of semiwhite-box setting (classification accuracy rate) in Table 3, we can see that CycleAdvGAN is able to generate adversarial instances to attack all models with high attack success rate compared to other state-of-the-art attacks, for example, the classification rate of adversarial examples in MNSIT datasets decreased from 6.12 to 2.54%. It shows that CycleAdvGAN can improve the attack effect and has better attack performance than the original attack method after training with adversarial examples crafted by only some kind of white-box attack method. Meanwhile, from the performance of semiwhite-box condition (defense with G D ) in Table 3, we can also see that CycleAdvGAN is able to recover adversarial examples to clean samples, so as to significantly improve classification accuracy rates; for example, the classification accuracy in the MNIST dataset improved from 6.12 to 98.12%. Defense was efficiently proved by this, and our proposed CycleAdvGAN can achieve the integration of adversarial attack and defense.
As shown in Table 3, we further analyse that in the MNIST dataset, and it can be seen that B architecture has better resistance against adversarial attack, especially for FGSM method. is is because dropout [30] was added to B architecture, which can enhance the robustness of neural network. Attacking ability and defending ability is highly correlated with the capacity of the learning model generating the adversarial examples. For example, adversarial examples generated by B architecture show good attacking performance on A architecture and A architecture has a better classification accuracy performance after defending with G D .
Because B architecture usually owns more complicated network structure with dropout, B architecture has better capability than A architecture in practice. And our proposed method can greatly increase the attack success rate, because our network learns the potential distribution of adversarial examples.
As is shown in Figure 3, the aggressivity can be added by G A to original samples, and the aggressivity of adversarial examples can be eliminated by G D .

CycleAdvGAN in Black-Box Setting.
In this section, we evaluate the transferability of the CycleAdvGAN in the black-box attacking settings. In black-box attacks, we train the CycleAdvGAN in certain target models and optimize the generator accordingly. Once the CycleAdvGAN trained, we generate adversarial examples or recover them through the G A and G D . Tables 4 and 5 respectively, show the classification accuracy of MNIST and CIFAR10 datasets, when transferring attacks between different classification models. We first train CycleAdvGAN for a source attacked model in certain attack method and then apply the instances   Table 4, the MNIST classification rate generated by BIM is from 28 to 98% after defending with G D , and G A 's transferability, better than the corresponding attack methods, from which we can get the following conclusions. (ii) After defending with G D , the classification accuracy rates have been significantly improved even though source and target model are different, which means defense by our CycleAdvGAN can also perform quite well in black-box setting.
(iii) From the above two points, our proposed CycleAdv-GAN method can effectively be applied practically.

High Transferability of Adversarial Examples Analysis.
As shown in Table 6, the first three columns refer to the classification accuracy of adversarial examples generated by a certain attack method after defense with G D trained by the adversarial datasets generated by other attack methods. And the last column refers to the classification accuracy of clean samples after defense with G D trained by the adversarial datasets generated by different attack methods defenses. It shows that CycleAdvGAN is trained with adversarial dataset crafted by certain attack method, but it still defends another attack method through G D recovering the adversarial example to clean example. Meanwhile, it also does not decrease the original classification accuracy of the target model after feeding the clean example to G D . From this result, it can be further demonstrated that the adversarial and clean data are not twins and subject to two different distributions because the CycleAdvGAN can learn the similar adversarial distribution from different adversarial example datasets crafted by different attack mehtod.   To interpret why CycleAdvGAN demonstrates better transferability, we further examine the update directions given by FGSM, BIM, PGD, and CycleAdvGAN along the iterations. We calculate the cosine similarity [31] of two successive perturbations and show the results in Table 7 when attacking MNIST and CIFAR10 datasets. e update direction of CycleAdvGAN is more stable than that of BIM due to the larger value of cosine similarity in CycleAdvGAN. Recall that the transferability comes from the fact that models better learn the distribution of adversarial examples rather than perturbation to a certain sample, resulting in better transferability for black-box attacks. Another interpretation is that the GAN can reconstruct images naturally, which may be helpful for the transferability.

e Better Efficiency of Generating Adversarial Examples.
In general, as shown in Table 8, CycleAdvGAN has obvious advantages on the integration of adversarial attack and defense over other white-box and black-box methods. For instance, regarding computation efficiency, CycleAdvGAN performs much faster than others even including the efficient FGSM in attack, although CycleAdvGAN needs extra training time to train the generator. Besides, FGSM and optimization methods can only perform white-box attack, while CycleAdvGAN is able to attack in semiwhite-box setting.

Conclusions
From the mechanism of generating adversarial examples, since there is a companion relationship between the original sample and the adversarial example, they can be regarded as a sample pair, subject to different distributions but interrelated. In this paper, we have proposed CycleAdvGAN to generate adversarial examples and recovery adversarial examples in cycle consistency. More importantly, we show that the proposed objective is enhancing both the attack effect and the defense effect through the integration of adversarial attack and defense. We show that it can improve attack effect only trained on the adversarial dataset generated by corresponding adversarial attack method.
Further, in our CycleAdvGAN framework, once trained, the G A can produce adversarial perturbations and G D can eliminate adversarial perturbation; both done efficiently. In addition, the CycleAdvGAN can work with other attacks or defense strategies, not conflicting with other existing frameworks. It can also perform both semiwhite-box and black-box settings with high attack success rate as well as defense success rate.
More importantly, we demonstrated that the adversarial and clean data are not twins, subjected to two different distributions. Consequently, the CycleAdvGAN can better learn their latent distribution, instead of targeting special image, so as to improve transferability. Significant transfer performances achieved by our crafted perturbations can pose a substantial threat to the deep-learned systems in terms of black-box attacking. erefore, it is an important research direction to be focused on in order to build reliable deep learning systems.

Data Availability
e detailed information about the Mnist and Cifar10 is provided in previous studies, and the public dataset can be downloaded from https://www.cs.toronto.edu/~kriz/cifar. html and http://yann.lecun.com/exdb/mnist/.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.