Interpolated Adversarial Training: Achieving Robust Neural Networks without Sacrificing Too Much Accuracy

Adversarial robustness has become a central goal in deep learning, both in the theory and the practice. However, successful methods to improve the adversarial robustness (such as adversarial training) greatly hurt generalization performance on the unperturbed data. This could have a major impact on how the adversarial robustness affects real world systems (i.e. many may opt to forego robustness if it can improve accuracy on the unperturbed data). We propose Interpolated Adversarial Training, which employs recently proposed interpolation based training methods in the framework of adversarial training. On CIFAR-10, adversarial training increases the standard test error (when there is no adversary) from 4.43% to 12.32%, whereas with our Interpolated adversarial training we retain the adversarial robustness while achieving a standard test error of only 6.45%. With our technique, the relative increase in the standard error for the robust model is reduced from 178.1% to just 45.5%. Moreover, we provide mathematical analysis of Interpolated Adversarial Training to confirm its efficiencies and demonstrate its advantages in terms of robustness and generalization.


Introduction
Deep neural networks have been highly successful across a variety of tasks. This success has driven applications in the areas where reliability and security are critical, including face recognition (Sharif, Bhagavatula, Bauer, & Reiter, 2017), self-driving cars (Bojarski et al., 2016), health care, and malware detection (LeCun, Bengio, & Hinton, 2015). Security concerns emerge when adversaries of the system stand to benefit from a system performing poorly. Work on Adversarial examples (Szegedy, Zaremba, Sutskever, Bruna, Erhan, Goodfellow, & Fergus, 2013) has shown that neural networks are vulnerable to the attacks perturbing the data in imperceptible ways. Many defenses have been proposed, but most of them rely on obfuscated gradients (Athalye, Carlini, & Wagner, 2018) to give a false illusion of defense by lowering the quality of the gradient signal, without actually improving robustness . Of these defenses, only adversarial training (Kurakin, Goodfellow, & Bengio, 2016b) was still effective after addressing the problem of obfuscated gradients.
However, adversarial training has a major disadvantage: it drastically reduces the generalization performance of the networks on unperturbed data samples, especially for small networks. For example, Madry, Makelov, Schmidt, Tsipras, and Vladu (2017) report that adding adversarial training to a specific model increases the standard test error from 6.3% to 21.6% on CIFAR-10. This phenomenon makes adversarial training difficult to use in practice. If the tension between the performance and the security turns out to be irreconcilable, then many systems would either need to perform poorly or accept vulnerability, a situation leading to great negative impact.
Our contribution: We propose to augment the adversarial training with the interpolation based training, as a solution to the above problem.
• We demonstrate that our approach does not suffer from obfuscated gradient problem by performing black-box attacks on the models trained with our approach: Section 5.2.
• We perform PGD attack of higher number of steps (up to 1000 steps) and higher value of maximum allowed perturbation/distortion epsilon, to demonstrate that the adversarial robustness of our approach remains at the same level as that of the adversarial training : Section 5.3.
• We demonstrate that the networks trained with our approach have lower complexity, hence resulting in improved standard test error : Section 6.
• We mathematically analyze the benefit of the proposed method in terms of robustness and generalization. For robustness, we show that Interpolated Adversarial Training corresponds to approximately minimizing an upper bound of the adversarial loss with additional adversarial perturbations. This explains why models obtained by the proposed method preserve the adversarial robustness and can sometimes further improve the robustness when compared to standard adversarial training. For generalization, we prove a new generalization bound for Interpolated Adversarial Training and analyze the benefits of the proposed method.

Related work
The trade-off between standard test error and adversarial robustness has been studied in Madry et al. (2017), Raghunathan, Xie, Yang, Duchi, and Liang (2019), Tsipras, Santurkar, Engstrom, Turner, and Madry (2018) and Zhang, Yu, Jiao, Xing, Ghaoui, and Jordan (2019a). While Madry et al. (2017), Tsipras et al. (2018) and Zhang et al. (2019a) empirically demonstrate this tradeoff, Tsipras et al. (2018) and Zhang et al. (2019a) demonstrate this trade-off theoretically as well on the constructed learning problems. Furthermore, Raghunathan et al. (2019) study this trade-off from the point-of-view of the statistical properties of the robust objective (Ben-Tal, El Ghaoui, & Nemirovski, 2009) and the dynamics of optimizing a robust objective on a neural network, and suggest that adversarial training requires more data to obtain a lower standard test error. Our results on SVHN, CIFAR-10, and CIFAR-100 datasets (Section 5.1) also consistently show higher standard test error with PGD adversarial training.
While Tsipras et al. (2018) presented data dependent proofs showing that on certain artificially constructed distributions -it is impossible for a robust classifier to generalize as good as a nonrobust classifier. How this relates to our results is an intriguing question. Our results suggest that the generalization gap between adversarial training and non-robust models can be substantially reduced through better algorithms, but it remains possible that closing this gap entirely on some datasets is impossible. An important question for future work is how much this generalization gap can be explained in terms of inherent data properties and how much this gap can be addressed through better models.
Neural Architecture Search (Zoph & Le, 2016) was used to find architectures which achieve high robustness to PGD attacks as well as better test error on the unperturbed data (Cubuk, Zoph, Schoenholz, & Le, 2018). This improved test error on the unperturbed data and a direct comparison to our method is in Table 2. However, the method of Cubuk et al. (2018) is computationally very expensive as each experiment requires training thousands of models to search for optimal architectures (9360 child models each trained for 10 epochs in Cubuk et al., 2018), whereas our method involves no significant additional computation.
In our work we primarily concern ourselves with adversarial training, but techniques in the research area of the provable defenses have also shown a trade-off between robustness and generalization on unperturbed data. For example, the dual network defense of Kolter and Wong (2017) reported 20.38% standard test error on SVHN for their provably robust convolutional network (most non-robust models are well under 5% test error on SVHN). Wong, Schmidt, Metzen, and Kolter (2018) reported a best standard test accuracy of 29.23% using a convolutional ResNet on CIFAR-10 (most non-robust ResNets have accuracy of well over 90%). Our goal here is not to criticize this work, as developing provable defenses is a challenging and important area of work, but rather to show that this problem that we explore with Interpolated Adversarial Training(on adversarial training type defenses of Madry et al., 2017) is just as severe with provable defenses, and understanding if the insights developed here carry over to provable defenses, could be an interesting area for future work.
Adversarially robust generalization: Another line of research concerns adversarially robust generalization: the performance of adversarially trained networks on adversarial test examples. Schmidt, Santurkar, Tsipras, Talwar, and Madry (2018) observe that a higher sample complexity is needed for better adversarially robust generalization. Yin, Ramchandran, and Bartlett (2018) demonstrate that adversarial training results in higher complexity models and hence poorer adversarially robust generalization. Furthermore, Schmidt et al. (2018) suggest that adversarially robust generalization requires more data and Carmon, Raghunathan, Schmidt, Liang, and Duchi (2019), Zhai et al. (2019) demonstrate that unlabeled data can be used to improve adversarially robust generalization. In contrast to their work, in this work we focus on improving the generalization performance on unperturbed samples (standard test error), while maintaining robustness on unseen adversarial examples at the same level.

The empirical risk minimization framework
Let us consider a general classification task with an underlying data distribution D which consists of examples x ∈ X and corresponding labels y ∈ Y. The task is to learn a function f : X → Y such that for a given x, f outputs corresponding y. It can be done by minimizing the risk E (x,y)∼D [L(x, y, θ)], where L(θ , x, y) is a suitable loss function for instance the cross-entropy loss and θ ∈ R p is the set of parameters of function f . Since this expectation cannot be computed, therefore a common approach is to minimize the empirical risk 1/N N i=1 L(x i , y i , θ) taking into account only a finite number of examples drawn from the data distribution D, namely (x 1 , y 1 ), . . . ..., (x N , y N ).

Adversarial attacks and robustness
While the empirical risk minimization framework has been very successful and often leads to excellent generalization on the unperturbed test examples, it has the significant limitation that it does not guarantee good performance on examples which are carefully perturbed to fool the model (Goodfellow, Shlens, & Szegedy, 2014;Szegedy et al., 2013). That is, the empirical risk minimization framework suffers from a lack of robustness to adversarial attacks. Madry et al. (2017) proposed an optimization view of adversarial robustness, in which the adversarial robustness of a model is defined as a min-max problem. Using this view, the parameters θ of a function f are learned by minimizing ρ(θ) as described in Eq. (1). S defines a region of points around each example, which is typically selected so that it only contains visually imperceptible perturbations.
Adversarial attacks can be broadly categorized into two categories: Single-step attacks and Multi-step attacks. We evaluated the performance of our model as a defense against the most popular and well-studied adversarial attack from each of these categories. Firstly, we consider the Fast Gradient Sign Method (Goodfellow et al., 2014) which is a single step and can still be effective against many networks. Secondly, we consider the projected gradient descent attack (Kurakin et al., 2016b) which is a multi-step attack. It is slower than FGSM as it requires many iterations, but has been shown to be a much stronger attack (Madry et al., 2017). We briefly describe these two attacks as follows: Fast Gradient Sign Method (FGSM). The Fast Gradient Sign Method (Goodfellow et al., 2014) produces ∞ bounded adversaries by the following the sign of the gradient based perturbation. This attack is cheap since it only relies on computing the gradient once and is often an effective attack against deep networks (Goodfellow et al., 2014;Madry et al., 2017), especially when no adversarial defenses are employed.
(2) Projected Gradient Descent (PGD). The projected gradient descent attack (Madry et al., 2017), sometimes referred to as FGSM k , is a multi-step extension of the FGSM attack characterized as follows: initialized with x 0 as the clean input x. S formalizes the manipulative power of the adversary. Π refers to the projection operator, which in this context means projecting the adversarial example back onto the region within an S radius of the original data point, after each step of size α in the adversarial attack.

Gradient obfuscation by adversarial defenses
Many approaches have been proposed as a defense against adversarial attacks. A significant challenge with evaluating defenses against adversarial attacks is that many attacks rely upon a network's gradient. The defense methods which reduce the quality of this gradient, either by making it flatter or noisier can lead to methods which lower the effectiveness of gradient-based attacks, but which are not actually robust to adversarial examples (Athalye, Engstrom, Ilyas, & Kwok, 2017;Papernot, McDaniel, Sinha, & Wellman, 2016). This process, which has been referred to as gradient masking or gradient obfuscation, must be analyzed when studying the strength of an adversarial defense.
One method for examining the extent to which an adversarial defense gives deceptively good results as a result of gradient obfuscation relies on the observation that black-box attacks are a strict subset of white-box attacks, so white-box attacks should always be at least as strong as black-box attacks. If a method reports much better defense against white-box attacks than the black-box attack, it suggests that the selected white-box attack is underpowered as a result of the gradient obfuscation. Another test for gradient obfuscation is to run an iterative search, such as projected gradient descent (PGD) with an unlimited range for a large number of iterations. If such an attack is not completely successful, it indicates that the model's gradients are not an effective method for searching for adversarial images, and that gradient obfuscation is occurring. We demonstrate successful results with Interpolated Adversarial Trainingon these sanity checks in Section 5.2. Still another test is to confirm that iterative attacks with small step sizes always outperform single-step attacks with larger step sizes (such as FGSM). If this is not the case, it may suggest that the iterative attack becomes stuck in regions where optimization using gradients is poor due to gradient masking. In all of our experiments for Interpolated Adversarial Training, we found that the iterative PGD attacks with smaller step sizes and more iterations were always stronger than the FGSM attacks (which take a single large step) against our models, as shown in Tables 2-7.

Adversarial training
Adversarial training (Goodfellow et al., 2014) encompasses crafting adversarial examples and using them during training to increase robustness against unseen adversarial examples. To scale adversarial training to large datasets and large models, often the adversarial examples are crafted using the fast single step methods such as FGSM. However, adversarial training with fast single step methods remains vulnerable to adversarial attacks from a stronger multi-step attack such as PGD. Thus, in this work, we consider adversarial training with PGD.

Interpolated adversarial training
We propose Interpolated Adversarial Training(IAT), which trains on interpolations of adversarial examples along with interpolations of unperturbed examples. We use the techniques of Mixup (Zhang et al., 2017) and Manifold Mixup  as ways of interpolating examples. Learning is performed in the following four steps when training a network with Interpolated Adversarial Training. In the first step, we compute the loss from an unperturbed (non-adversarial) batch (with interpolations based on either Mixup or Manifold Mixup). In the second step, we generate a batch of adversarial examples using an adversarial attack (such as Projected Gradient Descent (PGD) (Madry et al., 2017) or Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2014)). In the third step, we train against these adversarial examples with the original labels, with interpolations based on either Mixup or Manifold Mixup. In the fourth step, we obtain the average of the loss from the unperturbed batch and the adversarial batch and update the network parameters using this loss. Note that following Kurakin, Goodfellow, and Bengio (2016a), Tsipras et al. (2018), we use both the unperturbed and adversarial samples to train the model Interpolated Adversarial Trainingand we use it in our baseline adversarial training models as well. The detailed algorithm is described in Algorithm Block 1.
As Interpolated Adversarial Trainingcombines adversarial training with either Mixup (Zhang et al., 2017) or Manifold Mixup , we summarize these supporting methods in more detail. The Mixup method (Zhang et al., 2017) consists of drawing a pair of samples from the dataset (x i , y i ) ∼ p D and (x j , y j ) ∼ p D and then taking a random linear interpolation in the input spacex = λx i +(1−λ)x j . This λ is sampled randomly on each update (typically from a Beta distribution). Then the network f θ is run forward on the interpolated inputx and trained using the same linear interpolation of the losses L = λL(f θ (x), Here L refers to a loss function such as cross entropy.
Compute loss on unperturbed data using Mixup (or Manifold Mixup) Run attack (e.g. PGD as in Madry et al., 2017) Compute adversarial loss on adversarial samples using Mixup (or Manifold Mixup) Update parameters using gradients g (e.g. SGD )

end for
The Manifold Mixup method  is closely related to Mixup from a computational perspective, except that the layer at which interpolation is performed, is selected randomly on each training update.
Adversarial training consists of generating adversarial examples and training the model to give these points the original label. For generating these adversarial examples during training, we used the Projected Gradient Descent (PGD) attack, which is also known as iterative FGSM. This attack consists of repeatedly updating an adversarial perturbation by moving in the direction of the sign of the gradient multiplied by some step size, while projecting back to an L ∞ ball by clipping the perturbation to maximum . Both , the step size to move on each iteration, and the number of iterations are hyperparameters for the attack.
Why Interpolated Adversarial Training helps to improve the standard test accuracy: We present two arguments for why Interpolated Adversarial Trainingcan improve standard test accuracy: Increasing the training set size: Raghunathan et al. (2019) have shown that adversarial training could require more training samples to attain a higher standard test accuracy. Mixup (Zhang et al., 2017) and Manifold Mixup  can be seen as the techniques that increase the effective size of the training set by creating novel training samples. Hence these techniques can be useful in improving standard test accuracy.
Information compression: Shwartz-Ziv and Tishby (2017) and Tishby and Zaslavsky (2015) have shown a relationship between compression of information in the features learned by deep networks and generalization. This relates the degree to which deep networks compress the information in their hidden states to bounds on generalization, with a stronger bound when the deep networks have stronger compression.
To evaluate the effect of adversarial training on compression of the information in the features, we performed an experiment where we take the representations learned after training, and study how well these frozen representations are able to successfully predict fixed random labels. If the model compresses the representations well, then it will be harder to fit random labels. In particular, we ran a small 2-layer MLP on top of the learned representations to fit random binary labels. In all cases we trained the model with the random labels for 200 epochs with the same hyperparameters. For fitting 10000 randomly labeled examples, Table 1 Soft Rank (sum of singular values divided by largest singular value) of the representations (following first layer) from models trained with various methods. We report separately per MNIST class. FGSM and PGD refer to models trained with adversarial training. We note that FGSM slightly increases the numerical rank, but PGD (a much stronger attack) often dramatically increases it. These results suggest that adversarial training causes the learned representations to be less compressed which may be the reason for poor standard test accuracy. At the same time, IAT with Manifold Mixup significantly reduces the ability of the model to learn less compressed features, which may potentially improve standard test accuracy.
To provide further evidence for a difference in the compression characteristics, we trained 5-layer fully-connected models on MNIST and considered a bottleneck layer of 30 units directly following the first hidden layer. We then performed singular value decomposition on the per-class representations and looked at the spectrum of singular values ( Fig. 1 and Table 1). We found that PGD dramatically increased the number of singular values with large values relative to a baseline model (FGSM was somewhere in-between baseline and PGD).

Adversarial robustness
The goal of our experiments is to provide empirical support for our two major assertions: that adversarial training hurts performance on unperturbed data (which is consistent with what has been previously observed in Madry et al., 2017;Tsipras et al., 2018;Zhang et al., 2019a) and to show that this difference can be reduced with our Interpolated Adversarial Trainingmethod. Finally, we want to show that Interpolated Adversarial Trainingis adversarially robust and does not suffer from gradient obfuscation .
In our experiments we always perform adversarial training using a 7-step PGD attack but we evaluate on a variety of attacks: FGSM, PGD (with a varying number of steps and hyperparameters), the Carlini-Wagner attack (Carlini & Wagner, 2016), and the AutoAttack (Croce & Hein, 2020).

Architecture and Datasets:
We conducted experiments on competitive networks to demonstrate that Interpolated Adversarial Trainingcan improve generalization performance without sacrificing adversarial robustness. We used two architectures : First, the WideResNet architecture proposed in He, Zhang, Ren, and Sun (2015), Zagoruyko and Komodakis (2016) and used in Madry et al. (2017) for adversarial training 2 . Second, the PreActResnet18 architecture which is a variant of the residual architecture of He et al. (2015). We used SGD with momentum optimizer in our experiments. We ran the experiments for 200 epochs with initial learning rate is 0.1 and it is annealed by a factor of 0.1 at epoch 100 and 150. We use the batch-size of 64 for all the experiments.
We used three benchmark datasets (CIFAR10, CIFAR100 and SVHN), which are commonly used in the adversarial robustness 2 While Madry et al. (2017) use WRN32-10 architecture, we use the standard WRN28-10 architecture, so our results are not directly comparable to their results.
The CIFAR-10 dataset has ten classes, which include pictures of cars, horses, airplanes and deer. The CIFAR-100 dataset has one hundred classes grouped into 20 superclasses such as people, trees, vehicles, etc. The SVHN dataset consists of 73257 training samples and 26032 test samples each of size 32 × 32. Each example is a close-up image of a house number (the ten classes are the digits from 0-9).
Data Pre-Processing and Hyperparameters: The data augmentation and pre-processing is exactly the same as in Madry et al. (2017). Namely, we use random cropping and horizontal flip for CIFAR10 and CIFAR100. For SVHN, we use random cropping. We use the per-image standardization for pre-processing. For adversarial training, we generated the adversarial examples using a PGD adversary using a ∞ projected gradient descent with 7 steps of size 2, and = 8. For the adversarial attack, we used an FGSM adversary with = 8 and a PGD adversary with 7 steps and 20 steps of size 2 and = 8.
In the Interpolated Adversarial Trainingexperiments, for generating the adversarial examples, we used PGD with the same hyper-parameters as described above. For performing interpolation, we used either Manifold Mixup with α = 2.0 as suggested in  or Mixup with alpha = 1.0 as suggested in Zhang et al. (2017). For Manifold Mixup, we performed the interpolation at a randomly chosen layer from the input layer, the output of the first resblock or the output of the second resblock, as recommended in .
Results: The results for CIFAR10, CIFAR100, SVHN datasets are presented in Tables 2-3, 4-5, 6-7, respectively. We observe that IAT consistently improves standard test error relative to models using just adversarial training, while maintaining adversarial robustness at the same level. For example, in Table 2, we observe that the baseline model (no adversarial training) has standard test error of 4.43% whereas PGD adversarial increase the standard test error to 12.32%: a relative increase of 178% in standard test Table 2 CIFAR10 results (error in %) to white-box attacks on WideResNet28-10 evaluated on the test data. The rows correspond to the training mechanism and columns correspond to adversarial attack methods. The upper part of the Table consists of training mechanisms that do not employ any explicit adversarial defense. The lower part of the Table consist of methods that employ adversarial training as a defense mechanism a . For PGD, we used a ∞ projected gradient descent with size α = 2, and = 8. For FGSM, we used = 8. Our method of Interpolated Adversarial Trainingimproves standard test error in comparison to adversarial training (refer to the first column) and maintains the adversarial robustness on the same level as that of adversarial training. The method of Cubuk et al. (2018) is close to our method in terms of standard test error and adversarial robustness however it needs several orders of magnitude more computation (it trains 9360 models) for its neural architecture search.   TRADES (Zhang, Yu, Jiao, Xing, Ghaoui, & Jordan, 2019b). We found that FSAT and TRADES had higher standard test errors of 10.49% and 27.5%, than IAT, which had a standard test error of 10.12%. We also compared the robustness of IAT, FSAT and TRADES trained models to PGD (7 steps) and AutoAttack. We found that the IAT trained model had lower robustness to PGD (7 steps) attack than FSAT and TRADES, with IAT having a PGD (7 steps) error of 55.43%, FSAT having a PGD (7 steps) error of 49.48%, and TRADES having a PGD (7 steps

Transfer attacks
As a sanity check that Interpolated Adversarial Trainingdoes not suffer from gradient obfuscation , we performed a transfer attack evaluation on the CIFAR-10 dataset using the PreActResNet18 architecture. In this type of evaluation, the model which is used to generate the adversarial examples is different from the model used to evaluate the attack. As these transfer attacks do not use the target model's parameters to compute the adversarial example, they are considered black-box attacks. In our evaluation (Table 8) we found that black-box transfer were always substantially weaker than white-box attacks, hence Interpolated Adversarial Trainingdoes not suffer from gradient obfuscation . Additionally, in Table 9, we observe that increasing results in 100% success of attack, providing added evidence that Interpolated Adversarial Trainingdoes not suffer from gradient obfuscation .

Varying the number of iterations and for iterative attacks
To further study the robustness of Interpolated Adversarial Training, we studied the effect of changing the number of attack iterations and the range of the adversarial attack . Some adversarial defenses (Engstrom, Ilyas, & Athalye, 2018) have been found to have increasing vulnerability when exposed to attacks with a large number of iterations. We studied this (Table 10) and found that both adversarial training and Interpolated Adversarial Traininghave robustness which declines only slightly with an increasing number of steps, with almost no difference between the 100 step attack and the 1000 step attack. Additionally we varied the to study if Interpolated Adversarial Trainingwas more or less vulnerable to attacks with different from what the model was trained on. We found that Interpolated Adversarial Trainingis somewhat more robust when using smaller and slightly less robust when using larger (Table 9). Table 8 Transfer Attack evaluation of Interpolated Adversarial Trainingon CIFAR-10 reported in terms of error rate (%). Here we consider three trained models, using normal adversarial training (Adv), IAT with mixup (IAT-M), and IAT with manifold mixup (IAT-MM). On each experiment, we generate adversarial examples only using the model listed in the column and then evaluate these adversarial examples on the target model listed in the row. Note that in all of our experiments white box attacks (where the attacking model and target models are the same) led to stronger attacks than black box attacks, which is the evidence that our approach does not suffer from gradient obfuscation

Analysis of weighting of loss terms
IAT introduces a hyperparameter for the weighting of the clean loss and the adversarial loss, which by default can be set to an even weighting of both terms. We found that when we weighted the clean loss more, we had improved clean test accuracy and when we weighted the adversarial loss more, we had improved adversarial robustness. These results are shown in Table 11.

Theoretical analysis
In this section, we establish mathematical properties of IAT with Mixup. We begin in Section 6.1 with additional notation and then analyze the effect of IAT on adversarial robustness in Section 6.2. Moreover, we discuss the effects of IAT on generalization in Section 6.3 by showing how ICT can reduce overfitting and lead to better generalization behaviors. The proofs of all theorems and propositions are presented in Appendix B using a key lemma proven in Appendix A.

Notation
In order to present our analysis succinctly, we introduce additional notation as follows. The standard mixup loss L c can be written as wherex i,j (λ) = λx i + (1 − λ)x j ,ỹ i,j (λ) = λy i + (1 − λ)y j , and λ ∈ [0, 1]. Here, D λ represents the Beta distribution Beta(α, β) with some hyper-parameters α, β > 0. Similarly, the adversarialmixup loss L a used in IAT can be defined by Using these two types of losses, the whole IAT loss is defined by In this section, we focus on the following family of loss functions: (q, y) = h(q) − yq, for some twice differentiable function h. This family of loss functions includes many commonly used losses, including the logistic loss and the cross-entropy loss.

The effect of IAT on robustness
In this subsection, we study how adding Mixup to adversarial training affects the robustness of the model by analyzing the adversarial-mixup loss L a used in IAT. This subsection focuses on the binary cross-entropy loss by setting h(z) = log(1 + e z ) and y ∈ {0, 1}, whereas the next section considers a more general setting. We define a set Θ of parameter vectors by Note that this set Θ contains the set of all parameter vectors with correct classifications of training points (before Mixup) as Therefore, the condition of θ ∈ Θ is satisfied when the deep neural network classifies all labels correctly for the training data with perturbations before Mixup. As the training error (although not training loss) becomes zero in finite time in many practical cases, the condition of θ ∈ Θ is satisfied in finite time in many practical cases. Accordingly, we study the effect of IAT on robustness in the regime of θ ∈ Θ.
Theorem 1 shows that the adversarial-mixup loss L a is approximately an upper bound of the adversarial loss with the adversarial perturbations of x i → x i +δ i + δ mix i where δ i ρ ≤ is the standard adversarial perturbation and δ i 2 ≤ mix i is the nonstandard additional perturbation due to IAT. In other words, IAT is approximately minimizing the upper bound of the adversarial loss with additional adversarial perturbation δ i 2 ≤ mix i with data-dependent radius mix i for each i ∈ {1, . . . , n}. Therefore, adding Mixup to adversarial training (i.e., IAT) does not decrease the effect of the original adversarial training on the robustness approximately (where the approximation error is in the order of (1 −λ) 3 as discussed below). This is non-trivial because, without Theorem 1, it is uncertain whether or not adding Mixup reduces the effect of adversarial training in terms of the robustness. Moreover, Theorem 1 shows that IAT further improves the robustness depending on the values of data-dependent radius mix i when compared to standard adversarial training without Mixup. These are consistent with our experimental observations.

The effect of IAT on generalization
In this subsection, we mathematically analyze the effect of IAT on generalization properties. We start in Section 6.3.1 with th general setting with arbitrary h and f θ , and prove a generalization bound for IAT loss. In Section 6.3.2, we then make assumptions on h and f θ and study the regularization effects of IAT.

Generalization bounds
The following theorem presents a generalization bound for the IAT loss Lc +La 2 -the upper bound on the difference between the expected error on unseen data and the IAT loss, E x,y [ (f (x), y)] − Lc +La 2 : Theorem 2. Let ρ ≥ 1 be a real number and F be a set of maps x → f (x). Assume that the function | (q, y) − (q , y)| ≤ τ for any q, q ∈ {f (x + δ) : f ∈ F , x ∈ X , δ ρ ≤ } and y ∈ Y. Then, for any δ > 0, with probability at least 1 − δ over an i.i.d. draw of n i.i.d. samples ((x i , y i )) n i=1 , the following holds: for all maps f ∈ F , there exists a function ϕ : R → R such that where lim q→0 ϕ(q) = 0, X := (x 1 , . . . , x n ),X := (x 1 , . . . ,x n ), To understand this generalization bound further, we now compare it with a generalization bound for IAT without using Mixup on terms L c and L a . IAT without Mixup is the adversarial training along with the standard training, which minimizes the loss of The following theorem presents a generalization bound for IAT without Mixup on terms L c and L a : Theorem 3. Let ρ ≥ 1 be a real number and F be a set of maps x → f (x). Assume that the function | (q, y) − (q , y)| ≤ τ for any q, q ∈ {f (x + δ) : f ∈ F , x ∈ X , δ ρ ≤ } and y ∈ Y. Then, for any δ > 0, with probability at least 1 − δ over an i.i.d. draw of n i.i.d. samples ((x i , y i )) n i=1 , the following holds: for all maps f ∈ F , By comparing Theorems 2 and 3, we can see that the benefit of IAT with Mixup comes from the two mechanisms in terms of generalization. The first mechanism is based on the term of If this term is positive, then IAT with Mixup has a better generalization bound than that of IAT without Mixup (if we suppose that the Rademacher complexity term R n ( • F ) is the same for both methods). The second mechanism is based on the model complexity term R n ( • F ). As the model complexity term is bounded by the norms of trained weights (e.g., Bartlett, Foster, & Telgarsky, 2017), this term differs for different training schemes -IAT with Mixup and IAT without Mixup. Accordingly, we study the regularization effects of IAT on the norms of weights in the next subsection.

Regularization effects
The generalization bounds in the previous subsection contain the model complexity term, which are controlled by the norms of the weights in the previous studies (e.g., Bartlett et al., 2017). Accordingly, we now discuss the regularization effects of IAT on the norms of weights. This subsection considers the models . . , n. This is satisfied by linear models as well as deep neural networks with ReLU activation functions and max-pooling.
We let y ∈ {0, 1} and h(z) = log(1 + e z ), which makes the loss function to represent the binary cross-entropy loss. Define g to be the logic function as g(z) = e z 1+e z . This definition implies that g(z) ∈ (0, 1) for z ∈ R.
The following theorem shows that the IAT term has the additional regularization effect on ∇f θ (x i ) 2 and ∇f θ (x i ) 2 . This theorem explains the additional regularization effects of the IAT term on the norm of weights, since ∇f θ (x i ) = w for linear models and ∇f θ (x i ) = W Hσ H W H−1σ H−1 . . .σ 1 W 1 for deep neural networks with ReLU and max-pooling.
where lim q→0 ϕ(q) = 0 and In Theorem 4, C 2 is always strictly positive since g(z) ∈ (0, 1) for all z ∈ R. While C 1 can be negative in general, the following proposition shows that C 1 will be also non-negative in the later phase of IAT training: Here, we have that Therefore, the condition of θ ∈ Θ is satisfied when the model classifies all labels correctly with margin ζ i for adversarial perturbations. As the training error (although not training loss) becomes zero in finite time in many practical cases and margin increases via implicit bias of gradient descent after that Lyu and Li (2020), the condition of θ ∈ Θ is satisfied in finite time in many practical cases. Theorem 4 and Proposition 1 together show that IAT Fig. 3. We analyzed the Frobenius and spectral norms of the weight matrices on a 6-layer network. Generally Adversarial Training makes these norms larger, whereas Interpolated Adversarial Trainingbrings these norms closer to their values when doing normal training.
can reduce the norms of weights when compared to adversarial training. Zhang, Deng, and Kawaguchi (2021) showed that the standard mixup loss L c also has the regularization effect on the norm of weights and thus contribute to reduce the model complexity. Therefore, our result together with that of the previous study (Zhang et al., 2021) shows the benefit of IAT in terms of reducing the norm of weights to control the model complexity.
As the recent study only considers the standard Mixup without adversarial training, our result complements the recent study to understand IAT.
To validate this theoretical prediction, we computed the norms of weights for a 6-layer fully-connected network with 512 hidden units trained on Fashion-MNIST and report the results in Fig. 3. On the one hand, adversarial training increased the Frobenius norms across all the layers and increased the spectral norm of the majority of the layers. On the other hand, IAT avoided or mitigated these increases in the norms of weights. This is consistent with our theoretical predictions and suggests that IAT learns lower complexity classifiers than normal adversarial training.
To further understand why adversarial training tends to increase the norms, consider the case of linear regression: Therefore, each step of (stochastic) gradient descent only adds some vector in the column space of X to w as Here, the solutions of the linear regression are any w such that Thus, (stochastic) gradient descent does not add any unnecessary element to w, implicitly minimizing the norm of the weights. Accordingly, if we initialize w as w 0 ∈ Col(X ), then we achieve the minimum norm solution implicitly via (stochastic) gradient descent.
In this context, we can easily see that by conducting adversarial training, we add vectors v ⊥ ∈ Null(X ), breaking the implicit bias and increasing the norm of w. Similarly, in the case of deep neural networks, (stochastic) gradient descent has the implicit bias that restricts the search space of w and hence tend to minimize the norm without unnecessary elements (Lyu & Li, 2020;Moroshko, Gunasekar, Woodworth, Lee, Srebro, & Soudry, 2020;Woodworth et al., 2020). Thus, similarly to the case of linear models, adversarial training adds extra elements via the perturbation and tends to increase the norm of weights.
Our results show that we can minimize this effect via the additional regularization effects of IAT to reduce overfitting for better generalization behaviors.

Conclusion
Robustness to the adversarial examples is essential for ensuring that machine learning systems are secure and reliable. However the most effective defense, adversarial training, has the effect of harming performance on the unperturbed data. This has both the theoretical and the practical significance. As adversarial perturbations are imperceptible (or barely perceptible) to humans and humans are able to generalize extremely well, it is surprising that adversarial training reduces the model's ability to perform well on unperturbed test data. This degradation in the generalization is critically urgent to the practitioners whose systems are threatened by the adversarial attacks. With current techniques those wishing to deploy machine learning systems need to consider a severe trade-off between performance on the unperturbed data and the robustness to the adversarial examples, which may mean that security and reliability will suffer in important applications. Our work has addressed both of these issues. We proposed to address this by augmenting adversarial training with interpolation based training Zhang et al., 2017). We found that this substantially improves generalization on unperturbed data while preserving adversarial robustness. Our analysis showed why and how the proposed method can improve generalization and preserve adversarial robustness when compared to standard adversarial training.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
This appendix provides the key lemma, Lemma 1, which is used to prove theorems in Appendix B.

Appendix B. Proofs
Using Lemmas 1 proven in Appendix A, this appendix provides the complete proofs of Theorems 1, 2, 3, 4, and Proposition 1.

Proof of Theorem 2. Let
To apply McDiarmid's inequality to ϕ(S), we compute an upper bound on |ϕ(S) − ϕ(S )| where S and S be two test datasets differing by exactly one point of an arbitrary index i 0 ; i.e., S i = S i for all i = i 0 and S i 0 = S i 0 . Then, where we use the fact that both L c and L a have n 2 terms and 2n − 1 terms differ for S and S , each of which is bounded by the constant τ . Similarly, ϕ(S) − ϕ(S ) ≤ 2τ Moreover, by using Lemma 1, there exist functions ϕ and ϕ such (B.8) where lim q→0 ϕ (q) = 0 and lim q→0 ϕ (q) = 0. Thus, by defining (B.9) and where the second line follows the definitions of each term, the third line uses ± 1 n n i=1 (f (x i ), y i ) inside the expectation and the linearity of expectation, the fourth line uses the Jensen's inequality and the convexity of the supremum, and the fifth line follows that for each ξ i ∈ {−1, +1}, the distribution of each term ξ i ( (f(x i ), y i ) − (f (x i ), y i )) is the distribution of ( (f(x i ), y i ) − (f (x i ), y i )) since S and S are drawn iid with the same distribution. The sixth line uses the subadditivity of supremum.

Proof of
where the second line follows the definitions of each term, the third line uses ± 1 n n i=1 (f (x i ), y i ) inside the expectation and the linearity of expectation, the fourth line uses the Jensen's inequality and the convexity of the supremum, and the fifth line follows that for each ξ i ∈ {−1, +1}, the distribution of each term ξ i ( (f(x i ), y i ) − (f (x i ), y i )) is the distribution of ( (f(x i ), y i ) − (f (x i ), y i )) since S and S are drawn iid with the same distribution. The sixth line uses the subadditivity of supremum.