Adversarial and Random Transformations for Robust Domain Adaptation and Generalization

Data augmentation has been widely used to improve generalization in training deep neural networks. Recent works show that using worst-case transformations or adversarial augmentation strategies can significantly improve accuracy and robustness. However, due to the non-differentiable properties of image transformations, searching algorithms such as reinforcement learning or evolution strategy have to be applied, which are not computationally practical for large-scale problems. In this work, we show that by simply applying consistency training with random data augmentation, state-of-the-art results on domain adaptation (DA) and generalization (DG) can be obtained. To further improve the accuracy and robustness with adversarial examples, we propose a differentiable adversarial data augmentation method based on spatial transformer networks (STNs). The combined adversarial and random-transformation-based method outperforms the state-of-the-art on multiple DA and DG benchmark datasets. Furthermore, the proposed method shows desirable robustness to corruption, which is also validated on commonly used datasets.


Introduction
For modern computer vision applications, we expect a model trained on large-scale datasets to be able to perform uniformly well across various testing scenarios. For example, consider the perception system of a self-driving car; we want it to be able to generalize well across weather conditions and city environments. However, current supervised-learningbased models remain weak when it comes to out-of-distribution generalization [1]. When testing and training data are drawn from different distributions, the model can suffer from a significant accuracy drop. This is known as the domain shift problem, which has drawn increasing attention in recent years [1][2][3][4].
Domain adaptation (DA) and domain generalization (DG) are two typical techniques used to address the domain shift problem. DA and DG aim to utilize one or multiple labeled source domains to learn a model that performs well on an unlabeled target domain. The major difference between DA and DG is that DA methods require target data during training, whereas DG methods do not require target data in the training phase. DA can be categorized as supervised, semi-supervised, and unsupervised, depending on the availability of the labels of target data. In this paper, we consider unsupervised DA, which does not require labels of target data. In recent years, many works have been proposed to address either DA or DG problems [3,5]. In this work, we address both DA and DG in a unified framework.
Data augmentation is an effective technique for reducing overfitting and has been widely used in many computer vision tasks to improve the generalization ability [6] of the model. Recent studies show that using worst-case transformations or adversarial augmentation strategies can greatly improve the generalization and robustness of the model [7,8]. However, due to the non-differentiable properties of image transformations, searching algorithms such as reinforcement learning [9,10] or the evolution strategy [8] have to be applied, which are not computationally practical for large scale problems. In this work, we are concerned with the effectiveness of data augmentation for DA and DG, especially the adversarial data augmentation strategies, without using heavy-searching-based methods. Motivated by the recent success of RandomAugment [11] on improving the generalization of deep learning models and consistency training in semi-supervised and unsupervised learning [12][13][14], we propose a unified DA and DG method by incorporating consistency training with random data augmentation. The idea is quite simple. When conducting a forward pass in neural networks, we force the randomly augmented and non-augmented pair of training examples to have similar responses by applying a consistency loss. Because consistency training does not require labeled examples, we can apply it with unlabeled target domain data for domain adaptation training. Consistency training and sourcedomain supervised training are both within a joint multi-task training framework and can be trained end-to-end. Random augmentation can also be regarded as a method for noisy injection, and by applying consistency training with noisy and original examples, the model's generalization ability is expected to be improved. Following VAT [15] and UDA [13], we use KL divergence to compute the consistency loss.
To further improve the accuracy and robustness, we consider employing adversarial augmentations to find worst-case transformations. Our interest is in performing adversarial augmentation for DA and/or DG without using searching-based methods. Most image transformations are non-differentiable, except for a subset of geometric transformations. Inspired by the spatial transformer networks (STNs) of [16], we propose a differentiable adversarial spatial transformer network for both DA and DG. As we will show in the experimental section, the adversarial STN alone achieves promising results on both DA and DG tasks. When combined with random image transformations, it outperforms the state of the art, which is validated on several DA and DG benchmark datasets.
In this work, apart from the cross-domain generalization ability, robustness is also our concern. This is particularly important for real applications when applying a model to unseen domains, which, however, is largely ignored in current DA and DG literature. We evaluate the robustness of our models on CIFAR-10-C [17], which is a robustness benchmark with 15 types of corruptions algorithmically simulated to mimic real-world corruptions. The experimental results show that our proposed method not only reduces the cross-domain accuracy drop but also improves the robustness of the model.
Our contributions can be summarized as follows: (1) We build a unified framework for domain adaptation and domain generalization based on data augmentation and consistency training. (2) We propose an end-to-end differentiable adversarial data-augmentation strategy with spatial transformer networks to improve accuracy and robustness. (3) We show that our proposed methods outperform state-of-the-art DA and DG methods on multiple object recognition datasets. (4) We show that our model is robust to common corruptions and obtained promising results on the CIFAR-10-C robustness benchmark.

Domain Adaptation
Modern domain adaption methods usually address domain shifts by learning domaininvariant features. This purpose can be achieved by minimizing a certain measure of domain variance, such as the Maximum Mean Discrepancy (MMD) [1,18] and fuzzy MMD [19], or aligning the second-order statistics of the source and target distributions [20,21].
Another line of work uses adversarial learning to learn features that are discriminative in source space and at the same time invariant with respect to domain shift [2,22,23]. In [2], a gradient reverse layer is proposed to achieve domain-adversarial learning by backpropagation. In [22], a method that combines adversarial learning and MMD is proposed.
Ref. [23] outlined a generalized framework for adversarial adaptation and proposed ADDA, which uses an inverted label GAN loss to enforce domain confusion. In [24], a multi-layer adversarial DA method was proposed, in which a feature-level domain classifier is used to learn domain-invariant representation while a prediction-level domain classifier is used to reduce domain discrepancy in the decision layer. In [3], CycleGAN [25]-based unpaired image translation is employed to achieve both feature-level and pixel-level adaptation. In [26], cluster assumption is applied to domain adaptation, and a method called Virtual Adversarial Domain Adaptation (VADA) is proposed. VADA utilizes VAT [15] to enforce classifier consistency within the vicinity of samples. Drop to Adapt [27] also enforces the cluster assumption by leveraging adversarial dropout. In [28], adversarial learning and self-training are combined, in which an adversarial-learned confusion matrix is utilized to correct the pseudo label and then align the feature distribution.
Recently, self-supervised-learning-based domain adaptation was proposed [4]. Selfsupervised DA integrates a pretext learning task, such as image rotation prediction in the target domain with the main task in the source domain. Self-supervised DA has shown the capability of learning domain-invariant feature representations [4,29]. In [30], label-consistent contrastive learning is proposed for source-free domain adaptation.

Domain Generalization
Similar to domain adaptation, existing work usually learns domain-invariant features by minimizing the discrepancy between the given multiple source domains, assuming that the source-domain-invariant feature works well for the unknown target domain. Domain-Invariant Component Analysis (DICA) is proposed in [31] to learn an invariant transformation by minimizing the dissimilarity across domains. In [32], a multi-domain reconstruction auto-encoder is proposed to learn domain-invariant features.
Adversarial learning has also been applied in DG. In [33], an MMD-based adversarial autoencoder (AAE) is proposed to align the distributions among different domains and match the aligned distribution to an arbitrary prior distribution. In [34], correlation alignment is combined with adversarial learning to minimize the domain discrepancy. In [35], optimal transport with Wasserstein distance is adopted in the adversarial learning framework to align the marginal feature distribution over all the source domains.
In [5,43], self-supervised DG is proposed by introducing a secondary task to solve a jigsaw puzzle and/or predict image rotation. This auxiliary task helps the network to learn the concepts of spatial correlation while acting as a regularizer for the main task. With this simple model, state-of-the-art domain generalization performance can be achieved.

Data Augmentation
Data augmentation is a widely used trick in training deep neural networks. In visual learning, early data augmentation usually uses a composition of elementary image transformation, including translation, flipping, rotation, stretching, shearing, and adding noise [44]. Recently, more complicated data augmentation approaches have been proposed, such as CutOut [45], Mixup [46], and AugMix [47]. These methods are designed by human experts based on prior knowledge of the task, together with trial and error. To automatically find the best data augmentation method for a specific task, policy-search-based automated dataaugmentation approaches have been proposed, such as AutoAugment [9] and Population based augmentation (PBA) [48]. The main drawback of these automated data augmentation approaches is the prohibitively high computational cost. Recently, Ref. [7] improved the computational efficiency of AutoAugment by simultaneously optimizing target-related object and augmentation policy search loss.
Another kind of data augmentation method aims at finding the worst-case transformations and utilizing them to improve the robustness of the learned model. In [49], adversarial data augmentation is employed to generate adversarial examples, which are appended during training to improve the generalization ability. In [8], the authors further proposed searching for worst-case image transformations by random search or evolution-based search. Reinforcement learning is used in [7] to search for adversarial examples, in which RandAugment and worst-case transformation are combined.
Recently, consistency training with data augmentation has been used for improving semi-supervised training [13] and the generalization ability of supervised training [50].
Most related works focus on either domain adaptation or domain generalization, while in this work, we consider designing a general model to address both of them. Domain adversarial training is a widely used technique for DA and DG, and our work does not follow this mainstream methodology but seeks resolution from the perspective of representation learning, e.g., self-supervised learning [4], and consistency learning [29]. For representation learning, data augmentation also plays an important role, as it can reduce model overfitting and improve the generalization ability. However, whether data augmentation can address cross-domain adaptation and generalization problems is still not well explored. In this work, we design a framework to incorporate data augmentation and consistency learning to address both domain adaptation and generalization problems.

The Proposed Approach
In this section, we present the proposed method for domain adaptation and generalization in detail.

Problem Statement
In the domain adaptation and generalization problem, we are given a source domain D s and target domain D t containing samples from two different distributions, P S and P T . Denoting by {x s ,ŷ s } ∈ D s a labeled source domain sample and by {x t } ∈ D t a target domain sample without label, we have x s ∼ P S , x t ∼ P T , and P S = P T . When applying the model trained on the source domain to the target domain, the distribution mismatch can lead to a significant performance drop.
The task of unsupervised domain adaptation is to train a classification model F : x s → y s that is able to classify x t to the corresponding label y t given {x s ,ŷ s } and {x t } as training data. On the other hand, the task of domain generalization is to train a classification model F : x s → y s which is able to classify x t to the corresponding label y t given only {x s ,ŷ s }. The difference between these two tasks concerns whether {x t } is involved or not during training. For both domain adaptation and generalization, we assume there are n s source domains where n s 1 and there is one single target domain.
Many works have addressed either domain adaptation or domain generalization. In this work, we propose a unified framework to address both problems. In what follows, we first focus on domain adaptation and introduce the main idea and explain the details of the proposed method. Then, we show how this method can be adapted to domain generalization tasks as well.

Random Image Transformation with Consistency Training
Inspired by a recent line of work [13,29] in semi-supervised learning that incorporates consistency training with unlabeled examples to enforce the smoothness of the model, we propose using image transformation as a method for noisy injection and apply consistency training with the noisy and original examples. The overview of the proposed random image transformation with consistency training for domain adaptation is depicted in Figure 1. In this section, we focus on the random image transformation part and leave the adversarial spatial transformer networks in the next section. The main idea can be explained as follows: (1) Given an input image x from either the source or target domain, we compute the output distribution p(y | x) with x and a noisy version p(y |x) by applying random image transformation to x; (2) For domain adaptation, we jointly minimize the classification loss with labeled source-domain samples and a divergence metric between the two distributions D(p(y | x) p(y |x)) with unlabeled source and target domain samples, where D is a discrepancy measure between two distributions; (3) For domain generalization, the procedure is similar to (2) but without using any target domain samples. Our intuition is that, on one hand, minimizing the consistency loss can enforce the model to be insensitive to the noise and improve the generalization ability; on the other hand, the consistency training gradually transmits label information from labeled source domain examples to unlabeled target domain ones, which improves the domain adaptation ability.
The applied random-image transformations are similar to RandAugment [11]. Table 1 shows the types of image transformations used in this work. The image transformations are categorized into three groups. The first group is the geometric transformations, including Shear, Translation, Rotation, and Flip. The second group is the color-enhancing-based image transformations, e.g., Solarize, Contrast, etc., and the last group includes other transformations, e.g., CutOut and SamplePairing. Each type of image transformation has a corresponding magnitude, which indicates the strength of the transformation. The magnitude can be either a continuous or discrete variable. Following [11], we also normalize the magnitude to a range from 0 to 10, in order to employ a linear scale of magnitude for each type of transformation. In other words, a value of 10 indicates the maximum scale for a given transformation, while 0 means the minimum scale. Note that these image transformations are commonly used as searching policies in recent auto-augmentation literature, such as [7,9,10]. Following [11], we do not use search but instead uniformly sample from the same set of image transformations. Specifically, for each training sample, we uniformly sample N aug image transformations from Table 1 with the normalized magnitude value of M aug and then apply them to the image sequentially. N aug and M aug are hyper-parameters.
Following the practice of [11], we sampled N aug ∈ {1, 2, 3, 5, 10} and M aug ∈ {3, 6, 9, 12}. We conducted validation experiments on the PACS dataset and VisDA dataset and found that N aug = 2 and M aug = 9 obtain the best results; thus, we keep N aug = 2 and M aug = 9 in all our experiments. Following VAT [15] and UDA [13], we also use the KL divergence to compute the consistency loss. We denote by θ m the parameters of the classification model. The classification loss with labeled source-domain samples is written as the following cross-entropy loss: The consistency loss for domain adaptation can be written as wherep(y | x) uses a fixed copy of θ m , which means that the gradient is not propagated throughp(y | x).
As a common underlying assumption in many semi-supervised learning methods, the classifier's decision boundary should not pass through high-density regions of the marginal data distribution [51]. The conditional entropy minimization loss (EntMin) [52] enforces this by encouraging the classifier to output low-entropy predictions on unlabeled data. EntMin is also combined with VAT in [15] to obtain stronger results. Specifically, the conditional entropy minimization loss is written as Following [4,5], we also apply the conditional entropy minimization loss to the unlabeled target domain data to minimize the classifier prediction uncertainty. The full objective of domain adaptation is thus written as follows: where λ c and λ e are the weight factor for the consistency loss and conditional entropy minimization loss. For domain generalization, as no target domain data are involved during the training, Equation (2) can be written as: (5) and the final objective function is the weighted sum of Equation (5) and the classification loss Equation (1) :

Adversarial Spatial Transformer Networks
The proposed random image transformation with consistency training is a simple and effective method to reduce domain shift. Recent works show that using worst-case transformations or adversarial augmentation strategies can significantly improve the accuracy and robustness of the model [7,8]. However, most of the image transformations in Section 3.2 are non-differentiable, making it difficult to apply gradient-descent-based methods to obtain optimal transformations. To address this problem, searching algorithms such as reinforcement learning [7] or evolution strategy [8] have been employed in recent works, which, however, are computationally expensive and do not guarantee obtaining the global optima. In this work, we find that a subset of the image transformations in Table 1 are actually differentiable, i.e., the geometric transformations. In this work, we build our adversarial geometric transformation on top of the spatial transformer networks (STN) [16]. Specifically, in this work, we focus on the affine transformations. The STN consists of a localization network, a grid generator, and a differentiable image sampler. The localization network is a convolutional neural network with parameter θ t , which takes as input an image x and regresses the affine transformation parameters φ. The grid generator takes as input φ and generates the transformed pixel coordinates as follows: where (ũ,ṽ) are the normalized transformed coordinates in the output image and (u, v) are the normalized source coordinates in the input image, i.e., −1 ≤ũ,ṽ, u, v ≤ 1. Finally, the differentiable image sampler takes the set of sampling points from the grid generator, along with the input image x, and produces the sampled output imagex. The bilinear interpolation is used during the sampling process. We can denote STN by T : x →x, a differentiable neural network with parameter θ t , which applies an affine transformation to the input image x.
The goal of the adversarial geometric transformation is to find the worst-case transformations, which is equivalent to maximizing the following objective function: The straightforward way to solve the maximization problem in Equation (8) is to apply the gradient reverse trick, i.e., the gradient reversal layer (GRL) in [2], which is popular in domain adversarial training methods. The GRL has no parameters associated with it. During the forward propagation, it acts as an identity transformation. During the back-propagation, however, the GRL takes the gradient from the subsequent level and changes its sign, i.e., multiplies it by −1, before passing it to the preceding layer. Formally, the forward and back propagation of GRL can be written as R(x) = x, dR dx = −I. The loss function of the adversarial spatial transformer for domain adaptation can thus be written by where T is the spatial transformer network with parameter θ t , and R is the gradient reverse layer (GRL) [1]. For domain generalization, the only difference is that only x ∈ D s is involved in Equation (9). With the adversarial spatial transformer network, the final objective function for domain adaptation is written as and the final objective function for domain generalization is written as

Experiments
In this section, we conduct experiments to evaluate the proposed method and compare the results with the state-of-the-art domain adaptation and generalization methods.

Datasets
Our method was evaluated on the following popular domain adaptation and generalization datasets: PACS [53] is a standard dataset for DG. It contains 9991 images collected from Sketchy, Caltech256, TU-Berlin, and Google Images. It has 4 domains (Photo, Art Paintings, Cartoon, and Sketches), and each domain consists of 7 object categories. Following VisDA http://ai.bu.edu/visda-2017/ (accessed on 10 May 2020) is a simulation-toreal domain-adaptation dataset that has over 280 K images across 12 classes. The synthetic domain contains renderings of 3D models from different angles and with different lighting conditions, and the real domain contains nature images.
To investigate the robustness of the proposed model, we also evaluated it on popular robustness benchmarks, including the following.
CIFAR-10.1 [57] is a new test set of CIFAR-10 with 2000 images and the exact same classes and image dimensionality. Its creation follows the creation process of the original CIFAR-10 paper as closely as possible. The purpose of this dataset is to investigate the distribution shifts present between the two test sets, and the effect on object recognition.
CIFAR-10-C [17] is a robustness benchmark where 15 types of corruption are algorithmically simulated to mimic real-world corruption as much as possible on copies of the CIFAR-10 [58] test set. The 15 types of corruption are from four broad categories: noise, blur, weather, and digital. Each corruption type comes in five levels of severity, with level 5 being the most severe. In this work, we evaluated the models with the level 5 severity.

Experimental Setting
We implemented the proposed method using the PyTorch framework on a single RTX 2080 Ti GPU with 11 GB memory. The Alexnet [59], Resnet-18, and Resnet-50 [60] architectures were used as base networks and initialized with ImageNet [61] pretrained weights.
For training the model, we used an SGD solver with an initial learning rate of 0.001. We trained the model for 60 epochs and decayed the learning rate to 0.0001 after 80% of the training epochs. For training baseline models, we used simple data-augmentation protocols by random cropping, horizontal flipping, and color jittering.
We followed the standard protocol for unsupervised domain adaptation [2,62], where all labeled source domain examples and all unlabeled target domain examples were used for adaptation tasks. We also followed the standard protocol for domain-generalization transfer tasks as per [5], where the target domain examples were unavailable in the training phase. We set three different random seeds and ran each experiment three times. The final result is the average over the three repetitions.
We compared our proposed method with state-of-the-art DA and DG methods. The descriptions of the compared methods are shown in Table 2. In the following, we use Deep All to denote the baseline model trained with all available source-domain examples when all the introduced domain adaptive conditions are disabled. For the compared methods in Table 2, we used the results reported from the original papers if the protocol is the same.  [66] DA 2019 Adversarial training with margin disparity discrepancy. Rot [4] DA 2019 Self-supervised learning by rotation prediction. RotC [29] DA 2020 Self-supervised learning with consistency training. ALDA [28] DA 2020 Adversarial-learned loss for domain adaptation. MLADA [24] DA 2021 Multi-layer adversarial domain adaptation. GPDA [67] DA 2021 Geometrical preservation and distribution alignment.

Unsupervised Domain Adaptation
The multi-source domain adaptation-results on PACS are reported in Table 3. We follow the settings in [4,5] and trained our model considering three domains as the source datasets and the remaining one as the target. RotC is an improved version of Rot, which applies consistency loss with the simplest image rotation transformations [29]. We use the open source code of [65] to produce the results of CDAN and CDAN+E and the open source code of [66] to produce the results of MDD. Our proposed approach outperforms all baseline methods on all transfer tasks. The last column shows the average accuracy on the four tasks. Our proposed approach outperforms state-of-the-art CDAN+E [65] by 4.7 percentage points and MDD by 1.4 percentage points.
To investigate the improvement from data augmentation, we add the same type of data augmentation as ours on Deep All, DANN, CDAN, CDAN+E, and MDD, which are denoted by Deep All (Aug), DANN (Aug), CDAN (Aug), CDAN+E (Aug), and MDD (Aug), respectively. From these results, we can see that data augmentation can obtain an improvement of 1.2 to 2.6 percentage points for existing domain-adaptation methods. Even with the same type of data augmentation, our proposed method still outperforms these baselines. The results on Office-Home dataset are reported in Table 4. On the Office-Home dataset, we conducted 12 transfer tasks of four domains in the context of single-source domain adaptation. We achieved state-of-the-art performance on 8 out of 12 transfer tasks. It is noted that although Office-Home and PACS are related in terms of domain types, the number of total categories in Office-Home and PACS are 65 and 7, respectively. From the results, we can see that the proposed method scales when the number of categories changes from 7 to 65. The average accuracy achieved by our proposed method is 67.6%, which outperforms all the compared methods.

Office-Home
ResNet-50 [ The results on the ImageCLEF-DA dataset are reported in Table 5. As the three domains in ImageCLEF-DA are of equal size and balanced in each category and are visually more similar, there is little room for improvement in this dataset. Even so, our method still outperforms the comparison methods on four out of six transfer tasks. Our method achieves 88.2% average accuracy, outperforming the latest methods, including CDAN+E [65], RotC [29] and MLADA [24]. Table 5. Accuracy (%) on ImageCLEF-DA for unsupervised domain adaptation (Resnet-50). The bold font highlights the best domain adaptation results. I → P indicates that ImageNet ILSVRC 2012 is the source domain and Pascal VOC 2012 is the target domain. The proposed method also obtains strong results on VisDA as reported in Table 6. It outperforms CDAN+E by 2.6 percentage points. It is important to understand the improvement of the consistency loss. Because RotC is the combination of simple image rotation transformation and consistency loss, without using complex data augmentation, we can better understand how much of the improvement is from the consistency loss. Comparing to Rot, which does not use consistency loss, we can see from the above DA experiments that RotC obtains an improvemnt of about 0.7 to 3.5 percentage points thanks to the consistency loss.

Domain Generalization
In the context of multi-source domain generalization, we conducted four transfer tasks on the PACS dataset. We compared the performance of our proposed method against several recent domain-generalization methods. We evaluated the method with both Alexnet and Resnet-18 and report the results in Tables 7 and 8. From the results, we can observe that our proposed method achieves state-of-the-art domain generalization performance with both backbone architectures. With Alexnet, our method outperforms the comparison methods on all 4 transfer tasks. The average accuracy of our method outperforms the prior best method WADG by around 1.7 percentage points, setting a new state-of-the-art performance. With Resnet-18, the average accuracy of our method is 82.73%, also outperforming the existing latest ones. Table 7. Domain generalization results on PACS (Alexnet). For details about the meaning of columns and the use of bold fonts, see Table 3. As consistency loss is not mandatory for DG, we replaced consistency loss with crossentropy loss and ran our methods. The results are denoted by Ours w/o consis. We can see that the model trained with cross-entropy loss obtains similar accuracy to ours with consistency loss. However, the consistency loss is required for DA because of the unlabeled target domain samples. To keep a unified framework for both DA and DG problems, we used consistency loss for DG in this work.

PACS-DG
To investigate the improvement of pure data augmentation without consistency, we ran JiGen and Deep All with the same type of data augmentation as ours, denoted by JiGen (Aug) and Deep All (Aug). We can see that using pure augmentation, Deep All can obtain an improvement of around 0.8 to 1.2 percentage points, and JiGen can obtain an improvement of around 1.2 to 1.6 percentage points. Even so, our proposed method still outperforms these baselines.  (Resnet-18). For details about the meaning of columns and the use of bold fonts, see Table 3. We also conducted experiments on Office-Home and VLCS datasets for multi-source domain generalization. Compared to PACS, these two datasets are more difficult, and most recent works have only obtained small accuracy gains with respect to the Deep All baselines. The results on Office-Home and VLCS dataset are reported in Table 9 and Table 10, respectively. Our proposed method outperforms the compared methods on the four transfer tasks on the Office-Home dataset, and the results tested on the VLCS dataset show that our method achieves the best or close to the best performance on the four tasks, outperforming the recently proposed ones on average. It is noted that our baseline Deep All has relatively higher accuracy than other baselines. This is because we also add data augmentations such as random crop, horizontal flipping, and color jittering when training Deep All models. In this case, it is fairer to compare with the proposed method, which incorporates various data augmentation operations. Table 9. Domain generalization results on Office-Home . For details about the meaning of columns and the use of bold fonts, see Table 3.  Table 3.

Robustness
Apart from domain adaptation and generalization, we are also interested in the robustness of the learned model. In this part, we evaluate the proposed method on robustness benchmarks CIFAR-10.1 and CIFAR-10-C. We trained on the standard CIFAR-10 training set and tested on various corruption datasets, i.e., in a single-source domain-generalization setting. Figure 2 shows the testing error on different datasets. We evaluated different image transform strategies and also compare them with recently proposed methods in [74], i.e., JT and TTT. Following [74], we used the same architecture and hyper-parameters across all experiments.
The method denoted by baseline refers to the plain Resnet model, which is equivalent to Deep All in the DG setting. JT and TTT are the joint training and testing time training in [74], respectively. We denote by rnd-all the random image transformation including geometric and color-based transformations, adv-stn the proposed adversarial STN without random image transforms, and adv-stn-color the adversarial STN combined with random color-based transformations.
On the left is the standard CIFAR-10 testing dataset, where we can see that all the compared methods obtained similar accuracies. On CIFAR10.1, the testing errors of all these methods increase simultaneously, but there is no significant gap between them. On the CIFAR-10-C corruption data sets, the performances of these methods vary a lot. Our proposed methods show improved accuracies compared to the baseline. adv-stn-color shows better performance than its variants and also outperforms JT and TTT. It can also be seen that adv-stn even outperforms rnd-all in most cases, although it only applies geometric transformations, which indicates the effectiveness of the proposed adversarial spatial transformations for improving robustness.

Ablation Studies and Analysis
Below, we focus on the PACS DG and DA setting for ablation analyses of the proposed method.

Ablation Study on Image Transformation Strategies
In this part, we conducted ablation studies on adversarial and random image transformations. Table 11 shows the ablation studies of domain adaptation on PACS with different image transformation strategies, and Table 12 shows the domain-generalization results. rnd-color and rnd-geo are subsets of rnd-all, where rnd-color refers to color-based transformations and rnd-geo refers to geometric transformations. Please see Table 1 for details of each subset of transformations. For the DA task, when comparing individual transformation strategies, rnd-color obtained the best accuracy, and adv-stn outperformed rnd-geo. The combination of rnd-color + adv-stn also outperformed rnd-color + rnd-geo. However, in this experiment, adv-stn did not further improve rnd-color, which might be due to the limited room for improvement in the baseline method.
In the DG experiment, we obtained the similar conclusion that adv-stn outperforms rnd-geo. As the baseline accuracy of DG is far from saturated compared to DA, we can see that adv-stn further improved rnd-color and the combination of rnd-color + adv-stn obtained the best accuracy. Table 11. Ablation studies of domain adaptation on PACS. The first three columns indicate the types of image transformations applied. Each column title in the middle indicates the name of the domain used as the target. We use bold font to highlight the best results.

Ablation Study on the Hyperparameters Setting
In this part, we conducted ablation studies on the hyperparameters setting. The final objective functions of our proposed method for domain adaptation (10) and domain generalization (11) are a weighted summation of several items, with the weighting factors as the hyperparameters. Since the conditional entropy minimization loss is widely used in domain adaptation, we fixed the weight λ e = 0.1 and conducted a grid test with different settings of λ c and λ t , which are shared by domain generalization. We tested ten different values, which are logarithmically spaced between 10 −2 and 10 for λ c and λ t . For each setting, we ran with three different random seeds and calculated the mean accuracy. The results of multi-source domain adaptation and domain generalization, which take photo, cartoon, and sketch as the source domains and art_painting as the target domain, are reported in Figure 3. Resnet-18 was used as the base network. From the figures, we can see that when both λ c and λ t are not too large, the accuracies are relatively stable, validating the insensitiveness of our proposed method to hyperparameters. However, when λ c and λ t grow too large, the performance decreases, especially with a small λ c and a large λ t . The reason may be that overwhelmingly large weights for consistency loss and adversarial spatial transformer loss over the main classification loss make the learned feature less discriminative for the main classification task, therefore resulting in lower accuracy. Moreover, too large λ t may lead to excessive emphasis on the extreme geometric distortions, which may be harmful to the general cross-domain performance.

Visualization of Learned Deep Features
To better understand the learned domain-invariant feature representations, we use t-SNE [75] to conduct an embedding visualization. We conducted experiments on the transfer task of photo, cartoon, sketch → art_painting with both DA and DG settings and visualized the feature embeddings. Figure 4 shows the visualization on the PACS DA setting, and Figure 5 shows the visualization on the PACS DG setting. In both figures, we visualize category alignment as well as domain alignment. We also compare to the baseline Deep All, which does not apply any adaptation.  From the visualization of the embeddings, we can see that the clusters created by our model not only separate the categories but also mix the domains. The visualization from the DG model suggests that our proposed method is able to learn feature representations generalizable to unseen domains. It also implies that the proposed method can effectively learn domain-invariant representation with unlabeled target domain examples.

Visualization of Adversarial Examples
To visually examine what the adversarial spatial transformations learned by adv-stn are, we plot the transformed examples during training in Figure 6. In the first row, we show the original images with simple random horizontal flipping and jittering augmentation. The second and third rows show images with rnd-all and adv-stn-color, respectively. From the figure, we can see that the proposed adversarial STN does find more difficult image transformations than random augmentation. Training with these adversarial examples greatly improves the generalization ability and robustness of the model.

Conclusions
In this work, we proposed a unified framework for addressing both domain adaptation and generalization problems. Our domain adaptation and generalization methods are built upon random image transformation and consistency training. This simple strategy can be used to obtain promising DA and DG performance on multiple benchmarks. To further improve its performance, we proposed a novel adversarial spatial transformer network that is differentiable and able to find the worst-case image transformation to improve the generalizability and robustness of the model. Experimental results on multiple object recognition DA and DG benchmarks verified the effectiveness of the proposed methods. Additional experiments tested on CIFAR-10.1 and CIFAR-10-C also validated the robustness of the proposed method.