Learning Domain Invariant Representations by Joint Wasserstein Distance Minimization

Domain shifts in the training data are common in practical applications of machine learning; they occur for instance when the data is coming from different sources. Ideally, a ML model should work well independently of these shifts, for example, by learning a domain-invariant representation. However, common ML losses do not give strong guarantees on how consistently the ML model performs for different domains, in particular, whether the model performs well on a domain at the expense of its performance on another domain. In this paper, we build new theoretical foundations for this problem, by contributing a set of mathematical relations between classical losses for supervised ML and the Wasserstein distance in joint space (i.e. representation and output space). We show that classification or regression losses, when combined with a GAN-type discriminator between domains, form an upper-bound to the true Wasserstein distance between domains. This implies a more invariant representation and also more stable prediction performance across domains. Theoretical results are corroborated empirically on several image datasets. Our proposed approach systematically produces the highest minimum classification accuracy across domains, and the most invariant representation.


Introduction
Learning from data that originates from different provenances representing the same physical observations occurs rather commonly, but it is nevertheless a highly challenging endeavor.These multiple data sources may e.g.originate from different users, acquisition devices, geographical locations, they may encompass batch effects in biology, or they may come from the same measurement devices that each are calibrated differently.Because the source of the data itself is typically not task-relevant, a learned model is therefore required to be invariant across domains.A valid strategy for achieving this is to learn an invariant intermediate representation (illustrated in Figure 2).Furthermore, in certain applications, privacy requirements such as anonymity dictate that the source should not be recoverable from the representation.Hence, building a domain invariant representation can also be a desideratum by itself.
Domain invariance, in some contexts referred to as subpopulation shift [27] or distributional shifts [2,18], can be contrasted to two related and well-researched areas that are domain adaptation (DA) [59,62] and domain generalization (DG) [14,70].Domain adaptation is mainly concerned with the model performance on the (unlabeled) target domain, often at the expense of incurring more errors on the (labeled) source domain.Domain generalization, on the other hand, aims to build a ML model that generalizes across all domains, including unseen ones.This generality imposes additional constraints on the solution, that can hamper the careful enforcement of invariance w.r.t. the domains at hand.In comparison, domain invariance (DI), our focus in this paper, considers that the ML model is trained and applied on a finite and given set of domains, and each domain is treated equally.The objective is to learn a model whose performance is well-balanced over the multiple given domains.The differences are highlighted graphically and with equations in Figure 1.Hence, we address a singular and important problem, which has so far received little attention, especially in the context of deep learning models.
In order to address domain invariance, we consider in the present work the Wasserstein distance [47,63] as it characterizes the weak convergence of measures and displays several advantages over e.g. the more common Kullback-Leibler divergence, as discussed in [3,44].We contribute several bounds relating the Wasserstein distance between the joint distributions of two or more domains, and the objective function of practical supervised neural networks.This theoretical basis supports the rigorous learning of domain-invariant classifiers through the incorporation of a GAN-type discriminator between domains (or domain critic) as an auxiliary task 1 .With the proposed theoretical grounding, one can show that (1) the Wasserstein distance between the different domains is systematically reduced as an effect of training, and (2) the prediction performance gap between domains is also reduced as a result.
Furthermore, a significant part of the novelty of our work lies in contributing a formalism, which makes our theory applicable to partially labeled distributions.This allows us in particular to cover both supervised and semi-supervised learning scenarios.While a few other works also addressed the scenario where domains are partially labeled, they focus on the related but distinct problems of domain adaptation [8,38,22] and domain generalization [57].
Our proposed approach is tested empirically on three domain invariance benchmarks: MNIST vs. SVHN, and the multi-domain Office-Caltech and PACS datasets.Results confirm our theoretical analysis, in particular, we find that our approach yields highly invariant representations, and that the latter support predictions that are accurate on all domains, including the most difficult ones.Lastly, we inspect the learned invariant representation using UMAP embeddings [40] and 'explainable AI' (cf.[52,56]).This allows us to visually highlight how the data distributions associated to each domain merge into a single distribution under the effect E P∼D E (x,y)∼P L(f (x), y) Figure 1: Visual overview of the differences between domain adaptation, domain invariance and domain generalization in the context of classification.X denotes the input domain, and P denotes the various probability distributions.Domain adaptation learns a classifier that matches the target domain (Ptarget) using information from the source domain, irrespective of its performance on it (errors as circled in red).Domain invariance treats each of the n d domains equally and aims to build domain invariant representations and therefore a predictor that works equivalently well on each of them.Domain generalization addresses the more complex task of building a classifier that performs well on any domain drawn from some distribution D (including unseen ones, here depicted in gray).This is done potentially at the cost of giving up some accuracy on the few given domains (errors on the two domains of interest are circled in red). of the training objective.It also allows us to explore which input features are used to map the data into the desired invariant representation [36].Interestingly, we find that recognizing and exploiting domain-specific features remains in fact an integral part of the neural network strategy to arrive ultimately at the desired invariant representation.

Related work
Significant research in machine learning and statistics has been dedicated to the question of distributional shifts (between training and/or test distributions) [2,11].This has resulted in a variety of machine learning formulations that can be broadly categorized into domain adaptation, domain generalization, and domain invariance.

Domain Adaptation
Domain Adaptation [5] has been studied due to the fact that in real-world situations, when the source and target distributions differ, for instance by a covariate shift [59,62], models trained on the source distribution perform significantly worse on the target.Domain adaptation has two major well-studied settings: Unsupervised Domain Adaptation and Semi-Supervised Domain Adaptation.
Unsupervised Domain Adaptation considers the situation where the source domain has labels but the target domain has not (cf.[67] for a review).Using the theoretical framework of [5] where the target error is upper-bounded by the error on the source domain, the divergence between marginal distribution of the two domains and a constant term, [17,58] propose a domain adaptation technique based on the minimization of this upper-bound.The joint distribution optimal transportation (JDOT) method of [9] is similar to [58], but it aims to minimize the Wasserstein distance between joint distributions.Note that [9] does not use adversarial learning and instead solves the primal form of the optimal transport problem, and relies on a single-domain classifier to learn on the target domain with transported source-domain labels.The approach of [17] has been extended to a number of subsequent domain adaptation methods [69,60,65].Other metrics than the Wasserstein distance can be used to align domains including the MMD, in works such as [68] that use pseudo labels for unlabeled data, with a manifold regularization; or the Bures-Wasserstein distance (a specific case of the Wasserstein distance on normal distributions) such as in [37].The method of [64], inspired by [55], considers the alignment between domains and the class discriminability simultaneously, and proposes to weight these two terms in the objective in a dynamic manner.Although departing from existing theoretical frameworks, it achieves state-of-the-art empirical performance.
Semi-Supervised Domain Adaptation assumes a setting where there are a well-labeled source domain and a partially-labeled target domain.Observing that a few target labels can greatly improve task performance in applications such as object detection and image recognition, Semi-Supervised Domain Adaptation has recently attracted attention.The methods in [54,48] correct the classifier's predictions that are biased to the large amount of labeled data in the source domain by using conditional entropy computed from its predictions.In [34,25,24], the input data is perturbed by a powerful data augmentation (e.g.[10,12]) or adversarial method (e.g.[43]), and then the model is trained so that the predictions for the original input and the perturbed input are consistent.Reference [66] proposes an efficient method for training a model by assigning pseudo one-hot labels to unlabeled target data predicted with high confidence during training.The methods presented above achieve good results in numerical experiments, but do not provide a rigorous theoretical discussion of the generalization error.In particular, [54] minimizes an upper-bound of the target error, but that upper-bound contains the joint minimum error that cannot be optimized, and therefore it is not guaranteed that the target error will necessarily be small after training.One such exception is JDIP [7], which conducts a theoretical study with some similarities to ours, however in the case of domain adaptation, and with the classical L2 distance instead of a metric on distributions (the Wasserstein distance, with its advantages).This method builds on linear transformations and kernels models, whereas our approach works alongside more powerful and flexible neural networks.
Note that on both domain adaptation settings, the main goal is to improve generalization performance in the target domain, often at the expense of performance in the source domain.
In our work, we would like to have high performance in all given domains, and this task is better addressed using domain invariance.

Domain Generalization
Domain Generalization [6,45,71] is a very challenging problem that aims to achieve high performance on unseen target domains by learning models from multiple fully-labeled source domains.Domain generalization has received significant attention recently.Reference [31] combines an adversarial loss with a maximum mean discrepancy regularizer in order to extract a representation where domains are aligned.The method of [35] uses two adversarial losses to take advantage of label information in fully-labeled domains.The first loss matches the latent representation for each class, and the second loss reduces the negative effects of differences in class distributions across domains.The method of [14] uses meta-learning in order to extract features that are consistent across domains.Reference [70] starts from an adversarial approach and incorporates a metric learning loss into the classifier in order to improve classification boundaries.Reference [41] introduces a new attention diversification framework, based on attention maps, where the latter are trained to produce diversified responses for task-related features and to remove domain specific features.The approach of [72] mixes instance-level feature statistics across source domains.Mixing styles of training data has the effect of creating pseudo-new domains, resulting in increased diversity of training domains and improved generalization capability to unseen domains.The method of [33] can address unsupervised domain adaptation and model adaptation (or source-free unsupervised domain adaptation) as well as domain generalization.The method generates adversarial attacks to the extent that semantic information of original data is retained, and then learns to reduce the classification loss for those adversarial examples.
A few works have focused on theoretical aspects of Domain Generalization.Reference [32] develops theoretical arguments based on a strong assumption that the distribution of latent variables in all domains is represented by a linear combination of other domains.Reference [1] shows an upper-bound theorem indicating that minimizing the divergence between the source marginal distributions like [17,58] can minimize the unseen target error when the target distribution exists in the neighborhood of the convex hull of source distributions.However, it is also known that minimizing the divergence to an extremely small value increases the divergence between the target distribution and the convex hull, which leads to an increase in the upper-bound.[61] derives a tighter upper-bound of the target error than [1].Note that the negative effect resulting from minimizing the divergence remains unresolved.
While domain generalization is a challenging and interesting topic on its own, it differs from the setting we consider in the present paper by requiring the model to generalize well to any new domain.Not only this makes the theoretical analysis significantly harder than for the finite domain setting, performance gains on new unseen domains are often obtained at the expense of the existing domains, which are in our case the domains of interest.

Domain Invariance
In contrast to domain adaptation and domain generalization, domain invariance is a comparatively less explored setting.Domain invariance shares some technical similarities with works in distributionally robust optimization [49,16,15].These works however focus on the optimization problem and its theoretical properties rather than the problem of generalization between different groups or domains.Another related area is subpopulation shifts, which addresses the question of generalization across predefined subgroups (e.g.[53,18,27]).Unlike domain invariance, works on subpopulation shifts focus on building invariance to subgroups of the same domain, often numerous with few samples and small discrepancies, rather than producing invariance to qualitatively different domains.Furthermore, works on subpopulation shifts typically consider that the different subgroups are fully labeled, whereas the domain invariance framework we introduce in our work enables learning with domains that are partially labeled.Due to the limited previous works and the lack of reference methods for domain invariance, our experiments section will resort to ablation studies for comparison.

Domain Invariance and Optimal Transport
Domain invariance can be described as the property of a representation to be indistinguishable with regards to its original domain, in particular, the multiple data distributions projected in representation space should look the same (i.e. have low distance).A recently popular way of measuring the distance between two distributions is the Wasserstein distance.The latter can be interpreted as the cost of transporting the probability mass of one distribution to the other if we follow the optimal transport plan, and it can be formally defined as follows: be two arbitrary probability distributions defined over two measurable metric spaces A and B. Let c be a cost function.Their Wasserstein distance is: with Π(P, Q) Hence, we measure the invariance of representations by how low the Wasserstein distance is between the distributions P and Q associated to the two domains.The P and Q distributions respectively are the P 1 and P 2 distribution of Fig. 1.The Wasserstein distance being scaledependent, we assume that representations of both domains have fixed scale.In comparison to other common alternatives such as the Kullback-Leibler divergence, the Jensen-Shannon divergence, or the Total Variation distance, the Wasserstein distance has the advantage of taking into account the metric of the representation space (via the cost function c(a, b)), instead of looking at pure distributional overlap, and this typically leads to better ML models [44,3].Computing the Wasserstein distance with Eq. ( 1) is expensive.Luckily, if we use the metric of our space as a cost function, such as the Euclidean distance c(a, b) = ∥a − b∥ 2 , we can derive a dual formulation of the 1-Wasserstein distance as follows: where E P [φ] is the expected value of function φ on the distribution P.This formulation replaces an explicit computation of a transport plan, by a function to estimate, a task particularly appropriate for neural networks.Recently several methods have used this approach to learn distributions [44] specifically in the context of Generative Adversarial Networks [3,58].
The main constraint lies in the necessity of the function φ, which we will call the discriminator (or critic), to be 1-Lipschitz.A few approaches were proposed to tackle this problem, such as gradient clipping [3], gradient penalty [21] and more recently the Spectral Normalization [42].It is however important to note that in practice the set of possible discriminators will be a subset of 1-Lipschitz continuous functions.

Relating Wasserstein Distance to Supervised Losses
We would like to align the predicting behavior of a ML model on multiple domains following the approach illustrated in Figure 2, i.e. by learning a domain-invariant representation.Specifically, we aim for a representation of data where the distributions associated to the two domains have minimum Wasserstein distance and therefore cannot be distinguished.At the same time, the representation should contain the features that are necessary to solve the given prediction task, e.g. using common supervised loss functions.We focus here on the two-domain case, and refer to Supplementary Note D for the case of three or more domains.
Let us start with some formalities: We denote by X the input space, by Z ⊂ R d our representation or feature space, and by Y the label or target space.We further denote by Φ : X → Z the feature extractor, and by f : Z → Y the prediction function (e.g.regression; classification).We assume Z and Y to be compact measurable spaces, and we denote by M 1 + (Z × Y) the set of probability distributions defined on their product space.Let P t , Q t ∈ M 1 + (Z × Y) be the true probability distributions formed by the two domains we would like to align.When necessary, we add a subscript to these distributions to specify their support.
Similarly to previous works, domain alignment will be measured as the Wasserstein distance W (P t , Q t ) of samples embedded in feature space, but also including their labels.We contribute by showing that the Wasserstein distance W (P t , Q t ), can be related formally to common loss functions used in classification or regression, via mathematical inequalities.With these inequalities one can design practical learning objectives fairly easily, whose minimization not only solves the task at hand, but also implies as a side effect the minimization of the Wasserstein distance between distributions of the two domains, thereby achieving domain invariance.
From a technical standpoint, our novel approach draws some inspiration from [9] and is based on measure theory, in order to formalize partially labeled distributions and therefore our problem of aligning multiple joint distributions.The individual steps are presented in Sections 4.1-4.2,and we provide an overview of our novel theoretical framework in Figure 3.

Incorporating Semi-Supervised Data
Computation of the true Wasserstein distance W (P t , Q t ) would require knowledge of the true distributions P t and Q t .In practice, we only have a finite sample of these distributions, and the quality with which the Wasserstein distance can be approximated largely depends on the amount of labeled data available.(For high-dimensional tasks, the necessary amount of labels would be overwhelming.)However, in practice, it is common that unlabeled data is available in much larger quantity than the labeled data.We consider this semi-labeled scenario where only a fraction of data are obtained from P t , Q t (i.e.labeled).The remaining data are unlabeled and obtained from the marginals By learning an appropriate function f : Z → Y, say, a neural network classifier, that infers labels from features, one can obtain an approximation P f = (z, f (z)) z∼P t Z of the true joint distribution P t .This implies that the distribution we effectively have access to (and draw from) in practice is the mixture: where α ∈ (0, 1) is the fraction of labeled data.We will refer to P as the inferred distribution.Identically for the second domain, we construct an appropriate function g : Z → Y from which one can predict labels, and which results in another inferred distribution: Note that the functions f and g need not be identical.Also, β may differ from α, which addresses the case where different domains have different proportions of labeled data.Let the Wasserstein distance's cost function c be the metric on the space Z × Y.By application of the triangle inequality, the following relation can be extracted: In other words, the distance between inferred distributions W (P, Q) to which we add the inference 'errors', i.e. the distance between true and inferred distributions on each domain, form an upper-bound to the true distance between distributions.Let us now analyze the error term W (P t , P).We consider the case of P (analogously so for Q).
(4) A proof can be found in Supplementary Note A. The proof proceeds by first decomposing P into its mixture components, and then applying Jensen's inequality on the Wasserstein dual's supremum.For the special case of Eq. ( 4), the equality is due to K being an element of the mixture P and the Wasserstein distance of K to that mixture element being consequently zero.
Finally, combining the results above, that is, by applying the triangle inequality, Lemma 1, and using the symmetry property of the Wasserstein distance, one can obtain another bound on the true Wasserstein distance, where unlike Eq. ( 3), some mixture components now appear explicitly: Theorem 1.Given the Wasserstein distance's cost function c is the metric on the product space Z × Y, we get ( This final formulation will let us relate in Section 4.2 some of the expanded terms, specifically, the distances W (P t , P f ) and W (Q t , Q f ) to common loss functions used in supervised machine learning.

Connection to Supervised ML Losses
Various loss functions have been proposed for supervised learning.They address the diversity of output types (e.g.class labels; regression targets) and statistical properties of the data (e.g.margin between classes; presence/absence of outliers).Ideally, one would be able to achieve domain invariance while retaining the ability to optimize the most suitable loss function for a given problem.
Let us start with Eq. ( 5) in Theorem 1, in particular, the distance W (P t , P f ).The latter essentially measures the level of error with which the function f predicts the true labels y.It therefore plays a similar role to common loss functions used for supervised machine learning.Both can also be mathematically related.First, the Mean Absolute Error (MAE) commonly used in robust regression can be related to W (P t , P f ) as follows: The proof is given in Supplementary Note A. In the case of classification, a similar result can be provided for the Kullback-Leibler (KL) divergence, which is equivalent to the crossentropy loss when P Y|z is deterministic: Lemma 3. Assuming that P f and P t admit densities, we then obtain where KL(P ∥ Q) = − Z×Y log dQ dP dP is the Kullback-Leibler divergence; and where diam(•) is the diameter of the space received as input, i.e. the largest distance obtainable in that space.
For a proof, see Supplementary Note A. Lemmas 2 and 3 now relate the Wasserstein distance formulation to loss functions occurring in regression and classification tasks that are easily computable, and with the desired statistical properties.Together with the relation shown in Section 4.1, we can now propose a ML formulation that both addresses the prediction task, and enforces domain invariance.

Learning a Domain Invariant Neural Network
Consider the data available consists of examples sampled from both domains, specifically, from distributions P and Q.Under these distributions, part of the data comes with the true labels.For the rest of the data, labels are inferred via the functions f and g respectively.We denote by (X p , Y p ) the dataset of n examples drawn from the first domain P and by (X q , Y q ) the dataset of m examples drawn from the second domain Q.Based on Theorem 1 and Lemmas 2-3, one can define a learning procedure that consists of simultaneously minimizing a supervised loss function L on each domain and the Wasserstein distance W (P, Q) aligning distributions of the two domains.In the classification setting, the supervised loss for the first domain is defined as where Y p i and f (Z p i ) are vectors containing probabilities of each class.A similar loss function can be built for the the second domain.Note that the loss is only effective on the examples that come with a true label, because when the label is inferred, we have KL(f In a similar fashion, for the regression setting, we define L(Z p , Y p ) = For the minimization of the Wasserstein distance W (P, Q) we consider the dual form provided in Eq. ( 2), specifically, an empirical estimate of it: and this forms our domain critic.Finally, one can sum the supervised terms and the domain critic, i.e.Eqs. ( 8) and ( 9), and optimize the resulting objective w.r.t. the functions f , Φ, φ (more precisely, the parameters of the neural networks implementing these functions).This can be formulated as the GAN-like optimization problem: where the classifier terms and the domain critic are in competition.The hyperparameters λ p and λ q can either be fixed to 1 − α and 1 − β respectively (and for classification multiplied by the domain's diameter) in order to match the theory; or they can be selected heuristically or based on some validation procedure.The Lipschitzness constraint on φ is practically enforced by using one of the regularization techniques mentioned at the end of Section 3. Additionally, a constraint on the scale of the representation or the Lipschitzness of the classifier f can be added in order to prevent an arbitrary downscaling of the representation which may cause the Wasserstein distance to artificially go to zero.Lastly, supplementary regularization terms, such as EntMin [20], Virtual Adversarial Training [43], and Virtual Mixup [39] can be added to the objective, in order to take further advantage of the unlabeled examples.A visual representation of our model is given in Figure 4.The steps needed to train a domain invariant networks are summarized in Algorithm 1. Data: Semi-supervised datasets for both domains: (X p , Y p ), (X q , Y q ) Input: Untrained Φ, f with parameters θ, hyperparameters λ p , λ q Result: Trained Φ, f for epochs do for batch (x p , y p ), (x q , y q ) ∈ (X p , Y p ), (X q , Y q ) do /* Compute features */ z p ← Φ(x p ) z q ← Φ(x q ) /* Impute instances with missing labels in the batch */ y p i ← f (z p i ) ∀ unlabeled i y q j ← f (z q j ) ∀ unlabeled j /* Reverse gradient of features for domain discriminator */ z p rev , y p rev ← Rev Grad(z p ), Rev Grad(y p ) z q rev , y q rev ← Rev Grad(z q ), Rev Grad(y q ) /* Compute losses */ */ θ ← θ − γ∇L total end end Algorithm 1: Algorithm for training our proposed domain invariant network.The function 'Rev Grad' denotes a gradient reversal layer, which leaves the forward pass unchanged but multiplies the gradient by −1 in the backward pass (see e.g.[17]).
We now outline some specific condition on the data distribution which provides statistical consistency of the estimator.
Remark 1.Let us consider the condition on which Eq. ( 10) is statistically consistent when sufficiently many labeled examples are observed.For the data distributions suppose that there exists a function z = Φ o (x) such that (i) the conditional probabilities satisfy for any measurable subset A ⊂ Z, and * is the push-forward with the function x → Φ o (x).Under the above assumptions, we see that Hence, the Wasserstein distance estimator ∆ under the population distribution becomes zero.Due to the assumption on the conditional distribution, the optimal classifier f on Z is common for both distributions.Therefore, the optimal classifier with domain-invariant features is obtained by Eq. (10).
To put it more simply, by assuming there exists a feature map where marginal distributions are aligned, and that the conditional distributions are equal almost everywhere for the image of Φ o (x), optimizing our objective function leads to the optimal classifier.

Generalization Bounds
Interestingly, the Wasserstein distance between the true distributions of the two domains (that we have upper-bounded in Theorem 1) can also be related to the risks of the classifier on the two domains.Let R P t (f ) = E z,y∼P t [L(f (z), y)] be the risk or error of a classifier f .We here develop a result using the joint Wasserstein distance, similar to previous result obtained by [51] on the distance between marginals.Theorem 2. Let Z, Y be two compact measurable metric spaces whose product space has dimension d.Let P t , Q t ∈ M 1 + (Z × Y) two joint distributions associated to the two domains, and P t , Q t their empirical counterparts.Let the transport cost function c associated to the optimal transport problem be c(z 1 , y 1 ; z 2 , y 2 ) = ∥( z 1 y 1 ) − ( z 2 y 2 )∥ 2 , the Euclidean distance as the metric on Z × Y and L : Y × Y → R + a symmetric κ-Lipschitz loss function.Then for any d ′ > d and ψ ′ < √ 2 there exists some constant N 0 depending on d ′ such that for any δ > 0 and min(N P , N Q ) ≥ N 0 max(δ −(d ′ +2) , 1) with probability at least 1 − δ for all λ-Lipschitz f the following holds: (A proof is given in Supplementary Note A.) In other words, the empirical Wasserstein distance between the two domains upper-bounds the prediction performance gap between the two domains.In practice, we can therefore expect the optimization of the objective in Eq. (10) to not only reduce the Wasserstein distance between domains (as we have shown in the previous sections), but also to produce a more uniform classification accuracy across domains and therefore a higher minimum accuracy.
We may also want to compare the joint discriminator W (P, Q) to the more common marginal discriminator W (P Z , Q Z ).Indeed, it seems that in many cases, such as when the conditional distribution is identical between the two domains, the solution obtained appears to be equivalent.This is true due to theoretical reasons we will explore here.Let us first recall a bound on the distance between errors with a marginal Wasserstein distance.
Theorem 3 (From [58] (Adapted)).Let P t Z , Q t Z ∈ M 1 + (Z) be two probability measures.Assume the functions f ∈ H are all λ-Lipschitz continuous for some λ.Then for every f ∈ H the following holds where E is the combined error of the ideal f * that minimizes the combined error The main difference compared with our bound is the presence of E, the combined error of the hypothesis on both domains.Indeed, when the features of the two domains are properly aligned, the bound obtained with a joint or marginal Wasserstein distances are similar.However, when the domains are not properly aligned, usually due to the transformation between the two domains being large, and to the lack of labeled samples, we can have a large E such as E = 1, which renders the bound very large.Such a case arises when both domains have entirely identical samples but with opposite labels, for instance.The bound with the joint Wasserstein distance can lead to features more aligned even with large transformations between domains.We expect to observe similar performances between marginal and joint discriminators on experiments with simple transformations, and larger discrepancies as the transformations get larger.

Experiments
To test whether our proposed approach truly achieves an invariant representation and reduces the performance gap between domains (as predicted by Theorems 1 and 2 respectively), we conduct experiments on three common image classification problems.First, a handwritten digits recognition task where the digit images come from two popular datasets: MNIST [28] and SVHN [46], each of them constituting one domain.Then, we consider the Office-Caltech classification dataset [19], which consists of four domains.Finally, we consider the recent and more complex PACS multi-domain image recognition dataset [30], which also consists of four domains.We describe below these multi-domain tasks, and the training procedure for our models.More details are provided in Supplementary Note B.

Data and Models
MNIST and SVHN are two common digit recognition datasets composed of 60000 and 73257 training examples respectively.While MNIST digits are black&white, SVHN digits are colored and have more complex appearances, making them harder to predict.In our MNIST-SVHN two-domain scenario, we simulate partly labeled data by only providing labels for a random subset of examples (1000 for each domain for the experiments of Table 1, and 3000 per domain for experiments of Table 2).The remaining examples are given unlabeled.MNIST images are brought to the SVHN format by scaling and setting each RGB component to the MNIST grayscale value.For experiments in Tables 1 and 2, the function Φ is implemented by the Conv-Large model from [43].The model takes as input images of size 32 × 32 × 3. We use small random translations of 2 pixels as well as color jittering as data augmentation.
Importantly, for the purpose of evaluating the domain invariance of representations, we would like to stabilize the scale of representations learned by the different models.Specifically, we add for the experiments of Table 1 a further penalty to the objective: the Wasserstein distance between the distribution of distances in representation space (the histogram of distances of the union distribution of P Z and Q Z ) and a predefined Gaussian mixture, which we set to be a univariate mixture of Gaussians 1  10 N (5, 2) + 9 10 N (15, 3).The two modes model distances between data points of same class and of different classes respectively.Since the distribution of distances is a 1-dimension histogram, the Wasserstein distance can be computed analytically [47].This added constraint ensures a similar scale for the representation extracted by our model and the different baselines.In particular, it ensures that a reduction of Wasserstein distance in representation space can be reliably interpreted as an increase of domain invariance, and not as a simple scaling of the representation.We have experimented with several new metrics and constraints for settling for this one, although it comes with some side effects.Indeed, by its very definition, it implies that there should be 10 equidistant and equally sized clusters, which is an assumption that is not verified for all datasets (for instance, SVHN).Moreover, it gives a stronger advantage (in the form of a prior) to the bare methods, without any discriminator, acting indirectly as one.
Our second scenario is based on the Office-Caltech dataset.It is composed of four domains (Amazon, Caltech, DSLR, Webcam) containing pictures of objects present in offices (such as monitors) from different sources, such as pictures from a real office, or ones with white backgrounds from an e-commerce website.There are 10 classes, and a varying number of samples depending on the domain (between 150 and 1100).We use the Decaf6 [13] features with 4096 dimensions.We use the Resnet-18 architecture [23].We train a model on each possible bi-domain task (6 tasks) and average the resulting accuracies per domain.
Our third and last scenario is based on the PACS dataset, which consists of 10000 examples, with 4 domains (Photo, Art, Cartoon, Sketch) and 7 classes.We simulate semi-labeled data by providing labels for only 500 randomly sampled images from each domain, and giving remaining images unlabeled.The classes and domains are imbalanced, i.e. contain a different number of examples.The images are resized to 224 × 224 × 3, and a pipeline of data augmentation is applied based on RandAugment [10].We again use the Resnet-18 architecture.On this dataset, we test domain invariance in a 'one vs. rest' setting.
In all our experiments, the classifier f is a simple 2-layer MLP, and the discriminator φ a 3-layer MLP with spectral normalized weights [42].(On the multi-domain PACS, we use a discriminator for each domain, computed in a one-vs-rest manner.)The weights (hyperparameters) for each loss term λ p and λ q are set to one, as well as the discriminator's.Unless mentioned otherwise, the networks are trained for 20 to 50 epochs using the Adam [26] optimizer.

Results and Analysis
As a first experiment, we study the effect of the domain critic we have proposed in Section 5 on the accuracy of the model, and on the Wasserstein distance between the two domains.We consider two baselines for comparison: (1) a simple supervised neural network without domain critic, (2) a supervised network where the critic φ is based only on marginal distributions (such as proposed in [58]).These two baselines can be interpreted as an ablation study of our method, where instead of applying the Wasserstein distance to the joint input-label distribution, we apply it first only to the input variables (marginal critic), and then to no variables at all (no critic).For this experiment we do not use any additional losses/regularizers, and simply optimize the classification and domain alignment terms.We report the Wasserstein distance between the two domains' joint distributions, and the minimum classification accuracy for the two domains.These are two properties that our domain-invariant network is expected to fulfill (Theorems 1 and 2 respectively).Results are shown in Table 1.Results corroborate our theory.In particular, we observe that the Wasserstein distance significantly decreases under the effect of adding a domain critic, specifically a joint domain critic that puts more focus on Y, and the minimal accuracy over the two domains increases.Furthermore, we observe in this experiment that the use of a joint critic also leads to the highest average accuracy across domains.

Accuracy
Independently of the question of domain invariance, unsupervised data has already been routinely leveraged by classical semi-supervised learning approaches.These approaches have shown powerful on data with manifold structure (e.g.[50,29]).In our next experiment, we test the benefit of domain alignment techniques on models that are already equipped with semi-supervised learning mechanisms.Specifically, we consider a combination of two common semi-supervised techniques: conditional entropy minimization (EntMin) [20] and virtual adversarial training (VAT) [43], which have shown strong empirical performance on numerous tasks.Results are given in Table 2.

MNIST SVHN Avg Min
No critic + VAT/EntMin (only MNIST) 99.14We observe that semi-supervised learning on both domains, achieved by a combination of VAT and EntMin, leads to a strong baseline.In particular, it achieves the highest performance on MNIST.Our domain invariant approach, combined with the same techniques, further improves over this strong baseline, by reducing the accuracy gap between the two domains and arriving at a higher accuracy on the most difficult domain (and also on average).Results are further improved by applying a final supervised fine-tuning step to our model without discriminator, and with classification loss re-weighted depending on the domain classification error.Note that this fine-tuning step, while improving classification, hampers the domain alignment and therefore the reusability of features for alternative tasks as well as the domain privacy.More details in Supplementary Note C.1.
Table 3 displays the average results of our bi-domain experiments on the Office-Caltech dataset.We observe that the joint critic (ours) is always better than the marginal one.We also observe that the joint critic, compared to the lack thereof, leads to more uniform results across domains (and therefore higher minimum accuracy), as well as higher average.These observations are consistent with our theoretical results.Finally, Table 4 shows prediction performance on the more complex PACS dataset.We test our model on this data in a one-vs-rest setting, so that the model must learn to be invariant between one domain and the three remaining domains.Again, we find that our model produces the best minimum and average accuracy in each scenario.We found that a trade-off may exist between Art and other domains.Although our method performs worse than competitors on this domain, we observe that it leads to domain accuracies more concentrated around the mean, and therefore a higher minimum accuracy.Additionally, we note that the average accuracy has also increased.

Art
Lastly, we would like to reiterate that the problem of domain invariance has received considerably less attention in the context of deep neural networks than the tasks of domain adaptation and domain generalization.Our quantitative results as well as the multiple baseline results aim to provide useful reference values for future work on domain invariance.

Visual Insights on Learned Representations
While results in the section above have verified quantitatively the performance of our proposed domain invariant network, we would like to also present some qualitative insights.
As a first experiment, we visualize how the representation of the Conv-Large model trained with our proposed approach becomes more task-specific and less domain-dependent throughout training.For this, we take samples from P Z and Q Z , join them, and perform a lowdimensional embedding of the resulting distribution via UMAP [40].Plots before and after training are shown in Figure 5 (left).The visualization suggests that the two domains are strongly separated initially, but under the influence of domain invariant training, they collapse to the same regions in representation space.As expected, the learned representation also better resolves the different classes after training (here roughly given by the cluster structure).As a second experiment, we present SVHN-like synthetic examples to our domain invariant network and vary the digit and the colors.Using the Layer-wise Relevance Propagation (LRP) explanation method [4], we then compute for each prediction the local response of the model.The LRP method identifies the contribution of each input pixel to the prediction.These pixel-wise contributions can also be seen as the summands of a linear model, and the latter forms a local interpretable surrogate for the original model.We refer to weights of this linear model as the 'LRP response map' (details on LRP and how to generate response maps are given in Supplementary Note C.2).
A selection of examples and associated LRP response maps is shown in Figure 5 (right), featuring two digit classes and SVHN-like color variations.Although we would expect that style and color play a marginal role in representation space (our objective has enforced invariance between the colored SVHN and the black&white handwritten MNIST domains), recognizing such style and color variations remains an integral part of the neural network prediction strategy.We indeed observe that the model precisely adapts to the input digit by providing domain-specific response maps of corresponding colors.This strategy is therefore instrumental in the process of building the domain invariant representation.

Conclusion
Real-world data is often heterogeneous, subject to sub-population shifts, or coming from multiple domains.In this work, we have for the first time studied the problem of learning domain-invariant representations as measured by the joint Wasserstein distance.We have created a theoretical framework for semi-supervised domain invariance and have contributed several upper-bounds to the Wasserstein distance of joint distributions that links domain invariance to practical learning objectives.
In our benchmark experiments, we find that optimizing the resulting objective leads to high prediction accuracy on both domains while simultaneously achieving high domain invariance, which we also observe qualitatively on low-dimensional embedding visualizations.We have further observed, somewhat counterintuitively, that domain adversarial training can still result in a model that makes use of domain-specific features in order to arrive at the domain-invariant representations.
Our work allows for several future extensions.For example, it would be interesting to obtain a theoretical connection to other representation learning methods, in particular, contrastive learning, that may be integrated to our framework.Furthermore, an extension of our theory to domain generalization could enable further applications and increase our understanding of domain generalization itself.
Overall, our work on domain invariance provides new theoretical insights as well as quantitative competitive results for a number of scenarios and baselines.We believe it thereby constitutes a useful first basis for further research on domain-invariant ML models and applications thereof.

Supplementary Note A. Proofs of Main Results
In the following, we give the proofs of the main theoretical results presented in the paper.After providing the formal mathematical proof, we detail for each of them in the paragraph below the steps taken to reach the final result.
= sup = sup ≤ sup + sup = αW (P t , K) We start the proof in Eq. ( 2) by simply stating the definition of the dual of the 1-Wasserstein distance (with the cost function being the metric of our space).Because α ∈ (0, 1) and P t and P f are finite, we can decompose the integral of a mixture of measures as a mixture of integrals of each of the elementary measures, as we do in Eq. (3).In Eq. ( 4) we also decompose the second integral into parts weighted by the same α and 1 − α (which sum to one), group those with the corresponding integral on the other measure, and factorize by their weights.Using a property of the supremum in Eq. ( 5), we upper bound the supremum of a sum by the sum of the supremum.What we obtain then by pulling out constant in Eq. ( 7) is a sum of two Wasserstein distances by their dual definition, and we hence obtain Eq. ( 8) and complete 1 arXiv:2106.04923v2[stat.ML] 21 Aug 2023 the proof.For the second result, in the case where K = P t , the first member of Eq. ( 5) cancels out, therefore we do not need to apply the supremum bound, and therefore we have an equality. Proof.
Lemma 3. Assuming that P f and P t admit densities, we then obtain where KL(P ∥ Q) = − Z×Y log dQ dP dP is the Kullback-Leibler divergence; and where diam(•) is the diameter of the space received as input, i.e. the largest distance obtainable in that space.
Proof.In order to prove this result we have to rely on an upper bound of the Wasserstein distance by the Kullback-Leibler divergence, through the combination of two standard bounds.We therefore present this result here and a quick proof.
Lemma.From [7] Let P, Q ∈ M 1 + (Z × Y) be two probability distributions on a compact measurable space Z × Y, we then have and Proof.Combine the bound of the Wasserstein distance by the Total Variation distance (Theorem 4 of [7]), and that one by the Kullback-Leibler Divergence using Pinsker's Inequality.
With that result, we show that under our conditions, the Kullback-Leibler divergence on marginals is in fact the expected KL divergence (on the marginal distribution P Z ) on the conditional distribution.Let ρ t , ρ f be the densities of respectively P t , P f .
The first line (Eq.( 14)) is the definition of the Kullback-Leibler divergence with densities.Eq. ( 15) is an application of Fubini's theorem which allows us to decompose the double integral and a decomposition of joint probability into the product of marginal and conditional.Finally as Y is a discrete space, the integral becomes a sum of probabilities, and ρ t Z (z) is pulled out of the sum.Eq. ( 16) replaces the integral by the expectation, by definition.In Eq. ( 17), since by definition ρ t Z (z) = ρ f Z (z), those terms are removed from the fraction.Eq. ( 18) is again an application of the definition of the KL divergence.By combining the equality obtained in Eq. ( 18) and the cited lemma, we complete the proof.
Theorem 1.Given the Wasserstein distance's cost function c is the metric on the product space Z × Y, we get Proof.
Eq. ( 20) is an application of the triangle inequality.Using the symmetry of the metric (and therefore of the Wasserstein distance), and applying Lemma 1 twice, we obtain Eq. ( 21).
Theorem 2. Let Z, Y be two compact measurable metric spaces whose product space has dimension d.Let P t , Q t ∈ M 1 + (Z × Y) two joint distributions associated to the two domains, and P t , Q t their empirical counterparts.Let the transport cost function c associated to the optimal transport problem be c(z 1 , y 1 ; z 2 , y 2 ) = ∥( z 1 y 1 ) − ( z 2 y 2 )∥ 2 , the Euclidean distance as the metric on Z × Y and L : Y × Y → R + a symmetric κ-Lipschitz loss function.Then for any d ′ > d and ψ ′ < √ 2 there exists some constant N 0 depending on d ′ such that for any δ > 0 and min(N P , N Q ) ≥ N 0 max(δ −(d ′ +2) , 1) with probability at least 1 − δ for all λ-Lipschitz f the following holds:  rules.The Conv-Large architecture is composed of an alternation of convolutions, batchnormalizations, Leaky ReLUs, and max-pooling functions.We follow the strategy proposed in [12] of fusing batch-normalization layers into the parameters of the adjacent convolution layers, so that we arrive at a simplified but functionally equivalent neural network which consists only of max-pooling layers and convolution-leakyReLU layers.For the max-pooling layers (a k = max{(a j ) j }) we adopt the commonly used winner-take-all redistribution [1], i.e. we redistribute the R k to the neuron in the pool that has the maximum activation.For the convolution-leakyReLU layers, we extend the LRP-γ rule defined in [12] to account for negative input and output activations.Writing such layers as: where the convolution is written as a generic weighted sum, where 0,j indicates that we sum over all input neurons plus a bias (b k = w 0k with a 0 = 1), and where α ∈ [0, 1] is the leaky ReLU parameter, we define the rule: where (•) + and (•) − are shortcut notations for max(0, •) and min(0, •).Like for the original LRP-γ rule in [12], this expanded LRP rule prioritizes input contributions that agree with neuron output, and γ controls the degree of prioritization.Choosing the parameter γ = 0 makes the LRP procedure equivalent to the simple Gradient × Input method.Choosing γ > 0 adds robustness to the explanation, and such robustness is especially needed in the first layers.Also, for the standard ReLU case (α = 0 at each layer), all activations become positive, and the proposed LRP rule reduces to the original LRP-γ rule.In our analysis, we choose the parameter γ = 0.25 for the layers 1-17 and the parameter γ = 0 for the layers 18-35.Lastly, LRP response maps can be obtained by rewriting the relevance scores (R i ) i obtained in the input layer as R i = x i c i and returning the vector (c i ) i instead of the usual relevance scores.

Supplementary Note D. Extension of Theory to More than Two Domains
We consider two or more domains {D 1 , • • • , D N : N ≥ 2} in which labeled data and unlabeled data are observed.On each domain D i (i = 1, • • • , N ), we observe labeled data sampled from P t i with a proportion α i ∈ [0, 1], and then unlabeled data from the marginal distribution P t i Z , and by using a function f i : Z → Y, such as a neural network classifier, we obtain an estimate of the true distribution P f i i = (z, f i (z)) z∼P t i Z . Therefore the distribution that we observe samples from is a mixture P i = α i P t i + (1 − α i )P f i i .Note that the classifiers f i need not be identical, and the proportion α i on each domain D i may be also different.In addition, the overall distribution integrating all domains P = N i=1 q i P i is obtained by mixture weights q i satisfying N i=1 q i = 1.Note that the overall true distribution on all domains are described as P t = N i=1 q i α i q i α i P t i , and by using f i , we obtain an estimate of the true distribution on all domains P f = N i=1 q i (1−α i ) q i (1−α i ) P f i i , and then P = ( N i=1 q i α i )P t + ( N i=1 q i (1 − α i ))P f .Lemma 4. Assuming that for all i ∈ {1, • • • , N }, P t i and P t are densities , then we obtain W 1 (P t i , P t ) ≤ N j=1 q j α j N k=1 q k α k (1 − α i )W 1 (P t i , P f i ) + W 1 (P i , P j ) + (1 − α j )W 1 (P t j , P f j ) . Proof.
W 1 (P t i , P t ) = W 1 (P t i , N j=1 q j α j N k=1 q k α k P t j ) ≤ N j=1 q j α j N k=1 q k α k W 1 (P t i , P t j ) ≤ N j=1 q j α j N k=1 q k α k (1 − α i )W 1 (P t i , P f i ) + W 1 (P i , P j ) + (1 − α j )W 1 (P t j , P f j ) . ( Lemma 5. Assuming that for all i ∈ {1, q j α j )W 1 (P t , P f ). ( Proof.Use the triangle inequality and Lemma 1. From Lemma 4 and Lemma 5, we minimize the Wasserstein distance W 1 (P i , P j ) or W 1 (P i , P) instead of the Wasserstein distance W 1 (P t i , P t ).To calculate W 1 (P i , P j ) in Lemma 4, we need C N 2 discriminators.However to calculate W 1 (P i , P) in Lemma 5, we need only N discriminators.The following theorem is discussed assuming that Lemma 5 is used.

Figure 2 :
Figure 2: Illustration of the problem of domain invariance in the case of classification.We would like to learn a function Φ that maps the data to a representation where the domains cannot be differentiated, and from which a domain-invariant classifier f can be built.The invariant representation induced by this model can serve further purposes such as domain privacy or extraction of domain-related insights.X , Z, Y correspond to the input, representation and target (label) space respectively.

def = {π ∈ M 1 +
(A × B) : P A# π = P and P B# π = Q}, where P A# and P B# are push-forwards of the projection of P A (a, b) = a and P B (a, b) = b.This can be loosely interpreted as Π(P, Q) being the set of joint distributions that have marginals P and Q.

4 :
Figure 4: Diagram of the proposed machine learning model, that induces a domain-invariant representation through a domain critic.

Figure 5 :
Figure 5: Left: UMAP visualization of the extracted representation before and after training.Right: Response (extracted using LRP) of the model to various input digits with different style (color).

Figure 1 :
Figure 1: Visualization of the representations learned by the neural networks of Table 2 in the main paper.Representations are embedded using UMAP and the different examples are color-coded by domain (SVHN in blue, MNIST in red).
Figure 3: Visual overview of our theoretical framework.It relates the Wasserstein distance between the joint distributions P t , Q t of each domain (blue) to components of a practical ML objective (red) in two steps.Step 1: The true joint distributions (of which only a small sample is observable) can be related via the triangle inequality to inferred distributions P, Q where missing labels are predicted from the features.Step 2: The expanded terms can be further upper-bounded by common terms of a ML objective (cross-entropy, mean absolute error (MAE), etc.).

Table 1 :
Effect of the domain critic on the classification accuracy and the Wasserstein distance between the two domains in representation space.We use 1000 labels per domain.Best performance is shown in bold.For indicative purpose, we report in the first two rows the classification accuracy on individual domains.

Table 2 :
Evaluation of our method in combination with classical semi-supervised learning regularizers (VAT + EntMin).We use 3000 labels per domain.Best results are in bold.

Table 3 :
Comparison of our method to a classic marginal domain critic and an absence of critic, on the Office-Caltech dataset.We use 200 labels per domain except for DSLR which uses 100.Accuracy is reported as averages over of all bi-domain tasks.Best results overall are in bold.

Table 4 :
Comparison of our method to a classic marginal domain critic and an absence of critic, on the PACS dataset.We use 500 labels per domain.Accuracy is reported as Domain vs. Rest.Best results overall are in bold.
i{R P t i (f )} − min i {R P t i (f )}| ≤ 2 max i {W 1 (P t , P i {R P t i (f )} ≤ max i {R P t (f ) + W 1 (P t , P t i )} = R P t (f ) + max i {W 1 (P t , P i {R P t i (f )} − min i {R P t i (f )}| ≤ 2 max i {W 1 (P t , P t i )}.(46)