Barycentric-alignment and reconstruction loss minimization for domain generalization

This paper advances the theory and practice of Domain Generalization (DG) in machine learning. We consider the typical DG setting where the hypothesis is composed of a representation mapping followed by a labeling function. Within this setting, the majority of popular DG methods aim to jointly learn the representation and the labeling functions by minimizing a well-known upper bound for the classification risk in the unseen domain. In practice, however, methods based on this theoretical upper bound ignore a term that cannot be directly optimized due to its dual dependence on both the representation mapping and the unknown optimal labeling function in the unseen domain. To bridge this gap between theory and practice, we introduce a new upper bound that is free of terms having such dual dependence, resulting in a fully optimizable risk upper bound for the unseen domain. Our derivation leverages classical and recent transport inequalities that link optimal transport metrics with information-theoretic measures. Compared to previous bounds, our bound introduces two new terms: (i) the Wasserstein-2 barycenter term that aligns distributions between domains, and (ii) the reconstruction loss term that assesses the quality of representation in reconstructing the original data. Based on this new upper bound, we propose a novel DG algorithm named Wasserstein Barycenter Auto-Encoder (WBAE) that simultaneously minimizes the classification loss, the barycenter loss, and the reconstruction loss. Numerical results demonstrate that the proposed method outperforms current state-of-the-art DG algorithms on several datasets.


INTRODUCTION
Modern machine learning applications often encounter the problem that the training (seen) data and the test (unseen) data have different distributions, which can cause a deterioration in model performance. For example, a model trained on data from one hospital may not work well when the test data is from another hospital [22], a drowsiness driving estimator trained on one group of subjects may not perform well for other subjects [12], or a cognitive workload estimator based on fNIRS (functional near-infrared spectroscopy) measurements may not generalize well across sessions and subjects [32]. Methods that aim to mitigate this problem are broadly classified into two categories, namely Domain Adaptation (DA) [6] and Domain Generalization (DG) [8]. Both DA and DG aim to find a model that can generalize well in scenarios when the training data from the seen domain does not share the same distribution as the test data from the unseen domain. The key difference between DA and DG is that DA allows access to the (unlabeled) unseen domain data during the training process whereas DG does not. This makes DG a more challenging but more practical problem.
To address the problem of DG, motivated by the seminal theoretical works of [5,6], the hypothesis is typically expressed as the composition of a representation function followed by a labeling function, e.g., see [2,15,29,47], and the representation and labeling functions are learned by minimizing an upper bound for the classification risk in the unseen domain derived in [5,6]. The upper bound consists of three terms: (1) the prediction risk on the mixture of seen domains, (2) the discrepancy or divergence between the data distributions of different domains in the representation space, and (3) a combined risk across all domains that implicitly depends on both the representation mapping and the unknown optimal labeling function from the unseen domain. However, most current approaches disregard this dual dependency and treat the third term (combined risk ) as a constant while developing their algorithms. In fact, the majority of prominent works in DG and DA such as [21,29,45] are essentially variations of the following strategy: ignore the combined risk term and learn a domaininvariant representation mapping or align the domains in the representation space, together with learning a common labeling function controlling the prediction loss across the seen domains. However, the combined risk term is, in fact, a function of the representation mapping and should somehow be accounted for within the optimization process. Additional details of the shortcomings of previous upper bounds are provided in Appendix A.
To address these limitations, we revisit the analysis in [5,6] and derive a new upper bound that is free of terms with the dual dependence mentioned above. Our new bound consists of four terms: (1) the prediction risk across seen domains in the input space; (2) the discrepancy/divergence between the induced distributions of seen and unseen domains in the representation space, which can be approximated via the Wasserstein-2 barycenter [38] of seen domains; (3) the reconstruction loss term that measures how well the input can be reconstructed from its representation; and (4) a combined risk term that is independent of the representation mapping and labeling function to be learned. Our new bound differs from previous ones in two aspects. Firstly, it introduces two new terms: (a) the Wasserstein-2 barycenter term for domain alignment and (b) the reconstruction loss term for assessing the quality of representation in reconstructing the original data. We note that the Wasserstein-2 barycenter term for controlling the domain discrepancy in our bound is built in the representation space, which is better aligned with the practical implementation than previous Wasserstein-based bounds that are built in the data space. Secondly, the combined risk in our bound is independent of the representation mapping and thus can be ignored during the optimization. Motivated by these theoretical results, we propose an Auto-Encoder-based model that interacts with the Wasserstein barycenter loss to achieve domain alignment.
The contributions of this work can summarized as follows: 1. Contributions to Theory: We propose a new upper bound for the risk of the unseen domain using classical and recent transport inequalities that link optimal transport metrics with informationtheoretic measures. All terms in our new upper bound are optimizable in practice which overcomes the limitations of previous works and bridges the gap between previous theory and practice.
2. Contributions to Algorithm Development and Practice: We develop a novel algorithm for domain generalization based on our new upper bound. Our algorithm optimizes a new term that controls the domain discrepancy through Wasserstein-2 barycenter. Unlike previous Wasserstein distancebased bounds that form the domain discrepancy term in the data space but optimize it in the representation space, our domain discrepancy term is constructed and optimized in the representation space, making our practical implementation better aligned with the theory.
3. Gains over state-of-the-art methods: Our algorithm consistently outperforms other theoryguided methods on PACS, VLCS, Office-Home, and TerraIncognita datasets, with a noticeable improvement of 1.7 − 2.8 percentage points on average across all datasets.

Related Work
Our work falls within the DG framework wherein domain-invariant features are learned by decomposing the prediction function into a representation mapping followed by a labeling function. A recent example of this framework is [2], where the authors propose a three-part model consisting of a feature extractor, a classifier, and domain discriminators. The feature extractor learns the task-sensitive, but domaininvariant features via minimizing the cross-entropy loss with respect to the task label and maximizing the sum of domain discriminator losses. The domain discriminator loss is based on an estimate of the H-divergence between all seen domains [5] and has roots in the works [21,30] on Domain Adaptation. Following a similar idea, the authors of [29] align the representation distributions from different domains by minimizing their Maximum Mean Discrepancy. In [15], the authors adopt a gradient-based episodic training scheme for DG in which the extracted features are driven to simultaneously preserve global class information and local task-related clusters across seen domains by minimizing an alignment loss comprising soft class confusion matrices and a contrastive loss. The authors of [3] propose the Invariant Risk Minimization algorithm to learn features such that the optimal classifiers are matched across domains. In [33], DG is achieved by disentangling style variation across domains from learned features. Among the large body of works on the DG problem, we regard [3,7,21,30], and [27] as recent exemplars of principled algorithms that are guided by theory and compare their performance with our algorithm's. Our proposed upper bound is based on the Wasserstein barycenters. Related to this context are the works [37,39], and [47]. In [47], the pairwise Wasserstein-1 distance [35,38], is used as a measure of domain discrepancy. Using the dual form of the Wasserstein-1 distance, the feature extractor in [47] minimizes a combination of cross-entropy loss, Wasserstein distance loss, and a contrastive loss to achieve DG. The works [37,39] provide upper bounds for the risk of unseen domain based on the Wasserstein-1 distance. Although they were originally proposed for DA, they can be adapted to the DG set-up. While the bounds from [37,39] share some similarities with ours, their bounds are constructed in the input space and therefore do not explicitly motivate the use of representation functions. By contrast, our proposed upper bound measures the discrepancy of domains in the representation space, which naturally justifies the decomposition of the hypothesis in the practical implementation. A detailed analysis and comparison of the bounds in [37,39] and our proposed bound can be found in Appendix B.
In addition to the domain-invariant feature learning approach, which is the main focus of this paper, there are other noteworthy and emerging directions in domain generalization research. These are data manipulation techniques [9], meta-learning strategies [16], use of pre-trained models [31], and seeking flat minima [10]. For more details, we refer the reader to [43,48] which are recent survey articles on DG.

THEORETICAL ANALYSIS AND PROPOSED METHOD
We consider a domain v as a triple (µ from the input space to the representation space, and a stochastic labeling function g (v) : R d → Y from the representation space to the label space. We denote the unseen domain by (µ (u) , f (u) , g (u) ) and S seen domains by (µ (s) , f (s) , g (s) ), with s = 1, . . . , S.
Let F = {f |f : R d → R d } be the set of representation functions, G = {g|g : R d → Y} the set of stochastic labeling functions, H := G • F the set of hypotheses, with each hypothesis h : R d → Y obtained by composing a g ∈ G with an f ∈ F, i.e., h = g • f , and D = {ψ|ψ : R d → R d } the set of reconstruction functions that map from the representation space back to the input space. In this paper, we limit our theoretical study to binary classification problems, specifically hypothesis functions h such that h : R d → Y = [0, 1]. Note that a similar set-up is also used in [6] where the hypothesis h occurs non-deterministically and maps a data point to a label between zero and one.
The risk of using a hypothesis h in domain v is then defined by: where E[·] denotes the expectation, , and (·, ·) is a loss function. We make the following assumptions: A1: The loss function (·, ·) is non-negative, symmetric, bounded by a finite positive number L, satisfies the triangle inequality, and Q-Lipschitz continuous, i.e., for any three scalars a, b, c and positive constant Q, A2: The optimal hypothesis of the unseen domain h (u) = g (u) • f (u) is K-Lipschitz continuous. Specifically, we assume that for any two vectors x, x ∈ R d , and positive constant K, where x − x 2 denotes the Euclidean distance between x and x .
The first four conditions in Assumption A1 can be easily satisfied by any metric or norm truncated by a finite positive number. Concretely, if d(a, b) is a metric, potentially unbounded like Mean Squared Error (MSE), then loss(a, b) := min(L, d(a, b)), where L is a positive constant, will satisfy the first four conditions in A1. The Lipschitz condition in A1 and A2 are also widely used in the theory and practice of DG [7,39,44].
One may find our assumptions bear some similarities with the assumptions in [37] and [39], but there are some fundamental differences. Specifically, we assume that the loss function is non-negative, symmetric, bounded, Lipschitz, and satisfies the triangle inequality, whereas the loss function in [37] is required to be convex, symmetric, bounded, obey the triangle inequality, and satisfy a specific form. We only assume that the optimal hypothesis function on the unseen domain is Lipschitz, whereas [39] requires all hypotheses to be Lipschitz.

Bound for Unseen Domain Risk
Our analysis starts by considering a single seen domain. Lemma 1 below upper bounds the risk R (u) (h) of a hypothesis h = g • f in the unseen domain u by four terms: (1) the risk of the seen domain s, (2) the L 1 distance between the distributions of the data representations from the seen and unseen domain, (3) the reconstruction loss that quantifies how well the representation can reconstruct its original data input, and (4) an intrinsic risk term that is free of h and is intrinsic to the domains and the loss function. We use the notation f # µ (v) to denote the pushforward of distribution µ (v) under the representation function f , i.e., the distribution of f (x) with x ∼ µ (v) . Lemma 1. Under assumptions A1 and A2, for any hypothesis h ∈ H and any reconstruction function ψ ∈ D, the following bound holds: ) in the representation space and: Proof. Please see Appendix C.
In typical DG applications, training data from multiple seen domains are available and can be combined in various ways. Therefore, Lemma 2 below extends Lemma 1 to a convex combination of distributions of multiple seen domains.
The upper bound above relies on the L 1 distances between the pushforwards of seen and unseen distributions. However, accurately estimating L 1 distances from samples is hard [5,24]. To overcome this practical limitation, we upper bound the L 1 distance by the Wasserstein-2 distance under additional regularity assumptions on the pushforward distributions.
where ∇ denotes the gradient and · 2 denotes the Euclidean norm.
Proof. Please see Appendix E.
One may wonder what conditions would guarantee the regularity of the pushforward distributions. Proposition 2 and Proposition 3 in [36] show that any distribution ν for which E v∼ν v 2 is finite becomes regular when convolved with any regular distribution, including the Gaussian distribution. Since convolution of distributions corresponds to the addition of independent random vectors having those distributions, it is always possible to make the pushforwards regular by adding a small amount of independent spherical Gaussian noise in the representation space.
The upper bound in Theorem 1 consists of four terms: the first term is the sum of the risk on seen domains, the second term is the Wasserstein distance between the pushforward of seen and unseen domains in the representation space, the third term indicates how well the input can be reconstructed from its corresponding representation, and the fourth term is a combined risk that is independent of both the representation function and the labeling function and only intrinsic to the domain and loss function.
The form of the upper bound derived above shares some similarities with previous bounds in [6,37,39]. However, it differs from previous bounds in the following important aspects: • Firstly, even though Lemma 1 in [37] and Theorem 1 in [39] employ Wasserstein distance to capture domain divergence, the corresponding term is constructed in the data space. By contrast, the corresponding term in our bound is constructed in the representation space, which not only provides a theoretical justification when decomposing the hypothesis into a representation mapping and a labeling function, but is also consistent with the algorithm implementation in practice. Moreover, the bounds in [37] and [39] are controlled by the Wasserstein-1 distance, while our upper bound is managed by the square root of the Wasserstein-2 distance. There are regimes where one bound is tighter than the other as discussed in Appendix B.
• Secondly, our third term measures how well the input can be reconstructed from its representation. This motivates the use of an encoder-decoder structure in the proposed algorithm in Section 4 to minimize the reconstruction loss. This is a novel component absent from [6,37,39].
• Finally, the last term in our upper bound is independent of both the representation function f and the labeling function g. This contrasts with the previous results in [6], where the last term in their upper bound (see Theorem 1 in [6]) depends on the representation function f . We refer the reader to Appendix A for a detailed comparison.
The bound proposed in Theorem 1 can also be used for the DA problem where one can access the unseen/target domain data and estimate its distribution. However, under the DG setting, the second and third term in (4) are uncontrollable, leading to an intractable upper bound due to the unavailability of the unseen data. This intractability, which cannot be overcome without making additional specific assumptions on the unseen domain, is widely accepted in the literature as a fundamental limitation for all DG methods and analyses.
As a step toward developing a practical algorithm based on our new bound, we decompose both the second term and the third term in (4) into two separate terms where one term completely depends on the unseen distribution and the other fully depends on the seen distributions. Corollary 1. Under the setting and notation of Theorem 1, for an arbitrary pushforward distribution f # µ, we have: Proof. Please see Appendix G.
Motivated by the bound in Corollary 1, we want to find a suitable representation function f together with a reconstruction function ψ to minimize the second term S s=1 λ (s) W 2 2 (f # µ, f # µ (s) ) and the fourth (5) leads to finding the Wasserstein-2 barycenter of the distributions of seen domains in the representation space. Here, we assume a uniform weight of λ (s) = 1 S for all s, since there is no additional information for selecting these weights. For this choice, the Wasserstein-2 barycenter of the pushforward distributions of seen domains is defined by: We refer the reader to [1,14] for the definition and properties (existence, uniqueness) of the Wasserstein barycenter.
On the other hand, minimizing the fourth term S s=1 λ (s) E x∼µ (s) ψ(f (x)) − x 2 in (5) naturally leads to an auto-encoder mechanism. With a little abuse of notation, we denote the encoder, namely the representation function as f and the decoder, namely the reconstruction function as ψ. The L 2 reconstruction loss should be optimized over all seen domains.

Proposed Method
As the last term in (5) of Corollary 1 is independent of both the representation function f and the labeling function g, and the third and fifth terms are intractable due to their dependence on unseen domain, we focus on designing f , ψ and g to minimize the first, second, and fourth terms in (5) of Corollary 1.
Following previous works [2,5,6], we optimize the first term by training f together with g using a standard cross-entropy (CE) loss, such that the empirical classification risk on seen domains is minimized. The classification loss function can be written as: where CE(h (s) (x), g(f (x))) denotes the cross-entropy (CE) loss between the output of classifier and the ground-truth label of seen domain s. As discussed in Corollary 1, we propose to use the Wasserstein-2 barycenter of representation distributions of seen domains to optimize the second term in (5). Specifically, the barycenter loss is defined by: where f # µ barycenter , as defined in (6), denotes the Wasserstein barycenter of pushforward distributions of seen domains. In contrast to the previous Wasserstein distance-based method [47] where pairwise Wasserstein distance loss is employed, we motivate the use of Wasserstein barycenter loss based on our Corollary 1 and demonstrate its ability in enforcing domain-invariance in the ablation study of Section 5.4. Notably, the barycenter loss (8) only requires computing S Wasserstein distances, whereas using pairwise Wasserstein distance would require S(S − 1)/2 computations. Furthermore, to handle the fourth term in (5), we utilize the auto-encoder structure. Specifically, a decoder ψ : R d → R d is adopted, leading to the following reconstruction loss term: From the analysis above, we aim to find a representation function f , a classifier g, and a decoder function ψ to optimize the following objective function: where weights α, β > 0 are hyper-parameters. One can observe that the terms in our proposed upper bound are incorporated into our objective function in (10). Specifically, the first term in our objective function aims to determine a good classifier g together with a representation mapping f by minimizing the risk of seen domains, which corresponds to the first term of the upper bound in (5). The second term in (10) acts as a domain alignment tool to minimize the discrepancy between seen domains, aligning with the second term in the proposed bound in (5). Note that although L bary itself requires solving an optimization problem, we leverage fast computation methods, which are also discussed in Section 4, to directly estimate this loss without invoking the Kantorovich-Rubenstein dual characterization of Wasserstein distance [38]. This avoids solving a min-max type problem that is often plagued by unstable numerical dynamics. Finally, the third term in the objective function minimizes the mean squared error between the input and its reconstruction over all seen domains, which directly minimizes the fourth term in (5).

2: ($)
Input data from S seen domains

Input Data
True Label … Figure 1: An overview of the proposed algorithm. The top, middle, and bottom branches refer to the reconstruction loss term, the Wasserstein barycenter loss term, and the classification risk (from seen domains), respectively.
Based on the loss function designed above, we propose an algorithm named Wasserstein Barycenter Auto-Encoder (WBAE). The pseudo code of the WBAE algorithm can be found in Algorithm 1 while its block diagram is shown in Fig. 1.
As shown in the pseudo code, we use an encoder f and a decoder ψ, which are parameterized by θ e and θ d for feature extraction and reconstruction, respectively. Here we denote X (s) as a set of samples from domain s with empirical distributionμ (s) and with x  i ) for domain s. The classifier g, parameterized by θ c is then applied to the extracted features for label prediction.
The proposed algorithm requires calculating Wasserstein-2 barycenter and its supporting points.
Output: Encoder f θe , decoder ψ θ d , classifier g θc 1: while training is not end do

10:
L ← L c + αL bary + βL r 11: Here we use an off-the-shelf python package [20] that implements a free-support Wasserstein barycenter algorithm described in [14]. This algorithm is executed in the primal domain and avoids the use of the dual form of Wasserstein distances, which otherwise would turn the problem into an adversarial (minmax) type setting that we want to avoid due to its instability. The barycenter loss is approximated via an average Sinkhorn divergence [19] between the seen domains and the estimated barycenter. Sinkhorn divergence is an unbiased proxy for the Wasserstein distance, which leverages entropic regularization [13] for computational efficiency, thereby allowing for integrating automatic differentiation with GPU computation. We incorporate the implementation from [19] into our algorithm for a fast gradient computation and denote it as Sinkhorn in Algorithm 1, where is the entropic regularization term.

EXPERIMENTS AND RESULTS
The proposed method was evaluated on four benchmark datasets for DG: PACS [28], VLCS [18], Office-Home [42], and TerraIncognita [4] under two different settings: DomainBed setting [22] and Stochastic Weight Averaging Densely (SWAD) setting [10]. In the DomainBed setting, we implemented our method with the widely used DomainBed package and compared it with various theory-guided DG algorithms. Additionally, incorporating the recent advancement in DG-specific optimization, we conducted separate experiments using the SWAD [10] weight sampling strategy with the same experimental setting described in [10]. Furthermore, to investigate the impact of different components of the proposed loss function, we conducted an ablation analysis on the PACS, VLCS, and Office-Home datasets and reported the results in Section 5.4.
Example images of the above datasets are shown in Fig. 2, Appendix H.

Methods for Comparison
In this paper, we compare the empirical performance of our method against the state-of-the-art DG methods reported in [22] under the DomainBed setting. Specifically, the competing methods include: • Empirical Risk Minimization (ERM) [41] which aims to minimize the cumulative training error across all seen domains. • Domain-Adversarial Neural Networks (DANN) [21] which is motivated by the theoretical results from [6]. In particular, to minimize the upper bound of the risk in the unseen domain, DANN adopts an adversarial network to enforce that features from different domains are indistinguishable.
• Class-conditional DANN (C-DANN) [30] is a variant of DANN that aims to match the conditional distributions of feature given the label across domains.
• Invariant Risk Minimization (IRM) [3] aims to learn features such that the optimal classifiers applied to these features are matched across domains.
• Risk Extrapolation (VREx) [27] is constructed on the assumption from [3] which assumes the existence of an optimal linear classifier across all domains. While IRM specifically seeks the invariant classifier, VREx aims to identify the form of the distribution shift and propose a variance penalty, leading to the robustness for a wider variety of distributional shifts.
• Marginal Transfer Learning (MTL) [7,8] is proposed based on an upper bound for the generalization error under the setting of an Agnostic Generative Model. Specifically, MTL estimates the mean embedding per domain and uses it as a second argument for optimizing the classifier.
• CORrelation ALignment (CORAL) [40] is based on the idea of matching the mean and covariance of feature distributions from different domains.
• Style-Agnostic Networks (SagNet) [33] minimizes the style induced domain gap by randomizing the style feature for different domains and train the model mainly on the disentangled content feature.
We can categorize the algorithms provided in [22] into two groups: (1) heuristic algorithms, which lack theoretical analysis, and (2) theory-guided algorithms. As the proposed method in this paper falls into the second category, we primarily compare it with the theory-guided methods. Here, ERM acts as the baseline theory-guided model and DANN, C-DANN, IRM, VREx, MTL are five state-of-the-art theory-guided algorithms. Besides these six methods, for a complete comparison, we also include three heuristic algorithms that achieve the best performances on four evaluated datasets [22]. More specific, SagNet [33] is the best-performing algorithm for the PACS and TerraIncognita datasets, and CORAL [40] is the best-performing algorithm for both the VLCS and Office-Home datasets. In the SWAD setting, following [10] where CORAL was considered as the representative of the previous state-of-the-art methods, we compare our method with both ERM and CORAL, all of which employed the SWAD strategy. The results for the competing methods above are sourced from [22] and [10].

Experiment Settings
Model Structure: We used the same feature extractor and classifier as used in [22] for all four datasets. Specifically, an ImageNet pre-trained ResNet-50 model with the final (softmax) layer removed is used as the feature extractor. The decoder is a stack of 6 ConvTranspose2d layers for all datasets. The detailed structure of the decoder is described in Table 8, Appendix I. The classifier is a one-linear-layer model with the output dimension the same as the number of classes.
Hyper-parameters: In the DomainBed setting, we performed a random search of 20 trials within the joint distribution of 10 Uniform[−3.5,−2] for α and 10 Uniform[−3.5,−1.5] for β (see (10)) with other hyperparameters (e.g., learning rate, batch size, dropout rate, etc.) set as the default values recommended in [22]. In the SWAD setting, following [10], we performed a grid search for α in {10 −3.5 , 10 −3 , 10 −2.5 , 10 −2 } and β in {10 −3.5 , 10 −3 , 10 −2 , 10 −1.5 }. We chose the value of for the Sinkhorn loss (line 7, Algorithm 1) as 20, which is the smallest value that can produce stable training processes. A complete description of hyper-parameter tuning in the SWAD setting and a full list of hyper-parameters can be found in Table 9, Appendix I.  Model Selection: We adopted the commonly used training-domain validation strategy in [22,27] for hyper-parameter tuning and model selection. Specifically, we split the data from each domain into training and validation sets in the proportion 80% and 20%, respectively. During training, we aggregated together the training/validation samples from each seen domain to form the overall training/validation set and selected the model with the highest validation accuracy for testing. All models were trained on a single NVIDIA Tesla V100 16GB GPU. Experiments on each dataset are repeated three times with different random seeds and the average accuracy together with its standard error are reported.

Results and Ablation Study
As shown in Table 1, 2, 3, and 4, the proposed method (WBAE) performs comparably or better than the state-of-the-art methods. In particular, WBAE achieves the highest accuracy in three out of the four datasets compared to all methods, with a moderate improvement over all theory-guided methods on all four datasets. Additionally, the proposed method performs equally well as, or slightly better than, the best-performing heuristic DG methods.
In Table 1, it is demonstrated that WBAE outperforms other theory-guided methods by 0.5% and 1.2% points in both Cartoons (C) and Sketches (S) domains, respectively, and by at least 1% point on average on the PACS dataset. Similarly, Table 2 shows that WBAE achieves a performance gain of at least 0.2% points over all theory-guided comparison methods on the VLCS dataset. The effectiveness of the proposed method is further highlighted in Tables 3 and 4, which present results on the larger and more challenging Office-Home and TerraIncognita datasets. Specifically, compared to all theory-guided methods on Office-Home, WBAE boosts the average accuracy by at least 2.3% points on average and at least 2.2%, 3.4%, 0.3%, and 1.9% points on each task. Regarding the TerraIncognita dataset, the proposed algorithm still exhibits superiority by outperforming all theory-guided methods by at least 1.2% points, as shown in Table 4. A summary of evaluation results in the DomainBed setting is reported in Table 5. The proposed method outperforms all theory-guided methods with a noticeable improvement of 1.7-2.8 percentage points on average across all tested datasets. Table 6 presents the results obtained by applying SWAD, a DG-specific optimizer and weightaveraging technique, in combination with our proposed algorithm WBAE. It can be observed that this combination outperforms all comparison methods on all four evaluated datasets, with an average improvement of 0.4% point over the previous best-performing method CORAL as reported in [10].
Based on the results above, it is evident that the proposed algorithm has a more significant impact on the PACS, Office-Home, and TerraIncognita datasets compared to the VLCS dataset. One possible explanation for this, as also suggested in [46], is that three out of four domains in the VLCS dataset contain a greater proportion of scenery contents rather than object information. Unlike the scenery background in TerraIncognita dataset, the scenery contents in the VLCS dataset are usually more intricate and sometimes include multiple objects, making it challenging for the feature extractor to obtain useful object information for the downstream classification.
To study the impact of different components of the loss function in (10), we conducted an ablation study for WBAE on all datasets except TerraIncognita due to our limited computational resources. In particular, we consider the following variants of our method: (1) no L bary : using the WBAE loss function without the Wasserstein barycenter term L bary ; (2) no L r : using the WBAE loss function without the reconstruction term L r . We re-ran all the experiments three times using the same model architectures, hyper-parameter tuning, and validation method. Table 7 demonstrates the performance of the model with different loss terms removed from the original WBAE loss function. It can be observed that removing L r from the WBAE loss function leads to a decrease in the accuracy of 0.5%, 0.4%, and 1.1% points for PACS, VLCS, and Office-Home datasets, respectively. The performance deterioration is more significant when removing L bary from the WBAE loss function, leading to a drop of 1.2%, 0.9%, and 3.1% points for PACS, VLCS, and Office-Home datasets, respectively. Our ablation study demonstrates the importance of the Wasserstein barycenter loss and also highlights the auxiliary role of the reconstruction loss. Specifically, removing the Wasserstein barycenter loss (L bary ) will result in diminished performance, and a similar, though less significant, decrease will occur if the reconstruction loss (L r ) is removed.

CONCLUSION AND FUTURE WORK
In this paper, we revisited the theory and methods for DG and provided a new upper bound for the risk in the unseen domain. The proposed upper bound contains four terms: (1) the empirical risk of the seen domains in the input space; (2) the discrepancy between the induced representation distribution of seen and unseen domains, which can be further represented by the Wasserstein-2 barycenter of representation in the seen domains; (3) the reconstruction loss term that measures how well the data can be reconstructed from its representation; and (4) a combined risk term. The proposed upper bound provides valuable insights in three aspects. Firstly, we observed that the combined risk term in previous bounds relied on the representation function, which made optimization challenging. By contrast, our combined risk term in the proposed upper bound is a constant with respect to both the representation and the labeling function, making optimization straightforward, thus bridging the previous gap between theory and practice. Secondly, compared with other upper bounds using Wasserstein distance to measure the domain discrepancy, the proposed bound constructs the discrepancy term in the representation space rather than in the data space. This approach offers a theoretical justification for the decomposition of the hypothesis when bounding the risk and for practical implementation when designing the algorithm. Lastly, motivated by the proposed upper bound, our practical algorithm WBAE demonstrates competitive performance over state-of-the-art DG algorithms, validating the usefulness of the proposed theoretical bound for addressing the DG problem. In addition, our bound encourages minimizing the reconstruction loss term, which theoretically supports the use of (nearly) invertible representation mappings in recent works [23,34]. In terms of algorithm and numerical implementation, it should be noted that while our theory-guided method is effective in addressing the DG problem, it may become computationally expensive if one wants to use a larger batch size for a more accurate estimation of the Wasserstein-2 barycenter. To alleviate this constraint, our future works will focus on leveraging the recently proposed large-scale-barycenter and mapping estimators [17,26] to enable the calculation of barycenters with a larger number of samples.

A Limitations of Previous Upper Bounds
First, let us recall that a domain v is defined as a triple (µ (v) , f (v) , g (v) ) consisting of a distribution µ (v) on the input x ∈ R d , a representation function f (v) : R d → R d that maps an input x from the input space to its representation z in the representation space, and a stochastic labeling function We denote the unseen domain by (µ (u) , f (u) , g (u) ) and the seen domain by (µ (s) , f (s) , g (s) ). Let Here, the label space Y is considered as binary. A hypothesis h : R d → Y is obtained by composing each g ∈ G with each f ∈ F, i.e., h = g • f . Next, we rewrite Theorem 1 in [6] using our notations.
Theorem 1 in [6]. Let f be a fixed representation function from the input space to representation space and G be a hypothesis space of VC-dimension k. If random labeled samples of size m are generated by applying f to i.i.d. samples from the seen domain, then with probability at least 1 − δ, for every g ∈ G: where e is the base of the natural logarithm, d H is H-divergence (please see Definition 1 in [5], Definition 2.1 in [45] or Definition 1 in [24]), R (u) (g) = E z∼f # µ (u) |g(z) − g (u) (z)| denotes the risk in the unseen domain, R (s) (g) = E z∼f # µ (s) |g(z) − g (s) (z)| andR (s) (g) denote the risk in the seen domain and its empirical estimation, respectively, and: is the combined risk.
Although the bound in Theorem 1 of [6] was originally constructed for the domain adaptation problem, it has significantly influenced past and recent works in domain generalization as discussed earlier in Section 1. To highlight the differences between our work and previous theoretical bounds (the bound in Theorem 1 of [6] and Theorem 4.1 of [45]), we provide a detailed comparison below: • Firstly, [6] defines the risk induced by labeling function g from the representation space to the label space based on the disagreement between g and the optimal labeling function g (u) : On the other hand, we define the risk induced by using a hypothesis h from the input space to the label space by the disagreement between h and the optimal hypothesis h (u) via a general loss function (·, ·): Since the empirical risk measures the probability of misclassification of a hypothesis that maps from the input space to the label space, minimizing R (u) (g) does not guarantee to minimize the empirical risk. Though there are some cases for the causality to hold, for example, if the representation function f is invertible i.e., there is a one-to-one mapping between x and z, and the loss function has the form of (a, b) = |a − b|, it is possible to verify that R (u) (g) = R (u) (h).
In general, the representation mapping might not be invertible. For example, let us consider a representation function f that maps f (x 1 ) = f (x 2 ) = z, x 1 = x 2 , with corresponding labels as y 1 = 0 and y 2 = 1. In this case, the risk defined in (14) will introduce a larger error than the risk introduced in (15) since g(z) cannot be mapped to both "0" and "1". That said, the risk defined in (15) is more precise to describe the empirical risk. In addition, the risk defined in (14) is only a special case of (15) when the representation mapping f is invertible and the loss function satisfies (a, b) = |a − b|.
• Secondly, using the setting in [6], for a given hypothesis space, the ideal joint hypothesis g * is defined as the hypothesis which globally minimizes the combined error from seen and unseen domains [5,6]: In other words, this hypothesis should work well in both domains. The error induced by using this ideal joint hypothesis is called combined risk : Note that the labeling function g is a mapping from the representation space to the label space, therefore, the ideal labeling function g * depends implicitly on the representation function f , hence, λ depends on f . Simply ignoring this fact and treating λ as a constant may loosen the upper bound. By contrast, our goal is to construct an upper bound with the combined risk term σ (u,s) independent of both the representation function and the labeling function, which can be seen from Lemma 1 and Theorem 1.
• Finally, it is worth comparing our upper bound with the bound in Theorem 4.1 of [45] which also has the combined risk term free of the choice of the hypothesis class. However, note that the result in Theorem 4.1 of [45] does not consider any representation function f , i.e., their labeling function directly maps from the input space to the label space, while our hypothesis is composed of a representation function from the input space to the representation space followed by a labeling function from the representation space to the label space. Since it is possible to pick a representation function f that maps any input to itself, i.e., f (x) = x which leads to h = g • f = g, the bound in [45] can be viewed as a special case of our proposed upper bound in Lemma 1. [37] and [39] The form of the proposed upper bound derived in Theorem 1 shares some similarities with Lemma 1 in [37] and Theorem 1 in [39], for example, all of them introduce Wasserstein distance between domain distributions. However, they differ in the following key aspects.

B Comparison with Upper Bounds in
1. The term containing Wasserstein distance in our upper bound is constructed in the representation space, not in the data (ambient) space, which provides a theoretical justification when decomposing the hypothesis into a representation mapping and a labeling function. This is also consistent with the algorithmic implementation in practice.
2. The bounds in Lemma 1 of [37] and Theorem 1 of [39] are controlled by the Wasserstein-1 distance while our upper bound is managed by the square-root of the Wasserstein-2 distance. There are regimes where one bound is tighter than the other. It is well-known that where Diam(f (X)) denotes the largest distance between two points in the representation space R d generated by input X via mapping In fact, for a given Diam(f (X)), the larger the value of W 1 (µ, ν), the higher the chance that this sufficient condition will hold.

C Proof of Lemma 1
Note that in this paper, we assume that any hypothesis function h(·) outputs a value in [0, 1], i.e., h : R d → [0, 1], and (·, ·) is a bounded distance metric. In addition, we assume that h (u) (·) is K-Lipschitz continuous and (·) is Q-Lipschitz continuous. Particularly, we assume that for any two vectors x, x ∈ R d and any three scalars a, b, and c, the following inequalities hold: where x − x 2 and |b − c| denote the Euclidean distances between x and x , and b and c, respectively.
is K-Lipschitz continuous and (·) is Q-Lipschitz continuous. Then, for any hypothesis h ∈ H and any function (decoder) ψ : R d → R d , the following bound holds: , f # µ (s) ) and: Proof. First, we want to note that our approach is motivated by the proof of Theorem 1 in [5]. Next, to better demonstrate the relationship between the hypothesis, input distribution, true representation and labeling functions, we use inner product notation ·, · to denote expectations. Specifically, From the definition of risk, where the inequality of (19) follows from the triangle inequality (h, h (u) ) ≤ (h, h (s) ) + (h (s) , h (u) ) and (h (s) , h (u) ) = (h (u) , h (s) ).
In an analogous fashion, it is possible to show that: Next, we will bound the third term in the right-hand-side of (20). Specifically, Note that in this paper, any hypothesis function h(·) is assumed to output a scalar value in [0, 1], i.e., h : R d → [0, 1], and (·, ·) is a distance metric. With all these assumptions, we get (21) due to min{ (a, c), (a, d)} ≤ (a, b) ≤ max{ (a, c), (a, d)}, ∀b ∈ [c, d], a, b, c, d ∈ R and the fact for Lipschitz function h (u) (·) that: Next, (22) is due to the Lipschitzness of (·): Finally, we get (23) due to h = g • f , f (x) = z, and (24) due to (·, ·) is bounded by L.

D Proof of Lemma II.2
Apply Lemma 1 S times for S seen domains, then for any hypothesis h ∈ H and function (decoder) ψ : R d → R d , the following bound holds: Next, multiplying (29) with its corresponding convex weight λ (s) , for s = 1, 2, . . . , S, and summing them up, we have: Note that S s=1 λ (i) = 1, thus, the left-hand side of (30) is R (u) (h), and by re-arranging the terms on the right-hand side of (30), the proof follows.

G Proof of Corollary 1
We begin with the second term in the upper bound of Theorem 1. Indeed, for any arbitrary pushforward distribution f # µ, we have: with (42) due to the triangle inequality, (44) due to S s=1 λ (s) = 1, (45) due to the fact that for any a, b ≥ 0 and 0 < p ≤ 1, (a + b) p ≤ a p + b p .

H EXAMPLE IMAGES OF FOUR TESTED DATASETS
Example images of each dataset are shown in Fig. 2.

I ARCHITECTURE AND HYPER-PARAMETERS
• The model structure of the decoder used in all four datasets can be found in Table 8.

J Code Availability
The code used to generate the results and tables is available in the GitHub repository: https: //github.com/boyanglyu/DG_via_WB.