Exploiting data diversity in multi-domain federated learning

Federated learning (FL) is an evolving machine learning technique that allows collaborative model training without sharing the original data among participants. In real-world scenarios, data residing at multiple clients are often heterogeneous in terms of different resolutions, magnifications, scanners, or imaging protocols, and thus challenging for global FL model convergence in collaborative training. Most of the existing FL methods consider data heterogeneity within one domain by assuming same data variation in each client site. In this paper, we consider data heterogeneity in FL with different domains of heterogeneous data by raising the problems of domain-shift, class-imbalance, and missing data. We propose a method, multi-domain FL as a solution to heterogeneous training data from multiple domains by training robust vision transformer model. We use two loss functions, one for correctly predicting class labels and other for encouraging similarity and dissimilarity over latent features, to optimize the global FL model. We perform various experiments using different convolution-based networks and non-convolutional Transformer architectures on multi-domain datasets. We evaluate the proposed approach on benchmark datasets and compare with the existing FL methods. Our results show the superiority of the proposed approach which performs better in term of robust FL global model than the exiting methods.


Introduction
Federated learning (FL) is a distributed and collaborative machine learning technique in which multiple clients train a global model while keeping the data at local locations [1].In centralized training, a model is trained and updated on a cumulative train and test data at a single location (i.e.either client or server).However, in FL, the training dataset is decentralized and located on multiple participating clients where each client only shares its model parameters (i.e.gradients) with the central server rather than sharing the original raw data for training.Moreover, each client trains its local model using its local training data for a given number of local communication rounds, and sends model gradients to the server during the global communication round.After the server update, the revised global model is disseminated to all participants for the subsequent global communication round, and this process iterates for a given number of global rounds.In the recent past, FL was considered as a privacy-preserving machine learning approach to train a global model without sharing clients' private data with the server [1].However, many recent studies have introduced the vulnerabilities known as gradient leakage in FL, caused by the adversarial networks [2,3].To address the problem of gradient leakage, different methods such as gradient clipping [4], representation learning [3] and gradient-free optimization [2] have been introduced in the literature.Thus, FL with additional security layers, has evolved by enabling decentralized training of multiple participants without sharing of confidential data.This characteristic of FL makes it useful and beneficial in different areas such as communication networks [5], health care [6][7][8][9][10], organizations [11,12], and smart cities [13] where privacy and confidentiality of sensitive data are crucial for the data owners.Although, FL provides rich opportunities in many fields, there are many research challenges in the implementation of FL for real-world problems.One of the most common problem in FL is the data heterogeneity or different data distribution among multiple domains.Moreover, imbalanced class problem or label distribution in multi-domain environment is common for the real-world data in which non-uniform labels are distributed across the classes in a dataset.In real-world data, some classes contain only a few samples, but many others have a large number of samples of such classes [14][15][16].Many methods have been proposed to solve the problem of non-uniform label distribution in collaborative training [15][16][17][18][19][20].However, most of the existing methods face challenges of non-independent and identical data distribution (non-IID), causing uncertainty in fast and complete model convergence [21][22][23].Such existing methods have achieved good results by solving the problem of data heterogeneity in FL, but most of the methods focus on single domain scenario in which data splits across participants are taken from the same domain for all participating clients.These existing methods train a model on imbalanced data from different splits of the same domain, and try to generalize on a balanced test dataset.
Instead of focusing on the improvement of FL optimization as most of the existing methods [24,25], our work is based on multi-domain data heterogeneity i. Transformer models [26,27] have been used to solve the problem of data heterogeneity in classification tasks [28][29][30][31] by demonstrating the resilience against heterogeneous data [32,33].The robustness exhibited by Transformers makes them well-suited for self-supervised learning [34,35], especially for data heterogeneity based on domain and distribution shifts in training data.Thus, in our method, we exploit Transformer architectures for the training on heterogeneous data with domain-shifts and diverse distributions across multiple participating domains in FL.Existing studies [32,36] have indicated the superior performance of Transformer models in comparison to commonly used ResNet [37] and other convolution-based networks [38,39].The reason for the success lies in the Transformer architecture comprising attention heads that help contextual awareness in image interpretation [36].As the attention mechanism helps to capture contextual dependencies more effectively, we posit that this property contributes to superior performance in heterogeneous FL.The convergence of Transformers is fast and their global model is suitable for most devices.We compare our results with the existing FL methods, and conclude that vision transformers (ViTs) perform better without additional hyperparameters and training.Therefore, they are appropriate for the future research in FL problems.
Moreover, to minimize the inconsistency of domain-shift and diversity in multi-domain data distribution, we use two loss functions to optimize and improve the performance of global model.During training, one loss is computed on the latent feature vectors and the other one is calculated by class-logits of the model to optimize the global model in MDFL.We evaluate and compare our model with the existing methods using non-IID and heterogeneous data splits with domain-shift from multiple source domains.Experimental results suggest that performance of the proposed method is better compared to the similar existing methods.
Our main contributions are as follows: (i) We formulate the problem of data heterogeneity based on class-imbalance and domain-shift within and across the domains in FL. (ii) We propose a method MDFL, by training robust Transformer model to improve the performance of global model trained on heterogeneous data with diverse distribution, class-imbalance and domain-shift from multiple domains in FL. (iii) We use two loss functions; a loss L C calculated by the closs-logits of the model to correctly predict the class labels, as given in equation (7), and a loss L B computed on the latent feature vectors to align the classes across domains, as given in equation (8), to optimize and improve the global model trained on diverse data splits from multiple domains.(iv) We evaluate our method on benchmark datasets by training our model on multi-domain data with diverse distribution and domain-shift in MDFL.Experimental results indicate the better performance of the proposed method.

Related work 2.1. Class-imbalance and label distribution
Much research has been done on the class-imbalance problem [15][16][17][18][19][20]40], and different solutions have been proposed to solve this problem including under-sampling and over-sampling [41,42], reconciliation of loss function [15,17,43,44], and learning paradigms such as self-supervised learning [16,45], transfer learning [18], ensemble learning [46,47], metalearning [48], and metric learning [49].All these methods have been used in the scenario of a single domain and use the data splits for all participants from the same domain, while we extend the data heterogeneity problem to multi-domain and imbalance classes in FL environment.

Multi-domain learning
In multi-domain learning, a model must be adaptive and robust to data from multiple domains containing different label distributions [50] which is similar to transfer learning [51].The objective of domain adaptation is to learn a model for a single target domain [28,51], while multi-domain learning is focused on the average performance of all source domains and their distributions [52].The existing methods are based on a single-domain data [50,53], which exploit domain-invariant features [52,[54][55][56] and multi-task learning [57].We are focused on the class-imbalance within and across domains in FL environment.Our problem is similar to domain generalization, in which a model is trained on multiple domains and generalized for an unseen domain [58].Most existing methods are based on data augmentation [59,60], domain-invariant features [54,55,61], meta-learning [62,63], and casual relationships [64,65].These methods are based on a single domain and have not explored the class-imbalance problem within and across domains in the scenario of domain-shift, especially in FL environment.In this paper, we investigate the effect of data heterogeneity and class-imbalance in MDFL environment.

FL
FL provides distributed and collaborative training on private data from multiple sources [1].There are two main categories of effective distributed training [66] that have been evolved: (1) serial FL methods allow training of multiple clients in a cyclic and serial manner such as split training [67] and cyclic weight transfer (CWT) [7], whereas (2) in parallel FL methods, training of each participant is parallel, such as FedAvg [1].FL presents the challenge of domain-shift and class-imbalance across participants in FL training.Such data heterogeneity in FL causes non-guranteed model convergence and forgetting problem for cyclic FL methods [7,68,69], and divergence in model weights for parallel FL methods [22,[70][71][72].
FedAvg algorithm [1] has been widely used in different variations such as FedAVGM [73] to use the server momentum to mitigate the problem of class-imbalance and distribution-shift for each client.It has been used with some optimization methods, such as matching feature layers [74,75], collaborative replay [76], model distillation [77], and unsupervised contrastive learning [72] to address the problem of heterogeneity in data.It has also been implemented as FedAvg-Share [72] by sharing small chunks of data among participating users with an additional proximal term (FedProx) to the local objective, which reduces the potential weight divergence [22].
At the same time, several methods have been presented to solve the problem of catastrophic forgetting in cyclic FL methods.Such methods restrict the weight updates that are required and important for historical tasks, known as elastic weight consolidation [78].These methods implement cyclic weighted objectives to reduce loss due to the skewness of the label distribution [28], and deep generative replay to mimic data from historical tasks or the client [76,79].However, most existing methods focus on optimization techniques without inspecting the model architecture for domain and distribution shift of data, to increase the robustness and performance of the model.In our work, the experimental results are consistent with the hypothesis that an architectural change in the model makes a huge difference and should be explored for optimization methods, which is the main focus of our work.

Transformers
The Transformer architecture firstly proposed by [26], has been implemented in sequence-to-sequence machine translation, and then in self-supervised natural language processing tasks [34].Transformers have been widely used in image and video tasks in recent years.For example, self-attention has been applied to the local neighborhoods of the image in [80].Similarly, global self-attention has been applied to full-size images using ViT [81] for the ImageNet classification task, and state-of-the-art performance is achieved.Therefore, Transformers have shown a prominent performance increase compared to classical vision networks such as CNN [37,82]), language models (i.e.LSTMs [83]), and are attracted to understanding the causes of their effectiveness.Many existing methods have proven the effectiveness and robustness of ViTs to severe domain-shifts, perturbations, and occlusions [5,32].Furthermore, recent methods have demonstrated the effectiveness and suitability of Transformers for multi-modal and heterogeneous data [33,35,84].Inspired by the above studies, our hypothesis is that ViTs will perform better by adapting the domain-shift, class-imbalance, and overall heterogeneity of the data in FL.We conducted a considerable number of experiments and gave a detailed empirical analysis to validate the hypothesis.

Methodology
In this section, we formulate the problem of data diversity in MDFL using transferability, and demonstrate our approach to minimize the heterogeneity effect of multi-domain data.Moreover, we demonstrate our proposed approach using end-to-end model training with Transformer architectures and proposed losses in MDFL environment.

Transferability
To explain the problem of multi-domain data heterogeneity, we visualize the data distribution of different domains with respect to their variations across the domains.In a classification task, a domain space D = 1, 2, 3, . .., D, each domain having a label space C = 1, 2, 3, . .., C can be represented as a training set where E d is the Euclidean distance measured by first-order statistics µ (i.e.mean) from representation space z = g(x, θ), and g : Transferability can be visualized as a graph in a 2D Cartesian space by taking the average of Trf{(d, c), (d ′ , c ′ )} as a similarity measure.As a representative case, we plot the transferability graph for PACS [85] data.For each domain-class pair (i.e.(d, c)), the mean µ d,c is estimated from the learned representations and the distance matrix is calculated for the transferability graph as shown in figure 2, where each color represents domain samples with the circle size as the number of samples in that domain.Moreover, the distance among similar classes indicates the Euclidean distance between them.Our objective is to minimize the distance between domain-class pairs in all domains of a dataset.

MDFL
As shown in figure 3, there are two main categories of FL based on the gradient merging mechanism for all participating models.(1) CWT in which an individual local model is trained to become an ultimate global model in a serial and a cyclic way for every round of communication.Subsequently, the global model is transferred to the next client for the same process [7].In this way, each client participates in the training for a given number of local epochs.This process continues until the global model converges and reaches a specific number of given rounds for communication cycles.(2) Federated Averaging (FedAvg) in which a local model is trained on local data of a participating client performing a stochastic gradient descent.Afterwards, all local models are averaged, and these averaged parameters are broadcasted to every participant.The process is repeated until the global model converges for the given number of rounds [1].In our experiments, we apply the most common parallel FL method, FedAvg [1] as used by most existing methods [11,86,87].The overall objective of FL is to minimize the loss to achieve the best global model as given below.min where b i represents the batch size for the participant i ∈ N having model parameters θ, and s i ∈ S is the local data of a participant.Furthermore, we employ both convolution-based networks and non-convolutional Transformer architectures for training and evaluating the model across multi-domain data.We implement a variety of commonly used convolution-based classification models, including our custom model (referred to as CustomNet), LeNet-1 [39], CNN-2 [38], ResNet18 [37], and EfficientNet [88].These models incorporate convolution and pooling operations within their architectures.CustomNet is similar to CNN-2 [38] model, except that it contains an additional maxpool layer before each activation layer in the architecture.We also use an adaptive-average pooling layer in each convolution-based network to handle the variation in image size from different domains.We exploit ViTs as non-convolutional models, specifically ViT (T) and ViT (S), in our implementation for a fair comparison.In contrast to traditional convolutional models, these architectures do not utilize convolutional layers in their designs.
Transformers use patches of the image known as tokens to learn the features.The robustness of their architecture is due to self-attention, which aggregates the image information.Their non-convolutional architecture is based on layers, which investigate the average distance among learned weights.The attention distance and its variability from higher to lower layers is compared, which is almost uniform throughout the network going deeper.This ability of Transformers is significant for contextual relationship to interpret an image which is different from convolution-based architectures.If a domain i contains data samples s i , then a domain-adaptive Transformer attention Att with Q i , K i , and V i can be represented as follows.
where d k represents dimensionality of the key vectors.
For multi-domain data, aggregated attention is formulated as: where γ i is weight of a domain i, and Moreover, for a domain i, if the input data x i is transformed to embedding E i = Emb (x i ), and loss L i for that domain is L i = (E i , θ i ), then the loss function L C (i.e.cross-entropy loss) for MDFL can be mathematically represented as follows.
where hyperparameter β controls the domain importance, E i is embedding, and θ i are model parameters for a domain i.The loss L C is minimized to optimize and generalize the model trained on multi-domain data with heterogeneous domain and label distribution.

Training loss
For the training of a global model, we use two loss functions to formulate an overall loss to minimize for the optimization, as given below.where λ is a hyperparameter representing trade-off between two losses.L C denotes the standard cross-entropy loss computed on the final output layer (i.e.class logits) and can be defined as follows.
where y ′ n is the predicted output for input image n and b is the batch size.In (6), L B is used to represent the balanced domain-class distribution alignment (BoDA) loss as proposed in [89].L B is a special loss introduced to reduce the heterogeneity effect in a dataset.Here, we use L B to align the classes across domains, and to minimize the negative impact on FL training produced by data diversity from multiple domains.It can be mathematically formulated as follows.
where numerator represents the positive distance (i.e.E d ) of domain-class pairs (d, c) which has to be minimized to attract the same classes in the training, while the denominator is a negative distance of domain-class pairs that should be maximized to isolate different classes during training.Therefore, this loss function aims to minimize the distance between domain-class pairs that are similar, while increasing the separation between pairs that are dissimilar.Here, µ is the first-order statistics estimated from feature representations of domain-class pairs, and the calibration parameter p d ′ ,c ′ d i ,c i is used to control the transferability from (d, c) to (d ′ , c ′ ) depending on their sample size.Moreover, Euclidean distance ) T takes over the first-order statistics (i.e.mean).The overall training setup of the proposed approach is shown in figure 4, where multi-domain data are distributed to clients (i.e.C1 to C12), and a local model generates feature map which is transformed to feature vectors through global average pooling.The loss L B is calculated by the feature vectors, and the loss L C is measured by the class logits which is the final output after multi layer perceptron classifier.

Experimental results
In our approach, we train the Transformer models and support our hypothesis that they produce a superior global FL model compared to traditional convolution-based architectures.Employing Transformer models enhances the optimization process of the FL model.Moreover, these models are robust for heterogeneous data within and across domains (i.e.domain-shift), and must be adaptive for new and unseen-domain data.
We use two loss functions to optimize and further improve the performance of the model.Additional loss to model optimization is useful for data heterogeneity, especially domain-shift which is reduced as a measure of transferability.The performance of the FL model, which incorporates transformers and uses two loss functions (i.e.L C and L B ), is assessed and contrasted with the performance of the same model only using L C .

Datasets
In the experiments, DIGITS [54,[90][91][92] data and PACS [85] data are used by different split-categories.In the DIGITS dataset, domains are divided into multiple subdomains (i.e.clients) using Dirichlet Distribution (i.e.Dir(α)).As used in [87,93], dataset is split using smaller value of α (i.e.α = 0.5) to distribute data as non-IID data among participants.However, PACS dataset is used as multidomain dataset with one domain as a single client without further split distribution.Additional information regarding both datasets is provided below.

DIGITS dataset
We evaluate the proposed method for the digit classification task.In the experiments, we use: (1) MNIST [90] handwritten digits having 32 × 32 grayscale images with 60 000 samples as train set and 10 000 as test set, (2) MNIST-M dataset [54]  We split each domain data into three clients using Dir(α = 0.5) to make non-IID as used by other existing methods such as [87,93].We use the minimum value of α (i.e.0.5) to make the data more heterogeneous with imbalanced label distribution across the clients as higher the value of α leads to higher homogeneous data distribution and vice versa.Moreover, we use 3 different domains for training out of 4 domains, each having mentioned datasets.We distribute the data across clients with different label distribution within and across the domains.We use test set of each domain for its local model evaluation.We perform training to evaluate the model by leave-one-domain-out cross-validation on an unseen-domain data.Thus, for each domain, there are 3 clients and a total of 9 clients for 3 different domains which participate to produce a global model that is evaluated on the 4th unseen domain.We use the test set of 4th domain as a global test set to evaluate the global model.For the implementation of Transformers, we transform the training and testing data to a larger size of 224 × 224.In each experiment, a local model is optimized using loss from (6).Moreover, 5 local epochs and 100 communication rounds are used for each data split scenario.

PACS dataset
In the experiments, PACS [85] dataset is used to evaluate the robustness of the proposed method.PACS dataset contains the images of natural photos, art paintings, cartoons, and sketches.In each experiment, a domain is assigned to a participant without further split into subdomains and clients.Thus, 4 domains (i.e. 4 clients) participate in the scenario of PACS dataset.Other hyperparameter settings are same as used by the DIGITS dataset.For further visualization and description of PACS data, please see appendix A.2.

Technical details
We implement the proposed method with PyTorch configured on the Linux Operating System (i.e.Ubuntu 22.04 LTS) with the installation of NVIDIA GPU (GeForce RTX 3090) having a memory of 24 GB.Moreover, a CPU (i7-8700) with a memory of 50 GB has been used in the experiments.We follow [66] for the hyperparameter settings used in our experiments as summarized in table 1.

Evaluation metric
We use a common evaluation metric accuracy in our experiments because of the balanced evaluation data as used in other similar methods [22,78,95].We compare the proposed method based on accuracy with other existing methods.

Comparison with existing methods
We measure the test accuracy of each model trained as Leave-One-Domain-Out validation on heterogeneous data splits from three different domains of DIGITS dataset.The global model is evaluated on global test set obtained from 4th domain other than included domains in training (i.e.leave-one-domain-out).Test  accuracy of all convolution-based networks against communication rounds is shown in figure 5.It is observed that in some cases, models are converged, but pretrained ResNet18 and Efficient(B5) are not converged and adaptive when evaluated on USPS test data.Moreover, CNN-2 and LeNet-1 do not perform well in case of SVHN data.We also measure the test accuracy of each model evaluated on global test data from each individual domain and compare with existing methods as shown in table 2. It is clear from table 2 that the performance of the convolutional models is comparatively lower, while non-convolution-based ViTs performs better which are more robust against domain-shift and imbalanced distribution in the scenario of heterogeneous data from multiple domains.Moreover, it is also noteworthy that ViT(S) performs better as compared to ViT(T).The performance variance observed between ViT(T) and ViT(S) could stem from their distinct tokenization strategies, with ViT(T) employing token mixing tokenization and ViT(S) utilizing spatial tokenization.Moreover, the effectiveness of tokenization strategies also depends on the dataset characterization such as ViT(T) performs better in case of complex spatial relationship that cannot be adequately captured by non-overlapping patches, while ViT(S) performs better if the dataset is comprised of spatially correlated features because it preservers spatial information by dividing the image into Table 2. Average accuracy with non-homogeneous data (α = 0.5) and unseen domain.All results are reported according to leave-one-domain-out settings.These results are also a comparison of the proposed method and other recent FL methods based on the accuracy measured on homogeneous data from multiple domains.The bold text shows the improved performance of the proposed method using transformer models.non-overlapping patches.However, for both DIGITS and PACS data, ViT(T) and ViT(S) perform better than other convolution-based networks in all cases.

Model
To evaluate the performance of the model for a different dataset, we also use PACS [85] dataset in the experiments.In this dataset, each domain is assigned to a single participant and hence 4 participants perform training on their individual domain data on the basis of Leave-One-Domain-Out settings.The experimental results for the PACS dataset training with selected convolutional and non-convolutional models based on previous results computed on DIGITS dataset, are given in table 3 in which accuracy of an individual domain is measured, and average accuracy is also given in the last column.Table 3 also demonstrates the supremacy of ViTs over other convolutional networks trained on the PACS dataset.
We further exploit additional loss functions given in (6) to optimize the model trained on heterogeneous data of DIGITS from multiple domains.The model's performance experiences a notable improvement when incorporating these additional loss functions for both convolution-based networks and non-convolutional Transformers, as illustrated in figure 6.This is because of the transferability and distance minimization between domain-class pairs which eventually improves the overall performance of trained FL model.
Finally, we measure and compare the average accuracy of each method for both DIGITS and PACS datasets.Accuracy of the global model with and without additional loss has been given in table 4 which shows that ViTs are robust, when used in collaboration with additional loss functions, to heterogeneous data with imbalanced label distribution and domain-shift.Moreover, additional loss increases the performance of the learned model by decreasing the transferability and Euclidean distance between domain-class pairs of multiple domains.
We compare the proposed method with existing FL methods such as ResNet50-FedAVG (i.e.R50-FedAVG) [78], FedProx [22], and FedAVG-Share [95] as given in table 4. Most of the existing methods use convolution-based networks, and the performance of these methods varies from domain to domain.The comparative results clearly demonstrate that ViTs exhibit greater robustness when employed with FedAvg aggregation and the inclusion of additional (i.e.BoDA) loss.They also demonstrate an ability to adapt effectively to new and unseen domains.Moreover, ViTs perform better than existing methods [22,78,95] in case of data heterogeneity and different distribution on the basis of domain-shift when data from multiple domains are used for training.Thus, ViTs in addition to losses (equation ( 6)) perform better compared to existing state-of-the-art methods, and are suitable for MDFL by solving the problem of data heterogeneity, catastrophic forgetting, and domain-shift.When training the FL model in a real-world scenario, a model should adapt the diversity across domains.Thus, we use data splits from a single domain as well as multiple domains using the leave-one-domain-out training method, so that the test domain does not overlap with the training data.Furthermore, we evaluate the proposed model using unseen domain data that were not included in training or validation of the model.Our results show the effectiveness of ViTs and the additional loss used in such real-world scenarios.We solve the problem of data heterogeneity in the scenario of imbalanced class distribution and domain-shift within and across domains.
To evaluate the robustness of the proposed method, we have used two different data, DIGITS and PACS data sets that contain multiple domains having different datasets.DIGITS dataset is subdivided into multiple splits as non-IID for the participants, while the PACS dataset is distributed to participants as one domain allocation for one client.Moreover, DIGITS dataset contains digit images with different color, shape and resolution.Likewise, PACS dataset is also heterogeneous on the basis of different label distribution and features containing different images of natural photos, sketches, art paintings and cartoons.Thus, both datasets represent the real-world scenario with multiple domains having heterogeneous data with domain-shift and imbalanced label distribution.
To tackle with the data heterogeneity in MDFL, we have exploited ViTs that have been used as robust classifiers in different fields.However, we implement ViTs to address the problem of data diversity due to imbalanced label distribution and domain-shift when used in FL model training.Moreover, we incorporate a generalization loss into our MDFL approach to mitigate the impact of heterogeneous distribution and domain-shift in the training data.The experimental results clearly demonstrate that, within our MDFL settings, ViTs outperform other convolution-based architectures when combined with the BoDA loss (i.e.equation ( 8)).

Conclusion
We train the robust Transformer model for MDFL using data from multiple domains characterized by imbalanced class distribution and domain-shift, to solve the problem of data diversity.Transformer architectures are able to solve the problem of device forgetting and learn features efficiently compared to convolution-based deep networks.However, ViTs typically have a large number of parameters, making communication expensive, especially in FL where models are trained across multiple decentralized devices or servers.Moreover, these models require substantial computational resources and significant memory resources during training due to computational complexity and large number of parameters.As we are not concerned with the computational resources in this work, we take advantage of their robustness in case of training with heterogeneous data.We train the global FL model by optimizing two loss functions in latent feature space and class logits of the model.We evaluated the proposed method and compared with the existing methods based on accuracy of the global model.We achieve excellent results in term of accuracy which show the supremacy of the proposed method.In addition to addressing optimization challenges, we can address the prevalent issue of data leakage in FL by incorporating the proposed model with state-of-the-art defense methods in the future.

Figure 1 .
Figure 1.Multi-domain DIGITS data with imbalance label distribution.Each domain has image data with different color and resolution.Each dataset is split and distributed to clients using Dir(α) distribution to make non-IID data.Thus, a model is trained on the data from multiple domain with domain-shift and imbalance label distribution to produce a global model (i.e.equation (2)).
e. class-imbalance and domain-shift in the scenario of collaborative training with different participants from different domains.In the proposed multi-domain FL (MDFL) scenario, we use data from different domains each having the same general categories (i.e.class labels) but different distributions and underlying patterns.Thus, severe domain-shifts and diverse distributions are major challenges for the convergence of global model in FL.As shown in figure 1, the key challenges in MDFL are as follows: • Domain-shift due to data heterogeneity within and across the domains.Data from different domains have different variations such as color variation and image resolution.• Different distribution of labels (i.e.class-imbalance) within and across the domains.Class-imbalance becomes more challenging for local training of a participant and eventually global model, when dataset of a single domain is split into multiple clients.• Minority and missing classes within and across the domains.Thus, some of the participants have minority and missing classes in their local training data.
denotes the class label, and d i ∈ D represents a domain in the dataset.To represent the variation between domains and classes, we denote domain samples d and class samples c as domain-class pairs (d, c) as part of the training data, represented as T d,c ⊆ T. When we have

Figure 2 .
Figure 2. Transferability graph plotted on the basis of similarity between domain-class pairs of PACS dataset.X-axis and Y-axis are two dimensions of inter-domain distance (similarity).Moreover, gap between similar classes represents the Euclidean distance and the size of the circle represents the number of samples of that class.

Figure 3 .
Figure 3. Categories of Federated Learning (FL) representing clients training on heterogeneous data.In Cyclic FL method (right), an individual local model is trained in a cyclic way, while in the parallel FL method (left), all local models are averaged, and aggregated parameters are broadcasted to all the clients for another training round until a given limit is reached.

Figure 4 .
Figure 4. Overall setup of the proposed approach.A local model of each client is trained on multi-domain data with imbalance label distribution and domain-shift.A loss LB is optimized on the latent feature vectors and LC is optimized using class logits of the network.

Figure 5 .
Figure 5. Average test accuracy of each deep network using FedAvg as aggregation method, for a given global test set.(a) Accuracy of each network evaluated on MNIST global test data.(b) Test accuracy of each network model for the MNIST-M global test set.(c) Accuracy of every convolution-based network evaluated on SVHN global test data.(d) Test accuracy of each model evaluated USPS test data.

Figure 6 .
Figure 6.Average test accuracy of each model to compare ViTs using additional loss functions and FedAvg (i.e.aggregation) for a given global test set from (a) MNIST (b) MNIST-M (c) SVHN, and (d) USPS.

Figure A1 .
Figure A1.DIGITS dataset: Heterogeneous datasets from different domains with different colors, resolution, and data distribution (i.e.domain shift).

Figure
Figure Heterogeneous PACS data from different domains with different colors, size, and data distribution (i.e.domain-shift).

Figure A3 .
Figure A3.PACS dataset: Heterogeneous dataset containing different domains with different features, colors, and domain-shift.The dataset contains painting, cartoons, natural photos, and sketch images.A global model is trained on this divergent data.

Figure A4 .
Figure A4.Representation of data heterogeneity in PACS dataset due to class imbalance in multi-domain federated learning.Each domain represented by a different color has different number of samples for a class.Imbalanced class distribution is a challenge for FL when some classes are minority classes that are possibly missing sometimes, especially when dividing data into training and validation splits.

Table 1 .
Hyperparameter settings and values used in the experiments.

Table 3 .
Individual and average test accuracy of each domain from PACS dataset including Art painting, cartoons, natural photos, and sketch images.

Table 4 .
(8)rall average performance accuracy measured for different existing methods and the proposed method using both DIGITS and PACS datasets.Average accuracy is measured for the model with and without addition loss, given in equation(8).Here, + represents the results when LB is used for the model training on a given dataset.As given in section 2, most existing methods perform training on data splits from the same domain to improve the optimization of FL methods.However, in real-word data, different domains possess a divergent data distribution.For example, in our experiments, we use 4 different domains which contain different datasets with variation in colors, image resolution, domain-shift, and heterogeneity in data based-on label distribution.