FedER: Federated Learning through Experience Replay and Privacy-Preserving Data Synthesis

In the medical field, multi-center collaborations are often sought to yield more generalizable findings by leveraging the heterogeneity of patient and clinical data. However, recent privacy regulations hinder the possibility to share data, and consequently, to come up with machine learning-based solutions that support diagnosis and prognosis. Federated learning (FL) aims at sidestepping this limitation by bringing AI-based solutions to data owners and only sharing local AI models, or parts thereof, that need then to be aggregated. However, most of the existing federated learning solutions are still at their infancy and show several shortcomings, from the lack of a reliable and effective aggregation scheme able to retain the knowledge learned locally to weak privacy preservation as real data may be reconstructed from model updates. Furthermore, the majority of these approaches, especially those dealing with medical data, relies on a centralized distributed learning strategy that poses robustness, scalability and trust issues. In this paper we present a federated and decentralized learning strategy, FedER, that, exploiting experience replay and generative adversarial concepts, effectively integrates features from local nodes, providing models able to generalize across multiple datasets while maintaining privacy. FedER is tested on two tasks -- tuberculosis and melanoma classification -- using multiple datasets in order to simulate realistic non-i.i.d. medical data scenarios. Results show that our approach achieves performance comparable to standard (non-federated) learning and significantly outperforms state-of-the-art federated methods in their centralized (thus, more favourable) formulation. Code is available at https://github.com/perceivelab/FedER


I. INTRODUCTION
Recent advances of deep learning in the medical imaging domain have shown that, while data-driven approaches represent a powerful and promising tool for supporting physicians' decisions, the availability of large-scale datasets plays a key role in the effectiveness and reliability of the resulting models [1]- [3]. However, the curation of large medical imaging datasets is a complex task: data collection at single institutions is relatively slow and the integration of historical data may require significant efforts to deal with different data formats, storage modalities and acquisition devices; moreover, medical institutions are often reluctant to share their own data, due to privacy concerns. As a consequence, this affects the quality, reliability and generalizability of models trained on local datasets, which unavoidably suffer from bias and overfitting issues, reducing the ability to address future data distribution shifts [4]. In order to overcome the lack of largescale datasets, methodological solutions can be adopted: in particular, federated learning [5] encompasses a family of strategies for distributed training over multiple nodes, each with its own private dataset, which typically communicate with a central node by sending local model updates, used to train the main model. In this scenario, no data is explicitly shared between nodes, thus addressing the required privacy issues. However, this family of techniques generally performs well when dataset distributions are approximately i.i.d. and local gradients/models contribute to learning shared features: unfortunately, in practice this hypothesis rarely holds, due to differences in the acquisition and in the clinical nature of data collected by multiple institutions. Moreover, the presence of a central node, besides representing a single point of failure, requires that all nodes trust it to correctly and fairly treat updates from all sources: indeed, privacy issues arise when transferring local updates to the "semi-honest" central node [6], which might attempt to reconstruct original inputs from gradients or parameter variations [7]- [9]. To address the above limitations, we present FedER, a federated learning approach that, leveraging experience replay from continual learning [10]- [13] and generative models [14]- [16], proposes a principled way for training local models that approximately converge to the same decisions, without the need of a shared model architecture and of central coordination. FedER also enforces privacy preservation through the transmission of synthetic data generated in a way to obfuscate real data patterns. Specifically, FedER's learning strategy envisages multiple nodes that initially train their local models and a GAN on their own datasets. The GAN will be used in order to generate a privacy-preserving synthesized version of the dataset (buffer). Once local training is completed in a node, its model and the "buffer" of generated synthetic data are sent to a random node of the network. The receiving node then adapts the incoming model using its own data and the received buffer data, in order to limit model's forgetting. Data privacy is ensured through a privacy-preserving generative adversarial network (GAN) that employs a specific loss designed to maximize the distance from real data, while keeping a high level of realism and -as importantly -clinically-consistent features, in order to allow models to be trained effectively.
FedER is tested on two tasks, simulating a non-i.i.d. medical scenario: 1) classification of tuberculosis from X-ray data, using Montgomery County and Shenzhen Hospital datasets [17]- [19], and 2) melanoma classification using skin images of the ISIC 2019 dataset [20]- [22]. The experimental setting is specifically designed to emulate a realistic medical noni.i.d. scenario, where each node in the federation uses its own dataset. This is in stark contrast with common procedures where non-i.i.d. distributions are simulated by splitting a single source dataset. Results show how our approach is able to reach performance similar to using centralized training on all real data together in a single node, while outperforming current state-of-the-art methods, such as FedAvg [23], FedProx [24] and FedBN [25]. Privacy-preserving capabilities are measured quantitatively by evaluating LPIPS distance [26] between real images and samples generated, respectively, through latent space optimization on a standard GAN and by the proposed approach. Qualitatively, we also show several examples of generated images with corresponding closest match in the real dataset, demonstrating significant differences that prevent tracing back to the original real distribution.
In summary, the overall contributions of the proposed work are the following: • We propose a decentralized federated learning strategy, based on continual learning principles, designed for medical imaging data, which outperforms centralized federated learning approaches and yields performance similar to standard (non-federated) training settings. Furthermore, experience replay allows all local node models to converge to the same decisions, thus making the whole approach behave similarly to centralized aggregation models. • We propose a GAN-based privacy-preserving mechanism that supports synthetic data sharing through a GANbased technique designed to minimize patient information leak. This is different from most privacy-preserving techniques based on differential privacy, which degrades performance due to added noise. • Most approaches for model aggregation in federated learning employ gradient or parameter averaging. These solutions, however, completely neglect any similarity or dissimilarity between merged features, possibly resulting in interference that harm convergence. FedER, instead, takes feature semantics into account when merging models: if a node receives a model that extracts useful features for the local dataset, these can be readily employed and re-used, without the risk of randomly averaging them with other less important features. • We demonstrate that distributing learning at the data level is an effectivesolution to create models that generalize across multiple datasets, while ensuring privacy.

II. RELATED WORK
Federated Learning (FL) [23] has recently emerged as a family of distributed learning strategies that allow nodes to keep training data private, while supporting the creation of a shared model. In a typical FL setting, a central server sends a model to a set of client nodes; each node fine-tunes the model on its own data, then sends local model updates back to the server; the server aggregates the updates by all nodes into the global model, which is sent back to nodes iteratively until convergence. Given the constraints existing in the medical domain, especially in terms of data sharing, it represents an appropriate test-bench for federated learning methods [27]- [30]. The most straightforward way to aggregate information from multiple nodes is through averaging local models of each client, as proposed in FedAvg [23] and FedProx [24]. However, statistical data heterogeneity is an issue as it may lead to catastrophic forgetting [31], [32]. FedCurv [33] addresses this limitation by adding a penalty term to the loss driving the local models to a shared optimum. FedMA [34] builds a shared global model in a layer-wise manner by matching and averaging hidden elements with similar feature extraction signatures. Our method differs from existing feature integration approaches in that, instead of averaging model updates or gradients, which can be subject to input reconstruction attacks [7], [8], [35], each node attempts to learn features that perform well on its own dataset while retaining knowledge from other nodes, in a more principled way than parameter averaging. The strategy of fitting the global model to local data is also sought by the recent federated personalized methods. FedBN [25], for instance, keeps batch normalization layers private, while other model parameters are aggregated by the central node.
However, the presence of a central node that aggregates local updates simplifies the communication protocol when the number of clients is very large (thousands or millions), but introduces several downsides: it represents a single point of failure; it can become a bottleneck when the number of clients increases [36]; in general, it may not always be available or desirable in collaborative learning scenarios [31]. In this paper, we deal with decentralized federated learning, in which the central node is replaced by peer-to-peer communication between clients: there is no longer a global shared model as in standard FL, but the communication protocol is designed so that all local models approximately converge to the same solution. Decentralized learning is particularly suitable to application in the medical domain, where the number of nodes (i.e., institutions) is relatively low; however, research is still ongoing, and no effective solutions have been established. In [37], a Bayesian approach is proposed to learn a shared model over a graph of nodes, by aggregating information from local data with the model of each node's one-hop neighbors. A secure weight averaging algorithm is proposed in [38], where model parameters are not shared between nodes, but all converge to the same numerical values (with the disadvantages associated to parameter averaging with non-i.i.d. data distributions). Other approaches implement different communication strategies based on parameter Each node initially trains a privacy-preserving GAN, that is used to sample synthetic data from the local distribution, without retaining features that may be used to identify patients. Then, each node iteratively receives the local model and a buffer of synthetic samples from a random node, and fine-tunes the received model on its own private data, using the buffer to prevent forgetting of previously-learned features.
sharing (e.g., decentralized variants on FedAvg [23], [39]). In general, many of the existing solutions do not target, nor are they tested on, the medical domains -most employ toy datasets, such as MNIST and CIFAR10. A work which is similar in spirit to ours is BrainTorrent [28], where a use case of decentralized learning for MRI brain segmentation is presented. However, like other approaches, simple parameter averaging is used to integrate features from multiple nodes.

A. Overview
An overview of FedER is shown in Fig. 1. In this scenario, a federation consists of a set of N peer nodes, each owning a private dataset.
Before the decentralized training algorithm is started, each node internally trains a privacy-preserving generative adversarial network, which is used to generate synthetic samples from its private data distribution. The training objective of the GAN is designed to enforce the constraint that sampled data do not include privacy-sensitive information, while maintaining the clinical features required for successful training.
At each round of decentralized training, each node receives a model and a set of synthetic samples -"buffer" -from a random node in the federation. The input model to the node is fine-tuned on both the private dataset and the buffer, in a way that is reminiscent of experience replay techniques in continual learning (e.g., [13]), in order to learn features that transfer between nodes and that can handle non-i.i.d. distributions. At the end of each round (i.e., after performing several training iterations), the locally-trained model is sent to a randomlychosen successor node together with a buffer of local synthetic samples, and the whole procedure is repeated.
In this work we specifically address the problem of federated learning for medical image classification; thus, the method is presented by considering this task, but the whole strategy can be applied to any other task without losing generalization.

B. Privacy-preserving GAN
In the proposed method, nodes exchange both models and data, implementing a knowledge transfer procedure based on experience replay (see Sect. III-C below). Of course, sharing real samples would go against federated learning policies; hence, exchanged samples are generated so that they are representative of the local data, while taking precautions against privacy violations -which may happen, for instance, if the generative model overfits the source dataset.
Formally, we assume that each node n i , from a set of N nodes, owns a private dataset where each x j ∈ X represents a sample in the dataset, and each y j ∈ Y represents the corresponding target 1 . The local dataset can then be used to train a conditional GAN [15], consisting of a generator G, that synthesizes samples for a given label by modeling P (x|y, z)), where z ∈ Z is a random vector sampled from the generation latent space, and a discriminator D, which outputs the probability of an input sample being real, modeling P (real|x, y). The standard GAN formulation introduces a discrimination loss, which trains D to distinguish between real and synthetic samples: and a generation loss, which trains G to synthesize samples that appear realistic to the discriminator: While it has been theoretically proven that, at convergence, the distribution learned by the generator matches and generalizes from the original data distribution [32], unfortunately GAN architectures may be subject to training anomalies, including mode collapse and overfitting: as a consequence, the basic GAN formulation may lead to the generation of samples that are near duplicates of the original samples, which would be unacceptable in a federated learning scenario.
In order to mitigate this risk, we introduce a privacypreserving loss, enforcing the generation of samples that do not retain potentially sensitive information, but still include features that are clinically relevant to the target y of the synthetic sample. In other words, if y encodes generic features for the diagnosis of a certain disease, we want the generator to learn how to synthesize samples conditioned by y, that exhibit evidence of that disease but cannot be traced back to any of the dataset's samples of the same disease.
To do so, our privacy-preserving loss aims at penalizing the model proportionally to the similarity between pairs of real and synthetic samples. We measure "similarity" by means of the LPIPS metric [26], which has been shown to capture perceptual similarity by calibrating the distance between feature vectors extracted from a pre-trained VGG model [40].
In practice, given a batch of real samples and a batch of synthetic samples , the privacy-preserving loss term is computed as: where d L is the LPIPS distance. Note that, in this formulation, we ignore the y targets associated to each x: we want to prevent the model from generating near-duplicates of real samples in general, regardless of class correspondence. Also, we intentionally employ a pairwise metric on samples, rather than an aggregated metric such as Frêchet Inception Distance [41], since we want to prevent similarity between samples, not between distributions, which would conflict with the GAN objective. The resulting new loss for the Generator is a combination of Eq. 2 and Eq. 3: where L PP is sign reversed as we want to maximize Eq. 3, while α is a hyperparameter used to balance the two terms. The combined effect of the three loss terms -L D , L G , L PP -pushes the generator to explore the sample space to match the dataset distribution, while "avoiding" latent space mappings that would project to actual real samples.

C. Federated learning with experience replay
Current approaches for federated learning are mostly based on parameter averaging (e.g., FedAvg), which is, however, a straightforward way to combine knowledge from multiple sources: feature locations are not aligned over different models and may be disrupted by updates, before slowly converging to consensus: hypothetically, two models could learn the same set of features at different locations of the same layer, to only have them cancel each other when averaging. In a decentralized scenario, this issue is even exacerbated, due to the lack of an entity that enforces global agreement on node features.
In our approach, we address this problem by taking inspiration from continual learning strategies [42] that learn how to perform a task with a non-i.i.d data stream without forgetting previously-learned knowledge: as a consequence, models are encouraged to reuse and adapt features so that they can equally serve the current and previous tasks. Analogously, in the federated learning setting, the objective is to train a global model trained on disjoint non-i.i.d. data distributions coming from different nodes.
Given these premises, we define a federated learning strategy where a node receives another node's model and surrogate data (generated through our privacy-preserving GAN) -the "previous task" -and fine-tunes that model on its own private date -the "current task" -while using received synthetic data as a reference to what is necessary to retain/adapt from the knowledge learned by the previous node. The idea is to build for each node a model able to tackle its internal data while not forgetting about the data seen in previous nodes/iterations.
We first introduce the terminology used in the method's description. In our approach, we define a set of N tasks T = (T 1 , T 2 , . . . , T N ), where T i is the task to be solved within node n i . Definition 1. Task T i aims at optimizing a model M i , parameterized by θ i , on dataset D i residing on node n i and that cannot be shared to other nodes.

Definition 2.
A buffer B i is a set of synthetic images, drawn from a latent space learned through a generative model G i using data D i available on node n i . In the following, we describe our method (whose graphical representation is given in Fig. 1) from the point of view of a single node n j . At a given round r, training for node n j can be seen as learning a new task T j , from dataset D j , in a continual learning setting by finetuning the incoming model M r−1 i (with parameters θ r−1 i ) on D j and on the incoming buffer B i in order to learn T j while mitigating the forgetting of T i . Thus, unlike other federated learning approaches, each node does not have its own local model: as the decentralized learning strategy proceeds, a node iteratively receives a model from another node and updates it with local information, while preserving previously-learned knowledge, before sending it to the next node. Formally, the loss function for model M r j in node n j at round r is given as: where λ controls the importance between real samples from the local dataset D i and replayed synthetic samples from node n i . Note that, for a given n j , the predecessor node n i is not fixed: in a practical asynchronous implementation, a node may receive a model and buffer from any random node in the federation at any time, using queues to handle incoming data.
After optimizing the L(θ r j ) objective through mini-batch gradient descent for a certain number of training iterations, the resulting model M r j (θ r j ), with updated parameters θ r j , is sent to a random node n k of the federation, along with a buffer B j of locally-generated synthetic samples. The number of training rounds/iterations and the size of the buffer is discussed in the next section.
Then, the general federated model M, after all training rounds, is given by the union of all the N node models, i.e., M = M 1 ∪ M 2 ∪ · · · ∪ M N . However, experimental results, reported below in Sect. IV, demonstrate that all models converge to similar decisions, thus each node model can be considered as a general model for the entire network.
To ease the understanding of the whole training strategy we also report the algorithm pseudo-code in Alg. 1.

Algorithm 1: FedER Learning Procedure
Notations The N nodes are indexed by n i ; E is the number of local epochs for each round. R the total round of communications between nodes. Each node n i contains: i Model for node n i at round r B i Synthetic data buffer sampled using G i // Before Federated Training for each node n i ∈ N do Train G i on D i Generate Buffer B i using G i Train M 0 i on D i end // Federated Training for each round r = 1, 2, ..., R do for each node n j ∈ N do

IV. EXPERIMENTAL RESULTS
We test FedER on two applications simulating real case scenarios with multiple centers holding, and not sharing, their own data: 1) tuberculosis classification from X-ray images using two different datasets, and 2) skin lesion classification with three different datasets. In this section we present the employed benchmarks, the training procedure and report the obtained results to demonstrate the advantages of the proposed approach w.r.t. the state-of-the-art.

A. Datasets
X-ray image datasets for tuberculosis classification. We assume two separate nodes in the federation: one with the Montgomery County X-ray set and another one with the Shenzhen Hospital X-ray set [17]- [19]. The Montgomery Set consists of 138 frontal chest X-ray images (80 negatives and 58 positives), captured with a Eureka stationary machine (CR) at 4020×4892 or 4892×4020 pixel resolution. The Shenzhen dataset was collected using a Philips DR Digital Diagnostic system. It includes 662 frontal chest X-ray images (326 negatives and 336 positives), with a variable resolution of approximately 3000×3000 pixels. Skin lesion classification. We employ the ISIC 2019 challenge dataset, which contains 25,331 skin images belonging to nine different diagnostic categories. In this case, we assume a federation with three nodes as data provided belongs to three different sources: 1) the BCN20000 [20] dataset, consisting of 19,424 images of skin lesions captured from 2010 to 2016 in the Hospital Clínic in Barcelona; 2) the HAM10000 dataset [21], which contains 10,015 skin images collected over a period of 20 years from two different sites, the Department of Dermatology at the Medical University of Vienna, Austria, and the skin cancer practice of Cliff Rosendahl in Queensland, Australia; 3) the MSK4 [22] dataset, which is anonymous and includes 819 samples. Among all skin lesion classes, we only consider the melanoma class, posing the problem as a binary classification task. In all tasks and datasets we adopt 80% of the available data to train both the privacy-preserving GAN and the classification model, while the remaining 20% of each dataset is used as test set. Test sets are also balanced w.r.t. the label to avoid performance biases due to class imbalance. For all tested federated methods (including state-of-the-art ones), model selection is carried out through with 5-fold cross-validation on the training set, as a grid search on number of training rounds, number of rounds per epoch and learning rate. For FedProx [24], we also include the µ hyperparameter.
B. Training procedure and metrics 1) Federated training: In all settings, we employ ResNet-18 [43] as classification model, trained by minimizing the cross-entropy loss with mini-batch gradient descent using the Adam optimizer. Mini-batch size is set to 32 and 8 for the Shenzhen and Montgomery datasets, respectively, and to 64 for skin lesion datasets. The learning rate was found, through cross-validation, to be 10 −4 . Data augmentation is carried out with random horizontal flip; for skin images we additionally apply random 90-degree rotations. All images are resized to 256×256. The ratio between real and synthetic samples controlled by λ in Eq. 5 is set to 0.5 for all experiments, i.e., each mini-batch is composed by the same quantity of real and synthetic images. This also ensures that our method performs the same number of optimization steps as other approaches that do not use any synthetic data.
The node federation is trained for R rounds. In our implementation, at each round nodes are randomly ordered to establish each node's predecessor and successor: given our focus on medical applications, we can assume that the number of nodes is low enough that synchronization is not an issue. However, asynchronicity can be achieved by assuming that nodes can store incoming data in a queue: if the distribution of successor nodes is uniform and computation times are similar for all nodes, this is on average equivalent to the synchronous case. The number of rounds R and epochs E for FedER on the tuberculosis and melanoma classification tasks are set both to 100, according the 5-fold cross-validation results shown in Table I. Buffer size is set for all experiments to 512.
2) GAN training: We recall that GAN training is carried out before federated learning using training data only, while leaving out test samples, as mentioned in Sect. IV-A. Our privacy-preserving GAN employs StyleGAN2-ADA [44] as a backbone, because of its suitability in low-data regimes and its generation capabilities. Training is carried out in two steps: 1) the GAN is initially trained without any privacy-preserving loss to support learning of high-quality visual features; 2) afterwards, we enable privacy-preserving loss and fine-tune the model in order to limit the embedding of patient-specific patterns in the GAN latent space. For classification purposes, GANs are trained in a label-conditioned fashion with a minibatch size of 32 and learning rate of 0.0025 for both the generator and the discriminator. Early-stopping criteria are based on the Frêchet Inception Distance (FID) [41] between real and synthetic distributions: in the first training step, we stop training if FID does not improve for 10,000 iterations; in the second training step, we employ a criterion which stops training if FID increases by a factor of 2.5 w.r.t. the value obtained in the first step. As for the α parameter in Eq. 3, we tested multiple values of α (0, 0.5, 1, 1.5, 2 and 3) and found that the value of 1 yields the best compromise between image generation quality and pairwise LPIPS distance [26] over all tested datasets. In order to quantitatively evaluate privacy preservation, we also compute the average LPIPS distance between each real image and its closest synthetic sample by means of latent space projection (described in Sect. IV-D): the higher value of LPIPS, the lower the possibility to reconstruct real images from the generator.

C. Federated learning performance
We first evaluate the performance (in terms of classification accuracy) of FedER in the non-i.i.d. setting, and compare it to several centralized baselines, namely: • Centralized training: all datasets are merged in a single node where all training happens. In this setting, no federated learning constraints are applied. • Centralized training with synthetic data only. In this setting, each node trains a privacy-preserving GAN model and shares a synthetic version of its own data with the central node, where global training is performed. In this case, we aim to assess how much information is retained by synthetic data to support classification. • Centralized training with synthetic and real data.
This setting is a combination of the previous two: real and synthetic samples are centrally merged and used for training a global classifier. This scenario measures the contribution of synthetic data as a data augmentation approach.
We also compare FedER against standard training of the local node models, referred to as "Standalone" . Classification accuracy is computed using local node models on their own data. The results, reported in Table II, show that standalone training appears to be the most favourable scenario. Centralized strategies perform generally worse than standalone training, because of the non-i.i.d. nature of the data. However, when the centralized approach is trained with original data augmented with synthetic samples, its classification accuracy is on par with the standalone training, possibly due to the learned generative latent spaces that likely tend to smooth different modes of non-i.i.d. data. FedER, instead, outperforms its centralized counterpart and yields slightly worse performance (1.5 percent points less) than standalone training. Although this may appear, at a first glance, as a shortcoming of FedER, we recall that in a federated learning scenario, we aim at building a model that, leveraging multiple data distributions present in the federation, may generalize better, thus addressing possible future data drifts. In order to assess the capabilities of the trained models to achieve such a generalization, we measure the decision convergence by evaluating how a local node model performs on other node datasets. Results are in Table III and show a good average accuracy, with a low standard deviation, by FedER, indicating that each node model performs equally well on its own dataset and on the others (i.e., all node models converge to similar decisions). Conversely, standalone training yields significantly lower accuracy and higher standard deviation than ours, demonstrating to be an unsuitable strategy for the sought generalization properties. We then compare our approach to state-of-the-art federated learning approaches, namely: a) centralized federated methods, FedAvg [23] and FedProx [24], which have shown to perform generally better than decentralized methods [37], [39], and b) a personalized method, FedBN [25]. As already mentioned, to avoid biased assessment, we use the official code repository 2 of FedBN [25] and hyper-parameter selection on the tested datasets was carried out through grid search on training rounds/epochs, learning rate and µ for FedProx [24] using 5-fold cross validation as for our approach. Results, for the tuberculosis and the melanoma tasks, are reported in Table IV and show that FedER outperforms all methods under comparison. Interestingly, FedER learning strategy does better than: a) centralized methods, FedAvg [23] and Fed-Prox [24], suggesting that experience replay is a more effective feature aggregation approach than naive parameter averaging; b) personalized methods, such as FedBN [25], which affects a limited aspect of feature representation (i.e., input layer distributions), while our approach adapts the entire model to local and remote tasks. These above results suggest that experience replay plays a key role in federated models as a principled way to integrate features coming from different data distributions. To further assess its contribution, we evaluate FedER performance when using buffer at different sizes. Results on the tuberculosis task, measured as mean and standard deviation of the local node models over a given dataset, are shown in Table V and indicate a clear contribution of the buffer in terms of overall performance and models' agreement. Indeed, with no buffer 2 https://github.com/med-air/FedBN we obtain the lowest average performance and the highest standard deviation. As the buffer is enabled, we can observe a performance gain (mainly for the Shenzhen dataset) and a significant drop in standard deviation. Performance improves as buffer size increases, although gain becomes negligible above 512. Since higher buffer sizes result in more data to be shared among nodes, we use a buffer size of 512, as the best trade-off between accuracy and communication costs.
We finally evaluate the capability of FedER to scale with the size of the federated network. Accordingly, we quantify this property using an i.i.d. setting on both tuberculosis (Shenzhen dataset) and skin lesion classification (BCN dataset) tasks, by equally splitting the available data on multiple nodes.

D. Privacy-preserving performance
In this section we quantify how much information of real samples is retained by our privacy-preserving method, and in particular in the mapping between latent space and synthetic images. To do so, we employ the projection method proposed in [16]: given a real image x, we find an intermediate latent point w such that the generated image G(w) is most similar to x, by optimizing w to minimize the LPIPS distance [26] between x and G(w). Fig. 3. Quantitative analysis of privacy-preserving generation. In blue, LPIPS distance histogram between real images and the corresponding images obtained through latent space projection using a GAN trained without the proposed privacy-preserving loss. In red, LPIPS distance histogram between real images and the closest images generated with the proposed approach.
In practice, for each image of the dataset used for GAN training, we perform backprojection to find its most similar synthetic sample, and measure the LPIPS distance between the original and projected images. Fig. 3 shows the histograms of the resulting distances on the Shenzhen dataset, using GAN models trained with and without the proposed privacypreserving loss (both models start from the same w, for fairness). The histograms show that standard GAN training, with no privacy-preserving loss, tends to yield distances closer to 0, demonstrating that real images are indeed included into the generator latent space; while our model significantly mitigates this issue, by synthesizing samples that are substantially different than the original ones. In order to qualitatively substantiate these findings, Fig. 4 compares original samples from the Shenzhen dataset with the corresponding projections, generated with and without our privacy-preserving loss 3 . It is easy to notice that generated samples with a traditional GAN highly resemble real data, making it impossible to share such samples, albeit synthetic, in a privacy-safe manner, as they clearly contain patient information. Instead, comparing real images with the projections obtained from privacy-preserving GAN confirms the inability of the generator to find latent representations that recover real images used during training.
Given the high realism of generated samples, we run additional tests by proposing two FedER variants aiming to increase the level of privacy preservation: a) FedER-A: models are not shared among nodes -only synthetic buffers are sent and received; b) FedER-B: models are trained only using synthetic data, even on local nodes. Fig. 5 shows the internal architecture of each node in the two variants. Results obtained with these alternative privacy-enhanced configurations are provided in Table VI. It can be noted that FedER-A (i.e., "bufferonly sharing") configuration achieves comparable performance to our standard FedER (82.76 vs 83.41), but, remarkably, it outperforms all existing federated learning methods on the same datasets (compare Table IV with the node performance block in Table VI). The FedER-B (i.e., "synthetic-only training") configuration, instead, performs slightly worse than the other two configurations, but is still on par with existing federated methods.

E. Communication and computational performance
We conclude the experimental analysis by measuring communication and computational costs.
As for communication costs, compared to state-of-the-art approaches, FedER requires additional transmission of synthetic images between nodes at each round. Tab. VII reports per-node communication costs for state-of-the-art models (the table reports FedAvg, but the same values apply for FedProx and FedBN) and for FedER, in its full formulation and in the FedER-A variant, where only buffers of synthetic data are shared. The main cost for state-of-the-art models lies in the transfer of the model, and depends on the specific architecture    (FIG. 5-A). FEDER-B: MODELS ARE TRAINED ON SYNTHETIC DATA ONLY (FIG. 5-B  federated training starts, FedER requires that each node trains a local privacy-preserving GAN off-line; this, however, does not affect online federated learning costs, as it is carried out only once at the very beginning of the whole procedure. Furthermore, we argue that, in the medical domain, the number of institutions in a federation is relatively low and it is reasonable to assume that nodes can benefit from a powerful communication network and computing infrastructure: thus, the overhead introduced by FedER is tolerable, in light of the methodological advantages and the obtained performance and generalization capabilities showed by the resulting models.

V. CONCLUSION
In this paper, we propose FedER, a decentralized federated learning framework that replaces traditional parameters averaging with a more principled feature integration approach based on the combination of experience replay and privacypreserving generative models. In FedER, nodes communicate with each other by sharing local models and buffers of synthetic samples; local model updates are carried out in a way that encourages the reuse and adaptation of features learned by other nodes, thus avoiding potentially disruptive effects due to blind feature averaging. Experimental results show that our method outperforms significantly state-of-the-art centralized approaches in a non-i.i.d. scenario, which is a typical setting in the medical domain. Additionally, quantitative and qualitative analysis shows that our privacy-preserving generation approach is able to synthesize samples that are significantly different from real data, while correctly supporting the learning of discriminative features. In the future, we aim at investigating some unexplored properties of our method: for instance, unlike all other existing methods based on parameter averaging is required, our approach does not strictly require that all nodes share the same model architecture. Model heterogeneity could therefore be employed to create a shared ensemble and combine different feature learning capabilities.