Computer Methods and Programs in Biomedicine

Background and Objective: Generalizable and trustworthy deep learning models for PET / CT image segmentation require large heterogeneous multi-institutional datasets. However


Utilizing PET/CT Images for Head and Neck Cancer Management
Positron emission tomography (PET) and computed tomography (CT) play key roles in cancer diagnosis, staging and restaging, monitoring of treatment response, and radiation treatment planning [1].Complementary metabolic and anatomical information captured by multimodality PET and CT images, respectively, is commonly used for malignant disease detection, gross tumor volume (GTV), and biological tumor volume (BTV) delineation for radiation therapy planning (RT) [1].RT plays a major role in the treatment of head and neck (HN) cancer patients, which requires GTV delineation.In addition, tumor delineation is an essential step toward semi-quantitative and quantitative analysis of PET images for staging and response assessment of cancer patients.However, the delineation of GTV on PET/CT images is labor intensive, prone to inter/intra-observer variability, and remains a time-consuming process, often requiring to switch between PET and CT images [2].
The low resolution and noisy nature of PET images and partial volume effects on one hand, and the diverse anatomical variability in the HN region and the presence of highly active lymph nodes and lumen of the airway, on the other hand, challenge the deployment of semi-automated and fully automated PET segmentation algorithms in the clinic [2].More recently, deep learning (DL) algorithms have been developed for medical image segmentation and, specifically, PET image segmentation [3].While PET signal is essential for developing DL auto segmenting models for HN patients, anatomical modalities such as CT and MRI can be beneficial for their high resolution and helps the models to better identify the subtle details and accurately delineate boundaries of the tumor.

HN Tumor Segmentation from PET/CT images
Andrearczyk et al. [4] proposed fully convolutional 2D and 3D V-Net models for automatically delineating the hepatocellular carcinoma (HCC) tumors and nodal metastases on singleand multi-modality 18F-FDG-PET and CT images.Manually segmented ROI of 202 HCC patients were used as ground truth.They used two approaches for multi-modality modeling; They fed PET and CT images as multiple input channels, or alternatively, in a late fusion approach, averaged the voxel-wise probability outcomes of individual PET and CT models.They achieved a Dice score of 0.48, 0.58, and 0.60 for CT, PET, and late fusion PET/CT models.Also, their model performed better on a 2D basis compared to a similar 3D design.Zhao et al. [5] presented a fully convolutional network with auxiliary paths for automatic segmentation of Nasopharyngeal Carcinoma (NPC) from PET/CT images.They applied their proposed model on 30 patients enrolled from two centers and, with threefold crossvalidation, achieved a mean dice score of 0.87.In a study by Guo et al. [6], proposed a DL GTV segmentation framework based on 3D convolution with dense connections based on multi-modality PET/CT images.They split a dataset of 250 HN patients into 140 patients for training, 35 for validation, and 75 patients for testing the proposed model.They compared their proposed model with a 3D U-Net network as the reference model.Their proposed PET/CT Dense-Net showed superior outcomes compared to the 3D U-Net network (Dice 0.73 vs. 0.71) while having fewer parameters to train.The HECKTOR (HEad and neCK TumOR) segmentation challenges are being held in 2020-2022 and continuing in 2023 to address segmentation challenge [7,8] in HN patients using PET/CT images.In the second edition (2021) of the HECKTOR challenge, 22 eligible teams participated [7,8].A total number of 325 PET/CT images of FL cancer patients from six centers were split into 224 patients for training and 101 for testing.Models developed by the participants were with the Dice score ranging from 0.63 to 0.78 and the median Hausdorff Distance95% from 6.37 to 3.09.The winner of the challenge [9] achieved an average DSC of 0.78 and a median HD95 of 3.09.
In CNN-based algorithms, such as Encoder-Decoder, U-Net, and GAN, explicit long-range and global relation modeling is a major challenge because of the locality of convolution operations [10].These challenges result in weak performance because of large inter/intra-patient variabilities in HN tumor segmentation using PET/CT images.Transformers that have been successfully used in natural language processing (NLP) and machine translation tasks [11] have recently been shown to outperform CNNs in some image processing tasks [10,12].A number of studies implemented the transformer architecture in a variety of learning tasks.For instance, the vision transformer, data-efficient image transformer, and hierarchical Swin transformer have all been successfully used in image classification, image-to-image translation, and image segmentation, respectively [10,12].More recently, Swin-U-Net [12], a U-Net-like pure transformer, has been proposed for medical image segmentation and was shown to outperform CNN-based or combination of CNN and transformer (Trans-U-Net) counterparts [10].The main challenges in transformers are needing large data sets for training and many training parameters in their architecture.

FL in Medical Imaging
DL models developed based on single-center datasets face the challenge of model generalizability and result in poor performance for unseen data with the different acquisition, reconstruction, and scanner settings from different centers [13,14].In centralized model training, data owners are mandated to pool their data to third-party servers.However, this approach causes ethical and legal concerns as medical data contains highly sensitive private personal information.Federated learning (FL) has been proposed for distributed training without sharing data between different institutions [13,14].FL algorithms have been applied to medical image analysis for different tasks, including classification, prognostication, and segmentation [13,14].Dayan et.al. [15] built a predictive model called EXAM (electronic medical record (EMR) chest X-ray AI model) in COVID-19 patients using chest X-ray images and FL across 20 centers.They reported 16% and 38% improvement in average area under the curve (AUC) and generalizability compared to single center-based models.
Sheller et al. [16] studied the feasibility of brain tumor segmentation using MR images using FL.They reported identi-cal results for FL with centralized training in multi-modal brain tumor segmentation (Dice score of 0.85 vs 0.86).They implemented two collaborative learning approaches, institutional incremental learning (IIL) and cyclic institutional incremental learning (CIIL), which failed to reach FL performance, and reported that FL outperformed existing collaborative learning approaches.Bercea et al. [17] proposed unsupervised brain pathology (multiple sclerosis and Glioblastoma) segmentation using disentangled FL.They proposed a method that disentangles model parameter spaces into a shape space as they assumed that the brain's anatomical structure is similar across centers.They used open source and in-house datasets for model training and a reported Dice score of 0.38, thus outperforming autoencoder (42%) and state-of-the-art (SOTA) FL method (11%).In Sarma et.al. [18], implemented multi-center whole prostate T2-weighted MR image segmentation using 3D anisotropic hybrid network.They reported that FL-based models result in superior and generalizable performance with respect to single center-based models.Dice scores of 0.81, 0.83, and 0.87 were reported for three single center-based models.However, for FL models, they achieved a Dice score of 0.88.Li et.al. [19] developed a privacy-preserving FL for brain tumor segmentation, which identified a trade-off between performance and cost of privacy.Yang et al. [20] presented a semi-supervised learningbased segmentation for COVID-19 pneumonia using multinational chest CT data from three countries.They reported the effectiveness of the proposed method compared to supervised methods with data sharing.In a recent study, [21], an image-toimage translation task for PET image attenuation correction and scatter compensation was performed using deep FL.Dataset of six different centers (50 patients per center) enrolled and two sequential and parallelled FL algorithms compared with CeBa and CeZe algorithms.Moreover, they reported higher and comparable performance compared to FL algorithms compared to CeBa and CenZe learning algorithms, respectively.
Most recently, Shiri et al. [22] evaluated PET-only image segmentation using FL.Their study enrolled 405 HN cancer patient images from nine different centers.The models were built on cropped PET images using an R2U-Net network.They reported identical performance with a Dice score of 0.84 ± 0.06 vs 0.84 ± 0.05 for FL and centralized approaches, respectively, with no statistically significant differences.In terms of PET parameters, almost zero % relative error (RE%) was reported for both algorithms in SUV max , SUV peak , and in SUV mean RE% of 6.43 ± 4.72 vs 6.61 ± 5.42 were reported for centralized, and FL approaches, respectively.Isik-Polat et al. [23] evaluated different aggregation techniques and hyperparameter values for FL in brain tumor segmentation.They reported higher performance for FedAvgM (federated averaging with server momentum) compared to FedAvg and FedNov (normalized averaging method).In addition, adaptive epochs resulted in faster convergence and higher performance.They concluded that different combinations of hyperparameters may result in lower performance as one parameter may decrease the effectiveness of others.Recently, the Federated Tumor Segmentation (FeTS) challenge, which uses MR images from the BraTS challenge [24], was introduced.FeTS aims to identify optimal weight aggrega-tion and build generalizable models.In a more recent study [25] FL implemented to build a model using 71 site images for detection of the rare disease of glioblastoma, and they reported 33 and 23% improvement in the delineation of surgical targetable tumor and the complete tumor extent, respectively [25].
Although various approaches to FL have been developed to address different issues, including data partition, communication bottleneck, data heterogeneity, and privacy.There is no one-fit-all FL solution that can address all FL challenges [13,14].In the current study, we employ different FL approaches for PET/CT image segmentation that have been designed to address different issues and compare them with the centralized benchmark.Considering the pros and cons of each method and client's preferences, one of these approaches can be implemented to train a generalizable model using multicentric data.
The contributions of this research are as follows: • We provided the integration of purely attention-based transformers and FL algorithms for PET/CT Image segmentation in HN Cancer patients.
• We applied the FL framework for the PET/CT image segmentation, which provides a more generalizable model development in multi-center settings.
• Different FL frameworks are implemented in which each algorithm addresses different challenges of a FL model, including different learning paradigms, aggregation, robustness, privacy, and communication efficiency.
• A comprehensive comparison is performed between center-based, centralized, and FL frameworks.
• A comprehensive quantitative analysis is performed in PET images toward clinical evaluation of segmentation algorithms.

PET/CT Data Acquisition and Description
In the current study, we enrolled PET/CT images of 328 histologically proven HN cancer patients from six different centers.The number of included patients (after reviewing all patient's PET and CT images in terms of noise and artifacts in all centers) was 23, 32, 34, 59, 81, and 99 from centers 1 to 6, respectively.Different centers acquired and reconstructed 18F-FDG PET/CT images using different scanners and protocols.Detailed information about each center's data (demographic, PET, and CT image acquisition and reconstruction) is provided in Table 1, and more information could be found in [26,22,3,27,28,29,30,31,32].Ethics approval and consent to participate were unnecessary since the study was performed on open access online dataset.We split the data from each center into a train/validation set (70/10% patients, in total 234/26 patients), and a test set (20% patients, in total 68 patients) with stratification based on centers.

Manual Image Segmentation and Pre-processing
Manual segmentation of primary tumors performed separately for each center on PET/CT images was used as standard of reference for evaluation.An experienced nuclear medicine physician evaluated and checked all PET/CT segmentations and edited/modified them to offset plausible errors (i.e., missing slices, including lymph nodes, and including the lumen of the airway).PET and CT images were converted to standardized uptake value (SUV) maps and Hounsfield Unit (HU) values.Metal artifacts in CT images were corrected using the iterative metal artifact reduction (iMAR) algorithm [33].In order to render the computations tractable and to preserve the image resolution, all images were cropped to the HN region with the aid of an automatic CT lung segmentation and body contour extractor [34].Cropped images were subsequently resized to 200 × 200 with an isotropic voxel size of 1 × 1 × 1 mm 3 .CT images were clipped to the range [−1024, 1200] HU to include all HN tissues, and along intact SUV maps were normalized to the range [0, 1] for model development.All pre-and post-processing steps were fully automated to ensure fully automated PET/CT image segmentation in a clinical setting.

FL Framework
In general, neural network training methods can be categorized into (i) center-based training framework, (ii) centralized training framework, (iii) distributed training framework, (iv) decentralized training framework, and (v) FL framework.The main difference between these learning frameworks is the way the training data is distributed among the various nodes in the network.Below, we briefly review these training frameworks.
Center-Based (CeBa) Training Framework.In the centerbased training framework, each party (node) trains its own ML model using its local training dataset, independently of the other centers, and holds the entire control over the functionality of the model.This training framework faces the inability to adapt properly to unseen data.

Centralized (CeZe) Training
Framework.In a centralized training framework, the participating parties (nodes) send their local data to a centralized server to build and train a global ML model.That is, in a centralized learning framework, all of the training data are stored on a single node (centralized server), and the other nodes in the network must access these data to train their models.This training framework is a traditional data science pipeline, however, it cannot ensure the privacy and security of the participating data owners.
Distributed Training Framework.In the distributed training framework, participating parties independently train ML models using their local datasets and share their local model updates FL frameworks (Fig. 1).

Federated Deep Learning Framework
Let θ ∈ R d denote the parameters of a DL model.Consider F(θ) as an overall loss function.Typically, F(θ) is a non-negative real-valued function computed empirically using available data samples with respect to the model parameters θ.Suppose we have K data centers (owners) that are eager to participate in training a global DL model.Let each of these data centers have a collection of N k data samples, k ∈ {1, 2, . . ., K}.The local data samples at the k-th center are denoted by , where x i and y i are the feature vector and the groundtruth label vector, respectively.Let F k (θ) denote the local aggregated loss corresponding to θ and all the data samples at the k-th data center (owner).Typically, we take F k (θ) as follows: where L(θ; (x i , y i )) is the loss of the model parameters θ for sample (x i , y i ).The distributed learning model objective can then be formulated as the following minimization problem: where N = K k=1 N k denotes the total number of data samples across K centers.Once the parameter server (possibly a trusted data center) collects the local gradients from the data centers, it updates the global model parameters using the iterative stochastic gradient descent (SGD) algorithm given as: where η is the learning rate, and ∇ f k (θ t ) is the average gradient at center k, computed using the local data samples D k and the current model parameter θ t .The above iterative distributed SGD approach is also known as weighted averaging in literature.
In FL, the aggregation method refers to the way the models trained on individual nodes are combined to produce a global model.There are different ways to aggregate the models, and the choice of a method can affect the accuracy and convergence of the global model.In [36], federated averaging (FedAvg) was proposed to optimize the communication-efficiency compared with the naïve distributed SGD method.In FedAvg, firstly, the server initializes a global model parameter and then shares it with a subset of participating data owners, chosen randomly and independently.Next, each data owner performs several epochs of SGD using its local data samples and sends the updated model back to the server.Finally, the server updates the global model parameters as the weighted average of the received local model parameters.This process is repeated for a number of iterations, and the global model is updated with each iteration.Similarly to Eq. ( 3), the weighting coefficients are proportional to the size of the data samples of each data owner.The difference between weighted averaging and federated averaging approaches is that in the latter multiple SGD iterations are performed locally before sending the model differences to the server.
Practical FL systems face several challenges, most prominently i) robustness, (ii) privacy preservation, and (iii) communication-efficiency.Below, we briefly review these fundamental challenges related to our research.In previous sections, we introduced CeBa, CeZe, and FedAvg approaches.In what follows, we introduce the techniques we explor in this work, namely, robust aggregation (RoAg), secure aggregation (SeAg), clipping with the quantile estimator (ClQu), zeroing with adaptive quantile estimator (ZeQu), Gaussian differentially private federated averaging with adaptive quantile clipping (GDP-AQuCl), and lossy compression (LoCo).

Robustness in FL
The aggregation of the updates from the participating centers in the training phase significantly impacts the learned model's performance.It is desirable to reduce the model's sensitivity to corrupted updates caused by a failure in hardware or manipulated by potential adversaries.Robustness in FL refers to the ability of the learning system to perform well despite the various challenges, such as malicious attacks, non-i.i.d.data, and communication constraints.Robustness is also important in FL because it helps to protect the system against malicious attacks.In FL, the training data is distributed among multiple nodes, and each node trains its own model using the data it has locally.This can make the system vulnerable to attacks in which malicious nodes try to manipulate the training data or the model parameters in order to cause the global model to perform poorly.Several techniques can be used to improve the robustness of FL systems, including: • Robust Aggregation: Robust aggregation is a variant of federated averaging that is designed to be more resistant to malicious attacks.
• Federated Transfer Learning: Federated transfer learning is a technique that involves pre-training a model on a centralized dataset and then fine-tuning the model on decentralized data from multiple nodes.This can help improving the performance of the model in non-iid settings.
• Outlier Detection and Removal: Outlier detection and removal is a technique that involves identifying and removing data points that are significantly different from the majority of the data in order to improve the performance and robustness of the model.
• Data Perturbation: Data perturbation is a technique that involves adding noise to the training data at each node in order to protect the privacy of the data and improve the robustness of the model.
The standard aggregation scheme in FL, i.e., arithmetic mean aggregation, is not robust to data corruption.One possible solution is to use an approximate geometric median instead of the weighted arithmetic mean to increase the robustness to update corruption [37].Alternative popular solutions include zeroing and clipping techniques.Zeroing (Ze) refers to replacing the components larger than a predefined threshold with zeros.The main objective of the zeroing approach is to increase the robustness of the whole learning model towards data corruption by faulty clients.The most popular zeroing approach in the literature is adaptive zeroing with the quantile estimator.In the clipping (Cl) approach [38], we bound the L 2 norm of client updates by projecting larger updates onto the L 2 ball of radius C centered at the origin.The clipping function Clip : R d × R → R d is defined as follows: The hyper-parameter C has a significant role in the utility of the DL algorithm.If C is set too high, it entails the addition of more noise.If C is set too small, it can cause high bias in the gradient estimation since we lose the information on the magnitude of the original gradient, which may cause non-accurate training and worse generalization performance.

Privacy Preservation in FL
Privacy preservation in FL refers to the ability of the learning model to protect the privacy of the training data while still allowing for effective model training.In FL, since the data centers (owners) avoid transmitting their local data to an external party, it was initially promoted as a private distributed learning algorithm.However, it has been shown that participant's training data may leak via the communicated model updates or the final shared model [39,40].To avoid information leakage, the typical solution is to use secure aggregating methods [41] such as homomorphic encryption, and differential privacy (DP) mechanisms in FL.Furthermore, the aggregation schemes based on averaging are vulnerable to adversarial attacks, e.g., a malicious participant may impose undesired behavior into the global model.Robust aggregation approaches try to address model integrity attacks [42,37].The two popular aggregation approaches are federated averaging [36] and secure aggregation [43].In this research, we compare our results considering both of these approaches.
• Secure Aggregation: Secure aggregation (SeAg) is a method for aggregating models in FL designed to protect the privacy of the training data.In secure aggregation, each node trains its own model using local data and then sends encrypted model parameters to a central server.The server uses a secure aggregation protocol to combine the encrypted model parameters from the nodes and produce a global model.This global model is then sent back to the nodes for further training.Although SeAg is primarily aimed to protect the privacy of the training data, it can also improve the robustness of FL model.The goal of secure aggregation is to prevent the server from observing the individual local updates while being able to compute their aggregate.It also protects the final model from a possible integrity attack.It is mainly inspired by secure multi-party computation (SMC) protocols [44,45].In the secure aggregation approach, each participant masks its local model update using pairwise random keys and sends it to the parameter server.Two scenarios can be considered for the parameter server: (i) honest-but-curious (passive) model, and (ii) active adversary model.In our experiments, we consider the former.In [44], the authors addressed two masking schemes: (i) masking with one-time-pads and (ii) double-masking approaches.The masking approach with a one-time pad has two shortcomings: (i) it requires quadratic communication overhead, and (ii) there is no tolerance for a participant (data owner) failing to complete the protocol [43].In our experiments, we use the double-masking approach as described in the following.Let each pair of data owners k, k ′ ∈ {1, • • • , K}, k k ′ , agree on some random seed (vector) s t k,k ′ at global iteration t.The pairwise random key s t k,k ′ can be generated using a key exchange protocol [46].In addition, simultaneously, each data center k ∈ {1, • • • , K} samples (generates) a random seed s k .Next, the data owner k computes a masked version of its local model parameters as follows: where PRG is a secure pseudo-random generator whose output space is [0, R) d .Finally, the data owner sends its masked model parameters to the server.The data owner k uses the Shamir's N 2 -out-of-N secret sharing protocol [47] to share {s t k,k ′ } and s k with other data owners.Note that the opera-tions in (5) are carried out in a finite field of integers1 modulo a prime R, where [0, R) denotes the range of both model parameters and their summation.
• Homomorphic Encryption: In the cryptographic methods, which are mostly based on homomorphic encryption, the data owner sends an encrypted version of the data to the parameter server, and the signal processing is performed in the encrypted domain.Homomorphic encryption was initially introduced under the notion of privacy homomorphism in 1978 [48].Since this seminal work, several homomorphic encryption techniques have been proposed, which are only able to process the encrypted data with one kind of operator, e.g., multiplication or addition operations, for a limited number of times [49].The first fully homomorphic encryption (FHE) scheme has been proposed by Gentry [50], which allows an unlimited number of arithmetic operations in the encrypted domain.Recently, researchers try to use a Homomorphic encryption scheme in machine learning models, with possible application in medicine and biometrics [51,52].We can use homomorphic encryption for the aggregation stage of an FL system, as it involves only the addition operation.Alternatively, one can use homomorphic encryption to train the local models in an encrypted domain using FHE.The study and analysis of privacy homomorphism are beyond the scope of this paper.
• Differential Privacy (DP) DP is the most popular contextfree notion of privacy, which is inspired by the stability of likelihood ratios [53,54].DP adds noise to the model parameters during training and aggregation in order to protect the privacy of the training data.It is widely used in deep learning models [38,55,56,57,58].Informally, a randomized computation over a database D is differentially private if the sensitive data of individuals contributing to D is protected against arbitrary adversaries with query access to D [59].Although DP is primarily designed to protect the privacy of the training data, it can also improve the robustness of FL model.
Definition 1.Let ϵ ≥ 0 and 0 ≤ δ ≤ 1; a randomized algorithm M is said to be (ϵ, δ)-differentially private [59] if for any two neighbouring inputs (datasets) D 1 and D 2 and for every event E ⊆ R, its output distributions are (e ϵ , δ)-close, i.e., for every event E: where Pr [M(D 1 ) : E] denotes the probability of event E in the distribution obtained by running the algorithm M on dataset D 1 , ϵ is the privacy budget, and δ denotes the probability of information leakage.The δ = 0 refers to pure DP, while δ > 0 refers to approximate DP.When δ = 0, the (ϵ, δ)-DP mechanism M relaxed to ϵ-DP mechanism.
The intuition behind the definition of DP is that an individual has little incentive to participate in a statistical study, as the individual's data has limited effect on the outcome [60].The Laplace and Gaussian noise mechanisms are the two most widely used practical mechanisms to achieve DP.
where D 1 ∼ D 2 denotes that D 1 and D 2 are two neighbouring data sets.The Gaussian noise mechanism is defined as follows: where N(0, σ 2 ψ f 2 • I d ) is a zero-mean multivariate Gaussian noise vector.Using the Gaussian mechanism, each data owner adds Gaussian noise to its local model parameter before forwarding to the server.The parameter σ is chosen based on ψ f 2 and δ [38].Note that the clipping (Cl) approach, described in Sec.2.3.2, also bounds the L 2 sensitivity of the model parameter aggregate with respect to the removal or addition of data samples of one participant (data owner).Therefore, we can add Gaussian noise to the clipped model parameters to obtain a central DP guarantee.Gaussian noise can be added (i) during local training, (ii) to the aggregated local model parameters before forwarding to the server, or (iii) to the global model parameter at the server side before sharing with the participants.A combination of DP and secure aggregation is employed for medical image FL in [41].
In [61], the authors proposed a private adaptive strategy for tuning the clipping threshold C to approximate it at a specified quantile of the update norm distribution, which can be viewed as minimizing the clipping probability.In this research, we utilize the Gaussian differentially private federated averaging with adaptive quantile clipping approach [61], which we refer to it as GDP-AQuCl.Moreover, we applied the proposed quantile scheme to the fixed zeroing and fixed clipping approaches described in Sec.2.3.2, which results in (i) zeroing with adaptive quantile estimator (ZeQu), and (ii) clipping with the quantile estimator (ClQu) approaches.We compare all these SOTA approaches in our experiments.

Communication Efficiency in FL
Communication efficiency in FL refers to the ability of the learning model to minimize the amount of communication required among the participating nodes in order to train the global model.In the literature, several strategies have been proposed [62,63,64,65] to optimize the communication-efficiency compared with the naïve SGD method.The communication between participants (data owners) and the parameter server is a fundamental stage for the FL frameworks.The proposed solutions in the literature to reduce the communication costs in FL are to reduce (i) the model parameter update size (including model compression and/or pruning and/or quantization and/or sparsification techniques), (ii) the number of participating data owners, and (iii) the total number of updates performed by each data owner.They are mainly based on lossy compression techniques, such as quantization and sparsification.Model compression is a technique that involves reducing the size of the model parameters in order to reduce the amount of data that needs to be transmitted during training and aggregation.Pruning is a technique that involves removing redundant or unnecessary connections from the model in order to reduce the size of the model and the amount of data that need to be transmitted.Quantization is a technique that involves representing the model parameters using a smaller number of bits to reduce the model size and the amount of data that need to be transmitted.
Lossy Compression (LoCo) Approach.A common solution to reduce the communication costs in the FL framework is to utilize lossy compression techniques on the global model sent from the server to participating parties [66,67].Lossy compression techniques are typically studied through the lens of rate-distortion theory.Shannon's work became the foundation of digital circuit design and made the current digital era possible.Since then, abundant research done on designing the lossy compression schemes [68,69,70,71].It is worth mentioning that lossy compression meets privacy from the lens of information theory [72,73,74,75,76,77,78,79]. In this paper, we use simple probabilistic uniform quantization, which is parameterized by the number of quantization bits (q) and the compression threshold.For a vector θ = [θ 1 , • • • , θ d ] T , we denote its minimum and maximum components by θ min = min j {θ j } d j=1 and θ max = max j {θ j } d j=1 , respectively.For a probabilistic uniform binary (1-bit) quantization, one can replace every element θ j by θ max with probability θ j −θ max θ max −θ min , and by θ min otherwise [80].That is, the quantized value for each coordinate j is generated as follows: θ max , with probability Now, we can generalize the above stochastic 1-bit uniform quantization to stochastic q-bit uniform quantization.The process is based on equally dividing [θ min , θ max ] into k = 2 q intervals, and defining a new interval bounded by θ ′ and θ ′′ which plays a role of θ min and θ max in the above simple 1-bit uniform quantization method.More precisely, let us partition the interval [θ min , θ max ] into sub-intervals where B k (l) are given as: where S satisfies θ min + S ≥ θ max .Now we assign each coordinate of θ into one of B k (l)'s stochastically.To do this, for θ j ∈ (B k (l), B k (l + 1)] we quantize it as follows: In our experiments, we set q = 8 and consider the natural choice S = θ max −θ min .
A comparison of the different utilized FL algorithm properties is summarized in Table 2.

Deep Neural Network Transformers
In this study, we implemented a purely attention-based transformer without convolutions, inspired by [81,12], as a modified version of [81,12].This follows a very active line of research pioneered by [82], which is motivated by the significant success of the transformer structures [11] within the NLP domains and aims to bring the power of the self-attention mechanism of the transformers into the image-based and vision-based applications.The architecture consists of an encoder, a bottleneck block, a decoder, and skip-connections [12], and is based primarily on the Swin-transformer (Shifted windows) block, which was originally proposed in [81].The images are first split into non-overlapping blocks of dimension 4 × 4, followed by a linear projection to form the input sequences to the network.The encoder consists of patch-merging blocks for signal downsampling, followed by Swin-transformer blocks responsible for representation learning.This forms a hierarchical representation, where, similar to the U-shaped structure of the U-Net, has a symmetric decoder layer that consists of Swin-transformer layers and patch-expander units.Between the encoder and the decoder, skip connections facilitate the signal flow.At the bottom of the encoder, a bottleneck consisting of two consecutive Swin-transformer blocks without up-or down-sampling provides a further connection between the encoder and the decoder.
As an alternative to the traditional sliding-window approach, the Swin-transformer block is based on the idea of shiftedwindows [81].A regular partitioning of the patches is used at one layer, while the next layer uses a shifted version of them.This provides connections between windows with different shapes using self-attention.The Swin-transformer block consists of a layer-norm (LN), the multihead self-attention (MSA), multi-layer perceptrons (MLP), and several skip-connections, such that: where, ẑl and z l represent the outputs of the (S)W-MSA, and the MLP module of the l th block, respectively, and the self-attention  mechanism of [11] is computed as: where Q, K and V ∈ ℜ M 2 ×d denote the query, key and value matrices, with M 2 representing the number of patches in a window, and d being the dimension of the query/key.

Training
We evaluated different frameworks (single-center based, centralized, and seven FL algorithms) by using 68 PET/CT images (20% of each center's local data).The training was performed on axial slices, as PET and CT images with batch size 32 were fed to models simultaneously (dual channel input).During each iteration, stratified mini-batch approaches are used, in which half of the batches with tumor segmentation and half without tumors were fed to the model to avoid bias during training.All DL models were implemented in the TensorFlow framework.FL algorithms were implemented by using TensorFlow Federated (TFF).TFF is an open-source framework developed for simulating and implementing different FL algorithms.All networks were trained in a 2D manner with an Adam optimization with a learning rate starting with 0.001, as well as a weight decay of 0.0001.Dice loss was used, and models trained using 300 epochs and 100 rounds in FL.

Quantitative Evaluation
Different evaluation metrics, including standard segmentation quantitative metrics, image-derived PET metrics, and radiomics features, were considered to evaluate and compare the performances of different frameworks.Standard segmentation quantitative metrics, including the Dice similarity coefficient, Jaccard similarity coefficient, false-negative rate (1-Specificity), false-positive rate (1-Sensitivity), mean and standard deviation (SD) of surface distance (mm), as well as Hausdorff distance, average Hausdorff distance (mm), are considered.Image-derived PET metrics for clinical evaluation of the different frameworks, including variants of the standardized uptake value (SUV) SUV peak , SUV mean , SUV median , SUV max , metabolic tumor volume (MTV) and total lesion glycolysis (TLG, MT V × SUV mean ) were also analyzed.For radiomics analysis, we extracted intensity, histogram, and shape radiomics features using SERA package [83].All these metrics were calculated on test sets (20% of each center data).

Statistical Analysis
Percent relative error (RE%) was calculated for PET image metrics with respect to manual segmentation.The Kolmogorov-Smirnov test was used for normal evaluation and then, based on the distribution paired Wilcoxon signed rank test was chosen for evaluation, and p-value < 0.05 was defined as the threshold for statistical significance.Comparison of the different models using various metrics (Dice coefficient, Hausdorff Distance, and Mean Surface Distance) was performed using paired Wilcoxon signed rank test.All p-values were corrected by Benjamini-Hochberg correction Intra-class correlation (ICC) [84,85] test was performed for radiomics feature reproducibility in different approaches with respect to manual segmentation; we classified radiomics features based on ICC value into four Groups of poor reproducibility (ICC < 0.40), fair reproducibility (0.40 < ICC < 0.59), good reproducibility (0.60 < ICC < 0.74), and excellent reproducibility ( 0.75 < ICC < 1.00).

Code and Data Availability
All PET and CT images are available in The Cancer imaging Archive [26,22,3,27,28,29,30,31,32].Different implementation would be available in Authors GitHub repository.

Quantitative Segmentation Metrics
Fig. 4, compares the performance of different models with different approaches and a summary of the results is also presented in Table 3. Centralized (CeZe) and SeAg models showed the best performance in terms of the Dice coefficient (0.80 ± 0.11 versus 0.80 ± 0.11), without any significant difference between the two (p-value>0.05).In terms of the false negative rate, ClQu, ZeQu, FedAvg, LoCo and RoAg achieved the lowest values of 0.14 ± 0.13 (CI95%: 0.11 to 0.17); however, CeZe showed the lowest false positive rate of 0.14 ± 0.12 (CI95%: 0.12 to 0.17).In terms of Hausdorff distances GDP-AQuCl method achieved the lowest value of 7.78 ± 5.98 mm (CI95%: 6.35 to 9.2 mm) followed by RoAg with the value 7.85 ± 6.16 mm (CI95%: 6.39 to 9.31 mm).In terms of the Hausdorff distance (7.78 ± 5.98mm), Max Surface Distance (8.02 ± 5.93 mm), Mean Surface Distance (0.36 ± 0.30 mm), Std Surface Distance (0.93 ± 0.81 mm), and average Hausdorff Distance (0.40 ± 0.46 mm), GDP-AQuCl outperformed all federated, single-center, and centralized approaches.Statistical analysis showed significant differences (p-value<0.05) when comparing single center-based models with centralized and federated algorithms for different quantitative metrics.However, for almost all segmentation metrics, statistical tests showed no significant difference (p-value > 0.05) between centralized and different FL approaches and also among different FL algorithms.Fig. 5 presents a comparison of different models (p-values) in terms of three metrics.

PET Quantitative Metrics ICC and Reproducibility
Results of PET quantitative metrics in terms of RE%, are presented in Table 4 for different approaches.SUV max and SUV peak values of all federated and centralized approaches achieved RE% of zero; however, for CeBa RE% of 1.28 ± 6.92 and 1.53 ± 10.98 were achieved for SUV max and SUV peak , respectively.Lowest RE% of SUV mean (RE%: -0.87 ± 12.02), SUV median (RE%: -1.24 ± 15.75) and TLG (RE 9.79 ± 22.09) was achieved for GDP-AQuCl method with the FL approaches.All PET quantitative metrics showed excellent repeatability in terms of the ICC analysis (ICC>0.75).Fig. 6 depicts the Bland Altman figure of SUV mean for different approaches, which was computed with respect to manual segmentation, and shows good agreement between different approaches and manual segmentation.Fig. 7 represents the ICC value of different radiomic features of different algorithms with respect to manual segmentation, and as shown in this figure most features showed excellent repeatability (ICC>0.75).The CeBa approach had fewer (17 radiomics features) reproducible features (ICC<0.45).

Center-Based Analysis
In addition to centralized and FL-based frameworks, we also analyzed training and testing with each center's data-et separately, and the quantitative metrics for these single-center approaches are presented in Table 4.In CeBa analysis, results of training and testing of data using the same center (for training and testing) were presented.Training on one center and testing on other sets (datasets from centers not presented in training) showed low generalizability for different centers (mean Dice score of 0.56 to 0.72).In Table 5, we present the quantitative PET metrics for different centers, and as seen in the table, all metrics showed high variability across different centers.In Fig. 8, we present 2D axial views of different patients in both original and magnified versions of GTVs by training on different centers' datasets.This figure shows low accuracy of segmentation when training on a single center's dataset and testing on another center's.

Discussion
PET/CT image segmentation is a crucial step toward quantitative analysis in monitoring treatment response and radiation therapy.However, it suffers from a number of challenges due to inherent limitations in image quality and the high variability in HN regions.Inter-observer variability with an average Dice score of 0.57 in CT, 0.61 in PET/CT, and 0.69 in PET/CT was reported in previous studies between different human observers [7,86,87].Various DL algorithms have been developed to address these challenges by automating the segmentation process.Centralized training on data pooled from multiple centers is ideal for building generalizable models.However, this approach faces privacy, security, legal, ethical, and ownership challenges.These challenges could be addressed by a shared global model using FL.In the current study, we evaluated the performance of different decentralized FL frameworks for multi-institutional PET/CT image co-segmentation.The HECKTOR challenge was organized to address HN tumor segmentation using PET/CT images.Since we used different datasets from those used in the HECKTOR challenge, our results are not directly comparable.However, compared to mod-  We implemented seven different FL algorithms in the current study and compared their performance with centralized and single-center-based approaches for HN GTV segmentation from PET/CT images using vision transformers.All FL approaches achieved centralized learning model performance with no statistically significant difference.Among FL algorithms, SeAg and GDP-AQuCl outperformed other FL algorithms considering different quantitative metrics.However, there were no statistically significant differences between these FL algorithms.Conversely, single-center-based models showed low accuracy and generalizability.From all segmentation frameworks, GDP-AQuCl produced the highest number of reproducible radiomic features.Among the plausible reasons would be the lowest value of surface distance and Hausdorff distance compared to other frameworks and concurrently the same value of the Dice score.We conclude that collaboration between different centers is highly crucial for generalizable DL model development.Notwithstanding the variability in PET scanner models, image acquisition and reconstruction protocols, and different sizes of datasets across different centers, all FL approaches achieved centralized learning model performance with no statistically significant difference.
FL algorithms have some inherent challenges in medical imaging, including data partitioning, data distribution, privacy, and security, as well as communication and computation capa-bilities of the infrastructure.Choosing the right data partitioning is an important step toward addressing limited sample or feature sizes, or both, resulting in horizontal FL (HFL), vertical FL (VFL), or federated transfer learning (FTL), respectively [13,14].In the current study, we implemented horizontal FL algorithms, where there is no overlap between data from different centers.However, we used both PET and CT images for DL models.The second issue in FL is data distribution, which is a statistical data heterogeneity challenge due to the decentralized nature of datasets as each center generates their local data.As the data is decentralized, the distribution of data across each center could be significantly different, which is known as non-independent and identically distributed (non-IID) data.Different centers equipped with different scanner models and using different image acquisition and processing protocols, employing different segmentation techniques, may result in non-IID data in medical imaging [13,14].To address heterogeneity in our data, we employed automated preprocessing, including cropping, metal artifact reduction, and resizing to isotropic voxels.In addition, as the sample size of each center is different (from 23 to 99), we used stratified mini-batch approaches during each iteration (where half of the batches consisted of tumor segmentation while the other half did not include tumors), to avoid biased training.In our study, all FL approaches achieved centralized method performance in PET/CT image segmentation and outperformed single-centerbased approaches.Another issue in FL is privacy and security, as the number of centers could potentially be increased to hun-  [13,14].These attacks result in the leaking of sensitive information about patients during decentralized training, which can be a serious concern impeding the adoption of FL techniques in large-scale medical applications.Different methods, such as data perturbation or encryption, can be implemented for data privacy and security purposes.Controlled random noise can be added to samples during training to guarantee DP [13,14].Additionally, encryption can be used during the aggregation process to preserve privacy.Membership and model inversion attacks can be addressed by the DP mechanism [13,14].Other attacks, including data and model poisoning (i.e.adversarial attacks), can be performed by malicious parties.We implemented DP as well as secure FL approaches and showed that they both achieved centralized FL performance in PET/CT image segmentation while preserving patient privacy and security against potential attacks.
In our study, an experienced nuclear medicine physician evaluated and checked all PET/CT segmentations and edited/modified them to offset plausible errors, which is unrealistic in a real-world FL setup.However, it was necessary for our study as the dataset was gathered from an online dataset that contained a few errors that had to be mitigated before building models.In real FL, images and segmentations should be checked and modified in case of errors; otherwise, this could be an issue for the training process.Another challenge in FL is the statistical variation resulting from image pre-processing at different centers [13,14].As PET/CT images are in DICOM format, in real-world scenarios, image pre-processing could be shared with the client to provide pre-processed images with the same setting across the different centers.We implemented fully automated pre-processing steps in the current study toward reproducing data preparation.One of the limitations of the current study is that all the analysis has been performed in one server with multiple GPUs treated as different centers; thus, good communication between centers and the parameter  Further studies should consider practical communication bottlenecks for real clinical applications.In the current study, we used a limited number of data and clients for model development to demonstrate the effectiveness of FL for tumor segmentation to achieve the performance achieved by the centralized level model.However, further studies need to be conducted using more data and clients to prove the effectiveness of FL algorithms for FL segmentation models.

Conclusion
FL-based algorithms have proven to be highly effective for HN tumor segmentation in PET/CT images, achieving performance on par with centralized deep learning models.These algorithms enable the training of generalizable PET/CT image segmentation models by providing access to large, diverse datasets from multiple centers without compromising patient privacy or security.This decentralized approach to model training allows for the creation of more robust and accurate models, particularly in situations where communication among centers is limited.The use of FL-based algorithms represents a novel and innovative approach to HN tumor segmentation in PET/CT images.

Fig. 2
Fig. 2 depicts, for visual comparison, examples of 3D rendered volumes of segmentation of GTVs from six different clinical centers (columns) with manual segmentation (red) as well as single-center based, centralized, and different FL approaches (blue).For a visual comparison of different approaches with respect to manual segmentation, Fig. 3 represents 2D axial views of different patients in both original and magnified versions of GTVs.As shown in these figures, segmentation provided by different FL approaches are in good agreement with centralized and manual segmentations in different textures and sizes of GTVs.

Fig. 4 .
Fig. 4. Comparison of the performance of the different frameworks in terms of quantitative segmentation metrics of Dice similarity coefficient, Jaccard similarity coefficient, false-negative rate (1-Specificity), falsepositive rate (1-Sensitivity), mean and standard deviation (SD) of surface distance as well as Hausdorff distance, average Hausdorff distance.

Fig. 5 .
Fig. 5. Comparison of different models (p-values) in terms of different metrics of Dice coefficient, Hausdorff Distance, and Mean Surface Distance.Manual segmentation is used as the criterion standard.

Fig. 6 .
Fig. 6.Bland Altman plots of SUV mean for different frameworks compared to manual segmentation.

Fig. 8 .
Fig. 8. 2D views of PET/CT segmentation obtained in three different cases: from the manual: Red.Center 1: Green, Center 2: Blue, Center 3: Brown, Center 4: Olive, Center 5: Orange, Center 6: Cyan.Each colour represent the model build using the

Table 1 .
Summary of data description including patient demographics, PET and CT image acquisition, and reconstruction setting for the different centers.tobuild the global model.In this learning framework, the training data are divided among multiple nodes, and each node trains its own model using the data it has access to.Decentralized Training Framework.In a decentralized training framework, there is no central node and each node trains its own model using the data it has locally.Therefore, in this learning framework, there is no server to train a model (like a centralized training framework) or to aggregate the local model updates (like distributed training framework).Instead, the computation process is distributed across all the participating parties.Federated Learning Framework.In a FL framework, the training data remains decentralized and is not shared among the nodes, but the nodes can still collaborate and share their model updates with each other in order to improve the overall perfor- 595.74 ± 119.81 478.01 ± 167.11 372.31 ± 364.76 573.77 ± 71.06 324.33 ± 78.22 Time to Scan (min) 63.35 ± 20.08 93.35 ± 22.08 90.46 ± 18.46 118.25 ± 23.79 102.64 ± 16.46 102.42 ± 15.09 Time Per Bed (min) 2.5 ± 0.15 3.03 ± 0.17 4.87 ± 1.41 4.77 ± 0.72 5.43 ± 1.37 2.49 ± 0.05 mance of the network.In other words, the FL framework is introduced based on a centralized model which uses decentralized model training.That is, the participating parties have their own data, and the ML models are trained independently on the local datasets.Once the local model is trained, each party sends model updates to a central server.Finally, the central server aggregates the model updates to build a global model.Note that, in a distributed training framework [35], we have centralized data and distribute it to computing servers (i.e., workers) for efficient and fast training, while in the FL framework, we have decentralized data and aim to train a global model with the help of a parameter server.In this research, we implement and compare single center training, centralized training, and different

Table 2 .
Summary of the Different FL Algorithm Properties.
AlgorithmOvercome Client Drifting Adaptive Learning Rate Cross-Device Compatible Robust to Outliers Communication Efficient Address Client Hetrogeneity Ensure Privacy ClQu

Table 3 .
Summary of Quantitative Image Segmentation Performance Metrics (Mean ± Sd and CI95%) for different algorithms.

Table 4 .
Summary of Quantitative PET Metrics (Mean ± Sd and CI95%) for different algorithms.

Table 6 .
Summary of Quantitative PET Metrics (Mean ± Sd and and CI95%) for different training sets by different centres.and even thousands, in which case all centers cannot be considered trustable parties.Different kinds of attacks, including membership inference and model inversion attacks, could be performed by curious parties to discover whether a specific data sample exists within the training set of other centers, or to regenerate training sets from the trained model during model training, respectively dreds