Improving Multiple Sclerosis Lesion Segmentation Across Clinical Sites: A Federated Learning Approach with Noise-Resilient Training

Accurately measuring the evolution of Multiple Sclerosis (MS) with magnetic resonance imaging (MRI) critically informs understanding of disease progression and helps to direct therapeutic strategy. Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. Obtaining sufficient data from a single clinical site is challenging and does not address the heterogeneous need for model robustness. Conversely, the collection of data from multiple sites introduces data privacy concerns and potential label noise due to varying annotation standards. To address this dilemma, we explore the use of the federated learning framework while considering label noise. Our approach enables collaboration among multiple clinical sites without compromising data privacy under a federated learning paradigm that incorporates a noise-robust training strategy based on label correction. Specifically, we introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions, enabling the correction of false annotations based on prediction confidence. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites, enhancing the reliability of the correction process. Extensive experiments conducted on two multi-site datasets demonstrate the effectiveness and robustness of our proposed methods, indicating their potential for clinical applications in multi-site collaborations.


Introduction
Multiple sclerosis (MS) is a chronic neuroinflammatory disease that results in cumulative focal and diffuse damage to the brain and spinal cord.Magnetic Resonance Imaging (MRI) based volumetric quantitation of MS brain lesions, i.e.T2 hyper-intense or T1 hypo-intense areas, and their change over time, informs both disease course and response to therapy [1].
In current clinical trial workflows, MS lesions are manually segmented from images acquired over several time points by trained neuroimaging analysts to provide measures of therapeutic efficacy.However, annotating lesions on the several hundred images acquired in each scan is extremely time-consuming and labor-intensive, considering that the burden of MS lesions is highly heterogeneous, ranging from zero to hundreds in lesion number, and from less than 1ml to more than 15ml of brain tissue volume affected.Lesions can also be widely distributed throughout the brain, although they preferentially affect the periventricular, juxtacortical, and infratentorial regions.Lack of analysis expertise, time constraints, and logistical problems preclude routine quantitative measurement of MS lesions in clinical practice, which is reliant upon qualitative interpretation of images by radiologists and clinicians.
To facilitate the annotation process to improve the current clinical workflow, both statistical methods and deep learning methods have been explored to automatically segment lesions, among which U-Net-based deep neural networks [2,3] are most widely used in research settings.The lack of large, high-quality multi-center datasets and corresponding ground truth annotations is a critical limitation for training deep neural networks in medical imaging tasks, including MS lesion segmentation.Ethical, governance, privacy, and security constraints are difficult to navigate and preclude the participation of many centers in conventional centralized learning frameworks, which require the aggregation of raw imaging into a single largescale dataset.Federated Learning (FL), an emerging training framework that does not expose raw data from contributing sites, overcomes many of these obstacles.In FL, only the model weights, trained locally at each participating site, are shared with the central server for aggregation in an iterative training process that captures the distilled information of the site datasets without compromising raw data or private health information [4].
While FL is an effective training framework using scattered multi-center datasets and clean labels in both natural and medical image research [5,6], the impact of inaccurate labels has not been explored for image segmentation tasks in a federated learning environment.A multiplicity of image acquisition protocols and annotation schemes across participating centers generates significant labeling variability that is inevitably exacerbated by significant heterogeneity of MS lesion morphology and appearance [7].As shown in Figure 1, lesion segmentations performed on the same MRI images by two trained annotators differ from one another and from a consensus generated by the maximum voting of several expert image analysts, which is generally used to delineate "golden" labels for training and evaluation.Such label noise includes mostly the divergent boundaries of lesions and missed lesions according to our experience.
In this work, we conduct the first exploration of MS lesion segmentation with noisy labels under a federated learning paradigm to achieve a more practical, scalable, and accurate automatic lesion annotation system.In contrast with existing works that employ centralized MS lesion segmentation training with consistent and clean labels [8,9], we consider a scenario in which labels from individual sites participating in a federated training framework are less controlled for annotation quality and propose a Decoupled Hard Label Correction (DHLC) strategy to cope with the label noise issue that takes into consideration the characteristics of MS lesions, i.e., the extreme imbalance number of voxels between the lesion and normal brain features, and the relatively fuzzier decision boundaries compared with other segmentation tasks (e.g., natural image segmentation and brain segmentation in the MRI images).
Moreover, we introduce a Centrally Enhanced Label Correction (CELC) strategy, in which the aggregation center in the federated training framework helps each participating site maintain a more accurate central model that can be utilized to decrease overfitting of each site model to their local label noise during the site updating process.Here, we experimentally demonstrate that noisy annotations severely degrade the performance of the deep models under a federated paradigm; and show that our proposed strategies (i.e., DHLC and CELC) can significantly improve the model's robustness against label noise.In the two multi-site datasets we explored in this study (i.e., the public MSSEG-2016 dataset and an in-house dataset SNAC-MS), our method consistently augments segmentation performance of the U-Net model trained with severe label noise.
The contributions of our study are four-fold.

Deep Learning based Lesion Segmentation
MS lesion segmentation is based primarily upon altered signal intensity relative to normal white (and gray) matter on MRI sequences such as T1-weighted (T1-w), T2-weighted (T2-w), and T2 fluid-attenuated inversion recovery (T2-FLAIR) images.Deep learning based segmentation methods have shown promising performance.
The earliest application of deep learning, described in [10] and [11], treated the segmentation problem as a voxel-wise classification problem, in which 3D convolutional neural networks with multiple layers and channels were used to predict the labels for each voxel based on its surrounding patches.
Better performance has been achieved using U-Net [2].For example, in [8], the authors proposed a 3D-patch-based U-Net that includes an encoder with seven convolutional layers and shortcut connections to integrate multi-scale features.The performance of a baseline U-Net has been improved using variants of deep neural networks.In [12], a cascaded architecture was proposed to enhance the network's sensitivity.In [9], the authors introduced the attention module in the U-Net framework, improving the model's performance and interpretability.Our methods are also based on the U-Net model structure as shown in Figure 2.
Prior to the application of deep learning models, imaging data requires pre-processing that is intended to normalize the input images by co-registration [13,14], bias correction [15], and skull stripping (e.g., Brain Extraction Tool from FSL [16]).Where mentioned, these requisite pre-processing have also been employed in our experiments.

Federated Learning
The most common usage of federated learning is horizontal federated learning (HFL), in which all sites share the same model structure that deals with different samples with the same data format per site.The trained local site model weights are shared with the central server for aggregation and distribution.Examples of successful applications of deep learning models developed in a federated learning environment include chest x-ray classification for COVID-19 [5] and brain tumor segmentation [6,17].To the best of our knowledge, the current work represents one of the first studies to use horizontal federated learning for the task of MS lesion segmentation.
Current research aims to improve the performance of HFL and deal with more complicated scenarios, including non-independent and non-identically distributed (non-IID) data distributions [18], communication efficiency [19], data security and model inversion [20], and aggregation methods [21].Among these studies, the problem of noisy labels, a common issue in real-world applications, has not been considered.Here, we propose a novel training strategy to improve the performance of HFL in the scenario of noisy labels.

Noisy Labeling in Segmentation
Recognition of label errors during training by crossvalidation can provide an opportunity to remove or correct cases with bad labels or assign lower weights to such cases [22].Outlier/anomaly detection algorithms can also be used to identify and correct erroneous labels [23].Alternatively, the data labels can be updated with the classifier during the training process [24].Similarly, segmentation noise can be detected through an additional network [25,26] and data with noisy labels excluded or corrected from the training process.
Knowledge distillation, in which a teacher network is updated together with a student network, has also proven to be an effective method for dealing with noisy labels.In [27], the teacher model is updated using the exponential moving average (EMA) of the student model, which helps to mitigate the influence of noisy labels.This framework has been used for segmentation in [28], where the student network is trained with a noise-robust dice loss, and the teacher model is updated using EMA.As a special teacher-student network, the co-teaching network [29] maintains two peer networks that are trained simultaneously to independently select data with clean labels.The knowledge learned from both clean labels is shared between the networks in every iteration.A more complex tri-teaching strategy is proposed in [30] as an extended co-teaching framework, in which the corrected annotations are selected through consensus and differences between any two of the three teacher networks.
In the current work, we propose a novel label correction strategy, which we refer to as Decoupled Hard Label Correction (DHLC).This strategy builds upon traditional label correction methods [24,25,26] but focuses on independently adjusting each class with separate hard thresholds.The thresholds are conducted individually for all the voxels and the labels are corrected when only the prediction of the local teacher model and the ground truth labels are contradicted, which are based on the feature of lesions where most of the noise comes from the boundaries.
In the context of federated learning, we utilize the central model aggregated from site models to handle variations in noise distribution across sites.In this Centrally Enhanced Label Correction (CELC) method, the aggregated central model serves as the teacher model for generating pseudo labels.Unlike existing teacher-student network approaches [27,28] and in DHLC, the teacher model in CELC is only updated from the aggregation operation in the federated learning central server.This approach helps to eliminate the bias of each local model to the local noise and therefore enhances the label correction strategy within the federated learning framework.

Methodology
In this section, we first introduce the baseline method for automated lesion segmentation under a multi-site collaborative and distributed learning paradigm.Then, a robust MS lesion segmentation framework (Figure 3), composed of a U-Net-based lesion segmentation network and our new training strategies, is elaborated to ameliorate the impact of noisy labels on the learning process.An illustration of the basic U-Net architecture, which is built upon cascade CNN layers with the encoder-decoder structure to capture both low-level contextural patterns and high-level semantic information.Each orange arrow is a convolutional module, including a convolutional layer ("Conv"), a normalization layer ("Norm"), and an activation layer ("PReLU", which stands for Parametric Rectified Linear Unit).Dual arrows are used to represent the duplicated convolutional modules (refered to as "dual-conv module")."C" denotes the channels of feature maps, and the dimensions of the feature maps are controlled by strides in convolutional layers and the upsampling layers (marked with green arrows).

Lesion Segmentation under Federated Learning
Our baseline method for lesion segmentation under the federated learning paradigm does not take into account the presence of noisy labels.Following previous work on MS lesion segmentation [2], we utilize the widely used U-Net architecture as the backbone network for mapping input brain MR images to a high-dimensional representation space and for learning salient patterns.
The U-Net architecture is an extension of typical encoderdecoder convolutional neural networks (CNN) designed to capture both low-level contextual patterns and high-level semantic information.As illustrated in Figure 2, the U-Net comprises a down-sampling path and an up-sampling path, each consisting of convolutional modules and dual-conv modules (represented by dual arrows).The first Conv layers in all the dual-conv modules of the down-sampling path have a stride of 2 (except for the first dual-conv modules, which has a stride of 1 to encode detailed image features), which is intended to reduce the resolution of feature maps and increase the receptive field for subsequent CNN operations.Correspondingly, each dual-conv module in the up-sampling path is followed by an up-convolution operation (a 2 × 2 transposed convolution) to expand the resolution of the feature maps.Skip connections are employed to preserve more low-level contextual information in the final representation, by concatenating lower-level feature maps with their corresponding higher-level feature maps.Finally, a 1 × 1 convolution layer is employed to generate the segmentation map from the final feature map. Unlike

Decoupled Hard Label Correction
The U-Net-based lesion segmentation method described above demonstrates promising performance under the federated learning framework when clean labels are available.However, in the presence of noisy labels, additional strategies are needed to train the segmentation model effectively.To this end, we introduce a simple, yet effective, strategy that we refer to as Decoupled Hard Label Correction (DHLC), which aims to implicitly identify and correct possible noisy labels at the voxel level during the local optimization stage.
For an arbitrary input sample   with its corresponding label   , a predicted segmentation map generated by U-Net is denoted as   .As mentioned in [31], label correction method (such as [24]) mitigates the influence of noisy labels by adjusting the one-hot label distribution   using the predicted distribution   : where ŷ is the corrected pseudo label for   and  is a factor to control the correction strength.This label correction method is intended to generate soft pseudo labels for all samples instead of differentiating between clean labels and noisy labels, which is suitable and has shown success in the natural image classification problems [24,32].For the lesion segmentation task which is inherently more ambiguous with relatively fuzzier masks that lack a clear decision boundary between the lesion and background, such a strategy would potentially decrease the model's ability to discriminate between lesion and nonlesion voxels.Instead of using the complex tri-teaching strategy [30] or introducing additional networks [25] to identify noisy labels, we propose to identify and correct noisy labels at the voxel level based on the following rules (as shown in Equation 2): 1) if voxel   is predicted as not being a lesion with a probability higher than a threshold  0 but the given label identifies it is a lesion, the label is then considered noisy and should be corrected as not a lesion (i.e., the background); 2) if voxel   is predicted as a lesion with a probability higher than a threshold  1 but the given label does not identify it as a lesion, then the label is also considered noisy and should be corrected as a lesion; 3) otherwise, the original label is preserved.
where   and ŷ are the original label and the updated label for voxel   ,   is the prediction vector of the voxel   .In our strategy, the   vector is generated by the local teacher network with the weights    , which is updated based on the local student network weights    using the EMA method mentioned in [27,28].

Centrally Enhanced Label Correction
While the hard label correction significantly improves the model's robustness against noisy labels, we have observed that the aggregated central model, which is more discriminative, tends to be deteriorated by local models that are trained with heavy label noise.This is inherently caused by the FedAvg algorithm [4], in which multiple optimization iterations are performed at each site before aggregation and the site models tend to overfit to the label noise, thereby affecting the performance of the aggregated central model and site models, and the robustness of the model when trained to convergence.
To mitigate this issue, we propose using the central model as the teacher model to guide the learning of the student segmentation network (i.e., local models from sites), as shown in Figure 3.This design is motivated by the observation that the central model, which is aggregated from multiple sites, is less prone to overfitting to local label noise compared to the site models that are trained with heavily corrupted labels.Thus, using the central model to correct the label noise is likely to yield more accurate results.
The algorithm process of this Centrally Enhanced Label Correction (CELC) strategy is shown in 1.The correction strategy of CELC is the same as that of DHLC in 2, where   is the predicted probability distribution of voxel   from the teacher model with weights   obtained from the aggregation server.In contrast to the teacher-student network design presented in [28], the teacher model in CELC is not updated during the site optimization process.Rather, it is updated by the center as part of the federated learning framework (as shown in Figure 3).Specifically, when the newly aggregated model performs worse than the previous best central model in the validation dataset, the previous best central model is used as the teacher model and shared with sites in the next round.

Experiments
In this section, we present the empirical analysis of the influence of noisy labels on MS lesion segmentation and the performance of our methods.

Datasets
Two multi-site datasets are employed in our experiments.The first dataset is MSSEG-2016 [33], which was used in the MICCAI MS lesion segmentation challenge in 2016.It consists of 53 MRI images from four different data centers, with labels determined through the consensus of seven expert annotators.The original training data contains 15 cases from three data centers, while the test set contains 38 MRI images, including 30 samples from the training centers and extra 8 cases from an additional center.As the domain difference is not the key aspect of our paper, we excluded the 8 cases from the additional center and followed the original settings of the remaining three sites by assigning 5 training samples to each site.In contrast with the original data split, 2 samples from each original test set were extracted to form the validation set, and the remaining 24 samples were used for testing and reporting the results.To simulate varying levels of label corruption across different sites, different strategies were adopted to introduce noise to the training labels.The original clean labels were retained for the first site.For the second and third sites, we introduced corruptions by eroding or dilating 40% of the training samples, respectively, with two voxels in the lesion mask (referred to as "Label Erosion" or "Label Dilation") to mimic the noise caused by inconsistent mask boundaries.Given that MS lesions are typically small in size, dilation or erosion by two voxels can be considered a reasonable form of noise.
The second dataset (SNAC-MS) comprises 123 cases collected internally from two sites, with labels annotated by two professional neuroimaging analysts in parallel, followed by cross-checking by another expert annotator.To avoid the influence of domain differences and non-independent and non-identical distributions across centers, the dataset was divided into four sites regardless of the original sites of the collection, with each site containing 20 samples for training.We reserved 21 cases for testing, and the remaining 22 samples were used as the center validation set.Similar to the first dataset, we retained the clean labels for the first site.For the second and third sites, we applied Label Erosion to 40% or 80% of the training labels.The training labels in the fourth site were corrupted by randomly removing 40% of the lesions (referred to as "Label Removal") to mimic the noise caused by conservative annotators.The noise related to additional false positive lesions was not considered as false positive lesions are rarely labeled by human annotators according to our experience and the previous research [34] in which the specificity is pretty high in diagnosis when MRI images are presented.
All samples in both datasets consist of co-registered FLAIR and T1 images as input.Skull stripping was performed on these images, and all generated brain images were registered to the MNI space [35] using FLIRT [36] for preprocessing and standardization purposes.The performance was reported on the test set with the same pre-processing steps using clean labels.

Implementation Details
For federated learning, all sites utilized the same hyperparameter settings.Inspired by prior studies [28], we adopted a 2D U-Net as the segmentation network, processing slices individually from the original 3D MRI images to accommodate the large size of each FLAIR and T1 image.The U-Net has a depth of 3, i.e., three consecutive dualconv modules were incorporated for encoder and decoder.The unit number of channel is set to 32 (i.e., the "C" shown in Figure 3).The network was optimized with the Adam optimizer with a learning rate of 9.0e-4 for the MSSEG-2016 and 7.0e-4 for the SNAC-MS dataset, with weight decay of 1.0e-5.The batch size was set to 32 and the total optimization step was 50 epochs with models aggregated after each epoch.Regarding the label correction strategy, { 0 ,  1 } was set to {0.90, 0.65} for the MSSEG-2016 dataset and {1, 0.65} for the SNAC-MS dataset, which were decided upon the validation datasets.Prior to applying the label correction strategy, 10 warm-up pre-train epochs were conducted in which the label corrections were not conducted to ensure that the generated pseudo labels were of reasonable quality.

Evaluation Metrics
To quantitatively evaluate and compare the performance of different methods, we employed four evaluation metrics: subject-level Dice Similarity (P-Dice), voxel-level Dice Similarity (V-Dice), and subject-level average Precision and Recall.
where  is the predicted lesion map,  is the ground truth lesion map, and  is the total number of subjects.| ⋅ | denotes the sum of the elements within the tensor, and ∑ 's are all conducted across all test cases.P-Dice and V-Dice are selected as the main figure to report performance.In addition, subject-level Precision and Recall are used to reflect the methods' performance on subjects, regardless of individual lesion volumes.

Influence of Noisy Labels
As there is no prior work on federated MS lesion segmentation with noisy labels, we begin by analyzing the influence of noisy labels on both the centralized learning paradigm and the federated learning paradigm.Under the centralized learning paradigm, the training data from different sites are merged for training, while the validation and test set remain the same as in the federated learning paradigm.The experimental results are presented in Table 1.
For the MSSEG-2016 dataset, federated training demonstrates similar performance to the centralized training scheme in the absence of label noise, demonstrating the feasibility of learning discriminative MS lesion segmentation models without data sharing across sites to maintain privacy.However, in the presence of label noise, the performance of the federated learning paradigm deteriorated more than the centralized training.Specifically, centralized training achieves a P-Dice of 0.5886 and a V-Dice of 0.6904, which are 0.0387 and 0.0492 lower than the centralized model trained with clean labels.In contrast, federated training with noisy labels only achieves a P-Dice of 0.5157 and a V-Dice of 0.6530, showing decreases of 0.0951 and 0.0723 compared with the federated model trained with clean labels.This deterioration can be attributed to the inherent problem in federated learning, as previously described in [4].The site models are more prone to overfitting the label noise during their individual updates, which subsequently affects the performance of the averaged model and leads to further performance degradation.Moreover, it is evident that the introduced label corruption significantly decreases the recall in both centralized and federated training, suggesting that many lesions are overlooked by the model.However, the precision of the centralized model, when trained with noisy labels, shows an improvement compared to centralized training with clean labels.This observation indicates that the combined effects of dilation and erosion noise may have balanced out in the centralized training.However, this phenomenon is not observed in the federated learning paradigm.
For the SNAC-MS dataset, the performance also drops when there are label corruptions for both centralized training and federated training.Unlike the observation in the MSSEG-2016 dataset, both training paradigms exhibit improved precision with a sacrifice in recall.This phenomenon may be related to the specific label corruption operations applied in this dataset.Both Label Erosion and Label Removal tend to generate false negatives, leading to the bias of the model on the eroded lesion masks.Such an effect is more obvious in the federated learning paradigm, further demonstrating that it is more susceptible to be overfitting on site-specific label noise under standard federated learning settings.

Effectiveness of CELC Strategy
In this section, we evaluate and report the performance of our method (CELC) in handling noisy labels on both the MSSEG-2016 and SNAC-MS datasets.To provide a comprehensive evaluation, we compare our method with four representative approaches in the noisy labeling area: • Label Smoothing [37] is a baseline strategy for noisy labeling in the natural image classification area, which softens the one-hot class label with a weighted constant during the training process to decrease overfitting.• Label Correction [31] is a similar way to Label Smoothing, except that it uses the weighted combination of the predicted label and the original label as the learning target.
• ProSelfLC [31] builds upon the Label Correction method, which progressively updates the weights of the prediction in the learning target by considering both the learning time (i.e., passed iterations divided by total iterations) and the prediction entropy.
• PINT [38] is a recent noisy labeling method for medical image segmentation, which estimates both the voxel-level label quality and image-level label quality by measuring the prediction uncertainties with an auxiliary network.
The experimental results of our model and these comparison methods are presented in Table 2.As observed, Label Smoothing [37], Label Correction [31], ProSelfLC [31], and PINT [38] do not consistently boost the performance with noisy-labeled training data for both datasets, which reveals the challenge of training with noisy labeled data under the federated learning paradigm and the difficulty of adapting existing noisy labeling methods developed for the natural image classification tasks to medical image segmentation scenarios.Specifically, the Label Correction method achieved the highest precision scores in both datasets, but recall was largely sacrificed, leading to a suboptimal P-Dice and V-Dice scores.Nevertheless, our method CELC outperformed all methods and significantly improved the performance of the segmentation network, i.e., 0.0925 and 0.0626 performance gain on the MSSEG-2016 dataset and 0.0821 and 0.0711 performance gain on the SNAC-MS dataset in terms of the P-Dice and V-Dice scores.
By comparing the results in Table 1, we observe that our method CELC achieves a performance close to the model trained with clean labels under the federated learning scheme on both datasets, which demonstrates that our CELC strategy is robust to corrupted labels.Figure 4 presents an example visualization of the segmentation results obtained by the basic U-Net model and our method, along with the provided ground-truth label from the SNAC-MS dataset.This visualization reveals that our method achieves reliable segmentation performance despite the presence of severe label noise in the training data.Especially, our method CELC consistently identifies more accurate lesion regions, while the basic U-Net model tends to make incorrect predictions on the lesion boundary and ignore small lesion regions.

Compare CELC and DHLC
To assess the individual contributions of the decoupled hard label correction (DHLC) and the center-enhanced label correction (CELC) strategy to the observed performance improvements, we further conducted a comparison study of DHLC and CELC as presented in Table 2.The variant "U-Net+DHLC" incorporates the hard label correction strategy during the training of the U-Net segmentation model, focusing solely on the site level.As can be observed, "U-Net+DHLC" already achieves promising performance by mitigating the influence of noisy labels on network optimization and preventing the overfitting issues associated with federated learning using noisy labels, as discussed in Section 4.2.However, leveraging global knowledge of the federated learning paradigm further enhances the model's robustness.By utilizing the best central model for label correction, our method (i.e., "U-Net+CELC") achieves additional performance improvements of 0.0184 and 0.0213 on the MSSEG-2016 dataset and 0.004 and 0.0137 on the SNAC-MS dataset in terms of the P-Dice and V-Dice scores when compared to "U-Net+DHLC".

Analysis 4.5.1. Influence of the Correction Threshold
The correction threshold is a critical hyperparameter in our method as it determines whether a given voxel label should be corrected.In this section, we analyze the influence of the correction threshold by varying its value for the training process.Here, we present the analysis using the threshold  1 on the SNAC-MS dataset as an example while reserving  0 = 1.0 as the control variable.As shown in Table 3, our method achieves resilient performance when the threshold is set between 0.55 and 0.75.This demonstrates the robustness of our method to hyperparameters and makes the method more readily applicable in other scenarios.When  the threshold is too large (e.g., above 0.8 for  1 ), the performance noticeably drops as the model fails to correct a sufficient number of false annotations.

Influence of the Warm-up Epoch
We also analyzed the influence of another hyperparameter in our method, namely the number of warm-up epochs,

Dealing with Real Annotation Inconsistency
Besides studying the artificial label noise mentioned in previous experiments, it is also important to evaluate the model's performance using a dataset with real annotations from multiple annotators.We used the MSSEG-2016 dataset as an example, which was annotated by seven annotators in parallel.We identified three annotators from the three centers whose labels diverge the most to represent the label noise (i.e., annotator 4 of center 1, annotator 6 of center 7, and annotator 7 of center 8) and conducted the same experiments described in Section 4.3.The results are reported in Table 5, which demonstrate that both our CELC and DHLC methods outperform other methods in dealing with annotation variability, highlighting the robustness of our methods within the federated learning framework.

Conclusion
In this work, we addressed the challenge of multiple sclerosis lesion segmentation with noisy annotations within the federated learning paradigm.Our approach aimed to enhance model robustness in the presence of noisy labels, a situation highly prevalent in this application and address privacy concerns that hinder cross-center collaboration.We conducted the first investigation of this problem in the federated learning setting.To handle the highly imbalanced distribution of lesions and background regions, we proposed a site-level hard label correction method (DHLC).This method treats positive and negative label corrections separately and directly revises the labels of voxels with high-confidence predictions that differ from the annotations.Moreover, to mitigate the risk of overfitting to label noise during site updating in federated learning, we enhanced the label correction by incorporating predictions from the best-performing central model (with an approach referred to as CELC).Experimental results on the MSSEG-2016 dataset and our in-house dataset (SNAC-MS) demonstrated the effectiveness and robustness of our proposed methods in combating noisy labels during federated learning.Although the proposed method is effective, this work still has some limitations that need to be addressed in future research.First, the largest dataset used in this work contains 123 samples distributed to four sites, which is smaller than the real application scenarios.It would be useful to evaluate the method on larger-scale datasets with more samples and more sites.Second, while different noise types were considered to simulate the annotation differences, the domain gap among different sites may also arise from variations in data acquisition devices, which is a common occurrence in real applications but not considered in this work.Thus, it is necessary to develop more robust models that are capable of handling both label noise and image contrast differences among sites.
In conclusion, our study contributes to the understanding of multiple sclerosis lesion segmentation with noisy annotations within the federated learning framework.The proposed method demonstrates promising results and opens avenues for further research in federated segmentation tasks without meticulous labels or knowledge of label noise types and levels.

Figure 1 :
Figure 1: An example of noisy labels from different data centers.Each row represents one single case, and the three columns represent the original slices, consensus labels, and noisy labels respectively.The annotations in red are the consensus of several expert image analysts and the annotations in blue and green are from different annotators and are intended to illustrate the underestimation or overestimation of the lesions correspondingly.Better viewed in color.

Figure 2 :
Figure2: An illustration of the basic U-Net architecture, which is built upon cascade CNN layers with the encoder-decoder structure to capture both low-level contextural patterns and high-level semantic information.Each orange arrow is a convolutional module, including a convolutional layer ("Conv"), a normalization layer ("Norm"), and an activation layer ("PReLU", which stands for Parametric Rectified Linear Unit).Dual arrows are used to represent the duplicated convolutional modules (refered to as "dual-conv module")."C" denotes the channels of feature maps, and the dimensions of the feature maps are controlled by strides in convolutional layers and the upsampling layers (marked with green arrows).
the traditional deep learning training strategy, training the U-Net within the federated learning framework involves multiple sites (or local nodes, such as hospitals) and a central server to conduct aggregation of local site models.At the beginning of federated training, each site initializes a local U-Net segmentation model with the same architecture and parameters as the central node.During the federated training process, each site optimizes its local model using its site-specific dataset for a certain number of iterations.Subsequently, the site model is uploaded to the central node, where all site models are evenly aggregated to create a more discriminative central model.The merged central model is then distributed back to all sites for further local updates.This process of local optimization, model uploading, model aggregation, and model distribution is iteratively performed until the model converges at each site or reaches the predetermined number of local training epochs.

Figure 3 :
Figure 3: Overall framework for DHLC (red arrows, including EMA) and CELC (green arrows) strategies under the federated learning architecture.  and   are the teacher and student networks, respectively.Best viewed in color.More details about CELC are given in Algorithm 1.

Figure 4 :
Figure 4: Visualization of different MRI slices, where images on 1) column one are input MRI slices (in FLAIR), 2) column two are segmentation results of the basic U-Net model, 3) column three are segmentation results of our method,and 4) column four are the ground-truth label.Best be viewed by zooming in for small lesions, e.g., the fourth to sixth rows.

Algorithm 1
Training and Optimization Pipeline of CELC Input:  sites  = {( 1 ,  2 , ...,   } that participate in federated learning, each with a labeled (unknown noisy or clean) training dataset   = {(, )} ( ∈ [1, ]), maximum aggregation rounds , training iteration   at each site  per round, and a clean labeled validation set in the center   = {(  ,   )}.{  ,   } are the parameters of the teacher model and student model.1: {  ,   } = RandomInitialize() in the center and assign them to all sites 2: for round  = 0;  <  do Sample a batch {(  ,   )}  from   7: Feed {  }  to the model and get the prediction scores {  } Merge weights {   } to obtain central model   15: Update   with the best central model selected with the center validation set   16: Distribute the central model {  ,   } to each site 17: end for

Table 1
The influence of noisy labels to both centralized training and federated training paradigms.Dice ↑ V-Dice ↑ Precision ↑ Recall ↑ P-Dice ↑ V-Dice ↑ Precision ↑ Recall ↑

Table 2
Overall comparison of the proposed method and existing noisy labeling models for combating label noise.Dice ↑ V-Dice ↑ Precision ↑ Recall ↑ P-Dice ↑ V-Dice ↑ Precision ↑ Recall ↑

Table 3
Segmentation performance with different positive label correction threshold ( 1 ) on the SNAC-MS dataset using CELC.

Table 4
Lesion segmentation performance of CELC with different warm-up epochs on the SNAC-MS dataset.

Table 4 .
The model consistently achieves good performance with different numbers of warm-up epochs as long as the warm-up is employed.This demonstrates the robustness of our method regardless of the stage at which it is applied during training.

Table 5
Lesion segmentation performance of CELC when trained on a real noisy dataset.