Learning from multiple annotators for medical image segmentation

highlights• A novel deep CNN architecture is proposed for jointly learning the expert consensus label and the annotator’s label. The proposed architecture (Fig. 1) consists of two coupled CNNs where one estimates the expert consensus label probabilities and the other models the characteristics of individual annotators (e.g., tendency to over-segmentation, mix-up between different classes, etc) by estimating the pixel-wise confusion matrices (CMs) on a per image basis. Unlike STAPLE [25] and its variants, our method models, and disentangles with deep neural networks, the complex mappings from the input images to the annotator behaviours and to the expert consensus label.• The parameters of our CNNs are “global variables” that are optimised across different image samples; this enables the model to disentangle robustly the annotators’ mistakes and the expert consensus label based on correlations between similar image samples, even when the number of available annotations is small per image (e.g., a single annotation per image). In contrast, this would not be possible with STAPLE [25] and its variants [5], [8] where the annotators’ parameters are estimated on every target image separately.• This paper extends the preliminary version of our method presented at the NeurIPS Thirty-fourth Annual Conference on Neural Information Processing Systems [30], by extensively evaluating our model on a new created real-world multiple sclerosis lesion dataset (QSMSC at UCL: Queen Square Multiple Sclerosis Center at UCL, UK). This dataset is generated with manual segmentations from 4 different annotators (3 radiologists with different level skills and 1 expert to generate the expert consensus label). Additionally, we presented a comprehensive discussion about our model’s potential applications (e.g., estimate annotator’s quality and annotation’s quality), the future works we are going to explore, and the potential limitations of our model.


Introduction
The performance of downstream supervised machine learning models is known to be influenced by substantial inter-reader variability when segmenting anatomical structures in medical images [26] .This issue is especially acute in the medical image domain, where labelled data is commonly scarce due to the high cost of annotations.For instance, because of the heterogeneity in lesion location, size, shape, and anatomical variability across patients [29] , accurate identification of multiple sclerosis (MS) lesions in MRIs is difficult even for experienced experts.Another example [21] shows that glioblastoma (a kind of brain tumour) segmentation had an average inter-reader variability of 74-85%.Segmentation annotations of structures in medical image suffer from substantial annotation variations, which is exacerbated by disparities in biases and level of expertise [18] .As a result, despite the current quantity of medical imaging data due to almost two decades of digitisation, the world still lacks access to data with curated labels that can be used by machine learning [14] , necessitating the use of intelligent algorithms to learn robustly from such noisy annotations.
Different pre-processing techniques are often used to curate segmentation annotations by fusing labels from different experts in order to minimise inter-reader differences.The most basic and widely used approach is based on a majority vote, with the most representative expert opinion being treated as the expert consensus label.In the aggregation of brain tumour segmentation labels, a smarter variant [21] that accounts for class similarity has proven effective.
However, one major limitation with such approaches is that all experts are presumed to be equally trustworthy [25] .proposed a label fusion approach, which is called STAPLE.This method explicitly models individual expert reliability and uses that knowledge to "weight" their judgments in the label aggregation step.STAPLE has been the go-to label fusion method in the construction of public medical image segmentation datasets, such as ISLES [27] , MSSeg [11] , and Gleason'19 [12] datasets, after demonstrating its superiority over traditional majority-vote pre-processing in various applications.Asman further extended this strategy in [4] by accounting for voxel-wise consensus to solve the issue of annotators' reliability being under-estimated.Another extension [5] was proposed to model the annotator's reliability across different pixels in images.More recently, STAPLE has been modified in numerous ways to encode the information of the underlying images into the label aggregation process in the context of multi-atlas segmentation problems [2,16] where image registration is used to warp segments from labelled images ("atlases") onto a new scan.STEP, which is a way to further incorporate the local morphological similarity between atlases and target images in [8] , is a notable example, and several extensions of this approach, such as [1,6] , have subsequently been examined.However, all of the previous label fusion methods have one major limitation: they don't have a way to integrate information from distinct training images.This severely restricts the scope of applications to situations in which each image has a reasonable number of annotations from multiple experts, which can be prohibitively expensive in practise.Moreover, to model the relationship between observed noisy annotations, expert consensus label and reliability of experts, relatively simplistic functions are utilized, which may fail to capture complex characteristics of human annotators.
In this paper, we introduce and fully evaluate an unique endto-end segmentation approach that predicts the reliability of multiple human annotators and the expert consensus label based on noisy labels alone.We use the Morpho-MNIST framework [9] to perform morphometric operations on the MNIST dataset to simulate a variety of annotator types for evaluation.We also demonstrate the potential in several public medical imaging datasets, namely (i) MS lesion segmentation dataset (ISBI2015) from the ISBI 2015 challenge [7] , (ii) Brain tumour segmentation dataset (BraTS) [21] and (iii) Lung nodule segmentation dataset (LIDC-IDRI) [3] .Furthermore, we create a practical MS lesion segmentation dataset with 4 different annotators (3 radiologists with different level skills and 1 expert to generate the expert consensus label) to evaluate our model's performance in real-world data.Experiments on all datasets demonstrate that our method consistently leads to better segmentation performance compared to widely adopted labelfusion methods and other relevant baselines, especially when the number of available labels for each image is low and the degree of annotator disagreement is high.The main contributions of our approach are: (1) A novel deep CNN architecture is proposed for jointly learning the expert consensus label and the annotator's label.The proposed architecture ( Fig. 1 ) consists of two coupled CNNs where one estimates the expert consensus label probabilities and the other models the characteristics of individual annotators (e.g., tendency to over-segmentation, mix-up between different classes, etc) by estimating the pixel-wise confusion matrices (CMs) on a per image basis.Unlike STAPLE [25] and its variants, our method models, and disentangles with deep neural networks, the complex mappings from the input images to the annotator behaviours and to the expert consensus label.
(2) The parameters of our CNNs are "global variables" that are optimised across different image samples; this enables the model to disentangle robustly the annotators' mistakes and the expert consensus label based on correlations between similar image samples, even when the number of available annotations is small per image (e.g., a single annotation per image).In contrast, this would not be possible with STAPLE [25] and its variants [5,8] where the annotators' parameters are estimated on every target image separately.
(3) This paper extends the preliminary version of our method presented at the NeurIPS Thirty-fourth Annual Conference on Neural Information Processing Systems [30] , by extensively evaluating our model on a new created real-world multiple sclerosis lesion dataset (QSMSC at UCL: Queen Square Multiple Sclerosis Center at UCL, UK).This dataset is generated with manual segmentations from 4 different annotators (3 radiologists with different level skills and 1 expert to generate the expert consensus label).Additionally, we presented a comprehensive discussion about our model's potential applications (e.g., estimate annotator's quality and annotation's quality), the future works we are going to explore, and the potential limitations of our model.

Problem set-up
In this work, we look at the problem of developing a supervised segmentation model using noisy labels provided by multiple human annotators.In particular, we explore a situation in which a set of images { x n ∈ R W ×H×C } N n =1 (with W, H, C denoting the width, height and channels of the image) are assigned with noisy segmentation labels { ˜ y (r)   n ∈ Y W ×H } r∈ S(x i ) n =1 , ... ,N from multiple annotators where ˜ y (r)   n denotes the label from annotator r ∈ { 1 , . . ., R } and S(x n ) denotes the set of all annotators who labelled image x n and Y = [1 , 2 , . . ., L ] denotes the set of classes.
We suppose that each image x has been annotated by at least , the combination of images, noisy labels and labels of experts' identities (which label was obtained from whom).
We also emphasise that during inference time, the goal is to segment a particular unlabeled test image, not to fuse multiple accessible labels, as is typically done in multi-atlas segmentation techniques [16] .

Probabilistic model and proposed architecture
In this section, we present the probabilistic model of the observed noisy labels from various annotators.Given the input image, we make two key assumptions: (1) annotators are statistically independent, and (2) annotations over different pixels are independent.With these assumptions, the probability of observing noisy labels { ˜ y (r) } r∈ S(x ) on x factorises as: where ˜ y (r)   wh ∈ [1 , . . ., L ] denotes the (w, h ) th elements of ˜ y (r) ∈ Y W ×H .The probability of observing each noisy label on each pixel (w, h ) is now rewritten as: where p(y wh | x ) denotes the expert consensus label distribution over the (w, h ) th pixel in the image x , and p( ˜ y (r)   wh | y wh , x ) describes the noisy labelling process by which annotator r corrupts the expert consensus label.In particular, we refer to the L × L matrix whose each (i, j) th element is defined by the second term a (r) (x , w, h ) i j := p( ˜ y (r)   wh = i | y wh = j, x ) as the CM of annotator r at pixel (w, h ) in image x .
We present a CNN-based architecture for modelling the different constituents of the joint probability distribution in the above p({ ˜ represents the estimated segmentation probability map of the corresponding annotator.Note that here " • " denotes the element-wise matrix multiplications in the spatial dimensions W, H.At inference time, we use the output of the segmentation network ˆ p θ ( x ) to segment test images.

Learning spatial confusion matrices and expert consensus label
Recently, several combined loss functions have been designed and used to solve different problems [13] .In this section, we present the details of how we combined the parameters of the segmentation network, θ , and the parameters of the annotator network, φ, to optimise them.In short, we use stochastic gradient descent to minimise the negative log-likelihood of the probabilistic model plus a regularisation component.The following is a more extensive description.

Given training input
. ., R , we optimize the parameters { θ , φ} by minimizing the negative log-likelihood (NLL), . From eqs. 1 and 2 , this optimization ob- jective equates to the sum of cross-entropy losses between the observed noisy segmentations and the estimated annotator label distributions: Keeping the above to a minimum encourages each annotator's predictions ˆ p (r) θ ,φ ( x ) to be as close as feasible to the annotator's true noisy label distribution p (r) ( x ) .This loss function, however, is insufficient to distinguish the annotation noise from the expert consensus label distribution; there are many combinations of pairs ˆ A matches the true annotator's distribution p (r) (x ) for any input x (e.g., permutations of rows in the CMs).To combat this problem, inspired by [24] , which addressed an analogous issue for classification tasks, we add the trace of the estimated CMs to the loss function in Eq. 2.3 as a regularisation term.We thus optimize the combined loss: where S( x )) denotes the set of all labels available for image x , and tr ( A ) denotes the trace of matrix A .The average probability that a randomly selected annotator would provide an accurate label is represented by the mean trace.Minimizing the trace, on the other hand, encourages the predicted annotators to be as unreliable as possible while minimising the cross entropy ensures fidelity with observed noisy annotators.We minimise this combined loss via stochastic gradient descent to learn both { θ , φ} .

Justification for the trace norm
We present a further justification for employing trace regularisation in this section.If the average CM of annotators is diagonally dominant and the cross-entropy term in the loss function is zero, [24] demonstrated that minimising the trace of the estimated CMs recovers the true CMs uniquely.However, rather than individual data samples, their results address properties of the average CMs of both the annotators and the classifier over the entire population.In the sample-specific regime, we show a comparable but slightly weaker result, which is more relevant because we estimate CMs of corresponding annotators on every input image.
First, let us set up the notations.For brevity, for a given input image x ∈ R W ×H×C , we denote the estimated CM of annotator r at (i, j) th pixel by ˆ r) and its estimate ˆ where π r ∈ [0 , 1] is the probability that the annotator r labels image x .Lastly, as we stated earlier, we assume there is a single expert consensus label per image -thus the true L -dimensional probability vector at pixel (i, j) takes the form of a one-hot vector i.e., p ( x ) = e k for, say, class k ∈ [1 , . . ., L ] .Then, the following result motivates the use of the trace regularisation: Theorem 1.If the annotator's segmentation probabilities are perfectly modelled by the model for the given image i.e., ˆ

, R , and the average true confusion matrix A * at a given pixel and its estimate
] and such solutions are unique in the k th column where k is the correct pixel class.
The corresponding proof is provided in the supplementary material.The above result shows that if each estimated annotator's distribution ˆ A (r) ˆ p θ (x ) is very close to the true noisy distribution p (r) (x ) (which is encouraged by minimizing the cross-entropy loss), and for a given pixel, the average CM has the k th diagonal entry larger than any other entries in the same row2 , then minimizing its trace will drive the estimates of the k th ('correct class') columns in the respective annotator's CMs to match the true values.The single-ground-truth assumption indicates that the remaining values of the CMs are uniformly equal to 1 /L , and therefore it suffices to recover the column of the proper class, even though this result is weaker than what was shown in [24] for the population scenario rather than individual samples.
We use identity matrices to encourage { ˆ A (1) , . . ., ˆ A (R ) } to be diagonally dominant by training the annotation network to maximise the trace for a sufficient number of iterations as a warm-up period.Intuitively, the trace term and cross-entropy combination separates the expert consensus label distribution from the annotation noise by locating the maximum amount of confusion that adequately explains the noisy observations.

Model implementation and optimization
Low-rank Approximation.Low-rank approximation is an effective model compression technique to not only reduce parameter storage requirements, but to also reduce computations.For convolutional neural networks (CNNs), however, well-known low-rank approximation methods, such as Tucker or CP decomposition, result in degraded model accuracy because decomposed layers hinder training convergence.We note that each spatial CM, ˆ contains W HL 2 variables, and calculating the corresponding annotator's prediction ˆ p erations, potentially incurring a large time/space cost when the number of classes is large.We also investigate a low-rank approximation (rank = 1 ) approach to alleviate this issue whenever applicable, despite the fact that it is not the focus of our study (as we are concerned with medical imaging applications for which the number of classes is typically limited to less than 10).Analogous to Chandra and Kokkinos's work [10] where they employed a similar approximation for estimating the pairwise terms in densely connected CRF, we parametrise the spatial CM, ˆ as a product of two smaller rectangular matrices 1 ,φ and B (r) 2 ,φ of size W × H × L × l where l << L .In this case, the annotator network outputs B (r)  1 ,φ and B (r) 2 ,φ for each annotator in lieu of the full CM.Two separate rectangular matrices are used here since the confusion matrices are not necessarily symmetric.Such low-rank approximation reduces the total number of variables to 2 W HLl from W HL 2 and the number of floating-point operations Training without Sample Bias.Traditionally, machine learning methods can learn model parameters automatically with the training samples and thus it can provide models with good performance which can satisfy the special requirements of various applications.In medical image computing tasks, we usually have the longitudinal study (e.g., our practical MS segmentation data), which is an observational study and the data is gathered from the same sample repeatedly over an extended period of time.Sample bias would occur when our training data only have a limited number of patients from the dataset, which does not reflect the realities of the environment in which a machine learning model will run.For example, certain facial recognition systems are trained primarily on images of white men.These models have considerably lower levels of accuracy with women and people of different ethnicities.
In order to train our model without sample bias and make our model robust to the data generalization, we utilize the product of experts [15] to factor the potential sample biases out of the learned model.Our annotator network is firstly trained with the standard cross-entropy loss from multiple annotators to discover sample biases in the dataset.We then investigate the biases on which the annotator network relies and show that they match the identified sample bias existing in the longitudinal dataset.We also follow the training approaches in [23] , which is decomposed into two successive stages: (a) training the annotator network with a standard cross-entropy loss and (b) training the segmentation network via the product of experts to learn from the CMs of the multiple annotators.The core intuition of this training approach is to encourage the robust model to learn to predict the true label distribution that takes into account each annotator's mistakes (CMs).The final goal is to produce the segmentation network.After training, the annotator network is frozen and used only as part of the product of experts.Since the annotator network is frozen, only the segmentation network receives gradient updates during training.
Network Details.With the introduction of CNNs in recent years, image segmentation approaches have improved substantially.CNNs have been applied to both image and model-based segmentation problems with this method outperforming traditional techniques.With regards to the former, the most noticeable breakthrough was the introduction of the U-Net for 2D segmentation by [22] .Subsequently, different variations of the U-Net were proposed, which have extended the method to 3D, dealt with the issue of class imbalance and made full use of the advantages of spatial information.In this work, we implement our model and evaluate on both natural image dataset and medical image datasets in 2D and 3D version based on U-net.
For 2D natural image segmentation tasks, our aim is to study the properties of the proposed model that could estimate the expert consensus label from multiple annotators.Meanwhile, it is also easy to describe and to be understood the theoretical background of our model from the 2D level and suitable to applied on most datasets by exploring this topic on 2D version.For these reasons, we build our model on a 2D U-net [22] with 4 downsampling stages and channel counts of 32, 64, 128, 256 for each encoder.We also replaced the batch normalisation layers with instance normalisation.Apart from the last layer in the U-net decoder, our segmentation and annotator networks share the same parameters.In essence, the overall architecture is implemented as a U-net with multiple output last layers: one for expert consensus label prediction and the others for noisy segmentation prediction.The output of the last layer of a segmentation network has c channels, where c is the number of classes.
To deal with the more complicated segmentation problems in 3D medical image community, we also implement our model to 3D version.We use the original implementation with some minor modifications.Like our 2D version model, the overall 3D architecture is implemented as a U-net with multiple output last layers: one for expert consensus label prediction and the others for noisy segmentation prediction.In the symmetric encoder path, each layer contains two 3 × 3 × 3 convolutions each followed by a rectified linear unit ( ReLu ) , and then a 2 × 2 × 2 max pooling with strides of two in each dimension.In the synthesis path, each layer consists of an upconvolution of 2 × 2 × 2 by strides of two in each dimension, followed by two 3 × 3 × 3 convolutions each followed by a ReLu.Shortcut connections from layers of equal resolution in the analysis path provide the essential high-resolution features to the synthesis path.The last layer that has 1 x 1 x 1 kernel and c number of channels as output.In this case, we use ReLu non-linearity and the skip-connections are joined with a concatenation step.The network outputs a c-channel segmentation map with the training labels as well as a softmax.
By default, the output of the last layer for calculating CMs at each spatial position in an annotator network has L × L number of channels; when low-rank approximation is employed, the output of the last layer has 2 L × L number of channels.For fair comparison, we adjusted the number of the channels and the depth of the U-net backbone in Probabilistic U-net [20] to match with our networks.

Dataset description
We evaluate our method on a variety of datasets including both synthetic and real-world scenarios:1) for MNIST segmentation and ISBI2015 MS lesion segmentation challenge dataset [17] , we apply morphological operations to generate synthetic noisy labels in binary segmentation tasks; 2) for BraTS 2019 dataset [21] , we apply similar simulation to create noisy labels in a multi-class segmentation task; 3) we also consider the LIDC-IDRI dataset which contains multiple annotations per input acquired from different clinical experts as the evaluation in practice.4) We create a real-world multiple sclerosis lesion dataset with manual segmentations from 4 different annotators to verify our method in practical situation.

Comparison methods and evaluation metrics
Our experiments are based on the assumption that no expert consensus label is available a priori, hence, we compare our method against multiple label fusion methods.In particular, we consider four label fusion baselines: a) mean of all of the noisy labels; b) mode labels by taking the "majority vote"; c) label fusion via the original STAPLE method [25] ; d) Spatial STAPLE, a more recent extension of STAPLE that accounts for spatial variations in CMs.For STAPLE and Spatial STAPLE methods, we used the toolkit 3 .To get an upper-bound performance, we also include the oracle model that is directly trained on the expert consensus label annotations.To test the value of the proposed image-dependent spatial CMs, we also include "Global CM" model where a single CM is learned per annotator but fixed across pixels and images (analogous to [19,24] , but in segmentation task).Lastly, we also compare against a recent method called Probabilistic U-net [20] as another baseline, which has been shown to capture inter-reader variations accurately.
For evaluation metrics, we use: 1) root-MSE between estimated CMs and real CMs; 2) Dice coefficient (DICE) between estimated segmentation ˆ p θ ( x ) and expert consensus label y GT : where U c means the one-hot vector for class c, 3) The generalized energy distance proposed in [20] to measure the quality of the estimated annotator's labels.4) We use the incompetence score, which is defined by calculating the absolute error between the estimated CM and the real CM, to show the learning performance of annotator CNN. 5) We also evaluate our model on both dense labels (multiple labels per image) and single label (randomly selected 1 label per image) in each dataset to show the robustness on sparse labels.

Performance on synthetic datasets
MNIST and ISBI2015 Datasets: On both datasets, our proposed model achieves a higher dice similarity coefficient than STAPLE on the dense label case and, even more prominently, on the single label (i.e., 1 label per image) case (shown in Tables 1 & 2 ).In addition, our model outperforms STAPLE without or with trace norm, in terms of CM estimation, specifically, we achieve an increase at  dice where labels are generated by a group of 5 simulated annotators.
BraTS Dataset: For multi-class segmentation, our proposed model achieves a higher dice similarity coefficient than STAPLE and Spatial STAPLE on both of the dense labels and single label scenarios (shown in Table 3 .In addition, our model outperforms STAPLE in terms of DICE by a large margin at 14 .4% on BraTS.In Fig. 4 , we visualized the segmentation results and the corresponding annotators' predictions.Even in multi-class segmentation task, our model can capture each annotator's characters on annotation. Here we also show our preliminery results on the employed low-rank approximation of confusion matrices for BraTS dataset, precluded in the main text.Table 4 compares the performance of our method with the default implementation and the one with rank-1 approximation.We see that the low-rank approximation can halve the number of parameters in CMs and the number of floating-point-operations (FLOPs) in computing the annotator prediction while resonably retaining the performance on both segmentation and CM estimation.We note, however, the practical gain of this approximation in this task is limited since the number of classes is limited to 4 as indicated by the marginal reduction in the overall GPU usage for one example.We expect the gain to increase when the number of classes is larger as shown in Fig. 5 .

Table 3
Comparison of segmentation accuracy and error of CM estimation for different methods trained with dense labels and single label (mean ± standard deviation), respectively.For BraTS dataset, we present the results for the target class.Numbers in bold indicate the best method that statistically ( p < .01 ) better than other methods by computing the p values of paired t-tests on DICE and CM estimation metrics, respectively.Note that we count out the Oracle from the model ranking as it forms a theoretical upperbound on the performance where expert consensus label is known on the training data.

Performance on LIDC-IDRI dataset
In this section, we present our model's performance on LIDC-IDRI dataset, which has annotation masks generated from 4 radiologists for lesions that they independently detected and considered to be abnormal.Our model (with trace) outperforms STAPLE on single label by a large margin at 18.8%.Since LIDC dataset didn't provide the annotator identity, we cannot compute the average CM estimation result for each annotator.Thus, we randomly select several samples to visualize the segmentation results and analyse the segmentation performance on different consensus groups.Fig. 6 presents three examples of the segmentation results and the corresponding four annotator contours, as well as the consensus.As shown in the figure, our model successfully predicts both the segmentation of lesions and the variations of each annotator in different cases.We also measure the inter-reader consensus levels by computing the Intersection over Union (IoU) of multiple annota- tions, and compare the segmentation performance in three subgroups of different consensus levels (low, medium and high).
Additionally, as shown in Table 5 , our model consistently outperforms Probabilistic U-Net on generalized energy distance across the four test different datasets, indicating our method can better capture the inter-annotator variations than the baseline Probabilistic U-Net.This result shows that the information about which labels are acquired from whom is useful in modelling the variability in the observed segmentation labels.78.49 ± 0.17 0.0715 ± 0.0245

Performance on real-World MS dataset
In Fig. 7 and Table 6 , we compared the performance of different methods on our practical dataset.We can see that our model achieved the best result on segmentation accuracy and dice similarity coefficient compared with the state-of-art deep learning based method and the widely used STAPLE, Spatial STAPLE models.We also show the visualization of each annotator contours and the consensus, and the confusion matrices on our practical dataset in Fig. 8 .As shown in the figure, our model successfully predicts both the segmentation of lesions and the variations of each anno-tator in different cases.In the meantime, the confusion matrices in Fig. 8 illustrate our model can capture the patterns of mistakes for each annotator.We also notice that our model is consistently more accurate than the global CM model, indicating the value of image-dependent pixel-wise CMs.
Furthermore, we show the annotator confidence score of each testing example and corresponding CM errors in Fig. 9 .For each annotator, we can tell that if the annotator is not confident on labelling the sample, the lower confidence score will be given and the corresponding CM error will be higher.If different annotators gave the same confidence score for one sample but corresponding incompetence score is different, the annotator who has the lowest incompetence score has the best ability to label the data.For example, annotator 3 labelled the four testing examples with high confidence and the learned CMs show the lowest incompetence score, we can consider annotator 3 has the best ability to label the data.To further verify the annotator who has the best annotation ability, we show the correlation between annotator's confidence score and corresponding dice coefficient for each testing example in Fig. 10 .We can see that the annotator 3 also has the best performance on dice coefficient but with lowest CM incompetence score for each testing example.From both figures, we can tell the annotator 3 has the best ability to give labels for the MS lesions.

Discussion
In this work, we integrate two coupled CNNs into an end-toend supervised segmentation framework to jointly estimate the reliability of multiple human annotators and expert consensus label from noisy labels alone, which is applicable to different medical image segmentation tasks.Our method is very lightweight and can be trained in an end-to-end manner.In the following, we present a comprehensive discussion for some questions we are concerning in this work and the future extension of our model, such as the potential application on the education of teaching people how to label the image data and selecting the best annotator from multiple annotators.

Evaluation on 3D multi-class segmentation
For most multi-class image segmentation problems, the number of pixels in each class is different from each other which potentially leads to less accurate predictions for some classes than others.Additionally, some of the image regions are easier to be classified (i.e. higher segmentation accuracy) than others due to more distinct local image characteristics.In our work, to validate the synthetic noisy labels in multi-class segmentation for 3D medical images, e.g., BraTS dataset, we choose a target class and treat the other classes as "background".In Table 3 & 4, we only present the dice coefficient and the CM estimation for the target class for Brats image.To validate the segmentation model's performance, we also show the dice coefficient for all classes and the entire image D total by judging prediction is correct or incorrect with Eq. 6 : Quantitative results from the comparison models are presented in Table 7 and 8 .Our model show the best performance in all presented methods.The segmentation accuracy was improved roughly 14% and 17% compared to STAPLE and Spatial STAPLE in dense labels and single label, respectively.This means that our model worked well for capturing the annotator's character even with multi-class lesions in the images.

Learn Annotator's quality
In the experiments on practical MS dataset, we measure each annotator's incompetence score of the confusion matrices, confidence score for the annotation and the computed dice coefficient.In Fig. 9 and 10 , we plot the correlations of the CM incompetence score vs. the annotator confidence score and the dice coefficient, respectively.This is done for the testing cases, which we selected the middle 20 slices for each and annotated by the 3 different annotators.From Fig. 9 we can tell that the more confident the annotator, the smaller CM incompetence score, so we could choose the annotator 3 as the best one in this stage.Furthermore, we calculated the mean dice coefficient of the selected slices for each testing case.From the results shown in Fig. 10 , we can tell that annotator 3 still show the best performance with higher dice co-

Learning with metadata
Metadata is an useful and powerful machine learning tool to be collected in any data scientists' toolbox, regardless of the model we are using.Unfortunately, there is a paucity of quality literature on this topic and metadata is often overlooked when building a accurate machine learning model.In this work, the metadata can include information about each annotator's experience, fatigue, motivation, concentration.For example, annotator's experience (e.g., expert, senior, junior) is different for different types of lesions, which require different levels and types of expertise.The different annotation experience also affect annotation quality and availability of a worker base.As for the setup, annotator motivation is also one of    the key aspects determining the cost of annotations.In the future work, to improve the segmentation accuracy, we plan to integrate all the available metadata in the proposed method.

Conclusion
We introduced a novel, robust learning method based on CNNs for simultaneously recovering the label noise of multiple annotators and the expert consensus label distribution for supervised segmentation problems.We demonstrated this method on realworld datasets with synthetic annotations and real-world annotations.Our method is capable of estimating individual annotators and thereby improving robustness against label noise.Experiments have shown our model achieves considerable improvement over the traditional label fusion approaches including averaging, the majority vote and the widely used STAPLE framework and spatially varying versions, in terms of both segmentation accuracy and the quality of CM estimation.
One exciting avenue of this research is the application of the annotation models in downstream tasks.Of particular interest is the design of active data collection schemes where the segmentation model is used to select which samples to annotate ("active learning"), and the annotator models are used to decide which experts to label them ("active labelling")-e.g., extending [28] from simple classification task to segmentation remains as the future work.Another exciting application is education of inexperienced annotators; the estimated spatial characteristics of segmentation mistakes provide further insights into their annotation behaviours, and as a result, help them improve the quality of their annotations in the next data acquisition.At the same time, although we have achieved reliable performance on all experiments, it is worth to explore the question that how many training samples at least to achieve the reliable performance for each annotator.This is also a future work we will consider.

Declaration of Competing Interest
The authors certify that they have NO affiliations with or involvement in any organisation or entity with any financial interest (such as honoraria; educational grants; participation in speakers' bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.
as a Research Scientist and later being promoted to a Project Manager and Principal Key Expert.He led a team developing efficient machine learning methods for many challenging problems in medical image analysis.He is now Director and Distinguished Scientist of Tencent Jarvis Lab, Shenzhen, China, leading the company's initiative on Medical AI.His research interests include medical image analysis, natural language processing, document image analysis, computer vision, and deep learning.

Ling
Fig. 1 .The model consists of two components: (1) Segmentation Network , parametrised by θ , which estimates the expert consensus label probability map, ˆ p θ ( x ) ∈ R W ×H×L whose each (w, h, i ) th element approximates p(y wh = i | x ) ;(2) Annotator Network , parametrised by φ, that generate estimates of the pixel-wise CMs of respective annotators as a function of the input image, { ˆ

Fig. 1 .
Fig. 1.The schematic pipeline of 3 annotators in different characteristics: over-segmentation, under-segmentation and mixing up two classes.The model consists of two parts: (1) segmentation network parametrised by θ that generates an estimate of the unobserved expert consensus label probabilities, p θ ( x ) ; (2) annotator network , parametrised by φ, that estimates the pixelwise confusion matrices { A (r) φ ( x ) } 3 r=1 of the annotators for the given input image x .During training, the estimated annotators distributions ˆ p (r) θ ,φ ( x ) := A (r) φ ( x ) • p θ ( x ) are computed, and the parameters { θ, φ} are learned by minimizing the sum of their cross-entropy losses with respect to the acquired noisy segmentation labels ˜ y (r) , and the trace of the estimated CMs.At test time, the output of the segmentation network, ˆ p θ ( x ) is used to yield the prediction.

Fig. 3 .
Fig. 3. Segmentation accuracy of different models on MNIST (a, b) and MS (c, d) dataset for a range of annotation noise (measured in averaged Dice with respect to expert consensus label.

LFig. 4 . 6
Fig. 4. The final segmentation of our model on BraTS and each annotator network predictions visualization.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)Table 6 Comparison of segmentation accuracy and error of CM estimation for different methods trained with dense labels (mean ± standard deviation).The best results are shown in bold.Numbers in bold indicate the best method that statistically ( p < .01 ) better than other methods by computing the p values of paired t-tests on DICE and CM estimation metrics, respectively.Note that we count out the Oracle from the model ranking as it forms a theoretical upperbound on the performance where expert consensus label is known on the training data.

Fig. 5 .
Fig. 5. Comparison of time and space complexity between the default implementation and the low-rank counterparts.(a) compares the number of parameters in the confusion matrices while (b) shows the number of FLOPs required to compute the annotator predictions (the product between the confusion matrices and the estimated true segmentation probabilities).

Fig. 6 .
Fig. 6.Segmentation results on LIDC-IDRI dataset and the visualization of each annotator contours and the consensus.

Fig. 7 .
Fig. 7. Curves of validation accuracy during training of different models on our practical dataset.

Fig. 8 .
Fig.8.Visualisation of each annotator contours and the consensus (red for Annotator 1, yellow for Annotator 2, blue for Annotator 3 and purple for consensus), and the confusion matrices on our practical dataset (white is the true positive, green is the false negative, red is the false positive and black is the true negative.The background label is learned as true negative and false negative.)(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Fig. 9 .
Fig. 9.The annotator confidence score of each testing example and corresponding CM incompetence score.We select the middle 20 slices and compute the CM mean error for each example.

Fig. 10 .
Fig. 10.The annotator confidence score of each testing example and corresponding dice coefficient.We select the middle 20 slices and compute the CM mean error for each example.
Shao is the Founding CEO and Chief Scientist of the Inception Institute of Artificial Intelligence (IIAI), Abu Dhabi, UAE.He was the initiator of the world's first AI university -Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), for which he served as the Founding Provost and Executive Vice President (2019-2021).He received his B.Eng. degree in Electronic and Information Engineering from the University of Science and Technology of China (USTC), his M.Sc.degree in Medical Image Analysis and his Ph.D. (D.Phil.)degree in Computer Vision from the University of Oxford.Previously, He was Chair Professor (2016-2018) in Computer Vision and Machine Learning with the University of East Anglia, Norwich, UK, Chair Professor (2014-2016) in Computer Vision and Machine Intelligence with Northumbria University, Newcastle upon Tyne, UK, a Senior Lecturer (2009-2014) with the University of Sheffield, UK and a Senior Scientist (20 05-20 09) with Philips Research, The Netherlands.His research interests include Computer Vision, Deep Learning, Medical Imaging, and Vision and Language.Olga Ciccarelli is a NIHR Research Professor of Neurology.My research programme aims to develop a computer tool that will enable doctors to predict which medicines individuals with multiple sclerosis will respond better to.She studied Medicine at the University of Rome, La Sapienza, Italy, where I completed my specialist training in Neurology in 1999.I then came to London with the Jacqueline Du Pre award from the MSIF.After my PhD in Neuroscience at UCL, I was awarded a Wellcome Trust Advanced Clinical Fellowship.In 2019 I was awarded a NIHR Research Professorship.Frederik Barkhof is leading the Queen Square MS Centre Trial Unit, involved in analysis of multicenter drug trials.He serves on the Editorial boards of Radiology, Brain, Multiple Sclerosis Journal, Neuroradiology and Neurology.He is a fellow of the Royal College of Radiology.He was the chairman of the Dutch Society of Neuroradiology and the MAGNIMS study group for many years.Frederik Barkhof received his MD from VU University, Amsterdam (NL) in 1988 and defended his PhD thesis in 1992, for which he received the Philips Prize for Radiology (1992) and the Lucien Appel Prize for Neuroradiology (1994).Since 2001 he serves as a full Professor in Neuroradiology at the department of Radiology & Nuclear Medicine at Amsterdam Universities Medical Centers (location VUmc).In 2015 he was appointed as full Professor of Neuroradiology at the Queen Square Institute of Neurology and the Centre for Medical Image Computing (CMIC) at University College London (UCL) to translate novel imaging techniques.In 2018 he received the John Dystel Prize by the AAN and NMSS for his unique contributions to MS research.In 2021, he was awarded a Gold Medal by the International Society of Magnetic Resonance in Medicine (ISMRM).His research interests focus on white matter disease, ageing & dementia and glioma.Daniel Alexander is the Director of the Centre for Medical Image Computing (CMIC) and deputy head of the Computer Science Department at UCL.He leads the Microstructure Imaging Group and the Progression of Neurodegenerative Diseases initiative.He is theme lead for the UCLH Biomedical Research Centre Healthcare Engineering and Imaging theme.He coordinates the Horizon 2020 EuroPOND consortium.His core expertise is in computer science, computational modelling, machine learning, and imaging science.His first degree was a BA in Mathematics from Oxford completing in 1993.He then studied for MSc and PhD in Computer Science at UCL, completing in 1997.After a post-doc at the University of Pennsylvania, He returned to UCL as a lecturer in 20 0 0 and He has been Professor of Imaging Science since 2009.

Table 1
Comparison of segmentation accuracy and error of CM estimation for different methods with dense labels (mean ± standard deviation).Numbers in bold indicate the best method that statistically ( p < .01 ) better than other methods by computing the p values of paired t-tests on DICE and CM estimation metrics, respectively.

Table 2
Comparison of segmentation accuracy and error of CM estimation for different methods with one label per image (mean ± standard deviation).Numbers in bold indicate the best method that statistically ( p < .01 ) better than other methods by computing the p values of paired t-tests on DICE and CM estimation metrics, respectively.We note that 'Naive CNN' is trained on randomly selected annotations for each image.

Table 4
Comparison between the default implementation and low-rank ( = 1) approximation on BraTS.GPU memory consumption is estimated for the case with batch size = 1.Bot the total number of variables in the confusion matrices, and the number of FLOPs required in computing the annotator predictions.

Table 5
Comparison of Generalised Energy Distance on different datasets (mean ± standard deviation).The distance metric used here is Dice.

Table 7
Dice coefficients for multi-class segmentation results with different comparison models trained with dense labels per image (mean ± standard deviation).Average Dice is the average of each class's dice coefficient, Total Dice is the average of entire image dice coefficient.Numbers in bold indicate the best method that statistically ( p < .01 ) better than other methods by computing the p values of paired t-tests on DICE metric.

Table 8
Dice coefficients for multi-class segmentation results with different comparison models trained with only one label available per image (mean ± standard deviation).Average Dice is the average of each class's dice coefficient, Total Dice is the average of entire image dice coefficient.Numbers in bold indicate the best method that statistically ( p < .01 ) better than other methods by computing the p values of paired t-tests on DICE metric.