Deep Gaussian Processes for Classification With Multiple Noisy Annotators. Application to Breast Cancer Tissue Classification

Machine learning (ML) methods often require large volumes of labeled data to achieve meaningful performance. The expertise necessary for labeling data in medical applications like pathology presents a significant challenge in developing clinical-grade tools. Crowdsourcing approaches address this challenge by collecting labels from multiple annotators with varying degrees of expertise. In recent years, multiple methods have been adapted to learn from noisy crowdsourced labels. Among them, Gaussian Processes (GPs) have achieved excellent performance due to their ability to model uncertainty. Deep Gaussian Processes (DGPs) address the limitations of GPs using multiple layers to enable the learning of more complex representations. In this work, we develop Deep Gaussian Processes for Crowdsourcing (DGPCR) to model the crowdsourcing problem with DGPs for the first time. DGPCR models the (unknown) underlying true labels, and the behavior of each annotator is modeled with a confusion matrix among classes. We use end-to-end variational inference to estimate both DGPCR parameters and annotator biases. Using annotations from 25 pathologists and medical trainees, we show that DGPCR is competitive or superior to Scalable Gaussian Processes for Crowdsourcing (SVGPCR) and other state-of-the-art deep-learning crowdsourcing methods for breast cancer classification. Also, we observe that DGPCR with noisy labels obtains better results (<inline-formula> <tex-math notation="LaTeX">$\text{F}1=81.91$ </tex-math></inline-formula>%) than GPs (<inline-formula> <tex-math notation="LaTeX">$\text{F}1=81.57$ </tex-math></inline-formula>%) and deep learning methods (<inline-formula> <tex-math notation="LaTeX">$\text{F}1=80.88$ </tex-math></inline-formula>%) with true labels curated by experts. Finally, we show an improved estimation of annotators’ behavior.


I. INTRODUCTION
Machine learning (ML) classification algorithms have achieved very promising results in the field of digital pathology [1], [2], [3], [4]. These methods extract knowledge from data that has been previously labeled by an expert pathologist. However, modern ML models require a large amount of The associate editor coordinating the review of this manuscript and approving it for publication was Ines Domingues . labeled data to perform well. Given the enormous workload of pathologists, the labeling process has become one of the most important bottlenecks in real practice [5], [6]. To address this issue, crowdsourcing has emerged as an alternative labeling approach in the last few years. The idea in crowdsourcing is to share the labeling effort among many different annotators who may not be experts and may have different degrees of expertise [7], [8], [9]. Currently, the use of crowdsourcing approaches in the medical field is a topic of significant interest. A multitude of studies have used crowdsourced data to examine a wide range of problems [10], [11], [12].
Crowdsourced labels are inherently noisy, thus ML methods must be adapted to cope with this new scenario. A first approach would be to aggregate the crowdsourced labels to yield a set of noise-free labels (e.g. majority voting) and then use a standard ML classification method. However, as explained in [8], this approach usually performs worse than modeling the confusion of each annotator as part of the training process. In this work, we focus on the latter. Currently, the most successful crowdsourcing approaches are based on deep learning (DL) [13], [14] and Gaussian Processes (GPs) [7], [15], [16]. DL methods provide excellent predictive performance due to their hierarchical architecture that allows for learning complex features [17]. GPs are sound probabilistic methods that excel at uncertainty estimation, which is very valuable in the noisy crowdsourcing scenario [18], [19]. In the ML community, Deep Gaussian Processes (DGPs) represent a state-of-the-art method that leverages the strengths of both DL and GPs. The idea behind DGPs is to build a deep model by stacking various layers of GPs. Therefore, DGPs model flexible and complex functions like deep models while preserving the uncertainty estimation capability of GPs [20], [21].
In this work, we adapt DGPs to learn from crowdsourced labels and apply the new method to breast cancer detection. We call our method DGPCR (Deep Gaussian Processes for Crowdsourcing). To the best of our knowledge, DGPCR is the first extension of DGPs to the crowdsourcing setting in any area of application. DGPCR assumes that there exists an unknown ground truth label for each instance, which is modeled with a DGP. The crowdsourced labels are modeled from such ground truth through a per-annotator confusion matrix. These matrices codify the degree of expertise of each annotator for each class. DGP parameters and confusion matrices are estimated end-to-end by doubly stochastic variational inference [20]. Therefore, in addition to making predictions on previously unseen instances, DGPCR can estimate the reliability of each annotator as well as the ground truth labels.
To better understand the behavior of the novel DGPCR, we first conduct two controlled experiments. First, we use a fully synthetic 1D dataset, simulating data and annotators, to show the effectiveness of the method in a simple crowdsourcing problem. Then, we address a semi-synthetic problem using the well-known MNIST dataset, where we simulated only the crowdsourcing annotations. We consider five synthetic annotators with different known reliabilities, and we check that DGPCR is able to accurately estimate such reliabilities. We also show that DGPCR performance on the test set is superior to its shallow GP-based counterpart.
Then, we apply DGPCR to solve a real-world medical imaging problem. The data used here comes from an international study where pathology experts and non-experts annotated, following a crowdsourcing process, breast cancer tissue regions from the TCGA Breast Cancer cohort [7], [22]. TCGA is the well-known ''Cancer Genome Atlas Program'' [23]. In total, there are 161 rectangular regions of interest (ROI), from which 79607 patches were extracted. These patches are considered as training/testing instances, and features are obtained from them through a deep neural network. We will deal with a multiclass problem in which each patch belongs to one of three classes: tumor, stroma, and immune infiltrate.
The experimental results on this dataset show that DGPCR is competitive or superior to other state-of-the-art crowdsourcing methods based on both DL and GPs. We also show that, as theoretically expected, DGPCR performance is upper bounded by that of DGP-Gold (that is, a DGP trained with true expert labels), and it is lower bounded by that of DGP-MV (that is, a DGP trained with the naive majority voting of the crowdsourced labels). Moreover, DGPCR obtains slightly better results than DL and GPs trained with true expert labels. We also report enhanced estimation of annotator reliabilities (behavior), as well as good performance in the minority class across different training sizes. The reported statistics are illustrated through an insightful visualization of the predictions.
In summary, the main contributions of this paper are: • We formulate Deep Gaussian Processes for Crowdsourcing, a novel method to integrate the benefits of DL and GPs in crowdsourcing. To the best of our knowledge, this is the first time that DGPs are extended to the crowdsourcing setting in any application domain.
• We illustrate the behavior of the method with controlled experiments. We use a fully synthetic experiment with a 1D dataset (where we simulate the data and the annotators) and a semi-synthetic one using MNIST (where we only simulate the annotators).
• We apply the new method to a real-world problem of histology breast cancer images annotated by medical students. We show promising results compared to stateof-the-art crowdsourcing methods. We also discuss the power of crowdsourcing labeling in medical imaging and future opportunities.
The rest of the paper is organized as follows. Section II reviews the most relevant related work. Section III explains the newly developed method, providing a general overview (Section III-A) as well as details on the probabilistic model (Section III-B) and variational inference (Section III-C). Section V contains the synthetic experiment with MNIST. Section VI analyzes the real-world problem involving breast cancer images. Finally, Section VII presents the main conclusions and some future outlooks.

II. RELATED WORK
To set a richer context for DGPCR, this section reviews the most related approaches used for crowdsourcing. There are two contemporary approaches in the literature: i) combining the noisy labels and using a classification algorithm, and ii) utilizing the multiple labels during the learning process. The first one often involves using a weighted majority VOLUME 11, 2023 vote, with some works using Decision Trees to estimate the annotators' weights [24] and others utilizing label propagation/augmentation techniques to incorporate information from similar instances [25], [26]. The second one treats ground-truth labels as latent variables and maps them to noisy annotations using a confusion matrix per annotator. These confusion matrices encode the annotators' expertise and biases. In this work, we focus on this line of action. In this context, we distinguish two kinds of approaches: non-probabilistic ones (mainly based on deep learning) and probabilistic ones (mainly based on Gaussian Processes).

A. NON-PROBABILISTIC RELATED METHODS
Several crowdsourcing works have focused on how to adapt existing ML methods when multiple annotators label data. Raykar et al. [27] proposed a crowdsourcing classification method based on logistic regression. This method jointly learns the annotators' expertise and a latent classifier. Following an Expectation-Maximization (EM) scheme, they iteratively estimated the annotators' reliability and the classifier's coefficients. This method was applied to prostate cancer classification where there was a great amount of disagreement between expert pathologists [28]. However, this linear classifier can not achieve satisfying performance compared to other ML methods. To overcome this limitation, DL methods have been adapted to this crowdsourcing scenario. AggNet [13] considered a deep neural network (DNN) as the latent classifier, and a probabilistic noise model estimated the annotators' reliability. This method also used EM for the learning process. Lately, CrowdLayer [14] also included a DNN as the latent classifier. This time, they estimated the confusion matrix within the forward pass of the network to model the noisy observation. This characteristic enabled end-to-end training with stochastic gradient descent leading to better and faster convergence than EM.

B. PROBABILISTIC GAUSSIAN PROCESSES RELATED METHODS
Recently, probabilistic Gaussian Processes reported great performance in several classification problems, including digital pathology, being competitive with DL-based methods [29]. These methods are usually more reliable than deterministic DL methods due to their probabilistic formulation. They are not likely to overfit and generalize well to unseen data. Also, they are well-calibrated. All these properties encouraged their adaptation to the crowdsourcing scenario, where they have achieved very competitive results. Namely, Variational Gaussian Processes for Crowdsourcing (VGPCR) addressed different tasks using variational inference with the mean-field approximation [15]. They showed clear superiority against deterministic crowdsourcing methods. However, the training was not end-to-end. They iteratively updated the coefficients of the GP and the confusion matrices. Also, this method was not scalable. Then, Morales-Álvarez et al. [19] proposed Scalable Gaussian Processes for Crowdsourcing (SVGPCR) overcoming the two main limitations of VGPCR. They performed stochastic variational inference enabling end-to-end learning using stochastic gradient descent and at the same time, they achieved scalability. They applied this method to glitch detection in gravitational waves with great results against various state-of-the-art methods in crowdsourcing. Recently, this method was extended to accommodate the situation in which a small number of expert labels is available concurrently with the labels from less experienced annotators generated by the crowdsourcing process [16]. In the medical imaging field, SVGPCR was applied to breast cancer classification with promising results compared to other related methods [7]. However, no probabilistic deep methods have been proposed for crowdsourcing problems, and this is the gap that our method intends to fill. Table 1 summarizes the main properties of the algorithms reviewed here. In the experimental evaluation, we will compare against the most advanced methods in each family, i.e. the deep learning-based AggNet and CrowdLayer and the GP-based SVGPCR.

III. METHODOLOGY
In this section, we introduce the proposed methodology (DGPCR). Section III-A provides a general overview, and Sections III-B and III-C introduce the details of the probabilistic model and the variational inference, respectively. Figure 1 shows the pipeline for DGPCR. The inputs are i) features extracted from a pretrained VGG16, and ii) crowdsourced labels provided by annotators with varying degrees of expertise. DGPCR learns a latent DGP classifier for the ground truth of the instances and a confusion matrix for each annotator. In addition to ground truth predictions, DGPCR can make predictions on the annotator's behavior by combining the latent classifier with the estimated confusion matrices. The confusion matrices are valuable on their own, as they estimate how good each annotator is and which classes they are prone to confuse. This can be further used to enhance the training provided to each annotator. DGPCR is trained end-to-end by maximizing the objective described in eq. (7). Our implementation leverages GPU acceleration through GPflow [30], more specifically GPflow 1.2.0, a library on top of Tensorflow dedicated to GPs. The code is publicly available on GitHub: https://github.com/wizmik12/DGPCR. Pipeline for the proposed method. A1 through A5 represent five crowdsourcing (non-expert) annotators. The input data is given by features extracted from an RGB patch plus non-expert crowdsourced labels for such a patch. With this information, DGPCR estimates a latent DGP classifier and confusion matrices describing each annotator's behavior. In the test stage, DGPCR can predict the expert label for previously unseen patches, as well as the predictions that each annotator would give for the such patch.

B. PROBABILISTIC MODEL
Let us assume a K -class crowdsourcing classification problem. There are N training data points, and A annotators label the instances. We denote the training set as D = {(X, Y)} = {(x n , y a n ) : n = 1, . . . , N ; a ∈ A n }, where x n ∈ R D is a vector of features and y a n is the label provided by the a-th annotator for the n-th instance. A n ⊆ {1, . . . , A} is the set of annotators who labelled the n-th instance. Notice that, in general, not all annotators label every instance. We represent the crowdsourced labels y a n with a one-hot encoded vector. That is, if the a-th annotator assigns the k-th class to the n-th instance, then y a n = e k , a K -dimensional vector with all zeros except for the k-th position, where there is a one. Figure 2 depicts the probabilistic graphical model for DGPCR, which we describe next.

1) INTRODUCING THE CONFUSION MATRICES
Inspired by [19], [27], and [31], we assume an (unknown) true label z n ∈ {e 1 , . . . , e K } for each instance. Then, the crowdsourced labels depend on this true label and on the degree of expertise of each annotator. We model the expertise of each annotator a with a confusion matrix R a = (r a ij ) 1≤i,j≤K . Each element r a ij ∈ [0, 1] represents the probability that the a-th annotator labels as class i an instance whose real class is j. We also assume that every annotator labels the different instances independently. Mathematically, this is given by where we write Z for all the z n 's and R for all the confusion matrices R a , a = 1, . . . , A. We use prior (independent) Dirichlet distributions for the behavior of annotators, i.e.
This distribution is conjugate to the categorical one in eq. (1), which eases subsequent computations. Also, such Dirichlet prior can be used to incorporate prior knowledge that may be available for the annotator's behavior. In the default case where there is no prior knowledge, which we will assume in our experiments, we can set α a ij = 1 and we obtain a uniform prior distribution.

2) MODELING THE UNDERLYING TRUTH WITH A DGP
The true underlying labels Z are modeled from the input X with a DGP of L layers [20]. For this, we introduce latent variables {F l } L l=1 , where each F l follows a GP prior independently across dimensions, with input locations given by the outputs of the previous layer l − 1. We write f l n,d for the latent variable of the n-th instance in the d-th dimension of the l-th layer (each layer has D l units, d = 1, . . . , D l ). Since the last layer defines the output and we are considering K classes, we have D L = K . The true label z n is defined from the last layer of the DGP f L n,: with a multinomial distribution p(z n |f L n,: ) that depends on the chosen likelihood. In this paper, we use the popular softmax likelihood. Because of the computational cost of vanilla DGPs, which is O(N 3 ), we resort to the well-known sparse model [20], [32]. In brief, this approximation introduces M l−1 ≪ N inducing locationsX l−1 at each layer l with inducing values U l . These values are realizations from the same GP as F l and summarize the information contained in the N training points at the lth layer. Mathematically, this is given by the probability distribution where the semicolon notation indicates the inputs of each function. Notice also that we are writing F 0 for the input X. VOLUME 11, 2023 FIGURE 2. Probabilistic graphical model for an L-layer DGPCR. Dark (resp. light) circles are used for observed (resp. latent) variables.

3) SUMMARIZING THE MODEL
In total, the joint probabilistic model for DGPCR is given by where p(Y|Z, R), p(R) and p(Z, {F l , U l } l ) are given by eqs. (1), (2) and (3), respectively. As mentioned before, Figure 2 shows the probabilistic graphical model for DGPCR.

C. VARIATIONAL INFERENCE 1) MOTIVATION FOR VARIATIONAL INFERENCE
To obtain the posterior distribution over the latent parameters, we need to integrate out Z, {F l , U l } L l=1 and R in eq. (4). Since this is analytically intractable, we resort to doubly stochastic variational inference to approximate the computations [20]. The idea is to propose a parametric posterior To optimize the parameters of the parametric posterior, the Kullback-Leibler (KL) divergence with respect to the true posterior is minimized. The KL divergence is a metric that quantifies how different two distributions are, it is always non-negative, and vanishes if and only if both distributions coincide, see e.g. [33].

2) THE PROPOSED POSTERIOR DISTRIBUTION
Here we propose the following factorization for the posterior distribution: The details for each factor are as follows. The distribution on the true labels Z is given by categorical distributions, q(Z) = N n=1 z ⊤ n q n . The probability for each instance, q n , is a variational parameter to be estimated. Namely, q n is a K -dimensional vector containing the probabilities that the nth instance belongs to each one of the K classes (in particular, all the values in q n add up to one). The distribution As discussed in previous work [20], [21], this ultimately allows for efficient mini-batch training. For the inducing point distribution, q({U l } L l=1 ) is a multivariate Gaussian distribution where we have to estimate the mean vectors m l d and the covariance matrices S l d for each unit d in each layer l. Finally, for the confusion matrix distribution, we assume posterior Dirichlet distributions Dir(r a j |α a 1j , . . . ,α a Kj ).
All the variational parameters which q(Z, {F l , U l } L l=1 , R) depends on are collectively denoted by V, i.e., V = {q n , m l d , S l d ,α a j }.

3) THE RESULTING ELBO AND TRAINING PROCEDURE
Following the variational inference procedure, minimizing the KL divergence is equivalent to maximizing the Evidence Lower Bound (ELBO) [33]. In our case, the ELBO is given by: ELBO = n,k a∈A n q nk E q(r a k ) log p(y a n |e k , r a k ) The ELBO is composed of five interpretable terms. The first term encodes fidelity to the noisy observed data. The second term ensures that the DGP predicts well the distribution of the latent ground-truth labels. The third term imposes informativeness on the distribution of the ground-truth labels. And the last two terms encode fidelity to the prior distributions on the DGP and the confusion matrices, respectively. Due to the chosen posterior distribution, all these terms (except the second one) can be computed in closed form. For the expectation in the second one, we leverage Monte Carlo samples. The ELBO is maximized w.r.t. the variational parameters V, the inducing point locationsX, and the DGP kernel hyperparameters, which will be denoted . For clarity, the training process is summarized in Algorithm 1.

4) MAKING PREDICTIONS
Finally, for a previously unseen x * , we can predict both its true label and the label that each annotator would assign to it (recall Figure 1). For the former, we must propagate x * through the DGP with the estimated parameters. Specifically, we have that the prediction on the last layer is a mixture of Calculate ELBO in eq. (7). Update V,X and using Adam optimizer. end for return Optimal model parameters V,X and . Gaussians: where we use S samples from the posterior. For the latter, we combine the predicted true label with the estimated confusion matrices.

IV. A FULLY SYNTHETIC EXAMPLE ON 1D DATASET
This section presents a fully synthetic example. The goal is to show that DGPCR is able to predict the annotators' expertise, leading to high predictive performance. In next sections, we will use more complex datasets and a wide range of baselines. We first describe the experimental framework in Section IV-A, and then the obtained results in Section IV-B.

A. EXPERIMENTAL FRAMEWORK 1) DATA DESCRIPTION
This experiment uses a 1D synthetic dataset for binary classification. A cosine function produces the labels: the label is 1 where the cosine function is positive and it is 0 where the cosine is negative. We sample 200 points for training and 100 for test uniformly distributed in the interval (−4,4). Figure 3 illustrates the data.

2) DESCRIPTION OF THE ANNOTATORS
We simulate five synthetic annotators with behaviors that one can find in real-world problems. Figure 4 shows the labels provided each annotator. We define them by their specificity and sensitivity, which are assumed to be the same. Therefore, notice that the three first annotators can be considered as ''experts'' with varying degrees of precision (0.95, 0.9, and 0.6). The fourth annotator is a ''spammer'', as they label randomly regardless of the true label (probability of 0.5 for each class). Notice that the third ''expert'' is just slightly  better than the spammer. The behavior of the last annotator is known as ''adversarial'' since they have learned a wrong concept. Namely, in this case, they swap both classes with a probability. That is, its specificity and sensitivity are equal to 0.1. Notice that, whereas no knowledge can be extracted from a ''spammer'' annotator, whose labels are pure noise, very valuable knowledge can be obtained from an ''adversarial'' one, as long as its confusion matrix is correctly identified. This is because annotator 5 produces annotations with systematic errors, in contrast to the random labels of annotator 4.

3) EXPERIMENTAL DETAILS
We design a simple DGPCR of 2 layers. We use M = 64 inducing points and a batch size of 128. The ELBO is optimized using Adam and a learning rate of 10 −2 . We optimize the GP methods through 2,000 iterations. We use a Squared Exponential (SE) kernel. When predicting, we propagate S = 100 samples. We trained the method in the CPU. We repeat the experiment 10 times, sampling a different synthetic dataset each time.

B. RESULTS AND DISCUSSION
The DGPCR method achieves an accuracy of 99%±0.66 and a log-loss of 0.0186±0.0204. It is capable of classifying well the test set through the 10 runs. Furthermore, the value of the log-loss reveals that the predicted scores are well-separated for both classes. For a better understanding of the crowdsourcing scenario, we reported the predicted reliability of the simulated annotators in Table 2. These results suggest that the method is capable of estimating the three different kinds of annotators and provides an accurate prediction. Notice that the estimation is not exact. This fact is due to the prior distribution of the annotators' reliability, which acts as a regularizer. This characteristic that arises from the VOLUME 11, 2023 bayesian framework is of vital importance in real problems with limited data. This experiment confirms the effectiveness of our proposed method on a fully synthetic problem. We will further validate our method in more complex scenarios in the following sections.

V. AN ILLUSTRATIVE SEMI-SYNTHETIC EXAMPLE ON MNIST
This section focuses on a controlled experiment where we can simulate the behavior of the crowdsourcing annotators, and then check that DGPCR is able to accurately estimate it. We first describe the experimental framework in Section V-A, and then the obtained results in Section V-B.

A. EXPERIMENTAL FRAMEWORK 1) DATA DESCRIPTION
This experiment uses the well-known MNIST database, where the goal is to classify hand-written digits into ten different classes (from 0 to 9).

2) DESCRIPTION OF THE ANNOTATORS
Following the previous experiment, we simulate five synthetic annotators with different paradigmatic behaviors that one can find in real-world problems. The first row in Figure 5 shows the confusion matrices for each annotator. Recall that the element (i, j) of the matrix represents the probability that the annotator labels as class i an instance whose real class is j. Therefore, notice that the three first annotators can be considered as ''experts'' with varying degrees of precision (0.95, 0.9, and 0.5). The fourth annotator is a ''spammer'' one, as they label randomly regardless of the true label (probability of 0.1 for each class, recall that MNIST has 10 classes). The last annotator is ''adversarial'' since they have learned a wrong concept. Namely, in this case, they are confidently classifying the digit 0 as a 5, the 1 as a 6, etc.

3) EXPERIMENTAL DETAILS
In the experimental validation, we try DGPCR with 2 and 3 layers (they will be called DGPCR2 and DGPCR3, respectively). We use M = 100 inducing points and a batch size of 1000. The ELBO is optimized using Adam and a learning rate of 10 −2 . We optimize the GP methods through 20,000 iterations. The dimensionality of the latent space is 30, and we leverage a SE kernel. When predicting, we propagate S = 100 samples. We trained the methods in an NVIDIA TITAN X (Pascal) GPU device with 12 Gb memory. Figure 5 shows the estimations provided by DGPCR2 and DGPCR3 for the annotator's behavior (confusion matrices). More specifically, the depicted values for DGPCR are the expectations of the posterior Dirichlet distributions q(R a ). We observe a very accurate prediction for all types of annotators. In particular, identifying the spammer annotator allows DGPCR to discard the information provided by them, which 6928 VOLUME 11, 2023 is pure noise. Likewise, identifying the adversarial nature of the fifth annotator allows DGPCR to extract knowledge from it. In total, thanks to the accurate prediction of annotators' biases, DGPCR has access to valuable information from the noisy labels, as demonstrated next.

2) DGPCR REACHES A HIGH PREDICTIVE PERFORMANCE ON THE TEST SET
In spite of being trained with noisy labels, Table 3 shows a high test performance for DGPCR. Namely, its predictions are correct 97.82% of times for L = 2 layers, and 98.02% of times for L = 3. The log-loss is also interestingly low compared to the value obtained by SVGPCR (the shallow counterpart of DGPCR, based on plain GPs instead of DGPs, which is included here as a baseline). Notice that the log-loss is the average negative log-likelihood for the test data (the lower the better). It takes into account the uncertainty of the predictions (unlike the accuracy, which only accounts for the mean of the prediction). For completeness and to numerically support the findings in Figure 5, the last column of Table 3 shows low values for the confusion matrix (CM) error. This error is the mean absolute error between the estimated CM and the true CM for all the annotators. Finally, it is important to stress that DGPCR has no information about the ground-truth label or the annotators' expertise. It automatically estimates the confusion matrices and learns the latent DGP to make new predictions. In the following subsection, we will see how this method can be applied to a real-world problem of histology breast cancer classification.

VI. CLASSIFICATION OF HISTOLOGICAL BREAST CANCER IMAGES
This section is devoted to the real-world application of our method to histological breast cancer images. Specifically, Section VI-A introduces the experimental framework. Then, Section VI-B presents and discusses the main results.

A. EXPERIMENTAL FRAMEWORK 1) DATA DESCRIPTION
We evaluate our methodology on a histopathology image dataset collected from the TCGA Breast Cancer cohort [22]. It contains 161 ROIs from 151 different WSIs (Whole Slide Images) collected in 18 institutes. It was originated from an international study where 2 senior pathologists provided expert labels and 20 medical students, which were non-pathologists, provided crowdsourced annotations. The interested reader is referred to [7] for the full details on the annotation protocol. The images have color variations which may downgrade the performance of systems [34]. Thus, we apply color normalization [35] to minimize differences among institutes and crop the WSIs in patches of 224 × 224 size. We divide the dataset into train and test sets. The test set contains the images annotated by the senior pathologists, whose label is considered the ground truth. The train set contains the crowdsourcing labels provided by the students (a total amount of 108495 crowdsourcing labels are available). In total, we obtain 75243 patches for the train and 4364 for the test. These are patches from three different classes: tumor (train: 37260, test: 2692), stroma (train: 27668, test: 1196) and immune infiltrate (train: 10315, test: 476). This constitutes a moderately imbalanced scenario where immune infiltrate is the minority class. All the methods are tested and assessed on the test set.

2) EXPERIMENTAL DETAILS
We use a pretrained VGG16 to extract features for the proposed DGPCR, recall Figure 1. After the last convolutional layer, we apply average pooling with a 7 × 7 window to reduce the number of features to a vector of 512 components. To maximize the ELBO, recall eq. (7), we use Adam optimizer with a learning rate of 10 −2 . We performed 31, 000 iterations. We utilize 100 inducing points for the sparse GPs, and the minibatch size is set to 1000. The latent dimension of the hidden layers is 10. When predicting, we propagate S = 100 samples. We implemented the proposed DGPCR in GPflow 1.2.0. We trained the methods in an NVIDIA TITAN X (Pascal) GPU device with 12 Gb memory.

B. RESULTS AND DISCUSSION
In this section, we evaluate the performance of DGPCR on the aforementioned breast cancer problem. We analyze five different research questions, which are discussed in the following five sections, respectively.

1) COMPARISON TO STATE-OF-THE-ART CROWDSOURCING METHODS
Here we show that DGPCR performance on the test set is slightly but consistently better than that of other state-of-theart crowdsourcing methods. Table 4 compares DGPCR (with two, three and four layers) to five state-of-the-art crowdsourcing methods. These are based on deep learning (AggNet [13]; CL-VW, CL-VWB, CL-MW [14]) and GPs (SVGPCR [19]), which are precisely the core components of DGPs, recall also Section II.
In global results, DGPCR obtains slightly superior performance in different types of metrics. Namely, the F1-Score does not consider the uncertainty in the predictions and is a trade-off between Recall and Precision, which is very relevant in this imbalanced scenario. The log-loss considers the uncertainty in the predictions (it is just the negative log-likelihood of the test data, the lower the better). The AUC (area under the ROC curve) is a threshold-free metric commonly used in machine learning.
Moreover, in this imbalanced scenario, it is particularly important to analyze the performance in the minority class (immune infiltrate). We observe that DGPCR-4 obtains the best result in the minority class. The closest method (CL-VW) gets significantly worse performance in the other classes (especially in stroma).

2) EVALUATING THE ESTIMATION OF ANNOTATORS BEHAVIOR
As explained in Section III and illustrated in Section V, DGPCR estimates a per-annotator confusion matrix (CM) that describes their behavior on the different classes. To evaluate the quality of this estimation, the fourth column of Table 4 shows the CM error for all those methods that estimate an analogous CM. As before, this error is the mean absolute error between the estimated CM and the true CM (which can be approximated based on the true labels provided by expert pathologists). We observe that DGP obtains the best result, with a significant difference against DL-based CL-MW. This can be also visualized in Figure 6, which shows the actual CMs estimated for five different annotators.

3) COMPARISON TO THEORETICAL LOWER AND UPPER BOUNDS
Since DGPCR uses noisy crowdsourced labels, in theory its performance should be upper bounded by a standard DGP trained with expert (GOLD) labels. Analogously, its performance should be (lower) bounded by a standard DGP trained with the naive majority voting (MV) strategy, that is, considering as true label the one that was assigned by most annotators. Indeed, one of our main hypotheses is that, by adapting machine learning methods to the crowdsourcing paradigm, we can overcome naive methods like MV and obtain results that are very close to the ideal (but non-affordable) setting where all the expert labels are available (GOLD). Table 5 shows the F1-Score results of DGP trained under these three different paradigms (when using 2, 3, and 4 layers). This confirms the hypothesized bounds, which reinforces the consistency of the proposed methodology. Moreover, Table 5 includes analogous results for DL and GP. Importantly, notice that DGP with crowdsourced labels obtains better results than GP and DL with gold ones. Here, we use a VGG-16 net for DL.

4) VISUALIZING THE PREDICTIONS
The numerical results obtained so far are well illustrated in Figure 7. This figure focuses on an ROI annotated by all the participants. Notice that the segmentations predicted by SVGPCR and DGPCR-4 are obtained by aggregating the predictions obtained at the patch level.
The first row shows the analyzed ROI, the mask provided by the expert pathologist, and the predictions obtained by SVGPCR and DGPCR-4. In spite of working at the patch level, both methods capture well the structure of the different VOLUME 11, 2023 TABLE 6. Average and 0.95 confidence interval of macro F1 score with reduced subsets of the training set. Each column refers to a percentage of the original training set size. Every experiment has been repeated three times using different subsets.

FIGURE 8.
Average macro F1 score (axis-y) using subsets of the training set (given a percentage; axis-x). We see that GP-based methods are more robust to small amounts of crowdsourced labeled data. Furthermore, DGPCR methods perform quite well through different sizes unlocking their full potential with more data available. However, DL methods fail considerably when data is reduced.
classes. Notice that, as shown by the previous numerical results, DGPCR-4 is sharper in the minority class (see the blue isles in the green and red areas, which are better captured by DGPCR-4). The second and third rows show the segmentation provided by two crowdsourcing annotators, and the predictions obtained by SVGPCR and DGPCR-4 for those annotators. Again, we observe accurate predictions in general, with DGPCR-4 being finer in the minority class (specifically, see again the blue isle in the red area in the second row; and the blue isle that SVGPCR predicts wrongly in the green area in the third row, top left corner).

5) ROBUSTNESS TO THE SIZE OF THE TRAINING SET
Finally, we assess the generalization capability and robustness of DGPCR against the lack of labeled data, which is a typical scenario in medical imaging. We consider three different subsets for each size to measure the variability of the performance, reporting the average and 0.95 confidence interval of the three runs. Following Section VI-B1 we compare with SVGPCR and DL methods (for the latter we focus on CL methods, which have obtained better results so far). Figure 8 shows the results graphically. We observe a gap between GP-based and DL-based methods, confirming that probabilistic methods can generalize better even when data is scarce. Shallow SVGPCR is the best with little data, but DGPCR performs reasonably well across different settings. Furthermore, as data increases, DGPCR takes advantage of its complex architecture exploiting the data available. In conclusion, DGPCR performs well even when training data is reduced, showing how DGPCR combines the advantages of both SVGPCR and DL methods. Table 6 shows these results with a 0.95 confidence interval. In addition to the conclusions already drawn, we can observe the stability of the GP-based methods in different training subsets. In general, these methods outperform DL methods for every training size. Specifically, DGPCR outperforms the rest, with non-overlapping confidence intervals (including the shallow SVGPCR), when the data available is enough (i.e., higher than 25%).

VII. CONCLUSION
Crowdsourcing can be an effective approach for generating labeled data at scale for medical applications. ML models trained on crowdsourced data, however, should ideally address the noise introduced by less experienced annotators and the biases of individual annotators. While probabilistic methods can effectively model crowdsourcing, many of these methods cannot learn complex representations required in problems like image classification or segmentation. DGPCR addresses this challenge by combining the advantages of deep learning and probabilistic methods. Specifically, it combines the capabilities of complex function modeling with uncertainty quantification to provide a robust solution to crowdsourcing tasks. This is the first step towards the end-to-end training of deterministic feature extractors and probabilistic classifiers in crowdsourcing scenarios.
Our DGPCR method can infer an estimated ground truth on unseen instances and can generate predictions that reflect the biases of individual annotators. We evaluated DGPCR in MNIST and a real-world breast cancer classification problem, showing competitive or superior performance to state-of-theart crowdsourcing methods. The performance of DGPCR trained on noisy labels is similar to training with expert labels. DGPCR was compared to alternatives for overall performance, performance on minority classes, robustness to adversarial annotators, and training set size efficiency. While the additional parameters introduced in DGPCR require larger training sets, they produce higher performance in most tasks.
There are still some open questions in crowdsourcing. Future work should address how much labeled data or which overlap between experts and non-experts would lead to satisfying results. Despite these limitations, this paper opens the door to more robust and competitive classifiers in crowdsourcing scenarios, which are of great interest to the medical imaging community. We consider that data labeled by multiple pathologists are needed to tackle inter-observer variability and individual biases. This approach can lead to a consensus and leverage noisy labels provided by generalists and pathology trainees. This tool can also train novel pathologists and medical students, boosting their performance. Ultimately, this approach will help to achieve more robust clinical systems in digital pathology.