Hyperbolic Secant representation of the logistic function: Application to probabilistic Multiple Instance Learning for CT intracranial hemorrhage detection

Multiple Instance Learning (MIL) is a weakly supervised paradigm that has been successfully applied to many different scientific areas and is particularly well suited to medical imaging. Probabilistic MIL methods, and more specifically Gaussian Processes (GPs), have achieved excellent results due to their high expressiveness and uncertainty quantification capabilities. One of the most successful GP-based MIL methods, VGPMIL, resorts to a variational bound to handle the intractability of the logistic function. Here, we formulate VGPMIL using P\'olya-Gamma random variables. This approach yields the same variational posterior approximations as the original VGPMIL, which is a consequence of the two representations that the Hyperbolic Secant distribution admits. This leads us to propose a general GP-based MIL method that takes different forms by simply leveraging distributions other than the Hyperbolic Secant one. Using the Gamma distribution we arrive at a new approach that obtains competitive or superior predictive performance and efficiency. This is validated in a comprehensive experimental study including one synthetic MIL dataset, two well-known MIL benchmarks, and a real-world medical problem. We expect that this work provides useful ideas beyond MIL that can foster further research in the field.


Introduction
Multiple Instance Learning (MIL) [1] is a type of weakly supervised learning that has become very popular due to the reduced annotation effort it requires.In MIL binary classification [2] the training set consists of instances grouped into bags.Both bags and instances have labels, but we only observe them at the bag level while instance labels remain unknown.It is assumed that a bag label is positive if and only if it contains at least one positive instance.The goal is to achieve a method capable of accurately predicting both bag and instance labels using only bag labels.
The MIL approach has been successfully applied to many different scientific domains [1], being particularly well suited to medical imaging [3].
In this work, we are particularly interested in the problem of IntraCranial Hemorrhage (ICH) detection.ICH is a severe life-threatening emergency with high mortality and morbidity rates caused by blood leakage inside the brain [4].Computed Tomography (CT) scans are widely used to diagnose ICH because it is an inexpensive and non-invasive technique for patients.
Each scan is made up of a significant number of slices, each representing a section of the head at a given height.A CT scan is labeled as ICH if at least one of its slices shows evidence of the injury, otherwise, it is normal.To apply a supervised learning approach, radiologists have to manually label every single slice in the dataset [5,6,7], which is a costly and tedious process.In contrast, the MIL approach significantly reduces the radiologists' workload as only one label for each scan is needed.
To learn in the MIL scenario, Deep Learning (DL) methods have become popular in practice due to their ability to deal with highly structured data [3,8,9,10] The most successful models combine DL architectures with attention mechanisms to weigh the relevance of each instance, see [11].However, these methods do not model the instance label explicitly (they just have an attention weight for each instance), which hampers the quantification of uncertainty at the instance level.Note that this is essential in MIL since instance labels are unknown.As a consequence, plenty of attention has been paid to probabilistic MIL methods in recent years.Among them, Gaussian Processes (GPs) have achieved very competitive results [12,13,14,15,16,17], due to their high expressiveness and uncertainty quantification capabilities.Most of these GP-based MIL methods build on the popular VGPMIL [14], which formulates the MIL problem through sparse GPs for classification using the logistic function.
In order to achieve mathematical tractability of the logistic function, VGPMIL introduces a variational bound known as the Jaakkola bound.
Therefore, the training objective becomes a lower bound of the real one.As recently shown in [16], this theoretical approximation degrades the predictive performance in practice.An alternative and exact treatment of the logistic function has recently been introduced in the context of GPs for supervised classification, see [18,19,20].The idea is to augment the model using Pólya-Gamma variables, obtaining an equivalent and tractable formulation.
However, to the best of our knowledge, these ideas have never been adapted to the MIL scenario.
In this work, we first reformulate the VGPMIL model using Pólya-Gamma random variables.We find that this new model, called PG-VGPMIL, is equivalent to the existing VGPMIL when performing closed-form variational inference updates (they lead to the same update equations).This phenomenon, which was also observed in supervised classification [18,19], finds its justification in the properties of the Hyperbolic Secant density [21].
Thanks to them, the logistic observation model admits two representations [22], as a Super Gaussian and as a Gaussian Scale Mixture (GSM), that lead to the same variational optimization objective [18,Theorem 2.1].We build upon the GSM representation to formulate ψ-VGPMIL, a general model where ψ is a differentiable GSM density.When ψ is the Hyperbolic Secant density, we recover the original VGPMIL.Although ψ-VGPMIL is formulated using GPs, it can be extended to other related probabilistic frameworks, such as Relevance Vector Machines [23,24].Inspired by the above connection and by the definition of Pólya-Gamma variables [20], we replace the Pólya-Gamma distribution with the Gamma distribution, obtaining a new GSM density and thus a new particularization of ψ-VGPMIL, which we refer to as G-VGPMIL.The proposed algorithm is evaluated through a comprehensive set of experiments involving different datasets, baselines and metrics (both at the instance and bag level).
First, we focus on the G-VGPMIL method in a controlled experiment built around the MNIST dataset.This allows us to understand its behavior in practice and devise its main properties.Second, we utilize two historically important MIL benchmark datasets (MUSK1 and MUSK2).This second experiment again shows that G-VGPMIL improves the existing VGPMIL approach in terms of efficiency and performance.Third, we show that G-VGPMIL achieves better results than state-of-the-art methods for the ICH detection problem.
In summary, the main contributions of this work are the following: • We introduce PG-VGPMIL, a new probabilistic MIL model based on Pólya-Gamma random variables.To the best of our knowledge, Pólya-Gamma variables have never been used before in the context of MIL.
We observe that PG-VGPMIL is equivalent to the existing VGPMIL when performing closed-form variational inference updates.
• Building on the theory behind this equivalence, we develop ψ-VGPMIL, a general inference framework for the logistic observation model.New inference models can be obtained using different GSM densities, denoted by ψ.PG-VGPMIL (and hence VGPMIL) becomes a particular case of this framework when ψ is the Hyperbolic Secant density.
• We use the Gamma distribution to obtain a new GSM density, which is used in the aforementioned general framework to obtain G-VGPMIL, a new inference model for MIL.To the best of our knowledge, the Gamma distribution has never been used in the context of MIL.
• The newly proposed G-VGPMIL is compared to state-of-the-art approaches using different metrics and datasets, including a real-world ICH detection problem.The experiments conducted show enhanced predictive performance and efficiency.
The rest of the paper is organized as follows.In Section 2 we introduce the MIL problem and its GP-based formulation.In Section 3 we introduce PG-VGPMIL and investigate its equivalence to VGPMIL.In Section 4 we devise ψ-VGPMIL, a new general framework of which VGPMIL is a particular realization.In Section 5 we introduce our new model G-VGPMIL as another particular realization of ψ-VGPMIL.In Section 6 we empirically validate the newly proposed method.In Section 7 we summarize the main conclusions of this work.

Gaussian Processes for Multiple Instance Learning
The goal of this section is to introduce the MIL problem and how GPs are used to address it.

Variational Gaussian Processes for Multiple Instance Learning
Before considering the GP MIL problem, we briefly summarize the general framework for GP Logistic classification [25].Given a training dataset (X, y), GP classification assumes the existence of a latent function f : R D → R that determines the class of an instance.The model places a GP prior over this function, p(f ) = N (f | 0, K XX ), and uses the observation model where logit(t) = 1/(1 + e −t ) is the logistic function and f = [f 1 , . . ., f N ] ⊤ = [f (x 1 ), . . ., f (x N )] ⊤ are the realizations of the latent function f over the training set.The matrix K XX is defined as where [25].In this work, we will use the popular Radial Basis Function (RBF) kernel, which is defined as , where v, l > 0 will be treated as hyperparameters.In the following, we denote λ = {v, l}.
GP inference requires inverting an N × N matrix related to the kernel.
This has a cost of O(N 3 ), so the full model can not be used for large datasets.
This has motivated the popularization of sparse GP methods.Following the fully independent training conditional (FITC) approximation [26], we introduce a set of M inducing points Z = [z 1 , . . ., z M ] and their corresponding output via f , u = f (Z).Mirroring the relation between X and f , we have where XZ .This approach reduces the cost to O M 2 N .
In the MIL scenario, instead of instance labels {y n } we have only bag labels T b .To adapt the previous classifier to this setting, VGPMIL [14] introduces the following expression of the bag label likelihood given the labels of the instances, where H > 0 (set to 100 in [14]) 1/(H + 1) (again, a very small number).The complete VGPMIL model is given by the product of the distributions in Eq. ( 1), ( 2), (3), and (4).
In order to make predictions in VGPMIL it is necessary to compute the posterior distribution p(u, f , y | T), which is not analytically tractable.
The original work [14] resorts to mean-field variational inference [27] to approximate it by a variational distribution q(u, f , y) = q(u)p(f | u)q(y) with q(y) = N n=1 q(y n ), that is selected minimizing the Kullback-Leibler (KL) divergence between the variational distribution approximation and the true posterior.The solution for each factor is given by [27, Eq. (10.9)].
Since the logistic function used in Eq. ( 1) is not conjugate to the Gaussian distribution, it is impossible to obtain tractable expressions for the variational factors.To deal with this, VGPMIL uses the Jaakkola bound [28].
In the next subsection we explore an augmented version of VGPMIL based on Pólya-Gamma random variables [20], in which inference is tractable.

Pólya-Gamma variables for GPMIL
The class of Pólya-Gamma random variables was introduced in [20] to propose a data augmentation approach for stochastic Bayesian inference in models with binomial likelihood.Later, the multinomial distribution was reformulated in terms of Pólya-Gamma variables considering GPs with multinomial observations [29].Finally, [19] proposed a stochastic variational approach to GP classification using an augmented model based on Pólya-Gamma variables and inducing points.Next, we consider the Pólya-Gamma trick in the context of MIL GP classification.
The Pólya-Gamma distribution PG (b, c), with b > 0 and c ∈ R, has an important property that is closely related to our problem, This equality ensures that the following joint density is well defined and that p( , and assuming independence between instances, we have (7) where 1 = [1, . . .1] ⊤ .Note that we are considering an equivalent model where we have removed the use of the logistic function without introducing any approximation.The augmented VGPMIL model is defined by the product of the distributions in Eq. ( 2), ( 3), ( 4) and (7).We will refer to this model by Pólya Gamma Variational Gaussian Process Multiple Instance Learning (PG-VGPMIL).
Inference in PG-VGPMIL.We mimic the inference procedure followed by VGPMIL.We approximate the posterior distribution p(u, f , y, ω | T) with a variational distribution q(u, f , y, ω) = q(u)p(f | u)q(y)q(ω), where q(y) = N n=1 q(y n ), and q(ω) = N n=1 q(ω n ), minimizing the KL divergence between them.Applying [27, Eq. (10.9)], we obtain where q The derivation of these equations can be found in Appendix A. Note that training PG-VGPMIL boils down to iterating over the above equations.If we compare these updates with the ones from VGPMIL, we find that they are identical.We explore this in the following subsection.

On the equivalence between PG-VGPMIL and VGPMIL
We observe that the update equations for PG-VGPMIL are equivalent to those of VGPMIL.In both models, m and π n have the same expressions, recall [14, Eq. ( 12), Eq. ( 14)].Also, the update for ξ n in VGPMIL is the same as c n here.Finally, from logit(x) = (tanh(x/2) + 1)/2, we have that Λ = Θ and so the covariance matrix S is also the same [14, Eq. ( 11)].This phenomenon was also observed in the context of supervised classification [18,19], as a consequence of two different representations of the Hyperbolic Secant density, and generalizes naturally to our setting of MIL.
Namely, note that the logistic likelihood can be written as where ϕ(x) = (2π cosh (x/2)) −1 is the Hyperbolic Secant density proposed in [21].This density prevents us from calculating variational updates analytically.It admits two representations: as a Super Gaussian (SG), and as a Gaussian Scale Mixture (GSM), the former being a consequence of the latter [22,30].The SG representation leads to the Jaakkola bound used in VGPMIL.The GSM representation leads to PG-VGPMIL and is obtained from Eq. ( 5), where ϕ(ω) = (2πω) −1 PG (ω | 1, 0).These representations produce inference schemes that may appear different.However, both approaches lead to optimize the same objective, as demonstrated in [18, Theorem 2.1] for the supervised logistic regression model.This result follows straightforwardly for our setting involving MIL and GPs.
Besides being responsible for the equivalence between the two approaches presented so far, the Hyperbolic Secant density ϕ plays a fundamental role in the VGPMIL variational updates.Recall that the matrix Θ in Eq. ( 14) is computed using the function θ, which is completely determined by ϕ, see Appendix A. As pointed out in previous works, one could try to improve the inference procedure by modifying θ [31].In the following section, we formalize this idea, and show that it corresponds to replacing ϕ by a different GSM density.

A general inference framework for the logistic observation model
The GP logistic observation model can be written as where denotes the joint distribution given by equations ( 2) and (3).To extend this model, first we consider a differentiable density ψ : R → R that admits a GSM representation.Next, we replace ϕ by ψ, where where Given that ψ must be bounded, the above integral must be dominated by As N (f ) is a Gaussian and ϕ(f ) −1 is analytic (as a product of analytic functions), then N (f ) ϕ(f ) −1 is integrable [32].Therefore, the integral in Eq. ( 20) is finite.
In our new framework, p(y | f ) remains the same as in Eq. ( 1), but the GP prior distribution change according to This means that we no longer consider a GP prior on p(f ), which is now a Gaussian weighted by the ratio ψ(f )ϕ(f ) −1 .This new approach increases the flexibility of our model, since we can explore different options for ψ and subsequently select the best for our problem.One way to accomplish this is considering a parametric family and looking for the optimal parameters.
Recall that the original model can be recovered by setting ψ = ϕ, which highlights that the new framework is an extension of the previous one.
To adapt this new framework to MIL, we multiply the joint distribution in Eq. ( 18) by the bag likelihood defined in Eq. ( 4).We will refer to this extended model as ψ Variational Gaussian Processes Multiple Instance Learning (ψ-VGPMIL).
Inference in ψ-VGPMIL.We leverage the GSM representation of ψ and perform inference in the augmented model, similar to PG-VGPMIL.
Thus, we approximate the posterior distribution p(u, f , y | T) with a varia- and q(y) = N n=1 q(y n ).Our choice of q(f | u) is not optimal, but it allows us to derive tractable expressions for the variational updates.Note that q(f | u) = p(f | u) is a better approximation [33], but yields intractable expressions since p(f | u) is no longer a Gaussian, see Eq. (21).Our approach could be improved by giving more flexibility to q(f | u), e.g. using normalizing flows [34].
The optimal expressions for q(u) and q(y) are obtained minimizing the KL divergence between the posterior and the variational distributions.Applying [27, Eq. (10.9)], we obtain where q(f n ) = R M q(u)p(f n | u)du and π = [π 1 , . . ., π n ] ⊤ .The expression of θ(c n ) reveals why ψ must be differentiable.As in PG-VGPMIL, θ(c n ) represents the expectation of the augmenting variables.We provide more details on the derivation of these equations in Appendix A. Again, note that if ψ = ϕ, we recover the original VGPMIL model Kernel hyperparameters estimation in ψ-VGPMIL.The variational framework allows us to estimate the kernel hyperparameters λ.Following [35], we aim to maximize the ELBO with respect to them.This is computationally equivalent to placing a flat improper prior p(λ) ∝ const, and taking the mode of the approximated posterior q(λ) as an estimate.
The objective to be maximized is where µ = K XZ K −1 ZZ m is the mean of q(f ).We optimize J (λ) using gradient ascent for a fixed number of iterations, for which we approximate the expectation on the third and fourth terms using Monte Carlo sampling [36].
Note that Z depends on the kernel hyperparameters through N (f ).Since log Z requires sampling from N (f ), which is costly, we decide to further approximate those samples using Random Fourier Features [37].
It is worth mentioning that the hyperparameter estimation procedure was not implemented in the original VGPMIL.Thus, our work strictly generalizes the model proposed in [14].
The training procedure of ψ-VGPMIL is detailed in Algorithm 1.An iteration updates the variational parameters using Eq. ( 25) - (28), and then updates the kernel hyperparameters optimizing the objective in Eq. (29).

Making predictions in
where To compute the distribution of the test label y * , we integrate with respect 1 Initialize the locations of the inducing points Z. Compute the kernel matrices 2 Initialize the components of m and S and to random values drawn from N (0, 1).
Initialize the components of π to random values drawn from Uniform(0, 1).
Output: Kernel hyperparameters λ, distributions q(u) and q(y) = N n=1 q(yn) as in Eq. ( 23) and (24).(31) which allows us to make instance-level predictions having trained the model exclusively with bag labels.As the expectation cannot be calculated in closed form, we approximate it by sampling from the GP predictive distribution, To obtain the bag label, we apply the MIL hypothesis, where . Again, we estimate the above expectation using samples from the GP predictive distribution, where {f * n1 , . . ., f * nL } ∼ p(f * n | T) for each n ∈ {1, . . ., N * }.

Gamma variables for GP-MIL
As explained in the previous section, new models that generalize VGP-MIL can be obtained by replacing the Hyperbolic Secant density with a different density that also admits a GSM representation.In this section, we focus on a concrete realization of ψ-VGPMIL, which uses Gamma variables and will prove to work better in practice.
Our motivation to use Gamma variables is as follows.A Pólya-Gamma variable ω ∼ PG(1, 0) is defined as an infinite weighted sum of independent Gamma variables [20, Definition 1], ω = (2π 2 ) −1 ∞ m=1 g m /(m−0.5) 2 , where g m ∼ Gamma(1, 1).We alternatively consider g m = g ∼ Gamma(1, 1) for every m, which leads to ω = g/4 and, therefore, ω ∼ Gamma (1,4).Thus, we replace the Pólya-Gamma density in Eq. ( 16) by a Gamma density with parameters α and β, obtaining where Z(α, β) is the normalization constant.It is worth mentioning that although this constant can be calculated, it is not needed to carry out the updates in Algorithm 1.Thus, we arrive at a concrete realization of ψ-

VGPMIL, which we call Gamma Variational Gaussian Processes Multiple
Instance Learning (G-VGPMIL).

Results
In this section, we carry out an empirical validation of the newly formulated G-VGPMIL model by means of three different experiments.Table 1 summarizes the five datasets that we use.First, using an MIL version of the well-known MNIST dataset, we perform a controlled and visual experiment to understand the behavior of our method at both the instance and bag levels.Second, we employ two classical benchmark datasets for MIL algorithms, the MUSK1 and MUSK2 datasets.Third, we tackle the real-world medical problem of ICH detection, showing enhanced performance against state-of-the-art approaches.We compare VGPMIL (Algorithm 1 with ψ = ϕ) and G-VGPMIL (Algorithm 1 with ψ as in Eq. ( 38)).To ensure a fair comparison, both models were trained with identical set-ups, initial parameters, and grid-searches in every experiment.Both models have been implemented in JAX1 , and will be available at https: //github.com/Franblueee/psi-VGPMILupon the acceptance of the paper.Table 1: Label distribution at the instance and bag level for each dataset considered.

MNIST
Overview.In this section we analyze the behavior of the newly proposed G-VGPMIL in a controlled environment.To this end, we transform the MNIST dataset [38] into an MIL one.This experiment allows us to understand the performance of our method at both instance and bag level.
The results show that G-VGPMIL performs, at least, as well as the stateof-the-art VGPMIL [14] while taking less time to complete the training.
Dataset description.The MNIST dataset consists of 70000 images (60000 for training and 10000 for testing) of handwritten digits.We choose the digits 2 and 9 to be the positive class, while the rest of the digits belong to the negative class.This way, a bag is positive if it contains at least a 2 or a 9, and negative otherwise.We randomly group the digits into bags of 10 instances each, ensuring that there is a balanced distribution of positive and negative bags, with each positive bag containing 1 to 4 positive instances.
The resulting dataset has 61226 positive instances, 8774 negative instances and 7000 bags.Figure 2 shows two of the generated bags, as well as the predictions obtained by G-VGPMIL.We also show the standard deviation for each prediction, which was calculated from equations ( 33) and (37).
We consider two versions of this dataset.In the first one, called MNIST RAW, we use all the 28×28 = 784 features of the instances without any kind    Result 2: instance level results and other metrics.In the previous result, we focused on the Bag AUC metric.Now, we analyze the AUC, Accuracy and F1-score performance at bag and instance levels.This is especially important given the high imbalance of the instance labels (61226 negatives vs 8774 positives, recall Table 1).In Figure 4 we collect the metrics corresponding to the best model in each scenario.These metrics are also available in tables 4 and 5.In MNIST PCA there are few differences between VGPMIL and G-VGPMIL, while in MNIST RAW the differences are significantly higher in favor of G-VGPMIL, at both the instance and bag level.Result 2: other bag level metrics.Remember that in the MUSK1 and MUSK2 datasets we do not have the instance level labels (recall Table 1), so we cannot assess the performance at this level.However, we analyze other bag level metrics to ensure that our method performs correctly in all aspects.For MUSK2 the F1 metric is very important given the imbalance   between positive bags and negative bags (see Table 1).Figure 6 shows the metrics corresponding to the best model for each of the datasets (also available in tables 6 and 7).Observe that G-VGPMIL attains better values than VGPMIL, especially in MUSK2.The superiority of G-VGPMIL are consistent with the behavior we observed in MNIST: our approach performs better when dealing with high-dimensional data, which is exactly the case here.

Intracranial hemorrage detection (RSNA, CQ500)
Overview.So far, we have focused on understanding the behavior of G-VGPMIL (we have compared only with VGPMIL) and we have worked with two relatively easy problems.
Here, we consider a more complex real-world medical problem of detecting ICH from brain CT scans, for which we need to perform a more complex feature extraction process, which will be explained below.Moreover, we will provide a wider comparison against various state-of-the-art approaches that have been used in this same task.In summary, we will see that G-VGPMIL achieves very good results in predictive performance, efficiency and stability.
Dataset description.We will use two datasets to train and evaluate We will also use the CQ500 dataset [6], which was acquired from different institutions in New Delphi, India, as an independent test set.It includes Preprocessing.We follow the same approach as in [15].To imitate the way radiologists read CT images we apply three windows to each CT 2 https://www.kaggle.com/competitions/rsna-intracranial-hemorrhage-detection/slice to enhance the display of the brain, blood, and soft tissue.These three windows are concatenated to form a three-dimensional matrix and normalized to [0, 1] (see Figure 7).
Feature extraction for the ICH problem.In Sections 2, 3 and 5 we have discussed different GP-based MIL methods, including our novel G-VGPMIL.Although these methods can be directly applied to tabular data, in the case medical images (and more in general, high dimensional or highly structured data), it is convenient to extract a few meaningful features as a previous step.For the ICH detection problem we use an attention-based CNN that has been employed in previous work [15,17].The interested reader may consult [15] for details.The architecture is named AttCNN and corresponds to the composition of three functions, where nel.The initial kernel hyperparameters are (v, l) = (0.5, D), where D ∈ {8, 32, 128} is the number of features extracted.We perform a grid search considering {50, 100, 200} for the number of inducing points, {0.5, 1.0} for the α hyperparameter and {1.0, 2.5, 4.0} for the β hyperparameter.We use the same five train-test splits that were used in [15] and [17].To guide the training process we monitor the AUC metric in a validation subset and halt the training when it has not improved for ten epochs (an epoch is a complete update of the variational parameters).Based on the Bag Accuracy score, we select the best model and report the corresponding cross-validation metrics.

Training and testing on RSNA
As we have already mentioned, we first extract relevant features using AttCNN for D ∈ {8, 32, 128}.With these features, we train the different configurations of VGPMIL and G-VGPMIL.In this section, we report and analyze the results when testing in the RSNA test splits.Result 2: robustness to other metrics.Once we have examined the behavior of our method using the AUC metric, we now focus on other classification metrics.As before, the metrics corresponding to the best model of each type (according to the Bag AUC metric) are collected in Figure 9.
Notice that we are also considering AttCNN as a baseline.As expected, AttCNN performs much worst than the probabilistic solutions based on GPs.Clearly, VGPMIL is the most effective method, being closely followed by the proposed G-VGPMIL.This is also true for each of the values of D we have considered (see Table 8).Result 3: an example of a scan prediction.Figure 10 shows an example of how our method predicts a positive scan from RSNA.As in Figure 2, the standard deviation is calculated via sampling.Our method assigns a high probability to almost all positive instances.When the presence of  hemorrhage is clear, uncertainty levels (standard deviation) are very low.
In slices where there is hardly any visible hemorrhage, the uncertainty is higher.Also, as a considerable number of slices are predicted as positive, uncertainty at the bag level is almost zero.Note that, in addition to the prediction, the associated uncertainty is a very important piece of information for the user.This exemplifies the type of information that our model can provide to radiologists.Result 4: comparison with the state of the art.We complete the analysis of the RSNA dataset by comparing our method to other approaches in the literature.In Table 2 we collect information about other methods in the literature: the size of the dataset they use, a brief description of the method, and the Bag AUC obtained when evaluating in their test set.
In our case, we report the metrics obtained when evaluating RSNA.Our method outperforms the others although it uses very few scans (only [39] uses less) and a relatively simple architecture (compared to 3D CNNs and Autoencoders).Also, observe that GP-based models (VGPMIL, DGPMIL, G-VGPMIL) obtain the best performance.

Evaluation on CQ500
To finish, we evaluate our recently trained models using an external database called CQ500.We discuss how well our model generalizes to examples never seen before and of a different nature from the ones it was trained with.This suggests that the generalization ability of AttCNN is worst than that of the models built upon GPs.
Result 3: comparison with the state of the art.Finally, we compare our approach with other methods using the Bag AUC metric in the CQ500 dataset, see Table 3.Among the methods that use only bag labels to train the models [44,15,17], our method obtains the highest score.Also, it is highly competitive with those approaches that are trained using the slice labels [45,46].Notice that using instance-level information turns this problem into a much easier one, but it requires additional effort from the radiologists.Our model is able to compete with those methods in a less demanding manner.

Conclusion
Motivated by the problems associated with the logistic observation model in the GP-based MIL formulation, we have reformulated VGPMIL using Pólya-Gamma random variables.We found that this new model, PG-VGPMIL, leads to the same update equations as VGPMIL.This is a consequence of the two equivalent representations that the Hyperbolic Secant density admits.This reveals that VGPMIL/PG-VGPMIL is a realization of a more general framework, ψ-VGPMIL, which is achieved by replacing the Hyperbolic Secant density by a general GSM density ψ.
An interesting challenge that arises is the choice of a convenient density.
In this work, we have explored the natural choice of the Gamma distribution, arriving at the newly proposed G-VGPMIL.Our experiments show that G-VGPMIL improves upon VGPMIL in terms of predictive performance and training time.G-VGPMIL shows competitive results with fully supervised models, thus closing the gap with them.As the features used to train G-VGPMIL may not be optimal (they have been extracted in two phases), we believe that there is still room for improvement.
Another interesting challenge is the extension of our model to a multiclass problem, where one has to deal with the intractable term that the softmax function introduces.Following the line of our work, we would need to express the partition function of the softmax as a GSM, which is not obvious and requires further investigation.
Finally, we believe that this work provides interesting ideas that can be useful beyond MIL and leaves open questions of both theoretical and practical nature.We hope that our ideas will be useful to the rest of the

2. 1 .
Problem formulation: Multiple Instance Learning We consider a dataset {(x n , y n ) : n ∈ {1, . . ., N }} ⊂ R D × {0, 1}, which consists of the training instances x n and their binary labels y n .We express this set in a matrix X = [x 1 , . . ., x N ] ∈ R D×N and a vector y = [y 1 , . . ., y N ] ⊤ ∈ {0, 1} N .In the MIL scenario, the instance labels y n are not observed.Instead, they are grouped into bags and only the maximum of the labels of the instances in a bag is observed.Formally, the index set {1, . . ., N } is partitioned into B non-overlapping bags {Bag 1 , . . ., Bag B } and the operator {•} b is used to refer to the elements in bag b, so {y} b = {y i : i ∈ Bag b }.The operator {•} b\n is used to refer to the elements in bag b except for the n-th element, so {y} b\n = {y} b \ {y n }.Finally, for a bag b, the only observed label is T b = max {y} b .The goal is to predict the label for any new bag, as well as the labels of the instances in the bag.Notice that the ICH detection problem presented in the introduction can be cast as an MIL one.A full scan is treated as a bag and slices are treated as instances.Only bag labels T b are observed, while slice labels y n are unobserved.A CT scan presents hemorrage (positive class) if at least one of its slices show evidence of hemorrhage.A negative scan contains only normal slices.

Figure 2 :
Figure 2: MNIST bags and G-VGPMIL predictions (probability to be positive).Positive instances are highlighted with a red frame.

Figure 3 :
Figure 3: Training time vs AUC in the MNIST dataset.

Figure 4 :
Figure 4: Bag level and instance level performance in MNIST.
our method.The first one was published by the Radiological Society of North America (RSNA) 2 in 2019.It includes a total of 39750 slices from 1150 patients.The slice (instance) labels are known: there are 5782 abnormal slices (positive) and 33968 normal slices (negative).Regarding the scans (bags), there are 483 abnormal scans (positive) and 667 normal scans.The slices are of size 512 × 512 and the number of them in each scan varies from 24 to 57.The results for this dataset are shown in Subsection 6.3.1.
193317 slices and 491 bags, with only labels at bag level (205 abnormal and 286 normal).The number of slices in each scan varies from 16 to 128.The results for this dataset are shown in Subsection 6.3.2.

Result 1 :
competitive performance and training time.

Figure 8 show
an AUC vs time plot analogous to those discussed in previous experiments.For each value of the number of features (D) and inducing points, we

Figure 8 :
Figure 8: Training time vs AUC in the RSNA dataset.

Figure 9 :
Figure 9: Results obtained by the best model of each type in the RSNA dataset.

Figure 10 :
Figure 10: RSNA scan and G-VGPMIL predictions (probability to be positive).Positive slices are highlighted with a red frame.

Result 1 :
better performance with less training time.Figure11shows a plot similar to that of RSNA, now focusing on the Bag AUC in the CQ500 dataset.G-VGPMIL is always above VGPMIL, which indicates better performance.Also, for D ∈ {8, 128}, G-VGPMIL appears to the left, which implies reduced training time.

Figure 11 :
Figure 11: Training time vs AUC in the CQ500 dataset.

Result 2 :
other classifications metrics.As in the case of RSNA, we report other classification metrics at bag level in Table9.For each value of D, G-VGPMIL obtains the best value.In general, the gap between G-VGPMIL and VGPMIL increases as the number of used features grows.

Figure 12
Figure 12 represents the metrics of the best model of each type (according to Bag AUC).G-VGPMIL and VGPMIL remain at a competitive performance (being G-VGPMIL one step ahead), while the top performance of AttCNN drops significantly compared to the behavior observed in RSNA.

Figure 12 :
Figure 12: Results obtained by the best model of each type in the CQ500 dataset.
community and can enhance further research.validation of deep learning algorithms for detection of critical findings in head CT scans, Lancet (2018) 2388-2396.[46] N. T. Nguyen, D. Q. Tran, N. T. Nguyen, H. Q. Nguyen, A CNN-LSTM architecture for detection of intracranial hemorrhage on CT scans, Medical Imaging with Deep Learning (2020).

Table 2 :
Comparison of different approaches for binary ICH detection.VGPMIL, DGP-MIL and G-VGPMIL results are obtained using the RSNA dataset for training and testing.

Table 3 :
Performance of different approaches in the CQ500 dataset.VGPMIL, DGPMIL2 and G-VGPMIL use the RSNA dataset for training.