Bayesian Convolutional Neural Network-based Models for Diagnosis of Blood Cancer

ABSTRACT Deep learning methods allow computational models involving multiple processing layers to discover intricate structures in data sets. Classifying an image is one such problem where these methods are found to be very useful. Although different approaches have been proposed in the literature, this paper illustrates a successful implementation of the Bayesian Convolution Neural Networks (BCNN)-based classification procedure to classify microscopic images of blood samples (lymphocyte cells) without involving manual feature extractions. The data set contains 260 microscopic images of cancerous and noncancerous lymphocyte cells. We experiment with different network structures and obtain the model that returns the lowest error rate in classifying the images. Our developed models not only produce high accuracy in classifying cancerous and noncancerous lymphocyte cells but also provide useful information regarding uncertainty in predictions.


Introduction
Acute lymphoblastic leukemia (ALL) is a cancer of the lymphoid line of blood cells characterized by the development of large numbers of immature lymphocytes (NCI 2018). Excessive number of lymphocytes hamper the activities of other blood components, i.e., red blood cells, platelets as well as white blood cells and can be typically fatal within weeks or months if left untreated. Only in 2015, around 876,000 patients were reported for ALL cancer worldwide and almost 111,000 of them could not sustain, among which two-thirds of the total were observed to be aged below 5 (Vos et al. 2016). To ensure a timely detection of ALL (so to increase the chances of healing), it is vital to detect ALL cancer in the premature stage.
In the diagnosis of ALL, one method of initially determining the symptoms of cancerous lymphocytes cells is through complete blood count and, particularly, by inspecting the white blood cells present in peripheral blood samples (NCI 2018). In some cases, hematological analyzers provide quantitative data about hematological parameters, but they are not able to determine patient symptoms. A significant number of lymphoblast, B lymphocyte, and T lymphocyte in the peripheral blood is a probable indication of leukemia cancer (Hunger, Mullighan, and Longo 2015). If too many lymphocytes are observed, morphological bone marrow smear analysis is performed by a pathologist using a microscope to determine the manifestation of cancer. Determining whether a lymphocyte in the blood sample is associated with leukemia cancer is an important part of diagnosis and thus a proper identification of the cancerous lymphocytes would assist in a better prognosis. The produced screening reports on microscopic blood samples by a human can be subjective based on many factors, e.g., the responsible person's experience, age, mental state, exhaustion, etc. Therefore, the assistance of an automated system could be a vital tool to avoid errors in such critical situations.
Machine learning techniques are now used extensively in various healthcare applications for disease diagnosis and patient monitoring (see, e.g., Angermueller et al. (2016); Butler et al. (2018); Li, Gao, and D'Agostino 2019). In recent years, deep learning approaches have joined hands with classical machine learning techniques and gained considerable attention in improving the state-of-the-art in speech recognition, visual object recognition, object detection, drug discovery, and genomics (see, e.g., Li et al. (2020); Steingrimsson and Morrison (2020); LeCun, Bengio, and Hinton 2015). The convolutional neural network (CNN) is one of the widely used types of deep learning mechanism which excludes manual features extraction need and features are learned directly from the image and then convolve with input data for the classification. One major limitation that the CNN has is that, it requires huge amounts of data for regularization and quickly over-fit on small data. However, by placing a probability distribution over the CNN's kernels, the resulting Bayesian 1 Convolutional Neural Networks (BCNN) provides an alternative solution, which is not only robust to overfitting but also offers uncertainty estimates, and can easily learn from small data sets. The advantage of such an approach is not only to get an effective classification of images but also to obtain the estimation of predictive uncertainty of each prediction made by the model. Thus, the development of a model that can effectively classify the microscopic images with confidence bound would assist greatly in the better diagnosis of ALL cancer.
During the last few years, researchers have been proposing different classification models to facilitate the diagnostic process of ALL cancer. Several attempts have been made to classify the lymphoblasts via image classification or feature extractions. In this article, we implement the BCNN approach, originally proposed by Gal and Ghahramani Gal and Ghahramani, 2016a, and investigate the strength of this model in accurately classifying the infected cells with a lower error rate and with high certainty. Since classical CNNs require huge databases for accurate classification, Bayesian CNN is not sensitive to it. Moreover, due to Bayesian-based structure, the resulting model offers a mathematical framework to reason about model uncertainty and is robust to overfitting. The employed method not only predicts the class of the images but also provides valuable uncertainty information that, to the best of our knowledge, has not been attempted in earlier studies in this area. The intention is to combine the experience and skillset of a doctor with the potential of AI for better ALL diagnosis. Since we are dealing with microscopic images, which can be complex even for an expert, the convolution operation performed by a computer may produce better classification accuracy.

Literature
Research in medical imaging involves the segmentation and classification of images to extract diseases and interest areas for different body parts. The image segmentation is usually done in two phases. The first phase involves the detection of unhealthy tissue, while the delineation of different anatomical structures or areas of interest are done in the second phase. Segmentation of objects of "interest" from a noisy and complex image is a difficult task and requires extra care because the rest of the analysis revolves around it (Husham et al. 2016). As for classification, the neural networks, most particularly, the CNN has recently been among the popular choices due to its strength in robust classification (Iqbal et al. 2018). One such study is Masud, Eldin Rashed, and Hossain (2020), where the authors implemented convolutional neural network-based models to facilitate the diagnostic process of breast cancer using images. In Bardou, Zhang, and Ahmad (2018)s, the models based on CNN are compared with the handcrafted features-based classification methods (such as the k-NN and SVM) for lung sounds classification, and it has been reported that the former outperformed the handcrafted feature-based classifiers. In a similar study by Zang et al. (2020), an optimal convolutional neural network (CNN) has been successfully proposed for the early detection of skin cancer.
In a similar spirit, existing studies in classifying the cancerous and noncancerous lymphocyte cells rely hugely on image analysis techniques and feature extraction methods. Most of these studies involve a kind of two-step procedure, first feature extraction from the images and then classification of the image based on these features. In Amin et al. (2015); MoradiAmin et al. (2016), a computer-based method for classification of cancerous and noncancerous cells are implemented to conveniently detect the acute lymphoblastic leukemia. Microscopic images are obtained from blood and bone marrow smears of patients with and without acute lymphoblastic leukemia. After image preprocessing, cells nuclei are segmented by k-means and fuzzy c-mean clustering algorithms. The geometric and statistical features are then extracted from nuclei and finally, the obtained cells are classified into cancerous and noncancerous cells by means of a support vector machine classifier with 10-fold cross-validation.
The counting and classification of blood cells allow for the evaluation and diagnosis of a vast number of diseases, such as including the detection and classification of hematological diseases like sickle cell anemia or acute lymphoblastic leukemia (ALL) cancer (see, e.g., Das et al. (2020) and ). Through an image processing technique, in another study by Putzu, Caocci, and Di Ruberto (2014), a complete and fully automated method for WBC identification and classification is proposed using microscopic images to support the recognition of ALL. In total, 33 images acquired from the same camera and with the same illumination conditions were used in the analysis. In a similar study by Singh, Bathla, and Kaur (2016), computer-aided feature extraction, selection, and cell classification methods are implemented to recognize and differentiate the normal lymphocytes versus abnormal lymphoblast cells on the image of peripheral blood smears.
In another study by Mohapatra, Patra, and Satpathy (2014), the authors proposed via a computer-aided screening of ALL, a quantitative microscopic approach toward the discrimination of lymphoblasts (malignant) from lymphocytes (normal) in stained blood smear and bone marrow samples. The performance of the extracted features was then tested with five other standard classifiers, such as the naive Bayesian, KNN, MLP, RBFN, and SVM, where the best overall accuracy of 94.73% is achieved with the proposed multipleclassifier system. In a similar spirit, in Rawat et al. (2017), the authors address the problem of segmenting a microscopic blood image into different regions for localization. The localization of the immature lymphoblast cell is then analyzed via different geometrical, chromatic, and statistical texture features for the nucleus as well as cytoplasm and pattern recognition techniques for subtyping immature acute lymphoblasts. In total, 260 microscopic blood images (i.e., 130 normal and 130 cancerous cells) taken from the ALL-IDB database are used in the study and the classification is done via the SVM and its variants, k-nearest neighbor, probabilistic neural network, among others. The proposed method performed incredibly well in distinguishing between the normal and cancerous cells with an accuracy of almost 94% as reported in the paper. Similarly, in Das et al. (2020) the extraction of lymphocytes is accomplished by the color-based k-means clustering technique. where features, such as shape, texture, and color are extracted from the segmented image, and then the SVM with radial basis function kernel is employed to classify white blood cells. Another approach that recently has got attention especially in the field of image processing (and in medical image processing), is transfer learning, because of its superior performance in small databases. More recently,  and Das and Meher (2021b) (see also, Liu et al. (2018])) proposed transfer learning-based models in the feature extraction stage by introducing fully connected layers and/or dropout layers in ResNet50 architecture.
Since the classical CNNs require huge databases for accurate classification, several new solutions have been proposed recently to mitigate this issue. Most particularly, in Das and Meher (2021a) efficient deep CNNs framework is proposed to address the issue of low dimensionality via introducing depthwise separable convolutions and linear bottleneck architecture, with 97.18% accuracy for ALL-DB2 datasets. In more recent developments aiming to counter the issue of low-dimensional data, Genovese et al. (2021a) proposed the machine learning-based approach which is able to enhance blood sample images by an adaptive unsharpening method.The method uses imageprocessing techniques and DL to normalize the radius of the cell, estimate the focus quality, adaptively improve the sharpness of the images prior to training and classification, and obtain a classification accuracy of 96.84%. In another study, Genovese et al. (Genovese, et al., 2021b) proposed the method based on histopathological transfer learning for ALL detection to counter the limited dimensionality of ALL databases. The proposed approach, which managed to attain up to 98% classification accuracy, first trains a CNN on a histopathology database to classify tissue types and then performs a finetuning on the ALL database to detect the presence of lymphoblasts.

Bayesian convolutional neural networks
Various articles published in recent years have proved that CNN and other deep learning-based approaches are at the forefront of medical image segmentation and analysis-related tasks. Stochastic gradient-based optimization (Hinton, Srivastava, and Swersky 2012b) is what can be seen as the driving force of deep learning and has become the workhorse of these approaches. It is a variant of classical gradient descent where the "stochasticity" comes into play when a random subset of the measurements is employed to compute the gradient at each descent. Besides, the stochastic gradient-based optimizationhas the capacity to deal with highly nonconvex loss functions often appear in training deep networks for classification through its implicit regularization effects. Due to that, it provides an automatic way of optimal feature extraction used for segmentation and classification tasks. Manual feature engineering is a cumbersome job and an error-prone process, and the deep learning optimization mechanism relieves the researcher from it. On the other hand, during manual feature engineering, it is very easy to come up with irrelevant or semirelevant features due to the lack of knowledge or domain expertise or designing the features that can cause model overfitting (Iqbal et al. 2018).
In this section, we briefly outline the methodology behind the considered approach 2 . In the CNN 3 , the convolution operation of input data is performed with a feature detector/kernel. Each feature detector returns a feature map and each of these maps contains a key feature of a specific data point. Altogether the feature maps form a convolutional layer that summarizes the presence of features in an input image. An activation function, such as the Rectified Linear Unit (ReLU), is implemented to capture the non-linear features of the data. The output feature maps are usually sensitive to the location of the features in the input. Down-sampling or pooling techniques are called for to address such sensitivity and help in reducing the redundancy in features while still grasping the key properties of the data. These multi-dimensional pooled feature maps are then converted into one-dimensional arrays which are fed to fully connected neural networks by flattening the polled layers. From the beginning to this stage, we have transformed an image into a combination of vectors of numbers. Now, we initialize the NN by assigning random weights over each connection. Next, the data are fed to the succeeding layer for more modification. The network continues this procedure until it reaches the output layer and makes a prediction. After that, it estimates the error and backpropagates the error information. During the backpropagation, the network measures the contribution of kernels and weights to the loss function by stochastic gradient descent. Finally, the network adjusts the kernels as well as the weights and repeats these steps for a specific number of iterations, and in each iteration it minimizes the error function L. In Figure 1, the learning mechanism has been graphically presented.
Note that the traditional CNNs are prone to overfit unless we have a sufficiently big dataset. However, this problem can be tackled by implementing the Bayesian approach in CNN setup, originally proposed by Gal and Ghahramani (Gal and Ghahramani, 2016a). With this implementation, the The Output Layer returns an estimated value of y i , i.e., ŷ i . Then we estimate the error, backpropagate the error information through network, and adjust the weights as well as the kernels based on their contribution to the loss function L. The contribution to the loss is determined by mini-batch stochastic gradient descent. We repeat this process for a specific number of epochs/iterations until the loss function L is minimized. model tries to reduce overfitting even on a small data while still provides the uncertainty estimates in CNNs (Gal and Ghahramani Gal and Ghahramani, 2016a). The usual Bayesian NNs offer a probabilistic interpretation of deep learning models by inferring distributions over the models' weights. However, modeling with a prior distribution over the kernels (such as the one in the context of CNN) has never been attempted successfully before until recently by Gal and Ghahramani (Gal and Ghahramani, 2016a).
Given the dataset X ¼ x i and its corresponding label set Y, in a classical setting, the CNN maps the input x i to the output Y using the set of weights ω. The resulting model can therefore be seen as a probabilistic model, where the output pðyjx; ωÞ is a categorical distribution. By placing a prior pðωÞ over the kernels, the CNN can be converted to a Bayesian CNN. After setting the prior distribution over kernels, the posterior distribution becomes as follows: The presence of normalizing constant, pðYjXÞ in Equation (1) makes it difficult to estimate the posterior distribution of kernels and therefore Variational Inference (VI) is called for an approximation. With the VI technique, the true posterior distribution with a rather simpler variational distribution is approximated by minimizing the Kullback-Leibler (KL) divergence. Upon obtaining the approximate posterior distribution by VI, now we can estimate the predictive distribution by using the approximate posterior distribution together with MC integration as where ω t is the sample of parameters drawn from the approximate posterior distribution qðωjθÞ. The traditional Bayesian NN models involve the Gaussian distribution as the a variational distribution, which makes the estimation process computationally expensive due to the high number of model parameters without contributing to the improvement of model performance (Blundell et al. 2015). To address that we follow the direction of Gal and Ghahramani (Gal and Ghahramani, 2016a), where the authors instead proposed Bernoulli distribution as variational distribution which requires no additional parameters for approximating the true posterior. This in turn makes the estimation of posterior distribution less computationally expensive and produces coherent outcomes. Another advantage with Bayesian models is that it offers a mathematical framework to reason about model uncertainty, but it comes with an enormous computational cost.
Dropout is a stochastic regularization technique that addresses the problem of overfitting and, therefore, reduces the computational complexity (Hinton, Srivastava, and Krizhevsky et al. (Hinton, et al., 2012a); Srivastava, Hinton, and Krizhevsky et al. 2014). With the dropout technique, during the training phase, some units from the networks are randomly neglected, which in consequence, reduces the number of parameters and makes the networks with less available amount of parameter to fit the data. With this technique, a unit is kept in the network with probability p or omitted with probability 1 À p (see Figure 2 for illustration). Let us consider an NN with L layers and cross-entropy being the loss function. At the ith layers the weight matrices W i has the dimension K i � K iÀ 1 and the vector of bias b i has dimension K i . Also, we consider x i as input which is an independent variable and y i being the output, a dependent variable for i ¼ 1; 2; . . . ; N observations. The cross-entropy function estimates the error, i.e., the difference between y i and ŷ i . In general, to prevent NN from being overfitted due to a large number of the parameter we introduce a regularization term with the loss function. Here, we consider ridge regression or L2 regularization term in all layers with regularization parameter λ dictating the magnitude of regularization (Tikhonov 1963). The optimization objective in NN with dropout then takes the form as, The same mechanism of dropout can be connected under the paradigm of BCNN. In fact, in Gal and Ghahramani (Gal and Ghahramani, 2016b), the authors show that the dropout networks' training can be formulated as approximate Bernoulli variational inference in Bayesian NNs. In this way, Figure 2. On the left, fully connected Neural Networks with two hidden layers. On the right, Neural Networks after dropping out units at the input layer as well as hidden layers (red circled units). The networks now have less number of parameters to adapt to the dataset which in turn forces the networks to learn the relationship between independent variables and dependent variables more appropriately. This "appropriate" learning reduces the risk of overfitting. the implementation of the Bayesian neural network during the training phase is reduced to performing dropout after every convolution layer. Due to the computationally ease it provides, in our work, we utilized the dropout to approximate the Variational Inference in BCNN.

Data
In this article, we implement this model on the microscopic images of blood samples obtained by using an optical laboratory microscope mounted on a Canon PowerShot G5 camera and sampled for ALL diagnosis. The acute lymphoblastic leukemia image database ALL-IDB has been used for the empirical study. ALL-IDB is a public image dataset of peripheral blood samples from normal individuals and leukemia patients, and it contains the relative supervised classification and segmentation data. These samples were collected by the experts at the M. Tettamanti Research Center for childhood leukemia and hematological diseases, Monza, Italy. The database has two distinct versions: the first version (ALL-IDB1) contains a dataset of 108 images with 39,000 blood elements and can be used for both testing the segmentation capability of algorithms, as well as the classification systems and image pre-processing methods, while the second version (ALL-IDB2), which contains 260 colored images of lymphocytes, is a collection of cropped areas of interest from normal and blast cells that belong to the ALL-IDB1 dataset, so it can be used only for testing the performance of classification systems. In this article, we used the ALL-IDB2 dataset for experiments in BCNN. A lymphocyte that is not cancerous is labeled as 0 and 1 represents a cancerous lymphocyte. All the images labeled as Y = 0 are obtained from healthy personals while Y = 1 labeled images were collected from ALL patients (Scotti (2005); Labati, Piuri, and Scotti 2011).
Sample images are plotted in Figure 3 with cancerous and noncancerous cells. The colored images are our input data and to be more specific, the pixel values of each image are the values of independent variables while the labels Y = 0 and 1 are classes of the dependent variable. Thus, we classify the images based on two classes. We train the model that can classify each of the lymphocyte cells previously labeled by expert oncologists.
To implement the BCNN approach, images are split as cancerous and noncancerous cells. Among 130 cancerous cells images, first 100 images are used in the training set, next 15 images are used in the validation set, and the last 15 images are used in the test set. In a similar manner, we split the 130 noncancerous cell images. As for the validation of results, the hold-out validation approach is used.

Result
This section summarizes the results obtained during the analysis. Six 4 network structures are presented here and the results are summarized in Table 1. First, we display the accuracy rate returned by models from each network in our experiments and then, the predictive uncertainty produced by the best network is discussed. Finally, we check for overfitting for all models.

Model performance
We experimented with several network architectures to find the model that produces the optimum result. During the training phase, we add a dropout layer after each convolutional layer as well as fully connected layers. For keeping the consistency, in all experimental models, the dropout rate after each convolutional layer is 20% while it is 50% for all fully connected layers.  The reason behind taking the dropout rate as 20% was not to lose too much information while we implement the Bayesian CNN. Since the pooling operation makes the data compressed, taking a high dropout rate after the convolutional layer may put a heavy constraint on the learning process of the models.
With N refers to number of layers in different network structures, we experimented with 3 convolutional layers & 4 hidden layers, 3 convolutional layers & 5 hidden layers, 4 convolutional layers & 4 hidden layers, 4 convolutional layers & 5 hidden layers, 5 convolutional layers & 1 hidden layer and 5 convolutional layers & 2 hidden layers. We trained 10 different models for each of the network structures that we mentioned above with implementing dropout in each of the models. The loss function/objective function we deployed during the model training was binary cross-entropy since we are dealing with a two-class classification problem. For gradient descent optimization, initially, we experimented with different optimizers, e.g., adagrad, RMSprop, adamax, etc. and finally chose to deploy RMSprop as the final optimizer. The hyperparameter, learning rate in RMSpropoptimizer also experimented for different values, such as 0:001; 0:002; 0:005 etc. We observe that the learning rate 0:0001 was producing better results for our dataset and thus we keep that rate throughout all other models. Another hyperparameter of the RMSprop optimizer is the decay rate which was kept constant at 0:9. In the experimental setup, we trained our models with 200 images, while 30 images were kept for validation, and the remaining 30 for the testing phase. Each of the models was trained with 50 epochs, i.e., we passed the entire training set through the neural network 50 times. We used a mini-batch image size of 20, thus the steps per epochs were 10 for 200 images. During training, after each epoch, the network evaluates its learning process against the validation dataset. We shuffled the images within the validation dataset after each epoch. After training a model, we evaluated its performance on test data. R programming language (RStudio) is used to implement BCNN in our dataset. TensorFlow, the most commonly used framework for Deep Learning is used to perform the tasks with Keras as API on top of Tensorflow (both are originally developed by Google). A computer with CPU configuration of Intel Core i5 4570@ 3.20 GHz and 8.00 GB Single-Channel DDR3 RAM is used to run the experiments.
The implementation of Bayesian CNN was done by 50 forward passes through the models while keeping the dropout active. We observe that, during the test time with MC dropout, the error rate varies from little to relatively large quantity depending on network structures. To obtain a reliable estimate, the process is repeated several times. Due to that, the MC dropout testing is repeated 10 times for each network and the mean and standard deviation of the accuracy rates are observed. Following table shows the accuracy rate in six different BCNN structures for 10 repetitions. As can be seen that, the chosen models provide a reliable accuracy rate for up to almost 94%, which is an incredible performance. Without any manual feature extraction, which is very common in such studies, all our proposed models manage to classify the correct image with around 90% accuracy, with an exception of 5 Convolutional 1 Hidden layer model. However, some models do produce a high standard deviation, which indeed shows high uncertainties, but the 3 Convolutional 5 Hidden layer and 4 Convolutional 5 Hidden layer models are found to be the best models with standard deviation in accuracy hovering around 2.
In the next table, we present results about the sensitivity and specificity returned by models from six different network structures. Table 2 shows the mean sensitivity and specificity with relative standard deviation for 10 models from each of these six network structures. From Table 2, we can observe again that models from 3 Convolutional & 5 hidden layers and 4 Convolutional & 5 hidden layers have obtained same specificity, and it turns out to be higher than other models. This indicates that models from these two networks have predicted the cancerous lymphocytes with high mean accuracy and low standard deviation compared to other models. While for the other models, the mean sensitivity and specificity results are very good but with high uncertainty and therefore, relying on these models can be misleading. Note that similar dataset has been used in Rawat et al. (2017); Putzu, Caocci, and Di Ruberto (2014); Genovese et al. (2021a), where features of interests were first extracted from the images through different image processing techniques, and then several classifiers are implemented to evaluate the classification accuracy. The resulting overall classification accuracy reported in these studies are not much different (94.5% and 93%, respectively) 5 from what we obtained in this study without involving image processing methods. Here, we rely on the strength of Bayesian driven convolution neural network not only for classification but to self extract features from the images. Considering specificity as a performance measure, our approach outperforms findings reported in, such as Das and Meher (2021a) and Genovese et al. (Genovese, et al., 2021b). Besides these, since the Bayesian approach allows to evaluate model prediction uncertainty, our contribution provides more reliable estimate in comparison to existing studies. SD, Standard Deviation. We then move on further with the aim of understanding the dynamic structure of standard deviation of error rates obtained from different models. It is important since minimum error with consistent variation leads to reliable estimates. This is also useful to understand the model's capability of making consistent predictions. In Figure 4, we graphically illustrate model performance for all of these different networks, where error rate produced by models from six different networks are plotted. Data was passed through the networks 50 times as depicted on the x-axis and the process was repeated 10 times producing the variation on the y-axis. Deep blue dots are the mean of error rates in 10 iterations and the error bars show the 1 standard deviation from the mean.
Here, the dynamic structure tells a better story about the variation. The mean of error rates is lowest with relatively smaller standard deviation for models with 3 convolutional & 5 hidden layers. Other models except the models trained on 4 convolutional layers and 5 hidden layers returned relatively higher error rates with very high standard deviation. For the best model structure which is 3 convolution and 5 hidden layers, we briefly provide some technical details. In the first convolution layer, the number of output filters is 32 and the kernel size is 3 � 3 which refers to width and height on 2D convolution window. In the max-pooling layer, the pool size is 2 � 2 which refers to the magnitude of downscaling. In the second and third convolution layers, the number of output filters is 64 and the kernel size is 3 � 3. The maxpooling layer is kept the same as before, 2 � 2. Next, in all 5 hidden layers, the number of nodes is 1024. The activation function in 3 convolution and 5 hidden layers are RELU while, in the output layer, the activation function is sigmoid.

Predictive uncertainty
For a binary classification problem, deterministic deep learning models produce a single probability value in favor of a class given an observation. But to capture the predictive uncertainty, one needs a distribution for each prediction rather than a point estimate. Within the BCNN structure we have such a facility and to obtain the predictive uncertainty for our model, we pass the data through the network 50 times during the testing period. This, in turn, produces a distribution of probabilities for each outcome given the input data. We take the MC dropout as the average of 50 probability values given an observation, the variation among these values provides the uncertainty estimates for each prediction made by the model. Following this procedure, the estimation of predictive uncertainty is performed for all models from all networks. However, for brevity, in this section, we only focus on models from networks that returned the lowest error rates. From our experiment, the models from 3 convolutional layers & 5 hidden layers and 4 convolutional layers & 5 hidden layers produced the low error rates among all models as can be seen in Figure 4. Thus in the following, we discuss the predictive uncertainty for these two networks for which the results are presented in Figure 5 and Figure 6.
In the test data, first 15 images were labeled as 0, i.e., images with no sign of Leukemia and the remaining images were labeled as 1, i.e., images with the sign of Leukemia. If classified accurately, the predicted probabilities for images labeled as 0 should fall within the light red zone, while it is the light green zone for images labeled as 1. For instance, if an image from first 15 images is predicted wrongly, the probability value will fall outside of the red zone (top left white space). During the test time, a model produced the probability of a lymphocyte cell belonging to class "1" given new input data. For each of the 30 test images, during test time we have produced 50 probability values by passing the data 50 times through each network and obtained a distribution of predicted probabilities for each image. Thus the orange and green dots are the mean of predicted probability values for classes 0 & 1, respectively and the error bars show one standard deviation of predicted probabilities. Note that, due to the dropout each time the networks are different. The lower bound and the upper bounds for each image are defined by the mean of predicted probabilities � one standard deviation of predicted probabilities from the mean. As can be seen from the graphs and through the orange and green dots, most of the times, the model correctly classify the correct image with a very low misclassification rate.

Overfitting
The curse of overfitting limits the applicability of traditional machine and deep learning approaches in many instances. Therefore, for the reliability of our approach, it is important to identify whether our approach is trapped by this curse or not. To do so, we extend the scope of analysis in this direction. For each of the six different CNN structures, during the training, we evaluated model performance on validation dataset and during the testing, we assessed the model performance on the test dataset. When a model gives relatively low error rate for validation dataset compared to the test dataset, it gives an indication of overfitting of the model. To identify it, we experimented it with all six network structures and repeated each structure 10 times. Thus, in total, we get 60 models and for brevity 6 , one model from each of the network structure is presented in Figure 7. This figure shows the performance of the models during the training and testing period. In the x-axis, we plot the epochs or iterations, while the y-axis presents the accuracy rate. The test accuracy rate is marked in red while the validation accuracy rate is marked in green. As can be seen immediately from Figure 7, the models based on 3 convolutional layers & 4 hidden layers, 3 convolutional layers & 5 hidden layers, 4 convolutional layers & 4 hidden layers and 4 convolutional layers & 5 hidden layers present similar pattern while for models based on 5 convolutional layers & 1 hidden layers and 5 convolutional layers & 2 hidden layers, the results are very different.
The validation accuracy rate (green line) for one of the models from first network (3 convolutional layers & 4 hidden layers), as in Figure 7(a) was below 0.5 when the model started its learning process. After around 35 epochs, the validation accuracy reached 90% and remained constant for the rest of the epochs. During the test time, the accuracy remained constant at 90% throughout 50 Monte Carlo iterations. Both test accuracy and validation accuracy merged after around 35 epochs/iterations and maintained this consistency for the remaining epochs/iterations. Models from 3 convolutional layers & 5 hidden layers (in Figure 7(b)) show more or less a similar pattern. Models trained in 4 convolutional layers & 4 hidden layers and 4 convolutional layers & 5 hidden layers, i.e., plot (c) & (d) in Figure 7 demonstrate the consistency in validation and test accuracy but here it was achieved at around 50 epochs. This consistency in validation accuracy and test accuracy indicates that the specific model from that network is not overfitting.
While for the case of models based on 5 convolutional layers & 1 hidden layers and 5 convolutional layers & 2 hidden layers, it is observable that the validation accuracy rates are exceeding the test accuracy after around 15 epochs/iterations and 25 epochs/iterations, as depicted in Figure 7(e) and Figure 7(f), respectively. Thus, it is conclusive that these models are overfitting the data and not suitable to be used for further analysis.

Discussion & conclusion
In this work, we have demonstrated how Bayesian CNN models can be implemented for classifying the cancerous and noncancerous lymphoblast cells and provide uncertainty estimates with their predictions. It contributes to the current literature related to the investigation of leukemia cancer cells, in particular, with an in-depth discussion on classification accuracy and prediction uncertainties, and application and interpretation of such model for image classification, in general. For the readers, we briefly provide an outline of the proposed method and summarized the mechanism with brevity. In the empirical study, the predictive models in BCNN structure are then implemented for ALL image classification. We show how a probabilistic model can be used for image classification with reasonably good classification accuracy, without manual or computer-aided feature extraction. Not only that, we further move on to obtain uncertainty estimates for the classification of each image, which is usually not discussed in existing studies. To obtain reliable predictions, we take the Bayesian approach in CNN, which in turn produces a distribution of probable output values given new inputs. The common problem of overfitting is discussed and, through illustration, it has been concluded that the chosen models do not overfit the data. This is important to highlight here because existing studies merely focus on classification accuracy and not much attention is usually given to such investigations.
Our experimentation with the image dataset for ALL image classification produced reliable results. In total, we develop models in six different network structures and for each structure the experiment is repeated 10 times to obtain the variation the models produced in predictions. The highest mean accuracy rate, obtained for the model with 3 convolutional layers and 5 hidden layers with mean accuracy (specificity) rate as 94% (99.3%) and with an SD deviation of 2. Together with the supporting evidence obtained from classification accuracy, prediction uncertainty, and overfitting estimates, we are certain that the models considered in these experiments can be deployed in classifying the microscopic lymphocyte cells with more reliability and for better diagnosis of ALL.

Future directions
In the future we aim to employ different DL architectures and databases with more samples to better assess the data. For small datasets, like the one used in the study, active learning can be implemented for better classification. In Active Learning, all the data points are not required to be labeled when we start training our model, rather the algorithm starts with very small data and during the learning process, the model itself asks the user (human expert) to label a specific data point if needed. With the Bayesian approach and Active Learning, one can classify the microscopic images of lymphocyte cells and obtain not only the predictions accuracies, but the predictive uncertainty as well, which is more reliable while avoiding overfitting at the same time. Moreover, to further reduce the computational efficiency of the employed approach, Bayesian CNN can be coupled with, such as transfer learning, factorization/ decomposition of convolution kernels, and depthwise separable convolutions approaches with more up-to-date visualizations approaches, such as t-sne. In this article, we stick to the most commonly used performance assessment criteria, such as the accuracy, specificity, and