Confident Classification Using a Hybrid Between Deterministic and Probabilistic Convolutional Neural Networks

Traditional neural networks trained using point-based maximum likelihood estimation are deterministic models and have exhibited near-human performance in many image classification tasks. However, their insistence on representing network parameters with point-estimates renders them incapable of capturing all possible combinations of the weights; consequently, resulting in a biased predictor towards their initialisation. Most importantly, these deterministic networks are inherently unable to provide any uncertainty estimate for their prediction which is highly sought after in many critical application areas. On the other hand, Bayesian neural networks place a probability distribution on network weights and give a built-in regularisation effect making these models able to learn well from small datasets without overfitting. These networks provide a way of generating posterior distribution which can be used for model’s uncertainty estimation. However, Bayesian estimation is computationally very expensive since it greatly widens the parameter space. This paper proposes a hybrid convolutional neural network which combines high accuracy of deterministic models with posterior distribution approximation of Bayesian neural networks. This hybrid architecture is validated on 13 publicly available benchmark classification datasets from a wide range of domains and different modalities like natural scene images, medical images, and time-series. Our results show that the proposed hybrid approach performs better than both deterministic and Bayesian methods in terms of classification accuracy and also provides an estimate of uncertainty for every prediction. We further employ this uncertainty to filter out unconfident predictions and achieve significant additional gain in accuracy for the remaining predictions.


I. INTRODUCTION
Over the last decade, Convolutional Neural Networks (CNNs) have made phenomenal strides in various classification tasks using a wide array of input modalities. These powerful algorithms have achieved impressive performance, often at par with human experts, in many challenging natural scene The associate editor coordinating the review of this manuscript and approving it for publication was Sabu M Thampi . image recognition tasks [1]- [3] and even in sensitive and critical application areas like medical image analysis for disease prediction [4]- [8]. These CNNs gained significant attention due to their parameter efficiency, in contrast to other deep learning models like densely connected Multi-Layer Perceptrons (MLPs), resulting in comparatively better generalisation performance. They are particularly powerful in analysing visual modalities like images and videos [9] but have also proved their worth in time-series analysis where they have been used for classification [10] and anomaly detection [11].
The fundamental principle behind conventional CNNs is to learn the optimal combination of network parameters (weights and biases) that can capture encoded representation of input training data. These conventional CNNs use point-estimates to represent network parameters and although they work astonishingly well in most image recognition tasks, they have large insatiable appetite for data [12]. Additionally, the softmax function tips the odds in favour of one class by squashing classification probabilities for others. Therefore, it results in overly confident predictions often times even when the network is completely wrong. This compulsive behaviour of traditional point-based neural networks to always be relentlessly assertive in their prediction raises serious concerns in many crucial application areas like medical image analysis, security, autonomous driving, financial transactions and IoT (Internet of Things) based human health monitoring. Also, the very nature of these point-based classifiers prohibits them to associate uncertainty with their predictions, which is a highly desired characteristic of any AI-based classifier.
Bayesian estimation introduces a probabilistic perspective to the neural networks and addresses many shortcomings of traditional point-based neural networks. It represents each parameter with a probability distribution instead of a single point-estimate. As a result, Bayesian neural networks are able to learn effectively from relatively small amount of data and thus are fairly robust to overfitting [13]. They can provide an inherent regularisation effect [14] by constraining the network parameters within a distribution instead of letting them increase out of bound. Most importantly, Bayesian inference can allow to estimate network's uncertainty about any prediction. However, a full Bayesian estimation over all network parameters is computationally expensive and finding true posterior probability is intractable. These limitations are normally addressed by employing various tricks like Markov Chain Monte Carlo (MCMC) sampling [15] and Variational Inference (VI) [16], or a combination of the two [17], to approximate the true posterior with a manageable distribution. A CNN trained using Bayesian estimates for network parameters is shown to lag its counterpart, trained using point-estimates, in terms of classification accuracy [13], [18].
In this paper, we recognise specific merits of each approach discussed above and combine them into a hybrid training paradigm. This hybrid approach integrates deterministic CNNs, where each parameter assumes only one value, with probability driven Bayesian CNNs, where each parameter may take any value drawn from a probability distribution characterised by a mean and a standard deviation. This probability distribution is learnt for each parameter during training. The proposed hybrid training method provides an estimate of uncertainty, using Bayesian classifier, without compromising on classification accuracy owing to deterministic feature extractor. It also captures maximum weight configurations from small datasets while still remaining computationally manageable. The proposed approach is tested on 13 different classification datasets including benchmark image datasets, fine-grained medical image datasets and time-series datasets. The proposed hybrid method is proved to be superior to both fully deterministic and fully Bayesian CNN approaches in terms of classification accuracy.

A. RELATED WORK
Conventional CNNs have demonstrated their worth in various image recognition tasks since long [19] and have resurged into popularity in 2012 with Alexnet [20]. They have lately evolved into awfully complicated networks spanning thousands of layers [21].
Although applications of Bayesian method into neural networks have also been investigated for many decades [22], it was only after Blundell et al. [23] proposed Bayes by Backprop that training of deep neural networks was made possible using Bayesian estimation. This method of Variational Inference allowed backpropagation of so called Expected Lower BOund (ELBO) loss and regularising weight distributions. A CNN trained using Bayesian method was recently proposed by Shridhar et al. [18] as a fundamental construct for other network architectures. They used Bayes by Backprop for training convolutional network and reported comparable results on many benchmark datasets.
Acknowledging the excessive computational cost of Bayesian models, Gal and Ghahramani [24] proposed a Monte Carlo dropout method to approximate Bayesian inference in deep Gaussian processes. The method is equivalent to performing multiple forward passes through the network and taking the average of results to model the uncertainty of the network. Kwon et al. [25] recognised the importance of uncertainty quantification especially in medical domain and proposed to calculate it by splitting the uncertainty into aleatoric, which corresponds to model's uncertainty; and epistemic uncertainty, which represents inherent noise in the data. Kendall and Gal [26] studied the advantages of modelling epistemic uncertainty as compared to aleatoric uncertainty in deep Bayesian models. Combining deterministic and probabilistic models in various fashions has also been studied for long. Tang and Salakhutdinov [27] pointed out that the conditional distribution p(Y |X ) does not need to be unimodel, as normally assumed by MLPs, but can also be represented as a multimodel output distribution for many structured prediction problems. They proposed a hybrid Sigmoid Belief Network (SBN) with some stochastic hidden variables and some deterministic hidden variables and achieved superior performance on synthetic and facial expression datasets. Similarly, other neural networks with partially Bayesian parameters have been proposed for regression tasks as alternative to Gaussian Processes [14], [28], which do not scale well with the number of training samples.
The problem of estimating uncertainty has been addressed in variety of ways, for example out-of-distribution (OOD) samples detection [29], [30] and density estimation using VOLUME 8, 2020 flow based models. Normalising flows and autoregressive models have been successfully combined to produce state-of-the-art results in density estimation, via Masked Autoregressive Flows (MAF) [31]; and to accelerate stateof-the-art WaveNet-based speech synthesis to 20x faster than real-time [32], via Inverse Autoregressive Flows (IAF) [33]. Huang et al. [34] presented Neural Autoregressive Flows (NAFs) and demonstrated that these models are universal approximators for continuous probability distributions, and their greater expressivity allows them to better capture multimodal target distributions. Adding on to their work, Cao et al. [35] proposed Block Neural Autoregressive Flow which is a much more compact universal approximator of density functions, where a bijection is directly modelled using a single feedforward network. Dinh et al. [36] introduced a set of transformations called real-valued Non-Volume Preserving (real NVP) as a tractable and expressive way to modelling high-dimensional data. Ardizzone et al. [37] extended real NVP architecture and argued that their proposed Invertible Neural Networks (INNs) are well suited for determining full posterior parameter distribution conditioned on training data. They noted that alternating backward and forward training passes and accumulating gradients from both sides before updating parameters allows efficient training. Kingma and Dhariwal [38] furthered flow-based generative models [39] which are useful for calculating exact loglikelihood, performing exact latent-variable inference, and parallelising training and synthesis pipelines. Their Generative flow (Glow) model uses an invertible 1 × 1 convolution and is shown to be capable of efficient and accurate synthesis of large images.

II. PROPOSED HYBRID NEURAL NETWORK
A CNN primarily consists of two main modules: a feature extractor and a classifier. The proposed network consists of a set of convolutional layers trained with point estimates followed by fully-connected layers trained using Bayesian estimate. It provides a trade-off between high accuracy of deterministic models and uncertainty estimation of Bayesian models. It also restricts the parameter space of the network as compared to fully Bayesian models because only the classifier part of the network treats its parameters as random variables. Fig. 1 shows schematic diagram of the hybrid model proposed in this work. The network initially trains to optimise parameters for both convolutional feature extractor and dense classifier as given below.
where L denotes the loss function, represents the convolutional part of the network parameterised by W C and ψ represents the dense layers (forming the classifier) parameterised by W D .
Once the network is trained using point-estimates, we reinitialise fully connected layers with random variables following normal distribution and retrain them using Bayesian estimation. The parameters of convolutional feature extractor are frozen throughout this retraining. This whole training paradigm allows us to capitalise on economically learned features by deterministic convolutional block and use expensive Bayesian inference only to approximate posterior distribution, which might then be used for uncertainty estimation. Mathematically, the learning of FC classifier of hybrid model is given by; where represents the Bayesian layers learned through Bayes by Backprop and θ D denotes the distribution of weights. Since the weights are described by a distribution instead of point-wise estimates, L in this case denotes the ELBO loss. Convolutional feature extractor trained with point-estimates learns crisp features of the input data while probabilistic classifier allows to sample from posterior distribution and offers an insight into network's confidence.
After this retraining is finished, we perform inference by passing test samples a number of times from our network.

Algorithm 1 Uncertainty Estimation
Inputs modelOutput: Array containing softmax probabilities of all images for all models allPredictions: Array containing class predictions for all images and for all models allTargets: Array containing actual targets for all images and for all models percentile: A scalar parameter to ascertain uncertain images to ignore consensus: A scalar parameter representing minimum number of confident models to reach certain prediction Outputs certainAccuracy: Accuracy when model is certain uncertainImages: A percentage of uncertain images filtered out 1: procedure estimateUncertainty 2: for each model i in allModels do 3: for each image j in allImages do 4 for each image j in allImages do 9: Let confPred = 0, uncertain = 0, confModels = 0 be new variables 10: for each model i in allModels do 11: if differences[i][j] > threshold then 12: if 13: increment confModels 14: end if 15: end if 16: end for 17: if confModels >= consensus then 18: increment confPred 19: else 20: increment uncertain 21: end if 22: end for 23: return confPred/(len(allImages) − uncertain), uncertain/len(allImages) 24: end procedure

allPredictions[i][j] == allTargets[i][j] then
Since the parameters of the last fully-connected layers of the network are sampled from a probability distribution, each pass of the same test sample gives a different prediction. These output predictions are used to draw a posterior distribution and estimate network's uncertainty. Complete algorithms used for this task is given in Algorithm 1.
For uncertainty analysis in Bayesian and hybrid architectures during inference, the algorithm works by sampling 10 classifier models from Bayesian weights distribution for every test sample and taking their output predictions. This way, instead of a single prediction, we get a set of predictions representing a probability distribution on network's output. This set of predictions are normalised in [0 − 1] range using min-max normalisation for direct comparison. Predictions for top two classes are taken and difference in their values is recorded. After having the normalised differences, we build a distribution of all these differences and use a percentile value (40% in this case) to automatically select a threshold for the measure of uncertainty. The percentile value of 40% is determined heuristically. This parameter can be considered as a knob to control how confident predictions are desired in any given application area. In circumstances where no prediction is deemed better than a wrong prediction (medical diagnosis, for example), this value can be raised to ensure that only the most confident predictions are given by the network. For other, relatively less critical, scenarios this knob can be adjusted accordingly. The underlying assumption for our uncertainty estimation is that if the output for two classes is fairly distinctive then the difference in top two classes should be greater than the threshold and the model is regarded as certain about prediction otherwise it is considered uncertain. If a test sample is regarded as certain by more than half models (represented by consensus parameter), using simple majority voting, then it is output as a fairly certain prediction.

A. TIME AND SPACE COMPLEXITY ANALYSIS
The proposed hybrid model uses fewer parameters than its Bayesian counterpart as is evident from Table 1. The table  shows

III. EXPERIMENTATION
We used 13 datasets of disparate modalities and from diverse areas of application to ascertain the viability of our proposed hybrid CNN architecture. A brief description of all the datasets used and overall experimental setup is given below.

VOLUME 8, 2020
A. DATASETS Table 2 gives an overview of all the datasets used in this work. We picked standard benchmark image datasets, as well as challenging fine-grained medical image classification datasets and many time-series datasets so that the validity of our approach on a broad range of datasets may be extensively investigated.

1) IMAGE DATASETS
We used two of the most common benchmark datasets i.e. MNIST [19] and CIFAR-10 [40] and two publicly available medical image datasets i.e. ORIGA [41] and a subset of ISIC Archive to evaluate the performance of our proposed approach. For MNIST and CIFAR-10, standard pre-defined train and test splits are used. ORIGA dataset provides clinical ground truth to benchmark segmentation of optic disc and classification of healthy and glaucomatous images. Since this dataset is very small and no predefined train and test splits are given, we used 5-fold Cross Validation (CV) for this dataset such that in each iteration of CV there are 130 images in validation fold and 520 images in training fold. The second dataset of medical images was taken from ISIC Archive 2018 version. It consists of around 24000 clinical and dermoscopic images of skin lesions categorised into 7 classes. Some of the classes in this dataset have as fewer as 122 images per class, therefore, we took a subset of the whole data with three largest classes namely Benign Keratosis-like Lesions (BKL), Melanoma (MEL), and melanocytic Nevi (NV) and randomly divided them into training and test sets.

2) TIMESERIES DATASETS
We selected 9 datasets from UCR archive [42]. The time-series datasets were generated based on different sources including device usage, sensors data, ECG, motion sensor, or simulation etc. Each time-series contains different number of classes; and the number of observations also vary in each dataset. All datasets are already divided into train and test sets by the publisher.

B. PREPROCESSING
To preprocess benchmark image datasets (MNIST and CIFAR-10), we used random crop and normalisation by mean subtraction. On medical image datasets (ORIGA and ISIC Subset), histogram equalisation is applied to enhance contrast and normalize brightness. We also made use of different data augmentation techniques like rotations, flipping, and random crops to increase the dataset size. Note that in addition to preprocessed images, original images are also kept in the dataset. Data augmentation was done keeping in mind the class ratio, such that the minor class can have more augmentations and more copies generated. Time-series datasets are used without any preprocessing.

C. EXPERIMENTAL SETUP AND HYPERPARAMETER SELECTION
All of our image datasets were trained and compared with similar experimental setup. We used a 5-layer convolutional block as baseline CNN, however, our experiments with varying depths and breadths of CNN shows that the approach is fairly scalable to more advance CNN architectures. We trained this CNN using Maximum Likelihood Estimation (MLE) for 60 epochs with a learning rate of 0.001, weight decay of 5 × 10 −4 , and batch size of 32. For probabilistic models, we used the same setup as described above but instead of using point estimates we trained convolutional and fully connected layers with distribution-based weights using Bayes by Backprop for 60 epochs. In our proposed hybrid approach, we employed a fully-connected classifier with frozen convolutional feature extractor, pre-trained using MLE, and fine-tuned it using Bayesian estimation for 60 epochs with similar parameters. Two hyperparameters used in Algorithm 1, i.e. percentile and consensus can be selected as per use case requirements. In critical application areas, for example medical image diagnosis or stock market prediction, where there is little room for incorrect classification, higher values of these parameters can be selected to ensure only the most certain predictions are given by the network. In other applications, a relaxed criterion for uncertainty estimation might be acceptable. In our experiments, we used percentile = 40% and consensus of more than half models (i.e. 6 models). These values were selected empirically and they worked well in all 13 datasets of different kind. It should be emphasised here that, for a given dataset, we used the same underlying architecture (number, width, and depth of convolutional layers and size of dense layers) in all three training paradigms, i.e. fully deterministic, fully Bayesian and Hybrid, to ensure fair comparison among three approaches.
For time-series modality, we used CNN with two convolutional layers, each followed by a max pooling layer for deterministic model analysis. On top of that, two fully connected layers were added as classifier. For probabilistic and hybrid approach, we used the same setting as explained before. Table 3 summarises classification accuracies obtained by traditional fully deterministic CNN, Bayesian CNN [18] and our proposed hybrid approach. The table shows that the proposed hybrid approach outperforms not only purely Bayesian CNNs but also their deterministic counterparts in 9 out of 13 datasets while giving comparable results on rest of them. Even when the hybrid approach lagged other methods in classification accuracies, the difference was very small and came at no additional cost in terms of time or number of parameters as shown in Table 1. The results in Bayesian Accuracy field in Table 3 are generated by our own experiments using the implementation of Shridhar et al. [18] for Bayesian CNNs.  it associated relatively smaller probability scores with its misclassification than its competing models who also misclassified but did so with overconfidence. Additionally, in cases where both deterministic and Bayesian models failed to correctly classify an image and hybrid network succeeded (subfigures (c), (f), and (g)), it predicted very cautiously with reasonable probability scores. The probability scores of hybrid model were at par with other two methods for relatively easy examples as shown in subfigure (a).

A. UNCERTAINTY ESTIMATION
Since deterministic model does not have intrinsic ability to estimate uncertainty (although some works like [24], [43] have used deterministic models and applied some postprocessing to get confidence estimates), in this section we focus on Bayesian and Hybrid models only and compare their performance. Since the classifier part of both Bayesian and Hybrid methods are trained using Bayesian estimates, both networks provide posterior distribution which is used to estimate uncertainty using Algorithm-I. Table 4 compares   the accuracies of both training methods before and after using Algorithm 1. In this table, Overall Accuracy refers the accuracy of the model before applying Algorithm 1, whereas Certain Accuracy refers to the accuracy on the predictions for which the network was certain according to Algorithm 1. When the algorithm is not sure about the prediction it tags the test sample as uncertain. We can observe that accuracies for both fully Bayesian and hybrid approaches improved after uncertainty estimation algorithm was applied. The accuracy of our hybrid approach is higher than fully Bayesian model especially when it was fairly certain about the predictions. Fig. 3 shows some examples of images that were considered certain or uncertain by both Bayesian model (top row) and hybrid model (bottom row). It is very interesting to observe that the algorithm enabled both models to confidently categorised those images that had clearly defined optic disc border (black dotted elliptical boundary drawn on images to highlight disc boundary). In both training approaches the images where the boundary of the disc was dwindled, for examples because of papilledema ( Fig. 3d and Fig. 3h) or optic atrophy (Fig. 3b and Fig. 3f), were filtered out and the model did not predict on these images because of high uncertainty. Fig. 4 depicts the trade-off between number of uncertain samples and classification accuracy for both Bayesian and Hybrid models. We can see from this figure that the accuracy of the networks increases with the increase in percentage of uncertain samples. It can be argued from these curves that since, difficult samples have been passed over by the classifier and prediction is given for easy samples only, that is why we see a positive trend in accuracy with growing number of uncertain samples. However, in many crucial application areas, it is better to abstain from giving any half-cooked prediction than making a potentially costly mistake. In medical image analysis, for instance, such non-compulsive classifiers can reduce the workload of human experts by screening relatively easy disease patterns and allowing the physicians to focus their time and energy only on the most challenging of the cases.

V. CONCLUSION
Practical applications of deep learning based classification models require high accuracy, better generalisation, computational efficiency and an estimate of uncertainty in model's predictions. All these characteristics are not readily available with either traditional deterministic CNNs or their Bayesian counterparts. Deterministic models, though provide better accuracies, do not facilitate uncertainty estimation on their own. Bayesian method, on the other hand, allows explication of posterior distribution but have significantly larger number of parameters that require more memory and time for tuning. Therefore, in this work we conceptualised and implemented a hybrid CNN capable of combining some of the merits of deterministic and Bayesian methods in terms of classification accuracy. The proposed method in validated on 13 different datasets and it shows promising results. We experimented with different architectures with varying number of convolutional and dense layers, and the hybrid training approach consistently performed better than its deterministic and Bayesian counterparts. We anticipate that this work might serves as a proof-of-concept that such hybrid CNN training is worth exploring since it works noticeably better than its pure deterministic and probabilistic versions while at the same time facilitating estimation of network's certainty for every prediction. A thorough architecture search and hyper-parameter tuning might be required to increase baseline accuracies for each dataset. However, our experimentation with various data modalities and application areas has shown great promise to prompt further comprehensive investigation into this training paradigm. Our next logical step in this research would be to incorporate this hybrid approach with dataset specific architectures obtained through, for instance, NAS-Net [3] and ENAS [44] algorithms.