A Guided Neural Network Approach to Predict Early Readmission of Diabetic Patients

Diabetes is a major chronic health problem affecting millions globally. Effective diabetes management can reduce the risk of hospital readmission and the associated financial losses for both the healthcare system and insurance companies. Hospital readmission is a high-priority healthcare quality measure that reflects the inadequacies in the healthcare system that also increase healthcare costs and negatively influence hospitals’ reputation. Predicting readmissions in the early stages prompts great attention to patients with a high risk of readmission. There has been some attempt in applying machine learning predictive models such as ensemble learning with Extreme Gradient Boosting (XGBoost), Support Vector Machine (SVM) and Artificial Neural Networks (ANN) to correctly identify if the readmission can happen within 30 days (< 30 days) or it may never happen or happens after 30 days ( $\ge 30$ days). We are proposing a new method that is applied to ANN to guide it through its gradient descent optimizers by realizing consistent vs inconsistent data in every batch. Our results show that there are up to 1.5% improvement in classification accuracies in both 2-class and 3-class variations of the experimented benchmark dataset when using the guided optimizer to train the ANN as opposed to the standard optimizer. Guided ANN is also able to achieve better error convergence than standard ANN.


I. INTRODUCTION
The modern Health care industry is increasingly embodying the involvement of Artificial intelligence (AI) in their daily practices. AI has been assisting the health care domain in making decisions on appropriate treatment journeys, analysis of medical reports, making informed clinical decisions, early detection of diseases and many other tasks [1]. Over the years, hospital readmission rate has become a important performance measurement metric for hospitals in order to minimize the impact on healthcare costs and patient outcomes [2]. To improve the performance of health care services, predicting the readmission of a patient has become very important and such that many solutions have been proposed The associate editor coordinating the review of this manuscript and approving it for publication was Aysegul Ucar . that use AI and Machine Learning techniques to accomplish this [3], [4], [5].
Hospital readmission simply means a patient who had been discharged from a hospital is admitted again within a specified time interval because of the same disease [6]. There could be many causes of readmission, for instance, enough medical care was not provided initially when the patient was admitted to the hospital or subsequent care was not followed properly at home after discharge [7]. Keeping readmission rates low is important as it indicates the quality of health services of a hospital which also makes hospital readmission a major concern, especially in the era of a global pandemic of COVID 19 [8], [9], [10].
Early detection of patients with a high risk of getting readmitted will allow medical professionals to examine and conduct corresponding prevention measures, however, it may lead to an increase in operation cost [11]. Therefore, the correct prediction of readmission to optimize the cost is necessary.
Machine learning techniques have been proven effective for such medical bioinformatics applications and producing quality predictions [3], [4], [6], [7], [8], [12]. Some prominent state-of-the-art models for predicting readmission due to diabetes or other diseases are artificial neural networks (ANN), Support Vector Machine (SVM) and RandomForest (RF) [6], [13], [14], [15]. These algorithms are known for their robustness and good generalization ability, however, SVM is generally not suitable for large datasets and is sensitive to abnormal data distribution or noise whereas RF is more capable in such situations [16], [17]. ANN is also sensitive to abnormal data distribution but its gradient descent process can be strategically maneuvered to compensate for the complex data distribution (also called deterministic noise) [18], [19]. This strategy has been applied in logistic regression in [18] for small datasets but it has never been applied in ANN and for large datasets. Our approach addresses the underlying problem of gradient descent that blindly takes the training data without considering the inconsistency present in the data. The experimented dataset for diabetes readmission is not easy to train due to its size and data distribution. Therefore, this article aims to address these problems by providing a new variant of ANN algorithm based on guided gradient descent to have a better prediction of diabetes readmission. This can improve the quality of hospital services followed by the long-term effect that will be beneficial for hospitals and their patients [20].
Diabetes is a major health issue worldwide with an estimated 1.5 million deaths directly caused by it [21]. Diabetes is a metabolic disorder resulting in hyperglycemia and a plethora of symptoms as a result. Depending on the etiology, it can be classified as diabetes mellitus type I, diabetes mellitus type II, gestational diabetes and other specific type diabetes [22], [23], [24]. This paper will focus on diabetes mellitus type II, henceforth referred as diabetes, as it is the most common form of all [23]. There are multiple factors that increase the risk of a person developing diabetes, some of which include obesity, smoking, sedentary lifestyle, and family history of diabetes. Complications secondary to diabetes make it one of the leading and growing causes of hospital admission and disability [25], [26]. It typically affects 20% of the hospital inpatients [27]. In addition, people with diabetes have hospital admission rates higher than people without diabetes [28]. People with diabetes also have excessive lengths of hospital stay compared to people without diabetes [28]. The economic cost of diabetes mellitus is therefore enormous. Hence, reducing the rate of admissions secondary to diabetes can significantly reduce its global mortality, morbidity and economic burden.
The rest of the paper is organized as follows. Section II describes the machine learning techniques used in the literature to solve the hospital readmission problem. Section III introduces our proposed ANN model in detail and describes the benchmark dataset for hospital readmission for diabetic patients along with the experimental setup used in this paper. Section IV and V demonstrate the findings and describes the experimental results. Section VI concludes the paper with some suggestions for future work.

II. BACKGROUND AND RELATED WORK
A. THE STATE-OF-THE-ART MACHINE LEARNING TECHNIQUES AND READMISSION PREDICTION Some classic methods to determine risk prediction for hospital readmission include rule-based methods, scores and traditional statistical methods such as logistic regression [13]. Recently, machine learning has emerged as a powerful tool for hospital readmission research. These approaches allow researchers to analyze large datasets and identify complex patterns and relationships between variables. By applying machine learning algorithms to electronic health records, researchers can identify risk factors for readmission or simply predict readmission likelihood for individual patients, and develop personalized intervention strategies to reduce readmission risk [13], [29].
Several studies have demonstrated the effectiveness of machine learning in predicting hospital readmissions. Hosptial readmission can be for variaous diseases such as dental, heart, diabetes, pediatric or even COVID-19 [14], [15], [29], [30], [31]. Mostly ANN, SVM and RF models are considered state-of-the-art techniques [13], [14], [15]. For instance, a study in [6] shows three machine learning models: Random Forest, Naive Bayes and Tree Ensemble were developed for diabetes readmission prediction where the Random Forest was the best performing model. Similarly, studies in [14] and [32] show ANN and SVM perform best respectively.

B. ARTIFICIAL NEURAL NETWORK
Artificial Neural Network (ANN) is a type of model for machine learning(ML) that has become popular and helpful in classification, clustering, pattern recognition and prediction in many disciplines [33]. ANN model is analogous to animal brain cells that process and recognizes the vast amount of data to solve problems. ANN has nodes analogous to neurons in brain cells which are also known as perceptrons that are arranged in a layer or in vectors. ANN has three types of layers that are the input layer, one or more hidden layers and the output layer as shown in FIGURE 1 where the training happens through backpropagation technique [34]. The goal of backpropagation is to modify the weights (vectors) so as to train the neural network to map arbitrary inputs to outputs correctly. The goal is to learn the weights for all linkages in a multi-layered network. The minimum of the error function in weight space is calculated using the method of gradient descent. The resultant weight which offers the minimum error function is the solution of the learning problem. [35]. However, improper optimization techniques may cause the network to reside in the local minima during training without any improvement in reaching an optimal solution [36].

C. ENHANCED ANN THROUGH A VARIETY OF GRADIENT DESCENT TECHNIQUES
There have been many attempts to enhance the ANN through a variety of gradient descent techniques such as standard Stochastic Gradient Descent [37], Adam [38], Adagrad [39], Adadelta/RMSprop [40], Momentum [41] etc. These methods have been used in parallel computations for large datasets as well [18], [42], [43]. These methods exploit the search space to have a better convergence rate to minimize an objective function J (θ ) parameterized by a model's parameters θ ∈ R d in the opposed direction of the gradient of the objective function ∇ x J (θ )w.r.t to the parameters. The learning rate η decides the size of the step to take to reach a (local) minimum [44]. Quite often, the mini-batch gradient descent is favored in training large Deep Neural Network (DNN) models which commonly deal with large datasets, as mini-batch gradient descent allows for fast training time and effective utilization of available computing resources [45]. Mini-batch gradient descent performs an update to the model's parameters θ for every mini-batch of n training samples: ; y (i:i+n) while being quite efficient, mini-batch gradient descent can also lead to more stable convergence [44].
Earlier we proposed Guided Stochastic Gradient Descent (GSGD) in [37] for logistic regression -a shallow machine learning model as an improvement over other state-of-theart variations of gradient descent. GSGD realizes that inconsistent data is a major factor impacting the gradient descent and convergence for SGD, and it tries to overcome this by temporarily hiding them in hopes that they will become consistent in later iterations while GSGD continues to carry out weight updates using the consistent data. Inconsistent data instances simply are the data instances within the neighborhood of instance ρ; which individually performs better, while the average error valueĒ performs worse than the average error value of the previous iterationĒ t−1 , and viceversa. GSGD has proven to achieve better convergence and classification accuracy than the canonical SGD and its other variations [37].

A. GUIDED ANN FOR BETTER CONVERGENCE
The original GSGD algorithm in [37] and [18] was designed for shallow machine-learning models such as logistic regression. In [46], a variation of GSGD was proposed for Convolution Neural Networks (CNN), a deep learning Model. Based on the tested benchmark datasets, the GSGD for CNN achieves better convergence and improves classification accuracy up to 3% in general when compared to its canonical counterpart.
Our proposed method tries to overcome the deficiency faced by ANN when working with inconsistent datasets by incorporating the GSGD optimizer in a similar approach taken in [46]. The original GSGD [37] cannot be used for large datasets and complex models such as ANN because of its expensive verification step that requires entire training set to determine the data inconsistency in every iteration. The proposed algorithm uses only a subset of training data as verification data to have better convergence. Additionally, we have successfully backpropagated the gradient generated through a guided approach. So the guided approach now works for multi-layer perceptron as opposed to single-layer perception in its original form in [37]. Its flowchart is given in FIGURE 2 where the algorithm starts with a random selection of a data instance whose gradient is computed to update the weight vector. After ρ iterations, the weight vector is further refined with consistent neighboring data instances. It is important to note that the value of ρ must be chosen wisely, as a very large value would result in the algorithm executing in its original form with typical gradient computation and weight update, and GSGD algorithm having very little to no effect in improving its efficiency. For this paper, the value of ρ was selected with Bayesian Hyperparameter Optimization. See Algorithm 1 for the detailed pseudocode of the algorithm.
Original GSGD uses verification data to compute the error of the training data instance at every iteration, which is not applicable to deep learning. The execution of the algorithm begins with the gradient computation and weight update with the learning rate η in the usual manner for the first ρ iterations. Average batch error (Ē t ) is computed at every iteration and compared with the average batch error of the previous data batches (Ē t−1 ,Ē t−2 ,. . .Ē t−ρ ). After the end of p iterations, all data batches performing consistently on weights W t−ρ to W t is kept in the consistent datastore ψ. Finally, the entire weight vectors are updated with the consistent data batches in ψ. This algorithm proposes to process all data batches regularly up to ρ iterations before reprocessing with only the identified consistent data batches.

B. EXPERIMENT DATA
The dataset used in this experiment was originally presented in the paper [47] and was also added to UCI Library under the name ''Diabetes 130-US hospitals for years 1999-2008 Data Set''. The dataset has over 50 features representing patient and hospital outcomes with 3 classes: patients who were VOLUME 11, 2023 readmitted within 30 days of discharge, patients who were readmitted after 30 days of discharge and patients who had no record of readmittance within the 10-year study period. The dataset contained over 100,000 diabetic patient records.
Two datasets were prepared from the original, the first dataset consisted of the original 3 classes of classifications but for the second dataset, we were interested to identify if a patient will be readmitted within 30 days or not readmitted within 30 days period. Therefore, the ''Readmitted after 30 days'' class and ''no readmission record'' class were combined into one, having a total of two classes of classifications for the second dataset.
The datasets were also pre-processed and normalized before conducting the experiment. For patients who had died in between the data collection period, their data was removed. The pre-processed data contained the following attributes such as race, gender, age, admission type, time in hospital, number of lab tests performed, HbA1c test result, diagnosis, number of medications, etc.
After pre-processing, the final dataset had 86,555 records which were split into ratios of 80:20 for the training set and Validation set. The training set was further split into 80:20 to get the validation set. Since the class distribution was imbalanced (89.14% not readmitted and 10.86% readmitted), a commonly used oversampling technique -Synthetic Minority Oversampling Technique (SMOTE) [48] was applied to synthesize the minority classes in the training set. SMOTE is also considered a ''de facto'' standard for pre-processing imbalanced data. SMOTE addresses this problem by randomly generating synthetic samples for the minority class through interpolation, which helps to balance Algorithm 1 Pseudocode for Guided SGD for ANN // Input: training data set examples d, total iterations (T ) and neighborhood threshold (ρ). // Algorithm can be tweaked to return the best W so far.  the class distribution and improve the overall performance of machine learning models. SMOTE has been used in many application domains such as finance, healthcare and engineering [49], [50], [51]. The detailed description and pseudocode for SMOTE is available in [48], [51], and [52].

C. EXPERIMENTAL SETUP
The ANN architecture setup for this research is described in TABLE 2. The architecture comprises of two hidden layers before the final output layer. After every hidden layer there is a batch normalization layer that deals with the issue of internal covariate shift which is defined as the change in the distribution of network activations due to the change in network parameters during training [53], this issue complicates the training of deep neural networks by requiring lower learning rates and careful parameter initialization. The mini-batch normalization technique will make training artificial neural networks faster and more stable through the normalization of the layers' inputs by re-scaling [53]. After the normalization layer, we have the activation layer that is using sigmoid, a non-linear function for activations in the succeeding layer.
The Experiment was structured as follows: The ANN-GSGD model was developed and trained in Python using the PyTorch library so as all the training and testing code. The code is available in https://github.com/ECOLSresearch-group/GSGD-ANN. Five popular optimizers: RMSProp, Adagrad, SGD, Adam and Adadelta were used in the experiment as the canonical optimizers. The guided variants of these optimizers were also developed by incorporating the proposed guided ANN algorithm. The guided variants will be referred to as G-RMSProp, G-Adagrad, G-SGD, G-Adam, G-Adadelta in this paper. In the experimentation phase, a total of 30 successive runs were carried out to evaluate the Guided ANN model with the Canonical ANN model. These 30 successive runs were carried out for both the datasets and the different optimizers mentioned above.
Bayesian parameter tuning was also carried out to identify the best hyperparameters for the ANN model with different optimizers. TABLE 3 highlights all the hyperparameters that were used in this experiment. All experiments for this research were conducted in a windows-based machine having an i7 − 11 th Gen processor and 16 GB of RAM.
In order to compare the performance of our Guided ANN model against other popular machine learning models, we also used our experimental data to train SVM classifier model and Random Forest classifier model, the performance of these models was evaluated using the same validation dataset used for ANN, the results have also been collected and presented in this paper.

A. STATISTICAL ANALYSIS OF CLASSIFICATION ACCURACY
In this section, we provide experimental results of the canonical optimizers and guided optimizers on the Diabetes Readmission datasets of 2 classes and 3 classes. We also analyze and evaluate the performance of each optimizer against their corresponding guided variant. TABLE 4 highlights the best and average classification accuracy values obtained from running the test datasets on guided ANN. It also shows the Area Under the Receiver Operating Characteristics Curve (AUROC) and the standard deviation of the accuracy values obtained from all the 30 runs carried out in the experiment. We used average AUROC for the 3-class problem. The results show that, on average, the Guided variant performs better than the canonical versions. For the 2-Class dataset, G-Adadelta outperformed the canonical Adadelta by 1.5% and also became one of the best-performing optimizers for the 2-Class dataset. Other guided optimizers also managed to marginally outperform their canonical versions. When looking at the best classification accuracy yielded by the optimizers, both guided and canonical optimizers were able to obtain over 89% accuracy while the guided variant always performed the best; Adam was the only optimizer where the guided and canonical optimizer both obtained the same classification accuracy.
When looking at the classification accuracy results of the 3-class dataset in TABLE 4, it is evident that the guided optimizers outperformed its canonical variants but only marginally. G-RMSProp highlighted by † and RMSProp was the best-performing optimizers overall in the 3-Class dataset when looking at mean results. G-RMSProp managed to outperform its canonical variant by 0.6%.
The Standard Deviation(σ ) values in TABLE 4 indicate that the results obtained by Guided optimizers are far less dispersed from the mean when compared to the results obtained by the canonical optimizers. FIGURE 3 and 4 also highlight the different classification accuracies obtained by the ANN Model throughout the epochs using different optimizers. When analyzing the performance of our prominent optimizer for 2-class highlighted in FIGURE 3(a), it shows that the ANN model had started fitting the data really well by the 10 th epoch, and there were steady improvements by both the G-Adalelta and Adadelta on the classification accuracy while G-Adadelta had the upper hand till the last epoch.
Furthermore, our prominent optimizer for the 3-class dataset, G-RMSProp, was also able to have its ANN model fit the data quite well. it also had consistent improvement on the classification accuracy throughout the epochs when compared to its canonical version as shown in FIGURE 4(d). FIGURES 7 and 8 highlight the tradeoff between precision and recall at different thresholds, it is evident from the graph that both the classifier are performing best on the ''Not Readmitted'' class or the ''No'' class.
B. CONVERGENCE ANALYSIS FIGURES 5 and 6 highlight the error convergence of the ANN model using different optimizers. Even though the error convergence rate of both guided and standard ANN is O(1/T ) [18], the guided approach shows better convergence in a limited time budget. FIGURE 5(a) highlights the error convergence of our prominent 2-class optimizer throughout the epochs. Both G-Adadelta and Adadelta started attaining stable convergence right after epoch eight but G-Adadelta managed to have a faster convergence than Adadelta.
Error convergence on the 3-class dataset also showed similar patterns as convergence on the 2-Class dataset. G-RMSProp and RMS-Prop were our prominent optimizers for this dataset but G-RMSProp had a much faster convergence than RMSProp. TABLE 5 and TABLE 6 visualizes the classification performance of our prominent optimizers on the 2-class and 3-class VOLUME 11, 2023 TABLE 3. Bayesian parameter values for the ANN model where η is the learning rate, λ is the lambda parameter for regularization and ρ is the neighbourhood threshold.  diabetes readmission datasets. TABLES 5 and 6 present the confusion matrix for the validation dataset, the prediction class 'NOT' indicates that the patient will not be readmitted within 30 days and class 'Readmitted' indicates that the patient will be readmitted within 30 days. Both the classifiers performed really well on the 'NOT' class but not so well on the 'Readmitted' class which is also the minority class. When looking at the visualizations for classification performance on the 3-class dataset, it also shows similar results as the 2-class classifiers. The 3-class classifiers also performed really well on the 'NO' class, but both the classifiers did not perform well on '<30' class while performing adequately on the '>30' class. G-RMSProp performed better than RMSProp on the class '>30'.

C. CLASSIFICATION REPORT ANALYSIS
It is also important to look into the F1 score, which is the harmonic mean of precision and recall, as the dataset is highly imbalanced. The F1-scores are highlighted in bold in TABLE 7 on a scale of 0 to 1 for all the classes and averages. The best F1 score results have been highlighted in bold. VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   If we consider all classes to be equally important for the experiment, then the macro average becomes the ideal average metric and in the case of 2-class results, the canonical variant managed to outperform the guided variant by 1%, but in the case of 3-class results, the guided variant has managed to outperform the canonical variant by approximately 1%.

47534
Alternatively, the weighted average is preferred where the support values of classes need to be considered when calculating the average F1 score, in the case of our 2-class and 3-class results, the guided variant managed to obtain better weighted average F1 scores than the canonical variants by performing approximately 1% better on both the datasets.
The datasets were also evaluated on SVM and Random Forest classifiers; the results are highlighted in TABLE 8.
The hyper-parameter setup for these models were as follows: linear kernel was used for SVM, and gamma was set to scale while the regularizing parameter c was set to 1. In the case of the Random forest model, the estimators (trees) were set to 100 and 'Gini impurity' criterion was used for ensuring optimum splits. Random Forest and SVM both have shown similar results in comparison to ANN. The Random Forest is one of the best-performing classifiers for the Diabetes Readmission dataset, which has also been discussed in [6]. Guided-ANN managed to outperform Random Forest Classifiers on all four of the performance metrics: Accuracy, AUROC, Macro Average f1-score and Weighted Average f1-score for the 2-class dataset while only falling short on the Macro Average f1-score performance metric for the 3class dataset. Compared to the SVM classifier, Guided-ANN outperformed it in all the performance metrics, and on the case of the 3-class dataset, Guided-ANN outperformed SVM on Weighted Average f1-score by more than 2%.

V. DISCUSSION
A significant improvement in the quality of classification has been observed with the introduction of guided stochastic gradient descent on training Artificial Neural Networks. It has clearly outperformed its canonical counterpart by obtaining approximately 1.0 to 1.5% better classification accuracy on both the 2-class and 3-class datasets. Adadelta and its guided variant and RMSProp and its guided variant proved to be the most prominent optimizers in the experiment. The confusion matrix for the results also shows that the performance of the guided optimizer and its canonical version is quite similar. The low dispersion of classification accuracy results shown by the standard deviation obtained by guided optimizers indicates that they can provide consistent accuracy for classification.
Since the dataset is highly imbalanced, we also considered the F1 scores from the classification report of the experiment. Guided optimizer generally performed better than its canonical counterpart on both the weighted average F1 scores and the macro average F1 scores with the exception of the macro average F1 score of Adadelta on the 2-class dataset which was off by approximately 1%. The overall experiment shows that the ANN model trained with a guided optimizer performs better than the ANN model that has been trained with canonical optimizers. As expected, the guided approach shows improvement due to its handling of training data. It takes the 'consistent' samples to make the gradient move closer to the true gradient in every iteration as shown in [37] too. The precision-recall curve for the 2-class and 3-class datasets in FIGURE 7 and FIGURE 8 together with AUROC values in TABLE 4 shows that the data distribution of classes is quite indistinguishable. Even in such conditions, the guided approach provides better classification results. We also compared the performance of Guided-ANN against the Random forest classifier and SVM classifier and it also managed to outperform them. VOLUME 11, 2023

VI. CONCLUSION
We presented a Guided Artificial Neural Network model that uses the guided stochastic gradient descent algorithm to train the classifier. Our results show that the Guided Artificial Neural Network performs better than its canonical counterpart by obtaining at least 1% to 1.5% better classification accuracy on the 2-class and 3-class Diabetes readmission dataset, while also achieving faster convergence. Guided ANN is also able to achieve relatively good F1 scores in comparison to the canonical variant for the same dataset. Guided ANN has shown that it can significantly improve the classification accuracy and overall performance of the model in the prediction of early readmissions of patients due to diabetes, there are various benefits of being able to predict hospital readmissions such as reducing the strain on the health care resources, improving patient care [20].
There are several possible future directions from this work such as the application of GSGD in classification tasks in other medical domains or the application of GSGD to Recurrent Neural Networks (RNN). Since the Diabetes Readmission was a highly imbalanced dataset, it will also be interesting to investigate if other oversampling techniques such as SMOTified-GAN [52] can counter the effects of data imbalance and further improve the Guided ANN classifier model as opposed to the standard SMOTE technique [48] that was carried out in this research.

CODE AND DATA
The authors provide Python code and data for extending this work further. 1 1 https://github.com/ECOLS-research-group/GSGD-ANN