Comprehensive Performance Assessment of Deep Learning Models in Early Prediction and Risk Identification of Chronic Kidney Disease

The incidence of chronic kidney disease (CKD) is rising rapidly around the globe. Asymptomatic CKD is common and guideline-directed monitoring to predict CKD by various factors is underutilized. Computer-aided automated diagnostic (CAD) can play a major role to predict CKD. CAD systems such as deep learning algorithms are pivotal in disease diagnosis due to their high classification accuracy. In this paper, various clinical features of CKD were utilized and seven state-of-the-art deep learning algorithms (ANN, LSTM, GRU, Bidirectional LSTM, Bidirectional GRU, MLP, and Simple RNN) were implemented for the prediction and classification of CKD. The proposed algorithms were applied based on artificial intelligence by extracting and evaluating features using five different approaches from pre-processed and fitted CKD datasets. In this study, we have measured accuracy, precision, recall, and calculated the loss and validation loss in prediction. Further, the study analyzed computation time and prediction ratio, and AUC to evaluate the model performance along with statistical significance to compare their performances. While classifying CKD, algorithms such as ANN, Simple RNN, and MLP provided high accuracy of 99%, 96%, and 97% respectively., a good prediction ratio along reduced time. The model outperforms traditional data classification techniques by providing superior predictive ability. Subsequently, the study proposed the integration of best-performing DL models in the IoMT. This proposal will assist predictive analytics to advance CKD prediction by using deep learning more efficiently and effectively. The study is the first fundamental step toward a comprehensive performance assessment to classify and predict CKD using deep learning models and its associated risk factors.


I. INTRODUCTION
CKD is one of the most crucial health concerns due to its increased prevalence globally [1] and includes conditions damaging the kidneys slowly and reducing the ability to perform the essential functions of the body for a longer time. CKD is associated with complications such as renal failure, high blood pressure, anemia, nerve damage, etc. [2]. An estimated 2.2 million people around the world are plagued by renal failure. For instance, CKD has affected a large portion of the population in other developing countries such as Pakistan, India, Nepal, Bangladesh, Bhutan, Sri Lanka, and Afghanistan [3]. In addition,750,000 Americans every year suffer from CKD every year [4]. It is alleged that multiple risk factors (not limited to) such as the history of renal failure, high blood pressure, or diabetes, etc. are required to monitor each year for any abnormal test results [5]. Few blood tests are commonly used to detect CKD; (i) determine the glomerular filtration rate (eGFR), (ii) verify the concentration of albumin in the blood and urine, (iii) measure the blood urea nitrogen (BUN) index, and creatinine (CR) [6]. These circumstances lead to two major concerns (i) reliability of the screening test and ii) rising cost. First, there is no conclusive evidence that relying on screening tests cannot help one suspected patient to prevent the prognosis of CKD, because the disease is highly dependent on epidemiology and other clinical features. Second, CKD has no signs and symptoms in its early stages, diagnostics testing is one of the ways to distinguish whether a patient has renal disease or not. Once diagnosed, the patients follow various stages of CKD leading to end-stage renal disease (ESRD) which requires kidney transplants or dialysis to save people's lives ( Figure 1) [7]- [9]. Kidney transplants or hemodialysis are very costly and many patients in under developing countries cannot afford these treatments [10]. Health care has been digitalized, which has resulted in the creation of vast new data sets, and these are electronic medical record (EMR) systems, health insurance claims data, X-rays, lab reports, etc. Due to these vast available data, the conventional medical facility shows a limited capacity to predict CKD effectively and accurately. Hence, performing predictive health analytics to harness data is imperative.
Predictive tools such as machine learning (ML) and deep learning (DL models/algorithms can be used to overcome the limitations of traditional healthcare management [11]. Application of DL-based diagnosis may reduce unnecessary and invasive procedures to improve the efficacy and sustainability of existing health care practices. Utilizing DL's knowledge discovery capabilities, such as data mining and classification techniques, it is now possible to handle massive and valuable data to improve medical diagnosis and prognosis in decision making [9]. When health care providers combine this information with other data sources, they can create new solutions with the support of predictive analytics for early CKD diagnosis, associated health risks, and even prescriptive analytics for precision medicine. Early detection of CKD can prevent ESRD progression which is achievable by DL models prediction and subsequently reduce the cost. The researchers used DL models and attained very good performance in classifying chronic diseases such as liver disease [12], heart disease [13], and kidney disease. Moreover, the application of DL algorithms helps to develop a fast-acting, non-invasive, and easily accessible platform that is comprised of various data related to kidney disease. This usage of DL will eventually create a valuable supporting tool for early, accurate and fast diagnosis of CKD. Given that deep learning offers novel designs and better performance in many domains, we firmly think that deep learning has much to contribute to the field of CKD. The study shows a detailed explanation of the technical specifics of deep learning (DL) architectures along with a comprehensive performance assessment of the DL methods used to predict CKD. Five sets of feature selection/ranking tools have been utilized and compared to incentivize the application of DL methods. Further, the use of statistical analysis proved the outcome more reliable and effective. In addition, the study shed some insights into the application to the development of a health monitoring framework that can be used as an IoMT portal based on DL algorithms.

Figure 1. CKD progression in different stages
The goals of the research are to showcase how CKD can be diagnosed efficiently through prediction and classification by using DL algorithms. To achieve the goal, seven DL algorithms-ANN, LSTM, GRU, Bidirectional LSTM, Bidirectional GRU, MLP, and Simple RNN are proposed in the study. These algorithms were then extensively compared based on their accuracy and error(s) to classify CKD diseases. Subsequently, prediction ratio and computational times were calculated to evaluate the model performances. Further, the statistical significance was carried out to validate the outcome.
The developed system and model were applied to the CKD dataset, which is publicly available on the UCI machine learning repository [14]. Therefore, the key contributions of the study are as follows: i) The study involved in comprehensive performance assessment of seven DL models including simple RNN, Bi GRU, and Bi LSTM, and GRU in predicting and diagnosing CKD. ii) Current data modalities were evaluated through feature selection to establish the utility of feature selection in DL models. These approaches have not been explored extensively earlier relating to predicting CKD. iii) The study identified the risk factors associated with CKD that can prevent disease progression to the endstage. iv) Two statistical significance tests were further carried out to establish the reliability of the performance assessment.
The remainder of the paper is structured as follows: Section II examines the related studies performed by researchers using ML and DL algorithms along with gaps in the extant literature. Section III discusses the proposed algorithms along with detailed descriptions. Section IV explains the experimental analysis and results. Further, a detailed discussion along with limitations is presented in Section V. Finally, Section VI summarizes the conclusion of the research.

II. RELATED STUDIES
While examining the extant studies, it is evident that predicting CKD has become a prime interest among researchers. The studies emphasized the utilization of ML and DL algorithms. However, the prominence of using deep learning models casts interest among researchers in recent years.
To monitor and diagnose chronic diseases, machine learningrelated techniques have been used [15]. For example, the authors implemented seven machine learning algorithms including ANN and linear support vector machine (LSVM) to predict CKD. They used the CKD data set from the UCI repository. Three feature selection methods (i) filter (ii) wrapper, and (iii) embedded method were used to extract important features. Further, they obtained the highest accuracy of 98.46% using LSVM [16]. Chen et al. [15] applied three models to the UCI dataset. They used KNN, SVM, and SIMCA (soft independent modeling of class analogy) to calculate the patient's risk. The SVM and KNN models achieved the highest accuracy of 99.7 % [17]. In addition, six classification algorithms; Naive Bayes, MLP (Multilayer Perceptron), SVM, J48, and Decision Tree were used to assess the accuracy and effectiveness of classification of CKD. The results showed that MLP provided 99.75% accuracy [18].
Notable DL algorithms/classifiers along with hybrid versions were observed in the extant literature. The primary focuses are on the fitness of utilizing DL methods and discussing these methods' performance in diagnosing CKD. For instance, researchers used a sensor data set, extracted the features, and classified CKD by applying a Convolutional Neural Network-Support Vector Machine (CNNSVM). The concentration of urea in the saliva sample was measured to detect CKD. The study showed 96.59% prediction accuracy for the proposed algorithm [19]. Another study utilized a deep convolutional neural network (DNN) to distinguish serum potassium levels from 449,380 patients observed at Mayo Clinic's Rochester, Minnesota and was consequently confirmed using retrospective data from the Mayo Clinic in Minnesota, Florida, and Arizona. The study used ECG to detect hyperkalemia in CKD patients where the deep-learning model detected hyperkalemia with high sensitivity (90%) with an area under the curve (AUC) between 0.853 and 0.90 [20]. In another study, Heterogeneous Modified Artificial Neural Network (HMANN) was applied to describe the different architectures, colors, and locations of kidney stones. They achieved high accuracy (97.50%) and a substantial reduction of the required time. They used kidney ultrasound images to detect and segment kidney stones [6]. A Deep Neural Network (DNN) classifier was used to predict CKD and its severity level. The model classified CKD with 98.25% accuracy which later was increased to 99.25% by the PSO (Particle Swarm Optimization)feature selection method [21].
Limited study has been observed on the study utilizing the IoMT platform. Notably, a study emphasized an adaptive hybridized deep convolutional neural network (AHDCNN) for the early detection of kidney disease. CNN-based algorithm model was implemented to improve the classification accuracy by reducing the feature dimension. The model showed of 97%Accuracy. The study used a health monitoring framework as part of the IoMT portal. [11]. Another study developed Ensembling Multi-stage Deep Learning Approach (EMSDLA) to assess tumors in the kidney. For kidney and kidney tumors, the average Dice score is 0.96 and 0.74 on 90 unknown test cases [22]. The study claimed that the findings can advance tumor segmentation on the IoMT platform. Further, the Adaptive Neuro-fuzzy Inference System (ANFIS) was utilized to help determine chronic renal failure. Based on the fuzzy method, ANFIS networks estimated GFR with a high degree of accuracy [23]. Researchers utilized 10 ResNet models to predict eGFR and 10 XGBoost models to classify CKD. The models provided 85.6%accuracy.
Overall, we can assert that most of the researchers essentially applied ANN and CNN-based models including modified ANN and CNN to predict CKD, but the application of a wide range of other DL algorithms i.e., Recurrent Neural Network, SimpleRNN, LSTM, and GRU are missing. Alongside, the performance evaluation among various DL models is lacking in the extant literature. The reliance on one model or its hybrid edition does not warrant the model performance and so does its accuracy to predict CKD. The risk factors of CKD need to be detected to prevent CKD progression. Very limited ML studies and DL studies performed a risk factor analysis of CKD. Hence, the study envisioned to fill these gaps to apprehend whether proposed advanced DL algorithms work efficiently to diagnose or classify CKD coupled with how their performance can be evaluated. Thus, the study emphasized seven deep learning algorithms along with ANN and MLP to substantiate comparative performance among the models to classify CKD accompanying risk factor analysis. Figure 2 explains the research process/model. along with IoMT framework. The details are as below:

A. DATA RETRIEVAL, DESCRIPTION, and CONCERN
The real-time data was collected from the UCI repository [14].
In this dataset, the number of instances is 400. From the test report analysis, 250 patients are affected in CKD and 150 patients are not affected. As a result, each class has 62.5 % with CKD and 37.5 % without CKD. This dataset contains 25 attributes, where 11 attributes are numeric and the remaining 14 are nominal (Table 1).
This is a small dataset with a small imbalance issue. Subsequently, some concerns exist with the dataset, which might be an overfitting or generalization problem, imbalance, and the noise of the data. The researcher attempted various strategies/techniques to handle these issues. P. Yang et al. concluded that the ensemble technique is better than a single classifier because it is better at reducing the chance of overfitting [24]. Three feature selection methods (Filter, Wrapper method, and embedded methods) were adopted for feature selection while utilizing machine learning algorithms [16].

B. DATA PRE-PROCESSING
Missing data in the medical data set to cause a threat utilizing deep learning. It is generally believed that every attribute in the medical data poses a significant impact on health assessment. Hence, data pre-processing is the strategy that is utilized to change over the raw data into a clean dataset. It is the basic step to train every DL model/classifier algorithm.
For categorical string data columns, preprocessing was done by converting them into categorical numeric data columns. The categorical numerical data is defined as 0 (negative assertion) or 1 (positive assertion). For example, in the data set, column "pc" was described as "normal "or "abnormal which were replaced to 1 and 0 respectively; normal was defined as 1 and abnormal was defined as 0.
We performed multiple imputations (MI) to fill the missing values. The imputation process was based on linear regression for predicting continuous variables, and logistic regression for categorical variables. In multiple imputations (MI), missing values in the dataset are replaced by n times, where n is usually a small number (from 3 to 10) . We applied MI for 10 iterations to generate 10 different datasets. To narrow down the data to a subset with a plausible range of values, we choose the dataset that had the nearest means and standard deviations for its variables to the original dataset. Subsequently, the missing values for the entire data were filled.

C. FEATURE SELECTION
It is essential to remove unnecessary features from the dataset before training DL classifiers. Among feature selection methods, there were Co-relation based Feature Selection (CFS), (2) Recursive Feature Elimination (RFE), (3) Lasso Regression, (4) Boruta Feature selection method. The details of the five feature selection approaches/methods are explained as follows:

Without Feature Selection (WFS)
We considered all features after filling all missing values in the dataset and defined them as WFS datasets.

WRAPPER Method -CFS
CFS conducts attribute rankings based on the correlation heuristic assessment function [28]. The function employs an approach that generates two class labels, one associated with class and one not

RFE
RFE is a wrapper-type feature selection method. RFE works by exploring a various subset of features training dataset and successfully eliminating features until the required number remains. RFE utilizes the core of the model, ranks features by significance, removes the least important features, and re-fitts the model.

Embedded Method -Lasso
Lasso Regression is a linear regression that lowers the coefficients for input factors that have no substantial impact on the prediction task. Later, Lasso allows some coefficient values of features to go to zero, essentially eliminating input variables from the model and providing automatic feature selection.

Boruta
Boruta is a feature selection algorithm. Precisely, it operates as a wrapper algorithm over Random Forest. Boruta uses an allrelevant feature selection approach where it collects all features which are in certain conditions important to the result variable.

D.1.ANN
Artificial Neural Network (ANN) is a computational algorithm developed like a human brain where neuron nodes are interconnected like a web. ANN algorithm can be used for both machine learning and pattern recognition. ANN can learn from past data or example data for classification and prediction. Figure 3 depicts the generic architecture of ANN. where f is the activation function, w T are the weights and b is the bias term.

D.2. LSTM
Long short-term memory (LSTM) is a deep learning algorithm that resembles Recurrent Neural Network (RNN) where connections between nodes form a directed graph along a chronological sequence. LSTM is an algorithm that can retain information for a long time. Depending on time series data it can classify, predict and process data. LSTM algorithms retain information with the support of cell and memory manipulation with gates. Figure 4 describes the generic architecture of LSTM

D.3. GRU
Gated Recurrent Unit (GRU) is an RNN algorithm that uses the hidden states to transfer information. GRU algorithm uses two vectors: weight (W) and unit (U). These two vectors decide what information should take or not for the output. Without removing it through time, they (two vectors) can be trained to keep information for a long time. Figure 5 shows the generic architecture of GRU The equation of gates is as follows: Figure 5. Sample architecture/process of GRU [27] Update Gate(zt): The formula for update gate is, Here, x_t is multiplied by its weight W(z) when it is plugged into the network unit. The same process is going for h_(t-1) were multiplied by weight U(z).

Reset Gate (rt):
The formula for update gate is, Here, the formula is the same as the update gate. The main difference in weights and the usage of the gate.

D.4. BIDIRECTIONAL LSTM
When used in sequence classification, bidirectional LSTMs offer an improvement over regular LSTM. Instead of one LSTM, bidirectional LSTMs train two LSTMs on the input sequence. Bidirectional run inputs in two ways past to future and future to past. Figure 6 depicts the generic architecture of Bidirectional LSTM.

D.5. BIDIRECTIONAL GRU
A bidirectional GRU can be called BiGRU. GRU means gated recurrent units. It is a sequence processing model which consists of two GRU's. It is a bidirectional recurrent neural network (inputs in forward and backward direction) with only the input and forgets gates. Figure 7 describes the generic architecture of Bidirectional GRU.
here zt update gate, rt reset gate ht iis new memory and h is reset the memory.

D.6. MLP
Multi-layer perceptron (MLP) algorithm can be used for facilitating supervised learning of binary classifiers through a linear classifier or an algorithm. There are 5 primary components of perceptron: Input, Weights, Bias, Step function, weighted summation. The features must be added to train in the first layer as input. Later, the result of weights and inputs are multiplied. Figure 8 describes the generic architecture of MLP. Bias value-added for shifting output function. The equation of Perceptron is given below: Here x0 = 1 and w0 = -θ

D.7. SIMPLE RNN
Simple RNN is the collection of neurons where it can work with the variable length of sequences. RNN has been considered as a state model for its feedback loop. The state develops overtime for the nature of recurrence relation and the feedback is used in the state with one timestep delay. The delay feedback loop works as a memory causing it to store information between timesteps. Figure 9 describes the generic architecture of SimpleRNN. The recurrence relation over time steps is given below: Where Sk represents the state at k time.
Xk means input at k time.
Wrec and Wx are the weight parameter and free forward nets. Sk is the final output of the network with k timestep, which is typically calculated as Sk-i⋯Sk+j. The current state Sk can be calculated from current input Xk or else previous state Sk-1 and it can predict the next state from Sk+1 where the current state is Skad current input is Xk.

D.8. ADABOOST
AdaBoost may be a collective learning process (also referred to as "meta-learning") that was first shaped to raise the productivity of binary classifiers. AdaBoost uses an iterative tactic to soak up from the errors of weak classifiers and check out them into strong ones. Figure 10 describes the generic architecture of MLP Figure 10. Sample architecture/process of AdaBoost [32] The overall equation for AdaBoost is summarized as F_m = the m_th weak classifier = corresponding weight

D9. RANDOM FOREST
RF is a large-dataset aggregate classifier and regression classification method that generates decision trees from a randomly selected subset of training data and returns an output class (i.e., which is the output of individual trees [30]. Even though RF can easily maintain thousands of input attributes, there is no need to reduce variables during analysis. The estimation of variables important in classification is provided by RF.

D10. MODEL EXECUTION PROCESS
After collecting the data from the UCI repository, the adjusted values of the model parameters were defined. Then the dataset was randomly divided into a training set (80%) and a validation or test set (20%). We selected the parameters of the maximum average performance to build the model. Seven deep learning algorithms were implemented and customized to fit the model with the dataset and find the best fit to perform the analysis. We applied several input and output layers, different activation functions, and different parameters for model compiling and fitting. For ANN, we used hidden layers where the activation function; Rectified Linear Unit (ReLU), and sigmoid were used in input/hidden and output layers respectively. In the case of GRU and bidirectional GRU, 4 layers of DL models which contain 50 units each were used with 20 % dropout for handling and overfitting. Similarly, 2 layers of DL models which contain 50 units each were implemented with 20% dropouts for handling overfitting in LSTM and, Bidirectional LSTM.
In the case of simple RNN and MLP, we used 2 layers of DL models where 32 units were in the first layer with 20% of dropout. Sigmoid and ReLU (activation function), Adam and SGD (optimizers) were used for model compiling in ANN and MLP respectively. tanh as the activation function to compile and fit parameters, SGD as an optimizer, loss as MSE (mean squared error), and 200 epochs were applied in the remaining models (LSTM, bidirectional LSTM, GRU, and bidirectional GRU algorithm). ReLU as activation function, RMSPROP as an optimizer, and MSE as the error was implemented for SimpleRNN (Table 3). We calculated accuracy, Precision, Recall, F1 score, the loss and validation loss of the seven models, and visualized the model performance through the AUC ROC curve. Further, the study evaluates the prediction ratio and computation time of the models. The parameter score of the current parameter combination was used to compute the average performance. We trained our data 200 times in the training set for each model and tested it with the testing dataset. The optimum model was predicated on the testing/verification set to obtain the prediction result. After that, we made comparisons of the models with each other. All processing, visualization, and computation were done on Google Collaboration using python programming. The significance of the comparison among DL models in terms of accuracy was evaluated through Wilcoxon signed-rank test using R and the Deep dominance test in python

IV. EXPERIMENTS AND RESULTS
In this paper, we have presented the DL model-based prediction of CKD. For classification of the disease, we evaluated the accuracy, precision, Recall, and F1 Score, ROC curve area, loss, and validation loss of the models. The results of the performance of the seven algorithms are shown according to four feature selection methods followed by the performance without feature selection. We determined the prediction ratio and computational time of each model for a comprehensive understanding of the models. Finally, statistical significance analyses were performed to evaluate the reliability of the performance. In addition, AdaBoost and perceptron algorithms are applied to find the significant risk factors of CKD.

A1. CONFUSION MATRIX
For classification prediction, it is important to explain the concept of a confusion matrix. A confusion matrix is defined by 2×2 matrices, containing 4 attributes namely true positive (TP), true negative (TN), false positive (FP), and falsenegative (FN) ( Table 4) [21]. The most widely used prediction performance parameter is accuracy. It measures the value of classified instances events and is denoted in percentage (%). For greater classification results, the accuracy should be close to 100%, as defined in Eq.  The F-score is a measure of the testing process' accuracy.
Precisions and recall sets are used to calculate the average. The equation is expressed as: FPR is applied to calculate the likelihood for a particular test incorrectly by discarding the null hypothesis. It is the proportion of negative instances predicted as positive in the dataset. It is denoted in Eq. (17). FNR is the miss rate of the model describing positive instances classified as negative (Eq. 19) and this rate is expected to be as close to zero. FPR indicates negative instances classified as positive. Both FPR and FNR rates should be as close to zero for a good performance. Figure 11 shows the confusion matrix for each model while taking all the attributes (WFS) of the data set. This matrix shows the TP, FP, FN, TN of each model. We calculated the False Positive Rate (FPR), and True Positive Rate (TPR) of each model as it is important to know the FPR and the FNR rate in each model. Also, Figure 11 shows the confusion matrix with FPR and FNR for 7 DL models with the WFS dataset. ANN showed the lowest FNR and GRU showed the highest FNR of 0.231. MLP and SimpleRNN also provided low FNR;0.058,.038 respectively. The Confusion matrix of the seven algorithms using the other four feature selection methods is presented in the supplementary ( Figure S1-S4 ).

A2. ACCURACY, PRECISION, RECALL, F1 SCORE
We used all the features, and selected features in the data set through different methods, the models were fitted in the training dataset, tested the testing dataset, and predicted the CKD, it has been observed that SimpleRNN, MLP, and ANN provided the highest performance in terms of Accuracy, Precision, Recall, AUC, Loss, and Validation loss for all applied FS and WFs method among all DL models. All of these measurements should have values as close to 100. We considered the classifier that handles the greatest score to be the best classification algorithm. The average values of 200 iterations were considered for performances assessmentaccuracy, precision and recall, and F1 score.       Table 7 provides the experimental results of the proposed seven DL models and represents the different measures such as Accuracy, Precision, Recall, F1 Score, AUC, Loss, and Validation loss (errors) [35]. All the models showed 80% to over 90% accuracy, where three models showed over 95%. Out of all algorithms, ANN showed the highest accuracy of 97%, followed by 96% accuracy of both SimpleRNN and MLP. GRU, BidirectionalGRU, and Bidirectional LSTM achieved the same accuracy(85%). Also, Table 7

Figure 14 Graphical representation of performance comparison (RFE)
Lasso   Table 9 provides the experimental results of the proposed seven DLclassifiers with Boruta method and represents the different measures such as Accuracy, Precision, Recall, F1 Score, AUC, Loss, and Validation loss (errors) [33]. All the models showed 80 to over 90% accuracy, where three models showed 96-99% accuracy. Out of all algorithms, ANN showed the highest accuracy of 99%. SimpleRNN and MLP showed the accuracy of 97% and 96% respectively, followed by LSTM and Bidirectional LSTM -86% and 88%

B. RECEIVER OPERATING CHARACTERISTIC (ROC)/ REA UNDER CURVE (AUC)
At different threshold settings, the AUC-ROC curve is plotted to assess the performance of classification algorithms. Receiver Operation Characteristics (ROC) denotes the probability and Area Under Curve (AUC) represent the separability measure or degree. Higher AUC (close to 1) means better performance to distinguish whether the patient has a disease or not [34]. Figure 16 shows the AUC curve where the x and y-axis represent FPR (False Positive Rate) and TPR (True Positive Rate) respectively. Here, ANN, showed the highest AUC in all features selected methods and WFS where the value is from 0.97 to 0.99 (close to 1) indicating very good performance. SimpleRNN, and MLP also showed high AUC scores compared to other models. For Lass method LSTM, GRU, Bidirectional LSTM, and bidirectional GRU provided the same AUC of 0.84 which is the lowest in the analysis.
So, the three DL algorithms: ANN, MLP, and SimpleRNN performed superior to other deep learning algorithms in all methods along with WFs considering all features.

C. LOSS AND VALIDATION LOSS
Loss is defined by the error that occurred during each iteration (epoch) on the training dataset to predict the class of CKD. Loss or error is also calculated for the testing dataset and described as validation loss or val-loss. After completing 200 epochs (which entails the number of iterations), the loss and validation Loss were measured for each model. Loss and validation loss of the seven DL models is shown in Figure 17  validation loss was achieved by LSTM in WFS, Lasso, and RFE. GRU in CFS, Bidirectional LSTM in Boruta also showed high scores for both loss and validation loss. However, the difference between loss and validation loss was very low for the DL models in all selection methods suggesting the perfect fitting of data to the models. Thus, we can conclude that models were not overfitted to the training dataset ( Table 5)

D. PREDICTION RATIO
The algorithm uses a training set of features and the associated result to predict a given outcome (prediction results). To improve prediction learning techniques, the proper selection parameters for the tests of different classifiers need to be determined. Hence prediction ratio determines the ratio of correctly classified instances to incorrect ones for a given data set. The study divided the whole data set into eight segments containing 50 instances. Figure 19 shows the prediction ratio of the seven DL algorithms. In the case of WFS, CFS, and Boruta, it is observed that ANN, MLP, and SimpleRNN showed a high prediction ratio for small datasets. However, the prediction ratio of MLP drops to zero with a large dataset (300-400 instances). For other models, the Prediction ratio was high with small data set, then reduced, at last increased with a large dataset. For Laso and the prediction ratio of MLP and Bidirectional GRU appears in the opposite direction; the ratio drops a with higher data set while the ratio increases to 100% with large data set. This analysis shows that SimpleRNN and ANN performed best compared to other models while predicting CKD disease. For the remaining models, the prediction ratio varies concerning different sets of data. In the case of 150 and 200 data sets, the prediction ratio dropped to ~ 60-80% for Bidirectional LSTM, GRU, and LSTM, and increased to around 100% for 400 data sets Figure 19 Comparison of prediction ratio of DL models

E. COMPUTATION TIME
Performance is also verified by calculating the computation time of all models. The time for the computation to predict CKD is considered for each model with four feature selection methods described earlier and also without feature selection(WFS). The proposed ANN, MLP, and Simple RNN models showed low computation time i.e. 10 sec, 11-13sec, and 21 sec respectively suggesting high performance. Figure   15 demonstrates the computational time of the implemented seven DL models comprehensively with the method. Bidirectional GRU took the highest time; 91 sec with RFE and 90.63 sec with WFS. In Boruta, LSTM, Bidirectional LSTM completed the prediction or classification with more than 80 sec and GRU took ~65 sec ( Figure 19).

F. STATISTICAL TEST OF SIGNIFICANCE
We performed a statistical test of significance to validate the findings that are likely is real, reliable, and not occurred by chance. To achieve so, the study used Wilcoxon signed-rank test and calculated p-values among all models for all feature methods along with WFS based on accuracy [35]. Table 10 depicts the results of the p-value for pair-wise comparison of the models of the WFS dataset. To explain, the p-value among ANN and other models (LSTM, Bidirectional LSTM, GRU, Bidirectional GRU) is lower than 0.05 (p-value ~0.003-0.004) whereas the p-value among ANN and two other models (SimpleRNN and MLP) is greater than 0.05 (p-value ~0.07-0.15). Similar results were observed for SimpleRNN and MLP. Hence, the observed accuracy of ANN, MLP, and Simple RNN is significantly higher than that of other models to validate the outcome from the previous findings. The Wilcoxon test results of the seven algorithms using the other four feature selection methods are presented in the supplementary (Table S1-Table S4 ).

G. DEEP DOMINANCE TEST
Along with the Wilcoxon signed test, the study performed the deep dominance test to compare the performance of the seven algorithms. The test determines "Almost Stochastic Dominance" which follows a measurement of stochastic dominance (between two Algorithms [36]. Each algorithm was compared against the other, and deep dominance (ϵ) was measured between 0 to 1.0. 0 that corresponds to perfect stochastic dominance of one algorithm (X) over another (Y) and 1 corresponds to perfect stochastic dominance of Y over X. Table 11 shows the results from the Deep Dominance test. The Deep Dominance test results of the seven algorithms using the other four feature selection methods are presented in the supplementary (Table S 5 -Table S8).

G. RISK FACTOR ANALYSIS
We identified the risk factors of CKD by using Perceptron and AdaBoost Classifier models. Figure 16 shows the risk factors of CKD using both models considering all the features i.e. WFS). After preprocessing and normalizing, both models were fitted to the CKD data, features were ranked based on importance in predicting the class of the disease [37]. Perceptron provided the full ranking of the features (absolute values of the features were considered) and Adaboost classifier model ranked 15 important features which are as follows: "hemo", "sc", "sg", "age", "bu", "rc", "bgr", "pcv", "pe", "dm", "wc", "pot", 'al', "bp", and "htn" . Random forest identified top 15 features, and these are hemo , sg, pcv ,sc, al Rc , htn, bgr, dm, bu, sod, su, pot, pe, bp. While reviewing both models concerning top 15 features, it is observed that 9 features such as "hemo", 'sg', 'pe', 'al', 'dm', 'htn', 'bp', 'sc', 'pcv' and depicted in Figure 20. . The importance and interrelation of these 9 factors towards CKD progression have been substantiated in the medical science and health domains [38] A B C D Represents the

H. IoMT PLATFORM
Hardware plays a critical role in IoMT. Various hardware can be used to collect real-time data from CKD patients. Data can be collected from a range of medical devices such as ECG (Electrocardiogram) to provide heart monitoring data, fitness tracker to provide data like stress, breathing, Oxygen level, and glucometer to provide diabetes in blood, and blood monitoring devices will provide the pressure of blood ( Figure 2). All these devices are interconnected via the internet and are responsible for data transmission to the cloud through connectivity technologies such as networks and gateways. In many instances, data are uploaded from time to time through an API (Application Programming Interface) key for a secured connection. These API keys will allow the specific devices to access and store data on the IoT platform. Subsequently, deep learning models provide health care professionals with data analytics, reporting, and device control opportunities through software solutions. The approach to utilize IoMT components to manage CKD data and adopt a replicable application using deep learning algorithms to predict CKD is proposed in the study.

V. DISCUSSION
The risk of CKD is increasing rapidly, and consequently more people are suffering and dying due to a lack of proper treatment. Moreover, CKD needs to be identified at an early stage because late diagnosis leads to severe consequences and treatment becomes highly expensive. Perhaps, DL applications are key developments in recent years to combat various medical diagnoses. These approaches can potentially reduce the cost and treatment of CKD; due to the advantage of the  Figure 21). (3) ANN and SimpleRNN showed similar and comparatively high prediction ratios (i.e., 0.98-1.0). Hence ANN, Simple RNN, and MLP are the best fit for the CKD data. Also, these algorithms can classify the CKD data more accurately and efficiently than the remaining four algorithms.
The importance of medical significance on features selection plays a major role in understanding the performance. As indicated five feature selection techniques are adopted in the study. These methods generated different data set containing different features. The result of feature selection provided different attributes shown in Table. Further, using these feature selections methods, the average accuracy among the DL models lies between 85-99%. But it does not warrant true performance. Few features such as bp (blood pressure), sc (serum creatine), bgr (random blood sugar) were not selected in Lasso. Similarly, su (serum urea), bp (blood pressure), and pe (paddle edema) were not selected in RFE. However, it is well documented the direct and indirect influence of these parameters on CKD progression. Hence identifying interconnection between different functions and selecting target attributes are imperative. In this essence, without feature selection methods such as WFS does not eliminate any feature and showed superior performance in terms of Accuracy, Precision, Recall, and prediction ratio while using seven DL models, Therefore the study strongly recommended that without feature selection can potentially be considered in the DL models to predict CKD.
Further, the findings were validated by statistical significance analysis for all five feature selection processes. While comparing with other models, significant p-values (<0.05) were obtained for ANN, SimpleRNN, and MLP (Table 10). This substantiates the findings using the Wilcoxon test that ANN, SimpleRNN, and MLP showed superior performance in the case of WFS. In addition, the Deep dominance test showed that ANN is better than all other models where . The values of other models (LSTM, Bidirectional LSTM, GRU, and Bidirectional GRU) is ~1.0 while compared with Simple RNN and MLP. The findings indicate that fours models are not better than (nbt) SimpleRNN and MLP. Overall, both significance tests resemble a similar outcome in the rest of the four feature selections.
Further, the study identified the risk factors of CKD. Perceptron, Adaboost, and Randon forest classifier models were used. These models successfully provided 9 features (  Figure 20 ) as the considerable risk factors of CKD. For example, In CKD, the kidneys fail to produce enough erythropoietin (EPO), a hormone required by the body to produce red blood cells which direct relation with the hemoglobin. Less red blood cells indicate a lower level of hemoglobin. Hence, rc ( red blood count) and Hemologbin (hg) are crucial parameters for the diagnosis of CKD. Albumin usually found in blood and kidneys filters this protein. Thus, albumin is not commonly observed in urine. The presence of albumin in urine indicates that the kidney nephrons are damaging and lose the ability to filter albumin from the body. This increased amount of albumin in urine indicates CKD disease [39]. High-risk individuals having diabetes, hypertension, etc. are generally recommended to check albumin in the urine. The best models and risk factors identified in this study can be implemented in IoMT which will enable remote monitoring of CKD. Therefore, IoMT can be implemented for (1) improved diagnosis and treatment, (2) effective CKD management, and (3) reduced cost.
The study has limitations too. The dataset is small, which could cause the results to be unreliable. It is difficult to find another dataset with more attributes containing higher instances. More specifically, data collection dynamically from IoMT platform is even more difficult. However, during optimization, overfitting was prevented by customizing the parameters to calculate the error between testing and training datasets. In this study, we successfully applied several input layers, hidden layers, activation functions, and optimizers in all DL models. These actions resulted in a very low difference between loss and validation loss ( Figure 18 and Table 5). Hence it can be concluded that the models were not overfilled with the training dataset despite having 400 datasets.

VI. CONCLUSION
This paper proposed a methodology utilizing seven deep learning algorithms to detect CKD and identify risk factors that are crucial for early diagnosis to prevent the prognosis of the disease to end-stage. The study demonstrates a holistic performance assessment of deep learning algorithms on CKD. The study explains the following contribution to the body of the knowledge-(1) the study adopted a scientific data processing approach to identify the missing values in the CKD dataset. We employed linear regression and logistic regression for numerical and categorical data to fill up the missing values respectively. (2) five state-of-the-art feature selection processes were adopted to select features and subsequent comparative performance of the seven algorithms so that the utility of inclusion and not the inclusion of feature selection application in the deep learning is explained (3) adding the statistical significance (combination of Wilconsin rank test, and deep dominance test for cross-checking the results to compare the performances) established the reliability of the outcome, finally (4) three DL models simple RNN, Bi GRU, and Bi LSTM, and GRU are applied to predict CKD for the first time. Thus, this research examines the efficacy of seven DL algorithms for predicting CKD. When comparing the models, ANN, MLP, and Simple RNN showed superior performance, providing an accuracy of 97 % in predicting the disease. In this study, applied DL algorithm MLP and Adaboost, and RF identified 9 factors or attributes as the risk factors of CKD. These results can boost the medical community by predicting CKD and the risk factors. Based on the research, we believe that deep learning approaches could be effectively used to translate large amounts of /clinical/biomedical data of CKD into improved human health.