Interpretable deep learning for the prediction of ICU admission likelihood and mortality of COVID-19 patients

The global healthcare system is being overburdened by an increasing number of COVID-19 patients. Physicians are having difficulty allocating resources and focusing their attention on high-risk patients, partly due to the difficulty in identifying high-risk patients early. COVID-19 hospitalizations require specialized treatment capabilities and can cause a burden on healthcare resources. Estimating future hospitalization of COVID-19 patients is, therefore, crucial to saving lives. In this paper, an interpretable deep learning model is developed to predict intensive care unit (ICU) admission and mortality of COVID-19 patients. The study comprised of patients from the Stony Brook University Hospital, with patient information such as demographics, comorbidities, symptoms, vital signs, and laboratory tests recorded. The top three predictors of ICU admission were ferritin, diarrhoea, and alamine aminotransferase, and the top predictors for mortality were COPD, ferritin, and myalgia. The proposed model predicted ICU admission with an AUC score of 88.3% and predicted mortality with an AUC score of 96.3%. The proposed model was evaluated against existing model in the literature which achieved an AUC of 72.8% in predicting ICU admission and achieved an AUC of 84.4% in predicting mortality. It can clearly be seen that the model proposed in this paper shows superiority over existing models. The proposed model has the potential to provide tools to frontline doctors to help classify patients in time-bound and resource-limited scenarios.


INTRODUCTION
Coronavirus is a virus family that causes respiratory tract illnesses and diseases that can be lethal in some situations, such as SARS and COVID-19. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a new type of coronavirus which began spreading in late 2019 in the Chinese province of Hubei, claiming multiple human lives (Li et al., 2020a). The novel coronavirus outbreak was declared a Public Health Emergency of International Concern by the World Health Organization (WHO) in January 2020. The infectious disease caused by the novel coronavirus was given the official title, COVID-19 (Coronavirus Disease 2019) by the WHO in February 2020, and a COVID-19 pandemic was announced in March 2020 by the (World Health Organization; WHO Director General) (Bogoch et al., 2020). Since then, there have been over 170 million cases with many of them being hospitalized. A staggering 3.8 million people died from the disease with the numbers increasing as this paper is being written. Every patient has a different reaction to the virus, with many of them being asymptomatic and a small percentage getting worse rapidly with their organs failing (Leung et al., 2020). The ongoing surge in COVID-19 patients has put a burden on healthcare systems unlike ever before. According to a recent study by Pourhomayoun & Shakibi (2021), once the coronavirus outbreak begins, the healthcare system will be overwhelmed in less than 4 weeks. When a hospitals capacity is exceeded, the death rate rises. The repercussions of an extended stay and increased demand for hospital resources as a result of COVID-19 have been disastrous for health systems around the world, necessitating quick clinical judgments, especially when limited resources are available (Moreira, 2020). COVID-19 infection has been linked to a wide variety of clinical, laboratory, and demographic variables, as described by Rodriguez-Morales et al. (2020) with some of these variables related to an increased risk of critical illness which necessitates admission to an ICU, which may even lead to death. The purpose of this study is to create an interpretable deep-learning algorithm to determine the top predictors of ICU admission and mortality in COVID-19 patients from a vast set of clinical variables collected upon admission. This information will then be used to predict the ICU admission likelihood and mortality of the patients.

RELATED WORK
COVID-19 has been around since late 2019. A lot of studies have gone into different areas to try and find new, useful information about this disease, but just a handful of studies have been done in the hospitalization sector specifically to solve the ICU admission and mortality rate aspect. In this section, we include a detailed account of recent research studies to identify patients more likely to get worse and require ICU care at an early stage. Some well-known prediction approaches can be used for this particular task, and those methods can be divided into two categories: statistical methods, and machine learning methods. Machine learning models have been shown to outperform traditional clinical scoring systems or regression approaches in some situations. Decision tree algorithms or neural networks can detect non-linear relationships between variables which could explain the improved performance. Some of the machine and deep learning approaches used include tree-based algorithms, neural network algorithms, support vector machines, regression algorithms, and the likes.
Manca, Caldiroli & Storti (2020) used a simplified logistic and Gompertz models approach to predict ICU beds and mortality rate for hospital emergency planning in the COVID-19 pandemic for both short term and long term. Their models had two distinct roles to play. They monitored real data and allowed discrimination between models to find the most accurate model, as well as understanding if any unexpected trends emerge. They analyzed their data in two locations, Italy and Lombardy. For the ICU beds dynamic, all the models performed similarly. The Gompertz model performed the best for predicting fatalities in terms of precision and reliability for the whole data, which captures data with 48 features. The stacked ensemble model (AUPRC = 0.807) was the best model generated using autoML. The gradient boost machine and extreme gradient boost models were the two best independent models, with AUPRCs of 0.803 and 0.793, respectively. The deep learning model (AUPRC = 0.73) performed significantly worse than the other models. Li et al. (2020b) used the clinical variables of COVID-19 patients to develop a deep learning model and a risk score system to predict ICU admission and mortality of patients in the hospital. The data consisted of 5,766 patients, with comorbidities, vital signs, symptoms, and laboratory tests, between 7th February, 2020 and 4th May, 2020. AUC score was used to evaluate their models. The deep learning model achieved an AUC of 78% for ICU admission and 84% for mortality with the risk score for ICU admission being 72.8% and 84.8% for mortality. Their model was accurate enough to provide doctors with the tools to stratify patients in limited-resource and time-bound scenarios. Table 1 summarizes the current work being done in this field, the datasets used, the time period carried out in the research and their objectives.
Existing models used in the literature perform very well for their respective purposes, however, they have a downside in that they are difficult to interpret. The model lacks interpretability on which patient attributes it uses when making a decision (ICU admission and mortality). Existing models use various approaches, but the majority of them use neural network models, which are excellent at achieving good results, but their predictions are not traceable. Tracing a prediction back to which features are significant is difficult, and there is no comprehension of how the output is generated. Therefore, this paper proposes the use of an interpretable neural network approach to predict ICU admission likelihood and mortality rate in COVID-19 patients. It employs a deep learning algorithm that can interpret how the model makes decisions and which features the model selects in making the decision. The model has outstanding and comparable results to other

METHOD
This paper proposes a high-performance and interpretable deep tabular learning architecture, TabNet, that exploits the benefits of sequential attention (following in a logical order or sequence) to choose features at each decision step which enables interpretability and more efficient learning as the learning capacity is used for the most salient features from the input parameters. The degree to which a human can comprehend the reason for a decision is known as interpretability. The higher the interpretability of the machine or deep learning model, the easier it is for someone to understand why particular decisions or predictions were made. Although neural network models are known to produce excellent results, they have the drawback of being a black box, which means that their predictions are not traceable. It is difficult to trace a prediction back to which features are important, and there is no understanding of how the output was obtained. The interpretabilty here denotes the ability for the model to interpret its decision and shows the features that are the most important in predicting ICU admission and mortality of COVID-19 patients (Ghiringhelli, 2021). This section starts by describing how the input data has been pre-processed for the proposed learning model. Then the different components of the proposed model, and the steps it takes to arrive at a decision to predict ICU admission and mortality has been discussed. Finally, the evaluation metrics used to evaluate the model is analyzed.

Data preprocessing
It is important to pre-process the data before applying it to a machine-learning algorithm. Many pre-processing techniques were applied with each serving a specific purpose. The various pre-processing steps have been discussed below. Various sampling methods were experimented with which included Adaptive Synthetic (ADASYN), and SMOTE to deal with the imbalance in the class labels. ADASYN (He et al., 2008) is a synthetic data generation algorithm that employs a weighted distribution for distinct minority class examples based on their learning difficulty, with more synthetic data generated for minority class examples that are more difficult to learn compared to minority class examples that are simpler to learn. This is expressed by: where G in is the total number of synthetic data examples for the minority class that must be produced, m l is the minority class, m s is the majority class, the β is used to determine the desired balance level between 0 and 1. SMOTE (Chawla et al., 2002) is a technique for balancing class distribution by replicating minority class examples at random: where x′ is the new generated synthetic data, x is the original data, x k is the kth attribute of the data, and rand represents a random number between 0 and 1. Feature Extraction is a technique for reducing the number of features in a dataset by generating new ones from existing ones. Principal Component Analysis (PCA), Fast Independent Component Analysis (Fast ICA), Factor Analysis, t-Distributed Stochastic Neighbor Embedding t-SNE (t-SNE), and UMAP are the techniques used for the current dataset.
When PCA (Abdi & Williams, 2010) is used, the original data is taken as input and it gives an output of a mix of input features which can better summarize the original data distribution such that its original dimensions are reduced. By looking at pair-wise distances, PCA can maximize variances while minimizing reconstruction error: where x is the input, and y is the output. cov(x, y) is the covariance matrix after which it is transformed to a new subspace which is y = W'x.
FAST ICA (Hyvarinen, 1999) is a linear dimensionality reduction approach that uses the principle of negentropy from maximization of non-gaussian technique as input data and attempts to correctly classify each of them (deleting all the unnecessary noise): where MSE is the mean squared error, and y is the output. Factor analysis (Gorsuch, 2013) is a method for compressing a large number of variables into a smaller number of factors. This method takes the highest common variance from all variables and converts it to a single score.
t-SNE (Wattenberg, Viégas & Johnson, 2016) is a non-linear dimensionality reduction algorithm for high-dimensional data exploration. It converts multi-dimensional data into two or more dimensions that can be visualized by humans: where pi j is the joint probability distribution of the features, and q i j j is the t-distribution of the features. KL is the Kullback-Leiber divergence, P and Q are the distribution in space. UMAP (McInnes, Healy & Melville, 2018) is a general-purpose dimensionality reduction technique that can be used to pre-process the input data for machine learning. The theoretical foundation for UMAP is focused on Riemannian geometry and algebraic topology: where p represents the distance from each ith data point to the nearest jth data point. Before feeding all this information to the TabNet to act on, the dataset needs to be split into a training set and a testing set. The training dataset is used to train the model and the testing dataset is used to evaluate the models performance. The Stratified K-fold (Zeng & Martinez, 2000) cross-validation was implemented which splits the data into "k" portions. In each of "k" iterations, one portion is used as the test set, while the remaining portions are used for training. The fold used here was k = 5 which means that the dataset was divided into 5 folds with each fold being utilized once as a testing set, with the remaining k − 1 folds becoming the training set. This ensures that no value in the training and test sets is over-or under-represented, resulting in a more accurate estimate of performance/error.

Architecture of TabNet model
A TabNet Model (Arik & Pfister, 2019) is proposed to perform prediction of ICU admission and mortality parameters. A schematic diagram of the proposed TabNet deep learning model is presented in Fig. 1.
The TabNet model consists primarily of sequential multi-steps that transfer inputs from one stage to the next. There are three key layers of this model, namely, the Feature Transformer, the Attentive Transformer, and the Mask. The Feature Transformer is made up of four sequential Gated Linear Unit (GLU) decision blocks, which allow the selection of important features for predicting the next decision. The Attentive Transformer provides sparse feature selection that uses sparse-matrix to improve interpretability and learning. The way it does this is by giving importance to the most important features. The mask is then used in conjunction with the Transformer to produce two decision parameters n(d) and n(a) which are then passed on to the next step. n(d) is the output decision that predicts the two classes, namely, ICU admission (yes/no) and Deaths (yes/no). n(a) is the input to the next Attentive Transformer, where the next cycle starts. From the Feature transformer, the output is sent to the Split module. From Fig. 1, the input features are Batch Normalized (BN) and passed to the Feature Transformer, where it goes through four layers of a Fully Connected layer (FC), a Batch Normalization layer (BN), and a Gated Linear Unit (GLU) in that order. The Feature Transformer produces n(d) and n(a). The Feature Transformer comprises two decision steps.
A Fully Connected (FC) layer is a type of layer where every neuron is connected to every other neuron. The output from the FC layer should always be Batch Normalized. Batch normalization is used to transform the input features to a common scale. It can be represented mathematically as: where x represents the input features, μ b denotes the mean of the features and ω 2 denotes the variance.
The fully connected layer is the combination of all the inputs with the weights, which can be represented mathematically as: where x denotes the input features, and W denotes the weights. These operations are done sequentially, starting from Eq. (9), then to Eq. (10) and finally to Eq. (11) The Gated Linear Unit (GLU) is simply the sigmoid of x: The Fully Connected layer performs its operations, then its output is fed into the Batch Normalization layer to perform its operations. Finally, the output from the Batch Normalization is fed into the Gated Linear Unit, all in a sequential manner.
The decision output from the Feature Transformers n(d) is also aggregated and embedded in this form, and a linear mapping is applied to get the final decision. As a result, they are made up of two shared decision steps and two independent decision steps. A residual connection connects the shared steps with the independent steps and they are summed together via the 4 operation, which is a direct summation block.
Since the same input features in the dataset are used in distinct steps, the layers are shared between two decision steps for robust learning. By ensuring that the variation across the network does not change significantly, normalization with a square root of 0.5 helps to stabilize learning which produces the outputs of n(d) and n(a) as mentioned previously.
From the Feature Transformer, and after the Split layer, the Attentive Transformer is applied to determine the various features and their values. The feature importance for that step is combined with the other steps and is made up of four layers: FC, BN, Prior Scales, and Sparse Max in sequential order. The split layer splits the output and obtains p[i − 1], which is then passed through the Fully Connected (FC) layer and the Batch Normalization (BN) layer, whose purpose is to achieve a linear combination of features allowing higher-dimensional and abstract features to be extracted. The output from the BN layer is multiplied using the tensor product ⊗ with the previous decision steps Prior Scale p[i − 1]. The Prior Scale represents the employment of features in previous decision steps. If gamma (γ) is set to 1, all features have the same importance in predicting ICU admission and mortality. Sparse Max is then used to generate M[i]. The process of learning a Mask (Martins & Astudillo, 2016) is then represented by: where the h i represents the summation of the FC and the BN layer, the p[i − 1] represents the division by the split layer in the previous step, and P[i − 1] represents the Prior Scales. By transferring the Euclidean projection onto the probabilistic simplex, Sparse Max encourages sparsity and makes feature selections sparser. Sparse Max implements weight distribution for each feature of each sample and sets the sum of the weights of all features of each sample to 1 (Yoon, Jordon & van der Schaar, 2018), allowing TabNet to employ the most useful features for the model in each decision step. M[i] then updates p[i]: If γ is set closer to 1, the model uses different features at each step; if γ is greater than 1, the model uses the same features in multiple steps. The Sparse matrix is similar to the softmax, except that instead of all features adding up to 1, some will be 0 and others will add up to 1. The Sparse Max is expressed as: Z n i¼1 sparsemaxðxÞ i This makes it possible to choose features on an instance-by-instance basis with various features being considered at various steps. These are then fed into the Mask layer, which aids in the identification of the desired features. The Feature Transformer is applied again, and the resulting output is split to the Attentive Transformer. The split layer divides the output from the feature transformer into two parts which are d[i], and a[i]: where d[i] is used to calculate the final output of the model, and a[i] is used to determine the mask of the next step. ReLU activation is then applied: where f(x) returns 0 if it receives any negative input, but for any positive value x, it returns that value back. The contribution of the ith step to obtain the final result can be expressed as: Reluðd½i; c½iÞ where ϕ b [i] indicates the features that are selected at the ith step.
To map the output dimension, the outputs of all decision steps are summed and passed through a Fully Connected layer. Combining the Masks at various stages necessitates the use of a coefficient that can account for the relative value of each step in the decisionmaking process. The importance of the features can be expressed using the equation: TabNet decision making process TabNet has a feature value output called Masks, which shows whether a feature is selected at a given decision step in the model and can be used to calculate the feature importance. The masks for each input feature are represented by each row, and the column represents a sample from the data set. Brighter colors show a higher value. Consider Fig. 2, where nine features ranging from feat 0 to feat 8 are shown. For the random sample at 3, the first feature is the one being heavily used, hence the brighter colour, and the sample at 6, three features have brighter colours, feature 0, 1 and 8, with 8 being the brightest, signifying the feature 8's output was heavily used for this sample. The mask value for a given sample indicates how significant the corresponding feature is for that sample. Brighter columns indicate the features that contribute a lot to the decision-making process. It can be seen that the majority values for features other than 0, 1, 4, 5, and 8 are close to '0,' indicating that the TabNet model correctly selects the salient features for the output. We can then interpret which features the model selects enhancing the interpretability of the model. With this, the features that contribute to individuals being admitted to the ICU and dying of the COVID-19 disease can be ranked accordingly.

Evaluation metrics
Evaluation metrics are the metrics used to evaluate the model to determine if it is working as well as it should be. The evaluation metrics used are: Confusion matrix describes a classification models output on a collection of test data for which the true values are known. The matrix compares the actual target values to the machine learning models predictions. It includes true positives (TP) which are correctly predicted positive values, indicating that the value of the real class and the value of the predicted class are both yes, True negatives (TN) which correctly estimates negative values, indicating that the real class value is zero and the predicted class value is zero as well. False positives (FP), where the actual class is no but the predicted class is yes. False negatives (FN), where the actual class is yes but the predicted class is no.
Accuracy is the proportion of correctly expected observations to all observations.
Precision is the ratio of correctly predicted positive observations to total predicted positive observations.
Recall is the ratio of correctly expected positive observations to all observations in the actual class. Recall ¼ F1 Score is the weighted average of Precision and Recall. F1Score The various mathematical notations used in this section are shown in Table 2.

EXPERIMENTAL RESULTS
To demonstrate the effectiveness of the method in predicting ICU admission and mortality of patients, the different TabNet model hyperparameters, dimensionality reduction techniques, and oversampling methods have been thoroughly examined and contrasted. The section starts with the description of datasets followed by the statistical analysis, and it concludes with results and discussion.

Description of data sets
There are two data sets used for this analysis, one for the ICU likelihood and the other for the mortality rate. The ICU data set consists of 1,020 individuals with 43 features and the mortality data set consists of 1,106 individuals with 43 features. The features consist of vital signs, laboratory tests, symptoms, and demographics of these individuals. There are two labels associated with the ICU dataset which are, ICU admitted (label 0), and ICU non-admitted (label 1). Similarly, there are two labels associated with the death dataset which are, non-death (label 0), and death (label 1). The datasets are unbalanced with distribution ratios of 75.5:24.5, and 86.1:13.9 for the ICU and mortality datasets, respectively. Table 3 shows the summary of the description of the datasets.

ICU admission
There were more males than females in the study population. The non-Hispanic ethnicity forms the most individuals, and the Caucasian race has the highest number of individuals in the population. Hypertension is the co-morbidity that most individuals in the study population presented, with cancer being the least. The average age of individuals that needed the ICU is higher than those that did not. Regarding the vital signs, heart rate is the sign that showed quite a big difference on average, with the individuals needing ICU having a higher heart rate than their fellow counterparts. Procalcitonin, Ferritin and C-reactive protein are the laboratory findings that showed the biggest average difference, with individuals needing ICU showing higher levels of these. Table 4 shows the demographics, vital signs, comorbidities, and laboratory discoveries of ICU patients and non-ICU patients. For the symptoms, loss of smell and loss of taste were the symptoms that most individuals that got admitted to the ICU acquired whereas fever and cough were the symptoms that the least number of individuals acquired to be admitted to the ICU. Overall,  ICUMice-ICU 1,106-43 1 = death 0 = non-death 86.1:13.9 DEADMice-Mortality 1,020-43 1 = ICU 0 = no-ICU 75.5:24.5  Table 5 shows the relationship between symptoms and ICU admission by looking at the distribution of patients who were admitted to the ICU and whether or not they had a symptom. Certain symptoms had a higher correlation with ICU admission than others and Table 6 gives a summary of this. It is observed that the Shortness of Breath (SOB) feature has the highest correlation with admission to the ICU unit.

Mortality
There were more males than females in the study population. The non-Hispanic ethnicity forms the most individuals and the Caucasian race has the highest number of individuals in the population. Hypertension is the co-morbidity that most individuals in the study population presented, with Asthma being the least. The average age of individuals that died is higher than those that did not. Regarding the vital signs, Respiratory rate is the sign that showed quite a big difference on average, with the individuals that died having a higher respiratory rate compared to the ones who did not die. Procalcitonin, Ferritin, D-dimer, and C-reactive protein are the laboratory findings that showed the biggest average differences, with individuals that died showing higher levels of these. Table 7 shows the demographics, vital signs, comorbidities, and laboratory discoveries of patients that died and the ones who did not die. Loss of smell and loss of taste were the symptoms that most individuals that died acquired, whereas fever and cough were the symptoms that the least number of individuals that died acquired. Overall, at least 50% of individuals acquired a particular symptom before they died. Table 8 shows the relationship between symptoms and mortality by looking at the distribution of patients that died and whether or not they had a symptom.
Certain symptoms had a higher correlation with mortality than others and Table 9 gives a summary of this. It is observed that the headache feature has the highest correlation with death.

Hyperparameters of TabNet
The TabNet model has a considerable number of hyperparameters which can be tuned to improve performance. The TabNet comes with some default parameters which works well, but for certain use cases, different values of certain hyperparameters yield better performances. Table 10 shows the default hyperparameters of the TabNet.
Max epochs are the maximum number of epochs for training. It can be any number equal to or higher than 10. Batch size is the number of examples per batch. The number should preferably be a multiple of two and be greater than 16. The masking function is used for selecting features. Higher values for the width of decision prediction layer gives more capacity to the model with the risk of overfitting. The values typically range from 8 to 64. Patience is the number of consecutive epochs without improvement before performing an early stoppage. If patience is set to 0, then no early stopping will be performed. Momentum for batch normalization typically ranges from 0.01 to 0.4. n shared is the number of shared Gated Linear Units at each step. The usual values range from 1 to 5. n independent is the number of independent Gated Linear Units layers at each step. The usual values range from 1 to 5. Gamma is the coefficient of feature re-usage in the masks. Its values range from 1.0 to 2.0. A value close to 1 will make mask selection less correlated between layers. n steps is the number of steps in the architecture (usually between 3 and 10). lambda sparse is the extra sparsity loss coefficient. The bigger the coefficient, the sparser the model will be in terms of feature selection (Arik & Pfister, 2019).  (450) African American 4.2% (6) 6.9% (61) Asian 6.3% (9) 3.% (33) American Indian 0.7% (2) 0.23% (2) Native Hawaiian 0 0.1% (1) More than one race 0 0.6% (5

Results and analysis
We present a summary of our experimental results and analysis on two categories: ICU Admission and Mortality.

Results of Hyperparameter tuning using TabNet
The hyper parameters of the model have been tuned using various values of each parameter. The final table which has the best hyper parameters in predicting ICU admission for the various metrics are shown here, the rest of the tables for the individual hyper parameters can be seen in the appendix section. In varying the width of decision prediction layer (nd), the value of nd was changed from a range of 2 to 64 to determine the best output. The results were the best when nd was set to 64. In varying number of steps in the architecture (nsteps), the value of nsteps was varied from 3 to 12 to determine the best output. A value of 3 gave the best results. Changing the nsteps to numbers between 8 and 12 showed a slight decrease in performance which indicates that the performance will not be enhanced by increasing the number of steps.
In varying gamma. Changing the gamma shows a very haphazard trend in performance, the best results are given when the gamma is 2.0. Increasing the gamma does not improve the results. Thus, gamma was not increased any further.
In varying number of independent gates (nindependent), the number of independent gates was varied from two to seven. All the nindependent gates obtained similar results converging to the best result with 2. Thus, two independent gates give the best results.
In varying number of shared gates (nshared), the nshared gates was varied from two to seven, with two gates achieving the best results. Increasing the gates did not improve the results.
In varying values of momentum, momentum values of 0.2 and 0.3 displayed the best results, with higher values of momentum producing poorer results.
In varying lambda sparse, the values of lambda sparse was varied from 0.001 to 0.005. The results of the model showed a negative correlation with the value of lambda sparse. The best result was achieved with a value of 0.001.
Two different types of masks were used, the entmax and the sparsemax. It was concluded that the mask type of entmax gives a better result across the board in all the performance metrics.
The number of epochs and stopping condition was also experimented to determine the impact it has on the performance of the TabNet model. The results are generally better when the stopping condition is defined. The best results are achieved with an epoch of 150, and patience greater than 60. The results do not change when the patience is greater than 60.
Regarding the dimensionality reduction methods, five methods were experimented with to check its effect on the performance of the TabNet baseline model. The results using the Fast ICA has the best results with it falling short only on recall to PCA. The most important parameter in the table is the AUC, in which the Fast ICA has a score of 83.6.
Both PCA method and the Fast ICA methods yielded similar scores on the impact of different dimensionality reduction methods on the performance of the best TabNet model. For the AUC parameter, the Fast ICA has the highest score of 86.4. Different oversampling methods on the performance of the TabNet baseline model were experimented with and the results using the ADASYN method has the best outcome in all the measured performance metrics.
The impact of different oversampling methods on the performance of the TabNet Best model was determined. The results using the ADASYN method has the best results in all the measured performance metric.
The tables showing the various experimentations of the individual hyperparameter explained above can be found in the appendix section. Table 11 shows the performance of the TabNet Baseline, and TabNet Best models with FastICA dimensionality reduction, and ADASYN oversampling method. The results of the TabNet Best is the best amongst the other baseline models. Figure 3 shows the trend of the various hyperparameter during the experiments and tuning. Figure 4 further shows the feature importance masks for predicting ICU admission. TabNet features a feature value output called Masks that may be used to quantify feature importance and indicate if a feature is chosen at a particular decision step in the model. Each row represents the masks for each input element and the column represents a sample from the dataset. The brighter the color, the higher the value. In predicting ICU admission, two of the masks are shown as an example in Fig. 4, where the features which the respective masks are paying attention to can be seen in bright colors. The brighter a grid, the more important that particular feature is for the particular Mask. The number of grids lighting up corresponds to the number of features that are being paid attention to by the particular Mask. It can be seen that Mask 0 is paying the most attention to the earlier features, with an emphasis on the 20th feature. Mask 1 is paying the most attention to the later features, with the most attention given to features 32 and 38. The average feature output among all the Masks is used to arrive at the final decision.  Figure 5 shows a graph of features against feature importance. Features are the symptoms that contribute to an individual being admitted to the ICU. Feature importance stands for the importance of the symptom in contributing to the patient being admitted into the ICU, where a larger number indicates a higher contribution to an ICU admission. All features have some importance in determining if a person would be admitted to the ICU. The sum of all the feature importance data points is 1. The top 5 features that contribute greatly to a person needing to be admitted to the ICU were Ferritin, ALT, ckdhx, Diarrhoea and carcinomahx.

Best final model for ICU prediction
The model was analyzed and its output compared with different TabNet configurations based on different feature extractors. The model with the best results has been selected as the final proposed model for ICU prediction. The proposed model is a TabNet model with 150 epochs, 128 batch size, and 60 patience with the number of steps of 2, width of precision of layer of 64, gamma of 1.3, entmax mask type, n independent of 2, momentum of 0.3, lambda sparse of 1e-3, and n shared of 2, using the Fast Independent Component Analysis as the feature extractor, and ADASYN as the sampling technique to balance the imbalanced data. Figures 6 and 7 below show the loss graph, and training and validating accuracy graph of the TabNet in predicting ICU admission.  Figure 6 shows a graph of loss against number of epochs. The losses reduce as the number of epochs increases. At the first epoch, the model has not learned from the data, so the margin of error is big. As the model starts to learn (goes through the epochs), the error reduces and hence the loss reduces. Figure 7 shows a graph of accuracy against number of epochs. Accuracy tends to go higher as the number of epochs increases. At the early stages of the training, the accuracy is low, but as the model begins to learn the patterns of the data, the accuracy increases and reaches a higher value at the end of the epochs (150). Difference between the training accuracy and the testing accuracy is not high, which suggests that the model is not overfitting on the dataset. The precision-recall curve, which demonstrates a trade-off between the recall score (True Positive Rate), and the precision score (Positive Predictive Value), is used for this analysis due to the dataset being imbalanced (there is a large skew in the class distribution). The confusion matrix also gives a sense of the specific number of patients that were correctly classified as needing ICU or not needing it, and the ones that were incorrectly classified. A large area under the curve indicates high recall and precision, with high precision indicating a low false-positive rate and high recall indicating a low false-negative rate. High scores for both indicate that the classifier is producing correct (high precision) results as well as a majority of all positive outcomes (high recall). From Figure 8, it can be seen that there is a large area under the curve, indicating that the model is functioning very well. With an AUC of 88.75%, the model can distinguish between most of the ICU and the No-ICU patients. It can be seen from the confusion matrix in Fig. 9 that the model correctly predicted 61 individuals needing the ICU and 84 individuals not needing the ICU. A total of 18 individuals were incorrectly classified as not needing the ICU when they needed the ICU and 5 individuals as needing the ICU when they did not need the ICU. With the proposed model doing an excellent job in predicting ICU admissions, the proposed model is compared to the models existing in the literature.
It can be seen from Table 12 that the proposed model beats the model reported by Li et al. (2020b), in all metrics which is a clear suggestion that the proposed model is superior.

Mortality
We now present a summary of our experimental results and analysis for the Mortality category.

Results of Hyperparameter tuning using TabNet
Similarly in predicting morality, the hyper parameters of the model have been tuned using various values of each parameter. Figure 10 shows a graph of how varying the various hyper parameter values affect the mortality. In varying the width of decision prediction layer (nd), the value of nd was changed from a range of 2 to 64 to determine the best output. The results were the best when nd was set to 8.
In varying number of steps in the architecture (nsteps), the value of nsteps was varied from 3 to 12 to determine the best output. A value of 3 gave the best results. nsteps of 10 to 12 did not show any improvement in results, which indicates that the performance of the model will not be improved by increasing the number of steps. In varying gamma, the values of gamma was varied from 1.3 to 2.2. The performance of the model shows a very haphazard trend with the changing values. A gamma of 2.0 gives the best results.
In varying number of independent gates (nindependent), the number of independent gates was varied from 2 to 7. The number of gates which gave the best output is 2, and increasing the number of gates decreased the accuracy of the results.
In varying number of shared gates (nshared), the number of shared gates was varied from 2 to 7. The number of shared gates of 2 gave the best results and increasing the number of gates did not improve the results. Although there is a spike in results when the number of shared gates is 5, the performance reduces when it is increased further.
The values of momentum was varied from 0.02 to 0.3, with 0.02 giving the best results. Increasing the value of the momentum gave poorer results.
In varying lambda sparse, the values of lambda sparse was varied from 0.001 to 0.005, with 0.001 achieving the best results. The value of the lambda sparse had a negative correlation with the performance of the model.  The different masktypes used were sparsemax, and entmax. The output using the sparsemax had a better result compared to the entmax.
The number of epochs and stopping condition was also experimented to determine the impact it has on the performance of the TabNet model. The results are generally better when the stopping condition is defined. The best results are achieved with an epoch of 150, and patience greater than 60. The results do not change when the patience is greater than 60.
Regarding the dimensionality reduction methods, five methods were experimented with to check its effect on the performance of the TabNet Baseline model. The results using the Fast ICA has the best results with it falling short only on recall to PCA. The most important parameter in the table is the AUC, in which the PCA has a score of 94.0.  The Fast ICA yielded the best score on the impact of different dimensionality reduction methods on the performance of the best TabNet model. For the AUC parameter, the Fast ICA has the highest score of 86.4. Different oversampling methods on the performance of the TabNet Baseline model were experimented with and the results using the ADASYN method has the best outcome in all the measured performance metrics.
The impact of different oversampling methods on the performance of the TabNet Best model was determined. The results using the ADASYN method has the best results in all the measured performance metric.
The tables showing the various experimentations of the individual hyperparameter explained above can be found in the appendix section. Table 13 shows the performance of the TabNet Baseline, and TabNet Best models with FastICA dimensionality reduction, and ADASYN oversampling method. The results of the TabNet Best is the best amongst the other baseline models. Figure 11 also shows how the TabNet model makes decisions to predict mortality. In predicting mortality, two of the Masks are shown as an example in Fig. 11. The color of the grid determines the importance of the particular feature to the particular Mask. Brightness of the colors correspond to the importance of the feature. It can be seen that Mask 0 is paying attention to a couple of features but most of its attention is on the last feature, and Mask 1 is paying attention to only 4 features, but the most attention is on the 6th and 15th. Again, the average feature output among all the Masks is used to arrive at the final decision. Figure 12 shows a graph of features against feature importance. The features are the symptoms that contribute to an individual dying of COVID-19. Feature importance stands for the importance of the feature to contribute to the death of a person, where a larger number corresponds to a higher contribution to death. All features have some importance in determining if COVID-19 would be fatal to a person. The feature importance of all the features add up to 1. The top 5 features which contributes greatly to a person dying of COVID-19 were COPD, Ferritin, Myalgia, coronary artery diseases, and CRP.

Best final model for mortality prediction
The model was analyzed and its outputs compared with different TabNet configurations based on different feature extractors. The model with the best results was selected. The final model is the TabNet model with 150 epochs, 128 batch size and 60 patience with the number of steps of 3, width of precision of layer of 8, gamma of 1.7, sparsemax mask type, n independent of 2, the momentum of 0.02, lambda sparse of 1e−3 and n shared of 2, using the Fast Independent Component Analysis as the feature extractor and ADASYN as the sampling technique to balance the imbalanced data. Figures 13 and 14 below show the loss graph, and training and validating accuracy graph of the TabNet in predicting Mortality. Figure 13 shows a graph of loss against number of epochs. Losses reduce as the number of epochs increases. At the first epoch, the model has not learned a lot from the data, so the margin of error is big. As the model is learning (going through the epochs), the error reduces and hence the loss reduces. Figure 14 shows a graph of accuracy against number of epochs. The accuracy tends to increase as the number of epochs increases. At the early stages of the training, the accuracy is low, but as the model begins to learn the patterns of the data, the accuracy increases and reaches a higher value at the end of the epochs (150). The difference between the training accuracy and testing accuracy is not high which suggests that the model does not overfit on the dataset.
In predicting mortality also, the ROC curve which demonstrates a trade-off between the true positive rate (TPR) and the false positive rate (FPR) is plotted due to the imbalance of the dataset. The confusion matrix also gives a sense of the specific number of patients that were correctly classified as dying or not dying, and the ones that were incorrectly classified. A large area under the curve indicates high recall and precision, with high precision indicating a low false-positive rate and high recall indicating a low false-negative rate. High scores for both indicate that the classifier is producing correct (high precision) results as well as a majority of all positive outcomes (high recall). From Figure 15, it can be seen that there is a large area under the curve, indicating that the model is functioning well. With the best performing model here also achieving an AUC of 96.30%, the model can distinguish between most of the patients who died, and the patients that did not die. It can be seen from the confusion matrix in Fig. 16 that the model correctly predicted 89 individuals who died from the virus and 78 individuals who did not die from the virus. A total of 7 individuals were incorrectly classified as not dying from the virus when they died, and 1 individual was classified as dead when the individual did not die from the virus.
The proposed model does an excellent job in predicting mortality. Next, the proposed model will be compared with the baseline models existing in the literature.
It can be seen from Table 14 that the proposed model beats the model reported by Li et al. (2020b) in all metrics, which is a clear suggestion that the proposed model is superior.

DISCUSSION
The main purpose of this study was to develop a deep learning model to predict mortality rate and ICU admission likelihood of patients with COVID-19, and to determine which Finding 2 In predicting ICU admission likelihood, the TabNet model depicted that Ferritin, ALT, and Cxdhx were the top 3 predictors of a patient needing ICU admission after contracting COVID 19, and COPD, ferritin, and Myalgia were the top 3 predictors of a patient dying from the COVID-19 disease.
To derive these results, a dataset was chosen. The data in itself was very unbalanced since most individuals who had COVID actually were not rushed to the ICU and certainly did not die. The dataset was pre-processed. The dataset was experimented with 2 sampling techniques which were the ADASYN, and SMOTE technique. Thus, we exclude the oversampling using SMOTE and focus on the balancing using ADASYN. Table 15 shows the output of the model with varying width of decision prediction layer (nd). The value of nd was changed from a range of 2 to 64 to determine the best output. The results were the best when nd was set to 64. Table 16 shows the output of the model with varying number of steps in the architecture (nsteps). The value of nsteps was varied from 3 to 12 to determine the best output. A value of 3 gives the best results. Changing the nsteps to numbers between 8 and 12 showed a slight decrease in performance which indicates that the performance will not be enhanced by increasing the number of steps. Table 17 shows the output of the model with varying gamma. Changing the gamma shows a very haphazard trend in performance. The best results are given when the gamma is 2.0. Increasing the gamma does not improve the results. Thus, gamma was not increased any further.  Table 18 shows the output of the model with varying number of independent gates (nindependent). The number of independent gates was varied from 2 to 7. All the nindependent gates obtained similar results converging to the best result with 2. Thus, 2 independent gates give the best results. Table 19 gives the output of the model with varying number of shared gates (nshared). The nshared gates was varied from 2 to 7, with 2 gates achieving the best results. Increasing the gates did not improve the results.  Table 20 presents the output of the model with varying values of momentum. Momentum values of 0.2, and 0.3 displayed the best results, with higher values of momentum producing poorer results. Table 21 gives the output of the model with varying lambda sparse. The values of lambda sparse was varied from 0.001 to 0.005. The results of the model showed a negative correlation with the value of lambda sparse. The best result was achieved with a value of 0.001. Table 22 gives the output of the model with different types of masks used. It can be concluded from the table that the mask type of entmax gives a better result across the board in all the performance metrics. Table 23 shows the impact that number of epochs and stopping condition has on the performance of the TabNet architecture. The results are generally better when the stopping condition is defined. The best results are achieved with an epoch of 150, and patience greater than 60. The results do not change when the patience is greater than 60. Table 24 shows the impact of different dimensionality reduction methods on the performance of the TabNet baseline model. The results using the Fast ICA has the best results with it falling short only on recall to PCA. The most important parameter in the table is the AUC, in which the Fast ICA has a score of 83.6. The Fast ICA technique yielded the best results, achieving a 2.58% difference over the next highest technique (PCA), in predicting ICU admissions, which can be seen in Table 25. Table 26 shows the impact of different oversampling methods on the performance of the TabNet Baseline model. The results using the ADASYN method has the best outcome in all the measured performance metrics. The ADASYN gave the best results, achieving a 6.48% difference over the SMOTE technique, which can be seen in Table 27 in predicting ICU admission. Table 28 shows the output of the model with varying width of decision prediction layer (nd). The value of nd was changed from a range of 2 to 64 to determine the best output. The results were the best when nd was set to 64. Table 29 shows the output of the model with varying number of steps in the architecture (nsteps). The value of nsteps was varied from 3 to 12 to determine the best output. A value of 3 gives the best results. Changing the nsteps to numbers between 8 and 12 showed a slight decrease in performance which indicates that the performance will not be enhanced by increasing the number of steps. Table 30 gives the output of the model with varying values of gamma. The values of gamma was varied from 1.3 to 2.2. The performance of the model shows a very haphazard trend with the changing values. A gamma of 2.0 gives the best results, and increasing the gamma any further did not improve the results. Table 31 gives the output of the model with varying number of independent gates (nindependent). The number of independent gates was varied from 2 to 7. The number of gates which gave the best output is 2, and increasing the number of gates decreased the accuracy of the results. Table 32 gives the output of the model with varying number of shared gates (nshared). The number of shared gates was varied from 2 to 7. The number of shared gates of 2 gave the best results and increasing the number of gates did not improve the results. Although there is a spike in results when the number of shared gates is 5, the performance reduces when it is increased further.   Table 33 gives the output of the model with varying vales of momentum. The value of momentum was varied from 0.02 to 0.3, with 0.02 giving the best results. Increasing the value of the momentum gave poorer results. Table 34 gives the output of the model with varying values of lambda sparse. The values of lambda sparse was varied from 0.001 to 0.005, with 0.001 achieving the best results. The value of the lambda sparse had a negative correlation with the performance of the model. Table 35 gives the output of the model with different mask types. The different masktypes used were sparsemax, and entmax. The output using the sparsemax had a better result compared to the entmax. Table 36 shows the impact that number of epochs and stopping condition has on the performance of the TabNet architecture. The results are always better when a stopping condition is defined. The best results are achieved with an epoch of 150, and patience greater than 60. The results do not change when the patience is greater than 60. Table 37 shows the impact of different dimensionality reduction methods on the performance of the TabNet baseline model. The results using the PCA is the best with it  Table 38 shows the impact of different dimensionality reduction method on the performance of the best TabNet model. The results are better across all the measured performance metrics when the Fast ICA is used.  Table 39 shows the impact of different oversampling methods on the performance of the TabNet Baseline model. The results using the ADASYN method has the best results in all the metrics except for precision where the SMOTE method is better. Table 40 shows the impact of different oversampling methods on the performance of the TabNet Best model. The results using the ADASYN method has the best results in all the measured performance metric.
Five different dimensionality techniques were also experimented with to improve the results. The other dimensionality reduction techniques are excluded, and the Fast ICA technique is concentrated upon to achieve the final results.
The various hyperparameters were also tuned, and the best results of each was combined with the dimensionality technique and then oversampled to obtain the final result for all metrics. Results achieved were, an AUC of 88.3%, F1 score of 89.7%, the accuracy of 88.7%, recall of 93.3%, and precision of 86.4% for predicting ICU. In predicting mortality, results of 96.3% AUC, 95.8% F1 score, accuracy of 96.0%, recall of 99.8%, and precision of 91.8% were obtained. The reason why the results in predicting mortality achieves higher performances than the one in predicting ICU admission could be because sometimes individuals that need ICU admission, do not get the opportunity due to lack of beds available at that time because of large volumes of individuals present at the hospital needing the same resources. In the case of mortality, when an individual dies, the individual dies, there is no middle ground, so it is relatively easier to distinguish mortality than ICU admission.
A confusion matrix was constructed to show specifics, where there were more false positives than false negatives in both determining ICU admission and mortality. The reason for more false positives than false negatives could be because doctors have to make a quick and instant guess as to which patient needs the ICU at that time by simply looking at the physical conditions of the patient present. Due to the lack of time and resources, they depend on only those physical symptoms to make a decision, so patients who can deteriorate quickly due to underlying illnesses or other factors are often overlooked for admission to the ICU simply because they do not show physical deterioration at the time of decision making. The process by which the proposed model makes decisions to determine which features are most important was also determined. The model uses Masks which shows the features they were paying the most attention to in the heat map, which can be seen here in Figs. 4 and 11. This was then used to construct the global feature importance graph, which is easier to understand, where the longer the bar, the more importance it has in determining if a patient with COVID-19 is likely to be sent to the ICU or if the patient is likely to die from the disease.
The findings from our model suggesting the most important features in predicting ICU admission and mortality, has been supported by other literature's, these are shown below.
Ferritin is the symptom of the patient which is the most important in determining if the patient needs ICU or not. Ferritin represents how much iron is contained in the body, and if a ferritin test reveals a lower-than-normal ferritin level in the blood, this may indicate that the body's iron stores are low. This is a high indication of iron deficiency which can cause anaemia (Dinevari et al., 2021). Ferritin levels were found to be elevated upon hospital admission and throughout the hospital stay in patients admitted to the ICU by COVID-19. In comparison to individuals with less severe COVID-19, ferritin levels in the peripheral blood of patients with severe COVID-19 were shown to be higher. As a result, serum ferritin levels were found to be closely linked to the severity of COVID19 (Dahan et al., 2020). Early analysis of ferritin levels in patients with COVID-19 might effectively predict the disease severity (Bozkurt et al., 2021). The magnitude of inflammation present at admission of COVID-19 patients, represented by high ferritin levels, is predictive of in-hospital mortality (Lino et al., 2021). Studies indicate that Chronic obstructive pulmonary disease (COPD) is the symptom that shows the most importance in predicting mortality among COVID-19 patients. This COPD is a chronic inflammatory lung condition in which the lungs' airflow is impeded. Breathing difficulties, cough, mucus (sputum) production, and wheezing are all symptoms. Since COVID-19 is a disease that affects the respiratory system, it makes sense that a disease like COPD which also affects the lungs could have devasting effects on a patient who contracts COVID-19 (Gerayeli et al., 2021). Patients with Chronic Obstructive Pulmonary Disease (COPD) have a higher prevalence of coronary ischemia and other factors that put them at risk for COVID-19-related complications. The results of this study confirm a higher incidence of COVID-19 in COPD patients and higher rates of hospital admissions (Graziani et al., 2020). While COPD was present in only a few percentage of patients, it was associated with higher rates of mortality (Venkata & Kiernan, 2020). It can be observed that the top 3 highest predictors for mortality and ICU admission are different, which indicates that there are some features which can be seen in the later and more advanced stages of COVID (at the time of death). For example, Pardhan et al. (2021) observed that COPD is reported more often than asthma, suggesting that physicians in Sweden considered COPD to be a better predictor than asthma for detecting severe COVID-19 cases.
It can also be seen that shortness of breath had a high correlation with ICU admission but it was not among the top predictors for predicting ICU admission. This is because correlation looks at only the linear relationship between that feature and the target without considering other features. For example, a single feature can have a low correlation, but when combined with other features, it can offer a high predictive power, as in the case of the COPD.
The limitations of this study are that the sample size is small, with only about 1,000 patients included in the study. The study was restricted to patients at Stony Brook University Hospital and conducted between 7 February, 2020 to 4 May, 2020.

CONCLUSION
This paper proposes a tabular, interpretable deep learning model to predict ICU admission likelihood and mortality of COVID-19 patients. The proposed model achieves this by employing a sequential attention mechanism that selects the features at each step of the decision-making process based on a sparse selection of the most important features such as patient demographics, vital signs, comorbidities, and laboratory discoveries.
ADASYN was used to balance the data sets, Fast ICA to extract useful features, and all the various hyperparameters tuned to improve results. The proposed model achieves an AUC of 88.3% for predicting ICU admission likelihood which beats the 72.8% reported in the literature. The proposed model also achieves an AUC of 96.3% for predicting mortality rate which beats the 84.4% reported in the literature. The most important patient attributes for predicting ICU admission and mortality were also determined to give a clear indication of which attributes contribute the most to a patient needing ICU and a patient dying from COVID-19, where these claims were also backed up previous studies as well. The information from the model can be used to assist medical personnel globally by helping direct the limited healthcare resources in the right direction, in prioritizing patients, and to provide tools for front-line doctors to help classify patients in time-bound and resource-limited scenarios.
For future work, the study can be extended to include a lot more patients over a longer time frame from several hospitals. The proposed method can be combined with other machine learning methods for improved results. This study could be extended to include more diseases, allowing the healthcare system to respond more quickly in the event of an outbreak or pandemic.