Predicting Mortality in Hospitalized COVID-19 Patients in Zambia: An Application of Machine Learning

The coronavirus disease 2019 (COVID-19) has wreaked havoc globally, resulting in millions of cases and deaths. The objective of this study was to predict mortality in hospitalized COVID-19 patients in Zambia using machine learning (ML) methods based on factors that have been shown to be predictive of mortality and thereby improve pandemic preparedness. This research employed seven powerful ML models that included decision tree (DT), random forest (RF), support vector machines (SVM), logistic regression (LR), Naïve Bayes (NB), gradient boosting (GB), and XGBoost (XGB). These classifiers were trained on 1,433 hospitalized COVID-19 patients from various health facilities in Zambia. The performances achieved by these models were checked using accuracy, recall, F1-Score, area under the receiver operating characteristic curve (ROC_AUC), area under the precision-recall curve (PRC_AUC), and other metrics. The best-performing model was the XGB which had an accuracy of 92.3%, recall of 94.2%, F1-Score of 92.4%, and ROC_AUC of 97.5%. The pairwise Mann–Whitney U-test analysis showed that the second-best model (GB) and the third-best model (RF) did not perform significantly worse than the best model (XGB) and had the following: GB had an accuracy of 91.7%, recall of 94.2%, F1-Score of 91.9%, and ROC_AUC of 97.1%. RF had an accuracy of 90.8%, recall of 93.6%, F1-Score of 91.0%, and ROC_AUC of 96.8%. Other models showed similar results for the same metrics checked. The study successfully derived and validated the selected ML models and predicted mortality effectively with reasonably high performance in the stated metrics. The feature importance analysis found that knowledge of underlying health conditions about patients' hospital length of stay (LOS), white blood cell count, age, and other factors can help healthcare providers offer lifesaving services on time, improve pandemic preparedness, and decongest health facilities in Zambia and other countries with similar settings.


Introduction
Infectious diseases have always shaped the world in many ways, from changing the rules that govern daily life to restricting movement and travel and thereby disrupting daily life to the point of bringing the entire world to a total standstill. Tis has been very evident in the COVID-19 pandemic, which has claimed millions of lives since its outbreak [1]. Tis study [2] focuses on COVID-19 mortality in Zambia and how predicting mortality can improve public health preparedness and save lives. COVID-19 is caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2). Te COVID-19 pandemic has challenged how the feld of public health handles typical infectious diseases and how it conducts research. At the time of this writing and according to the data reported by the Johns Hopkins University Center for Systems Science and Engineering [3], the pandemic had already afected the global population with some 610, 200, 000 cases and 6, 500, 000 deaths; in Africa, about 12, 060, 000 cases and 256, 000 deaths; and in Zambia, with over 333, 000 cases and 4, 000 deaths [4]. Te situation became extremely overwhelming and attracted the attention of researchers from various felds of the research community.
Zambia has experienced surging COVID-19 cases and mortalities on a national scale. Tis has heavily overwhelmed local communities and especially public health facilities which have proven to be ill-prepared since the start of the pandemic. One of the major challenges Zambia faced was pandemic unpreparedness which has been shown to be an essential factor in the fght to control any pandemic [5]. Failure to predict COVID-19 mortality in patients with the greatest risk posed a public health challenge towards unpreparedness which in turn caused improper prioritization, underestimation, and underallocation of funds towards the government's pandemic response plan [6].
Many research studies have been done on COVID-19 pandemic so far using both traditional statistical methods and ML techniques [7]. Tere have been a few past studies that used ML algorithms for COVID-19 mortality classifcation. A study that compared two prediction models based on statistical and computational ML algorithms to predict mortality in COVID-19 hospitalized patients [8] found that between a conventional statistical method of LR and a ML method of artifcial neural network (ANN) when validated on 16 features against a sample of 482 laboratory-confrmed COVID-19 hospitalized patients, the ANN achieved the best performance with an ROC_AUC of 90%. However, despite the high performance, the authors of the study acknowledged the limitations associated with having used only two ML algorithms, having conducted the study at a single center and on merely 482 participants, which afected the generalizability of their fndings. Te authors also acknowledged that there were no eforts to address the misclassifcation bias that may have been potentially introduced by the class imbalance that existed between 382 (79.25%) who recovered and 100 (20.74%) who succumbed to the disease, in which case the use of Synthetic Minority Oversampling Technique (SMOTE) should have been performed. Another study conducted on 1710 hospitalized COVID-19 patients developed and evaluated several ANNs to predict the mortality risk in hospitalized COVID-19 patients. Te backpropagation artifcial neural network (BP-ANN) was the best model and achieved an ROC_AUC of 88.8%. For this study, the authors acknowledged the limitations presented by the single-center nature of their selected dataset, and the use of only two ANN algorithms in diferent confgurations.
Although this research was not focused on proposing totally new methods and procedures, there are a few components that represent the novelty of our study in addressing certain gaps identifed in past studies. In order to improve the generalizability of our fndings, we aimed to target a much larger study sample that included participants from multiple healthcare facilities. Most studies have not implemented several ML methods simultaneously and have thus recommended the use of several ML methods in order to have a clear picture of how these algorithms perform when compared with each other. In addition, the procedure for picking the best model in most of the studies reviewed simply pick the ML model with the highest value in the metric being considered, and there are no follow-up attempts to determine if the difference observed visually between two competing algorithms is actually statistically signifcant. To address this concern, this study sought to develop and validate several ML algorithms that included the following: (1) ecision tree (DT), (2) random forest (RF), (3) support vector machines (SVM), (4) logistic regression (LR), (5) Naïve Bayes (NB), (6) gradient boosting (GB), and (7) XGBoost (XGB). Tese algorithms were implemented simultaneously after which the procedure for selecting the best model used pairwise comparisons of each model compared to all other models for the various metrics used as explained in the post hoc analysis section of the Materials and Methods section. Tis helped in determining whether the diferences observed visually between two competing models were actually statistically signifcant. Tis also made it possible for this study to have a statistical basis for proposing and recommending multiple ML algorithms as alternatives to the top performing model in situations where there were no statistically signifcant diferences observed between the best model and the second-best model, something that is hardly done in ML research.
Tis study was conducted in order to help Zambia's healthcare system prepare for current and future pandemics and sought to predict mortality in hospitalized COVID-19 patients using ML. It employed a special form of ML called supervised ML [9]. Te use of ML in this study was chosen due to a number of reasons. Progress in computer science and technology has made the application of ML in public health research to become more common today. As it has been observed, ML models have been preferred in situations involving extremely dynamic datasets, automation, and greater computing abilities [10,11]. Tis study thus sought to develop an ML pipeline that supports automation, reusability, and reproducibility. ML algorithms have also been shown in a number of studies [12] to possess improved and unmatched performance as these models continually improve signifcantly as more data become available over time. Another advantage that favored the use of ML in this research was that, while most conventional statistical methods are profcient at detecting simpler univariate and multivariate associations, it often requires more sophisticated ML algorithms to detect complex interactions and heterogeneous feature associations in which diferent unspecifed subgroups of instances in the data may have distinct underlying feature associations with outcome [13].
Tis research is intended to answer two fundamental questions:

Literature Review
Tis section presents the review of literature published in various journals on COVID-19 mortality. Te literature considered were searched from the MEDLINE database using the PubMed online search engine. For each research paper reviewed, the focus was on the study design and setting, study sample, study purpose, methods, and main results.
Te frst part of the literature review presents papers that have addressed factors that contribute to severe COVID-19 and mortality [11,14,15]. Te second focuses on studies that attempted to predict mortality in COVID-19 patients using ML methods and the associated performances for various evaluation metrics. Te fnal part of our literature review has addressed a few studies that have compared ML models with conventional statistical models in order to appreciate why ML models were chosen for this study.
First and foremost, there have been a number of studies that have described predictors of severe COVID-19, which could probably be in the causal pathway leading to mortality. In a study entitled: Risk factors for mortality in critically ill patients with COVID-19 in Huanggang, China: A single-center multivariate pattern analysis [16], a group of researchers outlined multiple risk factors that led to severe COVID-19 and even death in a number of extreme cases. Te paper observed 192 critically ill COVID-19 patients in which 142 survived and 50 died in hospital. After data were compared between survivors and nonsurvivors, and performing multivariate pattern analysis to identify possible risk factors for COVID-19 mortality, several factors were identifed. Tese included age, duration (time from illness onset to admission), Barthel index score, whether laboratory examination indicators included C-reactive protein, white blood cell count, platelet count, fbrin degradation products, oxygenation index, lymphocyte count, and D-dimer. In another study (COVID-19 mortality risk assessment: An international multicenter study), Bertsimas et al. [17] addressed many more risk factors of severe COVID-19 and mortality including age, sex, heart rate, heart disease, diabetes, chronic kidney disease, cardiac dysrhythmias, and a few other features. Tese features were derived from a population of 3,062 COVID-19 patients. Te mortality rate was 26.84%. In comparison to survivors, nonsurvivors were older with a median age of 80, whereas survivors had a median age of 64. Of the nonsurvivors, men were 67.2% while women were only 58.4%. It was also reported that the prevalence of comorbidities such as cardiac dysrhythmias, chronic kidney disease, and diabetes were higher in the nonsurvivor population versus the survivor population (9.61%, 4.21% and 15.92% versus 5.56%, 1.74%, and 11.42%, respectively). In all these studies with varying study settings and study samples, a few features have appeared in many multiple studies. Tese are age, sex, hospitalization, pneumonia, acute respiratory distress syndrome, HIV, TB, malignancy, diabetes, cardiac disease, hypertension, chronic pulmonary disease (CPD), chronic kidney disease (CKD), and chronic lung disease. Tese features were thus targeted in this study.
After the review of the literature that attempted to predict mortality in COVID-19 patients, the following studies were reviewed. Josephus et al. [18] conducted a study on 114 Indonesian COVID-19 patients, and the objective of the study was to make mortality predictions on COVID-19 patients with nonmedical features. Te study used a single LR model which achieved an accuracy of more than 90%; further analysis found that age was the most important predictor of patient's mortality. Te author recommended a larger study sample as only 114 patients were used. It was also noted that more ML methods were missing with which comparisons should have been made in order to choose the best model. In a diferent study conducted in China involving a cohort of 2,160 participants analysed retrospectively, Gao et al. [19] built an ensemble mortality risk prediction model for COVID-19 using four ML methods including LR, SVM, Gradient-Boosted-DT, and ANN. Te results found that the ensemble model achieved an ROC_AUC of 0.9621 (95% CI: 0.9464-0.9778). Some of the limitations acknowledged by the authors included the fact that participants were primarily only local residents from Wuhan, China, and recommended investigation of the predictive performance of the ML models in other regions Global Health, Epidemiology and Genomics and ethnicities and the evaluation of the prognostic implications of the ensemble ML model in prospective cohorts other than the retrospective cohorts used in their study. In another retrospective study in South Korea involving 3,524 patients, Das et al. [20] conducted a study to predict mortality among confrmed COVID-19 patients in South Korea using machine learning. Of the fve ML algorithms (LR, SVM, KNN, RF, and GB) used, the results showed that LR was the best model and achieved an ROC_AUC of 83.0%. Tere were a number of limitations reported by the authors including unavailability of crucial clinical information on symptoms and risk factors. A major setback reported by the author of this study was the reuse of a subset of data for validation that was also included in the cross-validation process. Tis may have led to overftting of the models with the available data. We also noticed that despite the extreme class imbalance in their dataset which contained 3,529 (97.9%) cases and 74 (2.1%) deaths, there were no eforts to address the potential misclassifcation bias that may have been introduced by this imbalance, in which case balancing of the outcome classes using SMOTE should have been inevitable. In a much bigger multinational crosssectional study involving a huge sample of 2,670,000 participants from 146 countries, Pourhomayoun and Shakibi [21] designed and developed several ML models (SVM, ANN, RF, DT, LR, and K-Nearest Neighbor (KNN)) to determine the health risk of patients with COVID-19. Te study results found the best model to be the RF which achieved an ROC_AUC of 94.0%. Tis was a high-quality study with huge study samples; however, the performance was not exceptional; this could have been due to various confounding variables and other complex feature interactions that may have crept into the study due to the huge diferences in population characteristics across national or regional boarders; thus, results may have been stratifed according to regions having countries with similar population characteristics.
Some studies have compared ML models with conventional statistical models in prediction problems, in which ML models were preferred to conventional statistical models. One study entitled: Comparison of Conventional Statistical Methods with Machine Learning in Medicine: Diagnosis, Drug Development, and Treatment [22]. Te study was a narrative review whose aim was to ofer an expert perspective on the comparison of traditional statistical methods with ML, and their corresponding advantages and limitations in medicine, with a specifc focus on the integration between the two approaches and its application to illness detection, drug development, and treatment. It compared the usefulness and limitations of traditional statistical methods and ML, when applied to the medical feld. Tis study recommended a method that best meets the requirements and best solves the problem at hand. It also recommended a hybrid approach that integrates both ML and traditional approaches if doing so can add benefcial results to the study.
Te current review of the literature suggests that the use of ML in medical research has not been fully utilized despite the advantages associated with its use. Moreover, the newness of ML models and their heavy reliance on programming skills have added to the complicatedness of ML models and hindered most researchers from using ML methods where they ought to be used. Tis has resulted in less applicability of ML models. Reproducibility and consistency have always been the anchors of evidence-based medical research; however, the way in which most ML research studies have been documented has made it harder to reproduce ML methods. Tus, this study aimed to address a number of issues identifed in the various studies reviewed. Tese issues involved the use of larger study samples from various locations to improve generalizability of fndings, the implementation of several ML methods from which the best model should be chosen, the use of SMOTE in inevitable situations having extreme class imbalance of outcome categories to remove misclassifcation bias, and the use of a statistical procedure for selecting the best model that performs signifcantly better than other models. Te ML methods used in this study were intended to improve the reusability of ML pipelines built in order to allow others to apply similar methods to similar classifcation problems.

Materials and Methods
Tis section discusses the various methods used in the study, which are the design and setting of the study, data analysis, and the type of ethical approval obtained.
Tis study followed the standard guidelines of a typical ML research outlined by Luo et al. [23] in the paper "Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View." A visual conceptual framework displayed in Figure 1 was developed to visually display the outcome (COVID-19 mortality) and various features that are predictive of mortality.
To further expand the research conceptual framework, a more detailed visual graphic of the machine learning modelling steps implemented in this research was adapted from Urbanowicz et al. [13] and is shown in Figure 2.

Study Design.
Te outcome of interest and the exposures in this study were analysed simultaneously, and study participants were selected only based on relevance to the study objectives and not on the status of the outcome nor exposures. Tis qualifed the study to use an analytical crosssectional design as recommended by Wang and Cheng [24].

Study Population and Study Setting.
Tis study was conducted in Zambia which is estimated to have a population of about 18 million with the majority of the people (98%) estimated to be under the age of 65 years of age [25,26]. Tis is an important observation since age is an essential predictor of COVID-19 mortality. Te study population targeted included all confrmed cases of COVID-19 that were hospitalized in various health facilities in Zambia from the period of March 2020 to October 2021. Te data used were from the Zambia National Public Health Institute (ZNPHI), which house the combined datasets from various health institutions that were selected by the Ministry of Health to be COVID-19 centres across the country.

Measurement Variables.
Te measurement variables used in this study were chosen based on recent studies [16,17] that have showed that COVID-19 in the presence of a number of comorbidities is more likely to lead to mortality. Tus, the comorbidities chosen included age, diabetes, tuberculosis, and other underlying conditions, as listed in Table 1.

Eligibility Criteria.
Tis research targeted the data collected by ZNPHI from various health facilities in Zambia for which all confrmed COVID-19 cases hospitalized during the period of March 2020 to October 2021 were eligible to be included in the study. However, pregnant women were excluded from the study due to the variable vaccination status for which there was no acceptable vaccine for pregnant women in Zambia at the time this research was conducted. Other excluded cases involved records that had too many missing variables for which the application of multiple imputations would have simply added extra noise to the dataset.

Handling of Missing Data.
Since the study applied ML models that do not allow missing values in the dataset, missing values needed to be imputed for the models to run. Te Supplementary Material of " Figure S2: Dataset missingness map" contains details on the level of missingness associated with each of the features used. Multiple imputations by chained equations (MICE) [27] was the method used to impute missing values using the mice package in R [28].

Handling of Bias.
It was also noted that there was a large class imbalance in the proportions of the patients who recovered and the proportion of patients who died, as shown in Figure 3. Extreme class imbalance has been widely reported by many ML experts to have the potential to introduce misclassifcation bias or type II error [29,30]. Tis prompted the use of Synthetic Minority Oversampling Technique (SMOTE) [31,32] to balance the label classes in the dataset. Results from an imbalanced dataset were then compared to the results from a SMOTE-balanced dataset in order to check if balancing classes really helped the ML classifers in reducing the type II error.

Data Analysis.
Tis section described the various software packages used in this study, the classifcation models used, and the performance evaluation metrics employed. Te data analysis used all the data that were made available by ZNPHI, involving 1,433 COVID-19 hospitalized patients.

Statistical Software Packages Used.
Te Python programming language version 3.8.0 [33] and its libraries scikitlearn version 1.1.0 [34] and XG Boost [35] were used in ML model development.    Te dataset was split into the training and test sets in the ratio 80 : 20 using 5-fold cross-validation strategy, which has been shown to be sufcient in assessing the generalization ability of ML models [42]. Te ML models used were optimized for performance using the various scikit-learn and XG Boost hyperparameter tunings [43,44]. Te ML models were all trained and tested on the same dataset, after which the performance evaluation metrics were assessed to identify the best-performing model. Before running the candidate models, a Pearson correlation analysis of all pairs of features was conducted to identify potentially redundant or highly correlated features, followed by a univariate correlation analysis between outcome and individual features where numerical features were analysed using the chi-square test of independence and categorical features were analysed using the Mann-Whitney U test; this helped in consolidating the feature importance analysis that followed after.

Te Decision Tree (DT)
Algorithm. Te DT model [45] is a type of supervised ML algorithm used in classifcation problems in which the model follows a set of if-else conditions to either visualise the data or classify it in accordance with the possible outcomes presented. Tis study implemented the categorical variable DT during the mortality classifcation process. Te model used the decision tree classifer from the scikit-learn library with hyperparameter tunings shown in the Supplementary Material for "ML Models Optimization Hyperparameter" in Table S1 [46]. [47,48] is an ensemble learning method that combines many DTs and averages them to make a fnal decision. Tis produces a more complex and powerful classifer. Te RF model uses the random forest classifer from the scikit-learn library and is implemented with hyperparameter attributes shown in the Supplementary Material for "ML Models Optimization Hyperparameter" in Table S2 [49].

Te Support Vector Machine (SVM) Algorithm.
Te SVM [50] is a classifcation algorithm in which each data point is plotted in the n-dimensional space by using support vectors, which are the coordinates corresponding to each individual data point, where n is the number of features that best diferentiates the two classifcation classes. Te SVM algorithm performs classifcation by using the SVC (support vector classifer) from the scikit-learn library. Te SVC separates the data into their classes using the right hyperplane using the hyperparameters shown in the Supplementary Material for "ML Models Optimization Hyperparameter" in Table S3 [51].

Te Logistic Regression (LR)
Algorithm. Te LR model can be defned as the ML algorithm that is applied in classifcation problems using the concept of probability in predictive analysis by assigning observations a logistic cost function termed as a sigmoid function σ(z) � (1/1 + e − z ) that maps predicted values to their associated probabilities ranging from 0 to 1; it penalises the model for every wrong prediction and works towards reducing those misclassifcation errors [52]. Te LR model is a linear model that uses the logistic regression classifer from the scikit-learn library with hyperparameter attributes presented in the Supplementary Material for "ML Models Optimization Hyperparameter" in Table S4 [53].

Te Naïve Bayes (NB)
Algorithm. Te NB model [54] is a classifcation method that uses the popular Bayesian method of prior likelihood in the implementation of classifcation. It is based on Bayes theorem, which states that if an outcome event is partitioned into k nonintersecting (mutually exclusive or independent) categories B 1 , B 2 , . . . , B k , then the probability of an i th event B i happening given an event A is given by the following equation: Classifcation by the NB algorithm was implemented using the Gaussian NB classifer (Gaussian Naive Bayes) from the scikit-learn library, making the assumption that the likelihoods of features are assumed to be Gaussian such that parameters σ y and μ y are estimated using the method of maximum likelihood. Since the NB classifer is naturally less

Performance Evaluation Metrics.
Te metrics used to evaluate the performance of models in this study were accuracy, recall (sensitivity), and specifcity. In order to get a clearer picture of the models' performance that is free from bias from the imbalance between classes in the dataset, the analysis of the areas under the ROC and PRC curves were prioritized. To supplement the use of accuracy, the F1-score was used to optimize the trade-of between precision and recall [58].

Post Hoc Analysis.
At the end of achieving the desired results, a procedure for determining the best model was proposed to go beyond simply picking the ML model with the highest value in the metric being considered. Te best model was determined by frst conducting nonparametric statistical analyses to compare the averages of the performance evaluation metrics for every pair of the ML models used. Secondly, an analysis was done to determine which of the ML models had evaluation metrics that yielded significant fndings in the Kruskal-Wallis one-way analysis of variance (p value ≤ 0.05). Finally, those models were then run through follow-up pairwise Mann-Whitney U-tests to compare between all possible pairs of the seven ML models used to identify the existence of a signifcant diference in performance. Tus, for each metric assessed, the number of all possible pairwise ML model combinations from the seven algorithms used resulted in 21 combinations (computed from C(7, 2) � (7!)/((7 − 2)! × 2!) � 21). Te best model was then picked based on the existence of a statistically signifcant diference between a number of competing models. If multiple outstanding models are competing and the pairwise Mann-Whitney U-tests do not show the existence of a statistically signifcant diference, then choosing the model whose metric has the highest value as the best model should be accompanied by the argument that, in the event that the top model could not be implemented, the other competing models should be used with the same confdence as though they were the best model.

Results
Tis section presents the various fndings of this study. Te summary statistics presented describe the characteristics of the data used, followed by the results of the feature importance analysis and the results of the classifcation models presented with their performance evaluation metrics. Te results of performing SMOTE on the imbalanced dataset produced a balanced dataset shown in Figure 4. Table 2 presents a summary of numerical features in the form of averages and medians with their respective standard deviations (SD) and interquartile ranges (IQR) where appropriate. Table 3 presents a summary of categorical features with their respective proportions in percentages.

Sample Characteristics.
Te results in Table 2 describe the numerical features of the study participants which all showed a strong signifcant association with COVID-19 mortality. Te median number of days spent in the hospital (LOS) for patients who recovered and those who died was 4.0 (IQR � 5.0) days and 2.0 (IQR � 2.0) days, respectively. For the feature age as it was expected, the mean age of those who died was as high as 56.6 (SD � 17.1) years, whereas 49.9 (SD � 16.1) years was the mean age of those who survived. It can also be seen that the median white blood cell count for patients who died was 7.8 (IQR � 8.5) cells per μL, whereas patients who recovered recorded 6.7 (IQR � 4.6) cells per μL. Te results in Table 3 also describe eleven categorical features of the study participants. Five of these features (diabetes, hypertension, wave, ward, and CPD) showed a strong signifcant association with COVID-19 mortality, whereas the other features 8 Global Health, Epidemiology and Genomics

Feature Importance Analysis.
Te results of a feature importance analysis in Figure 5 show both the mutual information scores and the multisurf scores. Te mutual information score highly ranked LOS and white blood cell count with an approximate score of 0.188. Other relatively important features in order of decreasing importance included diabetes, sex, age, wave, and hypertension. Te multisurf scores, on the other hand, showed which of the important features were given maximum priority, and what features were given the least priority. Te frst priority was primarily given to LOS with a relatively high score of 0.12, whereas the second priority was given to the features hypertension, diabetes, sex, HIV, white blood cell count, wave, and age (in descending order of importance). On the other hand, chronic kidney disease (CKD), alcohol intake, tuberculosis, and admission ward were not prioritized. Figure 6 presents the normalized compound feature importance plot in the form of stacked bar graphs. Te size of the portion of the bar for each ML model represents the proportional contribution of each ML model in comparison to the total magnitude of importance that each feature was given. In harmony with the mutual information scores and the multisurf scores, the normalized compound feature importance plot for the seven algorithms used also confrmed that LOS stood out as the most infuential feature with a score of almost 2.00. Tis was followed by an approximate score of 0.70 for age, white blood cell count, and wave.
Te results of the feature importance analysis complemented the results of the univariate feature analysis and guided the removal of some features that had little infuence on the classifcation of mortality.

Performance of Classifcation Models.
Te results of the seven ML models used in this study are now presented and include both the results from imbalanced and balanced mortality classes. Te results have also presented the performance of models that used all features compared to those that used only selected important features.
To begin with, the results of ML models using the ROC_AUC are displayed in Figure 7. It was observed that for the dataset with imbalanced classes, ML models performed relatively well with ROC_AUC values ranging from 0.743 to 0.816, where LR was the best model and DT was the underperforming model. However, it was observed that despite maintaining the same hyperparameter tunings, ROC_AUC results improved signifcantly for all seven models when mortality classes were balanced using SMOTE, with ROC_AUC values now ranging from as high as 0.869 to a whopping 0.974, where the XGB was the best model whereas the NB was the underperforming model.
Secondly, the results of ML models using the PRC_AUC are now presented in Figure 8. It was observed that for this relatively unbiased metric, all seven models performed unacceptably poor and worse for the dataset with imbalanced classes. Te PRC_AUC results ranged poorly from 0.269 to 0.365, where RF was the best model, whereas NB was the underperforming model. In a surprising turn of events, despite maintaining the same hyperparameters tunings for all models, PRC_AUC results showed tremendous performance improvements for the dataset where mortality classes were balanced using SMOTE. PRC_AUC results now ranged from 0.860 to 0.973. Te best model in PRC_AUC results for the balanced dataset was now the XGB while the underperforming model was the NB.
Tirdly, having compared the performance improvements of the seven models as indicated by the ROC and PRC plots, it was clear that balancing mortality classes using SMOTE led to better performance improvements for all models used. Following the use of the dataset with balanced classes as a better choice for removing bias, the study then sought to determine whether all fourteen features assumed to be predictive of COVID-19 mortality were helping the models perform better. Tis led to the removal of some features that were less important and less predictive of mortality, as was earlier shown by the mutual information , multisurf , and normalized feature importance scores. Tis resulted in a series of trials that led to the removal of fve less infuential features: smoking, alcohol, chronic pulmonary disease (CPD), chronic kidney disease (CKD), and TB.
Te results of models with all fourteen features compared to models with only selected features using ROC_AUC as the evaluation metric are now presented in Figure 9. Models that used selected features only left out fve less infuential features (smoking, alcohol, CPD, CKD, and TB).
It can be clearly seen from Figure 8 that there are no signifcant diferences in the performance of the seven models when the ROC_AUC results for all features are compared with the ROC_AUC results for the selected features. Tis invoked the use of the law of parsimony, which favours the model with fewer features.
Finally, performance results of ML classifers were now evaluated using various metrics including accuracy, recall (sensitivity), specifcity, precision, ROC_AUC, and PRC_AUC, as presented in Table 4. Te performance results of the seven ML models used are presented in descending order starting from the best-performing model to the worstperforming model: XGB, GB, RF, SVM, DT, LR, and NB.
Te post hoc analysis of performance metric results for each ML model yielded signifcant results from the Kruskal-Wallis one-way analysis of variance. Tis result validated the analysis of a follow-up pairwise Mann-Whitney U-test for each metric. In order to determine the best model from the seven ML models used, this study concentrated on comparing the ROC_AUCs and checking whether a significant diference existed between each pair since similar results were also observed in other evaluation metrics checked.
Te pairwise Mann-Whitney U-test analysis comparing ROC_AUC results showed that despite the average algorithm performance in ROC_AUC being 93.3%, the algorithms NB, LR, and DT performed signifcantly worse ( p value ≤ 0.05) than the other ML models used. It was also   found that the SVM algorithm performed signifcantly better than NB, LR, and DT; however, it still performed signifcantly worse than the top three models (RF, GB, and XGB). As presented in Table 4, among the top three performing models, the best model was the XGB with ROC_AUC of 98.2% for all features and 97.5% for selected features; it was followed by the GB, which had ROC_AUC of 97.6% for all features and 97.1% for selected features; it was also followed by the RF in third place with ROC_AUC of 96.9% for all features and 96.8% for selected features. Further observation found that the pairwise Mann-Whitney U-test analysis of the top three models did not show any signifcant diference between the best-performing model (XGB) and the secondperforming model (GB); there was also no signifcant difference between the XGB as the best model and the RF as the third-performing model.

Discussion
Tis section now discusses the results just presented and ofers appropriate interpretations of the fndings. A brief summary of the fndings is presented frst, followed by a discussion of important features that hugely infuenced patients' susceptibility to mortality. Finally, the discussion of the performance evaluation metrics for the ML models to guarantee the quality of the predictions made is presented.

Summary of Findings.
Tis study aimed to apply supervised ML models to predict mortality in hospitalized COVID-19 patients in Zambia by deriving and validating seven (7) ML models for mortality prediction on Zambia's COVID-19 dataset. Te study successfully performed internal validation on the dataset and identifed features that proved to be predictive of mortality. It was found that hospital length of stay and blood cell count can efectively help in determining mortality; knowledge of patients' ages and diabetes status was also found to be reasonably useful. Te study then quantifed the infuence that predictive features have on the fnal mortality outcome among hospitalized COVID-19 patients. Te fndings showed that the features used can be ranked in order of decreasing importance, starting with hospital length of stay as the most infuential feature, followed by age, wave, diabetes, hypertension, and sex, respectively. Te performance of the ML models used was then checked to identify the model that ftted the data best. Te fndings showed that the XGB model outperformed all other models in the performance evaluation metrics used having an ROC_AUC of 97.5%, followed by the GB model, which performed signifcantly lower than the best model and had an ROC_AUC of 97.1%, whereas the worst-performing model (NB) equally had a reasonably good ROC_AUC of 86.9%. Tis meant that the XGB model ftted the dataset better than other models and was thus recommended in this study.

Feature Importance.
Te feature importance analysis used three efective methods: the mutual information score, the multisurf score, and the normalized compound feature importance plot. Te results of these analyses noted that all three methods consistently and unanimously gave coherent fndings about the features that were most important and predictive of COVID-19 mortality. Te most important feature that was found to be the most predictive of mortality was hospital length of stay, followed by white blood cell count. It was clearly seen that these two features were very important and greatly infuenced how the ML models classifed the mortality status of a COVID-19 patient. Other infuential factors arranged in order of decreasing importance included age, wave, diabetes, hypertension, and sex. Te implications of the feature importance analysis fndings show that if healthcare providers know exactly the factors adding to the length of hospitalization of a patient and if they have full knowledge of a hospitalized patient's age and sex and the type of variant (represented by the variable wave) and whether the patient is diabetic or hypertensive, then they can well estimate the possibility of a COVID-19 case deteriorating into a severe disease or mortality. Tis knowledge can also help government agencies responsible for public health to secure enough funding that can be used in implementing measures that prioritise the healthcare of hospitalized COVID-19 patients that have the highest risk of mortality in Zambia. Tis can also be applied in other countries with a similar setting as Zambia.

ML Model Performance.
Tis discussion is focused on the results of ML models that were run on selected features since the conditions for which a parsimonious model should be preferred were satisfed. Firstly, it was found that the application of SMOTE to balance the classes in the dataset was extremely essential and signifcantly improved the performance of the ML models across all performance evaluation metrics used. Tis was evidently observed in the metric precision (PPV) for which most of the ML models fared poorly. For the dataset with imbalanced mortality classes, the two worst-performing models were DT, which had the precision of 19.1% and 19.8% for all features and selected features, respectively, and SVM, which had the precision of 19.1% and 20.2% for all features and selected features, respectively. However, after the mortality classes in the dataset were balanced using SMOTE, the performance of the ML classifers improved signifcantly such that the DT recorded a precision of 87.4% and 86.4% for all features and selected features, respectively, whereas the SVM recorded a precision of 87.4% and 85.6% for all features and selected features, respectively. Tis study thus recommends the use of SMOTE in ML classifcation problems in which class imbalances are huge enough to introduce potential misclassifcation bias.
All the ML models used in this study achieved reasonably high performance as compared to other studies presented in the Literature Review section. As presented in Table 4, the top three ML models that achieved outstanding performance for the balanced dataset using selected features were the XGB, GB, and RF. Te other ML models, such as the SVM, DT, LR, and NB, also achieved similar results despite those results being signifcantly lower when compared to the top three models as observed from the pairwise Mann-Whitney U-test analysis.
Te results of the post hoc analysis helped to establish that the best-performing model in this study, the XGB classifer, together with the second-best model, the GB, and the third-best model, the RF, did not difer signifcantly, since it was shown that both the GB and the RF did not perform signifcantly worse than the XGB. Tis implies that the top three models of our study, the XGB, GB, and RF are all best suited for the dataset used and can thus be recommended in similar classifcation problems in which higher performance is sought to be achieved.
Te implications of the reasonably high performances recorded by the ML algorithms used can greatly help in future modelling of COVID-19 data. Since all seven ML models used performed reasonably well, future modelling of COVID-19 mortality may have to seriously consider the models used with special attention given to the XGB model as the most efective in mortality predictions for hospitalized COVID-19 patients. Other models that may have to be considered are the GB and the SVM models. Te application of these ML models may have serious implications for effectively and accurately predicting COVID-19 mortality including other similar health conditions which may greatly help in the control of both current and future pandemics.

Comparison of Findings with Other Studies.
Te fndings of this study were consistent with other studies, like those presented in the literature review. Current literature indicates that factors such as age, diabetes, hypertension, sex, and HIV are predictive of COVID-19 mortality. Tis was clearly evident in the fndings of this study where LOS, age, white blood cell count, and type of variant (wave) were shown to be infuential in helping classify the mortality status of the participants. Furthermore, like other studies have shown, ML models can be very powerful in modelling how factors associated with COVID-19 mortality can help in the classifcation of the health outcome in hospitalized patients. Te performance of ML models for various evaluation metrics under proper conditions and with the right hyperparameter tunings can achieve higher values for accuracy, precision, ROC_AUC, PRC_AUC, and other metrics as clearly observed in this study, although it is not unusual to record poor results for some models if the data do not ft such models well.

Interesting Findings.
Tis study also yielded some interesting fndings discussed in this section. It has not commonly been seen in most studies that the LOS of admitted patients is an important feature in most classifcation problems of COVID-19 mortality. Tis could be due to the fact that the variable LOS is rarely collected since it varies for every day a patient remains admitted to a health facility. Surprisingly, LOS was the most important variable in the dataset used, and this was observed for all seven algorithms validated. Another feature which was ranked as the second most important was the white blood cell count. Tis also came as a surprise, as it has not been frequently used in most of the classifcation models as the literature review indicated. Te reason for the rare use of the variable white blood cell count seems to also be associated with the rare events in which the variable is collected.
Te feature "wave" was deliberately chosen to represent the type of COVID-19 variant that is on the rampage and was equally shown to be predictive of COVID-19 mortality. Te feature "ward" was also predictive of mortality. On the other hand, the features smoking, alcohol, chronic pulmonary disease (CPD), chronic kidney disease (CKD), and TB were not shown to be important, and removing them did not signifcantly afect the performance of the ML models.

External Validity of Findings.
Te methods implemented in this study and the results found may be efectively applied to various study settings other than the Zambian setting in which this study was conducted.
Te participants selected for this study, as described in the eligibility criteria, involved every hospitalized confrmed COVID-19 case with an exception of pregnant women only. Tus, participants included various individual traits that were characteristic of the various health facilities in Zambia from which they came. Tis led to a reasonably large study sample that was highly inclusive, representative, and free from potential sources of sampling bias, which in turn added to the external validity of the study. Te generalization ability of the ML models used was also strengthened by the use of the 5-fold cross-validation strategy as recommended by Berrar [42]. Tis study also followed strictly the strong ML methodologies, standards, and guidelines proposed by Luo et al. [23], making it possible for any researcher to easily apply our methods to reproduce our fndings in another study setting similar to the one in which this study was conducted by reusing our ML pipeline codes available on the open science framework through the links provided in the supplementary materials section.

Strengths and Limitations of Study.
As seen from the higher performances obtained from the ML models used, this can be attributed to the quality of the methods used and how they conform to the standards of ML guidelines, methodological procedures, and conventions. Tis section discusses some of the strengths and limitations associated with our study.
Tis study used proven methodological procedures and well-documented guidelines, such as those recommended by Urbanowicz et al. [13], for the various hyperparameters proposed after a number of trials and simulations. Te level of automation associated with the ML pipeline that was created for this study has enabled our ML algorithms to be almost completely reproducible in similar settings upon the availability of a dataset. Tis may greatly help similar studies that may need to reproduce the results presented or employ similar methods in another study setting. Since the study sample was large and participants came from various health facilities of Zambia, this has made the fndings of our study to be more generalizable as compared to other studies. Despite the huge class imbalance observed in the dataset, the use of SMOTE signifcantly reduced misclassifcation bias in the study and led to increased performance of ML models. Another strength of our study was our use of multiple ML models and the use of a statistical procedure in selecting the best-performing model.
It is now important to also weigh the limitations associated with our study. Tere were two major limitations in our study. Te frst limitation was due to having a higher percentage of missingness (18%), as shown in the Supplementary Material of " Figure S1: dataset missingness map." Despite the use of the MICE procedure to handle missing values, it has been shown that imputing a dataset that has a higher percentage of missingness may introduce noise into the dataset. Tus, similar studies would record performance improvements if a dataset with a lower percentage of missingness was used. Te second limitation was that most of the Zambian health facilities lack efective screening and diagnostic test equipment, which hinders the collection of well-known clinical features that have been shown to be predictive of COVID-19 mortality. Similar studies that seek to reproduce our fndings should involve several clinical features that were missing in our study to improve the quality and reliability of the results.

Conclusion
Predicting mortality in hospitalized COVID-19 patients using factors that have an infuence on the severity of the health condition is an essential undertaking in public health and epidemiology. In conclusion, it can be reasonably stated that, like other studies have shown, the classifcation models of XGB, GB, RF, SVM, DT, LR, and NB successfully achieved the primary objective of this study by efectively showing their strength in predicting mortality in 1,433 hospitalized patients in Zambia using the features collected from patients with reasonably higher values of accuracy, recall (sensitivity), specifcity, precision, F1 Score, ROC_AUC, and PRC_AUC. Te fndings obtained, if put to use, have the potential to improve preparedness in health facilities, proper prioritization of funds, and healthcare to save the lives of COVID-19 patients with the greatest risk of mortality.
Having successfully derived and validated the seven ML models that achieved sufciently higher performances, it can be concluded that the XGB classifer, which was chosen to be the ideal and best-performing model, performed well in our classifcation problem and that it should be highly considered in classifcation problems in similar settings. It can also be added that the GB and RF can also be efective alternatives to the XGB for similar studies. It has been seen that there are many factors that were shown to infuence the susceptibility of hospitalized COVID-19 patients to mortality. Te factors LOS and white blood cell count strongly infuenced the classifcation process, while other factors like age, sex, hypertension, diabetes, and ward also showed noticeable infuence in determining the mortality outcome. Tis implies that healthcare providers should be fully aware of underlying health conditions of their patients in order to ofer lifesaving services that may help in both improving preparedness and decongesting health facilities.

Recommendations for Public Health Practice and Further
Research. Having stressed the importance of factors that are predictive of COVID-19 mortality, we greatly recommend that health facilities where COVID-19 patients are admitted should carefully and accurately keep track of each patient's LOS and also collect patients' white blood cell count, in addition to other routine variables discussed in this study. Tere should be sustained prioritization of admitted patients that are identifed as having the greatest risk of mortality, and vaccination should be encouraged as soon as it is necessary. Due to the drawbacks associated with the interpretability of ML models [59], this study also recommends that similar studies try to use a hybrid approach that uses both ML and conventional statistical classifcation methods to help in having more interpretable results that will go beyond identifying features as important but also describe the nature of the infuence on the classifcation problem, that is, whether the predictive features identifed increased or reduced mortality and with what value they either increased or reduced mortality. Tis would powerfully combine the advantages associated with both methods regarding high performance and having interpretable fndings.
To add to the body of knowledge and consolidate the fndings obtained in this study, especially the interesting fndings stated, we greatly recommend studies that might simply aim to reproduce the fndings of this study in another study setting. Te success of such studies would help to frmly accept the interesting fndings of this study as reproducible and reliable.

Data Availability
Data are not publicly available; however, it may be made available if the data request is approved by ZNPHI.

Disclosure
A preprint version of this paper has previously been published on AfricArXiv [1], a pan-African open access preprint repository hosted by the Center for Open Science.

Conflicts of Interest
Te authors declare that they have no conficts of interest.