A predictive model, and predictors of under-five child malaria prevalence in Ghana: How do LASSO, Ridge and Elastic net regression approaches compare?

Highlights • One out of four under-five children (U5C) tested positive for malaria in Ghana.• U5C older than 24 months or those severely anaemic were more likely to test positive for malaria.• U5C residing in households without electricity were more likely to test positive for malaria.• U5C residing in a rural area were more likely to test positive for malaria.• U5C residing in the poorest household were more likely to test positive for malaria.


Introduction
Globally, childhood malaria remains one of the leading causes of under-five morbidity and mortality in Sub-Saharan Africa (SSA) (Maitland, 2016;Camponovo et al., 2017). Malaria is known to cause haemolysis of red blood cells (erythrocytes) coupled with the formation of abnormal red blood cells (dyserythropoietic) all of which culminate in the development of anaemia in children (White, 2018). Complications of malaria have unfavourable clinical outcomes with significant casefatality rate (Aponte et al., 1999). Therefore, childhood malaria has been taken seriously by clinicians and policymakers over the years.
Substantial global policy initiatives have been implemented since the early 2000s to curb the burden of malaria in SSA. An example is the United States President's Malaria Initiative (PMI) which was launched in 2005 and consequently led to an increased availability of insecticidetreated mosquito nets (ITNs), antimalarial treatments and rapid diagnostic tests and indoor residual spraying amongst others. The PMI has led to a significant reduction in under-five mortality in SSA (Jakubowski et al., 2017). Following the success of the "for a malaria-free world 2008-2015 initiative", the Roll Back Malaria Partnership outlined an action plan in dubbed, "Action and Investment to Defeat Malaria (AIM) 2016-2030" (Partnership and Action, 2015). The alignment of the timeframe of the vision of AIM to that of the Sustainable Development Goals (SDG) underscores the need to address the problem of under-five malaria to ensure the realization of SDG goal 3. Nevertheless, malaria continues to be a significant cause of childhood deaths in SSA, thus threatening to derail the gains towards the achievement of the sustainable development Goal 3.2 which seeks to reduce under-5 mortalities to at least as low as 25 per 1000 live births by 2030.
The potential adverse outcome after childhood malaria underscores the need for early detection and identification of high-risk populations. Researchers have over the years used a variety of predictive modelling approaches to identify high-risk populations. These have included correlation studies, standard linear and logistic regression models, Poisson regression, non-linear models, an autoregressive integrated moving average models (ARIMAs) and spatial mapping approaches (Zhou et al., 2004;Wangdi et al., 2010;Bi et al., 2003;Craig et al., 2004;Weiss et al., 2019;Millar et al., 2018;Yankson et al., 2019). These predictive modelling approaches are largely limited by the number of covariates that can be fitted and are usually subject to the intuition of the researcher. For conditions such as malaria which is influenced by a range of physical, climatic, and social factors, machine learning models provide the opportunity to fit many covariates to identify high-risk populations. Wang et al. (Wang et al., 2019) demonstrated the superiority in the use of ensemble algorithms in predicting malaria in China using secondary health data. However, there is a paucity of literature in the Ghanaian context utilizing machine learning algorithms to predict malaria in children under five. This study sought to fill the gap in the literature by using LASSO, ridge, and Elastic net regression models to build a predictive model of malaria prevalence in children under five years in Ghana.

Design, data collection, and study sample
We analyzed the data on children under-five from the 2019 Ghana Malaria Indicator Survey (GMIS) (Ghana Statistical Service (GSS), 2019). The GMIS is based on a two-stage sampling design. The sampling was based on ten administrative regions. Each region was divided into urban and rural areas, resulting in twenty sampling strata. Enumeration areas (EAs) were sampled from each stratum. In the first stage, 200 EAs (97 in urban areas and 103 in rural areas) were selected with probability proportional to EA size (Ghana Statistical Service (GSS), 2019). In the second stage of selection, a fixed number of 30 households were selected from each cluster to make up a total sample size of 6,000 households (Ghana Statistical Service (GSS), 2019). About 5,181 women age 15-49 (representing 98.8% response rate) who were either permanent residents of the selected households or visitors who stayed in the household the night before the survey were eligible to be interviewed (Ghana Statistical Service (GSS), 2019). With the parent's or guardian's consent, children age 6-59 months were tested for anaemia and malaria infection (Ghana Statistical Service (GSS), 2019). The biomarker dataset has malaria RDT results on 2867 children under-five in Ghana.

Outcome variable
The outcome variable is children who tested positive for malaria through a rapid diagnostic test (RDT) kit. The RDT malaria test for children under-five was conducted by taking a drop of blood with the SD BIOLINE Malaria Ag P.f rapid diagnostic test (RDT). This test kit produces a result in 15 min (Ghana Statistical Service (GSS), 2019). The SD BIOLINE P.f RDT tests for one antigen, histidine-rich protein II (HRP-II), specific to Plasmodium falciparum (Pf), the major cause of malaria in Ghana (Ghana Statistical Service (GSS), 2019).

Explanatory variables
The selection of explanatory variables was informed by literature search and their availability in the dataset. These variables include the following: child age, number of under-five children in a household, has mosquito bed net for sleeping, sex of household head, sex of a household member, dwelling sprayed against mosquito last 12 months, household wealth, sex of household head, child-anaemia status, has electricity in HH, has a television in the household, place of residence, the region of residence, number of children who slept under mosquito bed net previous night, insecticide-treated net available in the household, number of household members.

Statistical analyses
We describe the characteristics of the study sample by using frequency and percentages. Chi-square test of independence was performed between the outcome and the explanatory variables. We used the Least Absolute Shrinkage and Selection Operator (LASSO), Ridge, and Elastic Net regression methods to identify variables to build the best fitting predictive model of malaria prevalence in Ghana. For LASSO, an alpha value of one was used and for Ridge an alpha value of 0. Given that the alpha values for Elastic net lie between an alpha value of zero and one (i. e. 0 < alpha < 1), we performed maximum likelihood to obtain the alpha value which was estimated to be 0.4186508 based on 5-fold crossvalidation, repeated five times using 'caret' package. We estimated the minimum lambda (i.e., lowest mean squared error (MSE)) for LASSO, Ridge and Elastic net via maximum likelihood estimation under k-fold cross-validation. The 'glmnet' package was used to select features for all models under the machine learning approaches.
Let Y be the malaria indicator. We set the binary response Y i = { 1 if the i − th child had malaria 0 otherwise and assume π i to be the probability that a given child i had malaria. Thus, our model formulation for the multivariable binary logistic regression for predicting under-five malaria status is: log a vector of predictors and β is a vector of regression coefficients for the predictors in the model. We extend this model to incorporate the regularization parameters for LASSO, Ridge and Elastic net models. After fitting the model to the full dataset, we split the data into 80% and 20% training and validation sets respectively. We then fit models to these data and evaluate their predictive ability via AUC Curves. To examine any evidence of multicollinearity, we employed the generalized variance inflation factor (GVIF) (Hair et al., 2018;Fox and Monette, 1992) with a GVIF value below 10 considered acceptable (Hair et al., 2018). The goodness of fit of the model was tested using Hosmer and Lemeshow goodness of fit (GOF) test. The fit was also examined using McFadden's R 2 , and a model with a value between 0.2 and 0.4 is considered an excellent fit. All analyses were performed in the R freeware version 4.0.2 (Core-Team R, 2019).

Ethical consideration
We obtained permission to use the 2019 GMIS data from the DHS MEASURE Program which is freely available after a simple, registrationaccess request at the following address https://dhsprogram. com/data/dataset_admin/index.cfm. From their report, it is indicated that the protocol for the 2019 GMIS was approved by the Ghana Health Service Ethical Review Committee and ICF's Institutional Review Board (Ghana Statistical Service (GSS), 2019).

Results
In the sample, one out of four children tested positive for malaria (25.04%) (see Table 1). The factors that were significantly associated with malaria among children include child age, number of under-five children, has mosquito bed net for sleeping, under-five children who slept under mosquito bed net last night, sex of household, Household wealth, Anaemia level, has electricity in household, has Television in the household, place of residence, the region of residence, number of children who slept under mosquito bed net previous night, insecticidetreated net, and number of household members (see Table 1).

Feature selection to build the predictors of malaria prevalence model
LASSO, Ridge, and Elastic Net regressions were used for feature selection to build a predictive model of malaria prevalence (Table 2), and the binomial deviance versus the log(Lambda) plots are presented as Fig. 1. The variables included in each of the feature selection models were: child age, number of under-five children in a household, has mosquito bed net for sleeping, sex of household head, sex of a household member, dwelling sprayed against mosquito last 12 months, household wealth, sex of household head, child-anaemia status, has electricity in HH, has a television in the household, place of residence, the region of   Fig. 1 displays the cross-validation error according to the log of the regularization parameter (lambda). The left dashed black vertical line indicates the optimal value of lambda which is the one that minimizes the prediction error (i.e., binomial deviance). This lambda value is expected to provide the most accurate model. For example, the top plot in Fig. 1 indicates that log of lambda of approximately − 5.7 will be the one that minimizes the prediction error with 11 features selected.

Predictive ability of the feature selected models of LASSO, RIDGE, and Elastic Net
We build three logit models, each with the features selected by LASSO, RIDGE, and Elastic Net regressions. The logit models based on selected features by LASSO, RIDGE, and Elastic Net contained eleven features, fifteen features, and thirteen features, respectively. All the models explained about 20% of the variability in malaria prevalence in Ghana with the same area under the curve (AUROC) values (i.e., AU = 81.20%) indicating that the models were good at predicting malaria prevalence in this group of children (Table 3, Fig. 2). Based on the principle of parsimony, the Lasso regression is preferred because it contains the smallest number of predictors and the smallest prediction error. We also presented the root mean square error (RMSE, i.e., prediction error) as a performance indicator for our models based on the cross-validation estimates obtained. The best model is the one with the lowest predictive error. Here again, the LASSO model (RMSE = 0.9489, SD = 0.0202) performed relatively better than the Ridge (RMSE = 1.0366, SD = 0.0172) and Elastic net (RMSE = 0.9531, SD = 0.0190) models (Table 3), supporting the choice of the LASSO model. Thus, only the results in the LASSO selected feature logit model was interpreted.
We further examined our final (i.e., LASSO) model to detect any presence of multicollinearity using the generalized variance inflation factor (GVIF). All the estimates of GVIFs are below 3, suggesting that there is no evidence of multicollinearity. The Hosmer and Lemeshow goodness of fit (GOF) test reveals no evidence of lack of fit (x 2 8 = 13.6, pvalue = 0.0939). Also, the McFadden's R 2 of 0.23 revealed an excellent fit for our final model.

Evaluation of model fit on training and validation datasets
We tested our final preferred model (i.e., LASSO) on the training dataset using Hosmer and Lemeshow goodness of fit (GOF) test. We did not observe any evidence of lack of fit (x 2 8 = 11.8, p-value = 0.1627). Also, the McFadden's R 2 of 0.25 and 0.21 respectively for the training and validation dataset models indicate an excellent fit for our model. The predictive ability of the fitted model based on AUC values for the training and validation datasets are respectively 82.3% and 79.5% (Fig. 3), indicating good predictive ability for both. We test for any difference in the predictive performance between the fitted model for the training and the validation sets by comparing the ROC curves for these models. Both the DeLong's (D = 1.1993, p-value = 0.2308) and Bootstrap (D = 1.197, p-value = 0.2313) tests for the two ROC curves suggest that there is no evidence of significant differences in the predictive performance of these models.

Regressors of malaria prevalence in Ghana
The following factors were regressed upon malaria prevalence in Ghana: child age, number of under-five children in a household, sex of household head, dwelling sprayed against mosquito last 12 months, household wealth, child-anaemia status, has electricity in households, place of residence, the region of residence, number of children who slept under mosquito bed net previous night, and insecticide-treated net available in the household.
The factors that are significantly related to the outcome were child age, household wealth, child anaemia status, presence of electricity in household, place of residence, and region of residence. The adjusted odds ratios reported in Table 4

Discussion
This study finds that in 2019, one out of four children tested positive for malaria (25.04%) with considerable malaria prevalence across different age group of children under five years. Our results also showed a good predictive ability of our fitted models (i.e., AU = 81.20%) to predict under-five malaria prevalence. Factors that were significantly associated with malaria prevalence in Ghana included: child age, household wealth, child anaemia status, presence of electricity in household, place of residence, and region of residence.
We found that children older than 24 months were more likely to test positive for malaria. This finding may be attributable to multiple reasons. One plausible explanation is the age-related decline in malaria antibodies acquired from the mother during pregnancy as the child grows. Although there is no consensus in the literature on the effect of maternally acquired immunity in protecting against childhood malaria (Riley et al., 2001(Riley et al., , 1998, the assumption is that children in malariaendemic areas such as Ghana acquire malaria antibodies from their mothers while in the womb but this immunity wanes gradually as the child grows. This coupled with low utilization of insecticide-treated nets in children older than 24 months (Nkoka et al., 2019) due to prioritization of access to ITN for younger siblings may explain our observation that older than 24 months are more likely to test positive for malaria. Moreover, mothers with an index child much younger are given more attention than those 24 months or more, hence the latter are more likely to be exposed to mosquito bites. Other studies have also reported a significant association between age and malaria infections in children in Ghana (Nyarko and Cobblah, 2014;Orish et al., 2015;Chilanga et al., 2020). In the retrospective study in the Western Regional Hospital in Ghana, Orish et al (Orish et al., 2015), noted that the age-specific discrepancy in the prevalence of malaria was rather higher for younger children. This variance is understandable given that although community prevalence of malaria may be actually higher for older children, the health seeking behaviours of parents plausibly prioritise younger children with malaria for treatment. We also found that children in at least a middle wealthy household had a lower likelihood of testing positive for malaria. This finding can be explained by the fact that children from wealthy households are more likely to be living in affluent neighbourhoods with good drainage system and clean environments that decrease the breeding of mosquitoes thus decreasing the likelihood of mosquito bites and malaria (Dickinson et al., 2012). Moreover, parents/guardians of children from wealthy households are more likely to afford the purchase and use of ITN (Dickinson et al., 2012;Ruyange et al., 2016) hence reducing the likelihood of malaria in children from wealthy families. This finding  corroborates the findings of other studies in Ghana (Nyarko and Cobblah, 2014;Afoakwah et al., 2018) and other African countries (West et al., 2013) which all reported lower burden of malaria among children from wealthy households. The study found that children who were not severely anaemic and not anaemic at all had a lower likelihood of testing positive for malaria. The association between anaemia and malaria in SSA has been well documented in the literature (McCuskee et al., 2014). This finding can be related to the haemolytic effect of malaria on red blood cells causing anaemia (White, 2018). This likely explains why mildly anaemia and non-anaemic children were less likely to test positive for malaria.
The study also found that children in households with electricity had a lower likelihood of testing positive for malaria. Access to electricity can be understood as a proxy for wealth status, access to other social amenities and socioeconomic status (Worrall et al., 2003). The assumption is that access to electricity which is a proxy for socioeconomic status creates protective conditions such as access to ITNs and clean place of residence which reduce the likelihood of malaria (Dickinson et al., 2012;Worrall et al., 2003). A more direct plausible explanation is that the use of electrically operated equipment like fans within households with electricity can reduce mosquito bites. Nevertheless, literature on the association between access to electricity and malaria prevalence reports contrary findings in which access to electricity has been found to be positively associated with malaria prevalence (Tasciotti, 2017). Tasciotti (Tasciotti, 2017), for example, opined that access to electricity rather increases the malaria vector density in households which support the view that mosquitoes are attracted by light. This coupled with the fact that members in households with electricity are likely to spend more time in the evening outdoors increases their risk for mosquito bites (Tasciotti, 2017).
Our study also revealed that children in rural areas had a higher likelihood of testing positive for malaria. This finding supports the assumption that urbanization is protective against malaria in sub-Saharan Africa (Hay et al., 2005). Besides, our findings agree with the results of Afoakwah et al (Afoakwah et al., 2018) who reported rural children had a higher burden of malaria prevalence in Ghana. This may likely be explained by the fact that poverty is common in rural areas coupled with poor housing and environmental conditions that promote the breeding of mosquitoes. Our findings reflect the need to prioritise rural areas in malaria prevention policies.
We also found that having had dwelling areas sprayed against mosquito in the last twelve months before the survey was not protective against malaria prevalence. Yearly spraying appears not to offer much protection since mosquitoes breed virtually throughout the year in the environment, although the breeding rate and vector burden may be higher in the rainy season (Dery et al., 2010). On the contrary, some studies have reported a significant protective effect of household  spraying when the effect was assessed at a shorter duration of 6 months (Afoakwah et al., 2018;Belete and Roro, 2016;Hamusse et al., 2012). For example, Belete & Roro (Belete and Roro, 2016) reported that spraying of the house environment in the last 6 months offers protection from malaria. Moreover, Hamusse et al. (Hamusse et al., 2012) showed that indoor residual spraying was effective in protecting against malaria within 6 months of the initial spraying. This underscores the need for continuous spraying at a shorter interval such as every 4-6 months to offer protection as yearly spraying appears not to be sufficient in preventing malaria.
The study found that compared to children in Western region (high rainforest ecological zone), their counterparts in the Greater Accra (Coastal Savannah), Ashanti (semi-deciduous rainforest), Northern (Guinea Savannah), Upper East (Sudan Savannah), and Upper West (Guinea Savannah) had a lower likelihood of malaria. This can be explained by the fact that that high rain forest ecological zone of the western region receives abundant rain compared to the other ecological zones. Rainfall is known to be associated with high densities of malaria vectors (Dery et al., 2010). With decrease rainfall in the other regions, children living there are less likely to have malaria compared to their counterparts in the high rainforest ecological zone of the western region.

Strengths and limitations
We have demonstrated the usefulness of machine learning techniques in predictive modelling for malaria in Ghana with an optimal level of sensitivities as seen in this study. The preliminary identification of variables for the final modelling using lasso, ridge and elastic net methods were less dependent on the researcher's intuition. The use of machine learning was also possible because a large nationality representative data was used. By using a nationally representative crosssectional data, our findings can be generalized to children from other similar countries. Also, the use of big data approach to malaria modelling has additional benefits with regards to scalability and transferability to other settings with comparable data. Although our machine learning modelling appeared to have good predictive ability, the results are dependent on the data used in the development and validation. Larger datasets than the one we used would perhaps produce better-trained  models. Finally, all associations observed in this study do not infer causality.

Conclusion
In summary, our study investigated the utility of machine learning approaches for predictive modelling of malaria prevalence among children under five years. The results showed evidence of concept and identified that age of the child, household wealth, place of residence, region of residence, anaemia status, and access to electricity was significantly predictive of malaria prevalence. The results (AU = 81.20%) show that the performance of our models is good at predicting under-five malaria prevalence. Beside identifying high-risk populations for cost-effective interventions, our study should serve as encouragement for malaria researchers in Ghana who are interested in machine learning and big data approaches in modelling malaria prevalence.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.