Exploring the determinants of under-five mortality and morbidity from infectious diseases in Cambodia—a traditional and machine learning approach

Cambodia has made progress in reducing the under-five mortality rate and burden of infectious diseases among children over the last decades. However the determinants of child mortality and morbidity in Cambodia is not well understood, and no recent analysis has been conducted to investigate possible determinants. We applied a multivariable logistical regression model and a conditional random forest to explore possible determinants of under-five mortality and under-five child morbidity from infectious diseases using the most recent Demographic Health Survey in 2021–2022. Our findings show that the majority (58%) of under-five deaths occurred during the neonatal period. Contraceptive use of the mother led to lower odds of under-five mortality (0.51 [95% CI 0.32–0.80], p-value 0.003), while being born fourth or later was associated with increased odds (3.25 [95% CI 1.09–9.66], p-value 0.034). Improved household water source and higher household wealth quintile was associated with lower odds of infectious disease while living in the Great Lake or Coastal region led to increased odds respectively. The odds ratios were consistent with the results from the conditional random forest. The study showcases how closely related child mortality and morbidity due to infectious disease are to broader social development in Cambodia and the importance of accelerating progress in many sectors to end preventable child mortality and morbidity.

lower birth weight and living in rural areas had a higher risk of mortality while those born to mothers who use contraceptives had a lower risk of mortality.
Neonatal disorders and infectious diseases continue to cause the most disability-adjusted life years for children under five years in Cambodia according to the global burden of disease 9 .For children aged 5-14 years, the causes of morbidity are more varied but are primarily caused by non-communicable diseases and injuries 9 .The mortality rate among under-five children from lower respiratory infections in Cambodia has declined by more than 80% since 1990, mainly due to increased vaccination coverage, lower household air pollution, and better nutritional status of children 10 .Nonetheless, lower respiratory infections are still the leading infectious cause of death and morbidity, with diarrhea as the second 9 .The possible drivers of lower respiratory infections in children in Cambodia have not been studied, however a study by Vong et al. 11 using parametric statistical analysis of 2014 DHS data showed that lack of water and sanitation facilities and maternal unemployment was associated with higher risk of diarrhea while older maternal age was associated with a lower risk of diarrhea.
As described in the above studies on child mortality and morbidity in Cambodia 7,8,11 , as well as other global estimates of determinants of child health 12,13 , standard linear, logistical or a mix thereof has been the only statistical approach used.These methods might be limited when the data to be analyzed has a high degree of correlation, random noise or does not follow assumptions of normality.However, machine learning has been shown to be provide complementary evidence on the determinants of child health in low-and middle-income countries.For instance, Bizzago et al. 14 used a random forest with data from household surveys in 27 countries to assess the most important determinants of under-five mortality, while Methund et al. 15 showed how a logistic classifier machine learning algorithm could be used to explore determinants of infectious diseases in children from a multiple indicator cluster survey.
Overall, Cambodia has shown a continued reduction in the under-five mortality rate and child morbidity from infectious diseases since 2015.However, no study has investigated child mortality and morbidity data after 2015 or applied machine learning which has emerged as a useful approach to complement more traditional parametric statistical analyses.Hence, the aim of this study was to explore factors that might be associated with under-five mortality and child morbidity from infectious diseases using the most recent DHS conducted in Cambodia.

Data source
This study is based on quantitative data derived from the Cambodian DHS (CDHS), a nationally representative household survey that collects a wide range of data from demographics to maternal and child health.The first round of CDHS was conducted in 1998 and has been repeated since then approximately every fifth year, with the latest survey in 2021-2022.The multi-stage stratifying sampling technique and specifics on the structured questionnaire are presented extensively elsewhere 16 , however the sampling unit for the survey was households.The unit of analysis in our study was under-five children in the CDHS conducted in 2021-2022.

Outcomes and possible predictor variables
The primary outcome was defined as a child dying before their fifth birthday in the last five years preceding the study.The secondary outcome was defined as a child under the age of five years having fever, acute lower respiratory disease, or diarrhea during the last two weeks preceding the survey.Possible variables that could be associated with the outcome were identified through established frameworks for understanding determinants of child mortality and morbidity 17 , previous studies in Cambodia 7,8,11 and CDHS data information 16 .A descriptive analysis of the identified variables is presented with weighted counts, accounting for the cluster and sampling design.Of the identified variables, those with less than 30% missing data and where data were captured for all under-five children were used to analyze the primary and secondary outcomes further.This led our multivariable models to include the following variables: twin, birth order of child, previous birth interval of mother, mother age at birth, contraceptive use of mother, mothers' highest educational level, the number of births in the last five years of the mother, drinking water source, sanitation facility, cooking fuel, electricity, household wealth quintile, household type, geographical region, and health insurance.A detailed description of the variables, including necessary recoding from the CDHS dataset is included in the Supplementary Material (Table S1).Lastly, children with missing data for any of the variables included in the models were excluded from the dataset.

Statistical analysis
First, the neonatal (from birth to 28 days of life), infant (birth to one year of age) and under-five mortality rates per 1000 live births and their 95% confidence intervals were calculated through Jackknife variance estimator 18 , in line with the established DHS method 19 .Secondly, a survey-weighted univariable and multivariable generalized linear model with a binomial link was used to conduct statistical inference on the primary and secondary outcome probability with robust standard errors clustered at the CDHS cluster level while taking into account the strata 19,20 .Unadjusted odds ratios and 95% confidence intervals were estimated for all variables considered, while adjusted odds ratios and 95% confidence intervals were estimated for the variables included in the respective multivariable models.Large sample two-sided Wald-type statistical tests for the hypothesis that the odds ratios for each predictor were equal to one (no association) were conducted with a type I error fixed at 5%.
A classification random forest machine learning algorithm was applied to identify additional possible predictors and to complement the statistical inference provided by the multivariable logistical regression.In brief, a random forest is a supervised ensemble learning algorithm combining individual decision trees into a random forest 21 .From the original sample, several bootstrap samples are drawn, and an unpruned classification tree is fit for each bootstrap sample.The variable selection for each split in the classification tree is conducted only from a small random subset of predictor variables.In the traditional application of random forest, the split is decided based on the Gini split criterion however, this can lead to decision trees preferring variables with more categories 22,23 .Given the many different categories present in the data, we use a split based on conditional inference framework provided by Hothorn et al. 24 and built upon by Strobl et al. 25 that provides unbiased classification decision trees.From the complete forest, the status of the response variable is predicted as an average or majority vote of the predictions of all trees.As such, the algorithm adjusts for the instability of the individual decision trees.In our study, we are not interested in constructing a prediction model, but rather in understanding which of the included variables in the model is most important.Interpreting variable importance from machine learning algorithms can be tricky however, for most datasets and aims, permutation importance provides a robust assessment of variable importance 26 .In short, by randomly permuting the predictor variable X j , its original association with the response Y is broken.When the permuted variable X j and the remaining unpermuted predictor variables used to indicate the response, the prediction accuracy (i.e., the number of observations classified correctly) decreases substantially, if the original variable X j was associated with the response.Thus, a reasonable measure for variable importance is the difference in prediction accuracy before and after permuting X j .One important advantage of permutation variable importance is that the measure both covers the non-linear impact of each variable on the prediction accuracy as well as the non-linear multivariable interaction with other predictor variables.In our analysis, the conditional random forest was implemented with default settings and link each observation with the household weight to account for the complex survey design.To assess variable importance, conditional permutation importance was averaged over ten permutations with the threshold level set at a p-value of < 0.05.For details on the statistical properties of conditional decision trees, random forests based on such trees, and permutation importance, we refer the reader to Debeer and Strobl 27 .
The data management and analyses were conducted in R (version 4.1.1) 28.Child mortality rates were calculated with the chmort function from the DHS.rates package 18 , the complex survey design accounted for with svydesign function and the survey-weighted generalized linear models constructed with a binomial link through svyglm function from the Survey package 20 .The random forest was created through cforest from Party package 29 and permutation importance calculated with the permimp function from the permimp-package 30 .

Ethical approval
The survey used in this study has been approved by ICF Institutional Review Board and gained ethical approval from relevant ethical institutional review board in Cambodia.Informed consent was gained from all participants.All analyses were performed in accordance with relevant guidelines and regulations.

Results
The CDHS included 8153 children under five years, and over the five years before the end of the survey, the reported neonatal mortality rate was 8.40 (95% CI 5.81-10.9)per 1000 live births, infant mortality rate 12.7 (95% CI 9.51-15.8)and under-five mortality rate 19.3 (95% CI 12.3-25.3).In total, 114 (1, 4%) of children died before their fifth birthday, with the majority of deaths (N = 66, 58%) occurring during the neonatal period.During the survey, 1321 (17%) of children had the secondary outcome of fever, acute lower respiratory disease or diarrhea.An overview of the characteristics of the population is provided in Table 1, while univariable analyses of the variables in Table 1 and the primary and secondary outcome is available in Supplementary Material (Table S2).There were no major differences between male and female children, with the exception of a higher proportion of male children being stunted versus female (11% versus 8.4%).
For the outcome of under-five mortality, the logistic regression (Fig. 1) indicates that being born fourth or later led to significantly increased odds of mortality (3.25 [95% CI 1.09-9.66],p-value 0.034).A similar tendency was noted for being born third, a twin, living in a rural household and in a geographical region outside of Phnom Penh.On the other hand, if the mother used some form of contraception, there was significantly lower odds of mortality (0.51 [95% CI 0.32-0.80],p-value 0.003).Although not statistically significant, being born a female and mother having any type of education were also indicating a lower odds of mortality.The permuted variable importance of the random forest (Fig. 2) shows contraceptive use to be of the highest importance to the model, followed by birth order, previous birth interval, household wealth quintile, highest educational level of mother, sex of the child, births in the last five years, geographical region, mother´s age at birth and type of cooking fuel while the remainder was deemed not important for the model.
When it comes to the outcome of infectious disease, the multivariable logistic regression results (Fig. 3) indicate that there might be a significantly reduced in odds for children living in households with improved water source (0.69 [95% CI 0.52-0.91],p-value 0.01), being in the middle (0.57 [95% CI 0.38-0.87],p-value 0.01), richer (0.59 [95% CI 0.37-0.94],p-value 0.028) or richest (0.42 [95% CI 0.20-0.89],p-value 0.024) wealth quintiles.None of the child-specific variables had a statistically significant association with the infectious disease outcome.There were a significantly increased odds if the child lived in the Coastal (2.30 [95% CI 1.05-5.01],p-value 0.036) or Great Lake (2.77 [95% CI 1.27-6.03],p-value 0.01) geographical regions.For the random forest (Fig. 4), the most important variables were deemed to be household wealth quintile, water source, geographical region and highest educational level of the mother with the remaining being important for the model except for the number of births in the last five years, if the child was a twin or not and the mothers age at birth.

Discussion
In this study examining the possible determinants of child mortality and child morbidity from infectious diseases in Cambodia in 2021-2022 we show a continued decline of the under-five mortality rate with the majority of under-five deaths occurring during the neonatal period, and that infectious diseases contribute to significant morbidity burden.Including both traditional multivariable logistical regression and machine learning analysis, variables that were significantly associated with the outcomes also had a relatively high permutated variable Vol:.( 1234567890 www.nature.com/scientificreports/importance in the random forest such as contraceptive use and household wealth quintile.Indeed, household wealth quintile, highest educational level of the mother and previous birth order seemed in the random forest model to be important for both primary and secondary outcomes, indicating that there are similar determinants of under-five child mortality and child morbidity from infectious diseases in Cambodia. In our study, we found that contraceptive use was significantly associated with a reduced odds of under-five mortality, and in the study population roughly two thirds of the mothers used any type of contraception.This is in line with the analysis done by Um and Heng based on the 2014 CDHS 8 which also found contraceptive use to be associated with lower under-five mortality albeit with tendency for a bit lower odds ratio than what we found (0.51 [95% CI 0.32-0.80]versus (0.30 [95% CI 0.18-0.52]).Even though the multivariable models included variables that to some extent account for the mother´s agency and empowerment, such as the education level, it is likely that the association between contraceptive use of the mother and odds of under-five mortality suffer from confounding.It has previously been shown that contraceptive use is closely linked to how empowered a woman is within the household 31 and that higher attained education level of the woman 32 improves the likelihood of contraceptive use in Cambodia.Contraceptive use allows mothers to space and plan pregnancies, leading to lower risk of unwanted pregnancies and has been shown to reduce infant mortality rates 33 .In Cambodia, women often do not have the full autonomy of choice when it comes to contraceptive method, and face cultural and practical barriers to accessing modern reversible contraceptive methods 34 .Given the lack of data on health service seeking pattern in our model, contraceptive use might also be indicative of health literacy and health seeking behavior 32 among mothers which could serve as a protective factor against under-five mortality.Being born fourth or later led in our analysis to a significantly increased odds of under-five mortality however it should be noted that the confidence interval is quite broad.Similarly, the birth interval was deemed third most important by the random forest.Both birth order and birth interval had not previously been identified as associated with under-five mortality in Cambodia 8,11 and might reflect the changing demography of Cambodia over time with the relatively fewer poorer households having more children.Beyond Cambodia, having a high birth order have been found to be associated with increased risk of under-five mortality in low and middle-income settings 35,36 , while also increase the risk of undernutrition 37,38 and even in high-income settings the effects of birth order continues throughout the life course 39 .
Similar to global burden of disease estimates 9 , our results indicate a significant burden of infectious disease among children under-five years in Cambodia.When exploring determinants of infectious disease among children under five, we found that households who had an improved water source had a lower odds of infectious disease.This is in line with the findings from CDHS 2014 by Vong et al. 11 that showed a higher risk of diarrhea among children living in households with unimproved water source.Water quality and risk of infectious disease among children is well-established in low-and middle-income countries 40 .With climate change leading to increased precipitation in many settings the importance of household access to quality water and sanitation to protect children from infectious disease has become clear 41 .Unsurprisingly, we found that children living in households belonging to the higher wealth quintiles (middle, rich or richest) had substantially lower odds of infectious disease.With improved but unequal living standards and economic growth over the last two decades, Cambodia has experienced a shift in under-five mortality and morbidity, with neonatal mortality driving underfive mortality and infectious disease among children primarily affecting poor and vulnerable households [42][43][44][45] .
Children living in the coastal (Kampot, Kep, Koh Kong and Preah Sihanouk) or great lake (Banteay Meanchey, Battambang, Kampong Chhnang, Kampong Thom, Pursat and Siemreap) regions had a higher odds of infectious disease.The prevalence of infectious disease pathogens among children in different regions of Cambodia is not known, however children living in proximity to water bodies in these regions might be more exposed to spread of infectious disease pathogens 46 .Additionally, these regions are also prone to cyclones and flooding 47 which might further lead to increased transmission of infectious diseases.Overall, the result from our study depicts how already vulnerable children in certain geographies are more at risk from infectious disease in Cambodia.We find that under-five mortality and morbidity due to infectious diseases are associated with characteristics of the mother and household.Specifically, empowering women and promoting safe contraceptive use and family www.nature.com/scientificreports/planning programs might further reduce under-five mortality in Cambodia.Since 2019, the Ministry of Health in Cambodia has implemented a cash transfer scheme for pregnant women from families with an IDPoor card 5 to further improve maternal and child health outcomes in Cambodia, including reducing mortality.Women are eligible to receive three stages of support including: 10 USD every antenatal care visit up to four visits, an additional one-time payment of 50 USD for new mothers after delivery in a health facility and 10 USD for each post-delivery check-ups for themselves and their children up to ten times until their children are two-years old 48 .Additionally, investing in quality water and sanitation systems for all along with recognizing the health disparities between households should be key when designing public health programs targeting infectious diseases.There is a lack of data on a district level on the prevalence of different infectious diseases affecting children, development of local surveillance systems and making the data publicly available should be prioritized.Our study showcases how child mortality and morbidity from infectious disease are linked to many sectors beyond the health sector, and that a random forest analysis can complement traditional statistical approaches to illuminate factors that might be influencing outcomes not fully captured in traditional statistical methods 49 .Acting on synergies and handling tradeoffs between sectors is key to put child health in the center of sustainable development 50 .Multisectoral programs that tackle multiple vulnerabilities, such as IDPoor 5 , holds promise to further accelerate progress which will be necessary if Cambodia is to reach the Sustainable Development Goal target 3.2 of ending preventable deaths of newborns and children under five years of age.
The DHS surveys follows a stringent data collection process and has provided high quality data on sociodemographic factors for more than three decades 51 .However, analysis based on DHS has its limitations.First, even though the CDHS follows a highly standardized approach, the respondents in the survey might have recall or omission bias which could skew the results.Secondly, for information regarding the prenatal, delivery and postnatal period only data for children below three years of age is included in the questionnaire resulting in all children three years or above are missing data on these variables.Additionally, among these variables there were a significant amount of missing data for children who died, making it methodologically questionable to construct multivariable models for this sub-group alone.Moreover, the lack of available variables representing the mother´s empowerment or health service utilization limits our understanding of the associations found.Thirdly, it is not possible to assess the underlying infectious pathogen causing diarrhea, cough or fever which limits the possibility to decipher pathogen-specific determinants of under-five infectious disease morbidity.Fourthly, when it comes to the statistical analysis, random forest cannot fully incorporate the complex survey design structure of the data even though survey weights can be included to mitigate this problem.An important consideration is that given the different assumptions of traditional logistical regression and machine learning algorithms such as random forest, comparing the findings between the logistical regression and the random forest should be done with caution particularly since random forest cannot make estimates of inference or the direction of the association.Although we present a variable importance measure, the direction of the relationship is not incorporated, and it is not possible to untangle why the random forest deem certain variables unimportant or important.In our study, we apply both methods in a complimentary manner in order to explore possible associations rather than to compare the approaches or assert causality.For instance, the random forest might provide a more nuanced view on associations and provide a starting point of further detailed exploration of variables that were seen as important by the random forest but not independently associated with the outcome in the logistical regression.Altogether, the strengths and limitations of this study reflects the complexities of trying to assert real-world associations in a data and analysis-limited environment.

Conclusion
The majority of under-five deaths in Cambodia occurred during the neonatal period, and under-five mortality was significantly associated with contraceptive use of the mother and the birth order of the child.Child morbidity due to infectious disease was associated with water source, household wealth quintile and geographical region.The findings showcase how closely related child mortality and morbidity due to infectious disease are to broader social development in Cambodia and the importance of accelerating progress to end preventable child mortality and morbidity.

Fig. 1 .
Fig. 1.Multivariable logistic regression for the primary outcome of under-five mortality.Black color indicates a statistically significant association.

Fig. 2 .
Fig. 2.Permutation importance for variables in model for the primary outcome of under-five mortality in Cambodia, ranked from most important to unimportant.*These variables had a value below zero, indicating that the variables were not deemed important for the machine learning algorithm.

Fig. 3 .
Fig. 3. Multivariable logistic regression for the outcome of fever, acute lower respiratory disease or diarrhea any time in the two weeks preceding the survey.Black color indicates a statistically significant association.

Fig. 4 .
Fig. 4. Permutation importance for variables in model for the outcome of fever, acute lower respiratory disease or diarrhea any time in the two weeks preceding the survey, ranked from most important to unimportant.*These variables had a value below zero, indicating that the variables were not deemed important for the machine learning algorithm.

Table 1 .
Characteristics of the study population by sex.All variables are weighted according to household survey weight, while taking into consideration the sampling cluster and strata.*These variables had data from all under-five children and below 30% missing data and were included in the multivariable model.Region Phnom Penh is the capital city; the Plain region consists of Kampong Cham, Kandal, Prey Veng, Svay Rieng, and Takeo; Great lake region includes Banteay Meanchey, Battambang, Kampong Chhnang, Kampong Thom, Pursat, and Siemreap; Coastal region has Kampot, Kep, Koh Kong, Preah Sihanouk; and Mountain/ Plateau region consists of Kampong Speu, Kratie, Preah Vihear, Ratanak Kiri, Mondul Kiri, Stung Treng, Oddar Meanchey, and Pailin.Improved water sources include direct water, piped wells, and covered dug wells.Improved sanitation facilities include toilet or latrine connected with sewage or septic tanks.