Factors of acute respiratory infection among under-five children across sub-Saharan African countries using machine learning approaches

Symptoms of Acute Respiratory infections (ARIs) among under-five children are a global health challenge. We aimed to train and evaluate ten machine learning (ML) classification approaches in predicting symptoms of ARIs reported by mothers among children younger than 5 years in sub-Saharan African (sSA) countries. We used the most recent (2012–2022) nationally representative Demographic and Health Surveys data of 33 sSA countries. The air pollution covariates such as global annual surface particulate matter (PM 2.5) and the nitrogen dioxide available in the form of raster images were obtained from the National Aeronautics and Space Administration (NASA). The MLA was used for predicting the symptoms of ARIs among under-five children. We randomly split the dataset into two, 80% was used to train the model, and the remaining 20% was used to test the trained model. Model performance was evaluated using sensitivity, specificity, accuracy, and the area under the receiver operating characteristic curve. A total of 327,507 under-five children were included in the study. About 7.10, 4.19, 20.61, and 21.02% of children reported symptoms of ARI, Severe ARI, cough, and fever in the 2 weeks preceding the survey years respectively. The prevalence of ARI was highest in Mozambique (15.3%), Uganda (15.05%), Togo (14.27%), and Namibia (13.65%,), whereas Uganda (40.10%), Burundi (38.18%), Zimbabwe (36.95%), and Namibia (31.2%) had the highest prevalence of cough. The results of the random forest plot revealed that spatial locations (longitude, latitude), particulate matter, land surface temperature, nitrogen dioxide, and the number of cattle in the houses are the most important features in predicting the diagnosis of symptoms of ARIs among under-five children in sSA. The RF algorithm was selected as the best ML model (AUC = 0.77, Accuracy = 0.72) to predict the symptoms of ARIs among children under five. The MLA performed well in predicting the symptoms of ARIs and associated predictors among under-five children across the sSA countries. Random forest MLA was identified as the best classifier to be employed for the prediction of the symptoms of ARI among under-five children.

was 73/1000 and 9/1000 live births respectively 1,5 , i.e. the African region under-five death rate was almost eight times higher than the European region.Different literature reported that symptoms of ARIs in under-5-yearold children are directly related to the population's environmental, socioeconomic, and cultural variables 2,[6][7][8][9][10] .Moreover, air pollution disproportionately affects the under-five children residing in low and middle-income countries (LMICs), including sSA.More than 89% of deaths due to air pollution occurred in LMICs, mainly in Africa and Asia 11 .Africa accounts for the highest excess mortality from ambient air pollution among under-five children, to which ARIs were suggested as a potential contributor 11,12 .It is confirmed that 92% of the world's population lives in areas where the air quality index (AQI) limit is exceeded (> 100, AQI near 100 is usually considered safe) 13 and about 4.2 million people die every year from many diseases due to air pollution.Underfive children are at greater risk than the other population groups from many of the adverse health effects of air pollution, mainly due to a combination of physiological, environmental, and behavioral factors.Besides, children spend most of their time outside engaging in physical activities and playing, they breathe air located closer to the ground, where some of the air pollutants are at a higher concentration, and they have a higher breathing rate than adults increasing their risk of exposure [14][15][16] .
Previous studies attempted to identify the determinant factors of ARIs among under-five children 2,6-12 using linear and non-linear regression models.As far as the researcher's knowledge is concerned, there exist a few previous studies [17][18][19][20] that applied machine learning algorithms to predict the ARIs among under-five children using air pollution factors.So far, these machine learning algorithms have not been extensively applied to the available cross-sectional datasets in low-and middle-income countries (LMICs).Hence, we applied machine learning (ML) algorithms to investigate the effects of air pollutants (such as Particulate Matter (PM2.5),nitrogen dioxide (NO 2 )), climate factors (temperature, land surface temperature, wet day), health-related information, and socio-demographic factors.Furthermore, a generic prediction framework is lacking for reliable assessment of the symptoms of respiratory infections among children under 5 years using a large-scale dataset employing MLA.To the best of our knowledge, this is the first study that employed different ML techniques to select and identify the associated risk factors with symptoms of ARIs in sSA countries.This MLA approach places the features according to their importance considers the selected risk factors (features) simultaneously in an unbiased manner and identifies the pattern of information, which is crucial to make a prediction.The objective of this study was twofold: first, to reveal the possible features for determining the ARIs among children, and second, to explore machine learning algorithms by considering the best possible features for predicting the ARIs among children in sub-Saharan African countries.

Data sources and variables
The data for this study came from two sources: the Demographic and Health Survey (DHS), which is described in detail at https:// dhspr ogram.com.The data from 33 sSA countries (Fig. 1), including the global positioning systems (GPS) coordinates (latitude and longitude) of household clusters, were available (Table 1).In DHS, multistage sampling was used to select the sample for each survey in the countries included in this analysis.Hence, the first step of the sampling procedure involved the selection of clusters (enumeration areas (EAs)), followed by systematic household sampling within the selected EAs.The number of clusters is the first stage which is selected from the list of enumeration areas (EAs) created in the recent population census of each country and the households that are randomly selected in each of EAs.From the selected households, women aged 15-49 years are selected for an in-depth interview 21 .Moreover, the geographical covariates were extracted from the DHS site and were linked to the original individual DHS datasets through the cluster identifying number (ID).The

Outcome variables
To measure the symptoms of respiratory infections, mothers/caregivers were asked if each of their under-five children had experienced symptoms of ARI (Cough, short rapid breaths or difficulty breathing) and fever, each classified as binary outcome measures (yes, no), within 2 weeks before the DHS surveys.ARI was defined as a child who had a history of an illness in the 2 weeks preceding the survey with cough and breathing faster than usual with short, rapid breaths or had difficulty breathing 23 , and severe ARI (SARI) was defined as having all ARI with fever 24 .

Features (independent variables)
The independent variables extracted were based on a review of the literature 3,[5][6][7]9,25,26 . The varables included in the analysis are summarized in the following framework (Fig. 2).

Model building
Machine learning algorithms such as Logistic Regression (LR) 27 , Ridge regression 28 , Least Absolute Shrinkage and Selection Operator (LASSO) regression 29 , Elastic Net 30,31 , Decision trees 32 , K-Nearest Neighbors (KNN) 33 , Naïve Bayes 32,34,35 , Random Forest (RF) 31,36 , Bagged tree 37 , Boosting 37 and Artificial Neural Network (ANN) 38,39 were included in the analysis.All the statistical analyses were performed using the R software 4.3.1 for Windows (R Development Core Team).Moreover, the function createDataPartition in the R caret package splits the dataset using the stratified random sampling technique, which can minimize the bias of the data distribution and create balanced data.

Logistic regression (LR)
LR is a widely applied statistical model for binary classification problems.Let y i be the response variable for the ith child, assumed to follow the Bernoulli distribution and takes on the value 1 with a probability of π i = P(y i = 1|x i ) , where x i = (x 1i , ..., x pi ) T is the i th child's covariate vector, and value 0 with probability 1-π i .Then the logistic regression model with the logit link function can be given as: (1) . where β 0 is the intercept term, and β = (β 1 , ..., β p ) T is a p × 1 vector of estimated regression parameters on the logit scale.When we have many features (dimensionality), the traditional LR model has a few limitations: overfitting, multicollinearity, and computational difficulties.To address these problems, we used regularization which is a GLM that imposes a penalty on the parameters to shrink them toward zero [27][28][29][30][31]40 .
The ridge regression ( L 2 regularization, which shrinks coefficients of correlated covariates towards each other) is obtained by maximizing the function with a penalized parameter applied for all the parameters except the constant (intercept) 27,28 .The penalized likelihood formulation for ridge regression is given by (2)   When the λ values are too large (λ → ∞), the coefficients of all the parameters tend to be zero, but when λ = 0, the ridge regression is equal to the traditional approach.The goal is to search for an optimal value between these two extremes.
The LASSO regression uses the L 1 penalty for variable selection and shrinkage.As such, if the is large enough, it forces the coefficient to be zero which provides a lesser number of predictors 29 .The function for the lasso regression is given by **Eq.(3) The optimal regularization parameter ( ) was determined using the nfold cross-validation techniques.The smaller the value, the more the effect of regularization upon the number of covariates (features) in the model and their respective coefficients 31,41,42 .Thus, variables with non-zero estimates are considered important covariates for the outcome variable of interest.
The elastic net regularization is a combination of both **Eq.( 2) and (3) penalties 30,31 .This method can effectively control for correlated features and also shrink the coefficients of non-informative features to zero 30,31,40,43 .The elastic net regression is given by ( 4) All the GLM regularizations are operationalized in R programming software using the glmnet package 44 .In this paper, we trained the generalized linear model (GLM) estimators with common α values from the set {0, 0.5, 1}, where ( α= 0.0, 0.5 and 1.0 respectively refers to the ridge, elastic net and lasso penalty) 30,31,40 .www.nature.com/scientificreports/Random forest (RF) RF is the popular supervised ML approach in applied statistics because of its applicability in both classification and regression [45][46][47] .It is also used for variable screening for dimension reduction [48][49][50] .It is a "tree-based" technique in which several decision trees are constructed from a random set of covariates and used to predict an outcome label for a subset of samples.It builds multiple trees (called the forest) and the decision is based on the majority votes over all the trees in the forest.This model is also used to select the important features [45][46][47]51 .
The Gini Importance analysis was conducted through random forest ML approaches to identify the features that have the most impact on the likelihood of developing symptoms of respiratory infections among under-five children in sSA countries.

Naïve Bayesian (NB)
NB is a collection of ML classification algorithms built on Bayes theorem.These algorithms are built on two basic assumptions; the first is that every pair of features being classified is independent of others and hence "naïve"), and the second is that each makes an independent and equal contribution to the outcome 32,34,35 .For a binary outcome variable, a Bernoulli Naïve Bayesian algorithm is appropriate and given as where X is the covariates and (X) is the predictors' prior probability, P(y) is referred to as the probability before evidence is seen or the prior.P(X|y) is known as the likelihood.

Decision trees (DT)
The given dataset is repeatedly split into increasingly similar groups based on the variable that maximizes the similarity of resulting groups 32 .The nodes of the DT normally have multiple levels where the topmost or first node is known as the root node.The predictions and classifications are made by evaluating the new individual according to the established criteria.The DT classifier was constructed using the R package rpart, and the classification and regression tree (CART) was applied to build binary trees.Figure 3 below shows the research workflow.Before performing any statistical analysis, the data were preprocessed, which was followed by feature selection.The data management, including missing values, the existence of outliers, and illogical values was checked.The missing value imputation process was carried out iteratively until 100% completeness of all variables was achieved.Specifically, we checked the missing values in the dataset.
(5) P y|x = P X|y p y P(X) .A value was excluded from the analysis if missing-ness was less than 10% for any variable including the study.However, mean imputation for continuous variable and mode imputation methods for categorical data were used to fill in the missing values if it is greater than 10%.The three-step approach consisted of feature selection, model comparison, and selection of the best ML models and interpretation.The random forest, which is one of the common approaches to identifying important features 46,47,[50][51][52] , was used.It generates 1000 trees and selects the Gini criteria to compute the importance of each feature, the second quartile (median) was considered as a cut of point for selecting important features.Only the symptom of ARIs, as an outcome (dependent (target)) variable for the machine learning parts, was used.To assess the performance of the given ML classifications, we randomly split the dataset into two: training (80%) and (20%) testing datasets.The performances of the given ML models are evaluated using sensitivity, specificity, the area under the curve, and accuracy 31,41,42,[53][54][55][56] which are calculated using the observed data as the gold standard.
After constructing the ML models, sensitivity, specificity, accuracy, and area under the curve (AUC) were calculated to test the performance.The AUC gives an aggregated value which explains the probability that a random sample would be correctly classified by each of the ML algorithms 54,57 .The AUC of the receiver characteristics curve (ROC) averaged over 10 cross-validation folds (ten repeats) 54 , which partitions the original sample into ten disjoint subsets, uses nine of those subsets in the training process, and then makes predictions about the remaining subset.When viewing the area under the receiver operating curve (AUC-ROC), the classifiers that provide curves closer to the top-left corner represent a reliable performance and hence the RF model is more accurate in distinguishing the diagnosis of symptoms of respiratory infections among children under 5 years.The ROC curve is a virtual demonstration used to explain the diagnostic capability of binary classifiers which is a plot of the specificity (1-false positive rate (FPR)) on the horizontal axis and sensitivity-true positive rate (TPR) on the vertical axis.Then the identified best-fit model is used to predict the respiratory symptoms in another dataset, known as the test dataset 31,41,42,[53][54][55] .

Compliance with ethics guidelines
The protocol for the sub-Saharan DHS was approved by the Humanities and Social Sciences Research Ethics Committee (HSSREC/00005776/2023) of the University of KwaZulu-Natal.The authors obtained permission from the demographic and health survey (DHS) program to download and use the data for this analysis and the need for informed consent was waived.

Results
Table 1 presents the prevalence of symptoms of respiratory infections among under-five children from 33 sSA countries.A total of 327,507 under-five children were included in the study.The overall prevalence of symptoms of ARI, SARI, cough, and fever for all countries was 7.10, 4.19, 20.61, and 21.02% respectively.However, there are inequalities in the symptoms of respiratory infections among under-five children across sSA countries (Table 1, Fig. 4).The preliminary analysis for symptoms of ARI using a generalized linear model (logistic regression) with the type of features and their relative importance values separately reported for socio-demographic, geospatial, health and nutrition, and environmental covariates are summarized in Table 2.The results of the variables showed that among the socio-demographic variables: age of mother, place of residence, and media exposure, from health nutrition-related features: breast-feeding, nutrition status (stunting, wasting, and underweight), and dietary diversity, from geospatial covariates: enhanced vegetation index, aridity, wet day, and the minimum temperature were positive predictors of the symptoms of ARIs.Additionally, environmental features: source of drinking water and toilet facility; air pollution features: fuel type, cooking place, PM2.5, and spatial locations (longitude, latitude) statistically and significantly affected the symptoms of ARI among under-five children in sSA countries (Table 2).

The number of under-five children across the DHS waves for each country and the prevalence of symptoms of respiratory infections among U5C children in
The relative importance results in a features score larger than the second quartile (20.3) was considered as a cut-off point for selecting important features and these were used for the subsequent machine learning models.As a result, 21 features are retained for the subsequent analysis.As shown in Fig. 5, the top features with strong influences on the symptoms of ARI among under-five children in sSA countries were air pollutants and climatic factors: household air pollution and air pollutants such as particulate matter (PM2.5),cooking indoors and Vol:.( 1234567890   outdoors, nitrogen dioxide and types of fuel.The features from geospatial/climate variables; spatial location (longitude, latitude), LST, EVI, Cattle, maximum/minimum temperature, aridity, and wet days have a relative importance score greater than the second quartile (20.3%).Whereas only the mother's age and sex of a child from socio-demographic and diarrhea status and vitamin A supplement from health-related features were selected for further ML models to predict the symptoms of ARIs among under-five children across sSA countries.Finally, the proposed ML models such as GLM (logistic regression), Ridge, LASSO, Elastic net, ANN, KNN, Boosting, Naïve Bayes, DT, RF, and Bagged Trees were employed based on the selected features to classify the diagnosis of symptoms of ARIs of the under-five children in sSA countries (Fig. 5).
The model evaluation and accuracy scores of different supervised machine learning models were done by randomly sampling 20% of the dataset as a test sample (Table 3).Table 3 revealed that there is no substantial difference in accuracies of the different MLAs that can predict the symptoms of ARI among under-five children in sSA countries.The highest model performance was obtained by Random Forest, Boosting, ANN, and Bagged trees with AUCs of 0.77, 0.76, 0.74, and 0.74 respectively.The lowest model performance was observed for DT and NB with AUC = 0.68 and 0.70 respectively (Table 3, Supplementary Fig. S1).

Discussion
This study explores a full statistical analysis of covariates associated with the ARIs among under-five children in sub-Saharan African countries, employing both descriptive data exploration and advanced machine learning algorithms.This study highlights a large variation in country-level prevalence of symptoms of ARIs among underfive children.Previous literature revealed that the distribution of the prevalence of ARIs varies from country to country [6][7][8]58 and from district to district within the same country 7,[58][59][60] .
One of the aims of this study was to apply ML algorithms to identify the key determinants (features) of ARIs among under-five children using a large dataset across sub-Saharan African countries.This is the first study to demonstrate the implementation of ML algorithms for predicting acute respiratory infection rates in sSA countries.The result of this study showcases the superior predictive capability powers of the MLA as compared to other conventional statistical techniques in identifying features linked to ARIs.The result is not surprising since MLA has been revealed to outperform traditional statistical models in several fields of the machine [61][62][63][64] .We have employed several ML techniques, to assess their predictive power capabilities.Evaluating the performance of these ML techniques, we investigated that all the techniques employed in this study achieved ROC values above the optimal threshold value (0.5).Using novel machine learning algorithms (MLA), our analysis of the www.nature.com/scientificreports/multi-country DHS datasets strongly indicated the association of air pollution and environmental variables with the symptoms of ARI among children in sSA counties.In our study, PM2.5 was the most influential variable increasing the risk of ARI, together with NO 2 .Both PM2.5 and NO 2 have been associated with the occurrence of respiratory infections 11,12,16,65 .Specifically, the support vector machine algorithm 66,67 has previously shown that ARI is associated with NO 2 .Those previous researchers applied parametric linear models and semi-parametric and generalized additive models [68][69][70][71] to determine the effects of air pollutants on symptoms of respiratory infections.To the best of our knowledge, few studies are using machine learning models to determine the association between air pollutants and human health [72][73][74][75] , and none have used ML models to determine the effects of air pollutants on children's symptoms of respiratory infections across the sub-Saharan regions.In this study, climate factors, such as temperature, wet day, and spatial location (longitude, latitude), were among the top features associated with the symptoms of respiratory infections.This is consistent with the previous studies [76][77][78][79] that the temperature affects the occurrence of the symptoms of ARIs.Nowadays, with the availability of large health-related data repositories (such as electronic medical records) and advances in computing power, classical statistical analysis is being combined with advanced machine learning algorithms to predict and classify the target variables (outcomes) [80][81][82] .The feature selection and feature relevance become prominent, especially in datasets with many features (independent variables) 37,52,[81][82][83] .The RF approach has been also used for feature selection in previous studies 46,47,52,74 .Using this approach, we found that the most important features are particulate matter, age of the mother, spatial location (longitude, latitude), land surface temperature, enhanced vegetation index, nitrogen dioxide, aridity, wet day, temperature, and others were identified, and the similar result was obtained from previous studies [6][7][8]58,[84][85][86] . In the sudy, all the ML classification approaches achieved greater accuracy in predicting/diagnostics of symptoms of ARI over traditional models like GLM also in line with studies on target variables 46,47,52,74,75,87 elsewhere.The study used large nationally representative datasets of 33 sSA countries in examining and selecting the important features to diagnose the symptoms of ARIs.Again, this large dataset made it possible to apply the high-level ML approaches that confirm the accuracy of the findings.However, this study has some limitations.Firstly, we considered only one recent DHS dataset for each country, and hence we did not model the variables over time. Secondly, the dat is crosssectional so we can only make conclusions on statistical association (not causality).Thirdly, the study (survey) is conducted in different survey years and the comparison made on prevalence by country may mislead the readers.Lastly, even though the random forest machine learning method is commonly used for feature selection, other methods may prioritize features differently.Therefore, our future focus will be to include the temporal effects to draw inferences over time and possibly causality.

Conclusion
The present study tried to assess the performance of various supervised machine-learning algorithms for the prediction of symptoms of respiratory infections using data from DHS and NASA sources.In this study, before we started the feature selection process, our dataset contained a total of 51 features and 327,507 under-five children.Feature selection is essential for the classification and prediction of certain target variables.Using the random forest approach, the ranking of the contributions of the features was determined by using the average Gini Importance method and only 21 features were retained for further ML models.It was found that particulate matter (PM2.5),age of the mother, spatial location (longitude, latitude), land surface temperature, enhanced vegetation index, nitrogen dioxide, aridity, wet day, and temperature are the most important predictors of symptoms of ARI among children in sSA countries.Those selected features have scores greater than the second quartile (median), which is used as a rule of thumb for dimension reduction of features.The present study attempted to identify the best ML algorithms for the prediction of symptoms of ARI using nationwide cross-sectional data from 33 SSA countries.The performances of these ML models were compared using different statistical merits such as sensitivity, specificity, accuracy, and AUC.Air pollution is a leading cause of symptoms of respiratory infections (fever, cough, ARI, and SARI) among children and adults.In addition, the ML algorithms are more accurate for the prediction of the symptoms and this result may apply to other target variables, for large data sets.The findings of this study established the potential of the ML techniques in predicting the presence of ARI among under-five children across sSA countries.This opens up the opportunities for development of automated screening tools and decision support systems which may assist the concerned bodies in diagnosing and managing the ARIs among under-five children in the region.Moreover, the spatial location (longitude, latitude) is one of the influential features in predicting and diagnostic symptoms of ARIs, hence if the spatial model is integrated with the ML models, it is possible to identify and flag under five children who are at most risk, such that datadriven intervention can be targeted to communities where those children live.

Figure 1 .
Figure 1.Eligible sub-Saharan African countries included in the study; we have created the map using ARC GIS.

Figure 2 .
Figure 2. Conceptual framework for features description.

Figure 3 .
Figure 3. Overview flow chart of the machine learning algorithms used for predicting U5C respiratory infections/symptoms.

Figure 4 .
Figure 4. Proportion of under-five children with different AR infections and symptoms across sSA countries.

Figure 5 .
Figure 5. Feature importance scores based on random forest approach.

Table 1 .
Selection of study participants from 33 sSA countries with recent DHS reports from 2012 to 2022.PSU Primary Sampling Unit, U5C under-five children, DHS Demographic and Health Survey Data, sSA sub-Saharan Africa. A

The number of under-five children across the DHS waves for each country and the prevalence of symptoms of respiratory infections among U5C children in sSA Survey countries Survey year Weighted sample Percent Children with symptoms of ARI
n (%) SARI n (%)Cough n (%) Fever n (%)

Table 2 .
Preliminary analysis of the effects of different variables on the outcome variables and the relative importance of each of the features on the target variable.

Table 3 .
The performance of the prediction models based on different classifications using a test dataset with 95% CI.GLM generalized linear model, ANN artificial neural network, KNN K nearest neighbor, NB Naïve Bayes, RF Random Forest, DT decision tree, AUC area under curve, CI confidence interval.The selected machine learning algorithm in bold.