Drinking water accessibility typologies in low- and middle-income countries

We present a data-driven typology framework for understanding patterns in drinking water accessibility across low- and middle-income countries. Further, we obtain novel typology-specific insights regarding the relationships between possible explanatory variables and typology outcomes. First, we conducted exploratory factor analysis to obtain a smaller set of interpretable factors from the initial set of 17 drinking water accessibility indicators from 73 countries. The resulting seven factors summarize the key drivers for water accessibility, and also serve as a vehicle for framing discussions on country outcomes. We clustered the countries based on their seven-dimensional water accessibility factor scores, referring to the resulting three clusters as ‘typologies,’ namely, Decentralized, Centralized and Hybrid. The typologies serve as a vehicle for analyzing water accessibility among countries with similar patterns, in contrast with geographically-based approaches. Finally, we fitted a decision tree classifier to analyze relationships between a country’s typology membership and socioeconomic, geographic and transportation explanatory variables. We found that private car ownership, population density and per-capita gross domestic product are most relevant in predicting a country’s drinking water accessibility typology.


Introduction
The proportion of the world's population using safely managed drinking water services has been increasing from 70% in 2015, when the tracking of progress toward the United Nations sustainable development goals (SDGs) began (WHO/UNICEF JMP 2021). Yet, as of 2020, about 2 billion people (26% of the global population) still lacked access to safely managed drinking water services, which are defined as access to an improved water source that is located on-premises, available when needed, and free from contamination (WHO/UNICEF JMP 2021). While remarkable progress has been made in increasing access to safe, reliable, and accessible drinking water supplies, recent reports have highlighted continued inequality in progress toward meeting the SDGs within and among countries (Yang et al 2013, Fuller et al 2016, Gain et al 2016, Jia et al 2016 Burden of Disease WaSH Collaborators 2020, WHO/UNICEF JMP 2021). Notably, among low-and middle-income countries, which account for 84% of the global population 3 , the share of people without access to safely managed drinking water is much higher at 42% (WHO/UNICEF JMP 2021). Given the size and needs of this fast-growing bloc of low-and middle-income countries, it is critical to analyze their various potential water accessibility pathways in order to highlight opportunities for targeting institutional support and investments for achieving universal access to safe and sustainable drinking water.
Clustering approaches have been previously used to group countries, regions or cities to better understand patterns in infrastructure provision (Onda et al 2014, Pawlak andKołodziejczak 2020), sustainable mobility (Oke et al 2015(Oke et al , 2019, urban planning (Reese 2006, Moody et al 2019, energy consumption (Creutzig et al 2015) and even dietary composition (Bonhommeau et al 2013). With regard to water and sanitation accessibility in particular, Onda et al (2014) performed a cluster analysis of countries to identify commonalities among countries with differing institutional sector arrangements and levels of drinking water and sanitation services. However, most of the analyses and support for country-level water access have focused on country groupings according to geography, e.g. physical proximity to one another, or indicators of economic or health metrics (Gain et al 2016, Luh and Bartram 2016, Munamati et al 2016, Cassivi et al 2018, Local Burden of Disease WaSH Collaborators 2020). While these efforts have highlighted disparities in access and progress within and among countries, they can hide even more significant disparities in access between neighboring countries and thus neglect potential opportunities to promote knowledge-sharing between countries with similar contexts across regions.
Research from both global and local case studies have noted the importance of distance and accessibility to improved water sources (Cassivi et al 2019). Improved drinking water sources are those that are likely to provide a better quality of water that is safe and accessible, including piped water, boreholes or tubewells, protected dug wells or springs, rainwater, or bottled/trucked water within 30 min (in round trip/queuing collection time) . While these definitions are a proxy for water safety and accessibility, the introduction of the SDGs refined the indicator of a safely managed source to emphasize accessibility and availability of the water source through the 'located on-premises' and 'available when needed' characterizations. On-premises and nearby water sources are associated with better health outcomes, and water fetching time has long been understood to have important implications for health and gender disparities (Sorenson et al 2011, Pickering and Davis 2012, Graham et al 2016, Overbo et al 2016, Cassivi et al 2018, Local Burden of Disease WaSH Collaborators 2020. Available modes of transportation (motorized/non-motorized) would also influence the collection time for off-premises sources (Cassivi et al 2018). Nevertheless, no studies to our knowledge have explicitly examined associations between water accessibility and means of transportation, along with other socioeconomic and geographical indicators. In particular, the interactions between access to civil infrastructure systems (e.g. drinking water and transportation) are relatively unexplored, although such systems can play an important role in water accessibility.
To address these gaps, we present a data-driven approach that clusters low-and middle-income countries based on their drinking water accessibility indicators. These indicators include the water source used, collection times and proximity. From these, we obtain the key factors that may govern water accessibility in these countries. Based on the factors, we obtain typologies which facilitate analyses of countrylevel commonalities and patterns without relying on traditional geographic, economic, or health indicators to group the countries. We then use these novel typologies to explore associations with national economic and environmental, and household-level transportation-related variables. The results offer new insights and perspectives into associations with household water access that can highlight varying paths that could be taken to achieve universal drinking water access and help facilitate cross-border knowledge sharing between geographic regions. We expect that our analyses will serve as a resource for other researchers and policymakers in targeted efforts to investigate and improve access to drinking water globally.

Method
To uncover and understand the patterns in drinking water accessibility among low-and middle-income countries, we developed a framework (figure 1) based on three methods: exploratory factor analysis (EFA), cluster analysis, and decision tree classification. Using data on 17 indicators of drinking water accessibility from nationally representative surveys from 73 countries over 20 years, we first conducted an EFA to discover the key drivers of water accessibility. Via EFA, we reduced the dimensionality of the accessibility indicators into seven interpretable factors representing the underlying structure of the dataset. We then conducted clustering analysis on these factors to obtain the most distinct groups of countries (i.e. typologies) with similar measurements across the factors. This analysis yielded three typologies of drinking water accessibility. Finally, we fitted a decision tree classification model to predict the typology of a given country based on a set of 13 potentially relevant socioeconomic and geographic explanatory variables. The decision tree revealed that three of these variables were most relevant in explaining or predicting a country's typology classification. The decision tree could also serve to predict the typology of out-of-sample countries.

Data
We obtained the data for this study from the demographic health survey (DHS) program repository (ICF 2012) and from the world development indicator database (World Bank Group 2021). The standard DHS surveys collect nationally representative data from high sample sizes (5000-30 000 households)  every five years, and currently includes data from over 400 surveys conducted in 90 countries. We restricted observation years to the period from 2000 to 2020 in order to capture the most current trends. By limiting the sample size to include countries with at least 50% coverage in the variables of interest, we obtained the final dataset comprising 73 low-and middle-income countries. Altogether, 17 survey-years were represented in the final sample. We considered two categories of data: drinking water accessibility indicators and explanatory variables.

Water accessibility indicators
We identified 17 relevant drinking water accessibility variables related to household use of a primary drinking water source. Table 1 describes the indicators, including the survey questions that define the three types collected. (Further details on the survey questions and indicators are available in SI table 1.) Their distributions across the 73 low-and middleincome countries are shown via histograms in figure 2 (an alternative version of this figure with different horizontal axis scales is provided in SI figure 1 for more clarity). Each observation is the percentage of households that use a given source of water as their main drinking source or spend more/less than 30 min to obtain water in a given day. The patterns across the water accessibility indicators formed the basis of the typology analysis of countries.

Explanatory variables
We hypothesized that socioeconomic, demographic, mobility and geographic characteristics could be relevant for predicting a country's water accessibility typology. To test this hypothesis, we gathered 13 explanatory variables on the 73 countries in the dataset from the most recently measured years. We then used these variables in fitting a model to classify a country into one the typologies resulting from our analysis. The explanatory variables include population, per-capita gross domestic product (GDP), mobility-mode ownership rates (cars, motorcycles, bicycles, etc), the Gini coefficient (a measure of inequality), among others. We provide the complete list and corresponding source in table 2). (An alternative version of this table is provided in SI table 2, which maps the variables to the exact names used by the World Bank.) Figure 3 shows pairwise scatter plots, correlation coefficients and histograms of the variables. Low correlations indicate that the corresponding variables are largely independent of each other. In addition, the statistical significance of each coefficient (i.e. strength of the linear relationship between the two variables) is indicated by the number of asterisks, as explained in the caption. Observing the histograms, we notice that majority of these variables skew right. Notably, however, the urban population rate tends toward a normal distribution.

Factor analysis
Factor analysis is an exploratory analytic approach used to reduce a multidimensional dataset to a smaller number of interpretable dimensions, i.e. factors, that  explain a reasonable amount of variance in the data (Thurstone 1931, Cattell 1965. Applying this method to our dataset of 17 drinking water accessibility variables measured across 73 countries, we obtained what can be considered as the underlying factors of water accessibility. In conducting the analysis, we needed to determine the best method of factor extraction and the optimal number of factors to extract from the data. The maximum likelihood method (Rubin and Thayer 1982) provided the best fit for our dataset (see SI table 3). We selected the optimal number of factors based on the objectives of fitness, parsimony and interpretability. The scree plot (figure 4) shows the eigenvalues of the correlation matrix of the data. Various scree-plot heuristics (Cattell 1966) indicate that 6 and 7 factors were potential candidates. We further considered a number of error metrics (table 3), which indicated that 7 factors provided a better fit than 6. No further interpretative advantages were observed at 8 factors or greater. Thus, we extracted 7 factors, reducing the dimensionality of the dataset from 17 to 7 by transforming it using the loading matrix, which indicates the contribution of each original feature in a given factor. (Further statistics for the extracted factors are provided in SI table 4. Loadings are given in SI table 5 and correlations in SI figure 2.)

Cluster analysis
We employed cluster analysis to uncover distinct groups of countries in the dataset with the most similar set of outcomes across the 7 factors. Cluster analysis is an unsupervised learning technique, i.e. there is no ground truth-no predefined groupingagainst which we can select the best solution. Thus, we selected the best clustering method and the optimal number of clusters by assessing quality indices that measure the distinctiveness of the clusters in each of the solutions, along with other objectives (Lee 1981).
We compared solutions from the following hierarchical agglomerative clustering algorithms: single linkage, complete linkage, Ward method, and the method of averages (Vijaya and Batra 2019). We also compared results from the k-means clustering algorithm (MacQueen 1967). Among these, the Ward method resulted in the best-fit clustering. We computed 30 quality indices for each of the clustering algorithms considered and used a consensus approach as a first step to determine candidates for the optimal cluster number. In figure 5, we plot the frequencies of the optimal cluster number given by each of the 30 fitness metrics for the Ward method. By consensus, seven clusters was optimal based on the fitness metrics, while three clusters was second best. However, we also considered parsimony, interpretability and dimensionality as further objectives in making the final selection (Jain et al 1999). Based on these, we selected three clusters as the best fit for our dataset, given our relatively smaller sample size and the need for interpretable clusters.

Decision tree model
While the factor analysis and clustering steps are useful for generating the typologies and thus providing a framework for explaining overarching patterns in drinking water accessibility outcomes among countries, they do not explain what characteristics (socioeconomic, geographical and demographic) of each country may predict its typology assignment. In order to therefore extract the relevant variables and obtain key typology predictors, we fitted a decision tree model using the 13 explanatory variables described in table 2 and figure 3.
Decision trees use a recursive splitting procedure to partition a multivariate dataset into boxes that provide the lowest error in predicting the class label of the observations in the corresponding partition (Kotsiantis 2013). In this case, we have three class labels (corresponding to the typologies). In fitting a decision tree, we seek to find the best stopping point (number of splitting nodes), as we could partition the data all the way up to each individual observation, resulting in leaf/end nodes equal in number to the total observations in the training data set. However, this would result in an overfitted tree. The best number of splitting nodes is determined by a cost-complexity parameter, whose optimal value can be found via cross-validation (estimation results are given in SI figure 4 and SI table 8). Based on this, the best number of splitting nodes was four, resulting in five leaf nodes in the fitted tree.
The fitted tree can be used as a predictive tool to assign a country outside the current dataset to one of the three typologies we obtained. More important, however, is the ability to determine the most relevant class predictors by determining the variables used in the splitting process. In the model we estimated here, only 3 of the 13 explanatory were actually used in constructing the tree. Furthermore, we can rank the importance of each of the relevant variables based on  a measure of how much a given variable impacted the error at each node.

Factors
We extracted 7 interpretable factors from the 17 water accessibility indicators obtained from 73 low-and middle-income countries. We assigned descriptive names to each factor: Far Spring, Far Well, Nearby Improved, Nearby Surface, Piped Indoors, Piped Outdoors, and Vended. The factor loadings (shown in figure 6) indicate the contribution of respective variables to each of the factors. The variables Protected Well and Unprotected Well provide the greatest positive contributions to the Far Well factor, with loadings of 0.9 and 0.7, respectively. Similarly the Protected Spring and Unprotected Spring variables have similar respective loading strengths on the Far Spring factor. The Nearby (Improved and Surface) factors are so named, as both have a significant positive loading from the T < 30 min variable. Nearby Surface, in addition, features strong positive loadings from both the Surface and Unprotected Spring variables. The Piped Indoors factor's maximum positive loading is attributed to the Piped (Dwelling) variable, while its most negative loading (−1) is due to the Borehole variable. In contrast, the factor Piped Outdoors receives its greatest positive loadings from Piped (Yard) (0.87) and Public Tap (0.52), while its most significant negative loading is from Piped (Dwelling). Finally, the Vended is characterized by a loading of 0.82 from the Bottled variable, followed by a loading of 0.42 from the Tanker Cart variable. (For reference, all the factor loadings are provided in SI table 5.)

Typologies
Using the 7 factors to compute similarity, we grouped the countries into three clusters based on the Ward method, as described in the previous section. The dendrogram corresponding to the selected solution is shown in figure 7. Throughout the rest of the paper, we refer to these clusters as drinking water accessibility typologies. A map indicating the geographical distribution of each typology is shown in figure 8. Following the analyses described in the remainder of this subsection, we named the typologies Centralized (22 countries), Decentralized (37 countries) and Hybrid (14 countries). However, we include these labels in both figures 7 and 8 for ease of reference.
The average factor scores in each typology (scaled between 0 and 1) are shown in figure 9). From this spider plot, we see that the Centralized typology has the highest Piped Indoors factor score average. In the Decentralized typology, the dominant factors are Nearby Surface, Far Well, Nearby Improved. In the case of the Hybrid typology, the highest average scores are from the Nearby Improved and Vended factors. We describe each typology in detail in the following subsections. For further analyses and characterization, we provide distributions of the factors (figure 11) and water accessibility indicators (figure 10) grouped by typology.

Centralized
The Centralized typology consists of 22 countries across Africa, Asia, Europe and South America. Example country members include Colombia, Kyrgyz Republic and Namibia, which is the most representative, as it is closest to the centroid. (Rankings of all countries relative to their typology centroids are given in SI table 7, while the centroids are plotted in SI figure 3.) This typology has the highest median scores for the Piped factors (see figure 10). With regard to the water accessibility indicators, the Centralized typology has the highest median values for the On-Premises, Piped (Dwelling) and Piped (Yard) indicators (figure 11). Thus, Centralized countries have the most developed water infrastructure in the sample, given the dominance of the availability of water piped to home dwellings.

Hybrid
Consisting of 14 countries mostly in Central America and South Asia, the Hybrids are so named due to the dual dominance of the Piped Indoors and Vended factors ( figure 10). Notable examples in this typology include India, the Philippines, and Turkey. The Dominican Republic, however, is the most representative member as it lies closest to the centroid of the factor scores (SI table 7). While countries in this typology have moderate centralized water supply systems, they depend significantly on transported solutions including bottled and tanker sources. This is further highlighted by the fact that of the three typologies, Hybrid has the highest median On-Premises value, yet considerably lower median values for the Piped variables compared to those of the Centralized typology (figure 11).

Decentralized
This typology consists of 37 countries, largely in sub-Saharan Africa (such as Congo DR, Kenya, Nigeria and Zimbabwe), and a few in Asia (Afghanistan, Myanmar and Papua New Guinea). Gambia is the most representative country in this typology. In comparison with the other two typologies, Decentralized has the lowest Piped Indoors factor score median. In contrast, it is dominated by the Nearby Improved, Far Well and Nearby Surface factors (figure 10). The distribution of the water accessibility indicators measurements within this typology also follow a similar pattern ( figure 11). Piped and vended sources are less significant in the Decentralized typology, while off-premises sources (borehole, well, public tap) are more important. The collection time indicators are also most significant in this typology, indicating the greater frequency of off-premises access.

Relevant explanatory variables for typology prediction
The estimated decision tree model (shown in figure 12) revealed that of the 13 explanatory variables considered, only three were relevant: car ownership, population density and per-capita GDP. These can be considered as the key predictors of a country's drinking water typology. In figure 12, each box represents a node in the decision tree and is labeled based on the majority typology among the observations in the respective node. The second row of numbers in each box indicates the proportions of observations in each respective typology (Centralized, Decentralized, Hybrid) at that node. The percentage in the third row of each box represents the proportion of observations in the corresponding node relative to the total number in the dataset.
The decision tree indicate that household car ownership is the most important predictor of a country's typology in the given sample. We see from the tree that Centralized countries are largely those that have an average car ownership rate greater than 12% and a population density less than 104 (sq. km) −1 . Conversely, a country with car ownership less than 12% and a population density less than 250 (sq. km) −1 is most likely Decentralized. A Hybrid  country can be identified as either a dense, lowcar-ownership nation with a per-capita GDP greater than $1459, or a high-car-ownership (⩾12%) but medium-density (between 104 and 250 per sq. km) country. The decision tree also provides importance rankings for each of these variables (figure 13).
We note that the decision tree only provides evidence that these explanatory variables can be used to infer, explain or predict the typologies. It does not, however, provide any evidence of causation. Thus, the variables cannot be considered as or used as levers to alter drinking water accessibility outcomes in a given country. Nevertheless, further analyses of these variables with respect to the typologies and factors can yield important insights and lessons that can guide sustainable and localized policy-making efforts in reaching water accessibility goals.

Exploration of relationships between factors and explanatory variables
To obtain further explanatory insights into the typologies, we observe the scores of each of the seven drinking water accessibility factors with respect to the three relevant explanatory variables obtained from the decision tree model. The results are shown in figure 14. In each facet of the plot, we fitted a local regression (locally estimated scatterplot smoothing) to highlight the prevailing trends of the factors against  car ownership, population density, per-capita GDP and the Gini coefficient. The relationships between these explanatory variables and the seven water accessibility factors, within the context of the typologies, are discussed below.

Car ownership
Most of the countries in the dataset had a relatively low household car ownership rate, with only a handful exceeding 20%. First, we find that on average, countries' Far Well, Far Spring and Piped Outdoors  factor scores decrease as car ownership increases. Similarly, the Nearby Surface score has a negative relationship with car ownership, implying that high car ownership is a predictor of lower usage of surface water sources. However, car ownership has no explanatory power on the Vended score of a country. The Hybrids, as earlier noted, tend to score higher on this factor. In contrast, though, the Piped Indoors factor has a positive relationship with car ownership. Bangladesh is an outlier with the lowest car ownership and Piped Indoors score.

Population density
Population density also significantly explains the outcome of the countries on all the accessibility factors except Piped Outdoors, Vended and Nearby Improved. With regard to Far Well, there is a slightly negative relationship to population density. The relationship is however more pronounced for the Piped Indoors factor, indicating that denser low-and middle-income countries are less likely to have water piped into dwellings. We would expect that the need for centralized systems is greater in denser countries, yet, most of them tend to be Decentralized or Hybrid. When we consider the access of water from nearby surface sources, there is an inflection point between one and two orders of magnitude. The bigger takeaway, however, is that almost all of the countries with a positive Nearby Surface score are in the Decentralized typology.

Per-capita GDP
We observe that the Far Well and Far Spring scores decrease rapidly as per-capita GDP increases. This trend implies that the average income of a country is a good indicator of its dependence on decentralized water sources. Conversely, Piped Indoors largely increases with per-capita GDP, but begins to trend downward beyond $6250. Armenia and Egypt are two outliers that out-perform in terms of their access to indoor piped water in spite of their relatively modest per-capita GDP. The Piped Outdoors factor, however, has a negative relationship with per-capita GDP. Another factor that has a strong negative relationship with per-capita GDP is Nearby Surface. Angola appears to be an outlier in this trend, however, with a modest per-capita GDP greater than most of the other Decentralized countries. Yet, its high factor score underscores its inclusion in this typology.

Conclusion
This study applied a data-driven approach to analyze current patterns of drinking water accessibility among a group of low-and middle-income countries with the goal of highlighting various paths to safe water access and opportunities for knowledgesharing across regions. We considered 17 indicators of drinking water accessibility from 73 low-and middle-income countries, which measured the proportion of households in each country that obtained water from a given source and the proportion of households either within 30 min of access to the water source or greater. Data were obtained between the years 2000 and 2020, in order to focus on the most recent trends. From these indicators, an exploratory factor analysis yielded seven key drivers of drinking water accessibility. Using these 7 factors, we conducted a clustering analysis that resulted in three distinct typologies, namely: Centralized, Hybrid and Decentralized. The existence of the Hybrid typology observed here is particularly of note, as more recent research highlights both the opportunities and risks of use of multiple sources (Elliott et al 2019, Daly et al 2021. Much of this work has focused on household use of multiple sources, but not of government or utility provision of multiple sources. However, this research highlights that such hybrid approaches are a feature of water access across in many contexts and likely deserve further attention. Furthermore, we investigated the variables that were most relevant for predicting a country's typology. We assembled a set of 13 candidate explanatory variables, guided by a research hypothesis that demographic, economic, transportation, and geographical characteristics were important in explaining the access to drinking water in a country. We fitted a decision tree classification model for typology prediction based on these variables to test this hypothesis. Using cross-validation to select the optimal parameters for the decision tree, we found that only three variables were required to classify a country into one of the three typologies. In order of importance, these are: household private car ownership, population density and per-capita GDP. While some of these relationships highlight wealth, which unsurprisingly relates to water accessibility, they also identify novel relationships worth further examination. This includes the consideration of realistic water accessibility options when considering population density (e.g. low population density may limit the type of water infrastructure that is viable, including piped water).
The typology framework can be a powerful tool for researchers and policymakers alike toward planning and innovation for improving drinking water outcomes in low-and middle-income countries. Given the similarity of accessibility among countries within each typology, lessons learned from research or intervention efforts from representative countries in a given typology could be used to guide similar efforts in a country within the same typology. Thus, the typologies can be used as a vehicle for knowledge sharing and policy learning. Typology-specific trends can also be readily identified and investigated further at the microdata level using case-study countries. Lastly, the decision tree model may also be employed to classify a country not previously analyzed in this study, with only three variables required. This can also aid policy efforts, as decision-makers can determine the accessibility profile of a country with minimal information.
With further data and research, this work may ultimately serve as a foundation to facilitate investigations into the factors or policies that govern how countries may transition from one typology to another.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https:// github.com/narslab/water-accessibility (Chung and Oke 2023).