Dual Water Choices: The Assessment of the Influential Factors on Water Sources Choices Using Unsupervised Machine Learning Market Basket Analysis

An unsupervised machine learning model of association rule known as market basket analysis is proposed in this study to analyze the influence of various socio-economic factors on the choice of the water source. Data of 51 socio-economic factors collected from 295 individuals living in 65 households in Ambo city in the Oromia region of Ethiopians were used for this purpose. The results revealed (i) 64% of the family preferred multiple water sources (i.e., public tap and river water), (ii) the water was collected females in 92% of the households, and (iii) majority of people preferred bathing and laundering in the river (support = 32% and confidence = 87%). Direct utilization of river water is not a preferable choice for the user since it may lead to severe health issues and cause water pollution from bathing and laundering. Education and monthly income have a significant impact on the choices of water sources. Local management authorities can improve sanitation and public health management using the results obtained in the study. The paper only gives a glimpse of the important factors that should be considered for improving the way of life for the underdeveloped areas of the world using advanced machine learning techniques.


I. INTRODUCTION
Unsustainable human activities have caused water pollution and degradation of safe water sources across the globe [1], [2]. Decreasing freshwater resources and increasing water demand to meet the growing population and economic demand have caused water scarcity even in many water-rich regions. About 2.4 billion of the global population is under water stress, which would increase to 9.6 billion in 2050 [3], [4]. The scarcity will be more in developing countries due to inefficiency in water resources management.
Despite water scarcity, the availability of low-cost water purification technologies and changes in government policies to fulfil the Sustainable Development Goal of ensuring clean The associate editor coordinating the review of this manuscript and approving it for publication was Xujie Li . water and sanitation for all have made the availability of safe water to more population in recent years [5]. A study by [6], estimated access to improved water sources to 16% more population in 2015 compared to 1990. However, improved provision of safe water alone cannot ensure access to clean water. There are many social and economic factors, such as education and behavior, that influence the choice of water sources and the use of safe water. For example, people with better economic ability have more access to freshwater. Education makes people more aware of safe water and enhances their willingness to get access to safe water. Reference [7] showed that the choice of water source is significantly affected by place of residence, geopolitical zone, education, wealth index, ethnicity, access to electricity and gender. Therefore, it is very important to consider these factors in designing and implementation of water services programs. Quantification of the relative influence of different social and economic factors on safe water use is also important for sustainable water resources management, ensuring social equity is water access, protection of public health, and improvement of the quality of life.
Association of safe water use with different socioeconomic factors like health, education, gender, distance to the water source, size of family and so on has been established by many researchers [8], [9]. The previous studies have used multinomial logistic regression [10], generalized linear models [11], conditional logit model [12] and univariate analysis [13]. Table 1 presents a brief survey of the previous research studies done to identify the factors affecting household water access in developing countries using various techniques. All the reported literature studies were conducted using classical statistical models.
The statistical models provide a numerical measurement of association by testing the hypothesis of an association between two variables [14]. Such statistical approaches are applicable when the two variables of interest are known, or a hypothesis is already defined. However, such a condition is not valid when prior knowledge of the relationship among the variables is not known. Besides, such methods are limited to numerical and ordinal data. Therefore, a reliable method is needed to unwrap the correlation in a dataset with a wide range of data types [15], [16].
This study aims to use an unsupervised machine learning model which uses association rule, also known as market basket analysis. It is an efficient analytical methodology to analyze and optimize the choice and behavior of the customer [17]. The main advantage of unsupervised machine learning is the potential to solve complex problems using the capacity of intelligent mimicking [18], [20]. It can observe frequently occurring pattern, correlation and association from the dataset with the help of the three thresholds support, confidence and lift [21]- [23].
The goal of the current research is to use machine learning method for establishing the relationship between the daily water requirement for households, available water supply facilities, choices of water supply and how these choices are affecting the consumer regarding several elements, such as education, job, daily time invested, household income, responsible member of the house to do the chores, awareness, the willingness of maintaining the sanitation and hygiene. The paper desires to highlight the condition of developing countries that lack awareness and affordability of proper water supply. The methodology used for the assessment of the influence of socio-economic factors on water use can be replicated for a reliable analysis of socio-economic interactions with natural resources.

II. FACTOR INFLUENCING WATER SOURCE SELECTION A. SOCIO-ECONOMIC CHARACTERISTIC
Water sources and their utilization are holistically affected by socio-economic factors. The choice of water source depends on the distance from the source, way of access, data availability, ethnicity group of the area, status of the family, education-water access relationship and so on [24], [25]. Besides inequalities in water access, water policies for predominantly poor and economically disadvantaged rural settlements are greatly affected by socio-economic variables in developing countries. Hence, it can be considered the key indicator of water sources utilization [26], [27].

B. PRICING
Water consumption and water price are related to each other. The increase in water price significantly affects water consumption, specifically in low-income households where water bill surges affect the monthly budget [28]. The developing and underdeveloped countries commonly consist of high population density, and raised tariffs might undesirably affect the financial health of households [29]. In specific cases, the pricing can change as per the seasonal variation to minimize the consumption and enhance the water accessibility as per WHO guidelines [30].

C. COLLECTION TIME
Research studies have found that the choice of water source is likely influenced by household characteristics and distance to the water source [38], [39]. As the women spent most of the time in water collection and change in distance, the water sources affect more to the female and children [40]. The collection time is also affected by the water activity since water for drinking and cooking need better quality, even though the travel time increases irrespective of whether the user chooses the source. In addition to that, the water infrastructure and water policies significantly change the household behavior towards the efforts given to water collection, which may include proximity, pricing, quality, accessibility to the source and geographical structure of the area [12].

D. MULTIPLE SOURCES
Choice of multiple water sources is affected by the distance from the water source, quality of the water depending on activities, such as bathing, washing, cooking and conflict among the people. Reference [41] the study reports that the factors influencing the multiple uses of water sources are water services, water supply scheme, technology-system design, water quality-quantity, collection distance and time. Another study reported by [42] showed that users preferred river and public tap depending on the quality, improved access to the water source and efforts on water collection (time taken, the volume of water transported each time, frequency of filling).

E. IMPACT OF COMMUNITY WELLBEING
Water availability and community health are essential parameters to measure community wellbeing. Reference [43] reported in the study that small-scale community water supply significantly affects hygiene behaviors and daily life. The factors considered were water source from the river to groundwater, collection time, household water consumption and income.

F. HEALTH
Water impact on health is a well-known fact but as per the studies yet neglected. Poor water supply or unprotected water source can cause acute infectious diarrhea, Trachoma, Ascariasis, Hook work infection, Dracunculiasis, Schistosomiasis and other diseases from heavy metal exposure. Without access to piped water, the household is 4.8% more prone to infant mortality from diarrhea and other waterrelated diseases [44]- [46]. It was also reported that an awareness program alongside water supply improvement is also necessary to improve health, sanitation and decrease infant mortality [47]- [49].

G. EDUCATION
There is no well-established relation between education and water yet; however, most research supported the relation between water and education [50]. A good impact on the educational campaign in line with water savings and conservation has been noticed in water-scarce countries [51]. Consequently, a study in China has reported that children (especially girls) get to attain more school days in rural areas if they get treated water in their households [52]. Educated parents showed more interest in the easy access of water and hygienic maintenance for their children's health [16], [53].

H. INCOME
Income supports more increasing water demand than the hygienic way of using it; however, the child involvement for water fetching remains the same [16], [54]. The research reported less or no relation between the duration of water access and income [55]. Another study depicts that income is affected by the education level of the head of the household, assets and the preference to the home lifestyle, including water-related decisions. The income structure significantly affects the willingness to pay the water bill, inter-sectoral water transfer, and user-specific water demand and consumption response [56].

III. CASE STUDY AND DATA EXPLANATION A. STUDY AREA
Ethiopia experienced a severe water shortage due to inadequate rainfall in the 2017 rainy season, which led to catastrophic agricultural and socio-economic losses [57]. Recently, 10.5 million Ethiopians need humanitarian assistance due to clean water shortage, as per the report published in August 2017 by the Government of Ethiopia [58].
Ambo is a town in the Oromia region and is 120 km from the capital, Addis Ababa. Ambo is one of the important geographical locations where water insecurity is at its peak, affecting the social and healthy life among 180 Woreda of West Shoa Zone, Oromia Regional State of Ethiopia [59]. Ambo woreda has six kebeles (the smallest administrative unit of Ethiopia similar to a ward) such as Ambo01, Ambo02, Ambo03, Senkale Foris, Kisose odo and Awaro. Kerchelle masa of Ambo 01 has insufficient water supply due to their human behavior, literacy rate, and insufficient infrastructure of the water supply system. The selected study area is well known for Afan-Oromo people in terms of their mother tongue. Ethiopia is running a severe water shortage due to inadequate rainfall during the 2017 rainy season, leading to insufficient crop production and catastrophic socio-economic losses [57]. The study area in Ambo is displayed in Figure 1.
The study area comes under the economically waterstressed zone and only rely on the surface water sources, especially river water. Besides, the groundwater has high fluoride, which makes it difficult for direct use [60]. The study considered the groundwater source, but none of the users reported any use. Therefore, it was not discussed in the result.

B. DATA COLLECTION
A structured questionnaire was developed considering the local language, was used by the interviewees. The questionnaire was translated into Oromo then back-translated to English. The household was selected randomly within a 1 km distance from the river water source. The sample size consists of 295 individuals living in 65 households. The selected area comes under the rural area, Kerchelle masa village situated near Taltelle river of Ambo region. Ethiopia is one of such under developed country which is continuously need humanitarian assistance due to low economic condition [57]. Moreover, severe shortage of water due to inadequate rainfall in rainy season has led to catastrophic agricultural and socioeconomic losses which added more stress to the present condition [58]. The study is a village which is far from the city due to which awareness and education is very less. This is one of the major reason that gathering data was one of the most difficult issue in this study. The selected area was one of the nearest village but issue focused in the study is the same for many such villages. Thus, this study aims to highlight the problems related to the water supply system of such villages all over Ethiopia and tried get attention required to improve the situation. This study found that with this number of data can be used for such case study and highlight the water supply problems and it's the influencing factors. The study also found that other research studies efficiently achieved their goal with minimum number of sample size such as n = 100 [36] and n = 40 [31].
The dependent variable used is the water sources, and independent variables for the study area were chosen based on the public water-related awareness and significant factors influencing wellbeing. Among which to was a major focus that factors such as education, job and awareness factors a lot when choosing the water source. The aim of considering many parameters was to evaluate and point out the issues which can elevate the life condition of the local people. Even though water was our primary concern but many key factors influences it and if at least one of the key point can be improved then major change can be observed. The independent variables were household member (number of male adults, number of female adults, number of male child and number of female child), education level (Uneducated, below 10 th standard, below 12 th standard and graduate), number of member employed, monthly income in Ethiopian Birr (ETB), daily water requirement of the household in litre, water collector information (sex and age), water collected each trip in litre, number of times water collected, total time taken for water collection in minutes, water collection method (manual and transportation), water bill paid per month in ETB, water sources selected for different household activity (bathing, washing and cooking), water collection time (Morning and evening), seasonal water quality variation (summer and rainy), seasonal water collection efficiency (difficult and easy), water interruption in days, household water treatment, cleaning frequency of water storage container (daily, never, weekly), hygiene information (use of soap before handling water, during bath, after defecation), defecation in river, toilet at home, heath information (diarrhea, common cold and other diseases).

IV. APPLIED MACHINE LEARNING AND STATISTICAL APPROACHES A. ASSOCIATION RULE/ZZZZZ/WWWWW/MARKET BASKET ANALYSIS
Association rule is the rule-based machine learning methodology where highly confident associations among multiple variables are calculated. This rule is a centered technique which has exhibited higher accuracy [61]. This tool has been used in different fields of science and engineering but have been applied in a limited number of researches with respect to analysis of water uses [62]- [64]. The best association is selected based on the various statistical analysis outcome such as the high number of counts, higher value of confidence, lift and support. Apriori algorithm is suited to the character variable and leads to better performance [65].
In a given transaction, where each transaction is a set of items, the association rule reveals that the transaction in the dataset containing X also contains Y expressed as equation (1). Support metric parameter is applied to measure the occurrence of the transactions of the item set and classifies the best rule for auxiliary analysis. Thus, support is the fraction of the total number of transactions for sets of items and can be expressed as equation (2). Another parameter to measure the occurrence of consequent and antecedents is known as confidence. The presence of the probability of occurrence of the sets of items and expressed as equation (3). The confidence value is sometimes high even though the association between the items is most likely unrelated. To overcome such a situation, Lift is introduced as the third metric parameter, which controls the frequency of consequent while measuring the conditional occurrence probability of {Y} given {X} and can be expressed as equation (4) [66]. Support and confidence reflect the usefulness and certainty of the identified rules. R version 3.6.1. Packages ''arules'' and ''arulesViz'' were used for association estimation and visualization, respectively.
where, X and Y are sets of items

1) BASIC CONCEPT
The process occurs in two steps (i) finding frequent items and (ii) generation of strong association rule based on support and confidence. Let consider J = (i1, i2 . . . .i n ) to be a set of items (itemset). Let D, the task-relevant data, be a set of database transactions where each transaction T is a set of items such as T ? J . Each transaction is associated with an identifier, known as Transaction ID (TID). Let consider a set of items, where a transaction ∪ ∩ _( ? ) → |T is a set to contain X if and only if X _( ? )T . An association rule is an inference of the form X ⇒ Y , where X ⊂ J , Y ⊃ J , and X ∩B = φ. The rule X ⇒ Y holds in the transaction set D with the support. That contains X ∪ Y (that is X and Y ) is taken as the probability, P(X ∪Y ) the rule X ⇒ Y has confidence C in the transaction set D, if C is the percentage of the transactions in set D which contains X that also contains Y . This is taken as the conditional probability as P(Y|X) that is, Support = (X ⇒ Y ) = P(X ∪ Y ) and Confidence == (X ⇒ Y ) = P(Y|X) [67]. Figure 2 presents the general flowchart of the Association rule process. Pearson's correlation analysis is widely used in multiple sectors of science due to its accuracy in terms of the relationship between the variables [68], [69]. It measures a linear dependence between two variables (x and y) where, x and y are the variables; m x and m y are the mean of x and y, respectively.

B. MODELING DEVELOPMENT
To transform the dataset into the transaction database so that it can be subjected to the selected analytical method. The data frame was divided according to information from the questioner prepared for the study. The data frame consists of 51 variables and 65 observations. Out of 51 variables, 28 variables were assessed using association rule, 6 variables using Pearson's correlation analysis and others were analyzed in general in which interviewer's no. and interviewer's general information was irrelevant for the analysis where the 3 rd variable was the distance from the river which found to be average of 385m (min-max: 300-950m). The remaining variables were interpreted using general data analysis like the mean and min-max analysis. The selected 28 variables were converted in logical type or binary type (i.e., TRUE or FALSE) data set as part of data preprocessing. The main transaction database (i.e., 28 selected variables and 65 observations) was divided further to better understand their relationship with each other and handle the numerous amounts of rules generated by the program. The application was separately established considering two different water sources, including Public tap and (ii) River. The water sources were coupled with four scenarios considering the dual water sources, i.e., public water source and river water sources, along with other variables presented in Figures 3 and 4. Total 65 observations are plotted for two water source choices, public tap water in Figure 3 and river water in Figure 4.  frequency Daily, (vi) Cleaning frequency: Never, (vii) clean hand before handling water, (viii) use of soap.
Scenario D: Seven variables: (i) Water Source: Public tap/River, (ii) bathing in river, (iii) Use soap during bathing, (iv) defecation at river, (v) after defecation use of soap, (vi) toilet at home, (vii) common cold.
The implementation part consists of the first step of finding an association rule is searching for frequent item sets. Apriori algorithm available in the R studio ''arules'' package was used to find frequent item sets [70]. The transaction database (a) was specified with minimum support of 20%, minimum confidence of 70% (b) with minimum support of 90%, minimum confidence of 95% (c) specified with minimum support of 40%, minimum confidence of 70% (d) specified with minimum support of 50%, minimum confidence of 70%. Different minimum support and confidence were used for the association rule formation to decrease the number of rules and computational time. The application was able to find the association rules for (a, c and d) around 20-50 seconds since the rules were less. In dataset (b), there were more than 1000 rules that took several minutes, which was later reduced using high minimum support and confidence. Further, for better management and interpretation ''Head'' function of R studio was used with specified 50 top rules sorted as per the descending order of the ''Lift''. The results produced was presented in table 2 where the LHS (left-hand side) is called antecedent, and the RHS (right-hand side) are called consequent along with specified minimum support and minimum confidence, lift and count. The principle between the item set is ''IF'' the antecedent, item {X} is there ''THEN'' the consequent, item {Y} will be there, and this is supported by the three thresholds minimum support and minimum confidence and lift. The results produced from 8 simulations were quite high; thus, the most relevant relations were sorted out. Table 2 only presents 44 of the most pertinent association of the variables.

V. RESULT AND DISCUSSION
The results show that washing, bathing, and choice of public water sources are highly correlated (support = 36%, confidence = 75%). Villagers preferred river water for bathing and washing (support = 32%, confidence = 87%). The survey shows that single users opted 100% for on public tap, and 64% used both water sources.
Female water collectors with an education level less than the 10th standard used the river for bathing (support = 23%, confidence = 75%), which depicts that public tap is used  less for bathing; however public tap is preferred for washing (support = 49%, confidence = 100%). The uneducated and educated family opted for the female collector with support 20% and confidence 100%, whereas a family with basic education below 10th showed more aptitude towards the same (support = 55%, confidence = 100%, count = 36). The study finds that collection's responsibility falls on females rather than males, where 92% of families have similar choices. Since most households collect water in the evening and morning (support = 90%, confidence = 100%), and thus, collectors have to devote more time.
The water quality during the rainy season is good, but collection becomes difficult, and interruption and evening collection is also affected (support = 55%, confidence = 87%). In this case, to manage the lack of water, a river source is chosen. The study also reveals that tap water quality remains good (support = 95%, confidence = 100%, count = 62%) but highly related to the regular interruption of water supply (support = 96%, confidence 96%). The average water supply interruption is 5.8 days with a minimum of 2 days and a maximum of 14 days long. The reason for frequent interruption can be due to (i) limited water supply (ii) rainy season affects the water treatment plant of the study area by increasing the sediment load, which leads to the temporary shutdown (iii) municipal corporation poor maintenance leads to frequent leakage, broken pipes and blockages.
In hygiene and sanitation issues, the results show that cleaning the container weekly is the most frequent choice (support = 58%, confidence = 100%), whereas most cleaned their hands before handling water (support = 93%, confidence = 100%). However, a few of them never cleaned the container, which also used river water (support = 23%, confidence = 100%, count = 15). Furthermore, only 27% of the family cleaned hands after defecation, yet 72 % used soap while bathing. It was also noted that for bathing, 73% and for washing 49% choose river water source whereas 100% of users choose tap water for cooking. 27 % of doesn't have a toilet at home and 27% defecates in the river. These preferences of different water source choices can be due to following reasons (i) Users prefer good quality water for consumption (ii) To reduce the water bill (iii) To reduce the time and amount of water collection from the river or public tap, and (iv) frequent cases of interruptions. The users also never do any kind of home treatment, whatever the sources may be.
The 5 integer variables were analyzed by correlation analysis. Figure 5 shows the scattered plot of the variables using Pearson's correlation. If the number of family members is less, then the daily requirement is less with less time spent on water collection and the bill paid per month is also decreased. The water bill increases with family income. The high positive correlation shows increased water requirement per day and increased collection of time. Water bill month is highly correlated to daily water requirement but negatively related to time spent on water collection. The correlation coefficient values of five variables are presented in Figure 5.
By the simple factor analysis, the age of the water collector has been divided into three: (i) 11-22: 45%, (ii) 22-33:40% and (iii) 33-44: 14%. The average time spent was found to be 44 minutes per day, in which 90% of users collect water two times a day, and all the users collect water manually. The survey also estimates that average income is 2327 ETB of the selected family in which the factor analysis reveals that (i) 56% user's monthly income is less than 1000 ETB (ii) 18% between 1000-2000 (iii) 9% between 2000-4000 (iv) 3% between 4000-6000 (v) 1.5% between 6000-8000 (v) 14% between 8000-10000. This shows that 74% of the users are from low-income backgrounds, which can explain that most have chosen dual water sources and paying water bills is a financial burden since the average water bill payment is 62 ETB per month. Considering the socio-economic structure of the study area, income is bound to affect the water choices. The education data discloses that 20% of the study population is uneducated, 55% of education level is below 10th standard, 3% below 12th standard, and only 21% are graduates. Moreover, the majority of the low-income family also belong to education level is below 10th standard, the as per the health issues very common diseases are diarrhea (75%), influenza (98%) and other seasonal water-related diseases like malaria, cholera (20%).

VI. CONCLUSION
The study aims to establish a relationship between the choice of water sources and the factors influencing the choice. The application of unsupervised machine learning techniques reveals that water is a key need of life, and safe choices of water sources are very important. The market basket analysis was able to associate the user's everyday life with the need for water. All the study users don't have a direct water supply, thus depending on the public and river water supply and the coinciding relationship between income, education, water awareness, water security, and health. Ethiopia has the majority of the young populace, and in the study area, 47% were adults, and 53% were children. However, the adult population of the study mostly consisted of youngsters below 25. Furthermore, females are responsible for the water collection, spending 44 minutes per day on travelling and spending 63 ETB per month. Considering the socio-economic background, the water expenditure is quite high considering 56% of family monthly earning is less than 1000 ETB. It is not surprising that many have chosen dual water sources to cope with expenditure and the monthly interruption, which may vary from 2-14 days. 73% of people preferred bathing with river water, and 49% preferred washing. 27% present of the users defecate in the river, which creates serious sanitation and hygiene issues. This also reveals that river water gets polluted due to public use. The time spends on water collection also affects education and working hours. Education is the path for overall development, including the awareness of the right choices since none of the users was familiar with neither treated water nor knowledge of treatment techniques and their benefits. In addition to that, 23% of users don't use soap after defecation. This exposes the lack of awareness about WASH (water, sanitation and hygiene). This can be the reason for the health issues commonly reported by the respondents were diarrhea (75%), influenza (98%) and other seasonal water-related diseases like malaria, cholera (20%).
The study also concluded that there is no bore well or tube well in their area, which means that the government or local population does not exploit groundwater. The lack of water supply to each household shows a lack of water economic structures in the study area and government intervention [71]. It is safe to say that a reliable household water supply can be very helpful to increase the everyday quality of life of the villagers. However, the study could not get to that detailed information about the pollutant in the river, point sources of pollution, detailed study of the water-related diseases in the area and number of children missing school. In the future, in-depth study and analysis for a better understanding of the impact of water on human wellbeing should be done.
The study recommends the following recommendations (i) Piped water service should be extended to the households as it is the most reliable water source (ii) water-related water structure needed to be improved and expanded (iii) water reservoirs and tanks should be installed to overcome the regular interruption issue (iv) in case of contamination during water distribution, household water treatment and safe storage should be encouraged (v) water awareness should be more efficiently increase with the population to ensure water security and health. Furthermore, feature engineering and hyper-parameter tuning would be applying to minimize the noise as well as to uplift the estimation accuracy, as reported in several studies [20], [72]- [74].
FIRAOL FITUMA received the bachelor's degree from the Department of Civil Engineering, Institute of Technology, Ambo University, Ethiopia. He equipped with soft skilled while working several water management project to the local community in collaboration with international and national non-governmental organization. His hardworking and dedication sharpen the pencil of the management guidelines. His several recommendation has been peer-reviewed in the authority meetings and included some of them.
TRAN MINH TUNG received the master's degree in civil engineering from the Ho Chi Minh City University of Technology, Vietnam, and the Ph.D. degree in civil engineering from the University of Wollongong, Australia. He is currently a Lecturer and the Dean of the Faculty of Civil Engineering, Ton Duc Thang University, Vietnam. His majors include sustainability construction, structure engineering, environmental engineering, and climate. In addition, he has an excellent expertise in machine learning and advanced data analytics. He has published over 30 articles in international journals. He was awarded the Peter Schmidt Memorial Award for the best performance in postgraduate research, in 2013. In addition, he won the scientific projects at Ton Duc Thang University. VOLUME 9, 2021