Investigation of the Soan River Water Quality Using Multivariate Statistical Approach

Evaluating the quality of river water is a critical process due to pollution and variations of natural or anthropogenic origin. For the Soan River (Pakistan), seven sampling sites were selected in the urban area of Rawalpindi/Islamabad, and 18 major chemical parameters were examined over two seasons, i.e., premonsoon and postmonsoon 2019. Multivariate statistical approaches such as the Spearman correlation coefficient, cluster analysis (CA), and principal component analysis (PCA) were used to evaluate the water quality of the Soan River based on temporal and spatial patterns. Analytical results obtained by PCA show that 92.46% of the total variation in the premonsoon season and 93.11% in the postmonsoon season were observed by only two loading factors in both seasons. The PCA and CA made it possible to extract and recognize the origins of the factors responsible for water quality variations during the year 2019. The sampling stations were grouped into specific clusters on the basis of the spatiotemporal pattern of water quality data. The parameters dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), turbidity, and total suspended solids (TSS) are among the prominent contributing variations in water quality, indicating that the water quality of the Soan River deteriorates gradually as it passes through the urban areas, receiving domestic and industrial wastewater from the outfalls. This study indicates that the adopted methodology can be utilized effectively for effective river water quality management.


Introduction
Rivers convey water and minerals to territories all around the earth. They have a significant influence on the hydrological cycle and serve as drainage outlets for runoff. Unsafe drinking water and inadequate sanitary conditions are associated with infectious diseases such as cholera, diarrhea, dysentery, and polio which significantly affect human health. According to the World Health Organization (WHO), at least two billion people worldwide use drinking water from a contaminated source containing human excreta [1]. In the last decades, research has revealed that developing countries have been able to control waterborne diseases to some extent through the development of sustainable natural resources, especially water resources, out of concern for the improvement of public health. However, challenges to achieving this workable goal remain, due to the increase in demand for water and the reduction of water availability resulting from escalating population growth and increased financial stability [2].
Many developing and developed countries are facing degradation of freshwater resources over the last decades up to an alarming situation mainly due to rapid regional development especially in underdeveloped countries and agricultural runoff from fields containing fertilizers and pesticides [3,4]. The use of agricultural chemical manures and municipal water influence the water quality significantly [5,6].
As the quality of water is important for human health and ecological systems, its deteriorating quality in developing countries has caused detrimental health impacts such as diarrhea, poor oral hygiene, hepatitis, anemia, and dental caries [7]. Research has revealed that the quality of existing water resources has been deteriorating rapidly with ever increasing urbanization, massive industrialization, agricultural expansion, inadequate management of domestic, industrial and urban wastes, eroded soil, and transportation infrastructures in recent decades [8][9][10]. Giri and Singh [11] reported that the escalation of urbanization and industrial development near water tributaries has led to river degradation, water effluence, and disintegration of natural ecological integrity. In most developing countries, the massive contamination of tributaries due to the point source pollution containing harmful chemicals is a major issue and has been widely observed [12,13]. The utilization of water has been escalating at more than double the increase in population in the recent era. Globally, approximately six billion people are impounding 54% of all the open freshwater contained in streams, lakes, and natural springs [14].
Waterways play a significant role in incorporating and arranging the landscape and in influencing the environmental condition. They are the prime variables in controlling the global water cycle and behave as active elements which carry nutrients and contaminants. In Pakistan, the increasing pace of urbanization and industrialization has caused the severe decline of flora and fauna in the river basin due to the mismanagement of water reservoirs. In particular, untreated sewage, agricultural runoff, and industrial effluents have resulted in a considerable deterioration in water quality. The population residing near rivers has to use this water for many purposes including domestic uses and drinking. Similarly, the quality of surface water is declining due to human development, industrialization, agricultural inputs, transportation, urbanization, animal and human excretions, and domestic waste. Variation in the quality and amount of stream water as a result of characteristic and anthropogenic impacts is generally observed in different waterways throughout the world [15].
Water quality assessment is the most widely accepted way forward for identification of biological issues pertaining to river environment, and fluctuations in the environmental key indicators due to seasonal and time variation are quite helpful for analysis of the contamination level [16].
According to a report published by the Asia Foundation [17], almost 95% of the water in Pakistan is utilized for agronomic practices, which compensate 60% of its population engaged in agriculture and livestock, while 80% of the exports depend upon agribusiness. Despite the fact that Pakistan has the world's largest glaciers, it is nevertheless counted among the world's 36 most water-stressed countries. Due to an escalation in population growth as well as to strained relations with the neighboring country concerning transboundary issues of water management, the water situation in the country has become critical, both with respect to its impending waterfront and sustainability, as the demand for water is expected to exceed supply.
The release of sewage water into the water bodies is the primary source of water pollution whereas the secondary source of contamination is the discharge of toxic chemicals from industrial effluents and agricultural nonpoint sources into the water bodies [18]. Low dissolved oxygen problem in the River Ravi (Pakistan) was identified by [19,20] due to discharge of very high biochemical oxygen demand (BOD) loadings from several wastewater outfalls and surface drains. They also identified concentrations of unionized ammonia (NH 3 ) higher than its permissible toxicity levels. Fecal coliforms were also found higher than the recreational and irrigation water quality standards [21]. In particular, in low flow conditions, the water quality of the river does not meet the desired standards for any of its beneficial uses, including fishery, irrigation, recreation, and drinking.
The people frequently suffer from waterborne diseases because of using contaminated drinking water. About 30% of all reported diseases and 40% of deaths in the Pakistan are attributed to fecal contamination of drinking waters [22]. In Pakistan, the main causes of waterborne diseases are contaminated drinking water, lack of sanitary facilities at treatment plants, and release of industrial effluents and domestic sewage into the water bodies [23].
The PCA approach was successfully utilized to interpret a large and complex data matrix for the investigation of water quality of the Açude da Macela reservoir for consumption of the local community and agriculture purposes [24]. Different multivariate integrated techniques were applied to classify the various sampling sites according to spatial and temporal characteristics for the investigation of water quality of Qiandao Lake in China. The results obtained from this study reveal that CA, discriminant analysis (DA), and PCA are valuable tools for water resource management [25].
The Soan River is among those waterways where housing developments have been established along the banks. According to [26] about 5.5 million residents of Rawalpindi/Islamabad discharge municipal sewage into the Soan River. Due to encroachment along the riverbanks, the ecosystem and surface runoff have drastically affected the pollution levels of the river. Due to the inadequate drainage of wastewater into the river, an excessive increase of heavy metals and nutrients in river water above the national standards has been observed [26].
The aim of this study was to investigate the possible major pollution sources affecting water quality in the urban area of the Soan River by using multivariate statistical approaches through Pearson's correlation, CA, and PCA with the objective of providing support for environmental executives in order to enable them to make better decisions regarding action plans.

Description of Study Area.
The Soan River is a minor tributary of the Indus River. The origin of the Soan River is the southwestern range of the Murree hills, which are located at longitude 71°45 ′ to 73°35 ′ E and latitude 32°45 ′ to 33°55 ′ N and passes through the Rawalpindi, Attock, and Jehlum districts of Punjab (Pakistan). The total length of the Soan River is 272 km, and it drains a total catchment area of 11,085 km 2 . Nallah Lai, a drainage channel which originates from the area of the Margalla hills, ultimately falls into this river by draining wastewater into the river near the Soan bus bay. This river provides its water to the Simly Dam which is located in Islamabad and is the largest drinking water reservoir for the Islamabad Capital Territory and its adjacent areas. Agricultural lands are also located along the river, and the use of herbicides and insecticides is common in these farming areas 2 International Journal of Photoenergy [27]. The main sources of water in the Soan River basin are precipitation, streamflow, and snow melting found in the Murree region. However, this basin is water deficient, as most of the area is covered and underlain by the rocks of tertiary age. The Ling stream from Lithar and Kahuta joins the Soan River near Sihala before Kaak bridge. The Kaak bridge was constructed over this river at the Islamabad Highway. Furthermore, two more streams near this bridge join the Soan River. One stream is the Korang River, and the other is the Lei stream. After following about 16 km distance upstream of Kalabagh in the southwest direction, the Soan River joins the Indus River [27,28]. The mean temperature of the river ranges from a minimum of 3°C in winter to a maximum of 45°C in summer. In summer, the river flow decreases, whereas in the monsoon season (July-September), heavy rain results in flash flooding, which causes increased runoff. The river passing through residential societies is contaminated by the dumping sites of the surrounding area. Furthermore, construction waste is also poured into the river and has been since the formation of private housing societies nearby. There are also some industries constructed along the left bank of the river, and these directly pour their effluents into the river without any treatment [29]. This surface water also recharges the ground water aquifer of the region. The average annual rainfall in the study area varies from 250 to 1800 mm from north to southwest. The maximum rainfall occurs during July-September, whereas the minimum takes place in the months of November-December [30].

Sample Collection and Chemical
Analysis. The most critical and polluted urban reach of the Soan River from downstream of Simly Dam (Chakian Bridge) Islamabad to the Bahria Town Phase-VIII Rawalpindi was identified by reconnaissance survey, and seven sampling sites were selected according to natural conditions and industrial zone as well as human activities along the river banks. The water samples were collected for two seasons (premonsoon in the month of May 2019 and postmonsoon in the month of November 2019) from a depth of 0.3 m in clean and sterilized plastic sample bottles of 1.5 L. The water samples collected from the field were placed in an insulated cool box and transported as soon as possible to the PCRWR Water Quality Lab Islamabad and stored at 4°C until they were analyzed [31]. The water quality parameters, their units, and methods of analysis are described in Table 1. The locations of selected sites as shown in Figure 1 were recorded (latitudinal and longitudinal position) using a handheld GPS (Global Positioning System) unit.

Pearson Correlation Matrix.
Pearson correlation matrix is widely employed to evaluate the relationship between the variables of any data set. If the correlation coefficient exists between 1 and -1; it indicates that the strongest relationship may be positive or negative between two parameters, and if it is zero, there will be no correlation between them at a significant level of p < 0:05 (Kumar et al., 2006). Pearson's correlation matrix measures the strength, direction, and probability of the linear association between two interval or ratio variables.

Cluster Analysis (CA)
. CA enables the researcher to arrange the water samples into groups resulting in high interval homogeneity (within clusters) and high external heterogeneity (between clusters). In CA, a tree-like diagram known as a dendrogram is created, and it summarizes the clusters by significantly decreasing the dimensions of the primary dataset. Euclidean distance is widely used to evaluate the sampling stations which can be characterized by the discrepancy present in the statistical analysis of the data.
The CA approach through the Ward linkage method was used collectively for clustering to classify the samples for investigation of water quality characteristics of the Mahitsy Commune, Central Highland of Madagascar [32] 2.5. Principal Component Analysis (PCA). PCA is a dimension reduction analytical method which reduces a large number of variables into a lower number of factors. It is normally used for data formation and for attaining qualitative inductive data about possible contamination causes. An eigenvalue in the PCA approach shows the significance of any factor, and the factors having higher eigenvalues greater than 1 or equal to 1 are considered the most significant factors as endorsed by [33]. A scree plot is a diagram in the PCA multivariate technique whose Y-axis represents the eigenvalues and whose X-axis shows the number of factors. The point where the gradient of the curve clearly levels off (the elbow) shows the number of factors to be created by the data analysis.
According to the literature review, the PCA has been widely used to determine latent contamination causes, without changing the original characteristics of the variables and by reducing the dimensions of data with negligible loss of primary data and by grouping multiple variables with respect to their geography [8,[34][35][36].

Data Processing and Statistical Analyses of Water Quality
Parameters. To evaluate the collected river water quality dataset consistently at different sampling stations, multivariate statistical techniques were employed through Pearson's correlation, CA, and PCA. All statistical computations in the present study were rendered using Microsoft Excel 2016 and IBM SPSS 22.
The statistical description of mean value, range, and standard deviation of selected water quality parameters at seven stations of the Soan River collected during premonsoon and postmonsoon seasons (2019) are provided in Table 2.
The values of measured parameters EC, pH, alkalinity as CaCO 3 , HCO 3 , Ca, Cl -, hardness, Mg, Na, K, SO 4 , NO 3 , and TDS are generally within the recommended range at all stations during both seasons, but there is significant change of measured values of DO, BOD, COD, and turbidity.
Lower concentration of DO was observed during both seasons at stations S-4 to S-7 at downstream as compared to upstream stations S-1 and S-2 due to large amounts of organic waste originating from domestic waste and industrial waste water entering into the Soan River containing plenty of biodegradable material causing increase in oxygen demand, and DO is depleted steadily. Lower concentration of DO 3 International Journal of Photoenergy   International Journal of Photoenergy     5 International Journal of Photoenergy shows river pollution. Figure 2(a) shows that the impact of the wastewater discharge on DO sag is extended to the subsequent stations, i.e., S-6 and S-7. Also, from Figure 2(a), it is evident that DO concentration is higher in the month of May (i.e. premonsoon) due to the impact of early rainfall events during these months, while the DO levels are low, with an exception of S-4, in postmonsoon samples due to dry weather flow conditions that prevail right after the monsoon  International Journal of Photoenergy in the study area. Figure 2(b) manifest higher BOD values observed against the lower values of DO at downstream stations due to industrial and populated accommodations. Low concentration level of BOD indicates that the river water is free from biological pollution [37] whereas high BOD is harmful for aquatic life as it deteriorates the DO level [38]. Observed COD concentrations show a similar pattern. High COD value (i.e., 2 times of BOD) at S-5 directs towards the possibility of industrial wastewater discharge that contains nonbiodegradable organics [39]. TSS and turbidity levels are fairly correlated in Figures 2(d) and 2(e), and their higher values in postmonsoon samples manifest the presence of urban and agricultural runoffs with higher sediment loads.

Spatiotemporal Variance of Water Quality Parameters by
Cluster Analysis (CA). According to Ashfaq et al. [30], there are high spatial and temporal variations of rainfall in the Soan Basin containing about 70% rainfall during the monsoon season (July to September) and the remaining 30% falls in spring and winter seasons. The hierarchical CA technique was used for the assessment of similarity among the sampling sites. A dendrogram was plotted based upon the chemical    International Journal of Photoenergy  analysis of all selected sites that grouped these sites into three and two clusters during premonsoon and postmonsoon seasons, respectively, as shown in Figures 3(a) and 3(b).

International Journal of Photoenergy
In the premonsoon season, Cluster 1 contains the stations S-1, S-2, and S-3. The primary land uses around these stations include agriculture and newly developed neighborhoods with less number of residents. Therefore, the pollution levels were found to be lower than in the other stations; it is also evident from high DO levels for these stations in Figure 3(a). Cluster 2 comprises S-4 and S-5 that passes through the developed part of the city receiving large volumes of domestic and industrial wastewaters. High pollution levels during the premonsoon, with high BOD, COD, and low DO levels, can also be seen in Figure 2. The land use is the same in the catchment area of these two stations. Other sources of contamination to this part of the river (S-4 and S-5) are open dumping of solid waste from slaughterhouses and some low-income residential neighborhoods without having a proper solid waste collection and disposal mechanism. Cluster 3 consists of two highly polluted stations at the downstream of the river, i.e., S-6 and S-7. There are two main reasons of such as high pollution (also see Figure 3) at these stations in the premonsoon monitoring: (i) at S-6, the river water is completely mixed with the industrial wastewater discharge from S-5-the meandering river morphology between S-5 and S-6 justifies this mixing length during relatively higher velocities in premonsoon period; (ii) S-6 is the confluence point of Nallah Lai that discharges large volumes of untreated municipal wastewater from Rawalpindi city into the Soan River; and (iii) continuing meandered river morphology covers a mixing length and shows high pollution levels at S-7.
In the postmonsoon period, the river primarily behaves like a sewage drain flowing at very low velocity. Therefore, only two clusters are formed for this monitoring survey, and the cluster configuration differs due to the low flow regime in the river. Cluster 1 contains five stations, i.e., S-1, S-2, S-3, S-4, and S-7. The primary reason for having these stations in a single cluster is the similar DO and COD values obtained at these stations (see Figures 2(a) and 2(c)). Relatively higher DO levels at S-4 and S-7 (i.e., >3 ppm) can be attributed to the sampling errors as the samples were obtained from the river bank, where complete mixing of the wastewater could not have been achieved yet. Cluster 2 contains S-5 add S-6 that shares the similar characteristics as already discussed above. The primary reason for having different clusters is the changing flow pattern in both the seasons that resulted in velocity variations, changing mixing lengths, and absence of nonpoint sources during the postmonsoon period.
The results reveal that the CA technique is a useful approach for the classification of river water at temporal and spatial scales. Therefore, the selection of critical sampling points and the respective costs incurred on future monitoring strategies should be minimized.

Pearson's Correlation Matrix.
Pearson's correlation matrix was prepared, and the correlation coefficient (r) was applied to check the possible relationship among the chemi-cal parameters of the water quality for the Soan River. According to guidelines established by Cohen [40], there is a strong correlation if the value of the Pearson coefficient (r) is 0:75 ≤ r < 1. Several significant positive strong correlations and few negative correlations were found among some of the parameters which are presented in Tables 3 and 4. In the premonsoon season, results revealed that most of the parameters show a significant positive correlation, as given below in Table 3: (i) EC with alkalinity, HCO 3 , hardness, Mg, K, SO 4 , and TDS (r > 0:9) (ii) Turbidity with NO 3 , COD, and TSS (r > 0:9) (iii) Alkalinity with HCO 3 , hardness, Mg, K, SO 4 , and TDS (r > 0:9) (iv) HCO 3 with hardness, Mg, K, SO 4 , and TDS (r > 0:9) (v) Clˉwith hardness, Mg, K, and SO 4 (r > 0:9) Alkalinity in water exists due to the presence of bicarbonate (HCO 3 -), carbonate (CO 3 -), and hydroxide (OH -) of elements such as Ca, Mg, Na, K, and NH 3 while hardness is primarily associated with the presence of carbonates and bicarbonates of Ca and Mg. TDS commonly contains inorganic salts having cations (Ca, Mg, K, and Na) and anions (CO 3 , HCO 3 , Cl, and SO 4 ). Therefore, TDS are highly correlated with alkalinity, hardness, cations, and anions. EC is essentially a measure of TDS and so was found highly correlated with all these parameters with an R value higher than 0.9. Similarly, turbidity is a measure of TSS. In raw wastewater, COD contains both the soluble and particulate forms of organic matter. Therefore, turbidity was found to be highly correlated with TSS and COD. NO 3 values are very small (i.e., 3.8 ppm in Table 1) in untreated wastewater due to the absence of nitrification process. So, the main reason of this correlation is the similar trend of the NO 3 values as of TSS, turbidity, and COD. COD was found highly correlated with the BOD and can be used as a monitoring parameter to avoid the long 5-day BOD test. All the other high correlations found reflect the relationships between alkalinity, hardness, TDS, cations, and anions that have already been described above. TDS and EC are quick tests and can be used to monitor the rest of the parameters using regression equations; the correlation relationship was detected as a strong negative between Na and DO, and DO with COD and BOD (r>−0:9). The former makes clear sense as there is essentially no relationship between Na and DO in the river water. DO should have a strong correlation with BOD; the reason for not obtaining a strong correlation between these parameters can be attributed to limited data and sampling errors. No significant correlation of pH was observed with any element in the river water, while there is a slightly negative correlation with other International Journal of Photoenergy parameters like BOD, COD, TSS, NO 3 , and turbidity. The analysis of the results shows that these water quality parameters cannot be controlled by a single specific factor because they have different pollution sources. Overall, divergent results revealed that common sources are not always linked with significant correlations among each other.
In Table 4, the postmonsoon results revealed that most of the parameters show a significant positive correlation as follows: (i) EC with alkalinity, HCO 3 , hardness, K, Cl -, Ca, Na, SO 4 , COD, and BOD (r > 0:9) (ii) EC with TDS which has strong correlation (r = 1) (iii) TSS with turbidity with NO 3 (r > 0:9) (iv) Alkalinity with HCO 3 , hardness, K, Na, SO 4 , COD, and TDS (r > 0:9) (v) HCO 3 with Ca, Cl -, K, Na, hardness, K, SO 4 , COD, and TDS (r > 0:9)  It is pertinent to mention here that a significant negative correlation of DO and TSS with all other parameters was observed except for turbidity and pH due to the low flow regime in the river. Discussion on the premonsoon period stated above explains the reasons of the strong correlations between hardness, alkalinity, TDS, EC, cations, and anions for postmonsoon sampling as well. Similar to the premonsoon period, the negative correlation between DO and BOD (and COD) during postmonsoon is due to data limitations and sampling errors. For remaining parameters, this finding validates that there is practically no correlation between DO and other parameters, except TSS that sometimes are organic in nature and contribute to sediment oxygen demand. However, the samples in the present study were not obtained from the bottom, so no such correlation can be established.
3.4. Principal Component Analysis (PCA). The guidelines developed by Kaiser [41] for the PCA technique were followed to determine the significant relationships among the selected sites in order to identify the physiochemical properties of the water quality parameters; the loading factors were calculated and classified as "strong," "moderate," and "weak" corresponding to absolute loading values of >0.75, 0.50-0.75, and 0.30-0.50, respectively. In order to understand the underlying data structure, scree plot graphs are used for configuring the principal components to be retained based on the eigenvalues. The scree plot also provides the percentage variances explained for each component and provides an indication on how the principal components were obtained [42].
As shown in Figures 4(a) and 4(b), there is a definite change of slope after the 3 rd eigenvalue; only two components were retained. Loading values and the explained variance of two retained PCs are shown in Table 5. Only two PCs have eigenvalues > 1 (Kaiser Normalization) which explain almost 92.46% of the total variance in the water quality dataset during the premonsoon season in the water quality dataset whereas 93.11% of the total variance was described during the postmonsoon season, which is prescribed by only two factors in both seasons due to eigenvalues greater than 1. The variables having eigenvalues lower than 1 were ignored due to their low significance [43].
In the premonsoon season, PC-1 recorded 60.35% of the total variance, indicating that high positive loadings of EC, alkalinity, HCO 3 , Ca, Cl -, hardness, Mg, K, SO 4 , and TDS were observed. In this factor, there are biogenic and anthropogenic (urban wastewater) pollution sources. PC-2 contributes 32.11% of the total variance and comprises strong positive loading on turbidity, NO 3, BOD, COD, and TSS whereas there is a negative loading on pH and DO  Table 5, it is clear that there is negative loading of DO during both seasons which may be indicative of too much bacteria and an excess amount of BOD (untreated sewage, organic discharges). The results obtained from PCA identified the major pollution sources of the study area which is mainly due to some natural processes and anthropogenic activities. The discharging untreated urban wastewater into the river constitutes the major point anthropogenic contamination source. The nonpoint source which also contributes immensely in Soan River water contamination is from agricultural activities and livestock farming. The PCA plots of premonsoon and postmonsoon are shown in Figures 5(a) and 5(b).

Conclusion
In the present study, multivariate statistical techniques were used for the evaluation of pollution variations and source apportionment due to water contamination in the Soan River. The hierarchical CA technique was used for the assessment of similarity among the sampling sites. A dendrogram was plotted based upon the chemical analysis of all selected sites that grouped these sites into three and two statistical significant clusters during premonsoon and postmonsoon seasons, respectively, as shown in Figures 3(a) and 3(b). The PCA and FA made it possible to extract and recognize the origins of or the factors responsible for water quality variations during the year 2019. The parameters such as DO, BOD, COD, turbidity, and TSS were identified as the most significant parameters contributing to water quality, indicating that the water quality of the Soan River deteriorates gradually from upstream to downstream as it passes through the urban areas, receiving domestic and industrial wastewater from the outfalls.  The results of DO, COD, TSS, and BOD prove that this wastewater requires proper treatment to overcome these indicators of pollution before disposal. Moreover, mismanagement causes solid waste to move down to the drainage system, which results in sewer line blockage and consequently in the spread of various types of diseases-causing organisms.
The sustainable integrated water resource management of rivers should be planned through various approaches such as the treatment of wastewater released from industrial zone and municipal locations. It is recommended that a long-term and continued monitoring program should be established to predict the water quality of the Soan River. This can be achieved by the selection of landfill sites to dump the municipal solid waste, adequate treatment of urban and industrial wastewater before releasing in the river, and improvement of agronomic practices in this area. Furthermore, installing an automatic wireless-sensor networking system for river water quality assessment and forecasting may support the environmental protection authority in the implementation of appropriate action plans.

Data Availability
The data used to support the findings of this study are included in the article.

Conflicts of Interest
The authors declare that they have no conflict of interest regarding the publication of this paper.