Assessment of Surface Water Quality using Multivariate Statistical Techniques: A Case Study in China

In order to interpret the surface water quality of drinking water sources of Tongyu River and Mangshe River in Yancheng city, China, 18 water quality parameters were selected and data from 9 sampling sites during 2010 to 2015 from were collected and analyzed by multivariate statistical techniques, including cluster analysis (CA), principal component analysis (PCA), and factor analysis (FA). The sampling sites were classified into three clusters based on their similarities using a hierarchical CA, which represented relative low pollution sites, moderate pollution sites, and relative high pollution sites. By PCA/FA, six latent factors were identified that accounted for 75.39% of the total variance, representing the influences of organic pollution, fecal pollution, biochemical reactions, nutrients, domestic sewage, and natural factors, respectively. By pollution source analysis, the results were obtained that Sites 1, 2, and 3 were almost completely unaffected by various pollution sources, Sites 4 and 5 were polluted with industrial and domestic discharge, Sites 6, 7, and 8 were polluted with point and nonpoint sources from industrial activity, agriculture, and domestic drainage, and Site 9 was severely polluted with untreated domestic discharge from nearby inhabitants. The results verified that multivariate statistical techniques are useful, and may be necessary for analyzing and interpreting large, complex surface water quality databases, which could help managers optimize action plans to control drinking water quality. *Corresponding author: Wang Y, School of Energy and Environment, Southeast University, Nanjing, Jiangsu, China, Tel: +86-13776415656; E-mail: wangyumin@seu.edu.cn Received July 24, 2018; Accepted August 07, 2018; Published August 20, 2018 Citation: Wang Y, Zhu G, Yu R (2018) Assessment of Surface Water Quality using Multivariate Statistical Techniques: A Case Study in China. Irrigat Drainage Sys Eng 7: 214. doi: 10.4172/2168-9768.1000214 Copyright: © 2018 Wang Y, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Citation: Wang Y, Zhu G, Yu R (2018) Assessment of Surface Water Quality using Multivariate Statistical Techniques: A Case Study in China. Irrigat Drainage Sys Eng 7: 214. doi: 10.4172/2168-9768.1000214


Introduction
Water quality has greatly deteriorated worldwide in the past decades, which is affected by both natural processes (precipitation rate, weathering processes, and soil erosion) and anthropogenic effects associated with excessive exploitation of water resources and untreated discharge of municipal and industrial wastewater [1][2][3][4][5][6][7]. The temporal and spatial variations in surface water quality have been monitored by governments for years in order to prevent pollution of surface water bodies. However, long-term monitoring datasets are large, with complex matrixes comprising numerous physicochemical parameters. Therefore, it is often difficult for planners to extract meaningful information from these datasets, identify significant parameters, and apportion pollution sources [7][8][9]. Multivariate statistical techniques such as cluster analysis (CA), principal component analysis (PCA), and factor analysis (FA) can be used to inspect complex datasets, evaluate water quality, and assess pollution sources. In recent years, a number of studies have comprehensively applied different multivariate statistical techniques in water quality assessments for optimizing monitoring networks, selecting representative water quality parameters without losing meaningful information [5,6,[10][11][12].
In this study, 18 water quality parameters were selected and collected from 2010 to 2015 at 9 sampling stations in Yancheng city, China. The multivariate statistical methods (i.e., CA, PCA, and FA) were applied to analyze the water quality data. Firstly, similarities and dissimilarities among 9 sampling stations were classified by mean of CA. Secondly, the complex water quality datasets were analyzed to extract latent water quality factors using PCA and FA. Finally, the effects of possible pollution sources on water quality were identified.

Study area
Yancheng city (32°51′-34°12′N, 119°34′-120°27′E) is an eastern coastal district in the center of Jiangsu Province, China, with a population of more than 8 million. It is bordered by the Yellow Sea to the east, and is adjacent to Yangzhou and Huai'an cities to the west, Lianyungang city to the north, and Nantong and Taizhou cities to the south. The district covers an area of about 14,983 km 2 , including 48.54 km 2 in urban districts, while the remaining area is divided into nine counties, cities, and zones including Dongtai city, Dafeng city, Xiangshui County, Binhai County, Funing County, Jianhu county, Sheyang County, Tinghu Zone, and Yandu Zone. The drinking water of Yancheng city is supplied by Mangshe River and Tongyu River, which receive pollutants from domestic sewage, agricultural runoff, aquaculture wastewater, and industrial effluent. Mangshe River originates in Dazong Lake and discharges into the East Sea; with a total length of nearly 50 km. Tongyu River has a total length of 415 km, originating from Chang Jiang River and ending in Lianyungang city. The middle reach of Tongyu River runs through Yancheng city, with a length of 183.6 km and mean flow rate of about 100 m 3 /s. At Wuyou Port in Tongyu River, surface water is severely polluted by domestic wastewater from nearby settlements. Along the rivers there are nine monitoring sites (Fenghuang Bridge, Qinnanxi Bridge, Dazong Lake, Dongtai Bridge, Baiju Bridge, Shuini Bridge, Xin-Gou, Datuan Bridge, and Wuyou Bridge) ( Figure 1). Water quality parameters including water temperature (T), pH, dissolved oxygen (DO), chemical oxygen demand (COD Cr ), 5-day biochemical oxygen demand (BOD 5 (APHA, 1998). All the water quality parameters are expressed in mg L -1 , except temperature (°C), pH, turbidity (NTU), fecal coliforms (CFU/100 mL), and Escherichia coliforms (CFU/100 mL). The statistical summary of the water quality parameters sampled at nine monitoring site was shown in Table 1

Multivariate statistical methods
Multivariate techniques including CA, PCA, and FA can reduce the dimensions of the data to enhance the quality of the analysis. Before performing CA and PCA/FA, datasets were standardized through z-scale transformation due to avoiding misclassification. Standardization tends to flatten the influence of variables' variance range, as well as eliminates the effects of different units among variables. All of the mathematical and statistical computations were performed using SSPS ver. 19.0 for Windows 7.
CA: CA is a multivariate technique with the primary purpose of assembling objects with respect to predetermined selection criteria, resulting in high internal (within cluster) homogeneity and high external (between clusters) heterogeneity. Hierarchical agglomerative clustering is the most common approach, which yields intuitively similar relationships between any one sample and the entire dataset, and can be represented graphically displayed as a dendrogram [3,13,14]. Dendrograms provide a visual summary of the clustering process and present a picture of the groups and their proximity with a dramatic reduction in the dimensionality of the original data [3]. Euclidean distance is usually adopted to show similarity between two samples, and can represent the difference between the analytical values from the samples [3,15].
In this study, the spatial variability of water quality was determined by hierarchical agglomerative CA on normalized datasets using Ward's method. The quotient between the links presented as Dlink/Dmax was multiplied by 100 to standardize the linkage distance [3,6,14,16].
PCA/FA: PCA is designed to form principal components (PCs), which are linear combinations of the original variables to transform the original set of inter-correlated variables into new, uncorrelated variables [6,17,18]. PCA focuses on the information from the most meaningful parameters, which minimizes the original dataset with the least loss of information [14,18]. PCA supplies an objective mode illustrating the variation in data as concisely as possible. As a result, a small number of factors can explain approximately the same amount of information as the much larger set of original observations. FA follows PCA, which further simplifies the data structure by reducing the contribution of less-significant variables by rotating the axis defined in the PCA. According to well-established rules, such as varimax rotation, new variables called varifactors (VFs) are constructed [7,14,19,20]. The difference between PCs and VFs is that PCs are a linear combination of variables (in this case, water quality variables), while VFs include unobservable, hypothetical, and latent variables [3,16,18]. In this paper, the VFs affecting river water quality were identified from large datasets using PCA/FA to distinguish possible pollution sources of sampling sites in the study area.

Results and Discussions
Spatial similarity and site grouping Spatial CA was applied to detect similar groups among the sampling sites, and the results were presented as a dendrogram ( Figure  2). All of the nine sampling sites were grouped into three statistically   related clusters in a convincing manner using Dlink/Dmax ×.100˂20. The results indicated that group A includes Sites 1, 2, and 3 located in the upstream region of Mangshe River, group B comprises Sites 4, 5, 6, 7, and 8 situated on Tongyu River and its tributary, and group C consists of Site 9, also located on Tongyu River. The three groups corresponded to relative low pollution sites (Site 1, 2, and 3), moderate pollution sites (Site 4, 5, 6 7, and 8), and relative high pollution sites, respectively. The classifications were statistically significant, because sites within the same group had similar natural and anthropogenic backgrounds. In group A, Sites 1, 2, and 3 received pollution from discharged domestic and industrial wastewater into Mangshe River. In group B, Sites 4, 5, 6, 7, and 8 were situated in the middle reaches of Tongyu River, receiving pollution from upstream sources, including domestic drainage and industrial wastewater. Finally, Site 9 (group C) received industrial pollution, domestic wastewater, and slaughter wastewater that drained into Wuyou Port, where concentrations of some water quality parameters were high, including COD Cr (33.9 mg/L) and F. coli (1155 cfu/100 mL), while other parameters were very low such as DO (4.3 mg/L). The results indicate that hierarchical CA can provide a reliable tool to classify surface water, making it possible to design a monitoring strategy that can optimize the number of sampling sites and reduce related monitoring costs. For example, in the present study, the number of sampling sites could be reduced to one (or more) sampling site from each of groups A, B, and C to perform rapid assessments of water quality.

Data structure analysis
Correlation analysis: The correlation analysis with all the sampling stations were considered, and shown in Table 2. The results indicated that T correlates with SO 4 2which is also reported in other literature [21], pH correlates with NH 4 + since ammonia is pH-dependent [22,23], and TP has negative relationship with TN which can be explained that the river researched receive the same pollution sources [24].

Box plots of water quality parameters:
The box plots of individual water quality parameters with the spatial variations corresponding to the three clusters from CA were shown in Figure 3. The water quality data of the same cluster were combined for a given parameter. The median concentration was showed by the line across the box. The first and third quartile values were showed at the bottom and top of the box. The lowest and highest observations were expressed by a vertical line extends from the bottom to the top of the box.
From Figure 3a and 3b, it can be found that group C box plots of CODcr, SO 4 2-, Cl -, NO 2 -, and F. coli were the largest, while the smallest for DO, and TN. The reason is that group C corresponding to Site 9 is located at Wuyou bridge on Tongyu river, which receive large quantities of industrial wastewater, domestic sewage, and wastewater from pig slaughterhouse. The box plot of DO concentration in Figure  3c showed a decreasing trend in the order of group A>group B>group C, which verified that the results obtained from CA is reasonable. In box plot of CODcr concentration shown in Figure 3d, group C was the highest, while group B was the lowest, probably due to self-purification of Tongyu river. The box plots of pH and temperature showed minor difference among groups through CA. In addition, the box plots of turbidity, Alkalinity, SO 4 2-, F -, and NH 4 + also showed small differences among groups.
PCA/FA analysis: PCA is an effective pattern recognition technique used to interpret the variance of a large dataset of inter-correlated variables with a smaller set of independent components. Kaiser-Meyer-Olkin (KMO) and Bartlett's sphericity tests were performed on the parameter correlation matrix to examine the validity of the PCA. The results of the KMO and Bartlett's sphericity tests were 0.586 and 816.104, respectively, with a significance level of 0, indicating that PCA was useful for data reduction and that significant relationships were present among the variables. PCA was applied to a standardized dataset to identify the latent factors. The aim of this analysis was primarily to create an entirely new, smaller set of factors compared to the original dataset. The PCA revealed six PCs with eigenvalues>1 that explained about 75.39% of the total variance ( Figure 4)   the total variance, respectively. FA was performed further to reduce the contribution of less important variables to simplify the data structure resulting from the PCA. A varimax rotation of the PCs to six different VFs with eigenvalues>1 explained about 75.39% of the total variance ( Table 3). As shown by the factor-loading matrix, the first VF (VF1), which explained 14.09% of the total variance, had a strong correlation with COD Cr and a moderate correlation with SO 4 2-, nitrate, and TDS. Therefore VF1 represented organic pollution from industrial point sources. The second VF (VF2), which explained 13.69% of the total variance, was correlated heavily with turbidity, E. coli, and F. coli, representing fecal pollution from domestic point sources. The third VF (VF3), which explained 13.56% of the total variance, had positive loading on DO and BOD 5 and negative loading on T, representing biochemical processes in the river and illustrating the fact that BOD 5 is degraded by the consumption of DO. The fourth VF (VF4) had significant loading on TP, alkalinity, and, to a lesser degree, TN, representing nutrient nonpoint sources from rainfall, agricultural runoff, atmospheric deposition, and livestock breeding. The fifth VF (VF5) had significant loading on pH and NH 4 -N due to pollution from untreated nonpoint domestic discharge. The sixth VF (VF6) had positive loading on fluoride and chloride, suggestive of the effects of natural factors such as soil leaching and weathering.

Source apportionment
The main pollution sources of the rivers were urban, agricultural, industrial, and domestic wastewater. The scores of the six VFs for each sampling site are plotted in Figure 5 to show differences in the pollution sources at the sampling sites. Higher VF scores were related to factors with greater influences on a sampling site. The results indicated that pollution sources differed greatly among the sampling sites. Firstly, Sites 1, 2, and 3 (group A) had higher VF3 scores and lower VF1, VF2, VF4, VF5, and VF6 scores (except Site 3 and VF6), indicating that they were polluted mainly by organic pollutants, not markedly affected by nonpoint sources of domestic, and agricultural wastewater. Site 3 had a higher VF6 score, which illustrate that Dazong Lake receives rainfall runoff containing fluoride and chloride, as well as seepage of surrounding ground water, while the scores of VF2, VF4, and VF5 were low, which indicated that Dazong Lake was almost not affected by domestic pollution and fecal pollution. Secondly, Sites 4, 5, 6, 7, and 8 (group B) had higher VF1 and VF5 scores, indicating that they were mainly affected by organic industrial pollution and untreated domestic discharge. Among these sites, Sites 4 and 5 had similar characteristics with higher VF2 scores and moderate VF3 scores, which indicate that they also received fecal pollution from nearby pig slaughterhouse. Sites 6, 7, and 8 had higher VF4 and VF6 scores, revealing that they were polluted with agricultural drainage, livestock breeding, and nearby rainfall. Although the five sites were grouped into one cluster, there exists a great difference between their pollution sources. Finally, Site 9 (group C) was significantly affected by VF5 and moderately by VF2, illustrating that it was severely polluted with domestic drainage, also receiving industrial wastewater. From the above discussion, PCA/ FA proved to be a reliable tool for distinguishing sources of pollution among sampling sites. This technique could be used to inform policies of pollution source control. In addition, it could be used to strengthen government initiatives to improve the water quality of drinking water sources.

Conclusions
In this paper, multivariate statistical techniques were applied to analyze surface water quality data from nine sampling sites in Yancheng, China. The spatial variations of surface water quality were    classified, and pollution sources of sampling sites were identified. The results obtained by hierarchical CA indicated that nine sampling sites were classified into three groups: group A (relative low pollution sites) contained Sites 1-3, group B (moderate pollution sites) included Sites 4-8, and group C (relative high pollution sites) included Site 9 solely. Through PCA/FA, six latent factors were obtained, which explained 75.39% of the total variance, and represented organic pollution, fecal pollution, biochemical reactions, nutrients, domestic sewage, and natural factors, respectively. In addition, the pollution sources of different sampling sites were analyzed according to the scores of six VFs. The results illustrated that Sites 1 and 2 were not affected greatly by pollution of nonpoint sources, Site 3 (Dazong Lake) was influenced by surrounding rain runoff, as well as ground water seepage, Sites 4 and 5 were polluted by fecal pollution, Sites 6-8 were polluted by point and nonpoint sources from industrial activity, agriculture runoff, and domestic drainage, and Site 9 was severely polluted with untreated domestic discharge from nearby residents. Based on these results, the sewage systems near Site 9 should be modified and improved by local managers as quickly as possible. The results show that multivariate statistical techniques are useful for analyzing and interpreting complex water quality datasets, as well as identifying pollution sources for governments to make effective policies. In addition, these methods can be applied by river managers to support scientific strategies to improve drinking water quality.