Evaluating Surface Water Nitrogen Pollution via Visual Clustering in Megacity Chengdu

: The current standards used for nitrogen pollution evaluation are lacking, and scientiﬁc classiﬁcation methods are needed for nitrogen pollution to improve water quality management capabilities. This study addresses the important issue of assessing surface water nitrogen pollution by utilizing two advanced multivariate statistical techniques: self-organizing maps (SOMs) obtained using the K-means algorithm and the Hasse diagram technique (HDT). The research targets of this study are the rivers of the megacity Chengdu, China. Samples were collected on a monthly basis in 2017–2020 from different sites along the rivers, and their nitrogen pollution parameters were determined. The grouping of nitrogen pollution parameters and the clustering of sampling events using SOMs facilitate the preprocessing required for the HDT, wherein clusters are ordered according to the pre-clustered water sampling events. The results indicate that nitrogen pollution in the Chengdu River Basin, which is prominent and mainly driven by nitrate nitrogen, can be categorized into ﬁve levels. The nitrogen pollution in Tuo River is serious. Although the degree of ammonia nitrogen pollution in Jin River is higher, the pollution range is smaller. Furthermore, these results were evaluated by the SOMs and HDT to be clear and reliable. Overall, these ﬁndings can provide a basis for local environmental legislation.


Introduction
Theoretical and experimental advances in water environment quality, which is a wellknown indicator of the degree of pollution, are vital for protecting the water ecological environment [1,2]. Earlier evaluations of water environment pollution have mainly been performed using qualitative descriptions of water. An extensive understanding of the physical, chemical, and biological effects of the water environment has been obtained over the years using several water quality evaluation methods such as index evaluation [3,4], fuzzy mathematics theory [5], grey system theory [6], multivariate statistical analyses [7][8][9][10], and artificial neural networks [11][12][13] . Owing to the rising pressure on water quality management objectives, there is an urgent need to analyze data and obtain important information; however, this has become difficult due to an increase in the historical monitoring data and automatic station data. Accordingly, the need for scientific and efficient water pollution assessment methods has arisen. Therefore, the research and application of artificial neural networks and Hasse diagram technology (HDT) have become a future development trend.
Self-organizing maps (SOMs) were first pfroposed by Finnish scholar Kohonen in 1982 [14]. As a nonlinear science, SOMs have the advantages of autonomy and inclusiveness. However, since clustering results cannot be used to compare each SOM individually, their practical applicability for environmental management is limited. HDT, which has been named after the German mathematician Helmut Hasse, is a method based on the partial order set theory that retains the important elements in the evaluation and decision-making processes [15,16]. This method only requires the weight order of the evaluation index, thus circumventing the need to weigh in other water quality evaluation methods. However, HDT exhibits high intolerance to 'noise'; thus, it has high requirements for data preprocessing. Although SOM and HDT have been used together for river pollution assessments, insufficient information has been obtained. Li et al. [17] only used two methods to evaluate water pollution independently, while limited information was interpreted using complex Hasse images. Meanwhile, Voyslavov et al. [18,19] and Liu et al. [20] only used SOMs for parameter grouping, and the equivalence class division of samples still relied on local surface water quality standards.
According to most global standards, rivers require only limited total nitrogen (TN) concentrations; however, these standards lack the concentration requirements for various other nitrogen forms. According to the surface water quality standard in China (GB3838-2002), river water is evaluated only using NH 3 -N. Meanwhile, lakes and reservoirs are evaluated using TN and NH 3 -N. Although the mass concentration of NO 3 − -N is limited in drinking water ( 10 mg/L in China), it exhibits a wide range. Traditional analytical methods offer a more qualitative description, which is insufficient for evaluating nitrogen pollution in rivers.
Under the absence of standards, this study used SOM and HDT techniques to explore the characteristics of regional nitrogen pollution and classify the river water pollution in Chengdu. In this study, no river water quality standard has been used as a reference except for the NH 3 -N concentration. Therefore, SOM is used to simultaneously categorize the equivalence classes of parameters and samples, thereby eliminating the need for manual classification and successfully completing the 'noise reduction' processing of data. Finally, a concise and clear Hasse diagram is obtained, and the nitrogen pollution of samples is ranked. Based on the binomial results, the spatial and temporal distribution laws of large data set elements are determined. Overall, the advantages of both SOMs and HDT have been exploited, while their shortcomings have been addressed.
The study aims to offer chemometric expertise for comprehensively evaluating the nitrogen pollution in the river waters of Chengdu and provide a basis for local environmental legislation.

Study Area
The Yangtze River is China's 'mother river', and the Yangtze River Economic Belt is a major engine for China's development [21]. Chengdu is the nearest megacity to the Yangtze River Basin, and its water quality directly restricts the economic development and water safety in the lower reaches of the Yangtze River. It is located between 30 • 05 N and 102 • 54 E, has a population of 20.9 million, and covers an area of 14,335 km 2 . Furthermore, it is positioned within the subtropical humid monsoon climate zone, and experiences an annual rainfall of 800-1400 mm and an average annual temperature of 15.2-16.6 • C [22]. Land use types in Chengdu City have the following three characteristics: First, land types are diverse. Second, the plain area accounts for 40.1% of the city area. Third, the land reclamation index (38.2%) is higher than the national average (10.1%). The area of construction land in Jin River Basin is 483.56 km 2 , which is higher than that in Jinma River Basin and Tuo River Basin. The area of agricultural land in the Jinma River Basin and the Tuo River Basin is 3333.25 km 2 and 4749.67 km 2 , respectively, which is significantly higher than that in the Jin River Basin.
Chengdu straddles two water systems: the Min River and Tuo River. The Min River, which was once considered the Yangtze River's main tributary, is divided into the Jinma River Basin and Jin River Basin at the Dujiangyan Fish Mouth (i.e., part of a famous ancient water project). Since ancient times, fish mouths have provided a steady flow of water to Jin River throughout the year, thus facilitating agricultural irrigation and preventing floods. Excess water tends to flow toward the Jinma River, which is mainly used for flood discharge. Although the Tuo River has its own water system, it actually draws water from the Min River. Notably, the Jinma, Jin, and Tuo River Basin account for 44.43%, 15.94%, and 39.63% of the total watershed area, respectively [22].

Sample Collection
This study used 75 sampling points ( Figure 1) in Chengdu River Basin, and 891 annual average values were collected between 2017-2020.

SOMs
SOMs are a neural network model used for exploring and visualizing high-dimensional data sets in the environment. Based on the minimum criterion of the Davies-Bouldwin index (DBI), this study uses K-means clustering for the automatic generation of final clustering categories [19,20]. Thus, this method can provide variable distribution information of the data sample by outputting variable planes. Furthermore, the K-means algorithm of SOM can also output the unified distance matrix (U-matrix), which governs the construction of SOMs according to the distance between nodes and obtains the classification results of all nodes. The difference between the U-matrix and variable plane is that it includes all the variable information of the samples. The SOM clustering analysis was conducted using the SOM toolbox 2.0 in MATLAB 2021b software.

HDT
HDT is a data graph that can represent finite posets. According to the research results of Voyslavov et al. [18,19] and the user manual associated with Decision Analysis by Ranking Techniques (DART) [24], the steps required for HDT clustering are briefly explained: (1) First, the weight order of each index parameter is determined. The calculation method of entropy weight is as follows [25]: For n samples and m indicators, X ij is the value of the ith sample corresponding to the jth index.
(4) Calculate the entropy weight w j of the j indicator: Thus, the value of W = (w 1 , w 2 , w 3 , . . . , w j can be obtained (∑ n j w j = 1). Second, the Hasse matrix is obtained using HDT. The ranking of object E, which includes the sampling data of the research period, is performed based on variables such as the selected water quality parameters; this object is called Information Basis (IB). The processed data matrix Q(N × R) contains N objects and R variables. y(x) represents the numerical value of the rth variable, and y r indicates the variables according to which the objects are ranked. The two objects s and t are comparable in the following cases: Even if one y(s) y(t), the objects s and t cannot be compared. The Hasse matrix, which can easily derive the partial order set and determine the relations between objects, can be expressed as follows: Finally, the Hasse image is drawn according to the Hasse matrix. If there is no object a in E, for which s ≤ a ≤ t (a = s ∧ a = t), s is covered by t or vice versa. The order relation in the Hasse matrix can be represented using the Hasse diagram, which is constructed as follows: a. Each object or equivalence class has a circular representation with an identifier. The equivalence elements function as different objects, indicating that all variables in IB have the same value.
b. If there is a coverage relationship, the corresponding objects are connected by lines and the representative elements can be compared.
c. If s ≤ t, s is drawn above or below t; all the relation lines follow the same direction principle. d Although there is no connecting line between s and z, a straight line can be used to connect s and t.
e. If s ≤ t ∨ t ≤ z, s and t are not comparable and cannot be connected using a straight line.
Elements that are not covered by other objects are termed as 'maximal elements', and those not covered by other objects are 'minimal elements'. Meanwhile, 'chain' and 'anti-chain' represent a set of comparable and incomparable objects at the same level, respectively; that is, the graph height represents the longest chain, and the graph width represents the longest anti-chain.
Since HDT is not tolerant to 'noise', preprocessing steps are extremely important. In this study, SOMs were used to preprocess the data, and HDT is implemented using the DART software [26].

Determining the SOM Clustering Structure
In this study, the multi-year average of 75 monitoring sections for 12 months (a total of 891 samples) was used as the data set. According to the minimum node volume of the competition layer (5 × INT( √ N)), the number of neurons in the SOM map was determined as 150 and statistical calculations were performed according to the data analysis method in Section 2.3.1. Figure 2a shows the U-matrix of the input dataset and visualizes all the parameters. The distance between neurons can be reflected by the U-matrix to determine the clustering structure of the SOM graph. The attribute value of the index parameters corresponding to each neuron can be expressed using color depth. That is, the neurons with higher TN and NH 3 -N values were located in the upper and middle parts of the SOMs, and the neurons with higher NO − 3 -N, NO − 2 -N and DON values were located in the lower right part of the SOMs. Figure 2 shows that some neurons were not only polluted by NH 3 -N, but also by NO

Evaluation Index Selection
The plane ordering of water quality parameters is shown in Figure 2b, which also depicts the position, distance, and color of each parameter on the graph. Three distinct groups can be observed; the first group includes NH 3 -N, the second group comprises TN, and the third group contains NO − 3 -N, NO − 2 -N, and DON. The images of the parameters in the third group show a high degree of consistency, indicating that there is a significant correlation between them. NO − 3 -N is the main form of nitrogen in river water and is more representative than NO − 2 -N and DON; thus, NO − 3 -N represents NO − 2 -N and DON to be a group. TN, NH 3 -N, and NO − 3 -N parameters exhibit distinct distributions, thereby providing different information for data set objects. Therefore, TN, NH 3 -N, and NO − 3 -N were selected as the evaluation indexes for water nitrogen pollution assessment based on HDT.

SOM Clustering Results
In this study, 891 objects were distributed in 142 neurons, and 8 neurons were not filled with objects ( Figure 3d). Finally, the data samples were divided into 8 clustering categories (Figure 3a) denoted as C i (i = 1, 2, . . . , 8). Different cluster categories in Figure 3b correspond to distinct color partitions, with the corresponding number representing the cluster category (i). Figure 3c indicates the corresponding neurons in different clustering categories. Neurons numbered 1 to 150 are filled in order from left to right and from top to bottom. Figure 3d shows the number of samples contained in each neuron. For example, C 1 contains 11 neurons (119, 120, 13, 133, 134, 135, 146, 147, 148, 149, and 150) and a total of 78 samples.

Determining the Data Set Equivalence Class and Evaluation Index Weight Ranking
To reduce the irrelevant differences between objects, each filled node in SOM has been used as an equivalence class. Therefore, 891 objects are included in 142 neurons, and these neurons are then divided into 8 categories according to the water quality characteristics between nodes. These categories are used as the final equivalence class for HDT clustering analysis. When dividing the equivalence class of the data set, it is necessary to consider the weight ranking of the evaluation indicators. According to the selection results of the evaluation indicators in Section 3.1.2 and the methods described in Section 2.3.2, the weights of the evaluation indicators are calculated (Table 1).

HDT Clustering Ranking
The preprocessing results of data sets and evaluation indicators are input into the DART software, after which the Hasse diagram is output (Figure 4). The input object is divided into five levels (clean, generally clean, lightly polluted, moderately polluted, and heavily polluted), and the maximum elements C 1 and C 8 and the minimum element C 6 are obtained. There is no connection line between the adjacent elements C 4 and C 7 as well as C 3 and C 1 , and it is considered that they have at least one evaluation index with opposite attributes. There are connecting lines between adjacent elements such as C 7 and C 1 as well as C 2 and C 3 , indicating that the attribute values of all evaluation indexes increase synchronously. The final sample clustering results are shown in Table 2, and the attribute values of clustering evaluation indexes are shown in Table 3.      The advanced relationships among nitrogen properties (mass concentration) can be analyzed by determining the relationship between different elements. Figure 4 shows the elements (C i ) in each level of the Hasse diagram. C 8 and C 3 represent heavy NH 3 -N pollution, while C 1 , C 7 , and C 8 represent heavy NO 3 − -N pollution. The nitrogen attribute values of C 6 and C 5 were low. Nitrogen pollution gradually increased from Level 1 to Level 5; however, the nitrogen attribute values between elements did not increase with a rise in level (Table 3). Specifically, Level 1 contains C 6 whose nitrogen attribute values are low. Level 2 contains C 5 , which is more nitrogenous than Level 1. Level 3 contains C 2 and C 4 , and its nitrogen properties are more profound than those at Level 2. Level 4 includes C 3 and C 7 ; C 3 shows higher TN and NH 3 -N values, C 7 has higher NO 3 − -N values, and C 3 exhibits higher nitrogen attributes than those of the samples at Level 3. However, C 7 only increased the nitrogen attribute of C 2 at Level 3, which was lower than the NH 3 -N value in C 4 . Level 5 contains C 1 and C 8 , which exhibit high nitrogen attribute values. The NO 3 − -N pollution of C 1 is dominant, while the NH 3 -N pollution of C 8 is more prominent. C 1 only has an advanced relationship with C 7 at Level 4. The NH 3 -N attribute of C 3 at Level 4 is higher than that of C 1 , while the nitrogen attribute of C 8 is higher than all elements at Level 1 to Level 4.

Comprehensive Evaluation of Nitrogen Pollution
The nitrogen pollution of rivers in Chengdu, which is mainly driven by nitrate nitrogen, has been concentrated in the middle and lower reaches. Figure 5 shows the number and proportion of samples during the high and low water periods as well as the upper, middle, and lower reaches of the hierarchical clustering results. The nitrogen pollution at the upper, middle, and lower reaches in Chengdu changed significantly compared to the variations in nitrogen pollution in the high and low water periods. Samples that were moderately and heavily polluted accounted for 30.1% of the total samples, indicating that nitrogen pollution is still prominent. With increasing pollution levels, the proportion of dry season samples increased to 57.0%, the proportion of upstream samples decreased significantly, and the proportion of downstream samples increased significantly. The upstream samples that were moderately and heavily polluted accounted for only 14.9% of the total samples, whereas the proportion of downstream samples was 85.1%. Samples subjected to NH 3 -N pollution were dominant in the middle reaches, and there were no upstream samples. Meanwhile, samples in the dry season were more than double the samples in the wet season. For NO 3 − -N pollution, the proportion of downstream samples was approximately 50%, and the sample size of C 1 in the wet season and dry season was similar. The number of C 7 samples in the wet season was more than that in the dry season, while contrasting results were observed for C 8 because of the significant NH 3 -N pollution. Samples subjected to low-level nitrogen pollution were mainly observed in the middle and upper reaches, and the number of samples in the wet and dry seasons was equivalent. Overall, the nitrogen attribute values of most samples were low, and the number of samples affected by NO 3 − -N (25.4%) was much more than that affected by NH 3 -N (9.2%). The nitrogen pollution characteristics in the three basins tended to be slightly different. The degree of nitrogen pollution in the Tuo River Basin was greater than that in the other two basins. The proportion of clean samples was only 1.0%, and that of heavily polluted samples was 32.0%. Meanwhile, the proportion of clean samples in Jinma and Jin River Basin accounted for 51.0% and 40.3%, respectively. The proportion of samples affected by NO 3 − -N and NH 3 -N was 47.4% and 14.5% in Tuo River Basin, 16.0% and 8.0% in Jinma River Basin, and 20.0% and 6.7% in Jin River Basin, respectively. The pollution range of NH 3 -N in Jin River Basin was low, but the pollution degree was high (3.11 ± 1.50 mg/L); however, all the samples were located in the middle reaches.

Advantages and Disadvantages of SOMs and HDT Technology
Studies have shown that the spatial and temporal distributions of various nitrogen forms in the region are complex, and the conclusions drawn by traditional single evaluation methods are often not accurate enough. Through the unorganized information provided by SOMs, numerous samples can be preliminarily clustered. Although the results provide a qualitative evaluation of water quality, a definite ranking of pollution levels cannot be obtained. Furthermore, HDT technology can elucidate ranking relationships during clustering, is not restricted by national water quality standards, and can be used to perform any standard water quality evaluation. The preprocessing of data by SOMs addresses the problem of HDT being intolerant to 'noise' to some extent. Thus, the nitrogen pollution evaluation conducted using SOMs and HDT is friendly and reliable.
Previous studies have concluded that by utilizing both SOMs and HDT, water pollution evaluation can be realized by imaging the water surface. Tsakovski et al. [12], Liu et al. [20], and Voyslavov et al. [18,19] used binomial technology to analyze the temporal and spatial characteristics of surface water pollution in Struma River, Mudan River, and Maritsa River, respectively. However, all these studies relied on local surface water standards for manual grading. In contrast, the present study employed SOMs and HDT to perform visual nitrogen pollution evaluation without utilizing any water standards. The results elucidated the spatial and temporal characteristics of nitrogen pollution in rivers, while providing another method for formulating water quality standards to better serve local water environment management.
Although the proposed method clearly exhibits advantages for evaluating surface water monitoring results, this study judges its reliability based on only the consistency of results. Since it only utilizes spatial and temporal analysis results of nitrogen forms, substantive evidence is lacking. The water quality evaluation parameters only include nitrogen-related indicators; although there is a significant correlation between these parameters, a certain deviation is also observed in the characterization characteristics. Furthermore, when using DART software for HDT analysis, it is still necessary to manually set the equivalence class samples, which is not ideal.

Conclusions
Nitrogen pollution in the rivers of Chengdu, which can be divided into five levels, is prominent and mainly driven by nitrate nitrogen. To further improve the water environment quality, controlling nitrate nitrogen pollution is key. The nitrogen pollution in the Tuo River Basin is more prominent. Meanwhile, the range of ammonia nitrogen pollution in Jin River Basin is low, but the pollution degree is high. The evaluation results obtained using SOMs and HDT are consistent with the actual situation, and thus can be used for evaluating nitrogen pollution in other rivers.
Furthermore, the evaluation of nitrogen pollution in river waters based on SOM and HDT is not restricted by water quality standards. The proposed method can be used for visual clustering and sorting, with the output results being clear and reliable. In the future, the credibility of this method can be improved and the software application development can be optimized to reduce manual operation, which will help promote its practical applicability for environmental management.

Data Availability Statement:
The data presented in this study are available on request from the corresponding authors.