Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Prediction of hydrological and water quality data based on granular-ball rough set and k-nearest neighbor analysis

  • Limei Dong,

    Roles Conceptualization, Data curation, Resources

    Affiliation Upper Changjiang River Bureau of Hydrological and Water Resources Survey, Chongqing, China

  • Xinyu Zuo,

    Roles Conceptualization, Writing – original draft

    Affiliation Upper Changjiang River Bureau of Hydrological and Water Resources Survey, Chongqing, China

  • Yiping Xiong

    Roles Software, Writing – review & editing

    870255728@qq.com

    Affiliation College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, Chongqing, China

Abstract

Hydrological and water quality datasets usually encompass a large number of characteristic variables, but not all of these significantly influence analytical outcomes. Therefore, by wisely selecting feature variables with rich information content and removing redundant features, it not only can the analysis efficiency be improved, but the model complexity can also be simplified. This paper considers introducing the granular-ball rough set algorithm for feature variable selection and combining it with the k-nearest neighbor method and back propagation network to analyze hydrological and water quality data, thus promoting overall and fused inspection. The results of hydrological water quality data analysis show that the proposed method produces better results compared to using a standalone k-nearest neighbor regressor.

1. Introduction

Algal blooms are a natural ecological phenomenon in water bodies, but only caused by algae [1]. Water bloom (or harmful algal blooms) is a phenomenon in which a large number of algae grow and reproduce or accumulate and reach a certain concentration when the water body is rich in nutrients and has suitable environmental conditions such as temperature, light, climate and suitable hydrological conditions that are conducive to the growth or aggregation of algae [2]. Algae is one of the common biological components in water, but excessive growth of algae will bring problems. Excessive nutrients, especially phosphorus and nitrogen, are the main causes of algal blooms [3]. In freshwater bodies, the main dominant algae species causing blooms are Aphanizomenon, Microalgae, Anabaena, Empty Chlorella or Cryptococcus. The colors of different algae-forming blooms are different due to different algae species. The main colors are green, red and brown. The species and quantity distribution of algae have obvious regional and seasonal characteristics [4, 5]. This is because the composition of phytoplankton in water is closely related to the physical and chemical properties of water. Different algae adapt to different physical and chemical properties of water. For example, warm-loving species prefer water with higher water temperature, and acid-resistant species prefer water with lower pH values. Therefore, with the different external conditions, the algae species that bloom are different [6].

At present, the process and mechanism of freshwater water blooms are not very clear. Most experts believe that algal bloom is a short-term massive reproduction and growth of algae, and the occurrence process of algal bloom is divided into the incubation period (algal bloom begins to reproduce rapidly), peak period (the number of algal bloom reaches the peak) and dissipation period (the number of algal bloom decreased significantly). Recently, experts from the Chinese Academy of Sciences have proposed that the “outbreak” of cyanobacterial blooms is only the apparent phenomenon caused by the floating and accumulation of algae in a certain period of time, that is, the change of spatial position, rather than the great change of biomass [7]. However, the author believes that the occurrence of water bloom is the result of the interaction of the above two processes. According to statistics, many lakes [813], reservoirs [1419] and rivers [20, 21] are currently reported to have algal blooms. Therefore, the problem of water bloom is the main problem that has been or will be faced by most water bodies in China. When algal blooms occur, the concentration of microcystins in water will increase significantly. There are three main chemical structures of microcystins detected in freshwater blooms: cyclic peptides, alkaloids and lipopolysaccharides [22]. One of the most important microcystins is microcystins. Microcystins are a group of toxic compounds that are currently studied due to their high toxicity and wide distribution. They are secondary metabolites produced by some strains or species of Microcystis, Anabaena, Oscillatoria and Nostoc in cyanobacteria. Not only the problem of freshwater water blooms is serious, but also the algae blooms in the main coastal waters (often referred to as red tides) can not be ignored. Red tides have been reported in the main coastal waters of China, such as the Bohai Sea [23, 24], the Yellow Sea [25, 26], the East China Sea [2729], the Taiwan Strait [30, 31], and the South China Sea [3234].

The analysis, modeling and prediction of hydrology and water quality are of positive significance for the prediction and prevention of water bloom. In 1983, Okada and Aiba [35] proposed a model to simulate the vertical spatial distribution of Microcystis biomass in calm water. The model simulates the emergence and disappearance of Microcystis blooms on the surface of eutrophic water by determining the vertical and spatial distribution of photosynthesis rate and buoyancy. The model does not consider the competition between freshwater plankton and the food chain. In 1984, Reynolds [36] constructed a nomogram to explain the biomass changes of algal populations using algal population biomass data related to seasonal and lake nutritional status (in this study, the biomass data of 49 different algal populations in 12 temperate lakes were used to construct a nomogram). SK Jayaraman et al. (2015) [37] used a multi-level direct search optimization technique with steady-state convergence criteria to predict the growth of microalgae. However, this method studied a closed environment and was not suitable for hydrological and water quality analysis. The study of Pyo J C et al. (2021) [38] showed a pan-Arctic result obtained by combining a new submarine algae sea ice model (SIMBA) with a three-dimensional sea ice-ocean model. The model is evaluated based on data collected during an onboard activity in the mid-eastern Arctic in the summer of 2012. Algae blooms are caused by light and show latitude dependence. Snow and ice also play a key role in the growth of ice algae. These studies can be used for auxiliary analysis of freshwater algae research.

The inductive modeling method uses regression analysis and/or correlation analysis to extract knowledge and patterns from a large amount of historical data to construct the overall model of the system and uses the constructed overall model to predict system behavior (fewer components to explain system behavior). Sakamoto (1966) [39] and later Dillon & Rigler (1975) [40] constructed the first empirical data model, which used Total Phosphorous data to predict the amount of chlorophyll (chlorophyll-a) in lakes (chlorophyll quantity can be regarded as an estimate of total algae biomass). In 1984, Whitehead & Horberger [41] applied the time series method to construct a model to predict the change of chlorophyll content in the Thames River. In 1994, Recknagel et al. [42] also used the fuzzy logic method to model and predict the change of chlorophyll content in lakes and reservoirs. Many scholars have used regression analysis to study the relationship between environmental variables and red tides [4348]. In addition, fuzzy logic, as a mathematical method commonly used for modeling and prediction of complex systems [49, 50], has also been used for modeling and predicting algal blooms [5052]. In recent years, machine learning-based methods have been used to predict and evaluate the content of various algae [51, 52]. Among them, k-nearest neighbor (kNN) is one of the most basic and simple machine learning algorithms [53, 54]. It is also used for inter-annual remote sensing monitoring of eelgrass beds [55], algae toxicity identification [56] and water quality assessment [57, 58].

With the rise of neural network applications, in the middle and late 1990s, some scholars began to apply neural networks to algae ecological modeling. Because the Back Propagation(BP) network algorithm is mature and easy to use, it is the most widely used in modeling and prediction research. Recknagel (1997) [59] used a three-layer feedforward neural network to construct a cascade-structured neural network model to model and predict the biomass of five algae species in Lake Kasumigaura, Japan. The results show that the inductive type of empirical model has good predictive ability for complex lake and marsh phenomena, but how to extract useful knowledge from the neural network weights to explain the occurrence mechanism of algal blooms is worthy of further research. In addition, Recknagel et al. [59] used BP neural network to conduct predictive modeling research on a large amount of historical data from Lake Kasumigaura and Lake Biwa in Japan, Lake Tuusulanjaervi in Finland, and Darling River in Australia. The prediction results show that the artificial neural network method can effectively predict complex nonlinear algal bloom phenomena in freshwater ecosystems under different environmental conditions. Although it can basically reflect the peaks and troughs of algal population biomass, the error between it and the actual observation value is relatively large. Wei et al. (2001) [60] also applied BP neural network modeling to algae ecological modeling in Kasumigaura Lake, Japan, and could well predict the occurrence time and intensity changes of Microcystis blooms. Maier, Dandy (1997) [61] and Maier et al. (1998) applied the BP network to the algae ecological modeling of rivers, using historical data of climate, hydrology, nutrient levels and other factors in the Murray River in Australia to predict Anabaena algae changes in biomass. BP neural network is still the mainstream of current algae ecological modeling, but most methods are also analyzed based on all characteristic attributes of the data, lacking correlation analysis between features.

Most existing methods for modeling and predicting algae-based biomass involve all attributes of the data in decision-making, and some redundant attributes may affect the result analysis. The heuristic attribute reduction algorithm based on rough set theory can effectively reduce the time complexity of high-dimensional problems [6264]. It can also detect algae more efficiently and accurately. The latest research shows that granular-ball rough set (GBRS) [65] based on granular-ball computing [66] combines the robustness and adaptability of granular-ball computing. This method can handle discrete and continuous data stably and efficiently, and feature selection is even better than the state-of-the-art methods. Therefore, we consider combining GBRS and regression methods to analyze hydrological and water quality data. The detailed main contributions are as follows:

  • This paper uses GBRS to remove redundant features in water quality data and select feature variables with rich information content for data analysis. And the Pearson correlation coefficient method was used for feature correlation analysis and compared with the GBRS method.
  • The results after feature selection are input into kNN and BP neural networks for regression prediction.
  • The hydrology and water quality data analysis results show that combining the GBRS method with kNN to analyze hydrology and water quality data produces better results.

The rest of this paper is organized as follows. We introduce related works in Section 2. Experimental results and analysis are presented in Section 4. We present our conclusion and future work in Section 5.

2. Materials and methods

2.1 Granular-ball rough set

Human cognition possesses a cognitive mechanism known as “global priority” [67], which enables the processing of information input based on coarse-grained details, thus providing adaptive multi-granularity descriptive capabilities. Building upon this theory, Wang and Xia [66] proposed multi-granularity sphere computing by using the granular-ball to cover sample points. This approach replaces individual sample points with the granular-ball as inputs, significantly reducing the number of required training samples. Additionally, the coarse-grained nature of the granular-ball ensures that they are less susceptible to the influence of fine-grained sample points, thereby enhancing the algorithm’s robustness. Each granular-ball GB = {xi, i = 1, 2, ⋯, N} can be expressed using only two features: its center c and radius r, applicable across any dimension. The expression is as follows [66]: (1) where N denotes the number of samples in the granular-ball. The size of the radius in (1) indicates the different granularity of balls. Larger radii result in fewer granular-balls, indicating a coarser level of granularity. More efficient granular-ball computation contributes to improved algorithm robustness. Currently, this method has been successfully employed in the field of rough sets. While the Pawlawk rough set utilizes equivalence classes for knowledge representation, it cannot handle continuous data. Conversely, neighborhood rough sets can address continuous data, but encounter the challenge of “heterogeneous transmission,” hindering knowledge representation. To overcome these limitations, Xia et al. [65] introduced granular-balls into rough set theory, proposing granular-ball rough sets (GBRS). This framework allows for the processing of continuous data while utilizing equivalence classes for knowledge representation. The specific models are defined and described as follows.

Rough set theory is a key component of granular computing. Pawlak rough set and neighborhood rough set are two important models within rough set theory. Pawlak rough set represents knowledge through equivalence classes but cannot handle continuous data. On the other hand, neighborhood rough set can handle continuous data but suffers from the “heterogeneous transmission” problem, which hinders knowledge representation. Granular-balls can simultaneously represent continuous data and perform equivalence class partitioning. Therefore, Xia et al. [65] introduced granular-ball into rough set theory and established the concept of granular-ball rough set, which is a unified learning model of Pawlak rough set and neighborhood rough set. Granular-ball rough set refers to the rough set based on granular-ball computing, the model is shown as follows.

Definition 1. [65] Let U = {x1, x2, ⋯, xn} is a non-empty finite set of real space. ∀xiU a granular-ball GBj is defined as: (2) where cj and rj denotes the center of GBj and the radius of GBj, respectively. Obviously, the larger the radius rj of the granular-ball, the coarser the granularity size and vice versa, the finer the granularity size.

Definition 2. [65] Let 〈U, A, V, f〉 be an information system, and U is the set of objects. A and V denotes the set of all attributes and the values of attributes respectively, and f denotes a mapping function that f : U × AV. ∀x, yU and BA, the indiscernible granular-ball relation INDGB(B) of the attribute subset B is defined as: (3) where a is an attribute of B, if (x, y) ∈ INDGB(B), then x and y are indiscernible according to attribute set B, denoted as xy. In granular-ball rough set, INDGB(B) denotes an equivalence relation on U, which can create a partition of U, denoted as U/GB(B). An element [x]GB(B) = {yU|(x, y) ∈ INDGB(B)} in U/GB(B) is an equivalence class generated by granular-ball computing.

Definition 3. [65] Let 〈U, A, V, f〉 be an information system. For ∀aB, GBRB denotes a corresponding relation on U. ∀XU, the upper and lower approximation of x based on attribute set B can be described as follows: (4) (5)

2.2 Pearson correlation coefficient

Pearson correlation coefficient is a simple method that can help understand the relationship between features and response variables. This method measures the linear correlation between variables. The value range of the result is [-1,1]. “-1” and “+1” indicates a complete negative correlation and a complete positive correlation respectively, and 0 indicates no linear correlation. The Pearson correlation coefficient is fast and easy to calculate and is often performed the first time after obtaining the data (after cleaning and feature extraction). Specifically defined as follows:

If two sets of data X = {X1, X2, ⋯, Xn} and Y = {Y1, Y2, ⋯, Yn} are overall data (such as census results), then the overall mean is expressed as [68] (6) Population covariance is (7) Overall Pearson correlation coefficient can be obtained by (8) σX(sigma X) is the standard deviation of X, where (9) It can be proved that |ρxy| ≤ 1, if and only if Y = aX + b,

Observing the formula of the overall Person correlation coefficient, we find that the Pearson correlation coefficient can be seen as eliminating the dimensional influence of the two variables, that is, the covariance after X and Y standardization. Therefore, we can use the Pearson correlation coefficient to measure the degree of linear correlation between the two variables. As shown in S1 Fig, the correlation between two variables x and y can be easily determined by drawing a scatter plot. Only when the following conditions are met, can we accurately use the Pearson correlation coefficient to judge: first, there is a linear relationship between the two variables, and they are continuous data; secondly, the overall distribution of the two variables is normal distribution, or near normal unimodal distribution; finally, the observations of the two variables are paired, and each pair of observations is independent of each other. By observing the graphics and verification conditions, the linear relationship between these two variables can be obtained.

2.3 kNN regressor

According to the given distance measurement method (generally using Euclidean distance), the k sample points closest to x are found in the training set T = {(xi, yi), i = 1, 2, …, N}, and the set represented by these k sample points is recorded as Nk(x). In Nk(x), the category y is determined by the classification decision rule (such as majority decision) x, . In the above formula, I is an indicator function: The basic idea of the kNN regression algorithm is to select the nearest training samples in k Euclidean spaces according to the similarity of Euclidean distance between samples, and the regression value is the mean or weighted value of k-nearest neighbor samples. The specific steps of the kNN regression algorithm are as follows: (1) There are n training samples, which are expressed as X = (X1, X2, …, Xn), and each training sample can be expressed as Xi = (xi1, xi2, …, xid, yi), in. Then the Euclidean distance D between the training sample Xi and the test sample Xt = (xt1, xt2, …, xtd, yt), tn can be expressed as [69] (10) where xim and xtm are the observed values of the mth auxiliary variable (the variable used to build the model); yi and yt are the monitoring variables’ observation value. (2) Calculate the Euclidean distance between all training samples and test samples according to (1) and find the first K nearest neighbor samples . (3) Calculate the estimation value of the monitoring variable for the test sample Xt. (11)

2.4 BP neural network

The BP neural network is a multi-layer feedforward network trained according to the error back propagation algorithm. It is one of the most widely used neural network models currently. It is a supervised learning algorithm with strong adaptive, self-learning and non-linear mapping capabilities. It can better solve the problems of small data, poor information and uncertainty, and is not restricted by non-linear models. The BP neural network can learn and store a large number of input-output pattern mapping relationships without revealing the mathematical equations describing the mapping relationship in advance. Its learning rule is to use the steepest descent method to continuously adjust the weights and thresholds of the network through backpropagation to minimize the sum of squared errors of the network. The multi-layer perceptron can obtain the final output a(L) of the network through layer-by-layer information transmission. The whole network can be viewed as a composite function ∅(x : ω, b). It takes the vector x as the output of the first layer a(0). The output of the L layer is used as the output of the whole function. (12) where ω and b denote the connection weights and biases of all layers in the network. The error backpropagation algorithm corrects the ω and b between each neuron by assigning the error to each hidden layer neuron so that the error signal tends to be minimized. This paper selects the multi-layer BP neural network model. Fig 1 is the algorithm flow chart of the BP neural network.

3. Method

In order to analyze the correlation between data features, we first combine the Pearson correlation coefficient and the kNN algorithm or BP neural network for analysis. The specific flow chart of the method is shown in Fig 2. Firstly, the correlation coefficients between the condition attributes and the label attributes are calculated, and then these coefficients are sorted. After deleting the attributes corresponding to the largest coefficients, the kNN algorithm or BP neural network is executed in the previous condition attributes, and the accuracy of the algorithm is calculated. Perform the above process in turn, and the final attribute set corresponding to the highest accuracy is the optimal attribute set after feature selection.

thumbnail
Fig 2. Flow chart of prediction method based on pearson correlation coefficient and kNN regressor.

https://doi.org/10.1371/journal.pone.0298664.g002

Although the Pearson correlation coefficient method can achieve feature selection, this method selects the optimal accuracy through kNN or BP neural network calculation based on the results of feature correlation analysis, which is relatively complicated. We considers introducing the GBRS method to directly select some data features, remove redundant attributes, and then process the data through regression methods. The flow chart of this method is shown in Fig 3. First, initialize the attribute set, add conditional attributes in sequence, and regenerate the granular-ball. If the sample size covered by granular-balls increases, retain the conditional attribute and repeat the above steps. Otherwise, output the conditional attribute set and calculate the prediction accuracy of kNN or BP neural network under these attributes. The GBRS can simultaneously represent both the Pawlak rough set and the neighborhood rough set, enabling it not only to be able to deal with continuous data but to use equivalence classes for knowledge representation as well. It combines the robustness and adaptability of the granular-ball computing, which enables the flexible fitting of different data distributions using the granular-balls with various radii. Thus, it prevents the propagation of heterogeneity caused by the overlap between the positive regional neighborhoods of different labels in the NRS. In addition, because the granular-ball set is used instead of points as the input, it can achieve higher efficiency than the feature selection on the basis of coarse-grained.

thumbnail
Fig 3. Flow chart of prediction method based on GBRS and kNN regressor.

https://doi.org/10.1371/journal.pone.0298664.g003

4. Results

4.1 Sample pretreatment

The water level and flow data used in this experiment are measured data from the hydrological station of the Hydrology and Water Resources Survey Bureau of the Upper Yangtze River, and the water quality and ecological data are obtained from the daily measurement of the Water Environment Monitoring Center of the Upper Yangtze River. Data cannot be shared publicly because of privacy. Data are available from the Hydrological Bureau of Yangtze River Conservancy Commission for access to confidential data. Contact information: syzuoxy@cjh.com.cn. The data set was measured from January 2015 to November 2021. Before 2019, it was monitored monthly, and after 2020, it was monitored quarterly.

4.2 Evaluating indicator

In order to analyze the performance of the prediction model, a reasonable evaluation system needs to be established. At present, the commonly used method is to evaluate the performance of the model based on the relationship between the output value of the prediction model and the actual monitoring value. In this paper, the root mean square error (RMSE) is used as the evaluation index.

The root mean square error (RMSE) (13) the mean absolute error (MAE) (14) the mean absolute percentage error (MAPE) (15) the R-Square (R2) (16) in the formula:

k---- Sample size;

T′---- Model predictions;

---- Mean value;

T---- Actual measuring.

In the above formula, RMSE mainly characterizes the absolute deviation of the predicted value from the actual monitoring value.

4.3 Experimental result analysis

The root mean square error is obtained for comparison between the prediction method based on the Pearson correlation coefficient combined with the kNN regressor(PK) and the separate kNN method. Two methods were used to detect changes in the chlorophyll quantity of Cryptophyta(Cry), Chlorophyta(Chl), Euglenophyta(Eug), Cyanophyta(Cya), Pyrrophyta(Pyr) and Bacillariophyta(Bac).

Table 1 shows the root mean square error results of the PK algorithm on the data set Cryptophyta. It is obvious that when the number of features is 2 and k = 3, the root mean square error is the smallest. Therefore, in Table 2, the PK algorithm selects the best results among different features for comparison with the results of kNN. The best results are marked in bold. The results show that the PK algorithm is equal to or better than the simple KNN algorithm. In fact, the results obtained by the PK algorithm are the optimal results after comparing with the kNN algorithm, so its results must be better than kNN.

thumbnail
Table 2. Comparison of root mean square error between PK and kNN.

https://doi.org/10.1371/journal.pone.0298664.t002

As shown in Tables 3 and 4, the root mean square errors of the kNN prediction model and the BP neural network regression model are shown respectively, and the Pearson correlation coefficient(PK/PBP) and the granular-ball rough set(GBRSK/GBRSBP) are used for feature selection. To demonstrate the feasibility and effectiveness of GBRSK/GBRSBP, we compare it against PK/PBP. In Tables 3 and 4, the experiment uses the optimal attribute set after the feature selection of the granular-ball rough set to predict. Likewise, the Pearson correlation coefficient takes the same number of optimal attribute sets. As the experimental results show, in most cases, the error of the prediction result obtained by using the granular rough set for feature selection is smaller.

thumbnail
Table 3. Comparison of root mean square error between PK and GBRSK.

https://doi.org/10.1371/journal.pone.0298664.t003

The experimental results under different values of k are shown in Table 3, where the number of features is the number of feature sets when the GBRS feature selection result is optimal. Table 4 is RMSE,MAE, MAPE and R2 under the prediction of bp neural network. As can be seen from Table 3, except for the case of the 4th data set, the prediction error of GBRSK is much higher than PK in most cases, in which the prediction error of the 3rd data set is the same for both algorithms. From Table 4, it can be seen that except for the sixth dataset, the RMSE obtained by the GBRS algorithm is lower than that of the Pearson correlation coefficient algorithm. In addition, according to the results of MAE, MAPE, and R2, GBRS has higher classification accuracy than PBP in most cases. The best result is marked in bold. The results indicate that the GBRS algorithm is superior to the Pearson correlation coefficient algorithm.Table 4 is RMSE,MAE, MAPE and R2 under the prediction of bp neural network. As can be seen from Table 3, except for the case of the 4th data set, the prediction error of GBRSK is much higher than PK in most cases, in which the prediction error of the 3rd data set is the same for both algorithms. From Table ref t8, it can be seen that except for the sixth dataset, the RMSE obtained by the GBRS algorithm is lower than that of the Pearson correlation coefficient algorithm. In addition, according to the results of MAE, MAPE, and R2, GBRS has higher classification accuracy than PBP in most cases. The best result is marked in bold. The results indicate that the GBRS algorithm is superior to the Pearson correlation coefficient algorithm.

GBRSK/GBRSBP obtains this superiority because it can flexibly fit the data distribution using those granular-balls with various radii, which is better than using a fixed correlation coefficient, such as the Pearson correlation coefficient. Therefore, GBRSK/GBRSBP can achieve better results than these two algorithms.

5. Discussion

In this paper, a hydrological and water quality prediction model is constructed by combining the granular-ball rough set and the kNN regressor. It provides a better method for the analysis of hydrological and water quality data than the single use of the kNN regressor. The model is applicable to the prediction of various algae blooms. The experimental results show that by combining the granular-ball rough set and the kNN regressor, the final attribute set corresponding to the highest accuracy is the optimal attribute set after feature selection. Feature selection plays an important role in the analysis of hydrology and water quality. Its main functions include the following aspects: firstly, by selecting the most relevant features, the accuracy and efficiency of modeling and analysis can be enhanced. By excluding redundant and irrelevant features, the complexity of the model can be reduced, minimizing the risk of overfitting and improving the performance of predictive and classification models. Secondly, Hydrology and water quality data often contain a large number of characteristic variables, which may contain noise, missing values, and incomplete data. Through feature selection, the process of data processing can be simplified, the workload of data cleaning and filling can be reduced, and data quality and reliability can be improved. Finally, Feature selection reduces the number of features that need to be collected and measured, saving time and cost in field investigations and experiments. By selecting the most representative and informative features, efficient data collection and analysis can be achieved, improving the efficiency of research. In summary, the combination of granular-ball rough set analysis and kNN regressor can help us better understand the water environment and establish accurate and efficient models to evaluate hydrological water quality.

Despite these encouraging advances, our approach does have certain limitations. Combining granular-ball rough set and kNN regressor for hydrology and water quality analysis has some defects: the GBRS can’t achieve higher classification accuracy than the baselines in some cases. The reason may be that the quality of granular-balls is not good enough. In terms of theoretical applicability, granular-ball computing primarily relies on the multi-granularity partitioning of the sample space. But when the sample size is small, especially when the number of samples is close to the sample dimension, the sample distribution is already very sparse, and the distribution is approximately linearly separable in the classification. At this time, the further division of using granular-ball computing is of little significance. At present, granular-ball computing is not suitable for processing data sets with relatively similar dimensions and sample sizes. In addition, the K value in the kNN regressor needs to be determined in advance. Choosing an inappropriate K value may lead to a model that is too simple or too complex, thereby affecting the accuracy of prediction. In order to overcome these shortcomings, we consider using more complex machine learning algorithms, such as Support Vector Regression (SVR), Decision Tree Regression (DTR), etc. Handle high-dimensional sparse data well. In addition, rational selection of features and adjustment of model parameters are also key factors to improve the accuracy of hydrological and water quality analysis.

6. Conclusions

In order to improve the detection of water quality data, this paper proposes to first introduce GBRS to perform feature selection on the data set, and then input the results into the kNN regressor and BP neural network to better capture the changing patterns of water quality indicators and improve the accuracy and reliability of hydrological and water quality analysis. The kNN regressor can predict based on the characteristic values of neighboring samples to better capture the changing patterns of water quality indicators. Results from the analysis of hydrological water quality data indicate that combining GBRS and kNN regressors yields better results. However, granular-ball computing is not suitable for high-dimensional and small sample data sets, so further research is needed.

Supporting information

S1 Fig. Explain the scatter plot of correlation from -1 to 1.

https://doi.org/10.1371/journal.pone.0298664.s001

(TIF)

References

  1. 1. Sellner K G, Doucette G J, Kirkpatrick G J. Harmful algal blooms: causes, impacts and detection[J]. Journal of Industrial Microbiology and Biotechnology, 2003, 30: 383–406. pmid:12898390
  2. 2. Hallegraeff G M. Harmful algal blooms: a global overview. Manual on harmful marine microalgae, 2003, 33: 1–22.
  3. 3. Zhang P, Peng C, Zhang J, et al. Long-term harmful algal blooms and nutrients patterns affected by climate change and anthropogenic pressures in the zhanjiang bay, China[J]. Frontiers in Marine Science, 2022, 9.
  4. 4. Choat J H, Schiel D R. Patterns of distribution and abundance of large brown algae and invertebrate herbivores in subtidal regions of northern New Zealand Journal of experimental marine biology and ecology, 1982, 60(2-3): 129–162.
  5. 5. Smetacek V S. Role of sinking in diatom life-history cycles: ecological, evolutionary and geological significance Marine biology, 1985, 84: 239–251.
  6. 6. Gu H, Wu Y, Lü S, et al. Emerging harmful algal bloom species over the last four decades in China[J]. Harmful Algae, 2022, 111: 102059. pmid:35016757
  7. 7. Zhang Y, Qin B, Zhu G, et al. Profound changes in the physical environment of Lake Taihu from 25 years of long‐term observations: Implications for algal bloom outbreaks and aquatic macrophyte loss Water Resources Research, 2018, 54(7): 4319–4331.
  8. 8. Binding C E, Zeng C, Pizzolato L, et al. Reporting on the status, trends, and drivers of algal blooms on Lake of the Woods using satellite-derived bloom indices (2002–2021) Journal of Great Lakes Research, 2023, 49(1): 32–43.
  9. 9. Kang L, Zhu G, Zhu M, et al. Bloom-induced internal release controlling phosphorus dynamics in large shallow eutrophic Lake Taihu, China Environmental Research, 2023, 116251. pmid:37245569
  10. 10. Luo J, Ni G, Zhang Y, et al. A new technique for quantifying algal bloom, floating/emergent and submerged vegetation in eutrophic shallow lakes using Landsat imagery Remote Sensing of Environment, 2023, 287: 113480.
  11. 11. Abkar L, Taylor A, Stoddart A, et al. Microbiome and hydraulic performance changes of drinking water biofilters during disruptive events-media replacement, lake diatom bloom, and chlorination Environmental Science: Water Research & Technology, 2023, 9(3): 723–735.
  12. 12. Schweitzer‐Natan O, Ofek‐Lalzar M, Sher D, et al. The microbial community spatially varies during a Microcystis bloom event in Lake Kinneret Freshwater Biology, 2023, 68(2): 349–363.
  13. 13. Wang S, Zhang X, Wang C, et al. Multivariable integrated risk assessment for cyanobacterial blooms in eutrophic lakes and its spatiotemporal characteristics Water Research, 2023, 228: 119367. pmid:36417795
  14. 14. Painter K J, Venkiteswaran J J, Baulch H M. Blooms and flows: Effects of variable hydrology and management on reservoir water quality Ecosphere, 2023, 14(3): e4472.
  15. 15. Song Y. Hydrodynamic impacts on algal blooms in reservoirs and bloom mitigation using reservoir operation strategies: A review Journal of Hydrology, 2023, 129375.
  16. 16. Summers E J, Ryder J L. A critical review of operational strategies for the management of harmful algal blooms (HABs) in Inland reservoirs Journal of Environmental Management, 2023, 330: 117141. pmid:36603251
  17. 17. Song Y, Chen M, Li J, et al. Can selective withdrawal control algal blooms in reservoirs? The underlying hydrodynamic mechanism. Journal of Cleaner Production, 2023, 394: 136358.
  18. 18. Li Y X, Deng K K, Lin G J, et al. Effects of physiologic activities of plankton on CO2 flux in the Three Gorges Reservoir after rainfall during algal blooms. Environmental Research, 2023, 216: 114649. pmid:36309212
  19. 19. Song Y, You L, Chen M, et al. Key hydrodynamic principles for controlling algal blooms using emergency reservoir operation strategies. Journal of Environmental Management, 2023, 325: 116470. pmid:36244283
  20. 20. Kruk C, Segura A, Piñeiro G, et al. Rise of toxic cyanobacterial blooms is promoted by agricultural intensification in the basin of a large subtropical river of South America Global Change Biology, 2023, 29(7): 1774–1790. pmid:36607161
  21. 21. Phlips E J, Badylak S, Mathews A L, et al. Algal blooms in a river-dominated estuary and nearshore region of Florida, USA: the influence of regulated discharges from water control structures on hydrologic and nutrient conditions Hydrobiologia, 2023, 1–27.
  22. 22. Howard M D A, Smith J, Caron D A, et al. Integrative monitoring strategy for marine and freshwater harmful algal blooms and toxins across the freshwater‐to‐marine continuum Integrated Environmental Assessment and Management, 2023, 19(3): 586–604. pmid:35748667
  23. 23. Li X Y, Yu R C, Richardson A J, et al. Marked shifts of harmful algal blooms in the Bohai Sea linked with combined impacts of environmental changes Harmful Algae, 2023, 121: 102370. pmid:36639187
  24. 24. Xiao R, Zhao Z, Guo J, et al. Variations of dissolved inorganic nutrients and their influences on harmful algal blooms in Bohai Sea over the past thirteen years Estuarine, Coastal and Shelf Science, 2023, 287: 108335.
  25. 25. Cao J, Liu J, Zhao S, et al. Advances in the research on micropropagules and their role in green tide outbreaks in the Southern Yellow Sea Marine Pollution Bulletin, 2023, 188: 114710. pmid:36860024
  26. 26. Hu Y, Yu F, Chen Z, et al. Two near-inertial peaks in antiphase controlled by stratification and tides in the Yellow Sea Frontiers in Marine Science, 2023, 9: 1081869.
  27. 27. Mu B, Qin B, Yuan S, et al. PIRT: A Physics-Informed Red Tide Deep Learning Forecast Model Considering Causal-Inferred Predictors Selection IEEE Geoscience and Remote Sensing Letters, 2023, 20: 1–5.
  28. 28. Yang H F, Zhu Q Y, Liu J A, et al. Historic changes in nutrient fluxes from the Yangtze River to the sea: Recent response to catchment regulation and potential linkage to maritime red tides Journal of Hydrology, 2023, 617: 129024.
  29. 29. Liu Z, Jiao S, Liu X, et al. Two-Dimensional Numerical Simulation of Tide and Tidal Current of Eight Major Tidal Constituents in the Bohai, Yellow, and East China Seas Remote Sensing, 2023, 15(15): 3735.
  30. 30. Wan L, Caruso G, Cao X, et al. Microbial Response to Coastal-Offshore Gradients in Taiwan Straits: Community Metabolism and Total Prokaryotic Abundance as Potential Proxies Microbial ecology, 2023, 85(4): 1253–1264. pmid:35581504
  31. 31. Zahir M, Balaji-Prasath B, Su Y P, et al. The dynamics of red Noctiluca scintillans in the coastal aquaculture areas of Southeast China Environmental Geochemistry and Health, 2023, 1–18. pmid:37027084
  32. 32. Li R, Liang Z, Hou L, et al. Revealing the impacts of human activity on the aquatic environment of the Pearl River Estuary, South China, based on sedimentary nutrient records Journal of Cleaner Production, 2023, 385: 135749.
  33. 33. Kim Y J, Kim W, Im J, et al. Atmospheric-correction-free red tide quantification algorithm for GOCI based on machine learning combined with a radiative transfer simulation ISPRS Journal of Photogrammetry and Remote Sensing, 2023, 199: 197–213.
  34. 34. Xin M, Sun X, Xie L P, et al. A historical overview of water quality in the coastal seas of China Frontiers in Marine Science, 2023, 10: 1203232.
  35. 35. Okada M, Aiba S. Simulation of water-bloom in a eutrophic lake—III. Modeling the vertical migration and growth of Microcystis aeruginosa Water Research, 1983, 17(8): 883–893.
  36. 36. Reynolds C S, Wiseman S W, Clarke M J O. Growth-and loss-rate responses of phytolankton to intermittent artificial mixing and their potential application to the control of planktonic algal biomass Journal of Applied Ecology, 1984, 11–39.
  37. 37. Jayaraman S K, Rhinehart R R. Modeling and optimization of algae growth Industrial & Engineering Chemistry Research, 2015, 54(33): 8063–8071.
  38. 38. Pyo J C, Cho K H, Kim K, et al. Cyanobacteria cell prediction using interpretable deep learning model with observed, numerical, and sensing data assemblage Water Research, 2021, 203: 117483. pmid:34384949
  39. 39. Sakamoto Y, Ishiguro M, Kitagawa G. Akaike information criterion statistics Dordrecht, The Netherlands: D. Reidel, 1986, 81(10.5555): 26853.
  40. 40. Dillon P J, Rigler F H. A simple method for predicting the capacity of a lake for development based on lake trophic status Journal of the Fisheries Board of Canada, 1975, 32(9): 1519–1531.
  41. 41. Whitehead P G, Hornberger G M. Modelling algal behaviour in the River Thames Water research, 1984, 18(8): 945–953.
  42. 42. Recknagel F, Petzoldt T, Jaeke O, et al. Hybrid expert system DELAQUA—a toolkit for water quality control of lakes and reservoirs Ecological Modelling, 1994, 71(1–3): 17–36.
  43. 43. Liu X, Zhang C, Geng R, et al. Are oil spills enhancing outbreaks of red tides in the Chinese coastal waters from 1973 to 2017 Environmental Science and Pollution Research, 2021, 28(40): 56473–56479. pmid:34057633
  44. 44. Xiao X, Li C, Huang H, et al. Inhibition effect of natural flavonoids on red tide alga Phaeocystis globosa and its quantitative structure-activity relationship Environmental Science and Pollution Research, 2019, 26: 23763–23776. pmid:31209750
  45. 45. Li X Y, Yu R C, Geng H X, et al. Increasing dominance of dinoflagellate red tides in the coastal waters of Yellow Sea, China Marine Pollution Bulletin, 2021, 168: 112439. pmid:33993042
  46. 46. Foley A M, Stacy B A, Schueller P, et al. Assessing Karenia brevis red tide as a mortality factor of sea turtles in Florida, USA[J]. Diseases of Aquatic Organisms, 2019, 132(2): 109–124. pmid:30628577
  47. 47. Tominack S A, Coffey K Z, Yoskowitz D, et al. An assessment of trends in the frequency and duration of Karenia brevis red tide blooms on the South Texas coast (western Gulf of Mexico) PLoS One, 2020, 15(9): e0239309. pmid:32946494
  48. 48. Lee M S, Park K A, Micheli F. Derivation of red tide index and density using geostationary ocean color imager (GOCI) data Remote Sensing, 2021, 13(2): 298.
  49. 49. Wang L, Xie Y, Xu J, et al. Prediction method of cyanobacterial blooms spatial-temporal sequence based on deep belief network and fuzzy expert system Journal of Intelligent & Fuzzy Systems, 2020, 38(2): 1487–1498.
  50. 50. Bi S, Li Y, Xu J, et al. Optical classification of inland waters based on an improved Fuzzy C-Means method Optics Express, 2019, 27(24): 34838–34856. pmid:31878664
  51. 51. Cairo C, Barbosa C, Lobo F, et al. Hybrid chlorophyll-a algorithm for assessing trophic states of a tropical brazilian reservoir based on msi/sentinel-2 data Remote Sensing, 2019, 12(1): 40.
  52. 52. Lei F, Yu Y, Zhang D, et al. Water remote sensing eutrophication inversion algorithm based on multilayer convolutional neural network Journal of Intelligent & Fuzzy Systems, 2020, 39(4): 5319–5327.
  53. 53. Molares-Ulloa A, Rivero D, Ruiz J G, et al. Hybrid machine learning techniques in the management of harmful algal blooms impact[J]. Computers and Electronics in Agriculture, 2023, 211: 107988.
  54. 54. Grekov A N, Kabanov A A, Vyshkvarkova E V, et al. Anomaly Detection in Biological Early Warning Systems Using Unsupervised Machine Learning[J]. Sensors, 2023, 23(5): 2687.z pmid:36904891
  55. 55. Lemieux P, Lalumière C, Fugaru N, et al. Assessment of pixel-oriented k-NN machine learning algorithm performance for the interannual remote sensing monitoring of eelgrass beds at the mouth of the Romaine[J]. Environmental Monitoring and Assessment, 2023, 195(8): 939. pmid:37436485
  56. 56. Bu X, Liu K, Liu J, et al. A Harmful Algal Bloom Detection Model Combining Moderate Resolution Imaging Spectroradiometer Multi-Factor and Meteorological Heterogeneous Data[J]. Sustainability, 2023, 15(21): 15386.
  57. 57. Rao W, Qian X, Fan Y, et al. A soft sensor for simulating algal cell density based on dynamic response to environmental changes in a eutrophic shallow lake[J]. Science of The Total Environment, 2023, 868: 161543. pmid:36640876
  58. 58. Krtolica I, Savić D, Bajić B, et al. Machine Learning for Water Quality Assessment Based on Macrophyte Presence[J]. Sustainability, 2022, 15(1): 522.
  59. 59. Recknagel F, French M, Harkonen P, et al. Artificial neural network approach for modelling and prediction of algal blooms[J]. Ecological Modelling, 1997, 96(1–3): 11–28.
  60. 60. Wei B, Sugiura N, Maekawa T. Use of artificial neural network in the prediction of algal blooms[J]. Water Research, 2001, 35(8): 2022–2028. pmid:11337850
  61. 61. Maier H R, Dandy G C. Determining inputs for neural network models of multivariate time series[J]. Computer‐Aided Civil and Infrastructure Engineering, 1997, 12(5): 353–368.
  62. 62. Pan Y, Xu W, Ran Q. An incremental approach to feature selection using the weighted dominance-based neighborhood rough sets[J]. International Journal of Machine Learning and Cybernetics, 2023, 14(4): 1217–1233.
  63. 63. Chen D, Zhao S, Zhang L, et al. Sample pair selection for attribute reduction with rough set[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(11): 2080–2093.
  64. 64. Ji X, Peng J H, Zhao P, et al. Extended rough sets model based on fuzzy granular ball and its attribute reduction[J]. Information Sciences, 2023, 640: 119071.
  65. 65. Xia S, Wang C, Wang G, et al. GBRS: A Unified Granular-Ball Learning Model of Pawlak Rough Set and Neighborhood Rough Set[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023. pmid:37943647
  66. 66. Xia S, Liu Y, Ding X, et al. Granular ball computing classifiers for efficient, scalable and robust learning[J]. Information Sciences, 2019, 483: 136–152.
  67. 67. Wang G. DGCC: data-driven granular cognitive computing[J]. Granular Computing, 2017, 2(4): 343–355.
  68. 68. Cohen I, Huang Y, Chen J, et al. Pearson correlation coefficient[J]. Noise reduction in speech processing, 2009: 1–4.
  69. 69. Guo G, Wang H, Bell D, et al. KNN model-based approach in classification[C]//On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, November 3–7, 2003. Proceedings. Springer Berlin Heidelberg, 2003: 986–996.