A method of water resources accounting based on deep clustering and attention mechanism under the background of integration of public health data and environmental economy

Water resource accounting constitutes a fundamental approach for implementing sophisticated management of basin water resources. The quality of water plays a pivotal role in determining the liabilities associated with these resources. Evaluating the quality of water facilitates the computation of water resource liabilities during the accounting process. Traditional accounting methods rely on manual sorting and data analysis, which necessitate significant human effort. In order to address this issue, we leverage the remarkable feature extraction capabilities of convolutional operations to construct neural networks. Moreover, we introduce the self-attention mechanism module to propose an unsupervised deep clustering method. This method offers assistance in accounting tasks by automatically classifying the debt levels of water resources in distinct regions, thereby facilitating comprehensive water resource accounting. The methodology presented in this article underwent verification using three datasets: the United States Postal Service (USPS), Heterogeneity Human Activity Recognition (HHAR), and Association for Computing Machinery (ACM). The evaluation of Accuracy rate (ACC), Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI) metrics yielded favorable results, surpassing those of K-means clustering, hierarchical clustering, and Density-based constraint extension (DCE). Specifically, the mean values of the evaluation metrics across the three datasets were 0.8474, 0.7582, and 0.7295, respectively.


INTRODUCTION
Water resources accounting is the statistical method of water resources in a certain basin to carry out statistics of various indicators, which is of great significance for water resources management and environmental protection (Li & Qian, 2018).With the rapid development of industry and economy, water pollution has become an increasingly important social problem (Zamora-Ledezma et al., 2021).Nowadays, people pay more attention to integrating public health and safety as well as the environmental economy.In the accounting of water resources, water resources liabilities are the indicators that can best there is a lack of global information for the classification of water resources liabilities requiring complete feature classification.
To solve the above problems, we propose an auxiliary method of water resources accounting based on deep clustering and attention mechanism (Ajay et al., 2022).This method is mainly aimed at the key part of water resource liability accounting in the context of public health data and environmental economic integration.In this article, the index of water resource liability is constructed by feature, and then a deep clustering algorithm is constructed by convolutional operation, and an improved Non-Local module is inserted.Proposed method can automatically cluster the debt levels of water resources, thus facilitating the subsequent debt analysis.This automatic method can greatly save the manpower in the classification of liabilities, and at the same time has high accuracy.Therefore, the proposed method is verified on several common datasets containing complex data, and compared with some traditional and deep learning clustering methods.Experiments show that our method has better performance and has certain application value in the field of accounting.

RELATED WORK Deep clustering methods
Deep clustering method is an end-to-end unsupervised clustering method based on deep learning technology.Compared with the supervised deep learning method, unsupervised learning is suitable for applications that lack sufficient prior knowledge.Besides, it is difficult to label categories of the supervised deep learning method manually (Guo et al., 2022).Water resources accounting data is large and complex, and it is difficult to obtain high quality labels through labeling.In similar difficult to label work, predecessors have achieved unsupervised problem solving through different deep clustering methods.Denny & Spirling (2018) proves that clustering algorithm can reduce the dependence on annotations and turn the method that needs strong supervision algorithm implementation into weak supervision implementation.Instead of turning strong supervision into weak supervision, Iakovidis & Georgakopoulos (2018) analyzed data-driven phenotypes of patients with heart failure using simple deep clustering method, echocardiography and electronic health record data.However, the volume of health data is small in other areas, Ulloa & Wehner (2017) used the deep clustering method to visually supervise the representation learning of large unlabeled data sets, and realized short-term traffic prediction and applied it to the intelligent transportation system.These deep learning clustering methods can automatically and intelligently complete some data analysis work.However, compared with these work, in the water resources accounting work under the background of public health data and environment integration, the feature matrix composed of data is more integrated, and more global information is needed to increase the accuracy of clustering, so as to more accurately grade the water resources liabilities in various regions.

Attention mechanism
An important significance of attention mechanism in the field of vision is to give global information to the feature map, and the receptive field of convolution operation is expanded to the whole map.In the field of accounting, the feature matrix generated by the original data has the same organizational structure as the image in the visual field, which also requires the addition of global information.Han & Zheng (2019) built an end-to-end model using self-attention mechanism and convolutional neural network to solve machine reading and question answering problems.The attention mechanism method used in the model is integrated into convolution operation, so as to focus more on difficult samples to improve the performance of the model.Wei & Dohan (2018) proposed a self-attention generation countermeasure network that allows attention driven remote dependency modeling of image generation tasks.This is consistent with our concept of adding global information to the network.Zhang & Goodfellow (2019) proposed a non-local module, which can be directly inserted into the convolutional neural network to collect rich global information for the network.As a separate module, Non-local can well improve the performance of the model and is widely used in supervised methods.However, this module has the problems of large computation and redundant information.These problems make it difficult for non-local to act on larger characteristic matrices.The water resources liability characteristic matrix has a large scale.If you want to use non-local to fuse the global information, you need to improve its structure (Wang & Girshick, 2018).

METHODS
To automate and intelligently conduct water resource accounting within the context of integrating public health data and environmental economy, we propose a methodology aimed at aiding in the assessment of water resource liabilities.The schematic representation of the overall methodology can be observed in Fig. 1.Initially, water samples are collected, and several water quality evaluation indicators, which directly impact water resource liabilities, are generated.Subsequently, water quality indicator tables are constructed.The subsequent stage involves the creation of features based on the generated table, resulting in the generation of a feature matrix, which is subsequently normalized.The third phase entails the development of a deep clustering model, followed by model evaluation and unsupervised clustering of the constructed features.Ultimately, the outcomes of deep clustering are utilized to evaluate the debt level of water resources across various regions.Based on the debt level of water resources, distinct accounting

Feature structure
We selected several water quality indicators that have a greater impact on water resource liabilities as the evaluation criteria for the classification of water resource liabilities: chlorophyll (Chl.a), pH value, transparency (SD), dissolved oxygen (DO), total oxygen (TN) and total phosphorus (TP).After collecting water samples from a certain area and detecting the original data, it is necessary to establish the original data table .Table 1 shows the sample data format of multiple samples in a certain region.Each column of the table represents the data of a different sample, and each row represents a different water quality index of a sample.The input format of the neural network is in the form of eigenvectors.In order to adapt the original data to the needs of the deep clustering model input, we use the data in the table to generate a feature matrix, as shown in Formula (1).The matrix encompasses a sequential incorporation of data into the sample data table, with each characteristic matrix representing the water quality attributes specific to a given geographical region.It is noteworthy that the number of samples surpasses the mere five exemplified in the formula when a sufficient quantity of samples is procured.In order to facilitate gradient calculation within the neural network framework, the feature matrix undergoes a normalization process.The normalization of the feature matrix serves to eradicate the potential influence arising from variations in units and scales among the distinct features, thereby ensuring equitable treatment of each dimension's feature.Consequently, this normalization procedure manifests an enhanced convergence velocity and ultimate accuracy for the model.The complete feature building process is shown in Fig. 2. Note: Table 1 shows the sample data format of multiple samples in a certain region.The input format of the neural network is in the form of eigenvectors.
Zhou ( 2023 The self-attention mechanism module, which we have incorporated, is positioned between the first and third convolutional layers.This module serves the purpose of incorporating global information into the feature extraction process.A comprehensive explanation of the Self-Attention module will be provided in the subsequent section.Following the completion of the feature extraction process, the model expands the extracted features into a one-dimensional long vector, with the second dimension set to 1.This vector then enters the fully connected layer to obtain a low-dimensional representation of the features.Subsequently, the acquired features are fed into the final clustering layer, where the clustering algorithm carries out the ultimate classification.Diverging from supervised deep learning models, our unsupervised deep clustering algorithm employs the reconstruction error function as the loss function, as depicted in Eq. (2).
(2) At the last clustering layer, we use the classical K-means algorithm to perform the final clustering.First, we use the soft assignment based on t distribution to calculate the similarity s ij between the position feature representation z i and the cluster center point u j .
The calculation formula is 2 , where d ij ¼ jjz i À u i jj 2 2 is the Euclidean distance between the feature representation z j and the cluster center point u j , and a > 0 is the degree of freedom of the t distribution; In addition, the super parameter others increases the similarity between the nearest feature z i and the cluster center u j .The probability value obtained after s ij normalization is used as the soft assignment of clustering, as shown in Formula (3): Secondly, auxiliary distribution q ij is constructed from soft distribution p ij to make auxiliary distribution p ij more reliable, as shown in Formula (4): where b is an exponential power.Two probability distributions p i and q i are obtained from Eqs.
(3) and (4).When the probability value q i of a class cluster in q ic is greater than 1=K, the probability value of the corresponding p ic must be greater than q ic , and the corresponding probability of other positions of p i is less than q i .This shows that auxiliary distribution p i improves the probability of belonging to a certain cluster, while reducing the probability of belonging to other clusters, and improves the reliability of clustering.The method described in this section mainly uses convolution operations for feature extraction, and then unsupervised clustering of the extracted representations.In addition, the traditional K-means clustering is improved to make it have higher confidence.This model can automatically grade water debt data.

Improved non-local self-attention module
Although convolution operation has strong feature extraction ability, its receptive field is limited.In order to add global information to the feature extraction process, that is, to associate the elements in the feature matrix with all other elements, we introduce the selfattention module.The original non-local module is plagued by large number of parameters and difficulty in calculation.In order to solve this problem, we have made some improvements to the original structure to improve the operation efficiency.The overall structure of the self-attention module is shown in Fig. 4.
The specific process consists of the following steps: Step 1: Initially, the input characteristic matrix undergoes a convolution operation to derive key, query, and value matrices.Each of these matrices captures distinct aspects of the input data.
Step 2: Subsequently, the maximum average pooling method is employed to sample the aforementioned matrices, effectively eliminating redundant information and reducing the parameter count.This technique enhances the feature extraction capability of the model.
Step 3: The key and query matrices are pointwise multiplied to yield a weight matrix.This matrix represents the importance or relevance of different elements within the input data.
Step 4: The weight matrix is then normalized using the softmax function, resulting in weight coefficients.This normalization step ensures that the coefficients sum up to 1 and facilitates proper weighting of the different elements.
Step 5: The weight coefficients are multiplied element-wise with the value matrix, and channel attention is applied to the corresponding positions across all channels of each characteristic matrix.This step allows for refining and highlighting relevant features within the input data.
Step 6: Finally, the resulting characteristic matrix, which embodies the refined and attended features, is outputted as the final representation of the data.Among them, the amount of calculation for sampling key, query and value matrices is greatly reduced.Because redundant information affecting accuracy is removed at the same time, experiments show that this operation makes the module have a certain amount of improvement in performance.

EXPERIMENTS AND ANALYSIS
In order to evaluate the clustering effect of the proposed method, we use some common data sets to evaluate (https://zenodo.org/record/7924095#.ZFyzJHZByNc): (1) The USPS (2) The Heterogeneous Human Activity Recognition (HHAR) dataset contains 10,299 sensor records from smartphones and smartwatches.All samples were divided into six categories of human activities, including riding, sitting, standing, walking, going up and down stairs.
(3) ACM is from the article network of ACM dataset.If two articles are written by the same author, there is an edge between them.The article features a bag of keywords.We selected the articles published in KDD, SIGMOD, SIGCOM and MobiCOMM and divided them into three categories (database, wireless communication and data mining) according to their research fields.All experiments were completed in the following environments: Intel (R) Xeon (R) Bronze 3204 CPU @ 1.90 GHz, 32 GB RAM, GPU Tesla V100, CentOS Linux release 7.6.1810.
In the experiment, Adam algorithm is used for optimization, Dynamic Decay Learning Rate and Batch Normalization are used.

Evaluation metrics
For clustering results, we use common centralized evaluation indicators to measure the performance of the algorithm: accuracy rate (ACC), regular mutual information (NMI), and RAND index (ARI).
(1) The accuracy rate (ACC) is used to represent the correctly classified samples.
where d is an indicator function; l i is the clustering index of x i obtained by the algorithm; y i is the true label of x i ; map is a transformation function, which maps each cluster label l i to a category as a whole, making the predicted label closer to the real label as a whole.This process can be obtained through Kuhn Munkres.
(2) Normalized mutual information (NMI) is another commonly used measure of clustering effect, which is used to represent the regularization amount of real label information that can be obtained after knowing the predicted clustering results.

Comparison of different methods
We choose several existing clustering methods to compare with our proposed methods: (1) K-means, a classical clustering method based on original data.(2) Hierarchical clustering, tree clustering structure, classical algorithm.(3) DEC, a deep clustering method, designed a clustering target to guide the learning of data representation.See Table 2 for the evaluation indicators and comparison of several methods.It can be seen that our method greatly surpasses the two traditional clustering methods in performance, which depends on the feature extraction ability of convolutional neural network and the global information collection of attention mechanism.Compared with the DEC method of deep learning clustering method, the proposed method has also achieved better performance in the three data sets.The experimental results confirm the effectiveness and practical value of the proposed method.The proposed method can make considerable contributions to the accounting of water resources under the background of public data security and environmental economy integration.

Clustering visual analysis
In order to more intuitively reflect the performance advantages of our proposed methods, we have made visual analysis on different data sets for the four methods used for comparison.The visual analysis results are shown in Fig. 5.It can be seen that the clustering method proposed by us is superior to both the traditional clustering method and the deep clustering algorithm DEC.The clustering effect of our proposed method is better, the embedding is more obvious, the overlap is less, and the clustering of each group of nodes is better.

Water resource liability rating results of deep clustering
In our study, we chose a water sample from Lugu Lake as the initial material for clustering analysis.The clustering results obtained are presented in Table 3.Based on the information derived from the water samples, an automated evaluation of the overall water quality of a given region can be performed, enabling the determination of its water resource liability level.The obtained results serve as evidence that the methodology proposed by our research team makes a significant contribution to the assessment of water resources.Moreover, it substantially reduces the time required for manual analysis and evaluation, offering an efficient and time-saving approach.We selected Lugu Lake water sample as the initial clustering material, and the clustering results are shown in Table 3.
According to the water sample information, the overall water quality of a region can be automatically evaluated to obtain its water resource liability level.

Figure 1
Figure 1 Overall flow of water resource liability analysis with deep clustering.Full-size  DOI: 10.7717/peerj-cs.1571/fig-1 ), PeerJ Comput.Sci., DOI 10.7717/peerj-cs.15715/14 Clustering model of deep learning with attention mechanism In order to improve the accuracy of clustering and reasonably classify the calculation level of water resources liabilities, we propose a deep clustering model with attention mechanism.The model learns by self-monitoring, extracts the features of the original data with convolution operation, and classifies the water resource liability level of a region by integrating these features.The overall structure of the deep clustering model is shown in Fig. 3.The model has four convolution layers for feature extraction.

Figure 4
Figure 4 Structure diagram of self-attention module.Full-size  DOI: 10.7717/peerj-cs.1571/fig-4 L correspond to real tags and predictive tags.H is the information entropy function and ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi HðYÞHðLÞ p is the regularization l term.(3) The adjusted rand index (ARI) is used to measure the coincidence of two data distributions.where, n represents the total number of samples, C represents the correct classification, and K represents the predictive clustering results.a is defined as the number of instance pairs that are divided into the same category in C and the same cluster in K. b is defined as the number of instance pairs divided into different categories in C and different clusters in K.The value range of ARI is [−1,1].The larger the value, the better the clustering effect.

Table 1
Raw form of water quality.

Table 2
Comparison of clustering performance between our method and existing methods.Which depends on the feature extraction ability of convolutional neural network and the global information collection of attention mechanism.Compared with the DEC method of deep learning clustering method, the proposed method has also achieved better performance in the three data sets.

Table 3
Clustering results of water resources liability levels. Note: