Machine learning-based anomaly detection of groundwater microdynamics: case study of Chengdu, China

Detection of subsurface hydrodynamic anomalies plays a significant role in groundwater resource management and environmental monitoring. In this paper, based on data from the groundwater level, atmospheric pressure, and precipitation in the Chengdu area of China, a method for detecting outliers considering the factors affecting groundwater levels is proposed. By analyzing the factors affecting groundwater levels in the monitoring site and eliminating them, simplified groundwater data is obtained. Applying sl-Pauta (self-learning-based Pauta), iForest (Isolated Forest), OCSVM (One-Class SVM), and KNN to synthetic data with known outliers, testing and evaluating the effectiveness of 4 technologies. Finally, the four methods are applied to the detection of outliers in simplified groundwater levels. The results show that in the detection of outliers in synthesized data, the OCSVM method has the best detection performance, with a precision rate of 88.89%, a recall rate of 91.43%, an F1 score of 90.14%, and an AUC value of 95.66%. In the detection of outliers in simplified groundwater levels, a qualitative analysis of the displacement data within the field of view indicates that the outlier detection performance of iForest and OCSVM is better than that of KNN. The proposed method for considering the factors affecting groundwater levels can improve the efficiency and accuracy of detecting outliers in groundwater level data.


Materials and method
Overall framework.As shown in Fig. 1, a detection model was established through the following three steps.
Firstly, a dataset containing noise and abnormal values was created and analyzed using four algorithms: sl-Pauta, iForest, OCSVM, and KNN.The accuracy of the detection algorithm was determined through performance evaluation.Secondly, real-time monitoring data was collected and factors affecting the scene were analyzed using Fourier transformation and spectral analysis to determine the frequency of relevant factors.The relevant factors were then removed using inverse convolution and filtering methods to obtain simplified water level data.Finally, the four algorithms were used to detect abnormal values in the processed data, and the performance of different techniques was qualitatively compared using displacement data using professional knowledge.
Model algorithms.Self-learning-based Pauta criterion method.The Pauta criterion is a statistical method for detecting outliers in sample data.When applying it to the detection of abnormal water levels in underground mines, the accuracy of using a certain section of water level data to detect overall data outliers is limited.Therefore, based on the basic principles of the Pauta criterion, an automatic learning function is added www.nature.com/scientificreports/by automatically moving down one position every time a water level data point is detected.The algorithm is shown in Fig. 2, where the Train is water level data used to calculate the mean x and standard deviation σ , and the Test points to the water level data that needs to be detected.Every time the water level of a point is detected, the training segment and testing segment move forward by one position as a whole, allowing the Pauta criterion to achieve automatic learning, which includes the function of adding the latest running results to the mean and standard deviation in real time, resulting in more accurate detection results.The specific process of the algorithm can be found in Appendix 1.1.www.nature.com/scientificreports/Isolated forest algorithm.The Isolation Forest (iForest) algorithm is an unsupervised anomaly detection method based on random binary trees, which is suitable for continuous data.The algorithm defines an anomaly as "easily isolated anomalous values", which are points that are sparsely distributed and far away from highdensity populations in the feature space.In the feature space, areas with sparse distribution indicate that the probability of occurring the event is very low, so it is inferred that data points distributed in sparse regions are anomalous values.The iForest algorithm consists of K trees, each tree is a binary tree.The algorithm randomly selects a subset of samples as the root node, then randomly selects a feature as the cutting point to generate a binary tree.It continuously constructs new leaf nodes until the maximum depth of the binary tree or the final leaf node has only one data point, which generates the completed isolated tree.During the training process, each iTree is randomly selected and independently generated.The more iTrees are used, the more stable the algorithm is.The evaluation formula for anomalous values is as follows: In the formula, c(ψ) is the average path length of samples with a given number of ψ , which is used to standardize the path length h(x) of sample x , h(x) is the height of x in each tree.If s is closer to 1, the possibility of it being an anomalous point is higher.If s is much smaller than 0.5, it is definitely not an anomalous point.If s is around 0.5, there is no possibility of there being anomalous points in the dataset.
One-Class SVM.One-Class SVM was proposed by Bernhard Schölkopf et al. in 2000, and the method is often used to monitor abnormal behavior in data 51 .The basic idea is to compute a hypersphere with the smallest radius in a sample and to include all samples inside this hypersphere.When this hypersphere is used to classify the dataset, the samples that fall inside the hypersphere are the first class (normal values) and the samples that fall outside the hypersphere are the second class (abnormal values).In this method, the kernel and scalar parameters need to be selected to determine the decision boundary and the parameter optimization problem.OCSVM separates a region of hyperspace with more data points close to each other from another data point with less density by computing a decision boundary, and treats these points as anomalies.
The function of the optimization problem is shown in Eqs.(3-5): The determination function of the decision boundary is shown in Eq. ( 6): (1) www.nature.com/scientificreports/KNN.The KNN algorithm is based on the distance for data segmentation and outlier detection 52 .The basic logic is to calculate the average distance between each sample point and its nearest K samples in turn and then use the calculated distance to compare it with a threshold, and if it is greater than the threshold, it is considered an outlier.Use the Euclidean distance d xy formula to calculate the distance between the current point and each point in the data set., sort the points in increasing order of distance, select the k points with the smallest distance from the current point; calculate the distance between the current point and its k neighbors, and take one of the mean, median, or maximum value as the outlier.
Algorithm evaluation.The algorithmic evaluation of outlier detection is considered a dichotomous problem, where the labels can be divided into 0 and 1, and a mixture matrix is constructed (Table 1), which acts as the true case and is listed as the detection result.TP: true value is 1, predicted value is 1, prediction is correct; FP: true value is 0, predicted value is 1, prediction is wrong; FN: true value is 1, predicted value is 0, prediction is wrong; TN: true value is 0, the predicted value is also 0, the prediction is correct.
The evaluation of the results can be judged by precision rate, recall rate, F1 score, and ROC curve.Precision is the probability of the actual positive samples among all the predicted positive samples (the proportion of samples with a prediction of 1 and a true label of 1), which is the prediction result, as shown in Eq. (6).Recall is the probability of the actual positive samples being predicted as positive samples (Eq.7).The false positive rate (FPR) represents the probability of misclassifying a negative case as a positive case (Eq.8).The F1 score is equalized to the evaluation metric as a summed average of the precision rate and recall rate (Eq.9).The ROC curve consists of a series of points (Fig. 3), and each point on the curve corresponds to the classification result when the classifier takes a certain threshold, with the true rate (TPR) as the vertical coordinate and the false positive rate (FPR) as the horizontal coordinates, the points where the threshold value changes move in the direction of the arrow in the graph.In the ROC graph, the closer the curve is to the point (0,1), the better the classification effect, that is, the steeper the curve, the better the evaluation result.Since the ROC curve lacks numerical representation, the area enclosed by the ROC curve and the x-axis (AUC, Area Under Curve) is usually used to evaluate the merit of the result, that is, the closer the AUC is to 1, the better the evaluation result: (7)   Groundwater data were obtained by RST vibrating wire piezometer, the vibrating wire transducer output frequency signal, this frequency is proportional to the pressure applied to the vibrating wire piezometer and is not affected by wire impedance or contact resistance.These sensors can be installed in boreholes or driven into soft ground, the precision of the sensor is 1 mm, and the acquisition frequency is set to 1 time/h.A total of 10,813 well water level monitoring data were selected for research in two time periods from 0:00 on August 20, 2018 to 16:00 on February 18, 2019, and from 16:00 on May 10, 2019 to 23:00 on December 31, 2019.(part of the data from February 18, 19 to May 10, 19 was missing, so this time period was not studied), the well water level variation curve is shown in (Fig. 5a).
The barometric data are obtained from the Baro-Diver barometer, which is placed in the automatic data acquisition device next to the monitoring wells, and the atmospheric pressure in the study area is monitored in real-time at a frequency of 1 time/hour.Since the unit of atmospheric pressure and the unit of water level need to be unified when calculating the atmospheric pressure efficiency later, in order to put the atmospheric pressure and water level together, "mm water column" is used to represent the atmospheric pressure in this paper (Fig. 5b).
The precipitation data were collected by the tipping bucket type automated real-time precipitation monitoring point, which provides a real-time dynamic recording function for the frequency and intensity of precipitation in the area.The transmission interval of monitoring data is 1 time/h, and the precipitation data from August 1, 2018, to December 31, 2019, are finally obtained (Fig. 5c).According to the precipitation monitoring results, it can be visualized that the precipitation in the landslide area is mainly concentrated in the flood season from May to October, and the accumulated precipitation amounts to 1500 mm during the monitoring time.
The earth tide monitoring data were obtained from the OSG-066 superconducting gravimeter located in Lijiang at the Innovation Academy for Precision Measurement Science and Technology, Chinese Academy of Sciences (CAS), and further converted using Tsoft software 53 , the earth tide data from Lijiang were converted to data from the study area, and the time interval for obtaining the data was 1 h.The earth tide variation curve is shown in (Fig. 5d).

Data processing.
Kalman filtering method is used to eliminate the effect of precipitation in groundwater levels.Digital filtering is a method to highlight the effective frequency band waves by using the difference in spectral characteristics to suppress the interference waves in some frequency bands.The basic principle is to filter the irrelevant frequency bands by using the different frequencies of the desired information and the irrelevant information.
Rasmussen and Crawford in 1997 used the inverse convolution regression method to estimate the barometric response function using water level-barometric pressure observations and, through the response function, to calculate the corrected head 54 .The inverse convolution regression method can be used to eliminate both the barometric effect in water level and the effect of earth tide on water level, and it is ideal for use in unconfined aquifers.The air pressure response function using water level data and air pressure observation data: that is: where �W(t) , �B(t) are the values of the change in water level and atmospheric pressure at moment t , �B(t − i) is the value of the change in barometric pressure at moment t − 1 , α(i) is the unit response coefficient at lag i , m is the maximum lag time.
The unit response coefficient α is found using least squares linear regression.once the value of α(j) is found, the stepwise barometric response coefficient A(i) is calculated by summing over the unit response coefficients: where W * t is the correction variable for each observation from m to n in t , m is the maximum lag time, n is the number of total observations in the dataset.
The elimination of the earth tide is the same as the elimination of the barometric effect, using least squares linear regression to find the unit response coefficient γ .After finding the value of γ (γ ) , the stepwise earth tide response coefficient A(γ ) is calculated by summing over the stepwise response: ( 14) Using the response function, calculate the earth tide correction for each observation: The observation water level minus the atmospheric pressure correction value is the observation well water level after the influence of positive atmospheric pressure; the observation water level minus the earth tide correction value is the observation well water level after the correction of the influence of earth tide.The atmospheric pressure and earth tide factors are also taken into account to obtain the water level correction Eq. ( 18).
Where γ (γ ) is the unit corresponding coefficient of the earth tide response, �ET(t − γ) is the value of the change in earth tide at moment t − γ.

Analysis of the main influencing factors of groundwater level. The variation of various elements
of groundwater over time under natural and human factors is called groundwater level dynamics.Based on the formation mechanism of groundwater level dynamics, the influencing factors of groundwater level are divided into two categories: macroscopic and microscopic dynamics (Fig. 6): among them, the dynamic changes of water level caused by the increase or decrease of water in the aquifer are called macro dynamics, such as precipitation recharge, artificial mining, and water injection.For closed aquifers and semi-closed aquifers, these effects often have a long process, showing a trend change of half a month, a few months, or even a few years, which can be analyzed and calculated by the groundwater dynamics method.The stress change of the aquifer leads to the change of the pore water pressure of the aquifer, and the dynamic change of the water level caused by it is called microscopic dynamics.This variation puts emphasis on the dynamic properties and mechanics of the water table during stress-strain processes in aqueous media.
The monitoring well is located in the eastern mounding area of Fenggan Canal (name of the river) (Fig. 4b).affected by the landslide, the residents have moved to the west side of Fenggan Canal, there are no residents living on the slope, and after the site visit survey, there are no residents on the slope for pumping and draining activities in recent years.The nearest railroad is about 3.3 km away from the study area in a straight line, and according to the study of the ground vibration pattern around the area caused by the train operation 55 , the operation of the railroad will not affect the fluctuation of the water level of the well.The river level of Feng Gan Canal is stable all year round, and there is no significant rise or fall in water level, and the elevation of the river level is 398 m, and the elevation of monitoring wells is 528 m, with a relative height difference of 130 m and a distance of 760 m, therefore, the change of river level will not affect the change of well water level near the slip zone soil on the slope body; the formation thickness of the fourth system of monitoring wells in the area is about 8 m, and the area is a subtropical humid monsoon climate zone with abundant precipitation.There is no fault rupture in the area, so there is no geological condition for seismic activity, and there is no historical record of destructive earthquakes in the area, so there is no impact of earthquakes on groundwater during the monitoring measurement period.After the above analysis, the changes in the well water level in the study area are mainly influenced by atmospheric precipitation, barometer, earth tide, and landslides.

Elimination of the main influencing factors of groundwater level.
There are obvious daily, semidiurnal waves, and 1/3 waves in barometer and earth tide.Although the nature of the forces and the mode of action of the earth tide effect and the barometric effect of the well water level are different, both tidal changes can cause tidal changes in the well water level 56 , while the action of barometer on water level has obvious hysteresis 57 .The inverse convolution regression method in the elimination of the barometer effect in the water level can consider the first dozen hours of barometer changes on the water level, so the barometer effect elimination using this method, using the same method to eliminate the earth tide effect in the water level, and finally get the water level change curve after the elimination of barometer, earth tide (Fig. 7), the water level time series graph of the elimination results through the Fourier transform for spectral analysis.The spectrogram shows (Fig. 8) that the water level peaks at 1 cpd, 2 cpd and 3 cpd have all been weakened, and the peak at 2 cpd (half-day cycle) has been weakened most obviously, indicating that the pressure on the water level, solid tide factors have been effectively eliminated.www.nature.com/scientificreports/Through the statistical analysis of the precipitation frequency in the study area (Fig. 5 c), the precipitation frequency in the area is 0.1-0.5 cpd (circle per day, the number of daily cycles), for the groundwater level data after eliminating the influence of air pressure effect and solid tide effect, select the band-stop filter of 0.1-0.5 cpd to eliminate the influence of precipitation, and obtain the water level change curve after eliminating the influence  www.nature.com/scientificreports/ of air pressure, solid tide and precipitation, as shown in Fig. 9, from the results, the groundwater level change after eliminating precipitation becomes more smooth and stable.

Results and discussion
The 4 techniques are first evaluated on a synthetic dataset with known outliers.This is because the performance of different outlier detection algorithms cannot be evaluated if using real field data, since real production data is unlabeled and has no information about the type and number of outliers.Therefore, synthetic data with known outliers are used to quantitatively compare the performance of different algorithms.The hyperparameters of each technique are optimized to maximize its performance and select the best one.The selected techniques are then applied and evaluated on real field data..

Evaluation of anomaly detection techniques on synthetic datasets. Water level time series data
from February to July 2019 were created using R software, the data volume is 3650, the frequency is 1 h, 1.5% white noise is added to the water level data, and the mutated points in the data were treated as outliers, a total of 35 outliers were filtered to constitute the final synthetic data set (Fig. 10).From Fig. 10, it can be seen that there are few outliers in the data and there is no clear boundary between normal and abnormal values, it is difficult to find the location of the outliers directly by visual inspection.Therefore, the above four machine learning techniques are used to detect outliers on synthetic data and evaluate the detection effect.It can be seen from the figure that all four methods detected the location of the anomalies to some extent (Fig. 11), but all four methods failed to mark all 35 anomalies accurately.The detection results of the four methods are given in Table 2.The sl-Pauta method detected 33 outlier results, of which 27 outliers were accurately identified, the number of normal values designated as outliers was 6, and 8 outliers were not detected.The iForest method detected 38 outlier results.Among them, 31 anomalies were accurately identified, the number of normal values designated as anomalies was 7, and the number of undetected anomalies was 4. The OCSVM method detected 36 anomalies, among which 32 anomalies were accurately identified, the number of normal values designated as anomalies was 4, and the number of undetected anomalies was 3. The KNN method detected 36 anomalies, among which 29 anomalies were accurately identified, the number of normal values designated as anomalies was 7, and the number of undetected anomalies was 6.In order to more intuitively evaluate the detection ability of the four methods, Precision, Recall, F1 score and AUC value were used as evaluation indexes (when the number of abnormal and normal values differed greatly, the evaluation effect of the accurary index would be greatly different in the case of unbalanced samples, so the index was not used in the article).The final evaluation results were calculated (Table 3), among which, the evaluation results of ROC curve method are shown in Fig. 12, and the results showed that all four methods performed well in terms of detection effect, and the OCSVM method had an precision rate of 88.89%, a recall rate of 91.43%, an F1 score of 90.14%, and an AUC value of 95.66%, which is the best detection effect; iForest and KNN methods have better detection effect than sl-Pauta, but the performance indexes of these two methods are low compared with OCSVM method.www.nature.com/scientificreports/Evaluation of anomaly detection techniques on processed datasets.Four detection methods in synthetic data are achieved a good detection effect, the four methods will be used for field real-time data detection for adaptability verification, because the field real-time monitoring data can not be marked in advance, and can not be used in common methods to evaluate the detection effect.Therefore, firstly, combined with the site, the site's real-time monitoring data were analyzed and factors eliminated to obtain simplified water level data (Fig. 9), then, the simplified water level data was imported into four methods for abnormal value detection, and the results were inserted into the original water level.The different algorithms were repeated to confirm whether they were truly abnormal values.In addition, the displacement data within the site was compared, and a qualitative analysis of the performance of the four methods was evaluated based on professional knowledge.The data were missing from February 19 to May 19, so the data were divided into two segments for testing from August 20, 2018, to February 18, 2019, and from May 10, 2019, to December 31, 2019.
The detection results show that a total of 42 water level anomalies were identified by the sl-Pauta method during the time period of August 20, 2018 to February 18, 2019 (Fig. 13a), and the distribution of the anomalies on the time series shows that the anomalous values are mainly concentrated in the time period of 9.3-9.15 in 2018.A total of 44 anomalies were identified by the iForest algorithm (Fig. 13b), and the distribution of the anomalies on the time series shows that the anomalies are mainly concentrated in the time period of 9.7-9.12 in 2018.A total of 43 water level anomalies were identified by the OCSVM algorithm (Fig. 13c), and the distribution of the anomalies in the time series shows that the anomalous values are mainly concentrated in the time periods 9.6-9.10 in 2018 and 2019.2.6.A total of 43 water level anomalies were identified by the KNN algorithm (Fig. 13d), and the distribution of the anomalies in the time series shows that the anomalous values are mainly concentrated in the time period of 9.8.-9.10.2018.Through the visualization results of the data and the repeated testing results of four methods, it can be determined that the abnormal point main concentrated area in September 6-12, 2018.
A total of 41 water level anomalies were identified by the sl-Pauta method in the time period of May 10, 2019-December 31, 2019 (Fig. 14a), and the distribution of the anomalies on the time series shows that the anomalies are mainly concentrated in the time period of 7.28-7.30in 2019.A total of 57 water level anomalies were identified by the iForest algorithm (Fig. 14b), and the distribution of the anomalies on the time series shows that the anomalies are mainly concentrated in the time periods of 5.15-5.19,7.29-8.3and 8.8-8.11 in 2019.A total of 55 water level anomalies were identified by the OCSVM algorithm (Fig. 14c), and the distribution of the anomalies on the time series shows that the anomalous values are mainly concentrated in the three time  Interpretation of anomaly detection.Drawing the cumulative displacement-time curve of the landslide in the study area based on the characteristics that the displacement change law of different deformation stages of the slope will be reflected in the water level (Fig. 15) 58,59 , and then test the outlier detection effect of the four methods by analyzing the relationship between displacement changes and outliers.When drawing the displacement-time curve of the slope, the dimensions used by different scholars are inconsistent, which leads to some defects in the displacement tangent angle method.In this paper, the improved tangent angle method proposed by XU Qiang's team 60 was used to draw the T-t curve of the displacement time series during the research period (Fig. 16).The results of the T-t curve showed that there were two landslide events during the study period.The maximum tangent angle of landslide event a was 83°; the maximum tangent angle of landslide event b was 82°, and the landslides reached the alert level at 2:00 on September 8, 2018, and 2:00 on July 31, 2019, respectively.The mechanism of the role of groundwater in destabilizing landslides indicates that when the slope  www.nature.com/scientificreports/begins to slide, the stress on the rock and soil body gradually increases.The pores of the impermeable layer/ weak pervious layer contain microcracks that close due to the compaction of the rock and soil body, resulting in a decrease in the porosity of the groundwater-bearing rock and soil body.As the porosity of the groundwaterbearing rock and soil body continues to decrease, the static water pressure within the system increases, causing water in the groundwater-bearing layer to exchange with the water in the wells only, resulting in an increase in the water level in the wells.After the slope has slid, the stress gradually decreases, and the static water pressure within the rock and soil body decreases.With the acceleration of the slope's slide, the static water pressure within the rock and soil body also decreases rapidly, allowing the groundwater-bearing layer to resume its permeability and restore the water-pressure exchange within the well-bearing system, resulting in a decrease in water level and a return to the state before the slide occurred 61,62 .Comparing the detection results of water level anomalies, the water level detection results of the four methods all show that there are concentrated abnormalities in the water levels of 9.6-9.12 in 2018 and 7.28-8.3 in 2019.Therefore, it can be judged that the four methods have good effects in the outlier detection of water level time series.Further qualitatively comparing the characteristics  Using four methods to detect abnormal water levels in raw data (Fig. 17), the detection results in both time periods were not satisfactory, mainly showing: during the same period, there was some overlap in the results of iForest and OCSVM, but there was no obvious clustering of abnormal values detected by the four methods overall.It is difficult to accurately determine abnormal data due to changes in external environmental conditions that can cause changes in data structure, which can affect the decision-making of algorithms.Compared to the results of simplified water level data detection (Figs. 13, 14) and the displacement T-t curve (Fig. 16), it can be seen that the groundwater detection method considering data structure has some advantages, providing a new idea for future groundwater abnormality detection.

Conclusions
This article proposes a groundwater abnormal value detection method that considers the data structure, and implements and evaluates the performance of four machine learning algorithms.The results show that for the synthesized data set, OCSVM achieves the best detection performance, while iForest and KNN are good candidates for abnormal value detection.The detection technology ofsl-Pauta performs poorly.Secondly, the spectral analysis of water level shows that Kalman filter and inverse convolution regression can effectively remove the effects of rainfall, atmospheric pressure, and solid tide on water level, simplifying the factors affecting water level.Finally, the four technologies are used for simplified water level abnormal value detection, and the qualitative evaluation of detection effectiveness is conducted through the soil displacement-time curve of the This study was primarily based on limited data and has not yet studied how this method can perform a series of abnormal value monitoring tasks on a wider range of data.Future work can collect more time series data from around the world to evaluate the advantages and disadvantages of this method against all available abnormal value detection algorithms currently available.However, the results of this study indicate that it may prove to be a valuable and useful supplement to existing abnormal value detection algorithms in this field.

Figure 4 .
Figure 4. Study area and monitoring equipment ((a) is the geographic location of the study area; (b) is an overview of the study area site; (c) is a photo of the monitoring well site; (d) is regular on-site maintenance of equipment and review of data; (e) is a sketch of the monitoring well layout).

Figure 5 .
Figure 5. Basic data of the study area ((a) is water level elevation data; (b) is barometric data; (c) is antecedent precipitation data; (d) is earth tide data)).

Figure 6 .
Figure 6.Analysis of the main influencing factors of groundwater level.

Figure 7 .
Figure 7. Water level change curve after eliminating barometer and earth tide.

Figure 8 .
Figure 8. Water level spectrum after eliminating barometer and earth tide.

Figure 10 .
Figure 10.Synthetic data set, red dots are added outliers, and black lines are synthetic data.

Figure 11 .
Figure 11.Outlier detection results for synthetic data.

Figure 12 .
Figure 12.ROC curve evaluation results of the four algorithms.

Figure 17 .
Figure 17.The plot of outlier detection results for raw data (a-d represents the detection results for the time period from August 2018 to February 2019, while e-h represents the detection results for the period from May 2019 to December 2019.)

Table 2 .
Test results of sl-Pauta, iForest, OCSVM, and KNN on the raw synthetic dataset.TP number of accurately detected abnormal values, FP number of normal values defined as abnormal, FN number of abnormal values defined as normal, TN number of normal values defined as normal.