A Real-Time Detection Method for Abnormal Data of Internet of Things Sensors Based on Mobile Edge Computing

Aiming at the anomaly detection problem in sensor data, traditional algorithms usually only focus on the continuity of single-source data and ignore the spatiotemporal correlation between multisource data, which reduces detection accuracy to a certain extent. Besides, due to the rapid growth of sensor data, centralized cloud computing platforms cannot meet the real-time detection needs of large-scale abnormal data. In order to solve this problem, a real-time detection method for abnormal data of IoTsensors based on edge computing is proposed. Firstly, sensor data is represented as time series; K-nearest neighbor (KNN) algorithm is further used to detect outliers and isolated groups of the data stream in time series. Secondly, an improved DBSCAN (Density Based Spatial Clustering of Applications with Noise) algorithm is proposed by considering spatiotemporal correlation between multisource data. It can be set according to sample characteristics in the window and overcomes the slow convergence problem using global parameters and large samples, then makes full use of data correlation to complete anomaly detection. Moreover, this paper proposes a distributed anomaly detection model for sensor data based on edge computing. It performs data processing on computing resources close to the data source as much as possible, which improves the overall eﬃciency of data processing. Finally, simulation results show that the proposed method has higher computational eﬃciency and detection accuracy than traditional methods and has certain feasibility.


Introduction
In recent years, with the continuous development and integration of technologies such as the Internet, IoT, and cloud computing, a large number of sensor devices have been widely used in different fields such as power systems and thermal systems [1,2]. Usually, sensors collect data at a certain frequency and send data to corresponding data receivers. e data receiver receives one or more sets of observations in a strict order; these observation data are basically time series data [3]. Time series data accurately records the real-time changes of a certain parameter and reflects trends and change law within a certain time range. erefore, the time series data collected by sensor devices are not only an important data source for data visualization but also the basis of data mining (such as classification, prediction, clustering, and association) [4][5][6]. However, there will always be some abnormalities in the data collection and transmission processes of sensor equipment, such as error codes and sensor failures in actual data collection scenarios [7,8]. us, in order to provide high-quality source data for subsequent data mining research, it is very necessary to effectively identify outliers in sensor data from the perspective of time series data analysis.
According to the wireless sensor network (WSN) characteristics, anomaly detection methods are divided into statistics-based, classification-based, clustering-based, and neighbor-based methods. Single-sensor data streams usually use the time correlation of data for anomaly detection, and many applications are based on statistical analysis and nearest neighbor distance for anomaly detection. Multisensor data streams have both time and space correlation, and cluster-based methods are usually used for detection. e time correlation of sensors is often ignored in the clustering process. For example, reference [9] proposed a new anomaly detection algorithm for time series data by constructing a distributed recursive computing strategy and KNN quick selection strategy. Reference [10] proposed a clustering algorithm that used local parameters for unbalanced data to detect abnormal data. Reference [11] applied the K-means algorithm to cluster analysis of iris data with 5 attributes. Compared with traditional methods, outlier removal clustering (ORC) technology achieved better results. Reference [12] was based on Spatiotemporal (ST) correlation and detected outliers by calculating the cross-correlation between sensor data streams. However, traditional detection algorithms mainly focus on the sequence continuity of single-source sensor data and ignore the correlation between multisource sensor data. In addition, it needs to be particularly emphasized that the current common sensing data anomaly detection and processing method use relatively mature cloud computing models and common big data processing products to directly transmit the data obtained by various data collection devices to cloud computing center for processing and storage. And the powerful computing power of the cloud computing center was used to complete corresponding anomaly detection and data cleaning work [13].
Although there may not be a clear correlation between time series data from different sensors, its inherent characteristics may have a high correlation. If these data are uploaded to the data center for feature extraction, it will cause a lot of computational pressure on the data center. Since underlying devices, including sensors, have certain computing capabilities, hidden features can be extracted. In order to improve the detection accuracy of sensor abnormal data, a real-time detection method of abnormal data based on edge computing is proposed to better meet the real-time detection requirements of large-scale abnormal data. e parameters use nonmutation characteristics of time series data, and the improved DBSCAN algorithm uses the spatial correlation characteristics of multidimensional data in the KNN algorithm. e proposed algorithm fully considers data relevance and can effectively mine their potential relationship. Furthermore, a distributed anomaly detection model for sensor data based on edge computing is proposed to process data on computing resources close to the data source as much as possible. Experiments have proved that the proposed algorithm has positive significance for improving algorithm detection accuracy and the overall data processing efficiency. Definition 1. e multisource data collected and transmitted by sensors can be expressed as the following time series data: Based on the single data source representing D i , a sliding window (SW) is introduced to store part of the data D i . erefore, the length of SW is expressed as Length SW .

Sensor Data Modeling Based on Time Series
Definition 2. According to some known correlations in the multisource data set TD M � D 1 , D 2 , . . . , D i , . . . , D M , perform the necessary combination and transformation of TD M to obtain a new time series (denoted as T D K ′ ). And enter it into the so-called related parameter set Ω K � D 1 ′ , D 2 ′ , . . . , D K ′ . en, abnormal data can be found by detecting the linear correlation of Ω K in the data correlation detection (DCD) process. According to Definition 1, from the linear correlation of TD M in same SW, the correlation of TD M can be realized. Considering that there may not be linear correlation or nonlinear correlation in TD M , it is necessary to convert TD M into a multisource signal TD K ′ with linear correlation characteristics for subsequent DCD processing. Here are the three correlations that exist: (1) Basic Correlation. It can also be called linear correlation. Taking the power system as an example, a binary time series TD 2 � P, f composed of generator active power output P and grid frequency f is defined, Due to the droop characteristic of the power system, the active output power P and grid frequency f satisfy a binary linear correlation, namely us, let Ω 2 � P, f be the binary related parameter set. Similarly, if ternary time series data TD 3 � D 1 , D 2 , D 3 satisfies ternary linear correlation, the corresponding set of related parameters can be given by (2) Combination Correlation. is shows that there is no linear correlation of a single time series D i in a given time series TD M , but there is a linear correlation after combining D i . Taking the thermal system as an example, a ternary time series TD 3 � Q, W, t { } composed of thermal power Q, instantaneous temperature observation value W , and time t is defined. However, according to the basic heat power theorem in thermodynamics, there is a positive linear relationship between heat Q and temperature change rate W(i.e., ΔW/Δt). In other words, at is, in a given TD M , there is no basic correlation or combination correlation in time series, but there is a nonlinear correlation, such as exponential model, hyperbolic model, polynomial model, etc. For example, flow optimization coefficients m and b of radiator satisfy the hyperbolic relationship: m � 1/(1 + b) in a thermal system. Another example is that kinetic energy E and the angular velocity ω of the generator rotor satisfy a polynomial relationship: E � Jω 2 /2 in the power system. rough some data conversion methods, nonlinear correlation models can be transformed into linear correlation models.

Proposed Real-Time Detection Algorithm for Abnormal Sensor Data
Due to the huge amount of sensor data in IoT, traditional centralized cloud computing framework may have low efficiency in solving detection algorithms when computing resources are limited. erefore, a detection framework based on edge computing is proposed to detect abnormal sensor data.

Real-Time Detection Framework of Sensor Data Based on Edge Computing.
e linear growth of centralized cloud computing power has been unable to meet the rapid growth of data processing needs for edge devices [14]. Besides, from a technical or economic point of view, it is unlikely that the ever-growing edge data will be concentrated in one or more data computing centers to complete corresponding computing tasks.
In the edge computing framework, computing tasks are assigned to many distributed devices with certain computing capabilities [15][16][17][18]. erefore, computational efficiency can be improved while reducing the performance requirements of computing equipment. So, an edge computing architecture is built to detect abnormal sensor data in real time. As shown in Figure 1, the corresponding edge layer data node is established near the sensor data collection terminals to complete the detection task of related data while receiving sensor data.
Edge computing function is the core function of this system. In this paper, the main content of edge computing is abnormal data detection, estimation, and correction, and other edge computing tasks can also be added to this functional module according to actual needs. e realization of the edge computing module mainly realizes the functions of sequence generation, anomaly detection, and correction: retrieve the configuration information in the database and record the parameters that need to be edged. And the data to be processed is divided into different sequences according to different parameters to facilitate subsequent data processing. en, perform anomaly detection and estimation correction algorithms on the corresponding sensor data sequence, mark abnormal data, and add estimated correction values.

Outlier and Isolated Group Detection Based on KNN Algorithm.
e basic rule of the KNN algorithm is to find the c nearest neighbors of N samples (where 1 < k < N). When c � 1, KNN problem is equivalent to the nearest neighbor problem. c i ∈ N represents the number of sensor data belonging to the i category. In general, the judgment rule for judging which type of sensor data belongs to is the voting principle. In addition, set c to an odd number to avoid divergence caused by equal votes. e voting principle can be expressed mathematically as follows: where x is the sampled data, and g i (x) is the number of sensors belonging to the i category. e steps of the KNN algorithm are as follows: (1) Distance Calculation. For a given sensing data set, calculate the distance between each object in the training set. In this paper, Euclidean distance is used as follows.
For two-dimensional vectors a(x 11 , x 12 , . . . , x 1n ) and b(x 21 , x 22 , . . . , x 2n ), Euclidean distance given is as follows: where d 12 is Euclidean distance between sensor data a and b. e test objects are classified according to the main categories of the above c neighbors.

Abnormal Sensor Data Detection Based on DBSCAN
Algorithm. e basic DBSCAN algorithm uses globally unique parameters Eps and MinPst to achieve clustering. Inspired by reference [10] using local parameters for clustering, this paper proposes a method based on SW data partition, uses local parameters to achieve density clustering of small sample data. e algorithm flow is shown in Figure 2.
e clustering process consists of three parts, namely parameter update, clustering, and anomaly detection. During the parameter update process, set the size of the clustering window and calculate the average distance difference between attributes in the window, take K as the number of points in the neighborhood MinPst, and Euclidean distance between attributes as radius Eps to ensure that the data clustering is correct in a single case. e formula is as follows: where y i is the average distance difference of attributes; M is the size of SW. DBSCAN algorithm is used for clustering in the clustering process. In view of the inconsistency between attributes, weights are assigned to each attribute to reduce the Mathematical Problems in Engineering 3 impact on the clustering effect. e weight W XY is calculated by the correlation coefficient, and the formula is as follows: where Cov(X, Y) is the covariance of attribute X and attribute Y; Var(X) is the variance of attribute X; Var(Y) is the variance of attribute Y. e anomaly detection process will analyze clustering results. In the clustering process, the object marked as an abnormal point for the first time is recorded as a candidate abnormal point, and the abnormal score is set plus 1 (the initial value is 0). e candidate abnormal points enter the next cycle, continue clustering, and update the abnormal score. If the abnormal score S is equal to the number of clustering C (the number of clustering is the inverse of SW overlap rate), it is marked as an abnormal point; otherwise, it is a normal point.
According to Definition 1, the multisource sensor data set is represented as a time series TD M � D 1 , D 2 , . . . , D M . When multisource sensor data enter SW, select the area data observation value of Length SW whose length is equal to the length of SW to form a new multisource sensor data set According to Definition 2, part of the time series set with known correlation in TD M ′ is represented as TD sub ′ , which can be combined or transformed into a new time series TD K ′ with linear correlation. en, enter the parameters in TD K ′ into the set Ω E . Since there is usually a certain correlation between data collected by sensors, the correlation between sensor data can be used to determine whether the sensor data is abnormal in a certain time range.

Algorithm Performance Analysis.
e time complexity of the DBSCAN algorithm is the time required to find a point in radius Eps neighborhood, and its time complexity is O(n 2 ) in the worst case. Improved DBSCAN algorithm uses a SW to divide data, and its time frequency can be expressed as follows: where n is the algorithm input scale, w is SW size, that is, w 2 is a constant. x is a constant; that is, the sliding step is w/x. us, the time complexity T 2 (n) of the improved DBSCAN algorithm is as follows: e time complexity of the proposed algorithm is the sum of the time complexity of KNN and the improved DBSCAN clustering algorithm. It can be expressed as follows: It can be seen that the time complexity of the anomaly detection algorithm increases linearly. As the amount of processed data samples increases, the time efficiency is higher than the basic DBSCAN algorithm.  1 TB SSD. e software environment is ZooKeeper V3.4.8, jdkv1.8.0 and Storm V1.0.0. All algorithms are running on CentOS 6.5. e experimental data comes from the urban heating system in Baohe District, Hefei, with 18031 users and 400 buildings. In the data collection process, the transmitters of each building will collect the data recording heating information of each room and send them to the corresponding receivers.

Execution Efficiency of Cloud Computing and Edge Computing Platforms.
e algorithm is implemented on a centralized cloud computing platform [19] and a distributed edge computing platform. Table 1 shows the average processing time for each step. As can be seen from Table 1, due to the large amount of data in cloud computing, the corresponding bandwidth pressure is also greater. erefore, the transmission delay of cloud computing is longer than that of edge computing. Moreover, due to the relatively strong computing power of cloud computing platforms, the time required for each step of using cloud computing has increased by 211.3 ms and 101.1 ms, respectively. erefore, the computational efficiency of edge computing is higher.
is is because the paper proposes an anomaly detection model for sensory data based on edge computing and uses the big data processing idea of edge computing to process corresponding data as much as possible on computing resources close to the data source. It improves the overall efficiency of data processing while reducing the pressure on network transmission bandwidth.

Detection Results
Analysis. In this subsection, the performance of the proposed algorithm is evaluated from two aspects by experimental results.
(1) Using the anomaly detection algorithm proposed in this paper, local detection of sensor data including accumulated heat, thermal power, accumulated temperature, flow, and temperature difference is performed. e number of selected sensor data is 500000. Using the proposed algorithm to find abnormal sensor data, the test results are shown in Table 2. It can be seen from Table 2 that there are 2650 abnormal P, W or (P, W) records, and 1997 abnormal V, s or (V, s) records. AD sum represents the total number of abnormal sensor data, and AD cor represents the abnormal sensor data successfully detected. us, the detection accuracy (expressed as AD pre ) can be calculated as follows: It can be seen from Table 2 that the basic DBSCAN algorithm can find 2650 abnormal records (P/W/(P, W)) of 2,320 and 1997 abnormal records (S/V/(S, V)) of 1802. e detection accuracy rates were 87.5% and 90.2%, respectively. e improved DBSCAN algorithm proposed in this paper can detect 2530 cases out of 2650 abnormal records (P/W/(P, W)) and 1909 cases out of 1997 abnormal records (S/V/(S, V)). e detection accuracy rates were 95.5% and 96.0%, respectively. It can be seen that the improved DBSCAN algorithm proposed in this paper can effectively improve the detection results of abnormal data. (2) In order to verify the method superiority in this paper, the methods in reference [10], reference [11] reference [12] are selected as benchmarks. 3,433,756 sensor records are selected from the data of the past two years, the proposed algorithm and three benchmark methods are used to detect anomalies. Figure 3 shows the average detection accuracy.
It can be seen from Figure 3 that the detection accuracy of the proposed method is increased by 1.91%, 2.04%, and 2.7%, respectively, reaching 96.4% compared with the methods in reference [10], reference [11], and reference [12].
is is because benchmarking methods do not effectively use the correlation between multisource time series to accurately assess the change trend.
is paper intercepts the detection results of sensor data in data set for statistics. Among them, there are 15 point anomalies, 49 cluster anomalies, and 13 correlation anomalies. e detection rate is 97.8% and the false alarm rate is 2.2%. In order to describe the situation of abnormal points being marked more clearly, this paper intercepts the first 180 sample points of air temperature data for drawing; the detection results are shown in Figure 4. e abnormality of air temperature data occurs within a short period of time (each sample is collected at an interval of 10 minutes), and the reasons for abnormality are all errors.
In order to further verify the time efficiency of the proposed algorithm, the result is shown in Figure 5 by comparing and verifying the datasets.
It can be seen from the figure that the running time of the method in reference [12] increases the fastest, and the running time of the proposed method and the method in reference [10] increases slowly. When the dataset reaches 8440 hours, that is, when the number of data points reaches 337760, the running time of the proposed method and the method in reference [12] is far less than the running time of methods in reference [10] and reference [11]. Combined with the previous analysis of time complexity, it can be seen that the improved DBSCAN algorithm takes advantage of the spatial correlation characteristics of multidimensional data, fully considers the data relevance, and effectively mines the data potential relationship. erefore, when the sample size increases to a certain extent, the time efficiency of the proposed method is lower than that of several comparison algorithms. In summary, the proposed method can be used for anomaly detection of multisensor data streams and is feasible.

Conclusion
is paper proposes a real-time detection method for abnormal data of IoT sensors based on edge computing, which combines the ST correlation of sensor data streams and ideas of nearest neighbor algorithm and clustering algorithm. e method optimizes parameters according to the characteristics of environmental data and overcomes fixed nearest neighbor distance threshold, global clustering parameters, and slow convergence speed problems, which improves anomaly detection efficiency. For a relevant multisensor data stream, its effect can meet the use of the current environment. Simulation results show that the proposed method has higher computational efficiency and detection accuracy than traditional methods and has certain feasibility. However, limited to the author's level, the algorithm in this paper still Ref. [11] Ref. [12] Proposed Ref. [10] Methods  Ref [10] Ref [11] Ref [12] Proposed   Data Availability e data included in this paper are available without any restriction.