Data Sensitivity Measurement and Classification Model of Power IOT based on Information Entropy and BP Neural Network

For the problems of privacy data protection caused by massive data sharing in the construction of power Internet of things, a data sensitivity measurement and classification model based on information entropy and BP neural network is proposed. Firstly, a recognition matching algorithm is proposed to identify the sensitive level of attributes in the dataset, and the information entropy is used to determine the weight of attributes sensitivity level, so as to calculate the sensitivity measurement value of records in the dataset; finally, the BP neural network is used to output the data classification results. The experimental results show that the model can achieve accurate measurement and classification of data, with low incorrect judgment rate and error rate.


Introduction
With the comprehensive promotion of power Internet of things(IOT) and the orderly construction of data centre, the integration and penetration of various data and the extensive data sharing of cross business become popular. How to ensure data privacy and provide efficient data algorithm model is particularly important. After international and domestic network security laws and regulations have been issued, various industries have launched research on data security and privacy protection.
Many scholars have put forward some effective privacy protection models and methods based on data disruption, data anonymity and other strategies [1][2][3][4] . However, in practical application, it is still restricted by the privacy data types and complex privacy application scenarios. As privacy is an abstract concept, it is difficult to form a clearly definition, which makes it difficult to identify privacy data [5] . Hence, there have been many effective research results on data privacy measurement. Li et al. [6] proposed a privacy metric based on calculating the distribution value of sensitive attributes by kanonymity model. Gkountouna et al. [7] were also based on the principle of anonymity, a predictive binary tree graph was constructed to measure the risk of privacy.
At present, the research on encryption, desensitization, leakage prevention and other aspects of power data security has been carried, but the research on data sensitivity classification of power IOT using natural language processing technology has not been paid attention. The data of power IOT is huge, and usually classified manually with low efficiency. For the problems that the sensitive data of power IOT cannot be automatically recognized and classified, a data sensitivity measurement and classification model for structured datasets is proposed in this paper. Based on the data sensitivity classification word library, a recognition matching algorithm is proposed to identify the sensitive level of attributes, such as top-secret( ), confidential( ), secret( ), classified( ) or public( ) . Then the sensitive weight of attributes in the dataset is calculated by using the Shannon information entropy, and sensitivity measurement values of records are calculated, thus constituting the sensitivity measurement vector of records. Finally, the classification results of records in dataset are output though BP neural network, realizing the automatic and intelligent classification of data.

Information entropy
The information entropy [8] derived from the thermodynamics, which is proposed by C.E.Shannon. In thermodynamics, thermal entropy is a physical quantity that represents the confusion degree of molecular state, while Shannon uses the concept of information entropy to describe the uncertainty of information sources.
Suppose that there are states in a system , denoted as , , ⋯ , , 1,2, ⋯ , denotes the appearing probability of in system . The Shannon entropy of system is defined as: ∑ log (1) where 0 1 and ∑ 1 . When 0，0 log 0 0. According to Shannon's theory of information entropy, the greater the information entropy is, the higher the disorder degree of information is, and the less information it contains; the smaller the entropy is, the lower the disorder degree of information is, and the more information it contains.

BP neural network
BP neural network [9] is a kind of multilayer feedforward network trained by error back propagation with BP algorithm. The basic idea of the algorithm adopts gradient search technology to minimize the mean square error between the actual output value and the expected output value of the network neural network is divided into input layer, hidden layer and output layer. All neurons were connected with the neurons in the next layer, but there was no connection between the neurons in the same layer. The BP neural network with four input layer neurons, five hidden layer neurons and three output layer neurons is shown in Figure 1. The advantage of BP neural network is that it can learn and store a large number of input-output relations, without the requirement of mathematical relationship in advance. It consists of two processes of signal forward propagation and error back-propagation. In forward propagation, the input signal acts on the output node through the hidden layer and generates the output signal through nonlinear transformation. If the actual output does not match the expected output, it will turn into the error backpropagation process. Error back propagation is that the output error is transmitted to the input layer gradually through the hidden layer, and the error is allocated to all units in each layer. The error signal obtained from each layer is used as the basis for adjusting the weight of units.

Construction of Model for sensitivity measurement and data classification
The power is with large amounts of data and complex data types in the dataset, which contains unlimited information value. But how to measure the sensitivity degree of these information is not clearly defined. There are many factors will affect the determination of data sensitivity classification, such as the diversity and relevance of the application scenarios of the information contained in each record, as well as the differences of subjective factors. Without setting the weight of sensitivity measurement elements, a data sensitivity measurement and classification model based on information entropy and BP neural network is proposed this paper. The basic framework consists of three models, as shown in Figure 2.
The first is attributes sensitivity level determination model. Based on the data sensitivity classification library, it uses the minimum edit distance algorithm to realize the sensitivity level identification of field attributes, including top-secret level( ), confidential level ( ), secret level ( ), classified level ( ) or public level( ), so as to obtain the sensitivity level distribution of attributes in the records.
The second is sensitivity measurement model. Because the sensitivity degree of attributes with the same sensitivity level in each record is not the same. In this paper, the information entropy is used to determine the weight of attributes in the same sensitive level, and finally the sensitivity measurement values of each record under each sensitive level are calculated, thus constituting the sensitivity measurement vector of each record.
The third is BP data classification model. With the self-learning and storage ability of BP neural network, the training dataset is extracted to train the BP neural network, and the data classification model is established. Finally, the classification results of records to be tested are obtained though BP neural network.

Determination of attribute sensitivity level
In order to accurately classify the data sensitivity, a data security classification library is established according to the provisions of the Data Classification Regulation of China State Grid Corporation, which listed the words of five sensitive levels including top-secret( ), confidential( ), secret( ), classified( ) or public( ) respectively. On this basis, an attribute recognition matching algorithm based on the minimum edit distance is proposed, which matches the attribute string in the dataset with the strings set of the word library, so as to determine the sensitivity level of the attribute.
For two strings to be matched, the index string , , ⋯ and the targe string , , ⋯ , the minimum edit distance (MED) between two strings is defined as the minimum cost of converting one string to the other, given three basic operations; insertion, deletion and substitution, and their associated costs.
A two-dimensional matrix，the penalties matrix 0, ⋯ , 0, ⋯ , is used to hold edit distance values, where and are lengths of the strings, and . MED between and is computed as follows: Step1. Initializes a penalties matrix , , , 0; Step2. Initialize the first-row elements of the penalties matrix: , is the penalty of substituting the element in with the element in , is the penalty of inserting the element in , is the penalty of deleting the element in , , is the minimum penalties of converting the string to the string . Though comparing the minimum penalties with the threshold, the matching between two strings is completed.

Data sensitivity measurement
In this paper, the information entropy measurement matrix is established for each attribute sensitivity level of the records in the dataset. Suppose the number of attributes contained in the top-secret level is , and the sensitivity measurement results for records with number of are calculated by an attribute information entropy measurement matrix . The steps are as follows. Step1. The attribute information measurement matrix is established for regularized records, in which there are attribute records belong to sensitivity level .
in matrix is the value corresponding to the th attribute in the th record after normalization, with the value of 0 or 1.
Step4. Calculate of the attribute weights ∑ The sensitivity measurement value of a single record under the sensitivity level is calculated: The higher the value of , the higher the sensitivity of the record under sensitivity level . Step5. Repeat steps 1-4 to calculate sensitivity measurement value of a single record under five sensitivity levels ( , , , , ), the sensitivity measure vector of is generated as follows: , ⋯ , .

Data classification
In this paper, the final classification result of privacy data is obtained based on BP neural network. The number of nodes in the input layer is five, which corresponds to the number of sensitive levels. The number of nodes in the output layer is also five, with output value 0 or 1. The output measured as topsecret level is {1,0,0,00}, and as public level is {0,0,0,0,1}. In this way, the output vector corresponds to five privacy levels respectively. The tangent function and logarithmic function of singmoid type are used as transfer functions of hidden layer and output layer respectively.
In the training process of BP neural network, 20% of the records in the training dataset are randomly selected, and the sensitivity measurement vector value of the training data is obtained by the sensitivity measurement method based on information entropy in Section 3.2, and the training samples were formed after normalization. The specific training process was shown in Figure 2. After 131th training of Neural network, the training target is achieved, which can meet the needs of the model.

Training of BP neural network
The test of model focused on data classification based on BP neural network, mainly for neural network training test and accuracy test of classification. During the testing, 5000 data records was used as the training dataset, with training number of 1 000, the learning efficiency of 0.1, and the target error of 0.000 001. According to the formula √ , the number of hidden layer nodes was preliminarily selected, where and are the number of input nodes and output nodes respectively, is adjustment parameter. Finally, the number of hidden layer nodes was nine. The error curve of the training process is shown in Figure 3.

Classification accuracy experiments
For the accuracy test of data sensitive classification of this model, the error rate of misjudgment [10] is used to measure the deviation between the output of the model and the actual classification results. The formula is as follows.
where is the number of records to calculate the error rate of misjudgment, is the natural logarithm, and are the actual sensitivity level and output of model for the record with index number of , | | 0,1,2,3,4. The testing dataset included 500 records to verify the classification accuracy of the model. The test results were shown in Table 1. It proved that the overall accuracy of the data classification can reach 96.4%.

Conclusion
For the problems that the sensitive data of power IOT cannot be automatically recognized and classified, an intelligent and automatic data sensitivity measurement and classification based on information entropy and BP neural network is proposed. The model can effectively identify and classify the sensitivity of any structured dataset, and it is also suitable for large-scale structured data sets. Next, we will explore the algorithm and model of adaptive privacy protection for large-scale data based on sensitive data identification and classification.