Detection Method of Data Integrity in Network Storage Based on Symmetrical Difference

In order to enhance the recall and the precision performance of data integrity detection, a method to detect the network storage data integrity based on symmetric difference was proposed. Through the complete automatic image annotation system, the crawler technology was used to capture the image and related text information. According to the automatic word segmentation, pos tagging and Chinese word segmentation, the feature analysis of text data was achieved. Based on the symmetrical difference algorithm and the background subtraction, the feature extraction of image data was realized. On the basis of data collection and feature extraction, the sentry data segment was introduced, and then the sentry data segment was randomly selected to detect the data integrity. Combined with the accountability scheme of data security of the trusted third party, the trusted third party was taken as the core. The online state judgment was made for each user operation. Meanwhile, credentials that cannot be denied by both parties were generated, and thus to prevent the verifier from providing false validation results. Experimental results prove that the proposed method has high precision rate, high recall rate, and strong reliability.


Introduction
In recent years, the cloud computing becomes a new shared infrastructure based on the network. Based on Internet, virtualization, and other technologies, a large number of system pools and other resources are combined to provide users with a series of convenient services [1]. The cloud computing has the advantages: providing convenient computing resource sharing pool, flexible resources, safe and controllable data, cost saving, unified management, and low-cost fault tolerance, so the related products of cloud service are more and more popular with users. With the rapid development of information technology, the traditional data storage mode cannot meet the new needs and challenges [2]. The cloud storage has attracted great attention due to its low cost and high efficiency. As a new storage mode, the cloud storage has attracted more and more attention with the rapid popularization. How to ensure the correctness and integrity of data files stored in cloud servers is one of the key issues in the development of cloud storage technology [3]. Based on the importance of data integrity detection, many excellent research results have been obtained in this field.
One study [4] proposed a method for security data integrity detection. The algorithm performs cross-validation by establishing a dual-evidence mode of integrity verification evidence and untrusted detection evidence. The integrity verification evidence is used to detect data integrity and use untrusted detection. Evidence determines the correctness of data verification results. In addition, the reliability of the verification results is ensured by constructing a detection tree. However, this method has poor recall. Literature [5] proposed a method for data integrity detection in big data storage. Cross-checking the integrity of data in the storage with a two-factor verification method to verify that the data integrity of the big data store is complete. A check tree is constructed to ensure the reliability of the verification The web crawler system will maintain a URL table, including some original URLs. Based on these URLs, Robot downloads the corresponding pages and extracts new URLs from them and then adds them to the URL table. After that, Robot repeats the above process until the URL queue is empty.
On this basis, the basic framework of web crawler is shown in Figure 1 (source: author own conception, adapted from Wu Libing): The working process of crawler is as follows: Import the seed (the initial URL list) and import the agent. Import templates (regular expressions constructed by different web features). The agent is corresponded to the seed URL, and the seed is put into the crawling queue as the initial set of URL queue. Take the waiting URL from the queue, and then enter the crawling state. After that, the web source file corresponding to the URL is downloaded. Transfer the regular expression in the template object array and the downloaded web page source file into the web page analysis module for matching, so that the structured information can be obtained. The important links contained in the structured information, such as URL of the next page of post list and URL of the post, which continues to be put into the URL queue and waits for crawling. Information required by other users in structured information, such as post name, post sender, reply number, and reply content are stored in the information table.
Extraction of image and related text: Through the research on the page, the text information related to the image mainly includes: (1) The texts around the image in page, most of them are too long, including a lot of semantic information. During the page analysis, they are mostly related to the page structure, such as the adjacent texts in the same row or column of the table. When an image exists as an illustration, the surrounding words have limited contribution to the annotation on the image.
(2) File name, title, or description information of image are usually concise phrases or words, which have strong generalization ability.
(3) The title of the image link page. The content of image web page is highly relative to image, and some titles which are used to generalize the content of hyperlink web page. Meanwhile, they also have something to do with image semantics.
The image semantics mainly comes from the analysis of related text. Firstly, Chinese characters and English existing in related text are translated into Chinese, and then the automatic word segmentation and the part of speech tagging are carried out.
When the system receives a large amount of text information through the crawler program, the following problem is how to extract the keywords from documents. One of the obvious differences between Chinese and English is that in Chinese text, there is no obvious natural separator between The working process of crawler is as follows: Import the seed (the initial URL list) and import the agent. Import templates (regular expressions constructed by different web features). The agent is corresponded to the seed URL, and the seed is put into the crawling queue as the initial set of URL queue. Take the waiting URL from the queue, and then enter the crawling state. After that, the web source file corresponding to the URL is downloaded. Transfer the regular expression in the template object array and the downloaded web page source file into the web page analysis module for matching, so that the structured information can be obtained. The important links contained in the structured information, such as URL of the next page of post list and URL of the post, which continues to be put into the URL queue and waits for crawling. Information required by other users in structured information, such as post name, post sender, reply number, and reply content are stored in the information table.
Extraction of image and related text: Through the research on the page, the text information related to the image mainly includes: (1) The texts around the image in page, most of them are too long, including a lot of semantic information. During the page analysis, they are mostly related to the page structure, such as the adjacent texts in the same row or column of the table. When an image exists as an illustration, the surrounding words have limited contribution to the annotation on the image.
(2) File name, title, or description information of image are usually concise phrases or words, which have strong generalization ability.
(3) The title of the image link page. The content of image web page is highly relative to image, and some titles which are used to generalize the content of hyperlink web page. Meanwhile, they also have something to do with image semantics.
The image semantics mainly comes from the analysis of related text. Firstly, Chinese characters and English existing in related text are translated into Chinese, and then the automatic word segmentation and the part of speech tagging are carried out.
When the system receives a large amount of text information through the crawler program, the following problem is how to extract the keywords from documents. One of the obvious differences between Chinese and English is that in Chinese text, there is no obvious natural separator between Chinese characters or vocabularies. Meanwhile, the number of Chinese words is uncertain, and the collocation is flexible, and the semantics is diverse. Most of them are composed of two or more Chinese characters, and the writing is continuous, which brings more difficulties for Chinese understanding and keyword extraction.
Generally, users may retrieve in the form of words or single character when querying, so the system should label images with shorter words or single character as much as possible. When the annotation platform receives various long texts, it is necessary to divide the whole sentence into smaller vocabulary units at first, and then process them through the keyword selection module [8].
In Chinese word segmentation, it is necessary to consider the complexity of time on the premise of accuracy [9]. A simple way is the maximum matching method of positive word subtraction. The basic thought is to build a dictionary in advance, and then extract a preset length word string from the long sentences of natural language, and compare it with the dictionary. If the string belongs to the dictionary, it will be regarded as a meaningful word string. Then, the separator is used to split it and output it. Otherwise, it is necessary to shorten the word string and search again in the dictionary. Finally, we should move backward and repeat the above steps. The basic description of this algorithm is shown in Figure 2 (source: author own conception, adapted from Wang Ruilei).
Chinese characters or vocabularies. Meanwhile, the number of Chinese words is uncertain, and the collocation is flexible, and the semantics is diverse. Most of them are composed of two or more Chinese characters, and the writing is continuous, which brings more difficulties for Chinese understanding and keyword extraction.
Generally, users may retrieve in the form of words or single character when querying, so the system should label images with shorter words or single character as much as possible. When the annotation platform receives various long texts, it is necessary to divide the whole sentence into smaller vocabulary units at first, and then process them through the keyword selection module [8].
In Chinese word segmentation, it is necessary to consider the complexity of time on the premise of accuracy [9]. A simple way is the maximum matching method of positive word subtraction. The basic thought is to build a dictionary in advance, and then extract a preset length word string from the long sentences of natural language, and compare it with the dictionary. If the string belongs to the dictionary, it will be regarded as a meaningful word string. Then, the separator is used to split it and output it. Otherwise, it is necessary to shorten the word string and search again in the dictionary. Finally, we should move backward and repeat the above steps. The basic description of this algorithm is shown in Figure 2 (source: author own conception, adapted from Wang Ruilei). With the distributed crawler as the core, an automatic image download program is implemented, and the main flow chart is shown in Figure 3. It mainly includes the web source file downloading module, regular matching module, next page URL construction and image URL construction module, and image download module whose entry parameter is image URL.  With the distributed crawler as the core, an automatic image download program is implemented, and the main flow chart is shown in Figure 3. It mainly includes the web source file downloading module, regular matching module, next page URL construction and image URL construction module, and image download module whose entry parameter is image URL.

Data Feature Extraction Based on Symmetrical Difference
In the above process of data acquisition, the automatic word segmentation and pos tagging, Chinese word segmentation are used to achieve the feature analysis of the text data. Then, the symmetrical difference algorithm is taken as the main method and background subtraction is taken as the auxiliary method, so that the image data feature extraction and preliminary recognition of data integrity are achieved. Thus, the precision rate of data integrity detection is improved.  With the distributed crawler as the core, an automatic image download program is implemented, and the main flow chart is shown in Figure 3. It mainly includes the web source file downloading module, regular matching module, next page URL construction and image URL construction module, and image download module whose entry parameter is image URL.  The symmetrical difference algorithm can remove the influence of the background revealed by the motion and draw the contour of the moving object accurately [10,11]. The basic algorithm is as follows: the source images of three consecutive frames in video sequence are f (k−1) (x, y), f (k) (x, y) and f (k+1) (x, y). The absolute difference gray-scale images of two adjacent source images are calculated respectively, namely d (k,k+1) (x, y) and d (k,k+1) (x, y).
where, W is a window function to suppress noise. Because the mean filtering will blur the image, leading to the loss of edge information, the median filtering function with 3 × 3 window is chosen to suppress the noise. The binary images b (k−1,k) (x, y) and b (k,k+1) (x, y) are obtained by calculating threshold values of S (x, y) of symmetrical difference result is obtained by logic operation of b (k−1,k) (x, y) and b (k,k+1) (x, y) at each pixel position. The formula is The basic idea of background subtraction is to subtract the current image from the background image stored in advance or real-time background image. The pixel point whose difference is greater than a certain threshold value is regarded as the point on the moving target. Otherwise, it is considered as the background point, which is very suitable for detecting the moving target when the background image changes less with time. By comparing the difference of gray values between the current source where, T denotes the threshold value. The ideal background B(x, y) = ∪ n−1 s=0 B s,s+1 is obtained on the basis of the given N frame image, in which ∪ denotes the image splice operator and B s,s+1 denotes the common background of frame S and frame S + 1.
Formula (4) is used to judge the attribution of sub block B K (s, j), Symmetry 2020, 12, 228 6 of 15 The moving object detection algorithm based on background subtraction and symmetrical difference can accurately extract and update the background model when the image has several moving targets.

Data Integrity Detection and Accountability Mechanism
Based on the data collection and feature extraction above, the random sentry data segment is introduced to achieve the final data integrity detection. Combined with the accountability scheme of data security of the trusted third party, the trusted third party was taken as the core. The online state judgment was made for each user operation. Meanwhile, credentials that cannot be denied by both parties were generated, so as to ensure the reliability of audit and accountability of trusted third party when the cloud is not trusted. In addition, it is able to prevent the verifier from providing the false verification results.
This scheme randomly selects the sentry data segment to detect the data integrity. Because the sentry data segment contains the selected sentry information, the effective data and other sentry data, this scheme can support the detector to carry out infinite detection, and thus to improve the data integrity detection and recall rate. It is not necessary to worry about the sentry leakage caused by multiple tests.
Next, the scheme flow is introduced according to the data preprocessing stage, the challenge initiation stage and the detection and verification stage.

Data Preprocessing Stage
The original data blocking: firstly, the network storage user runs the key generation formula: KeyGen(p u ) → (sKey u , pKey u ) . Secondly, the public key sKey u and private key pKey u held by the user are generated, and the private key pKey u is saved as a secret pair. After that, the user uses the public key sKey u to encrypt the original data file: FileEncrypt(F, sKey u ) → (F ) . Then, the file block algorithm is carried out, FileBlock(F , m) → {F 1 , · · · , F m } . F is divided into m blocks. The sizes of blocks are the same. Finally, each data block in {F 1 , · · · , F m } set is divided into n blocks: Finally, the data block matrix consisting of m vectors is obtained: F= Erasure code: the network storage user performs the erasure code on the obtained data The vandermonde matrix B of erasure correction code is a matrix with k rows and n columns, (k > n). Its model is shown in Formula (5) L a n-1 L a n-1 is obtained by m matrix multiplications, that is to say, each data block set {F 1i , · · · , F ni } is multiplied by the vandermonde matrix B of erasure correction code in the form of vector, and then these products constitute the data matrix.
L a n-1 1 1 a 1 2 a 2 2 L a n-1 2 1 a 1 3 a 2 3 L a n-1 According to the vandermonde matrix B with k rows and n columns, we can see that there are (k − n) × m redundant data blocks in this error correction code. In other words, for data vector (a 1i , a 2i , · · · , a ki ) T , when the maximum loss is k − n data blocks, the entire data vector can be recovered completely.
Sentry position generation: after the data matrix is generated, the network storage user sets the number of sentries which need to be placed in each data block, l (l ≥ 2). Meanwhile, the user uses the sentry insertion position generation algorithm GuardPosition(i, j, l pKey u ) → P i,j,1 , · · · , P i,j,w , · · · , P i,j,l to calculate the position array. P i,j,w denotes the position of the wth sentry which needs to be inserted into the data block. a ij denotes the data block after inserting the sentry data. len a ij = len a ij + q × l represents the bit length of a ij after inserting the sentry data. q represents the bit length of each sentry. Thus, the proportion of the actual effective data in data block a ij to the total storage data is len a ij / len a ij + q × l .
The generation algorithm GuardPosition(i, j, l pKey u ) → P i,j,1 , · · · , P i,j,w , · · · , P i,j,l of sentry insertion position is to select the random number to hash the bit length of a ij , namely P i,j,w = H w (ran(i, j) pKey u User_id)modlen a ij . The function ran(i, j) denotes the random number generated by i, j. User_id denotes the unique ID of network storage user. H(·) is the hash function, and Because P i,j,w is generated by random number, it is necessary to reorder the set of sentry positions from big to small. Finally, the orderly sequence of sentry insertion positions can be obtained [12,13].
Sentry data generation: network storage user uses the array P i,j,1 , · · · , P i,j,w , · · · , P i,j,l of orderly sentry insertion position and preset length q of sentry to calculate sentry at position P i,j,w : G i,j,w : GuardGen i, j, P i,j,w , q, pKey u → G i,j,w .
The sentry generation algorithm GuardGen i, j, P i,j,w , q, pKey u → G i,j,w is defined as the binary 0/1 string q-bit data with the result H w (ran(i, j) pKey u User_id). At this stage, network storage user needs to take pKey u and ran(i, j) as secrets and then save them locally. In addition, they are not disclosed to any other party in the data preprocessing stage.
Sentry insertion: after the network storage user generates the sentry, the sentry insertion algorithm is used to insert the sentry set G i,j,w into the data block a ij : When calculating the position of sentry P i,j,1 , · · · , P i,j,w , · · · , P i,j,l , we do not consider that the insertion of previous P i,j,w will affect the change of sentry position P i,j,w + behind. When Guardinsert is implemented, it is necessary to move the original position P i,j,w back w − 1 positions of q. After P i,j,w = P i,j,w + (w − 1) × q transformation, we will insert G i,j,w into the data block matrix A= , and then upload the data matrix to the network server. Upload parameter to trusted third party: the network cloud storage user uses the public key sKey t of trusted third party to encrypt the parameters n, m, k, l , q, ran(i, j) 1≤i<k,1≤ j<m , len a ij , pKey u which are used in the data pre-processing stage, sec = ParaEncrypt n, m, k, l , q, ran(i, j) 1≤i<k,1≤ j<m , len a ij , pKey u . After that, we can store them in the trusted third party. In the subsequent data integrity detection, the network storage user authorization can directly use the private key pKey u to decrypt sec. According to the calculated parameters, the challenge can be sent to the cloud.

Start Challenge
Cloud storage users put forward detection request. When cloud storage users need to detect the integrity of data files stored in cloud server, they will send a detection request to the trusted third party, and then the trusted third party will perform the data integrity detection instead of users.
Detection parameter analysis: after receiving the detection request from the user, the trusted third party checks the user rights at first, so as to judge whether the user has the read permission for the data file. If the applicant has the read permission, the trusted third party uses pKey u to decrypt the pre-processing parameter set sec to get n, m, k, l , q, ran(i, j) 1≤i<k,1≤ j<m , len a ij , pKey u . The random generation algorithm RanPos(k, m, r) → (i 1 , j 1 ), · · · , (i r , j r ) is performed. Where, k is the number of rows and m is the number of columns of matrix A = . r denotes the number of data blocks a ij that the trusted third-party plans to detect. According to the detection intensity and detection environment, r can be determined by the trusted third party. If the trusted third party wants to conduct comprehensive data integrity detection, r can be bigger at this time. If the trusted third party wants to conduct periodic data integrity detection, the value of r can be moderate at this time. If the current network fluctuates greatly or the soft and hard environment is lack, the value of r can be reduced [14,15].
The output I = [(i 1 , j 1 ), · · · , (i r , j r )] of algorithm RanPos denotes the subscript of the selected detection data block. Then, the trusted third party uses the array I and algorithm GuardPosition(i, j, l pKey u ) → P i,j,1 , · · · , P i,j,w , · · · , P i,j,l to generate r × l sentry positions of r data blocks a ij (i, j) ∈ I. Finally, the algorithm P i,j,w = P i,j,w + (w − 1) × q is used to calculate the real positions of l sentries in data block a ij (i, j) ∈ I.
The generation of detection interval: the trusted third party uses the form of selecting data interval to determine the data range. That is to say, the data interval [x c , y c ], 1 ≤ c < r is selected from data block a ij for final detection of data integrity. The data block a ij corresponds to an interval, and the lengths of r intervals [x c , y c ], 1 ≤ c < r are the same, but their positions are different in corresponding a ij (see Figure 4).
The selection of interval length y c − x c is similar to the selection condition of the number r of detection data block a ij , which can be determined by the trusted third party through the detection intensity and detection environment. Meanwhile, the minimum length of interval [x c , y c ] should be bigger than the length of len a ij /l . Because len a ij = len a ij + q × l , len a ij /l must be bigger than the position length q of sentry. That is to say, the detection interval range must be bigger than the length of sentry, so that the reliability of data integrity detection can be guaranteed, and thus avoiding the sentry leakage caused by multiple detection [16][17][18].
In order to ensure that the data interval [x c , y c ], 1 ≤ c < r contains the sentry information, the trusted third party randomly selects r random numbers {u e }, 1 ≤ e < r from the integer interval [1, l ], and then extracts the sentry position P i,j,e of corresponding random number {u e }, 1 ≤ e < r from the sentry position set P i,j,1 , · · · , P i,j,w , · · · , P i,j,l of data block a ij corresponding to I = [(i 1 , j 1 ), · · · , (i r , j r )].
That is the aggregation a i 1 ,j 1 : P i 1 ,j 1 ,u 1 , · · · , a i w ,j w : P i w ,j w ,u w , · · · , a i r ,j r : P i r ,j r ,u r . See Figure 5: x y c r ≤ < are the same, but their positions are different in corresponding ij a′ (see Figure 4). bigger than the position length q of sentry. That is to say, the detection interval range must be bigger than the length of sentry, so that the reliability of data integrity detection can be guaranteed, and thus avoiding the sentry leakage caused by multiple detection [16][17][18].
In order to ensure that the data interval [ ]    Figure 5:   is uncertain, so the number of sentries in an interval may be uncertain. A small number of sentries may only be included in some information [19,20]. The situation of sentry in interval is shown in Figure 6. (1) (3) and (4) of Figure 6. The length of detection interval the trusted third party is set as γ. In order to put the sentry a i w ,j w : G i w ,j w ,u w corresponding to the set a i 1 ,j 1 : P i 1 ,j 1 ,u 1 , · · · , a i w ,j w : P i w ,j w ,u w , · · · , a i r ,j r : P i r ,j r ,u r in the detection interval [x c , y c ], 1 ≤ c < r, the distance dis = random(0, δ) × γ, 0 < δ ≤ 1 between a i w ,j w : P i w ,j w ,u w and the left vertex of interval is defined. δ denotes the basic threshold which is randomly generated at dis. If δ is close to 1, the probability that the length of dis is close to γ. a i w ,j w : P i w ,j w ,u w is far away from the left vertex of interval with the increase of dis, and the probability that sentry in the interval is incomplete.
In the detection interval [x w , y w ] of data block a i w ,j w , the sentry G i w ,j w ,u w is uncertain, so the number of sentries in an interval may be uncertain. A small number of sentries may only be included in some information [19,20]. The situation of sentry in interval is shown in Figure 6. (1) shows that sentry G i w ,j w ,u w is located in the interval [x w , y w ] completely, namely x w ≤ P i w ,j w ,u w < P i w ,j w ,u w + q − 1 < y w .
(2) shows that only partial data of sentry G i w ,j w ,u w is located in the interval [x w , y w ], namely x w < P i w ,j w ,u w ≤ y w < P i w ,j w ,u w + q − 1. Due to the uncertainty of the length γ of interval, this interval may include several sentries of position w in addition to sentry G i w ,j w ,u w , which is shown in (3) and (4)    Start challenge request: the trusted third party encrypts the array I = [(i 1 , j 1 ), · · · , (i r , j r )] of the serial number of detected data block and the array Q = [(x 1 , y 1 ), · · · , (x r , y r )] which is composed of detection interval [x w , y w ] corresponding to serial number (i w , j w ) through the cloud public key sKey c and then sends them to the remote server as the challenge request chal.

Detection and Verification Stage
Evidence generation: after receiving the challenge request chal, the server uses the private key pKey c to decrypt chal, then uses the decrypted I and Q to perform the evidence generation algorithm GenProo f (I, Q ) → i, j, Pro (i,j)∈I , so as to output the credentials Pro which is used to submit the trusted third party for data integrity detection. Finally, the public key sKey t of trusted third party is used to encrypt Pro and return it to the trusted third party.
Evidence verification: after receiving the detection credentials Pro sent from the cloud, the trusted third party uses the evidence validation algorithm CheckProo f I, Q , P i,j,w , q, pKey u , Pro → {"Su", "Fa"} to judge whether the data file is complete. The detailed process is as follows: the trusted third party locates the θ sentry positions included in the detection interval [x w , y w ] according to the sentry position array P i w ,j w ,1 , · · · , P i w ,j w ,z , · · · , P i w ,j w ,l of data block a i w ,j w , namely x w + 1 − q ≤ P i,j,m 1≤m<θ < y w . Then, the trusted third party calculates the detailed information of sentry by the sentry generation algorithm GuardGen i, j, P i,j,w , q, pKey u → G i,j,w . By comparing with the corresponding position in detection evidence Pro, we can see that only partial data appears in the sentry of detection area. If R data segments in Pro are compared, it means that the data has not been modified or damaged in this time. Otherwise, it means that the data stored in remote server is incomplete. Finally, the trusted third party submits the data integrity detection result to cloud storage user.
(4) Data security accountability based on trusted third party With the popularization and development of cloud storage, people pay more and more attention to the security of cloud data. When the data stored in the cloud is illegally modified, neither the user nor the cloud can provide cogent credentials to divide the responsibility. Therefore, a data security accountability scheme based on trusted third party. This scheme takes the trusted third party as the core and carries out the online status judgment in each user operation, and then the credentials that cannot be denied by both parties are generated, so as to ensure the reliability of the audit and accountability of trusted third party when the cloud is not trusted.
A credible third-party accountability system needs to provide the following functions: Any operation on cloud data is recorded in trusted third party and cloud; When disputes occur, the trusted third party can accurately find the responsible party and determine the responsibility; The certificate that is used to judge the responsible party has non repudiation. Figure 7 is the framework of data security accountability based on trusted third party Figure 7. Framework of data security accountability scheme based on trusted third party.
In Figure 7, the user operates the cloud files through the browser or other clients. When the users log in to the cloud through the browser or other clients in each time, they can get a temporary token from the trusted third party. The file key version is formed after each user operation. When the user and the cloud operate the data, the accountability voucher will be generated and saved in the table of trusted third-party voucher. When the accountability is proposed or some disputes between the two parties occur, the trusted third party can judge the responsibility based on the clouding file operation records and local vouchers.

Results
In the process of verifying the performance of the network storage data integrity detection method based on symmetric difference, the experimental environment was 1 PC and the operating system was Windows 10. The installed memory (RAM) is 16 GB and the processor is AMD Ryzen Threadripper 2990WX@3.5GHz. The disk size is 250 GB and the disk read/write speed is 5400RPM.Java language is adopted. In this experiment, the text data and image data are randomly crawled in the network storage database by the above method, so that the feasibility of the proposed method is verified by the precision rate of data integrity detection. Figure 8 shows the precision rate of data integrity detection.  In Figure 7, the user operates the cloud files through the browser or other clients. When the users log in to the cloud through the browser or other clients in each time, they can get a temporary token from the trusted third party. The file key version is formed after each user operation. When the user and the cloud operate the data, the accountability voucher will be generated and saved in the table of trusted third-party voucher. When the accountability is proposed or some disputes between the two parties occur, the trusted third party can judge the responsibility based on the clouding file operation records and local vouchers.

Results
In the process of verifying the performance of the network storage data integrity detection method based on symmetric difference, the experimental environment was 1 PC and the operating system was Windows 10. The installed memory (RAM) is 16 GB and the processor is AMD Ryzen Threadripper 2990WX@3.5GHz. The disk size is 250 GB and the disk read/write speed is 5400 RPM. Java language is adopted. In this experiment, the text data and image data are randomly crawled in the network storage database by the above method, so that the feasibility of the proposed method is verified by the precision rate of data integrity detection. Figure 8 shows the precision rate of data integrity detection. In Figure 8, the overall precision rate of the proposed method is high, showing good performance. This method achieves the feature analysis of text data by automatic word segmentation, part of speech tagging and Chinese word segmentation. Based on the symmetrical difference algorithm and the background subtraction, the feature extraction of image data is realized. Meanwhile, the preliminary recognition of data integrity is achieved, and thus to improve the precision rate of data integrity detection.
Three different methods are used to further verify the data integrity. In this paper, the availability of network storage data is verified, and the results are as follows.
Threadripper 2990WX@3.5GHz. The disk size is 250 GB and the disk read/write speed is 5400RPM.Java language is adopted. In this experiment, the text data and image data are randomly crawled in the network storage database by the above method, so that the feasibility of the proposed method is verified by the precision rate of data integrity detection. Figure 8 shows the precision rate of data integrity detection.  Analysis of the above Figure 9 shows that the availability of different methods is different under different data volumes. When the data volume is 5 GB, the data availability rate of literature [4] method is 72%, the data availability rate of literature [5] method is 82%, the data availability rate of this method is 94%, while the data availability rate of this method is significantly higher than the other two methods, and the data integrity is better.
In Figure 8, the overall precision rate of the proposed method is high, showing good performance. This method achieves the feature analysis of text data by automatic word segmentation, part of speech tagging and Chinese word segmentation. Based on the symmetrical difference algorithm and the background subtraction, the feature extraction of image data is realized. Meanwhile, the preliminary recognition of data integrity is achieved, and thus to improve the precision rate of data integrity detection.
Three different methods are used to further verify the data integrity. In this paper, the availability of network storage data is verified, and the results are as follows.
Analysis of the above Figure 9 shows that the availability of different methods is different under different data volumes. When the data volume is 5GB, the data availability rate of literature [4] method is 72%, the data availability rate of literature [5] method is 82%, the data availability rate of this method is 94%, while the data availability rate of this method is significantly higher than the other two methods, and the data integrity is better.

Discussion
The recall rate of data integrity detection is used as the discussion index to analyze the recall performance of the proposed method. The results are shown in Figure 10.

Discussion
The recall rate of data integrity detection is used as the discussion index to analyze the recall performance of the proposed method. The results are shown in Figure 10.  Figure 10 show that the proposed method uses the random sentry data segments to detect the data integrity. Because the sentry data segments not only contain the selected sentry information, but also contain the effective data and other sentry data, the scheme can support the infinite detection of detection party, and thus to improve the recall rate of data integrity detection.
In order to further verify the performance of this method, three different methods of coding time are used for detection, and the results are as follows.
Analysis of Figure 11 shows that different methods have different encoding times, that is to say, the shorter the encoding time, the higher the efficiency. When the amount of data stored is 14MB, the coding time of document [4] method is 8s, that of document [5] method is 12s, and that of document [4] method is 5S. The coding time of this method is the shortest and the efficiency is higher.

Conclusions
In order to solve the problem about network data security and integrity, a method to detect the network storage data integrity based on symmetric difference is put forward. According to data  Figure 10 show that the proposed method uses the random sentry data segments to detect the data integrity. Because the sentry data segments not only contain the selected sentry information, but also contain the effective data and other sentry data, the scheme can support the infinite detection of detection party, and thus to improve the recall rate of data integrity detection.
In order to further verify the performance of this method, three different methods of coding time are used for detection, and the results are as follows.
Analysis of Figure 11 shows that different methods have different encoding times, that is to say, the shorter the encoding time, the higher the efficiency. When the amount of data stored is 14 MB, the coding time of document [4] method is 8 s, that of document [5] method is 12 s, and that of document [4] method is 5 s. The coding time of this method is the shortest and the efficiency is higher.
Recovery rate /% Figure 10. Recall rate of data integrity detection. Figure 10 show that the proposed method uses the random sentry data segments to detect the data integrity. Because the sentry data segments not only contain the selected sentry information, but also contain the effective data and other sentry data, the scheme can support the infinite detection of detection party, and thus to improve the recall rate of data integrity detection.
In order to further verify the performance of this method, three different methods of coding time are used for detection, and the results are as follows.
Analysis of Figure 11 shows that different methods have different encoding times, that is to say, the shorter the encoding time, the higher the efficiency. When the amount of data stored is 14MB, the coding time of document [4] method is 8s, that of document [5] method is 12s, and that of document [4] method is 5S. The coding time of this method is the shortest and the efficiency is higher.

Conclusions
In order to solve the problem about network data security and integrity, a method to detect the network storage data integrity based on symmetric difference is put forward. According to data

Conclusions
In order to solve the problem about network data security and integrity, a method to detect the network storage data integrity based on symmetric difference is put forward. According to data acquisition, feature analysis and extraction, the sentry data segment is introduced to detect the data integrity. The research results and work of this paper can be summarized as follows: (1) Through in-depth study of data feature extraction and data integrity technology, this paper proposes a method of network storage data integrity detection based on symmetric difference. In this method, the integrity of the network storage data is detected according to the automatic image annotation system. For data recovery, this paper provides a strong anti-corruption ability by two rounds of coding for the original file. In the face of large area and high frequency file damage, it can still recover the original file with a high recovery rate, providing a high data availability. Every time the data recovery algorithm is called, this method is better.
(2) This paper presents an efficient data recovery algorithm. After locating the faulty storage nodes, it is necessary to recover the faulty data on these nodes. The algorithm ensures the recovery of data with low communication overhead and high recovery rate, and provides better security for network storage data.
(3) The integrity detection method proposed in this paper supports not only static data detection, but also dynamic data detection.
In this paper, the integrity detection method of network storage data based on symmetric difference is studied, and the phased research results are obtained. However, due to the limited time and energy and the criss-cross problems involved, the work done in this paper will inevitably be insufficient, which needs further improvement and improvement, and further research in the future. With the continuous growth of data volume, the development and expansion of trusted third party has become the next problem to be solved, and fault tolerance is also a very important research focus. Author Contributions: This paper studies the integrity of network storage data based on symmetric difference. Based on a complete automatic image annotation system, use web crawler technology to capture images and related text information. Feature analysis of text data based on automatic word segmentation and part-of-speech tagging and Chinese word segmentation. It mainly uses symmetric difference algorithm, supplemented by background subtraction, to achieve image data feature extraction. With a trusted third party as the core, it conducts online status judgment for each user operation and generates credentials that both parties cannot deny, effectively preventing the verifier from providing a potential threat of false and falsified verification results. The experimental results show that the proposed method has higher precision and recall, and is more reliable. The methodology, concepts, experimental analysis, and so on of this article were completed by the author of this article, X.D., who read and approved the final manuscript. All authors have read and agreed to the published version of the manuscript.