A reversible database watermarking method with low distortion

: In this paper, a low distortion reversible database watermarking method based on histogram gap is proposed in view of the large gap in histogram of database integer data. By using the method, the tolerance of the attribute column containing all integer data is firstly calculated and the prediction error is obtained according to the tolerance. Then according to the watermark bits to be embedded, the database tuples will be randomly grouped and the histogram can be constructed by using the prediction error. Finally, the histogram correction rule is used to find the histogram peak bin, the number of consecutive non-zero prediction errors on the left and right sides of the peak is obtained, and the histogram shift is performed on the side with a smaller number of non-zero prediction errors, and then the watermark embedding will be realized. The results of the experiments based on the published dataset of FCTD (Forest Cover Type Dataset) show that compared with the existing GAHSW which also considers distortion, the proposed method significantly reduces the number of histogram column shift while embedding the watermarks, greatly reduces the changes to the carrier data, and effectively reduces the database’s data distortion caused by watermark embedding.


Introduction
As an important technology to protect database security [1], database watermarking technology was firstly introduced and applied to database copyright protection by IBM Almaden Research Center in 2002 [2].Since then, a variety of database watermarking techniques emerge, such as a relational database watermark based on a special tag tuple method [3,4], a digital fingerprint watermark based on the idea of multimedia watermark block [5] and a method of watermarking relational database based on optimization techniques [6].These methods may cause permanent distortion of the data in the original host when watermark is embedded.If the distortion ratio is small, it will not affect the usage of the database, but if the distortion ratio is large, it will become unacceptable in some specific fields (such as financial, medical, legal, military, etc.).Therefore reducing data distortion and recovering data are important aspects the researchers will concern.To solve such problems, the researchers have proposed a reversible database watermarking method [7][8][9][10] which is a technique for hiding the watermark information into the original data and recovering the original data without loss after extracting the watermark .With reversible database watermark method, data can be recovered after being attacked by insertion, deletion and modification.But not every user with legal identity has the right to recover data.The lower the distortion ratio of the database data is used, the better will be.Under such circumstance mentioned above, the reversible database watermarking method with low distortion is proposed to effectively reduce the data distortion.Therefore, it is of practical significance to carry out research on reversible database watermarking methods with low distortion.
The reversible database watermarking methods can be divided into method with data distortion and method with no data distortion.In the followings, we mainly introduce the robust watermarking method with data distortion.In [7], it proposes the reversible database watermarking method based on histogram shifting watermarking.The reference [11,12] use a difference expansion based watermarking (DEW) to restore the original database, and realizes the reversible watermarking mechanism of the relational database.The reference [13] groups the tuples and then uses DEW to embed the watermark.Based on GADEW method, the reference [14] combines genetic algorithm (GA) and DEW, and proposes a robust reversible database watermarking solution which improves the DEW capacity while maintaining a certain distortion.The reference [15] proposes a Prediction-Errors Expansion Watermarking (PEEW) reversible database watermarking method which has a better result in the aspect of anti-aggression.The reference [16], for numerical databases, proposes a Robust and Reversible Watermarking (RRW) method which can effectively resist various attacks like insertion, deletion and modification, and can better protect the quality of data.In [17], the authors proposes A New Robust Approach for Reversible Database Watermarking with Distortion Control) which is abbreviated as GAHSW (Genetic Algorithm and Histogram Shifting Watermarking) for numerical relation data.The method combines GA with a newly proposed Histogramfting of prediction error Watermarking (HSW) method to minimize distortion and improve r obustness for database watermarking.The characteristic of GAHSW is to realize the optimized group ing with GA and to embed a watermark bit in each group consequently with HSW by which the robu Shi stness of watermarked database can be ensured.Compared with previous methods, this method can not only significantly improve the capacity of the watermark embedded and reduce the data distortion, but also acquire higher robustness.However, since the histogram [18][19][20][21][22][23][24][25] will be wholly shifted when the watermark is embedded, the data distortion is still serious.
To solve above problems, we propose a low distortion reversible database watermarking method using histogram gap, which is abbreviated as HGW (Histogram-Gaps based Watermarking).The traditional histogram shift method is improved as followings: finding a vacant position that can introduce minimum distortion since the histogram contains gaps to determine the shift direction and the shift distance, and then moving necessary parts of the columns in the histogram to reduce the number of columns to be shifted when the watermark is embedded, and therefore the modification of the carrier database data will be lowered, which also meets the need of reducing data distortion.
This paper concludes altogether four parts.Section 2 elaborates the basic ideas and main steps of the proposed method.Section 3 performs experimental verification and analysis of results.Section 4 gives conclusions and proposes future research directions.Step1.Data preprocessing.

The proposed method
Step1.1.selecting multiple integer columns from the database that can identify the features of things and then sorting the selected columns in ascending order Step1.Step 2. Grouping tuples.s K , a secret grouping key, is set by a random method and the tuples in the database are divided into a set of non-overlapping groups , and the value of g N is determined by the number of the bits of the watermark to be embedded.Eq. ( 2.2) should be used to determine the group in which each tuple is located.
In the Eq.(2.2), u n represents the group number, '|' represents the concatenation operation, () H is a hash function, s K is the secret grouping key and PK t u .
is the primary key of the tuple as parameters.
Step 3.1.Calculating e p and h p with Eq. (2.3) and Eq.(2.4) respectively In the Eq.(2.3) and Eq.(2.4), y represents the value located in th j column and in certain tuple, e p represents the prediction error value corresponding to y. h p is the absolute value of the prediction error of the original database.
Step 3.2.Constructing a histogram for each group with h p and marking the h p occurring most frequently as i p in th i group (i.e., the peak bin in the histogram) and storing i p to array pa.
That related research found that the discrete database has histogram with gaps.Take the right shift as an example.For the histogram column, you don't need to move all on the right side of i p Just clear the side of i p a vacant position and then embed the watermark at i p .
Step4.1.In the th i group, the method will start from i p and search the left side and the right side, find the first h p whose frequency is zero, record the corresponding position as iL p and iR p , record the sum of heights of rectangles from iL p to i p as iL hs , record the sum of heights of rectangles from i p to iR p as iR hs , and then respectively calculate the distance from the position of i p to iL p and to iR p .The equations are as follows: In the Eq.(2.5), iL d is the value calculated by the distance from the position of i p to iL p .iR d is the value calculated by the distance from the position of i p to iR p .
Step4.2.Shifting histogram and embedding watermark.According to the results of comparing iL hs with iR hs , the Eq. (2.6) and Eq. (2.7) can be concluded.
if , (2.8) and Eq.(2.9) should be the only choice.This reduces the data distortion caused by the shifted.
Step5.Watermark extraction and data recovery.
Step5.1.The method selects the columns in which the watermark is embedded, and sorts the columns.The selection method and the sorting method are the same as Step1.Using the secret grouping key generated from the watermark embedding stage, the tuples are divided into non-overlapping groups by Eq. (2.2).Calculating e p and h p with Eq. (2.12) and Eq.(2.13) respectively.The histogram is constructed with h p on each group.
In the Eq.(2.12), y represents any value in W D , y ˆ which can be calculated with Eq. ( represents the tolerance of the column including y .e p is the prediction error corresponding to y .After the histogram is constructed, the method will scan e p one by one, judge the position in the histogram, extract the watermark and restore the database.
Step5.2.Watermark extraction.Each watermark is extracted by using the distance and direction of the shifting in each histograms stored in the array pa.The method performs the above detection on all tuples of the same group, and separately records the number of all '0' and '1' detected in the group, and then use the majority voting mechanism to determine the final watermark bit of the group.The watermark bit which has a larger number is treated as the final detected watermark bit.

Experimental results and analysis
Experiments were conducted on a workstation with Intel Core i5 with CPU of 2.40 GHz and RAM of 8 GB.The method is implemented and tested.
The test database is the Forest Cover Type dataset provided by the University of California.The dataset contains 581,012 tuples and 54 properties.In this paper, 10 integer-type columns are selected for experiments and compared with the existing GAHSW reversible watermarking method.For the sake of comparison, the method of this paper, same as the experiment of GAHSW in [17], will generate a synthetic column as the primary key.All of the following experiments embed watermarks under the same conditions.

Statistical distortion analysis
In this section, the data distortion rate (DDR) is used to evaluate the distortion effects of the HGW and GAHSW on the database after embedding the watermark.The DDR is calculated as followings: In the Eq.(3.1), dis T is the total amount of data distorted and TD is the total amount of data in the database.The larger the DDR, the greater the distortion of the data will appear.Conversely, the distortion of the data is small.Experiments verify that the distortion rate of HGW is much smaller than the distortion rate of GAHSW.With different length of the watermark (the number of groups) which are 24, 48 and 72, the distortion rates of HGW and GAHSW were compared.During each comparing process, the total number of tuples of 1000, 1200, 1400, 1600, 1800, and 2000 are respectively verified with same length of the watermark (the number of groups), and the results are plotted in Figure 2, 3 and 4. In the figure, the vertical axis is DDR which indicates the ratio of data distortion.When DDR is zero, it means that the data of the database is not distorted and the horizontal axis represents the different number of the tuples.Since both HGW and GAHSW rely on stochastic optimization, the distortion rate is the average of the 10 runs of the two methods.As can be seen from the figure 2, 3 and 4, when the same watermark information is embedded in the same database, the DDR generated by the HGW method is much lower than the GAHSW, which is less than 0.7%.

Histogram shifting quantity analysis
In this section, the total amount of distortion data from the shift (without the distortion caused by embedding the watermark bits) is used to evaluate the distortion effects of the HGW and GAHSW on the database after embedding the watermark.With different length of the watermark (the number of groups) which are 24, 48 and 72, the distortion rates of HGW and GAHSW were compared.During each comparing process, the total number of tuples of 1000, 1200, 1400, 1600, 1800, and 2000 are respectively verified with same length of the watermark (the number of groups), and the results are plotted in Figure 5, 6 and 7.In the figure, the vertical axis is the total amount of distortion data, indicating the number of data of distortion.When the total amount of translation distortion data is zero, it means that the data of the database is not distorted and the horizontal axis represents the different number of the tuples.Since both HGW and GAHSW rely on stochastic optimization, the total amount of distortion data is the average of the results of 10 runs of the two methods.It can be seen from the figure 5, 6 and 7 that when the same watermark information is embedded in the same database, the total amount of translational distortion data generated by the HGW method is much lower than GAHSW, which is at least 1.5 and 7 at the maximum and almost overlaps with the horizontal axis.

Robustness analysis
In this section, the robustness of the HGW watermarking method under well-known database attacks is reported.For convenience, the article carries on the comparing test by setting the length of the watermark (the number of groups) as 48 just like the test mentioned in the reference [17].With analyzing the attack results, this paper verifies the robustness of the HGW method.This article mainly tests three types of attacks: inserts, deletes, and changes.Robustness is evaluated by using bit error rate (BER).That is, the ratio of the number of bits extracted by error to the number of embedded watermark bits.The BER is calculated as follows: In the Eq.(3.2), i w is the embedded watermark bit and det i w is the detected watermark bit.We can see that the lower the BER value appears, the higher the watermark robustness will be.It is verified by experiments that the robustness of HGW is comparable to that of GAHSW.In the followings, we conduct attack experiments under the best situation and the worst situation, demonstrate watermark detection and data recovery results, and compare the two results.Simulating as an attacker, this article attempts to insert, delete, and modify data for 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, and 90%, and the results are plotted respectively in Figures 8 and  9.The vertical axis in the figure is BER, indicating that the ratio of watermarks has not been successfully detected.When the BER is close to zero, it means that the watermark is correctly detected from the database.The horizontal axis represents the changes' percentage (%) of the tuples in the database after attacking.Since the HGW relies on random optimization, the BER of the watermark extraction is the average of 10 runs of the method.
Both GAHSW and HGW are robust to insertion attacks, and the BER under the insertion attack is always zero no matter how many tuples are inserted.To this end, the comparison diagram of the insertion attacks will not be drawn.It can be seen from Figure 8 and Figure 9 that the robustness of the GAHSW and HGW methods is roughly equivalent under the Deletion attack and modification attacks.As the attack strength increases, the BER of the watermark extracted by these two methods also increases.The experiments also show that the performance of HGW and GAHSW have better watermark detection rate under this attack compared with other methods.
Deletion attack is to randomly delete the tuple to destroy the watermark.Figure 8 shows the BER of the watermark extracted by the HGW, GAHSW, PEEW, DEW and GADEW methods after the Deletion attack.When the database is suffered from severe deletion attack, for example, the number of tuples deleted in the database becomes 90%, DEW, GADEW, and PEEW extract the watermark with the BER value of 0.814, 0.955 and 0.915, respectively.However, the BER of the HGW extraction watermark is 0.472, and the BER of the GAHSW watermark extraction is 0.477.The HGW can also recover at least half of the watermark.If most of the watermark contained in the tuples is deleted, the watermark will not be able to be recovered.Deletion attacks have the greatest impact on database watermarks.
Modification attacks are to randomly modify the attributes of the database.Figure 9 shows the BER of the watermark extracted by the HGW, GAHSW, PEEW, DEW and GADEW methods after modification attacks.We can see that the BER of the extracted watermark increases as the amount of data modification increases.The number of tuples modified in the database becomes 90%, DEW, GADEW, and PEEW extract the watermark with the BER value of 0.434, 0.584 and 0.526, respectively.However, the BER of the HGW extraction watermark is 0.383, and the BER of the GAHSW watermark extraction is 0.411.The HGW can also recover at least half of the watermark.If the watermark contained in most tuples is modified, then the watermark tuple affected by this attack will be difficult to extract from the remaining unaffected data.
In general, the HGW method, comparing to GAHSW, produces a lower BER when embedding the same watermark information into the same database.It is worth noting that the HGW method sorts the names of the attribute columns in alphabetical order before the watermark is embedded and extracted, and groups the tuples according to the primary key of the tuple.Therefore, the rearrangement has no effect on the database.

Conclusions and future work
In order to reduce the data distortion rate of the carrier, a reversible database watermarking method based on histogram gap low distortion is proposed in this article.With this method, the vacancy provided by the histogram gap is used to reduce the number of columns which are shifted when the watermark is embedded, and therefore the modification of the carrier database data will be reduced, which also meets the need of reducing data distortion.The characteristic of HGW significantly reduces the number of histogram column shift while embedding watermarks, greatly reduces the changes to the carrier data, and effectively reduces the database data distortion caused by watermark embedding.Comparing HGW with GAHSW, the results show that HGW is superior to GAHSW in the aspects of reducing the amount of the histogram shifting and the data distortion.In the case of no attack, the watermark extraction error rate of HGW is better than that of GAHSW.In the face of attack, the bit error rate of watermark extraction of HGW nearly equals to that of GAHSW.Basing on the present methods, my future work is to propose better methods to maximally reduce the total amount of distortion data.

Figure 2 .
Figure 2. The comparison of data distortion rates caused by 24-bit watermarks.

Figure 3 .
Figure 3.The comparison of data distortion rates caused by 48-bit watermarks.

Figure 4 .
Figure 4.The comparison of data distortion rates caused by 72-bit watermarks.

Figure 5 .
Figure 5.The comparison of the total amount of the data distortion caused by shift with 24-bit lengths of watermark.

Figure 6 .
Figure 6.The comparison of the total amount of the data distortion caused by shift with 48-bit lengths of watermark.

Figure 7 .
Figure 7.The comparison of the total amount of the data distortion caused by shift with 72-bit lengths of watermark.

Figure 9 .
Figure 9.Comparison of watermark extraction BER of HGW with GAHSW, PEEW, DEW and GADEW after modification attacks.
.7)In the Eq.(2.6) and Eq.(2.7), h p is the absolute value of new prediction error of watermarked e p is the new prediction error of watermarked database.According to different situations, the histogram shifting and watermark embedding can be finally solved with Eq. (2.8), Eq. (2.9), Eq. (2.10), and Eq.(2.11).Besides, there are two special situations which should be noted.If 0  i p , Eq. (2.8) is the only one which should be used to solve the situation.If 1  i p , Eq.