Enhancing Feature Point-Based Video Watermarking against Geometric Attacks with Template

As the Internet and communication technologies have developed quickly, the spread and usage of online video content have become easier, which results in major infringement problems. While video watermarking may be a viable solution for digital video content copyright protection, overcoming geometric attacks is a significant challenge. Although feature point-based watermarking algorithms are expected to be very resistant to these attacks, they are sensitive to feature region localization errors, resulting in poor watermark extraction accuracy. To solve this issue, we introduce the template to enhance the location accuracy of feature point-based watermarking. Furthermore, a scene change-based frame allocation method is presented, which arranges the template and the watermark to be embedded into different frames and eliminates their mutual interference, enhancing the performance of the proposed algorithm. According to the experimental results, our algorithm outperforms stateof-the-art methods in terms of robustness against geometric attacks under close imperceptibility.


Introduction
Copyright infringement occurs as a result of the rapid development of the Internet [1,2], which makes it easy to disseminate and utilize digital media assets such as images, videos, and audios. Watermarking, on the other hand, may offer a solution for copyright tracking and verification, which is an algorithm that imperceptibly embeds a watermark containing copyright information into digital content. When copyright complications arise, the watermark is extracted to establish the ownership of the creator. Moreover, the focus of this paper is on video watermarking [3][4][5][6][7][8][9][10][11][12], which employs video as its carrier.
Throughout the Internet transmission, several deliberate or unintentional attacks on the watermarked digital video may occur. Common types of attacks include signal processing attacks (adding noise, filtering, transcoding, and so on), frame adjustment attacks (frame insertion, frame dropping, and so on), and geometric attacks (scaling, cropping, and so on). By the way, watermark energy is reduced by signal processing attacks, frame adjustment attacks change the quantity and relative placement of frames carrying the watermark, and geometric attacks cause the position of the watermark to be desynchronized between embedding and extraction. Most existing video watermarking methods own good robustness against signal processing and frame adjustment attacks, but they are usually vulnerable to geometric attacks.
Feature point-based watermarking extracts feature points from images or frames and uses them to locate nonoverlapping local regions named as feature regions, where the watermark is inserted. The feature regions can be kept as unaltered as feasible before and after geometric attacks by utilizing the invariance of the feature points. The scale information and the spatial coordinate of their associated feature points determine the location and the size of the feature regions; thus, the robustness of the feature regions is largely dependent on the stability of the scale information and the spatial coordinate. However, it is difficult to reappear the picked locations unbiased after suffering geometric attacks, resulting in feature regions locating mistake and a serious decline in the watermark extraction accuracy.
In order to improve the location accuracy, we introduce the template [22][23][24][25], which can recover geometrically distorted frames and help feature points in locating the feature regions. To reappear the feature regions from the image which have already recovered by the template, the scale information of the feature points is no longer required, but only the spatial coordinate is required. As a result, the inaccuracy in locating the feature regions can be significantly minimized.
The template is embedded in an image before embedding the watermark, and it is extracted in a possible damaged watermarked image to acquire the affine transform parameters in the extraction procedure. The damaged image is then restored to its original shape, allowing the watermark position to be synchronized between embedding and extraction. However, the watermark and the template are both embedded in the same image at the same time in the existing study, and they will interact with each other. Hence, the algorithm cannot get an ideal performance.
In summary, this paper contributes to the ongoing studies in the following two ways: (1) A scene change-based frame allocation strategy is presented. This strategy can effectively arrange different frames to embed the watermark and the template separately, eliminating their mutual interference (2) A video watermarking scheme against geometric attacks is designed, combining with the feature points and the template. The template aids feature points in locating the feature regions, decreasing the locating errors and increasing the robustness After a series of tests on real-world datasets, we find that our approach outperforms the state-of-the-art approaches in terms of robustness, under good imperceptibility.

Related Work
The earliest research on video watermarking is in 1994. Matsui et al. [12] consider the video as a collection of sequential images in a specified order and then embed the watermark in these images. A complete video watermarking framework addresses three issues: frame selection for embedding and extraction, embedding region determination, and embedding and extracting scheme design. Selection of frames for embedding and extraction is a unique task of video watermarking, differing from other digital media watermarking, and it consists of selection of the whole [12] or partial frames (such as I-frames [6,7], keyframes [8,9], and scene change frames [10,11]). The processes of determining embedding regions and designing embedding and extraction schemes are both performed on the selected frames so that the image watermarking algorithms can be used as references. Based on distinct embedding regions, the existing watermarking can be classified into global and local watermarking algorithms. The embedding process of the global watermarking algorithms employs the whole pixels of an image or a frame, making them vulnerable to cropping attacks. The local watermarking algorithms usually select the embedding regions by exploiting the feature point invariance, to make these regions roughly constant before and after attacks. Furthermore, the former has worse imperceptibility than the latter, and the correctness of watermark extraction is related to the precision of locating embedding regions. Quantization [5-7, 26, 27] and spread spectrum [28,29] are two common types of embedding and extraction schemes. Quantization uses various quantizers to quantize the original carrier data into various index intervals, and the watermark information is extracted based on the index interval to which the quantized data belongs during extraction. Spread spectrum applies the orthogonality of the codebook vector to embed the watermark into the host signal. Quantization is easy to implement with low algorithm complexity, but it is difficult to resist scaling attacks. Spread spectrum has strong robustness against scaling attacks; however, it suffers from the host signal interference.
The feature point-based image watermarking takes some local feature points as the reference points and then uses them to locate some nonoverlapping feature regions, which the watermarks will be embedded into. Tang and Hang [13] utilize the Harris detector to extract feature points and divide the picture into a collection of nonintersect triangles for watermark insertion by Delaunay tessellation. The drawback of this method is that if the feature points retrieved from the original and attacked pictures do not match, the watermark embedding and extraction triangle groups will be different, resulting in the extraction failing. Tang and Hang [14] determine the feature points by using a feature extraction approach called Mexican Hat wavelet scale interaction, and the watermark is embedded in the normalized circular regions centered on these points. Furthermore, several algorithms select the local geometric invariant feature points such as the scale-invariant feature transform (SIFT) [15][16][17][18][19], the Speeded Up Robust Feature (SURF) [20], and the KAZE [21] to locate the feature regions, by using the spatial coordinate and the scale information of these points. Through modifying the pixels in the spatial domain, Lee et al. [16] embed the watermark into the circular patches centered on the chosen SIFT feature points. Zhang et al. [20] present a new watermarking scheme against RST distortion based on SURF and embed the watermark by using the odd-even quantization technique. Liu et al. [21] repeatedly embed the watermarks into the significant bitplanes of the KAZE feature regions, by modifying their histograms. In summary, the general framework of the feature point-based watermarking algorithms follows. 2 Wireless Communications and Mobile Computing (i) The watermark embedding process Step 1. Extract the feature points from the image or the frame.
Step 2. Select a particular number of feature points based on certain specified criteria to locate the nonoverlapping feature regions, and their shape is typically square or circular, which is determined as follows: where ðt 1 , t 2 Þ is the spatial coordinate of the selected feature point, s represents the scale information, which is approximately proportional to the scaling factor, and k is a magnification factor to control the radius of the feature regions.
Step 3. Embed the same watermark into these determined feature regions repeatedly, using a specific watermark embedding method. The watermarked image or frame is generated.
(ii) The watermark extraction process Obtain the feature regions of the watermarked image or frame in the same manner that the embedding process did and then repeatedly extract the watermarks from these regions using the extraction method which corresponds to the embedding method described above. The ownership is proven if the watermark can be identified effectively in at least one region. However, neither ðt 1 , t 2 Þ nor s can be completely reappeared during extraction, resulting in the feature region desynchronization.
Many watermarking algorithms insert the template in order to recover the geometric attacks, which are mostly manifested in the following structures: Pereira et al. [22,23] utilize numerous discrete points placed at a specific distance on two straight lines as the template points; Qi et al. [24] use two straight lines as the template; Tokar et al. [25] suggest embedding the square templates in the intermediate frequency transform domain of the images. These methods are resistant to geometric attacks such as scaling and cropping. However, the template and the watermark are both contained in the same image and will interfere with each other. As a consequence, it is difficult for the template-based watermarking algorithms to attain optimal results in terms of imperceptibility and robustness.

A Scene Change-Based Frame Allocation Strategy
Because only one image is available as the carrier for the typical template-based image watermarking, the template and the watermark must be embedded into the same image, causing mutual interference. A video, on the other hand, could be seen as a series of images, so it has multiple carriers for embedding. Based on this, we can embed the template and the watermark into different frames, respectively. In order to embed the template, all of the pixel values of a frame must be changed, which may decrease the imperceptibility. Hence, for embedding the template, it is essential to pick the frames that are insensitive to human eyes, and the scene change frame fulfils this criterion. The scene change frame is the initial frame of each scene in a video. It changes so quickly during playback that the embedded information is difficult to discover, and using it as a reference point to locate the watermarked frame can significantly increase the extraction efficiency [10]. Based on these advantages, a scene change-based frame allocation strategy shown in Figure 1 is proposed: the scene change frames are chosen to embed the template, and the N frames behind each scene change frame are chosen to embed the watermark, where N is an empirical value.
Moreover, if the current frame is a scene change frame, the correlation coefficient between its histograms and those of the preceding frame will not exceed an empirical threshold. So, we extract the scene change frames based on the correlation coefficient. Denote the correlation coefficient as dðHðxÞ, Hðx + 1Þ Þ, and then, it is calculated as follows: where HðxÞ is the histogram of the Y component in the x-th frame, cov ðHðxÞ, Hðx + 1ÞÞ is the covariance between HðxÞ and Hðx + 1Þ, DðxÞ is the variance of HðxÞ, and Dðx + 1Þ is the variance of Hðx + 1Þ. If dðHðxÞ, Hðx + 1Þ Þ does not exceed a threshold denoted as Thre, then the corresponding frame of Hðx + 1Þ is regarded as a scene change frame.

The Proposed Video Watermarking Scheme
This section focuses on the video watermarking scheme based on the feature points and the template, which includes the embedding and extraction process.
And its inverse transform is Then, the following transform is carried out in the DFT domain, accordingly.
Thus, by detecting the linear transform of the template in the DFT domain, the corresponding transform in the spatial domain can be deduced.
The template embedding process is shown in the following ways.
Step 1. Perform scene change detection on the host video, according to Equation (2); obtain the Y components of the scene change frames.
Step 2. For each obtained Y component, pad it with zeros to a size of 3000 × 3000 and then apply the Fast Fourier transform (FFT) to it.
Step 3. Choose two radial lines with an angle of 45 degrees to the coordinate axis in the DFT domain (fixing the angle of 45 degrees can make the extraction process easier because we just need to match the abscissa of the template point); pick seven points in the middle frequency band ½ f 1 , f 2 of each line, and the interval between two adjacent points is equal to 11 pixels (refer to Figure 3); it is necessary to select 7 points at symmetrical positions to obtain real coefficients.
Step 4. For each selected point ðx i , y i Þ, embed the template by following equation: where i = 1, ⋯, 7, MagTðx i , y i Þ is the magnitude of the i-th point containing the template, LocalMeanðx i , y i Þ is the  Step 5. Apply the inverse FFT to the Y component with the template, depadding it to the original size.

Watermark Embedding.
We use the QRCode with a size of W × W as the original watermark, due to its high decoding reliability and strong error correction [11]. The detailed steps of the watermark embedding process are presented below.
Step 1. The scene change frames are extracted from the host video, according to Equation (2), and then, the next N frames of each scene change frame are selected for watermark embedding.
Step 2. Extract the SURF feature points from the Y component of each selected frame, and construct the R × R feature region with each point as the center; eliminate the points whose corresponding feature regions exceeding the frame boundary; if there are overlapping regions, reserve the point with higher intensity.
Step 3. Select the first three points with the highest intensity among the remaining feature points, and the watermark will be repeatedly embedded in the feature regions corresponding to these points.
Step 4. Segment each selected feature region into W × W blocks; for each block, calculate the DC coefficient in the discrete cosine transform based on Equation (8).
where DC ij is the DC coefficient of the block located in the i -th row and the j-th column of the feature region, B w and B h are the width and the height of the block, and I ij ðx, yÞ is the block in the i-th row and j-th column of the feature regions.
Step 5. Embed one-bit information of the watermark into DC ij , according to Equation (9).
where x = roundðDC ij /δÞ, roundðÞ is the rounding function, DC * ij is the corresponding value after embedding the watermark into DC ij , and δ is the quantization step to adjust imperceptibility and robustness.
Step 6. Obtain the watermarked block I * ij ðx, yÞ by the following equation [24]: When all the blocks in the current frame are embedded with the watermark, continue to do the same operation on the next frame.

Extraction Process.
The template is extracted before the watermark extraction process, in order to recover the watermarked videos which may have been destroyed by geometric attacks, and then, the watermark is extracted from the recovered videos. In summary, the whole extraction process is shown in Figure 4 Step 1. Perform scene change detection on the watermarked video, according to Equation (2); obtain the Y components of the scene change frames.
Step 2. For each obtained Y component, pad it with zeros to a size of 3000 × 3000 and then apply the FFT to it.
Step 3. For the points on the two lines with an angle of 45 degrees between the radial and the coordinate axis of the DFT domain, all the local peak points ðpx j , py j Þ are extracted by the following formula: where j is the index of the peak points, MagTðpx j , py j Þ is the magnitude value of the point ðpx j , py j Þ, LocalMeanðpx j , py j Þ is the average of the magnitude of 120 pixels adjacent to ðpx j , py j Þ, Std ′ is the standard deviation of the whole DFT spectrum of the selected frame, and β is the detection strength. Figure 3: The template embedding.

Wireless Communications and Mobile Computing
Step 4. If there are at least 4 points that match the points in the original template line, it is considered that a matching line is found, and the matching rule is where px j is the abscissa of the extracted peak point, x i is the abscissa of the original template point, K is the scaling factor between 0.4 and 1.5, and thresh is the empirical value.
Step 5. Perform statistical analysis on the whole results to get the final matching result.
Step 6. According to the matching result of the template, the geometric attack correction is performed on the attacked video.

Watermark Extraction.
The detailed steps of the watermark extraction are proposed below.
Step 1. Perform scene change detection on the watermarked video which is recovered by the template, based on (2), and then select the next N frames of each scene change frame for watermark extraction.
Step 2. Extract the SURF feature points on each selected frame, construct a R × R feature region with each point as the origin, and remove the points corresponding to the cross-border regions.
Step 3. Select the first ten points with the highest intensity, and the watermark is extracted from the feature regions corresponding to the selected points and their eight neighborhoods, according to the following equation: where DC ij ′ is the DC coefficient of the image block in the i-th row and j-th column of the feature region, floorðÞ is the function of rounding down, w ij ′ is the watermark information extracted from the image block, locating in the i-th row and j-th column of the feature region, and δ is the quantization step the same as that in the watermark embedding process.
Furthermore, the high decoding reliability of the QRCode is used to determine if the watermark is effectively extracted, which implies that when a QRCode can be successfully decoded by the decoder, the error rate of decoding is near zero. As a result, if the QRCode extracted from at least one feature region can be successfully decoded, the likelihood of treating it as the embedded watermark is close to 100 percent, and the ownership is established [11].

Experimental Evaluation
We evaluate the performance of the proposed video watermarking in this section. The experimental setup is introduced in Section 5.1. Section 5.2 verifies the effectiveness of the scene change-based frame allocation strategy. Section 5.3 compares the robustness between our algorithm and the state-of-the-art methods.

Experimental
Setup. Test set. The test set includes 50 1080 P (1920 × 1080) and 50 720 P (1280 × 720) videos. They are all in the mp4 format, with a frame rate between 23.98 and 30 (frames per second) and a duration ranging from 90 to 360 (s).
Environments. The experiments were performed on a PC with 16 GB RAM and 3.4 GHz Intel Core i7 CPU, running on 64-bit Windows 10. The simulation software was Visual Studio 2010 with FFmpeg 2.1 and OpenCV 2.4.9.
Parameters. The number of frames behind every scene change frames to embed the watermark N is set to 10. The threshold of scene change Thre is set to 0.6. The middle frequency band ½ f 1 , f 2 is set to [400,478]. The template embedding strength α is set to 30. The size of the QRCode to be embedded W × W is selected as 25 × 25. The size of the feature region R × R is set to 200 × 200. The quantization step δ is set to 60. The match threshold of the template thresh is set to 0.75. The detection strength β is set to 0.05.
Evaluation indexes. We evaluate the imperceptibility and the robustness of the algorithms by using the mean peak signal to noise ratio (MPSNR) and byte error rate (BER), respectively. The larger the MPSNR is, the better the imperceptibility is. On   Wireless Communications and Mobile Computing the contrary, the smaller the BER is, the better the robustness is. Furthermore, they are calculated as Equations (14) and (15), separately.
where I is the number of the watermarked frames, M and N are the width and the height of the video, and F i and F iw are the i-th original frame and its corresponding watermarked frame.
where NðerrorBytesÞ is the number of bytes inconsistent between the extracted watermark and the original watermark and NðtotalBytesÞ is the number of bytes of the original watermark.
The types of geometric attacks. Padding, cropping, scaling, shielding, scaling and cropping, scaling and padding, and scaling and shielding are the mainly geometric attacks

Verifying the Effectiveness of the Scene Change-Based
Frame Allocation Strategy. This subsection verifies the effectiveness of the proposed scene change-based frame allocation strategy by comparing it with the typical template and watermark allocation strategy (TTWAS) [22][23][24][25]. For a fair comparison, only the allocation manner of the template and the watermark of the two strategies differs, and the other settings are the same. We evaluate the effectiveness through the imperceptibility and the robustness.
The imperceptibility comparison results are shown as Table 1, and we can find that our strategy can enhance the imperceptibility efficiently. By using the TTWAS, the template and the watermark should be embedded in the same frame at the same time. However, in our proposed strategy, we allocate them to embed into different frames, so that the amount of the embedded data in a frame is reduced, and the imperceptibility becomes better.
Furthermore, the comparisons of the robustness against geometric attacks between the two different strategies are shown in Table 2, where "-" empresses that extracting the watermark fails. When the embedding strength of the two strategies is equal, which is set as before in Section 5.1, TTWAS cannot successfully withstand all of the attacks in the table, while our approach has certain robustness. The mutual interference phenomenon that exists in TTWAS can be effectively eliminated through our strategy, for embedding the template and the watermark in different frames.

Comparisons of Robustness against Geometric Attacks.
This subsection compares the ability of our algorithm and the baseline to recover the embedded watermarks in different geometric attacks. The typical and widely used location method of the feature point-based watermarking [17,18] is selected as the baseline of the proposed video watermarking algorithm, which is named as TFPA (typical feature pointbased algorithm).
The comparison results are shown in Table 3. In most cases, our algorithm outperforms the TFPA, due to introducing the template to assist the location of the feature points. After the video frames subjected to geometric attacks are restored, it is no longer necessary to use the scale information of the feature points when determining the feature regions, which reduces the locating error and improves the accuracy of the watermark extraction. However, our algorithm performs a bit worse against enlarging scaling, because the embedded templates may occasionally be undetected.

Conclusions
In this paper, a feature point-based video watermarking algorithm is presented, combining with the template. Because a video is made up of a series of images, it has more hosts for embedding the information. Based on this, a scene changebased frame allocation approach is proposed, which embeds the template and the watermark into different frames. This strategy can enhance the imperceptibility and avoid the mutual interference between the watermark and the template effectively. Furthermore, the insertion of the template reduces the location error of the feature points, improving the robustness against geometric attacks. In most cases, and the experimental results indicate that our algorithm outperforms the state-of-the-art approach. In future study, we will improve the robustness of our algorithm against more geometric attacks so that it can be applied to realistic scenarios.

Data Availability
The videos used to support the findings of this paper are subject to privacy or copyright, so that they cannot be shared.

Conflicts of Interest
The authors declare that they have no conflicts of interest.

Acknowledgments
This work was supported by the National Key R&D Program of China (2020YFB1406900) and the Key R&D Program of Shanxi (201903D421007). It was also the research achievement of the Key Laboratory of Digital Rights Services, which is one of the National Science and Standardization Key Labs for Press and Publication Industry.