Zero-watermarking Algorithm for Audio and Video Matching Verification

: For the needs of tamper-proof detection and copyright identification of audio and video matching, this paper proposes a zero-watermark algorithm that can be used for audio and video matching verification. The algorithm segments audio and video in smaller time units, generates a video frame feature matrix based on NSCT, DCT, and SVD, and generates a sound watermark based on methods such as DWT and K-means. The zero watermark combines video, audio and copyright information. The experimental results show that the zero watermark generated by this algorithm can not only realize highly accurate matching detection and positioning of audio and video, but also well resist common single attack and combination attacks such as noise, scaling, rotation, frame attack and format conversion, which has good robustness.


Introduction
With the development of multimedia technology and AI technology, videos can be easily accessed, copied, tampered and disseminated, and some may cause interest infringement and personal attacks on copyright holders, and some may cause political disturbances [1,2]. The authenticity of videos on the Internet is often questioned, and the crisis of trust spreads in the society [3][4][5]. Although digital watermarking technology can confirm and protect the copyright of audio or video, it can not be used to detect whether audio and video match or not, and can not indicate the mismatched part of media where audio or video tamper occurs. Therefore, the research on audio and video matching and tamperproof is of great significance in the verification of information authenticity. This paper presents a zero-watermarking algorithm which can be used for audio and video matching and local location. The algorithm generates a zero-watermarking stream based on audio and video segments, which can not only detect audio and video matching tamper but also realize tamper location. A variety of attacks and tampering experiments show that the proposed algorithm can detect and locate audio and video tampering, and the algorithm has high robustness.

Audio and video matching detection framework based on zero-watermarking
In this paper, the method of detecting audio and video matching is based on the generated zero watermark, and its framework is shown in Figure 1. The algorithm firstly segmented audio and video, and each segment generated a zero-watermarking integrated with audio and video features. The whole audio and video formed a zero-watermark stream, as shown in Figure 1(a). The matching detection process is shown in Figure 1(b). The audio and video segments are formed with the same method as Figure 1(a) to form a zero-watermarking, and compared with the zero-watermarking of the copyright center. If the correlation coefficient between the two is greater than the threshold, then the audio segment and the video segment match; if it is less than the threshold, the audio segment and the video segment do not match, and one of them is tampered with. Because the zero-watermarking comparison is performed segment by segment, it can realize the location of tampering, and the accuracy of the location is the segment length. In addition to detecting the matching of positioning audio and video, the method in this paper can also be used for traditional copyright determination.
(a) Zero-watermarking stream generation process integrating audio and video features.

Matching zero-watermarking generation
The zero-watermarking generation algorithm for a certain segment of audio and video in Figure 1(a) is shown in Figure 2. Extract key frames from video segments, and generate feature matrix of key frame images based on NSCT (Nonsubsampled Contourlet Transform, NSCT), DCT, SVD; for audio segments, incorporate copyright watermarks to generate sound watermarks. The feature matrix and the sound watermark are XOR to form the zero watermark of this paragraph. The zero-watermarking not only contains copyright information, but also incorporates the audio and video features of this paragraph, which can play a role in traditional copyright protection, as well as match detection and positioning. Among the four modules of the algorithm, key frame extraction, video feature matrix generation, encrypted sound watermarking generation and zero-watermarking generation, the key frame extraction is at the front end, which is used to reduce the redundancy of video, simplify the zerowatermarking process and reduce the number of zero-watermarking. Video feature matrix generation is to extract representative stable features from frame images to increase the robustness of watermarking. The generated part of encrypted sound watermark not only contains the sound information, but also hides the information, which increases the security of watermark on the one hand, and provides a basis for the realization of matching detection on the other hand. The zero-watermarking generation part combines image features and audio features to ensure the realization of the final algorithm function and excellent performance.

Key frame feature matrix generation
There is a lot of redundancy between video frames. Selecting key frames can reduce redundancy and the number of zero watermarks. This paper uses the Euclidean distance of the frame difference to select the key frame, and saves the frame number of the key frame as the secret key. Based on the key frame, the feature matrix of the video segment is constructed by using NSCT, DCT, SVD and other methods. The features contained in the matrix have good stability and robustness, and combined with Zernike invariant moments, it increases the ability of extracting features to resist rotation attacks ability. The detailed steps are as follows: 1. Calculate the Zernike moment of the key frame image and save it for rotation correction. 2. After normalizing the size of the key frame image, it is transferred from the RGB space to the YCoCg space, and the three components of Y, Co and Cg are decomposed. It is found that the loss of Co component in image compression is less than that of Y component and Cg component. In order to improve the anti-compression attack ability and overall robustness of watermark, this paper proposes to embed watermark in Co component of image.
3. Perform two-layer NSCT on the Co component, perform DCT transformation on its lowfrequency subband, and divide the result into 32*32 non-overlapping sub-blocks. Each nonoverlapping sub-block is marked as 4. Perform SVD decomposition on each non-overlapping sub-block j i T , to obtain a diagonal matrix, as shown in formula (1).    32  2  ,  1  ,  ,  ) ,

Copyrighted sound watermark generation
For audio segments, DWT and K-means algorithms are used to extract sound features, and at the same time, copyright watermarks containing copyright information are incorporated into sound features to obtain copyrighted sound watermarks. The detailed steps are as follows: Step 1: Perform two-level DWT transformation on the audio segment to obtain its low-frequency wavelet coefficient LL .
Step 2: For section LL , calculate the first moment  and the second moment  of each wavelet coefficient.
Step 3: Perform K-means coding with  and as features to obtain a one-dimensional two-valued feature matrix of sound, and obtain a two-dimensional voice feature matrix V by increasing the dimension.
Step 4: The copyright watermark B is incorporated into the sound feature matrix V to obtain the copyrighted sound watermark W .
Step 5: Use the Logistics chaotic encryption algorithm to encrypt W , and get the encrypted sound watermark w with copyright.

Zernike moments of key frame images
The Zernike Moment is the radial moment and is based on the orthogonal function of Zernike polynomials, and ZM (Zernike Moment, ZM) whose amplitude remains constant for only phase change during image rotation is widely applied to image rotation, feature extraction, and excellent [35].
Assuming that the polar coordinate of the image is expressed as ) , , the Zernike moment in the N-order M heavy polar coordinate system is: Among them, n is a non-negative integer, m is an integer that satisfies n m  , and n m − is an even number; nm V is an m-fold Zernike moment polynomial of order n, and * nm V is the conjugate of nm V . nm V can be expressed as: The Zernike moment of an image can be expressed as: Among them, arg() represents the calculation to find the argument.
Assuming that nm A is the Zernike moment before the image is rotated, and ' nm A is the Zernike moment after the image is rotated, the rotation angle ' nm A is solved as: The Zernike moments of the key frame images of the algorithm are saved and used for rotation correction in the detection and identification stage to improve the anti-rotation performance of the algorithm. Figure 3 shows the Normalized Cross-Correlation of the watermark extracted without the Zernike rotation correction algorithm and the watermark NC (Normalized Cross-Correlation, NC) value extracted by the algorithm after adding the Zernike moment rotation correction. It can be seen that the NC value of the watermark obtained after the Zernike moment rotation correction is higher than the original watermark NC value, which effectively improves the anti-rotation attack ability of the algorithm, and the NC value increases more obviously under a large rotation attack.

Matching detection process based on zero-watermarking
The zero-watermarking detection algorithm is shown in Figure 4, which is the inverse process of the audio and video matching zero-watermarking generation algorithm. When performing audio and video matching detection, the audio and video stream to be verified is decoded and segmented, and a zero-watermarking is generated for each segment with the support of the third-party stored information. Calculate the similarity between the zero-watermarking of the audio and video to be verified and the zero-watermarking extracted from the third party. If it is greater than the threshold, the audio segment and video segment are matched; if it is less than the threshold, the audio segment and video segment do not match, and one of them is replaced or tampered with.

Experimental results and analysis
The experiment in this article is carried out on the Matlab R2014b platform. The copyright watermark used is the words "Technology University" and its size is 32*32. In the experiment, the parameter of Logistic chaotic encryption is 4 , , audio and video segments are segmented in 1s. For the question of audio and video segment length, from the perspectives of the stability of audio and video features, the rapidity of generating zero-watermarking, the minimization of occupied resources, and the accuracy of matching detection, through comprehensive analysis and experiments, we determined to segment audio and video in 1s as the time unit. In this way, on the one hand, stable features of audio and video segments can be effectively extracted to quickly construct optimized zerowatermarking, and on the other hand, tampering of small audio or video segments in the whole audio and video stream can also be accurately detected. The video used in the following experiment is video in H.264 encoding format (including audio), the video frame size is 1080*1920, the duration is 30 seconds, the frame rate is 25fps, the audio stream sampling rate is 32KHZ, 16-bit quantization bit, dual sound road. According to the algorithm of this paper, it is divided into 30 audio and video segments, and 49 key frames are extracted in total. For the audio and video experiment in this paper, the algorithm takes less than 180s to generate the zerowatermarking stream, and less than 230s to detect the zero-watermarking matching.
In the experiment, the NC value is used as the objective evaluation criterion of the robustness of the watermark, and the Peak Signal-to-Noise Ratio is used as the difference measurement index of the two images. The larger the NC value, the better the extracted watermark effect and the stronger the algorithm robustness. If the PSNR (Peak Signal-to-Noise Ratio, PSNR) value is smaller, it means that the attack intensity is greater and the damage degree is stronger. The audio and video matching is determined by the normalized NC value, and the threshold is set to 0.8, that is, when the NC is greater than or equal to 0.8, the audio and video are matched; when the NC is less than 0.8, the audio or video is tampered with.

Audio and video matching and anti-tampering test
One of the functions of zero-watermarking in this paper is to detect and locate the matching of Zero-watermarking Zernike moment audio and video. Based on the above video, we randomly extract video frames and audio in different periods for replacement, and then use this method to detect and locate the matching of audio and video respectively. The results are shown in Figure 5. It can be seen from the figure that for the tampered video segments 4, 7, 10, 12, 17, 23, 27 and 30, the detected zero-watermarking NC values are less than the set threshold, which is determined to be mismatched; for the tampered 5, 12, 18, 20, 22, 23 and 30 audio segments, the detected zero-watermarking NC value is also less than the set threshold, and the detection result is mismatch; the results of both tests are correct. Therefore, the main function of the algorithm in this paper, detection of audio and video matching and tamper-proof, is realized through the cooperation of each module of the system and the generated zero-watermarking, and the algorithm accurately gives the conclusion of "mismatch" no matter the tamper of small fragments in video or audio.
(a) Test results of matching detection after video tampering.
(b) Test results of matching detection after audio tampering.

Algorithm security test
In this paper, the watermark image is scrambled by Logistic [36] chaos to hide the watermark information. For the watermark extraction party, only know the zero-watermarking algorithm and encryption methods and parameters, can correctly extract the watermark information, so as to increase the security of the algorithm and watermark.

Algorithm robustness test
The following experiments respectively test the robustness of the watermarking algorithm under single attack and combined attack. Single attacks include Gaussian noise, salt and pepper noise, cropping, scaling, rotation, frame averaging, frame reorganization, format conversion, etc. Combination attacks are situations where more than single attacks work simultaneously. The robustness of the algorithm is judged by the PSNR and NC values. The experimental results of a single attack are shown in Table 1. The attack is performed on each frame of the video. The PSNR mean and PSNR variance are the mean and variance of the PSNR after all frames are attacked. The NC mean and NC variance are the mean and variance of the NC after all watermarking attacks. From the results in the table, the algorithm in this paper can still effectively extract the watermark after the attack. Even if the PSNR of the image is about 20dB, the NC value under most attacks is above 0.98, and the variance is also very low, indicating that the algorithm is robust sex. Experimental data show that, with the cooperation of all modules, the algorithm has certain advantages in robustness and anti-attack ability, especially for noise, filtering and compression attacks, and for rotation and clipping attacks, the NC value is above 0.96, which has good anti-attack ability. We also changed the attack parameters in a relatively large range for several kinds of attacks and conducted repeated experiments to grasp the characteristics of the algorithm more comprehensively. The experimental results are as follows. 1) Noise attack. Gaussian noise and salt and pepper noise of different intensities are selected for attack test, and the noise intensity is increased by 0.01 in the range of 0-0.1, and the result is shown in Figure 6. Figure 6 shows the NC value of each extracted watermark in the form of a scatter diagram, in which the horizontal line represents its mean position and the mean and variance of the NC under two kinds of noise intensities. It can be seen that with the increase of noise intensity, the fluctuation range of NC value becomes larger, but the mean NC value of watermarking is above 0.97, indicating that the algorithm has strong anti-noise ability, especially under the attack of saltand-pepper noise. The mean NC value of watermarking is above 0.98, indicating that the algorithm has stronger robustness to salt-and-pepper noise. 2) JPEG compression attack. The quality factor reflects the quality of the image after JPEG compression. If the value is larger, it means that the image quality is higher and the image suffers less attack. This paper has experimented with the robustness of the algorithm when the quality factor is increased by 10 in the range of 10-90. The result is shown in Figure 7. It can be seen that with the increase of the quality factor, the distribution of the NC values of the watermark extracted from multiple key frames becomes concentrated, and the fluctuation range of the NC values gradually decreases, indicating that the algorithm can resist JPEG compression attacks very well. 3) Scaling attack. In this paper, the video is scaled by 1/8, 1/4, 1/2, 2, 4, and 6 times respectively. The results are shown in Figure 8. Experimental results show that the algorithm in this paper has strong robustness to scaling attacks. The extracted watermark NC value is about 0.98, and the NC value under the amplification attack is close to 1. 4) Rotating attack. In this paper, the rotation attack experiment is carried out in the range of 0-180°with an interval of 15. The results are shown in Figure 9. It can be seen from Figure 9 that even under a large rotation attack, the average value of the extracted watermark NC is above 0.9.  Figure 9. It can be seen from Figure 9 that as the filter window size and surround scale increase, the NC of the watermark decreases, but the average value of the watermark is still around 0.96. This algorithm is robust to Gaussian filtering. 6) Cutting attack. In this paper, four corners cropping 1/20, 1/16, 1/8, and center cropping 1/16 attack experiments were performed on the video, and the results are shown in Figure 9. The experimental results show that the algorithm in this paper extracts the features of key frames when generating watermarks. Large-scale cropping attacks cause a large number of key frame features to be lost, which seriously affects the NC value of the watermark. Although the algorithm in this paper is slightly less robust against clipping attacks, the average value of NC is above the matching detection threshold, and there is no problem in detecting matching. 7) Combination attack. In this paper, combination attack experiments are conducted on the algorithm, mainly including JPEG compression + cropping, rotation + Gaussian filtering, format conversion + other three combination attacks. The results are shown in Figure 10. From the experimental results in the figure, it can be seen that the algorithm in this paper also has good robustness to combined attacks, with an average NC value of 0.9 or more; in contrast, the algorithm is more robust to Gaussian filtering attacks than cropping attacks, and it is more robust to JPEG compression. The attack is more robust than rotating attacks. In the combination of format conversion and other attacks, the results show that the sensitivity of different video frames to different attacks is different, so the anti-attack ability is related to the frame, but overall, the NC value is still relatively high, which can be used for matching detection. 15 30

Comparison and analysis with other algorithms
This paper also experimented with the comparison between the algorithm in this paper and the algorithm in literature [24] and literature [37]. Among them, the literature [24] proposed a robust video watermarking algorithm combining discrete cosine transform and discrete wavelet transform technology. Literature [37]  SIFT (Scale Invariant Feature Transform, SIFT) feature extraction for correction. The main difference between the algorithms in this paper and them is the using of sound watermark and zero watermark. Literature [38] proposed a robust non-blind video watermarking technology based on DWT and QR decomposition. Literature [39] proposed a discrete wavelet based on redundancy. In this part of the experiment, the classic test videos foreman_cif and bus_cif are selected. The resolution is 352*288, the frame rate is 29fps, the duration of bus_cif is 5 seconds, and the duration of foreman_cif is 10 seconds. The watermark used in literature [24,[37][38][39] is a 32*32 binary image, and the watermark used in this algorithm is a 32*32 binary watermark based on audio and copyright images. The comparative experimental results are shown in Table 2. It can be seen that the algorithm in this paper has good robustness under rotation, scaling, and cutting attacks, especially in terms of rotation and cutting.

Conclusions
In this paper, aiming at the anti-tamper detection of audio and video matching and the invisibility and robustness of watermarking, a zero-watermarking algorithm that can be used for audio and video matching verification and fine positioning is proposed. The algorithm generates a zero-watermarking stream fused with audio and video features in units of audio and video segments. The zerowatermarking carries audio and video information at the same time. It can not only be used for traditional copyright determination, but also for audio and video tampering detection and positioning. Whether the video or audio segment has been tampered with, it can be detected by the zerowatermarking, which overcomes the watermark formed in an overall way can only be used for copyright identification and cannot accurately detect and locate the problem of small audio or video tampering. In this paper, a variety of attacks and tampering experiments are carried out on the proposed algorithm. The experimental results show that the proposed algorithm can detect and locate tampering of audio and video segments with high precision, and it also has high robustness and can resist most common single attack and combination attack.