An Optimized English Text Watermarking Method Based on Natural Language Processing Techniques

: In this paper, the text analysis-based approach RTADZWA (Reliable Text Analysis and Digital Zero-Watermarking Approach) has been proposed for transferring and receiving authentic English text via the internet. Second level order of alphanumeric mechanism of hidden Markov model has been used in RTADZWA approach as a natural language processing to analyze the English text and extracts the features of the interrelationship between contexts of the text and utilizes the extracted features as watermark information and then validates it later with attacked English text to detect any tampering occurred on it. Text analysis and text zero-watermarking techniques have been integrated by RTADZWA approach to improving the performance, accuracy, capacity, and robustness issues of the previous literature proposed by the researchers. The RTADZWA approach embeds and detects the watermark logically without altering the original text document to embed a watermark. RTADZWA has been implemented using PHP with VS code IDE. The experimental and simulation results using standard datasets of varying lengths show that the proposed approach can obtain high robustness and better detection accuracy of tampering common random insertion, reorder, and deletion attacks, e.g., Comparison results with baseline approaches also show the advantages of the proposed approach. RTADZWA using VS code IDE. The simulation and experiments are performed on various standard datasets under different volumes of deletion, and reorder RTADZWA approach been compared with HNLPZWA


Introduction
For the research community, the reliability and security of exchanged text data through the internet is the most promising and challenging field. In communication technologies, authentication of content and automated text verification of honesty in different Languages and formats are of great significance. Numerous applications for instance; e-Banking and e-commerce. Render information transfer via the Internet the most difficult. In terms of content, structure, grammar, and semantics, much of the digital media transferred over the internet is in text form and is very susceptible to online transmission. During the transfer process, malicious attackers can temper such digital content and thus the changed count [1].
For information security, many algorithms and techniques are available such as the authentication of content, verification of integrity, detection of tampering, identification of owners, access control, and copyright protection.
To overcome these issues, steganography and automated methods of watermarking are commonly used. A technique of digital-Watermarking (DWM), which can be inserted into digital material through various details such as text, binary pictures, audio, and video [2,3]. A fine-grained text watermarking procedure is proposed based on replacing the white spaces and Latin symbols with homoglyph characters [4].
Several conventional methods and solutions for text watermarking were proposed [5,6] and categorized into different classifications such as linguistic, structure, image-based, and formatbased images [7]. To insert the watermark information into the document, most of these solutions require certain upgrades or improvements to the original text in digital format material. Zerowatermarking without any alteration to the original digital material to embed the watermark information is a new technique with smart algorithms that can be used. Also, this technique can be used to generate data for a watermark in the contents of a given digital context [1,[7][8][9].
Restricted research has centered on the appropriate solutions to verify the credibility of critical digital media online [10][11][12]. The verification of digital text and the identification of fraud in research earned great attention. In addition, text watermarking studies have research concentrated on copyright protection in the last decade. However, less interest and attention has been paid to integrity verification, identification of tampering, and authentication of content due to the existence of text content is natural language-dependent [13].
Proposing the most appropriate approaches and strategies for dissimilar formats and materials, especially in Arabic and English languages, is the most common challenge in this area [14,15]. Therefore, authentication of content, verification of honesty, and detection of tampering of sensitive text is a major issue in different systems that need critical solutions.
Some instances of such sensitive digital text content are Arabic interactive Holy Qur'an, online, eChecks, tests, and marks. Different Arabic alphabet characteristics such as diacritics lengthened letters and extra symbols of Arabic make it simple to modify the key meaning of the text material by making basic changes such as modifying diacritic arrangements [16]. The most popular soft computation and natural language processing (NLP) technique that supported the analysis of the text is HMM.
The author suggests a reliable approach known as RTADZWA (Reliable Text Analysis and Digital Zero-Watermarking Approach) for transferring and receiving an authentic English text via the internet). The proposed approach is based on a second-order of alphanumeric mechanism based on the Markov model for content authentication and tampering detection of English text transmitted via the Internet. It consists of a model that operates in collaboration between zero watermarking and the Markov model as NLP techniques. In this approach, the second-order of alphanumeric mechanism has been used for text analysis in order to extract the interrelationships between the contents of the given English text and to generate a watermark key. The generated watermark will be embedded logically in the original English context without any modifications or effect on the size of the original text. Embedded watermark will be used later after the transmission of text via the Internet to detect any tampering occurring on the received English text and to determine if it is authentic or not.
The primary objective of the RTADZWA method is to achieve high accuracy of content authentication and sensitive detection of tampering attacks in English text, which has gained great importance and needs more security and protection via the Internet.
The remainder of the article is structured. In Section 2, the author explains the existing works done so far. In Section 3, the author discussed the suggested approach (RTADZWA). The simulation, implementation, are provided in Section 4, results discussion is provided in Section 5, and finally, the author concludes the article in Section 6.

Related Work
According to the processing domain of NLP and text watermarking, these existing methods and solutions of text watermarking reviewed in this paper classified into linguistical, structural, and zero-watermark methods [1,7,13].

Linguistical Methods
Natural language is the foundation of approaches to linguistic text watermarking. The mechanism of those methods embedding the watermark is based on changes applied to the semantic and syntactic essence of plain text [1].
To enhance the capability and imperceptibility of Arabic text, A method of text watermarking is suggested room dependent on the accessible words [17]. In this method, any word-space is used to mask the Boolean bit 0 or 1 that physically modifies the original text.
A text steganography technique was proposed to hide information in the Arabic language [18]. The step of this approach considers Harakat's existence in Arabic diacritics such as Kasra, Fatha, and Damma as well as reverses Fatha to cover the message.
A Kashida-marks invisible method of watermarking [19], based on the features of frequent recurrence of document security and authentication characters, was proposed. The method is based on a predetermined watermark key with a Kashida placed for a bit 1 and a bit omitted.
The method of steganography of the text was proposed to use Kashida extensions depend on the characters 'moon' and 'sun' to write digital contents of the Arabic language [20]. In addition, the method Kashida characters are seen alongside characters from Arabic to decide which hidden secret bits are kept by specific characters. In this form, four instances are included in the kashida characters: moon characters representing '00'; sun characters representing '01'; sun characters representing '10'; and moon characters representing '11'.

Structural Methods
Structural methods are material dependent in which altering on the structure of the original text is performed to hide watermark data.
A text steganographic approach [21] based on multilingual Unicode characters has been suggested to cover details in English scripts for the use of the English Unicode alphabet in other languages. Thirteen letters of the English alphabet have been chosen for this approach. It is important to embed dual bits in a timeframe used ASCII code for embedding 00. However, multilingual ones were used by Unicode to embed between 01, and 10, as well as 11. The algorithm of Text Watermarking is used to secure textual contents from malicious attacks according to Unicode extended characters [22]. The algorithm requires three main steps, the development, incorporation, and extraction of watermarks. The addition of watermarks is focused on the development of predefined coding tables, while scrambling strategies are often used in generation and removal, the watermarking key is safe.
The substitution attack method focused on preserving the position of words in the text document has been proposed [23]. This method depends on manipulating word transitions in the text document. Authentication of Chinese text documents based on the combination of the properties of sentences, text-based watermarking approaches have been suggested [24,25]. The proposed method is presented as follows: firstly, a text of the Chinese language is split into a group of sentences, and for each word, the code of a semantic has been obtained. The distribution of semantic codes influences sentence entropy. The distribution of semantic codes influences sentence entropy.

Watermarking Methods
A zero-watermarking method has been proposed to preserve the privacy of a person who relies on the Hurst exponent and the nullity of the frames [26]. For watermark embedding, the two steps are determined to evaluate the unvoiced frames. The process of the proposed approach bases on integrating an individual's identity without notifying any distortion in the signals of medical expression.
A zero-watermarking method was proposed to resolve the security issues of text-documents of the English language, such as verification of content and copyright protection [27]. A zerowatermarking approach has been suggested based on the authentication Markov-model of the content of English text [28,29]. In this approach, to extract the safe watermark information, the probability characteristics of the text of English are involved and stored to confirm the validity of the attacked text-document. The approach provides security against popular text attacks with a watermark distortion rate if, for all known attacks, it is greater than one. For the defense of English text by copyright, based on the present rate of ASCII non-vowel letters and terms, the conventional watermark approach [30] has been suggested.

The Proposed Approach
This paper proposes a novel reliable approach by integrating NLP and text zero-watermark techniques in which there is no need to embed extra information such as watermark key, or even to perform any modifications to the original text. The second-order of alphanumeric mechanism of the Markov model has been used as NLP technique to analyze the contents of English text and extract the interrelationships features of these text contents.
The main contributions of our approach, RTADZWA can be summarized as follows: • Unlike the previous work, in which the watermarking is performed by affecting text, content, and size, our approach RTADZWA embeds the watermarking logically without any effect on the text, content, and size. • In our approach RTADZWA, watermarking does not need any external information because the watermark key is produced as a result of text analysis and extracting the relationship between the content itself and then making it as a watermark.
• Our approach RTADZWA is highly sensitive to any simple modification on the text and the meaning in the English text, which is known as the complex text. The three contributions mentioned above are found somehow only in images but not in the text. This is the vital point concerning the contribution of this paper. • In addition, our approach RTADZWA can effectively determine the place of Tempering occurrence. This feature can be considered an advantage over the Hash function method.

Watermark Generation and Embedding Procedure
This subsection involves three sub-procedures which are pre-processing procedure, text analysis and watermark generation procedure, and watermark embedding procedure as illustrated in Fig. 1.

Pre-processing Procedure
The pre-processing of the original English text is one of the key steps in both the watermark generation and extraction processes to convert letter cases from the capital to small letters, remove extra spaces and newlines, and it will directly influence the tampering detection accuracy and watermark robustness. The original English text (OET) is required as input for pre-processing process.

Text Analysis and Watermark Generation Procedure
This procedure includes two subprocesses that are building Markov matrix, text analysis, and watermark generation processes.
• Building a Markov matrix is the starting point of English text analysis and watermark generation process using the Markov model. A Markov matrix that represents the possible states and transitions available in a given text is constructed without reputations. In RTADZWA approach, each unique pair of alphanumeric within a given English text represents a present state, and each unique word a transition in the Markov matrix. During the building process of the Markov matrix, the proposed algorithm initializes all transition values by zero to use these cells later to keep track of the number of times that the i th pair of alphanumeric is followed by the j th alphanumeric within the given English text document.
The algorithm of the Markov matrix constructing is performed as shown in Algorithm 1 below.
Algorithm 1: Algorithm of building Markov matrix using RTADZWA where, OET: is an original English text, PET: is a pre-processed English text, a2_mm: states and transitions matrix with zeros values for all cells, ps: refers to the current state, ns: refers to next state.
Text analysis and watermark generation procedure: after the Markov matrix was constructed, natural language processing and text analysis process should be performed to found interrelationships between contexts of the given English text and generate watermark patterns. In this algorithm, the number of appearances of possible next states transitions for each current state of pair of alphanumeric will be calculated and constructed as transition probabilities by Eq. (1).

a2_MM[ps][ns]
where n: is the total number of states, and i: is i th current state of pair of the alphanumeric.
This example of the English version demonstrates how this method was used to introduce the phase of transformation from the current state to the next state.
"The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead." When you use the second level of the secret Markov-model of alphanumeric approach, each pair of alphanumeric is a present state. Text processing is done as the text is read and the relationship meaning exchanged between the current and the next states is calculated. The accessible transitions from the above sample of the English text are shown in Fig. 2 below.

Watermark Embedding Procedure
Watermark embedding has taken place logically in this method without needing to change the original text. In fact, the feature extraction of the given English-text, watermark key is embedded logically by identifying all non-zero values in the Markov chain matrix. All these non-zero values are sequentially concatenated to form the original pattern of watermark key WMP O , as defined in Eq. (2) and Fig. 4.

a2_WMP O & = a2_mm[ps] [ns], for i , j = non − zeros values resulted in a2_Markov_matrix
(2) The algorithm of the watermark embedding procedure using the RTADZWA approach is introduced formally and implemented as shown in Algorithm 3.

Algorithm 3: Algorithm of watermark embedding using RTADZWA
Where a2_WMP O is an original watermark pattern.

Watermark Extracting and Detecting Procedures
This procedure consists of two key algorithms that are extracting and detecting the watermark. However, a2_EWM A extracted from the obtained will be extracted (AET P ) and matched by the detection algorithm with a2 WMP O . AET P is required as input to run this algorithm. Hence, it is necessary to perform the algorithm of watermark generation for obtaining the pattern of watermark for AET P as presented in Fig. 5.

Watermark Extraction Procedure
AET P should be provided as input to run this algorithm. Though, a2_WMP A is a core output of this algorithm as presented in Algorithm 4.

Algorithm 4: Algorithm of watermark extraction based RTADZWA
where AET P : pre-processed English-text attacked, a2_EWM A : attacked pattern of watermark key.

Algorithm of Watermark Detecting
a2_WMP A and a2_WMP O should be provided as the inputs needed for this algorithm to run. However, the status of the given English-text is a core output of this algorithm which can be actual or tampered with. The watermark detection process is performed by two sub-steps which are: Main matching for a2 WMP O and a2 EWM A is achieved. If these two watermark patterns are similar in appearance, then there'll be a warning, "English text contents is authentic and no tampering occurred". Likewise, the note will be rendered "This English text document is tampered and not authentic", and then it continues to the next step. Secondary matching is performed by matching each state's transition status in the entire produced pattern of watermarks. This means a2_EWM A of each state is contrasted with an analogous transition of a2_WMP O as given by Eqs. (3) and (4) below for all i, j states and transitions (3) where a2_PMR T represents tampering detection accuracy rate value in transition level, (0 <a2_- where a2_PMR S : value of tampering detection accuracy rate in state level, (0 < a2_PMR S <=100).
The weight of every state in the Markov matrix must be determined following the equivalent rate of every state, as is seen in Eq. (5). (5) where a2_PMR S : is the total matching value in the i th state level.

a2_Sw = a2_PMR S (i) * Transitions frequency(i) total number of transitions
The ultimate a2_PMR of AET P and AET are computed by Eq. (6).
The distortion rate of the watermark is the sum of manipulative attacks on the contents of the English context that have been defined by a2_WDR and calculated by Eq. (7).
The algorithm of watermark detection is formally introduced and applied as seen in Algorithm 5.
The effects of the method of watermark extraction and detection is illustrated in Fig. 6.

Implementation and Simulation
A variety of implementation and simulation simulations are conducted to test the accuracy of RTADZWA output and tampering detection. This section outlines a setting for implementation and experimentation, conditions for experiments, typical dataset experimental scenarios, and discussion.

Simulation and Implementation Environment
The self-developed software was developed to evaluate and assess the efficiency of RTADZWA. The RTADZWA implementing environment is: CPU: Intel Core i7-4650U/2.3 GHz, RAM: 8.0 GB, Windows 10-64 bit, PHP VS Code IDE programming language.
Algorithm 5: Algorithm of watermark detection based RTADZWA Figure 6: Results of extraction of watermarks and detection using RTADZWA

RTADZWA Simulation and Experiment Findings
The performance of RTADZWA refers to the accuracy rate of tampering detection of illegal attacks.

RTADZWA Experiment Under Small (10%) Attack Volumes
Tampering detection accuracy results of RTADZWA under 10% of attack volume of all attacks against all dataset sizes are graphically illustrated in Fig. 7. These results are discussed below.

RTADZWA Experiment Under Mid (20%) Attack Volumes
As observed from the results shown in Fig. 8 under 20% attack volume, RTADZWA gives the best performance in all scenarios of deletion attack, as well as results shown in scenario of 10% attack volumes.

RTADZWA Experiment Under Mid (50%) Attack Volumes
As observed from the results shown in Fig. 9 under 50% attack volume, RTADZWA gives the best performance in all scenarios of deletion attack in cases of a very small and middle datasets.
However, in the case of the small and large datasets, the RTADZWA is more sensitive under reorder attacks.

RTADZWA Simulation and Experiment Findings Under All Attack Volumes
The performance of RTADZWA refers to the accuracy rate of tampering detection of illegal attacks.
To evaluate the performance of RTADZWA, Scenarios of many studies are performed as shown in Tab. 1, for all forms of attacks and their volumes. The results shown in Tab. 1 and Fig. 10, it seems that the RTADZWA approach gives sensitive results of detection of tampering in all attacks that the structure, semantics, and syntax of the content of Arabic text may have been carried out. As a comparison of tampering based on attack types, the results show that the most sensitive tampering detection in all attack volume scenarios is the insertion attack.

Baseline Approaches
The performance results are critically analysed and compared between RTADZWA and baseline approaches UZWAMW and HNLPZWA and show discussion of their effect under the major factors i.e., dataset size, attack types, and volumes to find which approach gives the best performance. Baseline approaches and their objectives are stated in Tab. 2.

Results of Dataset Impact
This section tests the various data set size impact on watermark reliability against all forms of attacks within their multiple volumes. Tab. 3 shows a comparison of that effect using RTADZWA with HNLPZWA and UZWAMW approaches. The comparative results as shown in Tab. 3 and Fig. 11 reflects the performance of RTADZWA approach. The results show that in the proposed RTADZWA approach, the highest effects of dataset size that lead to the best performance are ordered as ASST, ALST, AMST, and AHMST, respectively. This means that performance increased with increasing text length and decreased with decreasing text length. On the other hand, results show that RTADZWA approach outperforms both HNLPZWA and UZWAMW approaches in terms of watermark robustness under all scenarios of dataset sizes.

Results of Attack Type Impact
Tab. 4 shows a comparison of the different attack type's effect on tampering detection accuracy of RTADZWA, HNLPZWA, and UZWAMW approaches against all dataset scales and all attack volume scenarios. In all cases of attack types, low effect detected under insertion attack by RTADZWA and baseline HNLPZWA and UZWAMW approaches because deletion and reorder attacks represent both insertion and deletion tampering at the same time. In general, the comparative results shown in Tab. 4 and illustrated in Fig. 12 show that RTADZWA outperforms baseline HNLPZWA and UZWAMW approaches with high-performance rate and low effect of attack types. This means that the proposed RTADZWA approach is strongly recommended and applicable for content authentication and tampering detection of English text under all attack types.

Results of Attack Rates Impact
Tab. 5 provides a comparison of the multiple attack volume effects on the performance of tampering detection for both dataset size, and volume scenarios. The comparison is performed using RTADZWA, HNLPZWA, and UZWAMW approaches. Tab. 5 and Fig. 13 show how the performance is influenced by low, mid, and high attack volumes. In cases of mid and high attack volumes, a very high effect is detected by baseline HNLPZWA and UZWAMW approaches. However, a very low effect is detected by the proposed RTADZWA approach. In Fig. 11, it can be seen that if the attack volume increases, the tampering detection accuracy also increases. In all cases of low, mid, and high attack volumes, it seen RTADZWA outperforms baseline HNLPZWA and UZWAMW in terms of performance in all scenarios of low, mid, and high volumes of all attacks. This means that RTADZWA approach is strongly recommended for content authentication and tampering detection of English text under all volumes of all attacks. Based on second level order and alphanumeric mechanism of hidden Markov model, a novel hybrid approach of natural language processing and zero-watermarking has been developed which is abbreviated as RTADZWA. The core aim of the proposed RTADZWA is content authentication and tampering detection of English text transmitted via the Internet. RTADZWA approach is implemented in PHP programming language using VS code IDE. The simulation and experiments are performed on various standard datasets under different volumes of insertion, deletion, and reorder attacks. RTADZWA approach has been compared with HNLPZWA and UZWAMW approaches. Comparison results show that RTADZWA outperforms baseline HNLPZWA and UZWAMW approaches in terms of general performance which represents watermark capacity, watermark robustness and tampering detection accuracy under all scenarios of all attack types and volumes. For future work, the author will intend to improve the performance using the high-level of the alphanumeric mechanism of the Markov model.