A Smart English Text Zero-Watermarking Approach Based on Third-Level Order and Word Mechanism of Markov Model

Text information is principally dependent on the natural languages. Therefore, improving security and reliability of text information exchanged via internet network has become the most difficult challenge that researchers encounter. Content authentication and tampering detection of digital contents have become a major concern in the area of communication and information exchange via the Internet. In this paper, an intelligent text Zero-Watermarking approach SETZWMWMM (Smart English Text Zero-Watermarking Approach Based on Mid-Level Order and Word Mechanism of Markov Model) has been proposed for the content authentication and tampering detection of English text contents. The SETZWMWMM approach embeds and detects the watermark logically without altering the original English text document. Based on Hidden Markov Model (HMM), Third level order of word mechanism is used to analyze the interrelationship between contexts of given English texts. The extracted features are used as a watermark information and integrated with digital zero-watermarking techniques. To detect eventual tampering, SETZWMWMM has been implemented and validated with attacked English text. Experiments were performed on four datasets of varying lengths under multiple random locations of insertion, reorder and deletion attacks. The experimental results show that our method is more sensitive and efficient for all kinds of tampering attacks with high level accuracy of tampering detection than compared methods.

temper these digital contents during the transfer process, and thus the modified content can result in incorrect decisions [Nurul, Amirrudin, Lip et al. (2018)]. Several solutions of information security have been proposed for many proposes which includes encryption, data hiding, copyright protection, integrity verification and unauthorized access control [Fan, Huang and Hsu (2011)]. Digital watermarking is the most common techniques of information hiding for several proposes such as content authentication and copyright protection whenever, if any altering is made in the watermarked media, the original media still protected and prove its ownership. Digital watermarking uses a specific algorithm to hide information by embedding it in the digital images, audio, video or text [Singh and Chadha (2013); Kaur and Sharma (2016); Kaur and Sharma (2017)]. In last decade, the most common digital media transferred among various internet applications is in the form of text. Nevertheless, limited research focused in text solutions because text is natural language dependent and it is difficult to hide security information unlike images which security information can hide in pixels, audio in waves and video in frames [Al-Maweri, Ali, Adnan et al. (2015); Tayan, Kabir and Alginahi (2014)]. The most challenges in this area involve developing the appropriate methods to hide information in the sensitive text contents without any modification of it [Hakak, Amirrudin, Tayan et al. (2017)]. Digital Holy Qur'an in Arabic, eChecks, online exams and marking are some examples of such sensitive digital text content. Various features of Arabic alphabets such as diacritics, extended letters, and other Arabic symbols make it easy to change the main meaning of text content by making simple modifications such as changing diacritics arrangements [Dhiman and Singh (2016) ;Hakak, Kamsin, Tayan et al. (2017); Khizar, Abid, Mansoor et al. (2018)]. Hidden Markov model (HMM) is the most common technique of natural language processing (NLP), which is used for text analysis and extract the text features. In this paper, the author presents an intelligent hybrid approach SETZWMWMM (Smart English Text Zero-Watermarking Approach Based on Third-Level Order and Word Mechanism of Markov Model) for content authentication and tampering detection of English text transmitted via Internet. The proposed approach is based on third level order of word Mechanism based on Markov Model. It consists of a model that operates in collaboration between zero watermarking technique and Markov model as NLP techniques. In this approach, the third order of word Mechanism has been used for text analysis in order to extract the interrelationships between the contents of the given English text and to generate a watermark key. The generated watermark will be embedded logically in the original English context without any modifications or effect on the size of original text. Embedded watermark will be used later after the transmission of text via the Internet to detect any tampering occurring on the received English text and to determine if it is authentic or not. The major objective of SETZWMWMM approach is to achieve a high accuracy of content authentication and sensitive detection of attack tampering in English text, which has gained a great importance and needs more security and protection via the Internet. The main contributions of SETZWMWMM are as follows.  A hybrid text zero-watermarking and NLP approach has been developed for content authentication and tampering detection of digital English contents have been ignored by researchers in the literature for the main reasons that English text is natural language dependent and the complexity of hiding the watermark information which there is no locations to hide it within text as pixels in case of image, waves in audio and frames in video.


Most modern techniques of information security and NLP have been integrated to improve tampering detection accuracy and embed watermark logically without need to make any modifications in the original text.  The resilience against random attacks on text has been improved and the watermark distortion is detected to attacks of varying volume and nature.  The approach has been developed without prior assumption in terms of size, structure, and contents of an English text documents which include character sets, numbers, and special symbols.  Watermark capacity has been reduced and gets rid of external watermark key, it is generated as a result of text analysis process.  Author compare the SETZWMWMM approach to other baseline approaches and performed implementation of self-developed program, and extensive experimental using various scenarios of English datasets and under main text attacks and volumes. By studying and analyzing the results, author observed that SETZWMWMM outperforms the baseline approaches in terms of tampering detection accuracy. The rest of the paper has five core sections. Section 2 provides a literature review of the related work. Section 3 presents SETZWMWMM. Section 4 describes the implementation, simulation, and experimental. Section 5 describes the comparison and result discussion, and Section 6 has conclusion of the article.

Related work
In the literature, several research on text watermarking approaches and methods have been proposed for several proposes of information security. In this paper, the authors briefly review the most common classifications of text watermarking methods which are linguistic-based watermarking, structural-based watermarking, and zero-watermarking methods [Nurul, Amirrudin, Lip et al. (2018); Sameeka and Kalpesh (2018)].

Linguistic-based techniques
The linguistic-based text watermarking methods are naturally language-based techniques, which works by making some modifications to the semantics and syntactic nature of plain text in order to embed the watermark key [Nurul, Amirrudin, Lip et al. (2018); Chen, Ma and Lu (2016)]. In the linguistic and semantic-based approaches, information is hidden by making some manipulations on words and utilize them as watermark key using many methods and techniques such as synonym substitution, typos, noun-verbs, and textmeaning representational strings [Chen, Ma and Lu (2016)]. One of the syntactic-based method proposed in Reem et al. [Reem and Lamiaa (2018)]. The proposed method uses open word space to improve the capacity of Arabic text. It works by utilize each word space to hide the binary bit 0 or 1 through which physical modification of the original text is conducted. Other syntactic-based methods presented in [Mujtaba and Asadullah (2015)] for copyright protection by considers the existence of Harakat (diacritics, i.e., Fat-ha, Kasra and Damma) in the Arabic language and reverses the Fatha for message hiding. Other English text watermarking method also make use of Unicode characters to hide the watermark information within English scripts. The ASCII code used for embedding is 00, however, and the Unicode used of multilingual for embedding are 01, 10, and 11 [Abdul, Wesam and Dhamyaa (2013)].

Structural-based techniques
The structural-based text watermarking approaches are based on content structure which alters the features or structure of the text to embed the watermark information [Nasr addin, Wan and Abdul (2016)]. This also include modifications in general formatting features of the original text to hide watermark key such as locations of letters or words, writing style, repeating some letters or altering the features of the text [Kaur (2015); Alotaibi and Elrefaei (2015)]. One of the early approaches following the structure-based approach change the locations of words in text [Bashardoost, Rahim and Saba (2017)]. Other structural-based approaches proposed in Liu et al. [Liu, Zhu and Xin (2015); Zhu, Xiang, Song et al. (2016)] for content authentication of Chinese text by merging properties of sentences and calculate its entropy. In these approaches, the contents of Chines text divided into sets of small sentences and obtain semantic code of each word, then calculate its entropy by semantic codes' frequency, and find the weight of each sentence by utilizing the sentence features such as entropy, length, relevance, and weight function. The extracted features utilized to generate the watermark by using the verbs, and nouns of the high-weight sentences.

Zero watermark -based techniques
Zero text watermark-based approaches are based on text features which is achieved by generating the watermark key from the text context. This means several text features should be obtained, extracted and utilized as a watermark information. Several techniques and solutions have been proposed based on text features includes number of words or sentences letters, first letter of each word, and appearance frequency of non-vowel ASCII letters and words [Milad (2018) One of the available text zero-watermarking approaches presented in Milad et al. [Milad (2018)] which hide the watermarking information within the social media and validate it later in terms of accuracy and reliability. Other text zero-watermarking techniques proposed in Khizar et al. [Khizar, Abid, Mansoor et al. (2018)] to validate data integrity of text context over the internet of things. The watermark information is generated as a text features such as text size, data appearance frequency, and time of data capturing. The generated watermark will have to be embedded logically in the original contents before its transmission. In Zulfiqar et al. [Zulfiqar, Shamim, Ghulam et al. (2018)], a text zerowatermarking method has been developed for individual privacy protection based on certain measures such as Hurst exponent and zero crossing of the speech signals. Individual identity has to obtained to embed it as a watermark key. In the case of copyright protection of English text, a text zero-watermark methods have been proposed in Tayan et al. [Tayan, Yasser and Muhammed (2014); Hanaa and Maisa'a (2016)] which uses the appearance frequency of non-vowel ASCII letters and words. According to combination solutions with zero-watermarking, such solutions presented in Mokhtar et al. [Mokhtar, Fadl and Fahd (2014) ;Fahd, Adnan and Kulkarni (2014)] which uses natural language processing and zero-watermarking for content authentication. The proposed methods trying to extract some text features to obtain the text probability properties and utilize it as a watermark key.

The proposed approach
This paper proposes a novel intelligent approach by integrating a zero-watermark and Hidden Markov as NLP techniques in which there is no need to embed extra information such as watermark key, or even to perform any modifications on the original text. Third level order of word mechanism of Markov model has been used as NLP technique to analyze the contents of English text and extract the interrelationships features of these text contents. I take the following set of assumptions in SETZWMWMM: • Unlike the previous work, in which the watermarking is performed by effecting the text, content, and size, the SETZWMWMM approach embeds the watermarking logically without any effect on the text, content, and size. • In SETZWMWMM, watermarking does not need any external information because the watermark key is produced as a result of text analysis and extracting the relationship between the content itself and then making it as a watermark.

•
The SETZWMWMM approach is highly sensitive to any simple modification on the text or the meaning in the English text. The three contributions mentioned above are found somehow only in images but not in text. This is the vital point concerning to the contribution of this paper. • In addition, the SETZWMWMM can effectively determine the place of tempering occurrence.
This feature can be considered as advantage over Hash function method. Two main processes should be performed in SETZWMWMM, which are text analysis and watermark generation process, and watermark extraction and detection process, illustrated in Figs. 1 and 2. The following subsections explain in detail the watermark generation and extraction processes.

Text analysis and watermark generation process
Pre-processing process should be performed before watermark generation and embedding process in order to remove any extra spaces and new lines in the given English text. Preprocessed original English text document (PET) is required as input for the watermark generation and embedding algorithm. The output of this algorithm is original watermark pattern (W3_WMO). The generated watermark will be stored in watermark database beside the basic information of English text document such as author name, document size and identity, and last modified date. The three main sub-algorithms included in this process are pre-processing and building a Markov matrix algorithm, text analysis algorithm and watermark generation algorithm.

Pre-processing and building a Markov matrix algorithm
The original English text (OET) is required as input for Pre-processing process to remove extra spaces and new lines. Building a Markov matrix is the starting point of English text analysis and watermark generation process using Markov model. A Markov matrix that represents the possible states and transitions available in a given text is constructed without reputations. In this approach, each unique triple of words within a given English text represents a present state, and each unique word represents a transition in Markov matrix. During the building process of Markov matrix, the proposed algorithm initializes all transition values by zero to use these cells later to keep track of the number of times that the i th triple of words is followed by the j th word within the given English text document. Pre-processing and building Markov matrix algorithm executes as presented in Algorithm. 1.

Algorithm 1.
Pre-processing and building Markov algorithm of SETZWMWMM where, OET: is an original English text, PET: is a pre-processed English text, w3_mm: states and transitions matrix with zeros values for all cells, ps: refers to current state, ns: refers to next state. According to the above, a method is presented to construct two-dimensional matrix of Markov states and transitions named w3_mm[ ][ ], which represents the backbone of Markov model for English text analysis. The length of 3_mm[ ][ ] matrix of SETZWMWMM is dynamic in which the number of states varies based on the context of a given English text, which is equal to the number of unique triple of words.

Text analysis algorithm
After the Markov matrix was constructed, the NLP for text analysis process should be performed to find the interrelationships between the contexts of the given English text, and generate watermark patterns. The following example of English text sample describes the mechanism of the transition process of present state to other next states.

"The quick brown fox jumps over the brown fox who is slow jumps over the brown fox who is dead."
When using the third level order of word mechanism of Hidden Markov model, every unique triple of words is a present state. Text analysis is processed as the text is being read to obtain the interrelationship between the present state and the next states. Fig. 3 below illustrates the available transitions of the above sample text and results of text analysis. Author assumes that "fox who is" is present state, and the available next transitions are "slow" and "dead". Now, I present a method to construct a two-dimensional matrix of Markov states and transitions named w3_mm[i] [j], which represents the backbone of Markov model to English text analysis. The length of w3_mm[i][j] matrix of SETZWMWMM is dynamic, which the number of states is varies based on the context of a given English text. In this algorithm, the number of appearances of possible next states transitions with non-zero-values for each current state of triple of words will be calculated and constructed as transition probabilities by Eq. (1).

3_
[ where, n: is total number of states, i: is i th current state, and j: is j th next state transition. Let PET be the pre-processed text, w3_mm[ps][ns] represents the Markov matrix to store values of the number of times that the i th triple of words (present state) is followed by the j th word (next state transitions) in the given English text. The watermark generation algorithm is presented formally and executed as illustrated in Algorithm. 2.

Algorithm 2: Text analysis algorithm of SETZWMWMM
where, w3_mm[ps][ns] refers to the initial matrix of Markov model with zero values, pw refers to pervious word, and nw refers to next word. The results of text analysis algorithm based on third level order of word mechanism of Markov model proceeds as illustrated in Fig. 4.   Figure 4: Text analysis process of given English text sample using SETZWMWMM

Watermark generation algorithm
After English text analysis has been performed and probability features were extracted, watermark key is generated by finding all non-zero values in Markov matrix. All of these non-zero values will be concatenated sequentially to generate the original watermark pattern W3_WMO, as given in Eq.

3_ & = 3_ [ ] [ ], for i , j = non-zeros values resulted in 3_mm
(2) Figure 5: The generated original watermark patterns W3_WMO in a decimal form using SETZWMWMM The generated watermark W3_WMO is stored in WM database beside basic information of English text document. The generated watermark sequential patterns are then digested by using MD5 Hash algorithm to find a secure watermark form and reduce the capacity of watermark information, and they are denoted as W3_DWMPO, notational as given in Eq.

Watermark extraction and detection process
Before the detection of pre-proceed attacked English text (AETP), attacked watermark patterns (W3_EWMA) should be generated, and matching rate of patterns and watermark distortion should be calculated by SETZWMWMM for detecting any tampering with the authentication of the given contents.
Two core algorithms are involved in this process, which are watermark extraction and watermark detection. However, W3_EWMA will be extracted from the received (AETP) and matched with WMO by detection algorithm. AETP should be provided as an input for the proposed watermark extraction algorithm. The same process of watermark generation algorithm should already have been performed to obtain the watermark pattern for (AETP).

Watermark extraction algorithm
AETP is the main input required to run this algorithm. However, the output of this algorithm is EWMA. The watermark extraction algorithm is presented formally and executed as illustrated in Algorithm. 4. Algorithm 4: Watermark extraction algorithm of SETZWMWMM where, AETP: refers to pre-processed attacked English text, W3_EWMA: refers to attacked watermark patterns.

Watermark detection algorithm
W3_EWMA and W3_WMO are the main inputs required to run watermark detection algorithm. However, the output of this algorithm is to notify whether the English text document is authentic or tampered. Detection process of extracted watermark is achieved in two main phases: • Primary matching is achieved for W3_WMO and W3_EWMA. If W3_EWMA and W3_WMO patterns appear identical, then, an alert will appear as "This English text is an authentic and no tampering occurred". Otherwise, notification will be "This English text is tampered", then continue to the next phase. • Secondary matching is achieved by matching the transition of each state in a whole generated pattern. This means W3_EWMA of each state is compared with equivalent transition of W3_WMO as given by Eqs. (4) and (5) below.
for all i,j states and transition (4) where, -W3_PMR T : represents pattern matching rate value in transition level, (0 < W3_PMRT <=1)T i, j: refers to indexes of states and transitions respectively, i= 0. total number of nonzeros states, and j= 0. total number of non-zeros transitions in the given English text. -W3_WM O : refers to original watermark value in transition level.
-W3_EWM A : refers to attacked watermark value in transition level.
where, n: is a total number of non zeros transitions of every state represented in matrix of Markov model, and W3_PMRS refers to value of pattern matching rate in state level, (0< W3_PMRS <=1). After pattern-matching rate of every state that is produced, author finds the weight of every state from all the states in Markov matrix by using Eq. (6).
The final W3_PMR of OETP and AETP are calculated by Eq. (7).
where, N: is a total number of non-zeros values in W3_mm. Watermark distortion rate represents the amount of tampering occurred on contents of attacked English context which is denoted by W3_WDR and calculated by Eq. (8).

Implementation and simulation
To evaluate the tampering detection accuracy of SETZWMWMM, several scenarios of simulation and experiments are performed. This section depicts an implementation, simulation and experimental environment, experiment parameters, experimental scenarios of standard datasets and results discussion.

Implementation environment and setup
Self-developed program has been developed to test and evaluate the performance of SETZWMWMM. Implementation environment of SETZWMWMM are: CPU: Intel Core i7-4650U/2.3 GHz, RAM: 8.0 GB, Windows 10 -64 bit, PHP Programming language with VS Code IDE.

Simulation and experimental parameters
A series of experiments and simulation scenarios of SETZWMWMM have been conducted using standard datasets with different sizes. experiments and simulation scenarios performed under predefined attacks with their volumes randomly on multiple locations of these datasets. The experimental and simulation parameters and their associated values that used to perform the experiments are given as follows in Tab. 1.

Performance metrics
Tampering detection accuracy refers to the performance of SETZWMWMM, which is evaluated using the following metrics: • Tampering detection accuracy (w3_PMR and w3_WDR) under very low volume (5%), low volume (10%), mid volume (20%) and high volume (50%) of all addressed attacks with all scenarios of Arabic dataset sizes. • Desired tampering detection accuracy values close to 100%.
• Accuracy evaluation of tampering detection under all attacks with various volumes.
• Tampering detection accuracy comparison of SETZWMWMM approach with others and results evaluation of dataset size effect, attacks type effect, and attacks volumes effect against tampering detection accuracy.

Simulation, experiments and results discussion with SETZWMWMM
In this subsection, author evaluates the tampering detection accuracy of SETZWMWMM.
The character set covers all English characters, spaces, numbers, and special symbols. Experiments are conducted on different volumes of datasets and various kinds of attacks with their rates as identified above in Tab. 1.

Comparison and result discussion
The tampering detection accuracy results are critically analyzed. This subsection displays an effect study, and a comparison between SETZWMWMM and baseline approaches RACAAT and ZWAFWMMM. It also shows a discussion of their effect under the major factors i.e., dataset size, attack types and volumes.

Baseline approaches
Tampering detection accuracy of SETZWMWMM is compared with RACAAT (Robust Approach for Content Authentication of Arabic Text) and ZWAFWMMM (Zero-Watermarking Approach based on Fourth level order of Arabic Word Mechanism of Markov Model) [Fahd, Khalid and Nadhem (2020)]. Comparison is performed under all performance metrics to find which approach gives the best accuracy of tampering detection. Baseline approaches and their working parameters are stated in Tab. 3.

Comparison and results study of attack type effect
Tab. 4 shows a comparison of the different attack types effect on tampering detection accuracy of SETZWMWMM, ZWAFWMMM and RACAAT approaches against all dataset sizes and all scenarios of attacks volumes. Tab. 4 and Fig. 8 shows how the tampering detection accuracy of SETZWMWMM, ZWAFWMMM and RACAAT approaches are influenced by type of tampering attacks. In all cases of insertion, deletion and reorder attacks, SETZWMWMM outperforms ZWAFWMMM and RACAAT approaches with high rate of tampering detection accuracy. This means that, the SETZWMWMM approach is a strongly recommended and applicable for content authentication and tampering detection of English text under all attack types.

Comparison and results study of attack volume effect
Tab. 5 shows a comparison of the different attack volume effect on tampering detection accuracy against all dataset sizes and all scenarios of attacks volumes. The comparison is performed using SETZWMWMM, ZWAFWMMM and RACAAT approaches. Tab. 5 and Fig. 9 shows how the tampering detection accuracy are influenced by low, mid and high attack volumes. In all cases of SETZWMWMM, ZWAFWMMM and RACAAT approaches, it can be seen that if the attack volume increases, the tampering detection accuracy also increases. However, if the attack volume decreases, the tampering detection accuracy also decreases. In all cases of low, mid and high attack volumes, it seen SETZWMWMM outperforms both ZWAFWMMM and RACAAT approaches in terms of tampering detection accuracy in all scenarios of low, mid and high volumes of all attacks. This means that SETZWMWMM approach is a strongly recommended and applicable for content authentication and tampering detection of English text documents under all volumes of all attacks.

Comparison and results study of dataset size effect
In this subsection, author presents an evaluation of the different dataset size effects on watermark robustness against all attack types under their different volumes. Tab. 6 shows a comparison of that effect using SETZWMWMM, ZWAFWMMM and RACAAT approaches. The comparative results as shown in Fig. 10 reflect the tampering detection accuracy of SETZWMWMM approach. The results show that in proposed SETZWMWMM approach, the highest effects of dataset size that lead to the best tampering detection accuracy with insertion and deletion attacks systematically are ordered as ESST, ELST, EMST, and EHMST, respectively. However, it is differing in case of reorder attacks. This means that, the tampering detection accuracy increased with decreasing document size and decreased with increasing document size. On the other hands, results show that SETZWMWMM approach outperforms both ZWAFWMMM and RACAAT approaches in terms of tampering detection accuracy under all scenarios of mid and large dataset sizes (ESST, EMST, EHMST, and ELST).

Conclusion
Based on Third level order and word mechanism of Hidden Markov model, a novel hybrid approach of NLP and zero-watermarking has been developed which is abbreviated as SETZWMWMM for content authentication and tampering detection of English text transmitted via Internet. SETZWMWMM uses combination of zero watermarking technique and NLP techniques for text analysis in order to find interrelationships between the contents of a given English text and generated watermark key. The generated watermark is embedded logically in the original English context without modifications and effect on the size of original text. Embedded watermark is used later after the transmission of text via Internet to detect any tampering occurred on the received English text and ensures whether if it is authentic or not. SETZWMWMM approach is implemented in PHP using VS code IDE. The simulation and experiments are performed on various standard datasets under different volumes of insertion, deletion and reorder attacks. SETZWMWMM approach has been compared with ZWAFWMMM and RACAAT approaches. Comparison results show that SETZWMWMM outperforms ZWAFWMMM and RACAAT in term of general tampering detection accuracy under all attack types and volumes. For the future work, author will intend to improve the tampering detection accuracy using another techniques and mechanisms.