A Hybrid Intelligent Approach for Content Authentication and Tampering Detection of Arabic Text Transmitted via Internet

In this paper, a hybrid intelligent text zero-watermarking approach has been proposed by integrating text zero-watermarking and hidden Markov model as natural language processing techniques for the content authentication and tampering detection of Arabic text contents. The proposed approach known as Second order of Alphanumeric Mechanism of Markov model and Zero-Watermarking Approach (SAMMZWA). Second level order of alphanumeric mechanism based on hidden Markov model is integrated with text zero-watermarking techniques to improve the overall performance and tampering detection accuracy of the proposed approach. The SAMMZWA approach embeds and detects the watermark logically without altering the original text document. The extracted features are used as a watermark information and integrated with digital zero-watermarking techniques. To detect eventual tampering, SAMMZWA has been implemented and validated with attacked Arabic text. Experiments were performed on four datasets of varying lengths under multiple random locations of insertion, reorder and deletion attacks. The experimental results show that our method is more sensitive for all kinds of tampering attacks with high level accuracy of tampering detection than compared methods.

protection whenever, if any altering is made in the watermarked media, the original media still protected and prove its ownership [4]. DW uses a specific algorithm to hide information by embedding it in the digital images, audio, video or text [5]. Several traditional text watermarking methods and solutions have been proposed and classified in various categories such as structure based, linguistic based, binary image based and format based [6]. Most of these solutions require some modifications or transformations on original digital text contents in order to embed the watermark information within text. Zero-watermarking is a modern technique that can be used with smart algorithms without any modification on original digital contents to embed the watermark information [7,8]. The most challenges in this area involve developing the appropriate methods to hide information in the sensitive text contents without any modification of it [9]. In last decade, the most common digital media transferred among various internet applications is in the form of text. Nevertheless, limited research focused in text solutions because text is natural language dependent and it is difficult to hide security information unlike images which security information can hide in pixels, audio in waves and video in frames [10]. Digital Holy Qur'an in Arabic, eChecks, online exams and marking are some examples of such sensitive digital text content. Various features of Arabic alphabets such as diacritics, extended letters, and other Arabic symbols make it easy to change the main meaning of text content by making simple modifications such as changing diacritics arrangements [11][12][13][14]. Hidden Markov model (HMM) is the most common technique of natural language processing (NLP), which is used for text analysis and extract the text features.
In this paper, the authors present an intelligent hybrid approach for content authentication and tampering detection of Arabic text, called SAMMZWA. The proposed technique combines Markov Model and zero watermarking. The second order of alphanumeric mechanism of Markov model is used for text analysis in order to extract the interrelationships among contents of the given Arabic text which consequently generates the watermark key. The generated watermark is logically embedded in the original Arabic context without modifications of the original text. After transmission of the text, the embedded watermark is used to detect any tampering with the received Arabic text and ensures the authenticity of the transmitted text. The objective of the SAMMZWA approach is to achieve high accuracy of content authentication and sensitive detection of tampering attack in Arabic text.
The rest of the paper has five more sections. Section 2 provides a literature review of the related work. Section 3 presents SAMMZWA. Section 4 describes the implementation, simulation, and experimental details. Section 5 describes the comparison and results discussion, and Section 6 offers conclusions.

Literature Review
In the literature, several research on text watermarking approaches and methods have been proposed for several proposes of information security. In this paper, the authors briefly review the most common classifications of text watermarking methods which are linguistic-based watermarking, structural-based watermarking, and zero-watermarking methods [15].

Linguistic-Based Methods
The linguistic-based text watermarking methods are naturally language-based techniques, which works by making some modifications to the semantics and syntactic nature of plain text in order to embed the watermark key. In the linguistic and semantic-based approaches, information is hidden by making some manipulations on words and utilize them as watermark key using many methods and techniques such as synonym substitution, typos, noun-verbs, and text-meaning representational strings [16].
One of the syntactic-based method proposed in [17]. The proposed method uses open word space to improve the capacity of Arabic text. It works by utilize each word space to hide the binary bit 0 or 1 through which physical modification of the original text is conducted. Other syntactic-based methods presented in [18] for copyright protection by considers the existence of Harakat (diacritics, i.e., Fat-ha, Kasra, and Damma) in the Arabic language and reverses the Fatha for message hiding. Other English text watermarking method also make use of Unicode characters to hide the watermark information within English scripts. The ASCII code used for embedding is 00, however, and the Unicode used of multilingual for embedding are 01, 10, and 11 [19].

Structural-Based Methods
The structural-based text watermarking approaches are based on content structure which alters the features or structure of the text to embed the watermark information [20]. This also include modifications in general formatting features of the original text to hide watermark key such as locations of letters or words, writing style, repeating some letters or altering the features of the text [21,22]. One of the early approaches following the structure-based approach change the locations of words in text [23]. Other structural-based approaches proposed in [24,25] for content authentication of Chinese text by merging properties of sentences and calculate its entropy. In these approaches, the contents of Chines text divided into sets of small sentences and obtain semantic code of each word, then calculate its entropy by semantic codes' frequency, and find the weight of each sentence by utilizing the sentence features such as entropy, length, relevance, and weight function. The extracted features utilized to generate the watermark by using the verbs, and nouns of the high-weight sentences.

Zero Watermark-Based Methods
Zero text watermark-based approaches are based on text features which is achieved by generating the watermark key from the text context. This means several text features should be obtained, extracted and utilized as a watermark information. Several techniques and solutions have been proposed based on text features includes number of words or sentences letters, first letter of each word, and appearance frequency of non-vowel ASCII letters and words [26][27][28][29][30][31][32][33][34]. One of the available text zero-watermarking approaches presented in [26] which hide the watermarking information within the social media and validate it later in terms of accuracy and reliability. Other text zero-watermarking techniques proposed in [27] to validate data integrity of text context over the internet of things. The watermark information is generated as a text features such as text size, data appearance frequency, and time of data capturing. The generated watermark will have to be embedded logically in the original contents before its transmission. Reference [28] shows a text zero-watermarking method has been developed for individual privacy protection based on certain measures such as Hurst exponent and zero crossing of the speech signals. Individual identity has to obtained to embed it as a watermark key. In the case of copyright protection of English text, a text zero-watermark methods have been proposed in [29,30] which uses the appearance frequency of non-vowel ASCII letters and words.
According to combination solutions with zero-watermarking, such solutions presented in [31,32] which uses natural language processing and zero-watermarking for content authentication. The proposed methods trying to extract some text features to obtain the text probability properties and utilize it as a watermark key. Reference [33] shows a spatial domain technique for copyright protection, data security and content authentication of multimedia images. A robust geometric features-based method has been presented in [34] to improves capacity and watermark robustness.

The Proposed Approach
This paper proposes a novel reliable approach by integrating NLP and text zero-watermark techniques which there is no need to embed extra information such as watermark key, or even to perform any modifications on the original text. As a result of hybrid solution, very low impact of overall complexity is resulted in terms of time, but in the other hands, there is no possibility for attackers to figure out the mechanism work of algorithm. Second level order of alphanumeric mechanism of Markov model has been used as NLP technique to analyze the contents of Arabic text and extract the interrelationships features of these contents. The main contributions of our approach, SAMMZWA can be summarized as follows: Unlike the previous work, in which the watermarking is performed by effecting text, content, and size, our approach SAMMZWA embeds the watermarking logically without any effect on the text, content, and size. In our SAMMZWA approach, watermarking does not need any external information because the watermark key is produced as a result of text analysis and extracting the relationship between the content itself and then making it as a watermark. Our SAMMZWA approach is highly sensitive to any simple modification on the text and the meaning in the Arabic text, which is known as complex text, including the Arabic symbols which can change the meaning of the Arabic word. The three contributions mentioned above are found somehow only in images but not in text. This is the vital point concerning to the contribution of this paper. In addition, our SAMMZWA approach can effectively determine the place of tempering occurrence. This feature can be considered an advantage over Hash function method. SAMMZWA has been implemented, simulated using various several standard datasets, and compared to other baseline approaches under all performance metrics to find which approach gives the best accuracy of tampering detection. Simulation and comparison results prove the accuracy and effectiveness of SAMMZWA approach in terms of tampering detection of unauthorized attacks. Baseline approaches and their execution parameters are presented later in Section 5.1.
The following subsections explain in detail two main processes that should be performed in SAMMZWA, namely watermark generation and embedding process, and watermark extraction and detection process.

Watermark Generation and Embedding Process
The three main sub-processes included in this process are pre-processing, Arabic text analysis and watermark generation, and watermark embedding as illustrated in Fig. 1.

Pre-Processing Process
The pre-processing of the original Arabic text is a key step for Arabic text analysis and watermark generation process to remove extra spaces and new lines, and it will be influence directly on the tampering detection accuracy. The original Arabic text (OAT) is required as input for the pre-processing process.

Text Analysis and Watermark Generation Process
This process include two sub processes are building Markov matrix, and text analysis and watermark generation processes. Building a Markov matrix is the starting point of Arabic text analysis and watermark generation process using Markov model. A Markov matrix that represents the possible states and transitions available in a given text is constructed without reputations. In SAMMZWA approach, each unique pair of alphanumeric available in the given Arabic text represents a present state, and each unique alphanumeric represent a transition in Markov matrix. During the building process of the Markov matrix, the proposed algorithm initializes all transition values by zero to use these cells later to keep track of the number of times that the i th unique pair of alphanumeric is followed by the j th single alphanumeric in the given Arabic text.
Pre-processing and building Markov matrix algorithm is executed as presented in Algorithm 1.
Where, OAT: is an original Arabic text, PAT: is a pre-processed Arabic text, a2_mm: is a states and transitions matrix with zeros values for all cells, ps: refers to present state, ns: refers to next state.
According to this algorithm, a method is presented to construct a two-dimensional matrix of Markov states and transitions named a2 mm i ½ j ½ ; which represents the backbone of Markov model.

a2_mm[i]
[j] length is dynamic in which it is depending on the given text contents and size. The matrix rows refer to the states which is equal to the total number of unique pair of alphanumeric in the given text. However, the matrix columns refer to transitions which is fixed, which is equal to sixty-two possible transitions (twenty-eight alphabets of Arabic letters, space letter, ten integer numbers "0-9," and twentyfour specific symbols, i.e., (. ' " , ; : ? ! / \ @ $ & % * + -= >< [ ]).
Text analysis and watermark generation process: after the Markov matrix is constructed, natural language processing and text analysis process should be performed to find the interrelationships between contexts of the given Arabic text and generate watermark patterns. In this sub-process, the appearance number of possible next transitions for each current state of pair of alphanumeric will calculated and constructed as transition probabilities by Eq. (1) below.
where, trans: is total of all possible transitions.
Algorithm 1: Pre-processing and building Markov algorithm of SAMMZWA n: is total of all possible states. i: is i th present state of pair of alphanumeric. j: is j th next transition.
The following example of an Arabic text sample describes the mechanism of the transition process of present state to other next states.
When using the second level order of alphanumeric mechanism of HMM, every unique pair of alphanumeric is a present state. Text analysis is processed as the text is read to obtain the interrelationship between present state and the next states as illustrated in Fig. 2.
Text analysis is performed using HMM to extract the features and find the interrelationship between the contents of the given text, we represent 39 unique present states and their 61 possible transitions as illustrated below in Fig. 3.
Is it possible to have two or more no-zero value in a row, for instance, it is assumed here that " ‫ﺍ‬ ‫ﻝ‬ " is a present state of pair alphanumeric, and the available next transitions are " ". It is observed that ten transitions are available in the given Arabic text sample and ‫"ﺙ"‬ transitions repeat three times.
Algorithm of text analysis and watermark generation based on the second-level order of alphanumeric mechanism of Markov model proceeds as illustrated in Fig. 4.  Arabic text analysis and watermark generation algorithm is presented formally and executed as illustrated in Algorithm 2.
where, ps: refers to the state of unique pair of alphanumeric, ns: represents the next state.

Watermark Embedding Process
In the proposed SAMMZWA approach, embedding the watermark have to done logically or digitally into the given text, without physically applying the watermark inside the text and no need to alter the original text. This is achieved by extracting the features of the given text and generating the watermark key by finding all non-zero values in Markov matrix and concatenate them sequentially to generate the original watermark pattern a2_WMP O , as given in Eq. (2) and illustrated in Fig. 5 Watermark embedding process based on second level order of alphanumeric mechanism of Markov model is presented formally and executed as illustrated in Algorithm 3.

Watermark Extraction and Detection Process
While the watermarked key is kept secret and ready for detection and verification process, the original watermarked texts are shared to others via Internet. In order to verify the authenticity of the attacked text (AAT P ), the watermarked text is collected with its watermark key a2_EWM A and compared with the original watermarked text (PAT) with its watermark key a2_WMP O .
Two sub-processes are involved in this process, which are watermark extraction and watermark detection as illustrated in Fig. 6.

Watermark Extraction Algorithm
The pre-processed attacked Arabic text (AAT P ) is required as input to this sub-process. However, attacked watermark patterns (a2_WMP A ) is an output as illustrated in Algorithm 4.

Watermark Detection Algorithm
This process aims to verify the authenticity of the attacked text (AAT P ) and notify it is authentic or tampered. This process is achieved by compare a2_WMP A and a2_WMP O in both state and transition level matching as follows: State level matching: is a default matching, which compare a whole pattern of a2_WMP O and a2_WMP A . The result of this matching is TRUE of FALSE. TRUE notification refer to authenticity of the text without any tampering occurred. Otherwise, FALSE notification refers to tampering detected and then it continues to the transition level matching. Transition level matching: each transition in a2_WMP A will be compared with the equivalent transition of a2_WMP O as given by Eqs. (3) and (4).
The weight of each state will be calculated as give in Eq. (5). The final a2_PMR of AET P and OAT P are calculated by Eq. (6).
where, N: is a total number of non-zeros values in a2_mm The distortion rate of watermark pattern refers to the detected tampering rate, which is denoted by a2_WDR and calculated by Eq. (7).
The results of watermark extraction and detection process of the given sample of Arabic text using SAMMZWA are illustrated in Fig. 7.
As shown in Fig. 7, TP1 represents first transition of non-zero in the given text, TP2 represents second transition, and so on. Some states have only one transition, which is shown in TP1. However, some states have more than transitions, which are represented in TP1, TP2, and so on, such as " ‫ﻗ‬ ‫ﻒ‬ " and " ‫ﺍ‬ ‫ﻝ‬ " states.

Implementation and Simulation
To evaluate the tampering detection accuracy of SAMMZWA, several simulation and experiments are performed using self-developed program, various standard dataset, and predefined tampering attacks with various attack volumes as explained in the following sub sections.

Implementation Environment and Setup
A self-developed program has been developed to test and evaluate the performance of SAMMZWA. Implementation environment of SAMMZWA are: CPU: Intel Core i7-4650U/2.3 GHz, RAM: 8.0 GB, Windows 10-64 bit, PHP Programming language with VS Code IDE.

Simulation and Experimental Parameters
Tab. 1 shows an experimental and simulation parameters and their associated values that used to perform the experiments of the proposed SAMMZWA approach.

Performance Metrics
Tampering detection accuracy refers to the performance of the SAMMZWA approach, which is evaluated using the following metrics: Tampering detection accuracy (a2_PMR and a2_WDR) under all mentioned attack types and volumes. Desired tampering detection accuracy values which close to 100%. Comparison and results evaluation of dataset size effect, attack types effect, and attack volumes effect against tampering detection accuracy using the proposed SAMMZWA approach, HNLPZWA and ZWAFWMMM baseline approaches.

Simulation, Experiments and Results Discussion with SAMMZWA
In this sub section, author evaluates the tampering detection accuracy of SAMMZWA. The letter set cover all Arabic letters, spaces, numbers, and special symbols. Experiments were conducted in different volumes of datasets, various kinds of attacks with their rates as identified above in Tab. 1. The simulation and experiments results are shown in tabular form in Tab. 2 and graphically illustrated in Fig. 8.
From Tab. 2 above and Fig. 8 below, we can see that SAMMZWA shows the best tampering has been detected by reorder attack in cases of both large (50%) and very low (5%) volumes of attack because reorder attack represents both insertion and deletion attacks in the same time. Whereas, in case of mid attack volume, high tampering is detected under deletion attack. This mean that, SAMMZWA gives best detection accuracy and high sensitive to tampering under both deletion and reorder attacks in all scenarios of attack volumes.

Comparison and Result Discussion
The tampering detection accuracy results were critically analyzed, effect study and compared between SAMMZWA and baseline approaches ZWAFWMMM and HNLPZWA and shows discussion of their effect under the major factors, i.e., dataset size, attack types and volumes.

Baseline Approaches
Tampering detection accuracy of SAMMZWA is compared with HNLPZWA (an intelligent hybrid of natural language processing and zero-watermarking approach) and ZWAFWMMM (Zero-Watermarking Approach based on Fourth level order of Arabic Word Mechanism of Markov Model) [35]. Comparison is performed under all performance metrics mentioned above in Sub Section 4.3 to find which approach gives the best accuracy of tampering detection.

Comparison of SAMMZWA with ZWAFWMMM and HNLPZWA Approaches
This subsection presents the tampering detection accuracy comparison of SAMMZWA with ZWAFWMMM and HNLPZWA approaches and study their effect under core affected factors are dataset size, attack types and volumes.

Comparison and Results Study of Dataset Size Effect
In this subsection, authors present an evaluation of the different dataset size effects on tampering detection accuracy against all attack types under their different volumes. Fig. 3 shows a comparison of that effect using SAMMZWA along with ZWAFWMMM and HNLPZWA approaches.
As shown in the summary of the comparative results of Fig. 9 applying the SAMMZWA approach, the highest effects of dataset size that lead to the best tampering detection accuracy are ordered as ASST, AMST, AHMST and ALST, respectively. This means that tampering detection accuracy increased with the decreasing document size and decreased with the increasing document size. On the other hand, the results show that, the SAMMZWA approach outperforms both ZWAFWMMM and HNLPZWA approaches in terms of tampering detection accuracy under all scenarios of dataset sizes. Fig. 10 shows a comparison of the different attack types effect on tampering detection accuracy against all dataset sizes and all scenarios of attacks volumes. A comparison was performed using SAMMZWA with ZWAFWMMM and HNLPZWA approaches. As shown in Fig. 10, the SAMMZWA approach outperforms both ZWAFWMMM and HNLPZWA in terms of tampering detection accuracy in all scenarios of deletion and rephrasing attacks. However, ZWAFWMMM and HNLPZWA approaches outperforms SAMMZWA approach in case of insertion attack. This means that SAMMZWA approach is strongly recommended and applicable for content authentication and tampering detection of Arabic text documents transmitted via internet in all cases of deletion and rephrasing attacks. Fig. 11 shows a comparison of the different attack volume effects on tampering detection accuracy against all dataset sizes and all scenarios of attacks volumes. A comparison was performed using SAMMZWA with ZWAFWMMM and HNLPZWA approaches.

Comparison and Results Study of Attack Volume Effect
As shown in Fig. 11, the SAMMZWA approach outperforms ZWAFWMMM and HNLPZWA approaches in terms of tampering detection accuracy in all scenarios of low, mid and high volumes of all attacks. This means that SAMMZWA approach is strongly recommended and applicable for content authentication and tampering detection of Arabic text documents under all volumes of all attack types.

Conclusion
In this paper, SAMMZWA approach has been proposed by integrating a zero watermarking and natural language processing techniques for content authentication and tampering detection of Arabic text transmitted via the internet. SAMMZWA implemented using PHP self-developed program in VS code IDE as well as simulation and experiments using various standard dataset under different volumes of insertion, deletion, and rephrasing attacks. SAMMZWA approach has been compared with ZWAFWMMM and HNLPZWA approaches. Comparison results show that SAMMZWA outperforms ZWAFWMMM and HNLPZWA in terms tampering detection accuracy under deletion and reorder attacks. Although SAMMZWA approach is an efficient approach, and it is designed only for all scenarios of insertion attack. For the future work, the authors will consider detection accuracy under all scenarios of deletion and rephrasing attacks. Moreover, the authors also intend to evaluate the tampering detection accuracy using high level order of alphanumeric of Markov model.  Tampering detection accuracy HNLPZWA ZWAFWMMM SAMMZWA Figure 11: A compression of attack volume effect on tampering detection accuracy using SAMMZWA with ZWAFWMMM and HNLPZWA approaches