Arabic Text Steganography Based on Deep Learning Methods

Steganography is one of the oldest methods for securely sending and transferring secret information between two people without raising suspicion. Recently, the use of Artificial Intelligence (AI) has become simpler and more widely used. Since the emergence of natural language processing (NLP), building language models using deep learning has become more. Furthermore, because of the importance of concealing secret information in delivered messages, Artificial Intelligence theories along with Natural Language Processing algorithms were employed to conceal secret information within the text cover. The Arabic language was used because of its large number of words, vocabulary, and linguistic meanings, and its most significant feature is Arabic poetry. This study discovered a new way to hide secret data inside newly formulated Arabic poetry based on previous Arabic poetic texts and a database of a number of Arab poets from the ancient and modern eras using Artificial Intelligence and Long Short-Term Memory (LSTM) theories to increase storage capacity by 45 percent. The linguistic accuracy and volume of secret data hidden within the formulated poetry were increased using a Baudot Code algorithm, where the secret data is hidden at the level of letters rather than words, and the linguistic accuracy and volume of secret data hidden within the formulated poetry were increased to eliminate the drawbacks found in previous studies.

mation trolls significantly impact computer network users' 23 fears, data and information cannot be transmitted smoothly 24 due to the fear of being exposed to and profiting from it by 25 a third party. Therefore, the security of data and information 26 transmission has become one of the most important sciences 27 and topics of interest to all Internet users [2]. 28 Steganography is one of the oldest methods for concealing 29 information and sending it without danger of its secret content 30 being discovered since it is contained inside a cover that 31 serves as the carrier of the secret text. It is classified on 32 The associate editor coordinating the review of this manuscript and approving it for publication was Jerry Chun-Wei Lin . the basis of the type of cover that carries the message into 33 four categories, namely the Image cover, Audio cover, Video 34 cover, and Text cover [3]. Because every second comprises 35 24 frames as a moderate limit, a video cover is good. It is 36 used to hide high confidential data, large size, and excellent 37 efficiency [4], and some use the video cover as watermarks to 38 protect the copyrights of digital videos [5]. Text cover is con-39 sidered one of the oldest methods used to conceal confidential 40 texts and information. The first recorded use of this term was 41 in 1499 by Johannes Trethimius in his book Steganographia, 42 which is a treatise on encryption and concealment disguised 43 as a book on magic [6]. Text cover is one of the most difficult 44 ways to hide information compared to the others (image, 45 audio, and video), due to the lack of a large space and storage 46 space capacity like the other methods. 47 The writing method and alphabetic letters utilized in this 48 study were based on the most extensively used types of 49 letters in the world, the Arabic alphabet graphics. There are 50 over 31 living languages that use the same Arabic alphabet 51 graphics, such as Persian, Urdu, Ottoman Turkish, and so 52 languages are used because they contain the character-107 istics of multiple characters, which help in the process 108 of hiding important information within the texts cover, 109 and among these characteristics are the dotted letters 110 that characterize these two languages. While English 111 language has only 2 dotted letters i and j [9]. 112 1.1. Some studies use the text as a cover to conceal a 113 secret text based on the dots of the letters, where 114 the dots of the letters are displaced or moved by 115 producing a new font with letters that are removed 116 with a small amount at a ratio of 1/300 of an inch 117 up or down. For letters with more than one dot, the 118 displacement of all dots occurs simultaneously, 119 and the secret text is hidden by converting it to the 120 binary system and then moving the dots bit by bit 121 to hide it. If the hidden bit has the value 1, the dots 122 of the chosen letter are moved, but if the hidden 123 bit has the value 0, the letter remains unchanged. 124 1.2. This study is in line with the previous research 125 on using semicolons to hide the secret text inside 126 the cover text by hiding 2 bits in each hiding 127 process by moving the letter dots horizontally and 128 vertically. No change is made to the letters if the 129 bits are in the sequence 00. In the case of the 130 sequence of bits 01, a slight horizontal space will 131 be added between the dots of the same letter. Also, 132 if the sequence of bits is 10, the letter dots will 133 move vertically with a slight space. In the event 134 of the bit sequence 11, the dots will be moved 135 in both directions (horizontal and vertical   a value of 0, a kashida is added, and when it has 208 a value of 1, no modification is made [18], [19].      6. Pseudo connection character: The Persian and Arabic 299 letters differ in that each letter has four different writing 300 styles depending on its place in the word. So, when a 301 letter appears on its own, it has a different shape than 302 when it appears at the beginning, middle, or end of a 303 word. If it appears at the beginning, it has a different 304 shape than when it appears in the middle, at the end, 305 or independently, as seen in the table (5): 306 6.1. The Persian and Arabic languages need Zero-307 Width-Joiner (ZWJ), which is used to connect 308 letters with each other in complex texts. It has 309 the symbol of (U+200D). They also need Zero-310 Width-Not-Joiner (ZWNJ), which is used to sepa-311 rate letters and has the symbol of (U+200C). Both 312 of them have no visible effect on the letters listed 313 between them, and each of them is considered 314 an unprinted letter. This feature can be used to 315 hide the confidential data inside the text cover, 316 converting the text to be hidden into the binary 317 system, and then the bit to be hidden is tested. 318 If it has a value of 1, Zero-Width-Joiner (ZWJ) is 319 inserted between the specific letters (the current 320 letter and the one that follows). But if the bit to be 321 hidden carries a value of 0, nothing is changed in 322 the text cover, and this process is continued until 323 all the bits are hidden [27], [28].   but if it carries a value of 0, a pseudo-space will 377 be added after the non-dotted letter [33].  (7). If the bit to be hidden 384 has a value of 1 and the current letter is from 385 the group A preceded by a letter from the group 386 A, one kashida is inserted between them after 387 converting the secret text into the binary system. 388 The same process is used when the current letter 389 is from group B, preceded by a letter from group 390 B. But when the current letter is from the group 391 AB preceded by a letter from the groups (A, B, 392 or AB), it will be replaced with the same letter 393 from the group (ISO-8859-6). In case the secret 394 bit value equals 0, no change will be made [34]. 395 7.4. In the Arabic language, letters are divided accord-396 ing to their pronunciation into two types Solar 397 and Lunar letters, as shown in table (8). The 398 researchers took advantage of this feature to hide 399 the confidential data inside the Arabic text cover. 400 This study will show the method of utilizing and 401 combining the decoration and changing the sym-402 bolic value of the letters. One researcher uses two 403 types of hiding, the first type by taking the word 404 that begins with the two letters ( ) followed by a 405 Solar letter. He substitutes the independent letter 406 ( ) with the same one from the other Unicode 407 to hide the value of the secret bit 1. In case of 408 hiding the secret bit 0, he searches for the word 409 that begins with the two letters ( ), followed by 410 a Lunar letter, and he replaces the independent 411 letter ( ) with the same letter but with another 412 Unicode symbol.

413
As for the second type, the researcher hides two 414 bits at a time. He hides two bits with a value of 415 00 by searching for the word that begins with 416 the two letters ( ) followed by a Lunar letter. 417 He begins to change the value of the letter ( ) with 418 another symbol from the Unicode and adds the 419 decoration (Fatha ) to the Lunar letter. In case 420 of hiding the two bits 01, he searches for the 421 word that begins with the two letters ( ), followed 422 by a Lunar letter and replaces the letter ( ) with 423 another Unicode and adds decoration (any deco-424 ration except the Fatha ( ). In case of hiding the 425 two bits 10, he searches for the word that begins 426 VOLUME 10, 2022  with the two letters ( ) followed by a Solar let-  Letter frequency is simply the number of times an alphabet 485 appears on average in a written language. Letter frequency 486 analysis goes back to the Arab mathematician Al-Kindi, 487 who formally developed a method for fractions and deci-488 mals. Letter frequency analysis gained importance in Europe 489 with the development of movable type, where one must 490 estimate how much type is required for each letter. Letter 491 frequency analysis is a basic method of language identifica-492 tion used by linguists. It is particularly useful in determining 493 whether an unknown writing system is alphabetical, syllabic, 494 or ideographic.

495
The use of letter frequencies and frequency analysis plays a 496 fundamental role in coding and many puzzle games, includ-497 ing Hangman, Scrabble, and the TV game show Wheel of 498 Fortune. One of the earliest descriptions in classical literature 499 of applying knowledge of English letter frequency to solving 500 a cipher is found in Edgar Allan Poe's famous story The 501 Gold-Bug, in which the method was successfully applied to 502 decipher a message directed to the whereabouts of a treasure 503 hidden by Captain Kidd [9]. 504 The repetition of characters in the text has been studied 505 for use in cryptanalysis and frequency analysis in particu-506 lar, as the method has been formally developed (Breakable 507 cyphers using this technique date back at least to Julius 508 Caesar's Caesar cypher, suggesting this method may have 509 been explored in classical times). The ''first twelve'' characters make up about 80% of the 528 total usage. The ''first eight'' characters make up about 65% 529 of the total usage. Many rank functions can fit letter frequency 530 as a rank function, with the Cocho/Beta rank function being 531 the best. Another classification function without an adjustable 532 free parameter also fits reasonable letter frequency distribu-533 tion, as shown in Fig (1).
2. The second step is to update the old cell state C t−1 for 562 the new cell state C t . The last step is to decide whether 563 to take or forget the C t−1 . Now multiply the old state by 564 (i t ) with (C t ), but we must first get the results of (i t ,C t ) 565 as shown as Equations 4 & 5.
Now we can calculate cell state output as shown as 569 Equation (6).
3. The final step is to determine what output is required 572 based on the cell condition after filtration. The sigmoid 573 layer decides which parts of the cell to ignore and which 574 parts to use and output and then uses it to extract the 575 values in the interval (−1,+1) before multiplying with 576 the sigmoid gate output, which means we only output 577 the sections we need as shown as Equations 7 & 8.
By building an LSTM model, we will use the Sequen-581 tial model, Embedding layer, LSTM layer, and Dense 582 Layer to train the model.

584
The embedding layer is defined as the first hidden layer of a 585 network. It must specify three arguments: Input, Output, and 586 the input length.

587
LSTM Layer: First, we provide the number of nodes in the 588 hidden layers within the LSTM cell. We will use 128 hidden 589 layer units, as shown in Fig (3). from the first phase one word at a time, allowing each word a 638 possibility to be learned from the 100 words that preceded 639 it. The Keras LSTM model is used to make predictions is 640 to first start off with a seed sequence of words as a new 641 input, generate the next word after that update the seed words 642 sequence to add the generated word on the end and trim off 643 the first word. This process is repeated for as long as we want 644 to generate new words, for example a sequence of 1000 words 645 in length. The LSTM algorithm has 2 layers and 128 nodes 646 per layer. Also, a 128-nodes dense layer is used. After that, 647 the results are purified by the SoftMax function with a Batch 648 size equal to 16 and 20 epochs. The seed words are entered 649 to generate the new words, as shown in Figure (5). The third phase is one of the most important phases of 651 the system, in which a cover text carrying hidden data is 652 generated. Two groups of letters carrying the hidden bits are 653 defined, including the first set (e,r,o,n,l,u) representing a bit 654 with a value 1 and the second set (a,i,t,s,c,d) representing a bit 655 with a value 0. After entering the secret text to be hidden, it 656 is compressed and encoded by a 5-bit Baudot code to reduce 657 its size by 45%. Then, the generated word is tested, where 658 letter after letter of the generated word is tested. The letter is 659 utilized inside the set of letters representing the bit 1, and the 660 bit to be hidden has a value 1. If the letter to be tested is in the 661 set of letters that represent 0, and the bit to be hidden has a 662 value of 0, the letter is also used. But if the letter is not among 663 the two groups and is considered to have a neutral value, it is 664 used without comparison with the bit to be hidden. But in the 665 case of asymmetry between the bit to be hidden and the letter 666 tested, the word is deleted, and a new word is regenerated. 667 This process is done until all secret text is hidden inside the 668 cover text [45], as shown in figure (6). 669 VOLUME 10, 2022   Poetry was an Arab means of communication in the pre-673 Islamic era, and the tribe used to celebrate when one of 674 their sons was a talented poet. Poetry was used in the past 675 among Arabs to raise the status of a tribe and degrade another. 676 In the early days of Islam, poetry was one of the means of 677 defending the message of Islam against the polytheists of 678 Quraysh. During the Umayyads and the era of the Abbasids, 679 Poetry was also a means for the conflicting political and 680 intellectual groups to communicate their opinions and defend 681 their principles in the face of their opponents.

682
Thus, Arab poetry had a prominent role in literary, intel-683 lectual, and political life. Poetry develops according to 684 the development of Arab and Islamic people and accord-685 ing to their relations with other peoples. New advanced 686 arts emerged in poetry, such as descriptive poetry, polit-687 ical poetry, mystical poetry, social and national poetry, 688 and modern contemporary poetry, in terms of substance, 689 style, and language, as well as weights, rhymes, and other 690 factors.

691
All these features made Arabic poetry more widely circu-692 lated among people. So, it is used to hide the secret text within 693 Arabic poetry, where the Arabic letters were divided into two 694 groups based on their frequency in the Qur'an texts. Each 695 group has 9 letters with equal frequencies. The first group 696 carries a value 0 bit, and the second group represents the value 697 of 1 bit. The rest of the letters represented by 10 letters are 698 considered to have a neutral value to complete the 28 letters 699 used in the Arabic language.

700
When generating words, they are tested one by one. If the 701 word has less than four letters, it is added to the cover 702 text without comparing the bits of the secret text. But when 703 the generated word contains more than three letters, the 704 letters are compared sequentially with the sequence of bits 705 to be hidden. If they match, they are added to the cover 706 text, and if they do not match, they are excluded, and 707 a new word is re-generated. This process continues until 708 all the secret bits to be hidden are canceled as shown in 709 algorithm (1).