A Robust Text Coverless Information Hiding Based on Multi-Index Method

Recently, researchers have shown that coverless information hiding technology can effectively resist the existing steganalysis tools. However, the robustness of existing coverless text information hiding methods is generally poor. To solve this problem, we propose a robust text coverless information hiding method based on multi-index. Firstly, the sender segment the secret information into several keywords. Secondly, we transform keywords into keyword IDs by the word index table and introduce a random increment factor to control. Then, search all texts containing the keyword ID in the big data text, and use the robust text search algorithm to find multiple texts. Finally, these texts are converted into mixed indexes sent to the receiver. The receiver disassembles received indexes through the index construction protocol and uses the random increment factor to extract the secret information. Experimental results show that this method improves the concealment and security of secret information and has strong robustness compared with the state-of-the-art methods.

encoding, extensive data, small space occupation, and frequent use, so the text has good concealment and superior research value. At the same time, text information hiding technology has attracted the attention and interest of many researchers because of its great value in wireless transmission [5], secret communication, copyright issues, and other aspects. Besides, with the development of text information hiding and automatic text generation technology [6], text steganalysis technology is also constantly developing, which has brought a serious threat to information hiding.
As a natural carrier, text generates trillions of data every day, which is exceptionally suitable as an information hiding carrier. There are two main types of coverless text information hiding methods, the search method [7] and the generation method [8,9]. Both ways are developing rapidly, but they are limited by the development of natural language processing technology. When the length of secret information is long, the search method's implementation is complex, and the generation method may have semantic ambiguity, sentence failure, poor readability, and other problems. Moreover, both of them have insufficient embedding capacity. Zhang et al. [10] proposed to build a text database, using the word level and frequency of the secret information to match in the database, find the appropriate text to send. Although this method does not need to modify the carrier, which reduces the possibility of being attacked, the embedding rate still needs improvement. Lu et al. [11] proposed a coverless question camouflage method combined with random codes, which using secret information to generate a camouflaged form of an exam question. This method avoids the direct transmission of secret information, reduces the possibility of being discovered, and improves hidden capacity. Mo et al. [12] embed information in HTML documents by inserting invisible characters (such as spaces, tabs) in web pages, but detection resistance is poor. Synonym replacement [13,14] improves the hidden capacity and success rate, currently the best semantic-based coverless information hiding algorithm. Zhao et al. [15] proposed using high-frequency function words in Chinese to hide information, which has a low success rate of hiding. Liu et al. [16] proposed the method to extract all the components of Chinese characters and increase the capacity of information hiding by using part of speech to hide the number of keywords. Long et al. [17] proposed a text coverless information hiding based on word2vec, and it uses word2vec to obtain similar keywords. When text retrieval fails, keywords can replace similar keywords, increasing the hiding success rate and slightly increasing the hiding capacity. Long et al. [18] also proposed to use Web text to hide information, but this has significant instability. It is related to real-time web pages, and its hiding success rate is volatile. Although researchers have proposed many information hiding methods, their security and robustness are still challenging to meet actual needs.
To solve the above problems, we propose a robust text coverless information hiding method based on multi-index. The main contributions of this work can be summarized as follows: 1. We propose a multi-index secret information transmission method. In this method, a piece of secret information can generate multiple indexes. Even if a third party broke one or loses part of the carrier, the receiver can still extract the index's information. It can significantly improve the robustness. 2. The secret information can be extracted by recombination of its multiple groups. We used random increment factors to control the keyword's order, which can accurately extract secret information. 3. We use a multi-index robust method to extract multiple sets of secret information. It can judge whether the carrier is attacked by evaluating the group's continuity of secret information, which dramatically improves the security.
2 Related Work 2.1 HanLP and TF-IDF Recently, image and text processing technology have developed rapidly [19][20][21][22]. How to accurately segment sentences into words has been a research hotspot in natural language processing [23]. Word segmentation is the process of recombining consecutive word sequences into word sequences according to certain specifications. In order to promote natural language processing in the production environment, HanLP is proposed, a Java toolkit composed of a series of models and algorithms. It appears the character of clear architecture, up-to-date corpus, complete functions, customizable, and efficient performance. HanLP's word segmentation rate can reach 20 million words per second in extreme speed mode.
After text segmentation, it is often necessary to analyze words in the text. In natural language processing, the most commonly used methods are word frequency statistics and word TF-IDF (word frequency-inverse document frequency) feature extraction. TF-IDF is a weighting technique widely used in information retrieval and text mining. As a statistical method, TF-IDF is used to evaluate the importance of a word to document set or document in a corpus. The core idea is that this word or phrase has a good classification ability and is suitable for classification if a word appears in an article with a high frequency of TF and rarely appears in other articles.
The text segmentation and the calculation of the TF-IDF features are essential, mainly for preparing the subsequent topic model clustering.

Text Topic Distribution
LDA (Latent Dirichlet Allocation) is a document topic generation model known as a three-layer Bayesian probability model containing a three-layer structure of documents, topics, and words. LDA has achieved great success in text topic clustering and mining by introducing hyperparameters that control model parameters to the feature word layer, text collection layer, and topic layer [24]. In recent years, scholars have begun to apply the LDA topic model to big data platforms. Spark is one of the leading big data platforms, and the distributed memory design architecture makes its running speed 5 to 150 times faster than traditional Hadoop. The Spark platform provides LDA topic model clustering methods based on online and EM implementation methods. The LDA topic clustering method of the EM method relies on the graph computing module in Spark to implement and is suitable for cluster parallel computing. Fig. 1 shows a schematic diagram of EM LDA topic clustering based on the Spark platform. The primary process is to collect the data source and extract the data on the Spark platform through text segmentation cleaning and calculation of the TF-IDF features of the words in the text, and then input the features into the LDA topic model for training and finally obtain the text topic distribution.

The Proposed Multi-Index Robust Approach
In this section, we will detail the proposed multi-index robust mechanism, mainly composed of six parts. 1) Robust text coverless information hiding framework based on multi-index. 2) Keyword sequence control. 3) Text search containing keywords. 4) Multi-index robust text search algorithm. 5) Robust information hiding algorithm. 6) Robust information extraction algorithm. The notation used in this paper is shown in Tab. 1. Fig. 2 shows the proposed framework, which comprises six parts: 1) Codebook construction and text preprocessing. 2) Segment secret information into keywords. 3) Keyword id conversion and introduce a random increment factor to control the keyword sequence. 4) Find all texts containing the keyword id in the codebook. 5) Find multiple sets of indexes through the robust text search algorithm and send them to the receiver. 6) The receiver disassembles the indexes through the index construction protocol and extracts the secret information by deduplication through an increment factor.

Robust Text Coverless Information Hiding Framework Based on Multi-Index Method
The method requires establishing a text-topic distribution index, a global word index, and a text-word TF-IDF codebook. The word index comprises all the words in the text database, word frequency ranking, and the corresponding word frequency. The text index includes the topic clustering distribution of the text, and the text tag number is used to label the text containing the secret information.   Find all text containing the id label Best hidden text label FI(label) Convert label to text topic distribution FTF(id,label) Find word TF-IDF value and word frequency according to word id and label bws{} The best text collection containing secret information Tws{} Secret keyword collection contained in the ciphertext Mts{} All text collections containing keywords

Keyword Sequence Control
In this paper, a random increasing factor is introduced to extract the secret information, and its primary functions are as follows: 1. The multi-index robust method adopted in this paper can extract multiple sets of secret information.
The random increment factor can deduplicate multiple secret information disorderly. 2. Since the secret information extracted in the first step is out of order, the increment factor can reorganize the keywords in the correct order.
The example of random increment factors is shown in Fig. 3. Suppose we need to hide the secret information "文本无载体信息隐藏." First, we segment the secret information into "文本, 无, 载体, 信 息, 隐藏" and the random increment factor generated are "11,34,55,65,236". We use four sets of indexes to hide secret information. Index one extracts secret information is "信息无隐藏载体," index two is "文 本信息无" index three is "隐藏信息文本," index four is "载体文本." Each keyword has a random increment factor. The keywords are sorted and deduplicated by random increment factor, finally, obtain the secret information is "文本无载体信息隐藏." Besides, double-layer random control is used for the random increment control mechanism to ensure better randomness. The behind random number must be larger than the previous random number. The algorithm as follows, initialize the random number R and take the remainder of R, then generate the corresponding random increment factor. Details see in Algorithm 1:

Text Search with Keywords
This paper improves the robustness of text coverless information hiding, which is necessary to query all texts containing secret keywords. The steps are as follows: 1) Segment secret information into keywords. 2) Keyword id conversion. 3) Find all texts containing the keyword id in the codebook.
w i 1 i k ð Þ , where w represents a keyword segmented using HanLP.
We convert the segmented keywords into keyword id through the word index CV. Shown in Eq. (2): Where the w idÀi represent the word id.
Algorithm 2 is the text search algorithm designed in this paper. It mainly consists of the following steps: Loop through all the words, each time it traverses all the text that contains the secret keyword, and Case 0: generate a random number r 0 2 1; q þ 1 ð ÞÃ10 ½ 8: Case 1: generate a random number r 1 2 q þ 1 ð ÞÃ10; q þ 2 ð ÞÃ10 ½ 9: … 10: Case N-1: generate a random number r N À1 2 q þ N À 1 ð ÞÃ10; q þ N ð ÞÃ10 ½ 11: w i ¼ R þ r q 12: Return w i at the same time generates a random increment number for each word. Finally, it returns all texts containing secret keywords.

Robust Multi-index Search Algorithm
The multi-index robust algorithm is implemented based on the greediest algorithm, an optimization process for all the ciphertexts found, and the best-hidden text is selected with the least amount of hidden text each time. By traversing all texts to find the text that contains the most secret keywords, a comparison is made every time the text is searched, and it is determined that the text already contains the most secret information, and the loop is traversed until all the keywords are found. This paper designs a multi-index robust algorithm as shown in Algorithm 3: Algorithm 2: Find the text that contains secret information keywords Input: I ¼ w 1 ; w 2 ; . . . ; w k f g Output: all texts that contain secret information keywords: Mts 1: All text collections containing keywords: Mts = null 2: for i = 1 to length (I) do 3: Convert to the word: w idÀi ¼ CV w i ð Þ 4: Generate random for each w i 5: Text collections containing the words: Mts i ¼ TW w idÀi ð Þ 6: Append to all text collection: Mts. append (Mts i ) 7: end for 8: return all texts that contain secret information keywords: Mts

Robust Information Hiding Algorithm
After finding the multi-index text based on the above method, this section focuses on the robust information hiding algorithm. This method mainly includes the following processes: 1) According to Eq. (1), the secret information I will be segmented and divided into several keywords w i .
2) The keyword w i is converted into the corresponding keyword id according to Eq. (2), and a random increment factor is generated for each keyword to ensure that the receiver can extract its keywords in an orderly manner.
3) Find all text collections containing keywords, defined as all text, as shown in Eq. (3) 4) Find multiple sets of best texts according to the multi-index robust text search algorithm, defined as best text. For best text, the corresponding text's secret keywords in best text can be constructed, which are recorded as SCRET_WORDS. Then use Algorithm 3 to find the best-hidden text text label.

5)
According to the text index codebook, the best text_label is converted into Spare, a multi-group text topic index distribution, as shown in Eq. (4).
6) Find the corresponding word frequency and TF-IDF feature according to the text-word TF-IDF codebook, as shown in Eq. (5).
where the word tf represents the keyword TF-IDF value, and word count represents word frequency.

7)
The TF-IDF feature index included of word count , word tf , random increment number of each keyword, which is recorded as IDFIndex. Generate multiple sets of IDFIndex and Spare and send to the receiver.

Algorithm 3: Find the best-hidden text
Input: I ¼ w 1 ; w 2 ; . . . ; w k f g Output: multiple best texts contain secret messages: bws 1: Return multiple index: bws = {} 2: for i ← 1 to n // n represents several sets of indexes 3: Returns the set of text with the most keywords: bts i = null 4: While I! = null: 5: Current best-hidden text: bts = null 6: Record the keywords that have been included: wsc = null 7: for text, words in Tws: 8: Take the intersection: covered = I \ words 9: if length covered ð Þ> length wcs ð Þ : 10: bts ¼ text 11: wsc = covered 12: Remove included words: I -= wsc 13: Add to the best text collection: bts i .add(bts) 14: Add to multi-index collection: bws.append(bts i ) 15: Remove the best text collection found: Tws = Tws.remove(bts i ) 16: Return multiple best texts contain secret messages: bws

Robust Information Extraction Algorithm
The receiver disassembles the indexes according to the index construction protocol and then deduplicates and sorts to extract the secret information. The steps of a secret robust information extraction algorithm are as follows: 1) Disassemble the index. The receiver extracts the mixed index containing multiple sets of secret information to obtain Spare and IDFIndex.
2) Obtain the hidden text label. Using Eq. (4), the receiver obtains the text_label of the hidden text based on the topic distribution index.
3) Obtain the topic distribution. Using Eq. (5), convert the text_label to the word count and word tf . 4) Get the keyword. The receiver obtains the keyword id through the word count ; word tf , then converts it into a keyword.

5) Reorganize and extract the information.
Finally, extract the secret information through the random increment factor.

Experimental Environment
In the experiment, the corpus is mainly from the Chinese corpus of Sogou Lab, which is divided into six categories: social, sports, tourism, education, culture, and military. In addition, multiple sets of experimental results verify the method proposed in this article. The experiment adopts a distributed structure, so the experiment development environment is on a personal PC, completed by IntelliJ IDEA. Place the codebook on the two computing nodes of Spark and the work on the personal PC.

The Evaluation Indicator
In this paper, the experiment refers to the algorithm of the reference Long et al. [18]. Test data comes from 120 texts provided by the reference [18]. We divide these texts into 1k-6k, a total of 120, with words ranging from 1 to 2000. The text carrier comes from the Sogou Lab news data set. 1) Hidden success rate. Defined as the ratio of successfully hidden to the secret information, denoted by P i , the definition is shown in Eq. (6). Where x represents the actual number of hidden characters and X represents the number of characters that need to hide.
2) Average hiding success rate. Because this paper focuses on improving robustness, random loss of 5%, 10%, 15%, 20%, 25%, 30%, 35% of carriers to test the recovery rate of secret information. Calculate the average value for each missing test. Use the average value of all P i as the average hiding success rate, denoted by P r , as shown in the definition Eq. (7). Where P i represents the success rate of hiding each text. P r ¼ X 120 i¼1 P i 120 i ¼ 1; 2; . . . ; 120 ð Þ 3) The overall hiding success rate. This paper sets seven missing cases, generates hidden success rates seven times, and calculates the average value. Use SR to represent the overall hiding success rate. The definition is shown in Eq. 8. Where P ri respectively represent the average hidden success rate of 5%, 10%, 15%, 20%, 25%, 30%, and 35% of the carrier loss. Such as P r1 represent loss 5% carrier and P r2 represent loss 10% carrier IASC, 2021, vol.29, no.3 907 SR ¼ P n r i ¼1 P ri n r i ¼ 1; 2; 3; . . . ; n ð Þ (8)

Robustness Comparison
To make the experiment more convincing, we calculate the average value of 120 text hiding success rates when the carrier randomly loses 5%, 10%, 15%, 20%, 30%, and 35%. We also compared this method with the reference method [18] (Long's method). The result is shown in Fig. 4.
From Fig. 4, we can see that the hiding success rate of this paper is much higher than that of the reference [18] (Long's method) under different carrier deletion rates, and it is also relatively stable. With the carrier deletion rate increase, the hiding success rate of reference [18] decreases faster, while this method basically in balance. Because we adopt a multi-index robust mechanism, a total of multiple sets of secret information texts are searched, and the best text for each search is different. Therefore, this method can search multiple text groups, so the secret information can be extracted well even if the carrier is lost, achieving strong robustness.
We also compared this method with reference [18] in the success rate of hiding each text. As shown in Fig. 5, there are 120 1k-6k texts, comparing this paper with the reference [18] in the case of different carrier deletion rates, and each text lost 5%, 10%, 15%, 20%, 25%, 30%, 35% randomly. Picture (a) shows the case where the carrier randomly loses 5%. From (a) to (g) corresponding to the carrier loss of 5% to 35%, we can see that the secret information hiding success rate of the reference [18] is not high even though the carrier deletion rate is minimal. As the carrier deletion rate increases, the success rate of information hiding is much worse. However, the success hiding rate in this paper is very stable, and it can be seen that the method in this paper has strong robustness. According to Eqs. (6)-(8), the seven times average hiding success rate in this paper has reached 94.89%.
Meanwhile each text is randomly lost 5%, 10%, 15%, 20%, 25%, 30%, 35% of the text carrier. We respectively compared the length of secret information hiding of this method and reference [18]. As shown in Fig. 6, figure (a) to (g) correspond to carrier loss of 5% to 35%, figure (a) shows the case where the carrier loses 5% randomly. We can see from the figure that the method adopted in reference [18] has inferior stability, while our method is very stable. As the carrier loss rate increases, the number of hidden words in this paper also fluctuates slightly, but overall, it has a powerful performance compared to the reference [18]. From figure (a) to (g), the lost more the carrier, the shorter the secret information hidden length in the reference [18], while this paper maintains a stable state. We can see that the method proposed in this paper is very robust. Figure (h) is a random loss of 5%, 10%, 15%, 20%, 25%, 30%, 35% of each text, and averaged to get the comparison between the length of the secret information in each experiment and the length of the actual hidden secret information. We can see that, within a certain range of carrier loss, the number of Chinese characters successfully hidden by this method will not change greatly due to the change of the secret information length. But the reference [18] (Long's method) will change a lot. The 120 texts are 120 secret messages. Long's method represents the length of secret information that can success hide when the carrier is randomly lost using the method of reference [18]. Our method represents the length of secret information that can success hide under the condition of random carrier loss using the method proposed in this paper.

Security Analysis
Since this paper uses a multi-index, setting three sets of indexes can extract three sets of secret information. We can determine whether the carrier has been tampered with by comparing the three groups of extracted secret information, thereby ensuring the security of confidential information.
For example, we can extract three sets of secret information after we hide the secret information "中天杯 第九届中国上海苏州河城市龙舟国际邀请赛开赛比赛共邀请境内外支龙舟队参赛." We can compare the secret information consistency to determine whether the carrier has tampered. Fig. 7 shows three sets of secret information extracted by three sets of indexes: We can see that the three groups of secret information extracted are entirely consistent. It can be determined that the carrier has not been modified, thereby ensuring the security of secret information.
At the same time, the secret information is converted into easy-to-express digital numbers during the information hiding. The text index includes the topic cluster distribution of the text and the text tag number used to label the text containing secret information. The transmitted indexes are all digital numbers. Even if the index is stolen during the transmission process, it is not easy to identify the stealer because the index is an abstract number. Therefore, the proposed method has strong security while ensuring robustness.

Secret Hiding Rate Comparison
We also tested the hiding rate of different lengths of secret information hiding. Select 1k to 6k texts (20 texts in each group) as the secret information and compared the transmission rate of this method with the reference [18], shown in Tab. 3.
It can be seen from Tab. 3 that the hiding rate of this paper is faster than that of reference [18]. Because this paper uses multiple indexes to hide multiple sets of secret information, the throughput is relatively more significant, so the hiding rate is faster. However, as the length of the secret information increases, the transmission rate of this paper is getting lower because the time to find the carrier becomes longer.

Conclusion
In this paper, we proposed a robust text coverless information hiding method based on the multi-index method. In this method, the sender sends multi-indexes to the receiver to achieve better robustness. Experiments show that this paper can accurately extract secret information even if the carrier is lost when transmitting secret information, and its robustness is greatly improved. Since the original carrier has not been modified, it can resist attacks from various steganographic tools. Moreover, Multiple sets of indexes can be used to extract multiple sets of secret information. By comparing whether the secret information is   consistent to determine whether the carrier has been tampered with, the extracted secret information's quality is further improved. However, the method in this paper still has the problem that a small number of proper nouns such as person names and place names cannot be hidden, resulting in the hidden success rate not reaching 100%, which will be further optimized in the follow-up work.