A novel linguistic steganalysis method for hybrid steganographic texts

Most of the existing linguistic steganalysis methods mainly focus on detecting steganographic texts which are generated by embedding secret information into a type of text medium using one steganographic algorithm. But in practical applications, a large number of the steganographic texts may be hybrid ones which are generated by embedding secret information into different types of text media using different steganographic algorithms. In this paper, inspired by transfer learning, a novel linguistic steganalysis method is proposed to detect hybrid steganographic texts. The proposed method first uses the pre-trained BERT language model to obtain initial context-dependent word representations. Then the extracted features are fed into attentional Long Short-Term Memory (LSTM) to get the final contextual representations of sentences. The experimental results show that the proposed method can better satisfy the practical application demands than the existing linguistic steganalysis methods.


Introduction
Steganography is the science of embedding secret information into common carriers such as text [1][2], image [3] and audio [4]. Due to the frequent use of text in daily life, text steganography has attracted extensive attention. The text steganography can be mainly divided into two categories: modificationbased steganography [6][7][8][9] and generation-based steganography [10][11][12]. The former embeds secret information mainly by modifying cover texts and the latter usually automatically generates steganographic texts according to the secret information.
As a counter technique of text steganography, linguistic steganalysis aims to detect whether the texts contain secret information. Most of the traditional linguistic steganalysis methods are based on machine learning technology [13][14][15]. These methods mainly use manually extracted features, which makes the extracted features simple and not universal. Recently, some linguistic steganalysis methods based on deep learning technology have been proposed [16][17][18][19][20]. For example, Wen et al. [18] utilized convolutional neural network (CNN) to extract local correlation features of different granularities. Yang et al. [19] employed the Bidirectional Long Short-Term Memory (Bi-LSTM) to extract long-distance contextual dependency information. Niu et al. [20] combined the advantages of Bi-LSTM and CNN to extract both the global semantic features and local features.
Although the existing linguistic steganalysis methods based on deep learning technology have made great progress compared with the traditional ones, most of them mainly focus on detecting steganographic texts which are generated by embedding secret information into a type of text medium using one steganographic algorithm. While in practical applications, the bulk of the steganographic texts 2 may be hybrid steganographic texts, namely, these steganographic texts are generated by embedding secret information into different types of text media using different steganographic algorithms.
In the light of the above point, a novel linguistic steganalysis method is proposed in this paper. Firstly, motivated by transfer learning, BERT [21] is introduced to obtain initial context-dependent word representations. By using BERT, the proposed method can extract the correlations between words from different representation subspaces. Furthermore, Bi-LSTM and attention mechanism are adopted to fuse the contextual information and get the final representations of sentences. The experimental results demonstrate the advantages of the proposed model in meeting the application requirements.
The rest of the paper is organized as follows. Section 2 introduces the proposed method in detail. The next section presents the experimental settings and analyzes the experimental results. Finally, the conclusions are summarized in Section 4.

The proposed method
The proposed method can mainly split into three parts: (1) obtaining initial word representations; (2) fusing contextual information and getting the final representations of sentences. (3) forecasting whether the input sentences contain secret information. For clarity, the overall structure of the proposed method is shown in figure 1.  For linguistic steganalysis, the main criteria for judging whether the texts contain secret information are the differences in semantic spatial distribution between cover and stego texts. Usually, the correlations between words in a sentence will be destroyed after embedding secret information, which will lead to the changes in semantic spatial distribution. The main components of BERT are transformer encoders [22] which mainly depend on multi-head self-attention. For each input sentence, multi-head self-attention can directly capture the correlations between words from different representation subspaces, which is very helpful to improve the performance of linguistic steganalysis.
In addition, the conditional probability in sentences also plays an important role in linguistic steganalysis. We assume a sentence S of length n, that is, , , ⋯ , , where indicates the ith word in . The conditional probability of can be expressed as: Once a sentence is embedded with secret information, the probability distribution of the entire sentence will be inevitably affected. As a variant of RNN, LSTM can not only grab the changes of the semantic space caused by the destruction of conditional probability of texts, but also alleviate gradient vanish problem. After Bi-LSTM, attention mechanism is applied to automatically capture important clues to linguistic steganalysis.
Finally, the fully connected layer with dropout is used to prevent overfitting and soft-max function tries to judge whether the input sentences are embedded with secret information.

Experimental datasets and settings
In the experiments, we construct multiple hybrid steganographic text datasets by using two steganographic algorithms: Tian-Fang [1] and T-Lex [2], which are very classic and widely used ones. Tina-Fang is a generation-based steganographic algorithm which can generate steganographic texts with different embedding rates. T-Lex is a modification-based steganographic algorithm which embeds secret information into texts based on synonym substitution. We use Tina-Fang to embed secret bitstream into three common types of texts media (IMDB [23], News [24] and Twitter [25]), respectively. Similarly, we use T-Lex to embed secret bitstream into these three types of text media, respectively. Then the steganographic texts generated by the two steganographic algorithms are used to construct the hybrid steganographic text datasets in different ways, which are introduced in the following corresponding subsections. Each dataset is divided into training set, validation set and test set according to the ratio of 72:8:20. Moreover, three representative linguistic steganalysis methods, including LS-CNN [18], TS-BIRNN [19] and R_BILSTM_C [20], are chosen as our comparison methods. Accuracy(Acc), precision(P) and recall(R) are used as the evaluation indicators.
For the pre-trained BERT model, we choose the "BERT Base -uncased" version which contains 12 Transformer encoder blocks. The hidden size is 768 and the number of self-attention heads is 12. Some hyper-parameters of the proposed method are set as follows: The hidden size in Bi-LSTM is 256, the nonlinear activation function is GELU and the optimizer is AdamW. The batch size is 16, the dropout rate is 0.5, and the initial learning rate is 5e-5.

Experimental results
(1) The first experiment is to verify the performance of the proposed method when text media are hybrid. Specifically, we mix the steganographic texts of various text media generated by Tina-Fang with bpw=1,2,3,4,5, respectively. And mix the steganographic texts of various text media generated by the T-Lex. Then we get six datasets, and each contains three types of text media and 6000 cover-stego text pairs. The experimental results are shown in table 1, where the "TF" denotes Tina-Fang steganographic algorithm.
(2) The second experiment is to evaluate the performance of the proposed method when embedding rates are hybrid. Specifically, for each medium, the steganographic texts generated by Tina-Fang at various embedding rates (bpw=1,2,3,4,5) are mixed. Then we get three datasets, and each one contains 10000 cover-stego text pairs. The experimental results are shown in table 2.
(3) The third experiment is to test the performance of the proposed method when steganographic algorithms are hybrid. Specifically, for each medium, we mix the steganographic texts generated by Tina-Fang at various embedding rates (bpw=1,2,3,4,5) and the steganographic texts generated by T-Lex. Then we get three datasets, and each one contains one type of text medium, two steganographic algorithms and total 20000 of cover-stego text pairs. The experimental results are shown in Table 3.

Experimental analysis
Firstly, from the above experiments, it can be observed that the proposed method achieves state-of-theart performance in various hybrid scenes, including hybrid text media, hybrid embedding rates and hybrid steganographic algorithms. This demonstrates that no matter what steganographic algorithm is used to embed secret information into what type of text media, the features extracted by the proposed method are sufficiently comprehensive to expose the distribution changes in the semantic space before and after steganography, so that the proposed method can accurately distinguish the stego texts from a large number of normal-looking texts.
Besides, it can be found from Table 2 that the detection performance of each method for long texts like News and IMDB is better than that for short texts like Twitter. This is because if the cover texts are short, the steganographic algorithm can generate stego texts with high naturalness, which makes the semantic spatial distribution between the cover and stego texts more similar. Therefore, the cover and stego texts are more difficult to be distinguished.

Conclusion
Motivated by transfer learning, a novel linguistic steganalysis method has been proposed in this paper for hybrid steganographic texts. Multiple hybrid steganographic text datasets are constructed to verify the performance of the proposed method. The experimental results show that the proposed method can achieve state-of-the-art performance, which proves that the proposed method can be commendably applied to practice. In future work, we will explore the performance of linguistic steganalysis method when the text media, embedding rates and steganographic algorithms are all mixed together.