General Steganalysis Method of Compressed Speech Under Different Standards

Analysis-by-synthesis linear predictive coding (AbS-LPC) is widely used in a variety of low-bit-rate speech codecs. Most of the current steganalysis methods for AbS-LPC low-bit-rate compressed speech steganography are specifically designed for a specific coding standard or category of steganography methods, and thus lack generalization capability. In this paper, a general steganalysis method for detecting steganographies in low-bit-rate compressed speech under different standards is proposed. First, the code-element matrices corresponding to different coding standards are concatenated to obtain a synthetic code-element matrix, which will be mapped into an intermediate feature representation by utilizing the pre-trained dictionaries. Then, bidirectional long short-term memory is employed to capture long-term contextual correlations. Finally, a code-element affinity attention mechanism is used to capture the global inter-frame context, and a full connection structure is used to generate the prediction result. Experimental results show that the proposed method is effective and better than the comparison methods for detecting steganographies in cross-standard low-bit-rate compressed speech.


Introduction
Data hiding is a technique of embedding secrets into digital media imperceptibly, and different types of media data are considered for steganography, including image [1,2], text [3,4], and video [5,6]. In recent years, with the continuous growth of network bandwidth and the enhancement of network convergence, network streaming media services for communication have undergone unprecedented development. Since Voice over Internet Protocol (VoIP) technology [7,8] has been widely used for real-time communication, it has become an excellent carrier for transmitting secret information over the Internet. VoIP steganography is a means of imperceptibly embedding secret information into VoIP-based cover speech. There are many VoIP speech codecs, including G.711, G.723.1, G.726, G.728, G.729, internet Low Bitrate Codec (iLBC), and the Adaptive Multi-Rate (AMR) codec. Most of them, including G.723.1, G.729, AMR, and iLBC, are low-bit-rate speech codecs that use analysis-by-synthesis linear predictive coding (AbS-LPC) [9].
At present, most methods of speech steganography utilize AbS-LPC low-bit-rate speech codecs to embed secret information for covert communication. Therefore, it is essential to develop a powerful steganalysis method to analyze low-bit-rate speech streams.
Information-hiding methods based on low-bit-rate speech streams can be divided into three categories according to the embedding position: The first category uses a pitch synthesis filter for information hiding [10][11][12][13][14][15][16], the second uses a LPC synthesis filter to hide information [17][18][19][20][21][22], and the third embeds information by directly modifying the value of some code elements in the compressed speech stream [23][24][25][26][27][28][29][30].  Figure 1: Difference between different levels of general steganalysis methods: (a) Non-general steganalysis method; (b) C 1 -and (c) C 2 -level general steganalysis methods The existing steganalysis methods for AbS-LPC low-bit-rate compressed speech steganography are specifically designed for a specific coding standard or category of steganography methods. Thus, they lack generalization capacity. When general steganalysis is required, it is complex and time-consuming to enumerate all the steganalysis methods that correspond to the steganographic methods, which makes it difficult to meet the requirements of practical applications. In this paper, the generality of steganalysis algorithms is divided into two levels: one is general for different steganography algorithms under the same compression standard, and the other is general for steganography algorithms under different standards. For interpreting the idea of the proposed method, the generality of the first one is referred to as C 1 and that of the second one as C 2 . The general steganalysis algorithm of the C 1 level can effectively detect different information-hiding algorithms (e.g., quantization index modulation [31]) under the same standard, such as G.729. The general steganalysis method of the C 2 level can detect different information-hiding algorithms under an arbitrary standard. For example, to achieve general steganalysis of different coding standards, if non-general steganography detection methods are used, it is necessary to jointly use multiple steganalysis methods for different coding standards and different steganography methods, as shown in Fig. 1a. As demonstrated in Fig. 1b, different methods must be combined under different coding standards when using the steganalysis methods of the C 1 level. As shown in Fig. 1c, only one detection method of the C 2 level is needed. Obviously, the ideal steganalysis method is to achieve C 2 -level generality, which is also the research focus of this paper.
Since speech signals are encoded by different encoding standards, the number of code elements (CEs) and their connotations are quite different. Therefore, it is unrealistic to perform C 2 -level general steganalysis directly based on the original compressed speech stream. In this paper, the compressed speech stream of different coding standards is first converted into an intermediate feature representation. Then, a classification network based on a CE affinity attention mechanism is built to accomplish steganalysis.

Proposed Method
The architecture of the proposed steganalysis method is illustrated in Fig. 2

Intermediate Feature Representation
Assuming that one must detect m types of coding standards at the same time, the CE matrix X i corresponding to the ith coding standard can be expressed as where N i is the number of CEs in a frame corresponding to the ith coding standard, and x i T, N i is the value of the N i th CE in frame T. To detect different coding standards at the same time, the CE matrices corresponding to m coding standards are concatenated to obtain a synthetic CE matrix X: where x m T, N m is the value of the N m th CE in frame T corresponding to the mth coding standard. To convert the values of CEs into a form that is easy to use by the neural network, one-hot coding is utilized to map each CE into a feature vector. For a CE that occupies n bits, its coded value range is 0 − 2 n−1 . In one-hot encoding, a vector with a length of 2 n is used to represent this CE. If the coded value of this CE is u, the one-hot representation can be denoted as where After one-hot coding, a group of independent CE one-hot representations are obtained, which are then aggregated according to the order of the original CEs to form a long feature vector, called a multi-hot vector. This process is called multi-hot coding. Taking the example in which there are M code elements in a frame, the length of the corresponding multi-hot vector can be calculated as where d i denotes the number of bits that the ith CE occupies.
However, some CEs may occupy too many bits, which will greatly increase the amount of calculation of the model. When the one-hot coding operation is conducted on these code elements, the length of the one-hot vector will become very large. This explosive dimension increase is more than can be afforded, so dimensionality reduction is needed. Therefore, a frequency count method is employed on these CEs. Experiments prove that this is a simple but very effective coding method. Specifically, for each CE that occupies more than 8 bits, the occurrence frequency of its every coded value is counted. Then, the coded values are arranged in order of frequency, and the first 255 values selected. These values are encoded as 0-254. The other coded values are encoded as 255. In this way, the coded value of all CEs can be mapped into 0-255.
A sparse representation R can be obtained by applying multi-hot encoding. However, the sparse representation will bring an additional computational cost to the model, which is unfavorable for the real-time requirements of steganalysis. Inspired by natural language processing tasks, an embedding method for each CE is introduced. First, dictionaries are built for each CE to convert the multi-hot vectors to more compact representations into the intermediate feature.
The parameters are randomly initialized in the dictionaries with the normal distribution. It is hoped that such an embedding representation can be obtained that has a strong robustness to the different embedding rates. Then, a large dataset consisting of different stego data and cover data is built to pre-train the dictionaries. At the pre-training stage, two-layer bidirectional long short-term memory (Bi-LSTM) [32] and a full connection layer are used, followed by a sigmoid activation function. Dictionaries and the training network are trained together. In addition, the dictionaries are fixed once the training is done. Before utilizing the steganalysis network to classify the input sample, the matrix R will be converted into the embedding matrix E based on the trained dictionaries.

Steganalysis Network
Since the front and back frames of a speech sample can influence each other, a two-layer Bi-LSTM is first employed to capture long-term contextual correlations of E, and a better representation of the frame vector is generated. However, Bi-LSTM can only capture longrange dependencies, which lack local CE information. Inferring inter-frame context information from local CE information can simultaneously capture both intra-and inter-frame relationships, which is very important for low-bit-rate compressed speech steganalysis tasks. Global context information is useful for extracting a wide range of inter-frame dependencies and providing a comprehensive understanding of the entire input speech sequence, while local CE information plays a key role in understanding the secret information embedded in different CE positions. Based on this theory, the CE affinity attention module is proposed, which adaptively infers the global context information between frames under the guidance of the codeword affinity representation.
The architecture of the CE affinity attention module is illustrated in Fig. 2. It consists of two branches: the first branch is used to calculate the local affinity attention vector, and the second deals with the feature representation y at a single scale. Moreover, the second branch determines the amount of information contained in the local affinity vectors. Both branches will be described in detail below.
In this paper, the output features calculated by Bi-LSTM are defined as O, where T indicates the number of frames of the input data, and S the feature dimension. In the first branch, the features O are first calculated by a global average pooling operation to obtain the global information representation g (O), which can express the global inter-frame information. The process can be defined as where o i, j denotes the feature value at the jth position of the ith frame. Then, a frame-wise multiplication between global information g (O i ) and input features O is employed to obtain a new global-guided feature representation O, which can be calculated by After O is obtained, it is reshaped into O. Then, a one-dimensional (1D) convolution (kernel size 3 and stride 1) followed by a ReLU activation function is used to convert the global-guided feature representation into an affinity vector A, where M denotes the area size of codeword affinity. In the second branch, adaptive average pooling and a 1D convolution are first applied on input feature O to obtain y. Then, y is reshaped into the size M × T to match that of the affinity vector. A and y are then multiplied together and the results reshaped to obtain an adaptive context matrix z M , which includes local codeword information and global inter-frame correlations. The mathematical description of this process can be defined as where a j ∈ A indicates the affinity factor.
To endow the features used for classification with both long-range dependencies and global inter-frame context, the features output from Bi-LSTM and the codeword affinity module are integrated to form a more powerful feature representation. Then, the features are inputted into a classification that consists of two layers of full connection and a sigmoid activation function. A prediction probability value p, which determines whether the hidden message exists in an input speech sequence, is then obtained: Output Result = cover, p < 0.5 stego, p ≥ 0.5 (9)

Equations and Mathematical Expressions
Seven thousand speech segments were collected from the Internet, including samples from seven human voice categories, to form the speech database. Each category contains 1,000 speech segments. The seven categories are Chinese man, Chinese woman, English man, English woman, French, German, and Japanese. Each human voice category contains samples from more than five individuals. The duration of each speech segment is 10 s, and each segment is formatted as a mono PCM file with an 8,000-Hz sampling rate and 16-bit quantization. The speech segments in each category are divided into a training dataset and a testing dataset at a 4:1 ratio. The training dataset is used to conduct parameters adjustment of the model, and the test dataset is used to evaluate the model performance. The G.723.1 (6.3 kbit/s) and G.729 codecs are used to evaluate the performance of the proposed method.
Both the training and testing stages were executed on a GeForce GTX 2080 graphical processing unit with 11 Gb of graphics memory. PyTorch was used to help implement the model and algorithm. In addition, in the process of training the neural network, Adam was used as the optimizer with a learning rate of 1 × 10 −4 , and the cross-entropy chosen as the loss function. The maximal training epoch was 200, and the batch size in the training process was 16.
As mentioned above, three main categories of steganography methods exist for AbS-LPC low-bit-rate compressed speech. To comprehensively test the performance of the proposed model, a representative method [15,17,24] was chosen for each steganography category. For simplicity, the chosen methods are denoted "ACL" [15], "CNV" [17], and "HYF" [24]. It should be noted that the ACL and HYF methods are designed for the G.723.1 standard, and the CNV method was used for steganography under the G.729 standard; All three methods were used for steganography under the G.723.1 standard.
To the best of our knowledge, no general method has been designed for the detection of steganographies in cross-standard AbS-LPC low-bit-rate compressed speech. The MFCC-based steganalysis method [33] can, in theory, detect any type of steganography based on the decoded audio/speech data. In this sense, this method is believed to be general as well. Besides, Hu et al. [34] proposed a SFFN-based general steganalysis method for specialized coding standards. In the present paper, these methods are used as comparison algorithms with which to evaluate the proposed method.
The embedding rate is defined as the ratio of the number of embedded bits to the total embedding capacity. Experiments on the three steganography methods for the G.723.1 standard were conducted under five different embedding rates (20%-100%). The experimental results are shown in Tab. 1. For ACL, the detection accuracy of the MFCC method is only 51.58% when the embedding rate is 20%, slightly better than a random guess. As a comparison, the detection accuracy of the proposed method is 98.96%, far exceeding that of the MFCC method. However, the detection accuracy of SFFN achieves 99.54%, 0.58% higher than the proposed method. When the embedding rate is 40% or above, both SFFN and the proposed method have a detection accuracy of 100%. For HYF and CNV, when the embedding rate is 20%, the detection accuracies of the proposed method are 35.73% and 37.26% higher, respectively, than that of MFCC. By contrast, the detection accuracies of SFFN are 8.48% and 12% higher than that of MFCC, respectively. When the embedding rate is 80% or above, SFFN can achieve detection accuracies greater than 95%, while the proposed method can achieve the same accuracy when the embedding rate is only 20%. Since the ACL and HYF methods are designed for the G.723.1 standard, the CNV method is used for steganography under the G.729 standard. Experiments on the CNV method were conducted under five different embedding rates (20%-100%). The experimental results are shown in Tab. 2, from which it can be seen that the proposed method performs better than MFCC and SFFN at all embedding rates. When the embedding rate is 20%, the detection accuracies of the proposed method are 32.73% higher than that of MFCC and 6.74% higher than that of SFFN. When the embedding rate is 80% or above, SFFN can achieve detection accuracies greater than 99%, while the proposed method can achieve the same accuracy when the embedding rate is only 40%. In summary, the proposed method achieves the best results at all embedding rates under the G.723.1 and G.729 standards, except for a 20% embedding rate and ACL steganography under the G.723.1 standard, which is 0.58% lower than that of SFFN. The experimental results indicate that the proposed steganalysis method can be effective for detecting steganographies in cross-standard low-bit-rate compressed speech.

Conclusions
In this paper, a common method for detecting steganographies in cross-standard low-bit-rate compressed speech based on intermediate feature representation is proposed. To detect multiple coding standards at the same time, the code element (CE) matrices corresponding to m coding standards are first concatenated to obtain a synthetic CE matrix. Then, one-hot coding is utilized to convert this matrix into a form that is easy to use by a neural network. Inspired by the ideas in natural language processing, dictionaries are built for each CE by transforming them into intermediate features to achieve more compact representations. These features are inputted into the resulting steganalysis network to obtain the final classification result. Experimental results indicate the superiority in accuracy and performance of the proposed method.