F3SNet: A Four-Step Strategy for QIM Steganalysis of Compressed Speech Based on Hierarchical Attention Network

Traditional machine learning-based steganalysis methods on compressed speech have achieved great success in the field of communication security. However, previous studies lacked mathematical description and modeling of the correlation between codewords, and there is still room for improvement in steganalysis for small-sized and low embedding rates sample. To deal with the challenge, We use Bayesian networks to measure different types of correlations between codewords in linear prediction code and present F3SNet -- a four-step strategy: Embedding, Encoding, Attention and Classification for quantizaition index modulation steganalysis of compressed speech based on Hierarchical Attention Network. Among them, Embedding converts codewords into high-density numerical vectors, Encoding uses the memory characteristics of LSTM to retain more information by distributing it among all its vectors and Attention further determines which vectors have a greater impact on the final classification result. To evaluate the performance of F3SNet, we make comprehensive comparison of F3SNet with existing steganography methods. Experimental results show that F3SNet surpasses the state-of-the-art methods, particularly for small-sized and low embedding rate samples.


I Introduction
As an effective way to secretly transfer information over the Internet, steganography uses the redundancy of digital carriers to accomplish secret information embedding. In recent years, due to the pervasiveness of streaming media technologies, VoIP steganograpy and their countermeasures have become one of the hot topics in information hiding [1,2,3].
Among many VoIP applications, for band-limited channels and wireless communication, speech coders such as G.713.1, G.729, Adaptive Multi-Rate (AMR), and Enhanced Full Rate (EFR) have become essential components in mobile and wireless communication. How to exploit redundancy existing in the encoding process to achieve steganography is a new research hotspot. Some methods which embed secret messages into the bitstream during the encoding process have been proposed, such as quantization index modulation (QIM) steganography [4,5,6], fixed codebook (FCB) steganography [7,8,9] and pitch modulation (PM) steganography [10,11].
As the counterpart of steganography, steganalysis is not only to ensure that steganography is not maliciously abused, but also a key technique for evaluating the performance of steganography algorithms. Machine learning algorithms, especially Support Vector Machine (SVM), have been widely used in the field of steganalysis of both traditional media and VoIP streams. For QIM steganography, S. Li et al. proposed a variety of detection methods [12,13]. In [12], they presented a statistical model to extract the quantitative feature vectors of the index distribution characteristics (IDC). In another work, S. Li et al. [13] further presented a model called the quantization codeword correlation network (QCCN) to quantify the correlation characteristics of the vertices in the correlation network. For FCB steganography, Miao et al. [14] first presented a Markov Transition Probabilities (MTP) based detection method and an entropy-based detection method to detect steganography of compressed speech. To improve performance, Ren et al. [15] used the statistical probability of Same Pulse Position (SPP) in the same track to accurately distinguish covers from stegos. For PM steganography, Q. liu et al. [16] extracted the statistics of the high-frequency spectrum and the mel-cepstrum coefficients of the second-order derivative for detecting audio steganography. S. Li et al. [17] proposed a network model to quantify the correlation characteristics of the adaptive codebook. Undoubtedly, steganalysis of compressed speech based on machine learning has made great progress. However, such methods mentioned above are facing some challenges.
Firstly, as steganography becomes more sophisticated [7,8,18], extracted statistical features for steganalysis are evolving from low dimensions and simplicity to high dimensions and complexity [19]. Secondly, information hiding technology is gradually developing towards randomization and fine granularity. That is, within the allowable range of carrier distortion, secret information is first divided into small segments, and then carriers of different lengths are randomly selected to achieve finegrained steganography with different embedding rates. However, especially for small-sized and low embedding rate samples, most existing steganalysis methods do not perform well [14,15]. Fortu-nately, the emergence of neural networks (NNs) has brought hope to deal with these challenges.
In 2018, Lin et al. [20] first introduced neural networks (NNs) to the steganalysis of compressed speech. They proposed Recurrent Neural Network (RNN)-based steganalysis model (RNN-SM) to detect the disparities in codeword correlations caused by QIM steganography. In 2019, Chen et al. proposed a steganalytic scheme by combining RNN and Convolutional Neural Network (CNN) for FCB steganography. In 2019 and 2020, Hao et al. [21,22] successively proposed hierarchical representation Network and multi-head attentionbased network to extract correlation features for QIM steganalysis. However, sequence coding based on CNN or RNN is still a local coding method, and it models the local dependency of input information. The literature [23] argues that the attention mechanism can completely replace LSTM and convolutional neural networks. However, whether this mechanism must be efficient is still an open problem.
In this paper, to avoid information loss, the encoder is able to keep much more information by being distributed among all its vectors. Moreover, the attention mechanism can decide which vectors it should pay more attention to, and relieves the encoder from the burden of having to embed the input into fixed-length vectors, and thus allows to keep much more information. Practice has proved that the effective combination of RNN and attention mechanism is quite beneficial for weak signal processing such as steganalysis.
In summary, this work makes the following contributions: 1. We first use the Bayesian network to establish a framework for uncertainty knowledge expression and reasoning, and then calculate the link strength between different nodes as a measure of the strength of the codewords correlation. The process of quantification analysis serves as an essential step towards effective detection using a deep learning framework.
2. We present F3SNet, a four-step strategy for QIM steganalysis method based on hierarchical attention network. Through a four-step strategy, we encode the numerical codeword vectors into multiple memory vectors, and then select a set of vectors that have the greatest impact on the classification result to prevent information overload, and finally achieve efficient steganography classification, even in special cases, such as small size and low embedding rate.
3. To evaluate the performance of F3SNet, we perform comprehensive experiments on detection accuracy (ACC), false positive rate (FPR), and false negative rate (FNR) of the algorithm under different lengths and different embedding rates. Furthermore, we compare F3SNet with several existing algorithms, such as IDC [12], QCCN [13], RNN-SM [20] and FCEM [22] methods under different embedding rates and different lengths. The experimental results show that our algorithm is superior to other state-of-the-art algorithms.
The rest of the paper is structured as follows. Section II reviews related work on existing steganography and steganalysis of compressed speech. Section III provides an overview of linear prediction analysis and QIM steganography. Section IV discusses correlations using bayesian network. Section V details the design and implementation of F3SNet, followed by experiments and discussions in Section VI. Finally, we conclude the paper and discuss future work in Section VII.

II RELATED WORK
In 2010, Ding et al. [24] used the histogram features of the pulse position parameter to train the SVM classifier to distinguish cover and stego speech. In 2011, Huang et al. [25] employed the second detection and regression analysis not only to detect the hidden message but also to estimate the length of embedded messages. But their method is a relatively dedicated steganography method. Li et al. [12] designed statistical models to extract the quantitative feature vectors of these characteristics for detecting QIM steganography using SVM classifier. Furthermore, Li et al. [13] built a QCCN model, extracted feature vectors from split quantization codewords and then train a high-performance SVM classifier.
In addition, for FCB steganography, Miao et al. [14] used the Markov property of speech parameters to propose a detection method based on MTP and entropy in 2013. Ren et al. [15] proposed an AMR steganalysis algorithm based on the probability of the same pulse position in the same track in 2015. For better performance, In 2016, Tian et al. [26] characterized AMR speech exploiting the statistical properties of the pulse pairs and present a steganalysis of AMR speech based on multi-dimensional feature selection mechanism. For pitch modulation steganography, Li et al. [17] proposed a network model to quantify the correlation between the adaptive codebook. SVM classifier was used in the above three papers.
In recent years, with the application of different types of deep learning, many novel algorithms have been proposed for steganalysis based on image, audio and vedio [27,28,29,30,31]. Compared with the conventional steganalystic methods with handcrafted features, the algorithms based on deep learning can significantly improve the detection performance. In 2015, Qian et al. [32] proposed a customized CNN for image steganalysis. The model could capture the complex dependencies in images and achieve better detection performance than the Spatial Rich Model (SRM). Xu et al. [33,34] proposed a CNN architecture that is more suitable for image steganalysis, and enhanced it by improving the statistical model in the subsequent layers and preventing overfitting. Ye et al. [35] proposed a CNN-based image steganalysis method, which uses an activation function called truncated linear unit (TLU), and improved the steganalysis ability by incorporating the knowledge of selection channel. In 2016, Paulin et al. [36] presented an audio steganalysis method using deep belief networks (DBN). Compared with SVM and Gaussian mixture models (GMM), the proposed DBN-based steganalysis method could get higher classification accuracy. In 2017, Chen et al. [37] designed a novel CNN to detect audio steganography in the time domain. However, due to different signal characteristics, these algorithms are difficult to directly apply to compressed speech.
In 2018, Lin et al. [20] proposed the codeword correlation model based on RNN. They used a supervised learning framework to train RNN-SM. Experiments showed that RNN-SM achieved better detection results regardless of short sample length or low embedding rate. In 2019, Chen et al. [38] proposed a steganalytic scheme by combin-ing RNN and CNN. They utilized RNN to extract higher level contextual representations of FCBs and CNN to fuse spatial-temporal features for the steganalysis. Experiments results validated that their method outperforms the existing state-of-the-art methods. In 2019 and 2020, Hao et al. [21,22] successively proposed hierarchical representation Network and multi-head attention-based network to extract correlation features for QIM steganalysis. Both methods significantly improve the best result especially in detecting both short and low embedded speech samples. Inspired by their work, we proposed a new model called F3SNet based on hierarchical attention network to model the spatial and temporal characteristics of the quantization index in LPC, and further improve the accuracy of steganography detection.

A Linear Prediction Analysis
As the basis of low-rate speech coding, the basic idea of linear predictive analysis (LPA) is to use the correlation of the speech signal to approximate the sample value at the current moment with the linear combination of several past speech samples. Linear predictive coding is mainly divided into three processes: LPA, line spectrum pair (LSP) analysis and vector quantization (VQ). First, the speech signal can be regarded as the output produced by an input sequence µ(n) exciting an all-pole system H(z). The transfer function of the system is, where G is a constant, p is the order of the model, and α i is a real number. The p prediction coefficients form a p-dimensional vector, which is the linear prediction coefficient.
However, the LPC coefficient fluctuates greatly, and the error of a certain LPC coefficient will make a greater impact on the entire frequency domain. Therefore, the LPC coefficient is not suitable for direct quantization and needs to be further transformed into the line spectrum frequency parameter LSF (Line Spectrum Frequency). In order to further balance the bit rate and quantization accuracy, vector quantization technology is used to search the codebook for the codeword vector → C k that is closest to the vector → p to be quantized in a certain distance, and the sequence number k of the codeword vector is obtained as the quantization result.

B QIM STEGANOGRAPHY
The intrinsic essence of QIM steganography is that there is redundancy in the quantization codebook, and the sub-optimal codebook parameters caused by steganography have little impact on the speech quality.
Chen et al. first proposed a steganography method suitable for QIM of static digital carriers such as image, text, audio and video [39]. Assume that the secret information to be transmitted is from the set S = {s k |1 ≤ k ≤ n}. The sender wants to hide secret information s k . First, the codebook D is divided into n disjoint subsets C = {c k |1 ≤ k ≤ n}. Then he (or she) establishes the mapping: f : s k → c k . For the input vector X to be quantized, only the codeword closest to X is searched in sub-codebook f (s k ). The receiver extracts secret information by checking which part of the codebook the codeword belongs to.
In 2009, Xiao et al. [4] combined the QIM method with VQ in the encoding process of compressed speech, and proposed a novel steganography algorithm based on complementary neighbor vertices (CNV). Given N codewords, every codeword is m-dimensional. Xiao et al. used graph theory to establish a graph G(V, E) in the code space, which can be defined as follows, where V i is the i-th codeword in the codebook. Each edge represents a certain relationship between codewords, and the weight of the edge is defined as the Euclidean distance between any two codewords. In Xiao's paper, he gave a graph construction algorithm and proved that the graph can be two-colorable. In the process, the vertices of the same color were assigend to the same subset. The dyeing operations were repeated until all vertices have been assigned, so as to obtain different partitioned subsets of the codebook. Finally, each code-word was in the opposite part to its nearest neighbor. Suppose X is the input value to be quantized. In this case, the additional quantization distortion caused by CNV steganography can be given.
It can be proved that the algorithm can minimize signal distortion and significantly improve the undetectability and robustness of CNV steganography. This paper implements steganalysis for CNV algorithm.

IV CODEWORDS CORRE-LATION MODELING AND ANALYSIS
In order to fully describe the correlation between codewords in LPC, we use Bayesian network (BN) to model the codewords, and then use the model to analyze the correlation. BN can be represented as a 2-tuple G, θ , where G = (V, E) denotes a directed acyclic graph and θ denotes a set of conditional probabilities, called network parameters.
Suppose there are S frames, each of which contains N codewords. V and E represent the set of vertices and the set of edges in the directed graph G, respectively, which can be expressed as follows, where Once the vertices and edges of the directed graph G are determined, the network parameters θ can be computed to characterize the dependencies between the vertices. Therefore, the following formula can be established, where V i is the set of parent nodes of node Λ i . The construction of Bayesian networks includes struc-ture learning and parameter learning, and parameter learning depends on structure learning. Structure learning refers to finding a network structure that is as similar as possible to the data for any given data set D = {D 1 , D 2 , · · · , D n }. In the paper, the K2 algorithm based on Bayesian scoring rules is used to find the network with the largest probability under a given data set. According to Bayesian formula, where P(G) is the prior knowledge of the network structure G, and the data set D is known information, and is independent of the network structure, we have max arg G P (G|D) = max arg G P (G)P (D|G). (7) Since P (G)P (D|G) ∝ logP (G) + logP (D|G), the Bayesian score is defined as follows, Assuming that the prior distribution of the parameters Θ obeys the Dirichlet distribution, let r i represents the number of values of the i-th variable, q i represents the number of possible values of the parent node of the i-th variable, m ijk represents the number of samples whose parent node is the j-th value when the i-th node in the Bayesian network takes the k-th value and α ijk is a hyper-parameter, and α ( ij * ) = k α ijk , m ij * = k m ijk , then where Γ( ) is the gamma function, n represents the number of variables. It has been proved that the K2 algorithm can almost learn the Bayesian network when the node priority is completely correct. In order to verify the effectiveness of BN, we select a 40-second speech segment, compress it with a G.729 vocoder, and then extract 4000 sets of quantized codewords. In the experiment, we construct the BN with 9 vertices and then perform parameter learning. Using the above K2 algorithm, the learned network structure is shown in Fig. 1. The intra-frame codeword correlation is mainly reflected between codeword l 1 and codeword l 2 , and between codeword l 1 and codeword l 3 , and the inter-frame correlation mainly reflected in the first codewords of the two consecutive frames. But for this network, how to measure and visualize the link strength between different codewords? For that purpose, Imme [40] has proposed a measurement method for discrete Bayesian networks Based on mutual information and conditional mutual information. In his method, X and Z are both the parent nodes of Y , and P (y|x, z) is given by the conditional probability table of y given x and z, link strength is defined as where #(X) denotes the number of discrete states of X, etc. Conveniently, the LinkStrength package has been implemented in MATLAB's Bayes Net Toolbox (BNT). The package provides functions to calculate and visualize entropy, connection strength and link strength for discrete Bayesian Networks.
For simplicity, we only use link strength in this paper. Fig. 2 shows Blind Average Link Strength. In the link strength graph, the value of the link strength is indicated by the number next to the arrow. As indicated by the Blind Average Link Strength in Fig. 2 most links are quite strong. Especially the link strengths between the first codewords of two consecutive frames are 3.472 and 3.582 respectively, which are the two connections with the largest value. This demonstrates that the correlation between consecutive frames is the strongest. Next, It can be observed that in three consecutive frames, the link strength between the first codeword and the third codeword is greater than the link strength between the first codeword and the second codeword. For example, in the first frame, the former value is 1.996 and the latter value is 1.953, which is 4.3 % higher. This implies that the correlation between the first and the third codeword is stronger than that between the first and the second codeword. Furthermore, the absence of links between other vertices does not nean that there are no correlations between them. It is just that the correlations are too weak and optimized by the learned model. Of course, the weak links can be measured by manually adding the link relationship in the graph.
As can be seen, correlations between codewords in LPC are complex. It is necessary to find a novel method to improve the traditional detection method. Steganalysis based on deep learning can automatically extract the intrinsic features of the carrier, avoiding the complexity of establishing the model. therefore, we propose a steganalysis method that utilizes the advantages of RNN and attention mechanism. Figure 3: The Model Based on Hierarchical Attention Network.

V QIM STEGANALYSIS BASED ON F3SNet
Till now, we can formally present our F3SNet, which is an architecture based on a hierarchical attention network. The structure is shown in Fig. 3. It includes embedding layer, multi-layer attention layer and classifier. Among them, the multi-layer attention layer adopts a two-layer structure, and includes a single codeword encoder, a codeword attention layer, a codeword sequence encoder, and a codeword sequence attention layer.
The steganography classification is briefly summarized as follows. Simply feed in an input array, get the codeword vectors and codeword sequence matrices. The codeword vectors are taken as the input and sent to the first attention layer. The compressed vector representationes of the codewords are provided by LSTM, and then some important vectors that can reflect the correlation of the codeword are extracted by the attention mechanism. Simutaneously these codeword sequence matrices enter the second attention layer. After the same operation, a sequence-level expression that summarizes all the information in the entire speech is obtained. Finally the obtained representationes are further used as classification features to achieve steganography classification by a fully connected network. For the convenience of verification, we choose keras as the steganalysis framework. Below we describe the details of different components.

A Input
As we know, speech has a hierarchical structure similar to that of a document, which can be divided into different sentences, and each sentence contains a corresponding number of words. As a result, one speech can be divided into codeword sequences and codewords. Each codeword sequence and codeword contain unique information. To fully mine this information, we use a hierarchical attention network to model the structure of the quantized codewords. Here, two types of input data with different shapes are required.
Assume that there are S frames in a given speech sample of duration L(s). We extract the codeword index and pack all indices of a speech sample into a vector X with size (S × 3). X 1 is the first layer input, and the format is as follows, X 1 = [l 00 , l 10 , l 20 , l 01 , l 11 , l 21 , · · · , l 0(S−1) , l 1(S−1) , l 2(S−1) ], (12) where l ij (0 ≤ i ≤ 2, 0 ≤ j ≤ S − 1) denotes the i-th index in the j-th frame. For the second layer input, we take the length L speech as a unit, and pack the codeword indices of the S frame into a matrix as X 2 =   l 00 l 01 · · · l 0(S−1) l 10 l 11 · · · l 1(S−1) l 20 l 21 · · · l 2(S−1)   .

B Embed
The embedding layer is used as the first hidden layer in our model, which converts the quantized codeword index sequence (QIS) into a fixed-size vector sequence. Through embedding layer, a continuous, distributed QIS representation can be obtained and can effectively characterize the correlations between different codewords. In principle, a set of two-dimensional tensors with shape (batchsize, S × 3) is fed into the embedding layer. And they are used as 'indices' to select a permutation of inner tr ainable weights matrix W M ax_num×D , where D represents the output dimension of the embedding layer. In our experiment, matrix W M ax_num×D is initialized randomly, which is regarded as a part of the deep learning model, and updated during the model learning process. After multiple epochs, the entire correlations between each codewords are correctly expressed. Using this learned weight, the final outputs are a batch of 3-dimensional tensors with shape (batchsize, S × 3, D), which are the encoded representations.
As can be seen in Section C, the comparison between Model #1 and #4 shows that the embedding layer can significantly improve the classification accuracy.

C Encode
The embedding layer is followed by the LSTM coding layer. LSTM mainly processes the encoded sequence from left to right through three gated logics (forgetting gate, input gate, output gate), and returns an ordered list of hidden states {h 1 , h 2 , · · · , h T }, as well as an ordered list of output vectors {y 1 , y 2 , · · · , y T }. As shown in Fig. 4, the LSTM cell remembers values over arbitrary time intervals while the three gates regulate the flow of information into and out of the cell.
There are eight groups of parameters that need to be learned throughout the LSTM network, which are the weight matrices and the corresponding bias terms of the three gates. The parameters are defined as follows: forgotten gate weight matrix W f and its bias term b f , input gate weight matrix W i and its bias term b i , output gate weight matrix W o and its bias term b o , and cell state weight matrix W c and its bias term b c . For clarity, the four weight matrices are further subdivided into W if , W hf , W ii , W hi , W io , W ho , W ic and W hc . Taking the forget gate as an example, the calculation process of giving the control factor and retaining how much memory is given. In each LSTM cell, the two weight matrices connecting the input node to the hidden node are respectively the input weights (W if ) and the hidden node feedback weights (W hf ). First, the network output h t−1 at time t − 1 is combined with the current network input x t , and then linearly transformed to obtain u T f . The mathematical process is briefly described as follows: Then, u T f is mapped to 0 ∼ q(1) by the nonlinear activation function to obtain the control factor of the forget gate, which can be described as: In a similar way, the control factor i t of the input gate and the control factor o t of the output gate can be calculated. At each time step t, LSTM call outputs two vectors: the memory c t from the current block and the output h t of current block, i.e.
where the symbol f (·) represent the activation function, two types of activation functions ReLU or Tanh are used in the LSTM cell, and symbol ' · ' means multiplication by element. Finally, LSTM will give an output sequence of dimension L×P ×Q (Q = M × D), where L is the length of the samples, P is the batch size, and M is the hidden size and D is the network direction (D = 1 indicates a one-direction network; D = 2 indicates a bi-direction network). In the work, The output vectors H LST M T = [h 1 , · · · , h T ] of LSTM layer further serve as input for the attention layer.

D Attend
As mentioned above, the encoder is able to keep much more information by distributing it among all its vectors. Moreover, Not all vectors contribute equally to the final classification. Hence, attention mechanism is introduced to extract such vectors that are important to the steganalysis and aggregate the representation of those informative vectors to form the features vectors. Attention can be divided into two steps. One is to calculate the attention distribution based on all input information; the other is to calculate the weighted average of the input information based on the attention distribution.
where W is the parameter matrix of the Dense layer. The attention distribution can be then derived by comparing the output u t of the dense layer with a trainable context vector u and normalizing with a softmax, .
Using the scaled dot product model, the scoring function is obtained, denoted as s(u t , u n ) = u T t un √ D (D is the dimension of the input vector). Let α nj represents the weight of the j-th input concerned by the n-th output. For each input vector, get the weighted average output vector h n , where, n, t ∈ [1, T ] is the position of the output and input vector sequence. Finally, the output vector sequence H AT T = [h 1 , · · · , h T ] containing the most information is obtained, which is used as a classification feature for steganalysis.

E Classify
Finally, after several neural network layers, highlevel reasoning in F3SNet is done via fully connected (FC) classifier. The classifier is shown in Fig. 3. The FC layer calculates the probability that the speech sample belongs to a normal set and stego set according to the weights obtained by its training. No matter how many FC layers are passed, it is still regarded as a linear transformation, which implements the conversion from P × Q feature matrix to P × 2 classification result matrix. Assume that the parameters in the FC layers of our network, namely the weights and bias terms, are denoted by W F (size 2 × Q) and bF (size 2), respectively. Note that each batch of samples shares the same set of parameters. The output array y (size P × 2) can be calculated as, where σ is the sigmoid function.

A Experimental Setup
As we all know, there is no public database in speech steganography and steganalysis. Almost all the literature uses self-generated speech samples for experimentation. In order to facilitate the comparison of algorithm performance, we use the speech sample set published by Lin et al. on GitHub 1 . We divide the original samples into 5-second samples of equal length, and then convert the audio into PCM format with 8 KHz sampling rate, 16 bits per sample, and stereo by Cool Edit Pro 2.1. Finally, a cover database with a total of 5120 different speech samples was established.
As described in Section B, steganography method was involved in the experiment, namely CNV steganography [4]. For each sample in the cover database, several bits of randomly generated secret data are separately embedded into the cover speech. The actual number of embedded bits depends on sample lengths and embedding rates. At the same time, different sample lengths and different embedding rates also have a direct impact on the detection accuracy of the proposed steganalysis algorithm. Additionally, the normal signals are assigned to the negative category and the stego samples were selected from the positive and negative category to construct a training set and a test set, respectively. To evaluate the performance of F3SNet, three statistical indicators are used to measure the classification efficacy of F3SNet, i.e. false positive rate (FPR), false negative rate (FNR), and accuracy (ACC).
Firstly, in order to evaluate the effect of different sample lengths on the performance of F3SNet, we give the sample lengths of 0.1 s, 0.2 s, 0.4 s, 0.6 s, 0.8 s, 1 s, 2 s, 4 s, 5 s with 20 % and 40 % embedding rate, respectively. As mentioned before, many existing algorithms have good detection accuracy 1 https://github.com/fjxmlzn/RNN-SM/ for large-sized samples, but they do not perform well for small-sized samples. Therefore, we focus on how well F3SNet performs for small-sized samples.
Then, to evaluate the effectiveness of F3SNet at different embedding rates, the normal signals and the stego signals with different embedding rate (ER) are grouped. Therefore, embedding rates for the G729 encoder are chosen to be 100 %, 80 %, 60 %, 40 %, 20 %, 10 %. At the same time, we focus on the performance of F3SNet for small-sized samples. The length of the sample is set to 0.2 s and 1 s in the experiment. For each steganography algorithm, there are five different steganography datasets containing different embedding rates.
Thirdly, as described above, for steganography based on compressed speech, researchers have successively developed a variety of steganalysis methods. Among them, the typical algorithms are IDC [12], QCCN [13], RNN-SM [20] and FCEM [22]. Below we will compare the performance of these state-of-the-art algorithms and F3SNet using different lengths and different embedding rates.

B Determining Hyper-Parameters of F3SNet
For deep learning technology, it is first necessary to determine the hyperparameters used for model training. In our model, the hyperparameters involved include the output dimension of the embedding layer, the number of LSTM hidden units, the recurrent layers of LSTM, the dropout rate, batch size, epoch, and so on. All these hyperparameters are determined by cross-validation on training set and validation set. For a given network model, hyperparameters such as the dimension of the embedding layer, the number of LSTM hidden unit and the recurrent layers of LSTM are determined by cross-validation on training set and validation set. Taking into account classification accuracy and training time, we collect a total of 102, 400 speech samples with a length of 1 s (cut from the above database) and then divide them into training set and validation set in 7 : 3 ratio. In order to optimize the tuning process in the model, the Adam optimizer was used for model training. The learning rate is done in the default way. The dimension of embedding layer = 100, the number of word LSTM hidden unit = 100, the number of sentence LSTM hidden unit = 50, dropout = 0.5, dropout_recurrent = 0. In our implementation, the programs run on a single GPU in the deep learning server, which has "Intel (R) Xeon (R) CPU E5-2620 V4 @ 2.10GHZ", 64GB memory, and 4 NVIDIA GeForce GTX 2080Ti GPUs. Moreover, the memory size and processing power of the GPU are 11GB and 11.3T F LOP S in double precision, respectively. Normally it has the ability to accommodate most of the implementation in deep learning architecture. Thus, based on the GPU server resources in our lab, the final parameters are as follows. Batch size was set to 128. The dimension of embedding layer is 100. The dimension of word LSTM is 100. The dimension of sentence LSTM is 50. The recurrent layer of LSTM is 1. It is worth mentioning that the current parameter values are not necessarily optimal, and one may find a more balanced point of accuracy and time cost through experiments.

C Comparation on Different Network Model
Different models have different learning capabilities. Generally speaking, the more complex the model, the stronger the deep learning capabilities, but the greater the resource overhead. Here we use classification accuracy and training time as evaluation metrics to compare six types of models, as shown in Table 1. As can be seen from the above, F3SNet uses a hierarchical attention model. Model #2, #3, and #4 are variants obtained by modifying the proposed model in the paper. For example, model #2 only considers a single-layer attention structure, model #3 does not use a LSTM layer, and model #4 does not use an embedding layer. In addition, model #5 and #6 are the two deep learning models proposed before [22,20] , and both are compared here.
For the classification accuracy metric, the exper- iment selects 1 s speech, and the embedding rate starts from 0.1 and increases at a growth rate of 10 %. After 10 iterations, the maximum accuracy is plotted on the Y -axis and the embedding rate is plotted on the X-axis, and finally Fig. 6 is obtained. We can find that, as the embedding rate increases, the classification accuracy of all models is significantly imporved, and the accuracy of F3SNet is the best among all embedding rates, which shows that the model has good steganography feature learning capabilities. However, it can be seen from Fig. 7 that the training time of Model #1 is relatively long, which is a price that must be paid to improve accuracy. In some applications, time overhead is an "acceptable metric" and accuracy is a "satisficing metric"-the classifier just has to be "good enough" on this metric. Our model can be applied to these occasions.    cantly better than the state-of-art algorithms. In addition, for each fixed embedding rate, the detection accuracy is proportional to the sample length. This means that the longer the sample, the higher the detection accuracy. When the sample length is increased to 5 s, the detection accuracy of the proposed algorithm corresponding to the above two embedding rates reaches 95.46 % and 99.9 %, respectively. Furthermore, it can be seen that as the speech length gradually increases from 0.1 s to 1 s, the detection accuracy of the algorithm under each candidate length fluctuates within a relatively small range. However, when the sample length changes from 1 s to 5 s, the detection accuracy increases more obviously. Taking the embedding rate of 40 % as an example, The sample length was increased from 0.1 s to 1 s, and the detection accuracy increased by 10.65 %. However, the sample increased from 1 s to 5 s, and the detection accuracy only increased by 5.27 %.

D Performance Testing
From another angle, we can make some observation about FNR and FPR. Regardless of the embedding rate, the FNR of different lengths is significantly greater than the FPR. This shows that the missed detection rate is higher than the false alarm rate in our detection algorithm. Therefore, the algorithm is suitable for some application environments that do not require high missed detection rates, such as online real-time detection.

D.2 Test Results at Different Embedding Rates
This experiment evaluates the performance of F3SNet with fixed length and different embedding rates. The results are shown in Table 3. From the experimental results above, we can find that there is a positive relationship between the detection accuracy rate and embedding rate (ER in Table 3). For samples with a length of 0.2 s, when the embedding rate is 10 %, the detection accuracy is 62.3 %, and as the embedding rate rises to 40 %, the detection accuracy is up to 87.11 %. Finally, the detection accuracy ends up at 98.88 % under 100 % embedding rate.
At the same time, for fixed-length samples, when the embedding rate is low, the embedding rate increases by a certain percentage, and the accuracy rate increases accordingly. But when the embedding increases to a certain value, the increase in accuracy is not significant. Similarly for a 0.2-second sample, the embedding rate ranges from 20 % to 100 %, each time increasing by 20 %, and the ratio of the increase in detection accuracy is 12.4 %, 7.27 %, 2.84 %, and 1.66 %. In addition, two conclusions can be drawn from the horizontal comparison of different sample lengths. First, the longer the sample, the higher the detection rate. Second, when the embedding rate is lower, the sample length increases by a certain value and the detection accuracy increases more significantly. This conclusion is consistent with the first experiment.

D.3 Comparison with Existing Algorithms
First, we focus on comparing the detection accuracy of various algorithms for different sample lengths (0.2 s, 0.4 s, 0.6 s, 0.8 s, 1 s, 2 s) with embedding rate 20 %, 40 %, and 60 %, respectively. The results are shown in Fig. 8, 9 and 10. Comparing, we conclude that as the sample length increase, the detection accuracy of all algorithms participating in the comparison keeps increasing, and FNR and FPR keep decreasing, despite occasional fluctuations. In addition, according to the performance distribution curve in Fig. 8, the five types of algorithms can be divided into three different performance ranges. The detection algorithms IDC and QCCN based on traditional machine learning have poor performance, RNN-SM is in the middle,     and FCEM and F3SNet have the best performance. And among all the algorithms, the performance of F3SNet is obviously the best. On average, F3SNet leads RNN-SM by about 15.41 % and FCEM by about 2.48 %. Furthermore, from the longitudinal comparison of the threee graphs, two conclusions can be drawn. Firstly, in the case of 20 % embedding rate, ACC, FPR, and FPR fluctuate significantly, indicating that the steganography detection efficiency is low at this time and it is susceptible to noise. Secondly, when the sample length is fixed, the higher the embedding rate, the higher the detection accuracy, and the lower the FPR and FNR. For example, with a fixed length of 0.2 s, when the embedding rate is 20 %, the accuracy of F3SNet is about 74 %. If the embedding rate is increased to 40 %, the detection accuracy will increase to 87 %. Secondly, to further evaluate the performance of F3SNet, the detection accuracy of different algorithms under different embedding rates (10 %, 20 %, 40 %, 60 %, 80 % and 100 %). Here we select three samples with lengths of 0.2 s, 0.8 s, 2 s for the experiment. The results are presented in Fig. 11, Fig. 12 and Fig. 13. The results also show that as the embedding rate increases, the detection accuracy of all algorithms is increasing, but F3SNet has the best performance among all algorithms. For example, when the embedding rate is 20 %, the detection accuracy of F3SNet can reach 74.71 %, but the other algorithms are 63 %, 62.2 %, 64.95 %, and 72.31 %, respectively. Therefore, IDC, QCCN and RNN-SM can hardly obtain effective detection. When the embedding rate increases to 40 %, the detection accuracy of F3SNet is 87.11 %. This proves that F3SNet is already a detector with excellent detection capabilities in this case.

VII CONCLUSION AND FU-TURE WORK
In this paper, we mainly focus on how to use hierarchical attention network to detect the disparities in the correlation of LPC coefficients before and after steganography. First, to demonstrate the existence and complexity of the correlation, we performed Bayesian network modeling on the quantized codeword index, and then calculate the link strength between different nodes as a measure of the strength of the codewords correlation. Then, we propose a four-step strategy for QIM steganalysis based on HAN, which can automatically extract the features reflecting the correlation. In the proposed model, the LSTM layer and the attention layer are two core components. The former considers possible dependencies in the codebook structure because of its memory property in time series, and the latter further determines which vectors have a greater impact on the final classification result, thereby effectively avoiding information overload. Experimental results showed that even for speech with a length of 0.2 s, F3SNet could effectively detect QIM steganography under an embedding rate of 20 % and outperforms RNN-SM methods by an average of 15.4 %. In the future, we will pursue our research to study the application of HAN to detect other steganography of compressed speech.