Speech Bandwidth Extension Using DWT-FFT-Based Data Hiding

. A novel transform-domain speech bandwidth extension algorithm is proposed to transmit information about the missing speech frequencies over a hidden channel, i


Introduction
Majority of the telephone connections are restricted to speech frequencies below 4 kHz termed as narrowband (NB) which causes the characteristic sound of telephony speech. In order to improve the speech quality, speech frequencies up to 8 kHz, called wideband (WB), are desired. The required modifications, which are expensive and time-consuming, of today's telephone network infrastructure turned out to be the main complication for the introduction of high-quality speech transmission in existing networks [1].
An alternative approach to enhance the quality of the received NB signal is the artificial bandwidth extension (ABE) [2], where the bandwidth of the NB signal is artificially extended at the receiving end. The source-filter model divides bandwidth extension (BE) into excitation signal extension and WB spectral envelope estimation. Several approaches for excitation signal extension can be found in [2], [3]. Several approaches for WB spectral en-velope estimation can be found in [3][4][5][6]. However, ABE techniques do not provide adequately stable WB speech quality in all circumstances [7].
A new solution to resolve this problem is to communicate the information about the missing speech frequencies (MSFs) over a hidden channel, i.e., the related information is hidden within NB signal using data hiding techniques [1]. Several state-of-the-art techniques for speech BE based on data hiding are available. A speech BE method has been proposed in [8], where the encoded spectral envelope parameters (SEPs) of the MSFs in 4 to 8 kHz range, called high-band (HIB) signal, are hidden into the NB signal to produce a composite NB (CNB) signal. A method for generating a high-quality WB signal over the above method has been proposed in [9], where HIB signal is encoded more efficiently by phonetic classification. A method for the enhancement of telephone speech quality has been proposed in [10], where SEPs of HIB signal are inserted into the least significant bits of the bit stream of NB signal. A method for speech BE has been proposed in [11], where quantization based data hiding technique is employed through which reliable transmission is attained over many typical telephony channels. In [12], the audible components of HIB signal are embedded into the hidden channel. It was shown that hidden data can be reliably recovered. A speech BE method has been proposed in [13] based on insertion of the pitch-scaled frequencies of HIB signal into 3.4 to 4 kHz. A method based on joint coding and data hiding (JCD) has been reported in [14] for extending the bandwidth of an NB signal. The WB signal of high quality is reconstructed in [15], [16] using JCD technique.
Speech BE algorithms using data hiding should provide a high-quality CNB signal and a reconstructed wideband (RWB) signal. These algorithms should also be robust enough to withstand channel and quantization noises. However, many of the methods discussed above [8][9][10][11][12][13][14][15][16] fail to provide a high-quality CNB signal and RWB signal. Also, they fail to provide robustness to withstand channel and quantization noises. So, the development of a novel speech BE algorithm using the data hiding technique is required to enhance CNB signal quality and RWB signal quality and efficient handling of channel and quantization noises.
A speech steganography method has been proposed in [17]. This employs DWTFFTBDH technique to embed the parameters of the secret speech signal in the detailed wavelet coefficients of host speech signal without degrading the quality of the host signal. It was found that this approach is producing a stego speech signal that is indistinguishable from the host speech while being able to recover the secret speech signal without any degradation in quality.
A novel speech BE algorithm using DWTFFTBDH technique [17] is proposed to insert the encoded SEPs of HIB signal into detailed wavelet coefficients of NB signal. The hidden data is retrieved at the receiving end to produce a high-quality WB signal. Furthermore, the proposed method is compatible with conventional NB terminal equipment's, e.g., a plain ordinary telephone set (POTS). In other words, conventional NB receivers can still access the NB speech properly without additional hardware, while a customized receiver is able to extract the embedded information and provide WB signal with much better quality.
The telephone network channel effects, such as quantization and channel noises, are incorporated in this paper. More elaborate proposals, such as [8] and [9], aiming at speech BE, account for the aspect of quantization noise. However, the impact of channel noise has not been evaluated. The present invention considers the code division multiple access (CDMA) technique for recovering the hidden data as it is claimed to be robust against channel and quantization noises. In particular, each data bit to be embedded into the NB signal is spread out by multiplying it with a specific spreading sequence. The spread signals are then added up to form the hidden data. The hidden data can be reliably recovered because of low cross-correlation between spreading sequences (Hadamard codes are employed in this work).
The paper is organized as follows. In Sec. 2, DWTFFTBDH technique for BE is introduced. Section 3 deals with the novel speech BE algorithm using DWTFFTBDH technique. The subjective and objective analyses are discussed in Sec. 4. Section 5 gives a conclusion.

DWT-FFT-Based Data Hiding Technique for BE
To hide HIB signal Y eb (n) within NB signal Y nb (n), initially, discrete wavelet transform (DWT) is applied on Y nb (n) to decompose it into detailed and approximation coefficients. Fast Fourier transform (FFT) is then applied on detailed coefficients to compute the spectrum, followed by calculation of magnitude spectrum Y nb (K) and phase spectrum  NB (k). Assume that Y eb (n) is encoded into a sequence of data bits, i.e., C s  {-1,1}, s = 0,1,…,S -1, where S denotes the total number of bits.
Spread each data bit to be embedded by multiplying with a specific pseudo-noise (PN) code, i.e., C s q s . The length of the PN code q s is S. Adding all of these spreading vectors produces hidden data. It is given by The hidden data E are inserted into the last L elements of the first half of Y nb (k) [17], and this results in a modified magnitude spectrum Y 1 nb (k) and it is given by where E l denotes the l th component of E and  is a scalar that will enhance the quality of CNB signal, i.e., 2 2 where G nb denotes the energy of Y nb (K). Hence, an appropriate value of  is found by These changes result in a CNB signal, and its spectrum can be expressed as, Inverse transform the CNB signal spectrum to convert back to the time representation of the CNB signal by applying an inverse FFT and then inverse DWT. The resulting CNB signal Y 1 nb (n) is transmitted over telephone network channel to the receiver and the channel introduces channel and quantization noises. Let Ŷ 1 nb (n) denote the received signal, i.e., Ŷ 1 nb (n) = Y 1 nb (n) + er. The combination of channel and quantization noises is denoted by er. Ŷ 1 nb (n) is treated as an ordinary signal by a conventional phone terminal. The quality of Y nb (n) is not considerably degraded since the perceived differences between Y nb (n) and Y 1 nb (n) are very small. Recovery of the hidden data Ŷ eb (n) requires the receiver to compute the spectrum of the signal by applying DWT on Ŷ 1 nb (n), and then, FFT is applied on detailed coefficients, followed by calculation of magnitude spectrum and phase spectrum. The hidden data are then recovered from the magnitude spectrum of Ŷ 1 nb (n) [17] by The data bits are decoded by employing a multiuser detector [18]. That is, In a noise-free environment, . Substituting it into (7), we have The PN sequences are orthogonal. That is where s g  . Therefore, This demonstrates that the parameters of Y eb (n) can be effectively retrieved by using the CDMA technique. Figure 1 shows the proposed transmitter. Initially, WB speech Y wb (n) that was sampled at 16 kHz is split into a low-band signal and a HIB signal by the low-pass filter (LPF) and a high-pass filter (HPF), respectively, where low-band signal contains speech information between 0 and 4 kHz and HIB signal contains speech information between 4 kHz and 8 kHz. The NB signal Y nb (n) is then produced by decimating LPF output by a factor of two. The output of HPF is shifted to the frequency range of NB spectrum and then decimated to produce an upper-band (UB) signal Y eb (n).

Transmitter
Minimize the number of parameters that represent Y eb (n) to imperceptibly embed HIB signal into the NB signal. Here, the linear predictive (LP) analysis [19] is utilized to fulfil this objective. Computing LP coefficients from Y eb (n) using the Levinson-Durbin algorithm [19] and then these coefficients are converted to line spectral frequencies (lsfs) since the minor change in coefficients results in distortions while reconstructing Y eb (n). Also, the gain of Y eb (n) has to be embedded to avoid over-estimation [20]. Thus, calculate the relative gain as g r = g eb /g nb and combined with lsfs to produce a representation vector of Y eb (n), i.e., C = [lsf 1 , lsf 2 ,…, lsf 10 , g r ]. Quantize C to the closest entry of a vector quantization (VQ) codebook that is generated by the fuzzy c-means (FCM) algorithm [21]. The binary representation of the entry index, i.e., (c 0 c 1 c 2 …c S -1 ) is then hidden within the NB signal using DWTFFTBDH technique to provide a composite NB signal Y 1 nb (n) that can be transmitted over telephone network channel to the receiver.
The parameters of the excitation signal are not embedded to minimize the parameters to be embedded since above 3400 Hz, the human ear is not sensitive to the excitation signal distortions [22]. Therefore, estimation of UB signal excitation from the NB signal at the receiver guarantees the reconstruction performance.
A synchronization sequence such as 1111….111 is inserted after every frame of Y 1 nb (n) to accomplish frame synchronization [23] between the transmitter and receiver. The arrival of a new frame of Y 1 nb (n) is indicated by the reception of a certain number of consecutive identical waveforms (synchronization sequence) at the receiver. Figure 2 shows the proposed receiver. Recover the entry index properly by the proposed DWTFFTBDH technique and then the corresponding quantized lsfs are properly retrieved from the VQ codebook. Then, construct LP coefficients from the retrieved lsfs. Meanwhile, Ŷ 1 nb (n) is inverse filtered using LP coefficients of Ŷ 1 nb (n) to obtain an NB residual signal and extending the residual signal. This results in a UB excitation signal. Synthesizing Ŷ eb (n) is carried out by exciting the synthesis filter described by the retrieved LP coefficients by a UB excitation signal. At this point, the sampling rate for both Ŷ 1 nb (n) and Ŷ eb (n) is 8 kHz. Interpolating these signals by a factor of two, the sampling rate of WB signal. Y 1 eb (n) denotes the interpolated Ŷ eb (n), which lies in 0 to 4 kHz and is shifted to 4 to 8 kHz. The interpolated composite NB (Y 11 nb (n)) and restored Y 1 eb (n) signals are summed to produce WB signal (Y 1 wb (n)) of high quality.

Evaluation
Two aspects need to be considered for the quality evaluation of the proposed system. First, a good WB quality must be guaranteed for customized receiver. Second, the NB speech quality must not be degraded even after embedding the encoded spectral envelope parameters of HIB signal into detailed Wavelet coefficients of NB signal for conventional NB receivers.  Twelve sentences spoken by 20 speakers including 10 men and 10 women (altogether two hundred forty speech utterances) are taken from the TIMIT corpus [24] for the performance evaluation. The NB signal is segmented into frames of length of 20 ms with 10-ms overlap between frames and is processed on a frame-by-frame basis. The performance of the proposed BE algorithm is assessed with subjective and objective tests. The different methods compared with the proposed method are: BE of telephony speech by data hiding [8], speech BE by data hiding and phonetic classification [9], an audio watermark-based speech BE [10] and steganographic WB telephony using NB speech codecs [15]. These are represented, respectively, by conventional speech BE using data hiding (CSBUDH), conventional speech BE using data hiding and phonetic classification (CSBUDHAPC), conventional speech BE using bit stream data hiding (CSBUBSDH) and conventional speech BE using watermark transmitted side information (CSBUWTSI) in the analysis. Telephony channel model used in this paper is additive white Gaussian noise (AWGN) channel model.

Subjective Listening Test Results
The speech utterances used for the subjective listening tests were a subset of the two hundred forty speech utterances. The subjective listening tests were made on one hundred speech utterances drawn randomly from the two hundred forty speech utterances. Here, the perceptual transparency (PET) is evaluated using a mean opinion score (MOS) test [8], [9]. The listening test for the comparison of the WB, CNB and RWB signals is conducted [12]. The obtained speech quality of the proposed method and conventional speech BE methods [8][9][10]15] is also assessed using absolute category rating (ACR) listening test [34] recommended by international telecommunications union (ITU-T). Speech BE using different data hiding techniques [8,9,[27][28][29][30][31] are used MOS test to evaluate the perceptual transparency. These tests were conducted in a quiet environment using headphones. Twenty subjects participated in each test.

Perceptual Transparency
The information should be transparently hidden by the proposed method. That is Y 1 nb (n) and Y nb (n) should be subjectively indistinguishable. High PET means low noticeable NB signal degradation. There should be high PET even after embedding the encoded spectral envelope parameters of HIB signal into detailed wavelet coefficients of NB signal. PET is evaluated using the MOS test [8], [9]. Subjects participating in the test compare Y nb (n) and Y 1 nb (n) and provide their opinions in terms of MOS presented in Tab. 1. Table 2 illustrates the results of the averaged mean opinion scores for the traditional methods [8][9][10]15] and the proposed method. As seen in Tab. 2, the proposed method reveals its distinct PET advantage over the traditional methods [8][9][10]15].

Subjective Comparison of Original WB Speech, Composite NB speech and Reconstructed WB Speech
A subjective listening test has been conducted in order to compare the performance of the proposed technique with the existing techniques. WB speech is denoted as I, CNB speech and RWB speech are numbered as II and III respectively. The subjects have to compare speech samples pairs taken from I to III and rate the first sample of the pair sounded best (  ), poor (  ) or similar (  ) in relation to the second sample. Table 3(a) lists the results of comparing I with II and III and 3(b) lists the results of comparison of II with III. The number of subjects with a specific rating (  or  or  ) is presented in the table in Arabic numerals. Table 3(a) confirms that the consistent preference of original WB speech over CNB speech of traditional methods [8][9][10]15] and the proposed method. A clearly improved RWB signal quality of the proposed method over the traditional methods is also observed from Tab. 3(a).  a clear RWB signal quality advantage of the proposed method over CNB speech.

ITU-T Test Results:
The speech samples used in the listening test were taken from the TIMIT database. One hundred sentences were taken for evaluating the performance of conventional methods [8][9][10]15] and the proposed method. Since the main application of the speech BE technique is in mobile communications, listening test samples are prepared so that they simulated speech transmitted over a cellular telephone network. The test samples were high-pass filtered with the mobile station input (MSIN) filter, which approximates the input response of a mobile station and the sound level of each test sample was normalized to 26 dB below overloading [35]. These pre-processed test samples were then down sampled to the 8-kHz sampling rate and used as NB signal for conventional speech BWE methods [8][9][10]15] and the proposed method.
The ACR test was conducted to evaluate the quality of the bandwidth extended speech signal generated by the conventional speech BE methods [8][9][10]15] and the proposed method. The listeners were asked to evaluate the quality of the speech samples with the scale: 5 (excellent), 4 (good), 3 (fair), 2 (poor), 1 (bad). The test was conducted in a quiet environment using headphones. Twenty subjects participated in the test. MOS values for the conventional speech BE methods [8][9][10]15] and the proposed method are presented in Tab. 4. A clearly improved reconstructed WB signal quality of the proposed method over the traditional methods is observed from Tab. 4.

Objective Test Results
The obtained WB speech quality is rated using the log spectral distortion (LSD) measure [8], [9] and the ITU-T WB perceptual evaluation of speech quality (WB-PESQ) measure [33]. The perceptual transparency is rated using the NB perceptual evaluation of speech quality (NB-PESQ) measure [25]. Bit error rate (BER) is used to evaluate the robustness of hidden data against quantization and channel noises. Speech BE using different data hiding techniques [11,13,27,29,30] are used NB-PESQ measure to evaluate the perceptual transparency.

Comparison of Original and Reconstructed HIB Speech
The perceptual similarity between original and reconstructed HIB signals is evaluated using the LSD measure and is given by where g p is the gain of the original HIB signal, 1/(a s e jw ) is the spectral envelope of the original HIB signal, ĝ p is the gain of the reconstructed HIB signal and 1/(â s e jw ) is the spectral envelope of the reconstructed HIB signal. Smaller LSD value indicates the best quality of the reconstructed HIB signal.

Perceptual Transparency
NB-PESQ measure is used to rate PET by comparing Y nb (n) with Y 1 nb (n). The NB-PESQ scale ranges from -0.5 for the worst PET up to 4.5 for the best PET. A clearly improved PET of the proposed technique over the traditional techniques is observed from Tab. 6.

Robustness of Hidden Information
The effect of noise corruption is considered now. AWGN is added to the CNB signal Y 1 nb (n), with the signal to noise ratio (SNR) ranging from 15 to 35 dB [26]. The robustness of the proposed technique is evaluated using BER. The length of the PN code is 16. The smaller BER value indicates the better quality of the RWB signal. The obtained BER values as a function of SNR ranging from 15 to 35 dB are below 3.2810 -5 which confirms the better RWB signal quality.
The obtained BER value after applying μ-law coding to Y 1 nb (n) is 1.3210 -5 , which confirms the better RWB signal quality.

WB Speech Quality
WB-PESQ measure [33] is used to evaluate the quality of reconstructed WB speech Y 1 wb (n) by providing original WB speech Y wb (n) and reconstructed WB speech Y 1 wb (n) as inputs. Here, the speech quality is rated using WB-PESQ measure by comparing Y wb (n) and Y 1 wb (n). Table 7 illustrates the results of the averaged WB-PESQ scores for the traditional methods [8][9][10]15] and the proposed method. Clear quality improvement of the proposed method over the traditional methods [8][9][10]15] is observed from the average WB-PESQ scores as shown in Tab. 7.

Conclusion
A novel speech BE algorithm using DWTFFTDH technique is proposed in this paper. The encoded spectral envelope parameters of HIB signal are hidden into detailed coefficients of NB signal. The hidden data is retrieved to produce a high-quality WB signal at the receiving end. The proposed technique proved to be a robust solution for BE of NB speech signals. Evaluation results confirm the excellent wideband performance of the proposed technique over the traditional speech BE techniques.