A Packet Loss Concealment Technique Improving Quality of Service for Wideband Speech Coding in Wireless Sensor Networks

A packet loss concealment (PLC) algorithm is proposed to improve the quality of decoded speech when packet losses occur in a wireless sensor network. The proposed algorithm is mainly based on artificial bandwidth extension (ABE) from narrowband to wideband. It consists of three main functions: packet loss concealment in the narrowband, ABE in the modified discrete cosine transform (MDCT) domain, and smoothing of wideband MDCT coefficients with those of the last good frame. The performance of the proposed PLC algorithm is implemented by replacing the PLC algorithm employed in the ITU-T Recommendation G.729.1. The experimental results show that the proposed PLC algorithm provides significantly better speech quality than the PLC in the ITU-T G.729.1.


Introduction
There have been rapid developments in wireless sensor networks (WSNs) owing to recent advances in devices such as ultralow-power microcontrollers and short-range transceivers [1]. WSN technology is used in a wide range of applications like environmental monitoring, human tracking, biomedical research, military surveillance, and multimedia transmission [2,3]. This paper addresses the issues regarding sensors used for multimedia transmission, called wireless multimedia sensors (WMSs) [4,5]. These sensors deal with multimedia data like image, video, speech, and audio. Multimedia sensor nodes have resource constraints such as low energy capacity of battery, low storage space, and limited computing power. Many multimedia sensor nodes focus on speech data transmission suitable for speech transmission over WSNs. In such cases, each sensor node is linked by wireless local area network (WLAN) links and real-time transport protocol/user datagram protocols (RTP/UDPs).
Packet loss rate increases in this type of transmission because of increased network congestion [6,7]. In addition, depending on the network resources, the possibility of burst packet losses also increases, which potentially results in severe quality degradation of the reconstructed speech [8].
Most speech coders in use today are based on telephonebandwidth narrowband speech, nominally limited to about 300-3,400 Hz at a sampling rate of 8 kHz. In order to improve speech quality in voice services, wideband speech coders have been developed for smoothly migrating from narrowband to wideband quality. They operate with a bandwidth 50-7,000 Hz at a sampling rate of 16 kHz. For example, ITU-T Recommendation G.729.1, a scalable wideband speech coder, improves the quality of speech by encoding the frequency bands ignored by the narrowband speech coder, ITU-T Recommendation G.729. Encoding wideband speech using ITU-T Recommendation G.729.1 is performed by two different operations on the low band and high band in the time and frequency domain, respectively. When a frame loss occurs, 2 International Journal of Distributed Sensor Networks the low-band and high-band packet loss concealment (PLC) algorithms work separately. The low-band PLC algorithm reconstructs the excitation and spectral parameters of the lost frame from the last good frame, and the high-band PLC algorithm reconstructs the spectral parameters such as modified discrete cosine transform (MDCT) coefficients of the lost frame from the last good frame [9].
Several packet loss concealment (PLC) methods have been proposed to reduce the speech quality degradation due to a packet loss [7,10]. The PLC algorithm proposed in [7] was developed to improve the narrowband speech quality by estimating the excitation using comfort noise and multiple codebooks. A technique based on the resynchronization of the glottal pulses in the low band was also proposed in [10], which was subsequently embedded into ITU-T Recommendation G.729.1 as the low-band PLC algorithm [10]. However, the high-band PLC algorithm for ITU-T Recommendation G.729.1 replaced spectral parameters in the modified discrete cosine transform (MDCT) domain with those of the previous frame [10,11]. In this case, the high-band signal was reconstructed without regard to the low-band signal for the lost frames. The speech quality would have been better if the PLC algorithm estimated the high-band signal by taking into account the reconstructed low-band signal for the lost frames. Therefore, this paper proposes an artificial bandwidth extension-(ABE-) based PLC algorithm for high-band signal reconstruction in order to improve the quality of decoded speech under packet loss conditions in a WSN. The proposed PLC algorithm is mainly composed of three functions: PLC in the narrowband, ABE in the MDCT domain, and smoothing of the wideband MDCT coefficients using those of the last good frame. The ABE algorithm performs different operations for the 4-4.6 kHz and 4.6-7 kHz bands. It reconstructs the MDCT coefficients of the 4-4.6 kHz band from the harmonic spectral band replication and correlationbased replication approaches. On the other hand, the MDCT coefficients for the 4.6-7 kHz band are obtained by spectral folding [12]. The performance of the proposed PLC algorithm is evaluated by implementing it in the G.729.1 decoder, and it is compared with that of the PLC algorithm employed in the ITU-T Recommendation G.729.1 decoder.
The remainder of this paper is organized as follows. Following this introduction, Section 2 discusses the PLC algorithm currently employed in the G.729.1 decoder. Section 3 proposes an ABE-based PLC algorithm that can also be applied to the ITU-T Recommendation G.729.1 decoder. Section 4 evaluates the performance of the proposed ABEbased PLC algorithm. Finally, this paper is concluded in Section 5.

Conventional PLC Algorithm
PLC algorithms can be classified into a sender-based and a receiver-based algorithm, depending on the position where the PLC algorithm works [13,14]. The sender-based algorithms try to prevent packet errors by using error-robust transmission methods or by including error correction data. The lost speech packets are retransmitted or the sequential speech packets are interleaved to avoid burst losses.  Moreover, the speech packets are transmitted with forward error correction (FEC) code or redundant data, which are used to recover the lost speech signals at the receiver. In addition, robust header compression (ROHC) provides a robust speech streaming method at the transmission protocol layer by reducing the overhead due to protocol headers [15]. On the other hand, the receiver-based algorithms conceal lost speech signals by using the speech signal characteristics. The lost speech signals are replaced with silence, noise, or previously reconstructed speech signals. In other words, lost speech signals can be reconstructed by interpolating previous and next good speech signals [16]. In practice, the parameters of a lost frame should be estimated by extrapolating the parameters of a previous good frame. Figure 1 shows a block diagram of the PLC algorithm employed in the ITU-T Recommendation G.729.1 decoder [17]. The PLC algorithm is composed of low-band and highband PLC modules. The PLC algorithm reconstructs speech signals of a lost frame based on the speech parameters correctly received from the last good frame, where the speech parameters are excitations in the low band and the MDCT coefficients in the high band. In the low-band PLC module, the excitation of the lost frame is replaced with that obtained from the last good frame, and the energy of the reconstructed excitation is gradually decayed. In addition, a synthesis filter for the lost frame is reconstructed using the linear predictive coding (LPC) coefficients from the last good frame, and the pitch period of the lost frame is estimated as the integer part of the pitch period of the last good frame.
In the high-band PLC module, the high-band signal is reconstructed by time-domain bandwidth extension (TDBWE) that convolves the excitation generated from the low-band PLC module with a spectral envelope estimated from the high-band energy parameters of the last good frame. Then, an MDCT is applied to the high-band signal, and subsequently, the MDCT coefficients corresponding to 7-8 kHz are set to zero. Next, an inverse MDCT (IMDCT) is applied to the modified MDCT coefficients in order to obtain Burst packet loss? the high-band signal. Finally, the reconstructed wideband signal of the lost frame is obtained by quadrature mirror filter (QMF) synthesis using both the low-band signal and the high-band signal that are reconstructed by the low-band PLC and high-band PLC modules, respectively. Figure 2 shows a block diagram of the proposed ABE-based PLC algorithm. When a frame loss occurs, the low-band PLC module reconstructs the low-band speech signal of the lost frame, 1 ( ). Simultaneously, the high-band signal is reconstructed by extending the glottal pulse using the high-band spectral envelope of the last good frame, which is denoted by ℎ ( ) in Figure 2. Next, the high-band signal is obtained by applying an ABE algorithm to extend the low-band signal in the MDCT domain, resulting in abe ( ). Subsequently,̂ℎ( ) is obtained by smoothing abe ( ) with ℎ ( ). By applying an IMDCT tôℎ( ), a time-domain high-band signal,̂ℎ( ), is obtained. Finally, ( ) and̂ℎ( ) are constructed by using QMF synthesis. In the following subsections, the ABE algorithm and the spectral smoothing method are described in detail.

Artificial Bandwidth Extension (ABE)
. ABE is used to generate the high-band MDCT coefficients, abe ( ), as shown in Figure 3. In this paper, the frame size, , is set to 160. For a given set of low-band MDCT coefficients, the high-band MDCT coefficients are obtained in different ways depending on the frequency bands. The high band is divided into two frequency bands, such as 4-4.6 kHz and 4.6-7 kHz.
First, for the frequency band of 4.6-7 kHz, the MDCT coefficients are initially generated by a spectral folding operation, which is defined as where ( ) denotes the low-band MDCT coefficient in the th frequency bin. The spectral folding in (1)   by mirroring ( ), where the range of corresponds to the high band from 4.6 to 7 kHz. However, the spectral folding tends to create an unnaturally prominent harmonic structure at high frequencies, resulting in audible distortion. To mitigate this, the high band of 4.6-7 kHz is further split into subbands of 4.6-5.5 kHz and 5.5-7 kHz.
For the frequency band of 5.5-7 kHz, ( ) is smoothened as where (59) = (59). In addition, sgn( ) = 1 if ≥ 0, but sgn( ) = −1 otherwise. For the frequency band of 4-4.6 kHz, the low-band MDCT coefficients are grouped into 20 subbands, where each subband is composed of eight MDCT coefficients. The energy of the th subband, ( ), is defined as By using ( ) in (3), each low-band MDCT coefficient is normalized as where ( ) denotes the th normalized low-band MDCT coefficient. Next, the MDCT coefficients for this frequency band are obtained differently depending on the voicing 4 International Journal of Distributed Sensor Networks characteristics of narrowband speech. Each frame is classified as either a voiced or an unvoiced frame by using a spectral tilt parameter, , which is identical to the first reflection coefficient, 1 , from the ITU-T G.729.1 decoder. If of the current frame is greater than a predefined threshold, , then the frame is declared as a voiced frame; otherwise, it is declared as an unvoiced frame. For a voice frame, the harmonic characteristics of the low-band should be maintained in the high band [18]. The harmonic spectral band replication approach determines the harmonic period as Δ V = 2 / where is the frame size and indicates the pitch period obtained from the ITU-T Recommendation G.729.1 decoder. Then, by using ( ) in (4), the th MDCT coefficient, ( ), is expressed as where mod ( , ) is the modulus operation defined as mod ( , ) = % , and ⌊ ⌋ denotes the largest integer less than or equal to . In (5), varies from 0 to 23, which corresponds to the frequency band 4 to 4.6 kHz.
On the other hand, for an unvoiced frame, a correlationbased replication approach is used to patch the high-band MDCT coefficients. Thus, the optimal shift, Δ V , which maximizes the autocorrelation [19,20] between the normalized low-band MDCT coefficients, is determined as where 3N/4 is the maximum shift range. Note that the search range is limited between zero and 3 /4 in order to find an optimal shift in the frequency band of 3-4 kHz. In (6), corr( ( ), ( + )) is defined as corr ( ( ) , ( + )) = Therefore, the th MDCT coefficient that is most correlated to ( ), ( ), is obtained as It is important to avoid an abrupt change in the boundary between the low band and the high band. This is achieved by adjusting ( ) for 0 ≤ < 24 so that the energy of the frequency band of 4-4.6 kHz changes smoothly when compared to the low-frequency band of 3.4-4 kHz [21]. The allowable energy for the th high band, ℎ ( ), is defined from ( ) in (3) where denotes a scale factor used to mitigate the abrupt energy change and it is set to 1.25 in this paper. Note also that the range of is associated with the frequency band of 4-4.6 kHz. Next, each high-band MDCT coefficient in the frequency band of 4-4.6 kHz is modified as Finally, the extended MDCT coefficients, ℎ ( ), are obtained by concatenating the MDCT coefficients obtained from (1), (2), and (10), such that ( ) , 24 ≤ < 60, ( ) , 60 ≤ < 120.
The extended MDCT coefficients in (11) provide an excessively fine structure at high frequencies, which results in musical noise. Therefore, it should be smoothened. This is done by applying a shaping function to ℎ ( ), where a cubic spline interpolation is used for the shaping function that has a not-a-knot condition around four control points at 4, 5, 6, and 7 kHz with 0, −6, −12, and −18 dB, respectively [22]. Consequently, the extended MDCT coefficients are further modified as abe ( ) = 10 0.05 ( ) ℎ ( ) , 0 ≤ < 120, where ( ) is a value obtained after applying the spline function.

Reconstruction of High-Band MDCT Coefficients for a Lost
Frame. As mentioned earlier, the proposed ABE-based PLC algorithm reconstructs the high-band signal from the lowband signal, which is mainly composed of three modules: low-band PLC, ABE in the MDCT domain, and smoothing of the wideband MDCT coefficients using those of the last good frame. The high-band PLC module in the ITU-T Recommendation G.729.1 decoder utilizes the high-band energy of the last good frame regardless of the signal class characteristics such as voiced, unvoiced, and transition period. In contrast, in the proposed ABE-based PLC algorithm, the high-band MDCT coefficient, abe ( ), is smoothed with the high-band MDCT coefficient, ℎ ( ), that is obtained from the highband PLC module in the ITU-T G.729.1 decoder. In other words, the smoothed high-band MDCT coefficient,̂ℎ( ), is obtained aŝ ℎ ( ) = ( ℎ ( ) + abe ( ) ) ⋅ sgn ( ℎ ( )) , 0 ≤ < 120.
Next,̂ℎ( ) is transformed into the time domain by applying an IMDCT, as shown in Figure 2. Finally, ( ) and̂ℎ( ) are concatenated by QMF synthesis using a 64-tap filter to reconstruct the decoded wideband speech for the lost frame.

Performance Evaluation
The effectiveness of the proposed ABE-based PLC algorithm is demonstrated by comparing its performance with that of International Journal of Distributed Sensor Networks the PLC algorithm employed in the ITU-T Recommendation G.729.1 decoder, which is referred to as G.729.1-PLC. For comparison, eight audio files (three male voice files, three female voice files, and two music files) were excerpted from the sound quality assessment material (SQAM) database [23]. Since the files were originally recorded in stereo at a sampling rate of 44.1 kHz, the right channel signal of each file was downsampled to 16 kHz. In addition, two different packet loss conditions such as random and burst packet losses were simulated. Packet loss rates of 10, 20, and 30% were generated by the Gilbert-Elliot model defined in ITU-T Recommendation G.191 [24]. To simulate burst packet loss conditions, the burstiness of the packet losses was set to 0.99, where the mean and maximum consecutive packet losses were measured as 1.9 and 5.6 frames, respectively.
First, the log spectral distortion (LSD) [25] was measured between the original and decoded signal. It is defined as where ( ) and ( ) denote the th spectral components of the original signal and the proposed signal, respectively. In order to obtain LSD, an -point discrete Fourier transform was applied to both signals and then summed from /4 to /2. This was because only the spectral components of the high band were compared. Tables 1 and 2 compare the LSD between the proposed PLC and the G.729.1-PLC algorithms at packet loss rates of 10, 20, and 30% under random and burst packet loss conditions for the speech and music files, respectively. It was shown from the tables that the proposed PLC algorithm provides smaller LSDs than G.729.1-PLC for all packet loss conditions. Second, the waveforms decoded by different PLC algorithms were compared, as shown in Figure 4. It was seen that the decoded signal obtained by the proposed PLC algorithm (Figure 4(e)) is closer in fidelity to the decoded signal without any loss (Figure 4(b)) than the decoded signals obtained by G.729.1-PLC (Figure 4(d)) for a given packet error pattern (Figure 4(c)). Additionally, Figure 5 compares the spectrograms of the signals decoded by different PLC algorithms. As shown in Figure 5, the spectrogram of decoded signal obtained by the proposed PLC algorithm ( Figure 5(d)) was more similar to the decoded signal without any loss  ( Figure 5(b)) than the spectrogram of the decoded signals obtained by G.729.1-PLC ( Figure 5(c)) in the high band. Third, an A-B preference listening test was performed to evaluate the subjective quality. The audio data used for the test consisted of six speech files (three male and three female voices) and two music files. All the files were processed under random and burst packet loss conditions by G.729.1-PLC and the proposed PLC algorithm, respectively. In this paper, seven people with no auditory diseases participated. Audio files processed by the G.729.1-PLC and proposed PLC algorithm were presented to the participants, and they were asked to choose their preference. Tables 3 and 4 show the test results for the speech and music data, respectively. Note that if a participant could not distinguish the difference between the file processed by the proposed PLC and G.729.1-PLC, then "No. Diff. " was selected. It was shown from Tables 3 and 4 that the speech and music signals decoded by the proposed   Next, in order to demonstrate the effectiveness of the proposed PLC algorithm, the stimuli with a hidden reference and an anchor (MUSHRA) test [26] were performed as a subjective listening test. For the MUSHRA test, two anchors with cut-off frequencies of 7 and 3.4 kHz were prepared. Seven people with no auditory diseases also participated in this test. Each participant was presented with the eight stimuli and was asked to rate the audio quality from 0 to 100. Figure 6 compares the MUSHRA scores, where each column corresponds to the opinion score averaged over seven listeners and eight audio files. Note that the vertical line on the top of each bar denotes the standard deviation of the opinion score. As shown in Figure 6, the proposed PLC algorithm achieved an average score of 39, which was higher than that by the G.729.1-PLC algorithm.
Finally, in order to show how much more effective the proposed PLC algorithm was in comparison to the G.729.1-PLC algorithm, a paired -test [27] was performed using their MUSHRA scores. Assuming that the differences in MUSHRA scores followed a normal distribution, the test statistic had adistribution based on ( −1) degrees of freedom [28], where was the number of stimuli for the MUSHRA test; thus, = 56. The test statistic was given by = ⋅√ / , where and are the sample mean and the sample standard deviation of the differences of MUSHRA scores, respectively.
For the paired -test, the test statistic must be greater than 0.05 if two methods are significantly different. According to the mathematical table [28], 0.05 = 1.674 when = 56 for a confidence of 95%. The test statistic was 3.20, which implied that the audio quality of the stereo signals processed by the proposed PLC was significantly better than that of G.729.1-PLC.

Conclusion
In this paper, a packet loss concealment (PLC) algorithm has been proposed to improve the performance of decoded signal quality when frame erasures or packet losses occurred in wireless sensor networks. The proposed PLC algorithm was based on artificial bandwidth extension (ABE) from the low band to the high band in the MDCT domain. The performance of the proposed PLC algorithm was evaluated by replacing the PLC algorithm currently employed in the ITU-T Recommendation G.729.1 decoder, G.729.1-PLC under random and burst packet loss rates of 10, 20, and 30%. The comparisons were made based on log spectral distortion (LSD), waveform/spectrogram comparison, an A-B preference test, MUSHRA test, and the paired -test. It was shown from the comparisons that the proposed PLC algorithm provided better quality of decoded speech and music signals than G.729.1-PLC for all the simulated packet loss conditions.