Speech Waveform Compression Using Robust Adaptive Voice Activity Detection for Nonstationary Noise

The voice activity detection (VAD) is crucial in all kinds of speech applications. However, almost all existing VAD algorithms su ﬀ er from the nonstationarity of both speech and noise. To combat this di ﬃ culty, we propose a new voice activity detector, which is based on the Mel-energy features and an adaptive threshold related to the signal-to-noise ratio (SNR) estimates. In this paper, we ﬁrst justify the robustness of the Bayes classiﬁer using the Mel-energy features over that using the Fourier spectral features in various noise environments. Then, we design an algorithm using the dynamic Mel-energy estimator and the adaptive threshold, which depends on the SNR estimates. In addition, a realignment scheme is incorporated to correct the sparse-and-spurious noise estimates. Numerous simulations are carried out to evaluate the performance of our proposed VAD method and the comparisons are made with a couple of existing representative schemes, namely, the VAD using the likelihood ratio test with Fourier spectral energy features and that based on the enhanced time-frequency parameters. Three types of noises, namely, white noise (stationary), babble noise (nonstationary), and vehicular noise (nonstationary) were artiﬁcially added by the computer for our experiments. As a result, our proposed VAD algorithm signiﬁcantly outperforms other existing methods as illustrated by the corresponding receiver operating characteristics (ROC) curves. Finally, we demonstrate one of the major applications, namely, speech waveform compression associated with our new robust VAD scheme and quantify the e ﬀ ectiveness in terms of compression e ﬃ ciency.


INTRODUCTION
Nowadays, speech processing techniques can be applied in a wide variety of communication devices such as cellular handsets, internet search machines, and call-in telephony services. Despite its constant growth, voice activity detection, one of important speech processing techniques, is still intriguing to many researchers [1]. A voice activity detector is a preprocessing component for speech recognition systems, isolated word boundary detection systems, cellular phones, and speech enhancement systems. Voice activity detection (VAD) algorithm is designed to distinguish the speech from the background noise among short-time frames. The importance of the VAD system to the speech processing applications can be easily found in the existing literature. In the early developed VAD algorithms, the features were extracted from the short-time energy, zero-crossing rates [2], linear predictive coding coefficients [3], and cepstral coefficients [4]. Recently, the Mel-energy features [5], the wavelet transforms [6], the correlation coefficients [7], and the likelihood ratios [8,9] have been adopted as the primary features for the VAD techniques. Among all of the adopted features for VAD, the Melspectra have been shown to be very promising in the previous VAD method using the threshold based on the minimum subband Mel energy [5]. According to several experiments in [5], the Mel-spectral features would lead to the most robust VAD performance compared with almost all other features. However, the comprehensive studies associated with the Melspectral features for VAD cannot be found in the existing literature.
In most of the aforementioned techniques, the feature extraction is followed by the threshold detection. In the realistic environments, there exist nonstationary noises such as babble noise and vehicular noise. Therefore, the static thresholds, which can only depend on the information extracted from the first few frames, would cause numerous classification errors [7]. Hence, dynamic or adaptive thresholds were proposed to combat the problem of nonstationary noises [5]. Nevertheless, how to appropriately adjust the threshold dynamically is still very challenging up to now [5,7]. In this paper, we propose a new adaptive threshold, which depends on the signal-to-noise ratio (SNR) estimates and results from the dynamical speech and noise information. Consequently, it can lead to the much better VAD performance in the presence of nonstationary noise. Furthermore, we extend our new VAD technique for the application of the speech waveform compression, which can be used in the voice communications and storage [10].
In this paper, we first justify the advantages of the Melenergy features via the Bayes hypothesis analysis instead of the psychophysical conjectures in the existing literature [5]. We then propose a new robust VAD algorithm, which is based on the Mel-energy features and the adaptive threshold detection. Such a dynamical threshold can be derived from the SNR estimates in [11]. The rest of this paper is organized as follows. The time-frequency features of the speech signals are studied and analyzed in Section 2. In Section 3, we introduce our new robust VAD scheme using the Mel-energy features and the SNR-based adaptive threshold. The simulation results to demonstrate the effectiveness of our new VAD algorithm and the speech compression performance are both presented in Section 4. The concluding remarks will be finally drawn in Section 5.

COMPARATIVE STUDIES FOR TIME-FREQUENCY FEATURES
The features play a key role in all voice activity detectors. The ambiguity due to the unreliable features, especially in the conditions of low SNRs and/or nonstationary background noises, will cause the misdetection very often [1]. According to a few simulations in [5], robust speech detection can be achieved using the Mel-spectral features compared to the Fourier spectral features. The existing literature provides the explanation simply based on the auditory psychophysics that the human ears perceive acoustic waves along the nonlinear scale in the frequency domain, which forms the Mel filter bank [5]. In addition, the dimensionality reduction, but not in the tradeoff of performance, can be achieved for speech detection and recognition using the Mel-spectral features. In the subsequent sections, we formulate the Fourier spectral and the Mel-spectral features for VAD. We compare the VAD performances using these two features via the Bayes hypothesis analysis accordingly to show the effectiveness of the Melspectral features.

Fourier spectral features
Fourier spectral features are obtained from the short-time Fourier transform. A primary Fourier spectral feature, shorttime Fourier energy |x freq [n, k] | 2 , is defined as [12] x freq [n, k] where x(n) is the discrete-time speech signal, w(m) is the window sequence, and N is the window size. According to (1), it is noted that |x freq [n, k] | 2 is a double-indexed function with time index n and frequency index k. Usually, the short-time framed Fourier energy E FT [n , k] will be collected for n = 0, Δn, 2Δn, . . . , n Δn, . . . and n ∈ Z + ∪ { 0}, such that where Δn > 0 is the frame advance step size.

Mel-spectral features
The Mel-spectral features can be acquired through the weighted Fourier spectral features via the Mel filter bank, which is a uniformly spaced filter bank on a nonlinearly wrapped frequency scale, known as the Mel scale, as illustrated in Figure 1. The relationship between the Mel-scale frequency f mel and the conventional frequency f con (in Hz) is given by [5] f mel = 2595 log 1 + f con 700 . ( Without loss of generality, we choose a set of 20-band Mel filters throughout this paper as illustrated in Figure 1. As depicted in Figure 1, the squared magnitude response of the ith Mel filter,  specifies the individual weighting factor for the kth frequency component of the Fourier spectra [5]. According to (2) and (3), the short-time framed Mel energy E mel (n , i) is given by where i is the Mel-filter index. It is noted that the frequency to 20.

Comparative studies between Fourier and Mel-spectral analyses for speech
In this section, we will provide the simulation results for the speech/noise classification to justify the advantage of the Mel-energy features over the Fourier energy features extracted on frame-by-frame basis. We establish an optimal Bayes classifier in [13] to evaluate the effectiveness of the extracted features under the assumption that the entire set of feature vectors are drawn from the multidimensional Gaussian processes. The general framework of the Bayes classifiers is shown in Figure 2. Provided speech data, the framed Fourier and Mel energies are acquired to establish the corresponding Bayes classifiers and the ground truth (the true speech/noise frame labels), which are applied to determine the optimal threshold. Then, the outcomes of each classifier will be compared with the ground truth.
The Bayes classifiers can be constructed as follows. Each feature vector is of dimension d (d = N for Fourier spectral features, d = 20 for Mel-spectral features). For a feature vector in an arbitrary frame, let us denote it as X ∈ R d . Thus, the speech/noise frame classification becomes a binary Bayes hypothesis test, where H s : X ∈ ω s ( X is extracted from a frame in the presence of both speech and noise) and H n : X ∈ ω n ( X is extracted from a frame in the presence of noise only) are the two corresponding hypotheses. Let the two a priori probabilities be P(ω s ) and P(ω n ), respectively. Then, the a posteriori probabilities are given by [13] P ω s | X = P X | ω s P ω s P( X) , where P( X) = P( X | ω s )P(ω s ) + P( X | ω n )P(ω n ) is a common factor associated with the nonparametric probability. The conditional probabilities are given by where μ s , μ n are the mean vectors and Σ s , Σ n , are the covariance matrices associated with the extracted feature vectors X for Hypotheses H s and H n , respectively. According to (6), the Bayes classifier depends on P( X | ω s ), P( X | ω n ), P(ω s ), and P(ω n ) only. The ground truth can be utilized to determine P(ω s ) and P(ω n ). In addition, the ground truth can also be utilized to determine the logarithms of the conditional probabilities as well, such that According to (6)-(9), the Bayes decision rule is given by where the discriminant functions are defined as For quantificational convenience, we artificially add the clean speech with noise (SNR = 5 dB) and carry out the Bayes 4 EURASIP Journal on Audio, Speech, and Music Processing classifiers as given by (10). The speech data are randomly picked from the TIMIT database (three male and three female speakers) [14], while the white and the babble noises are taken from the NOISEX-92 database [15]. The ground truth comes from the speech frame labels specified in the TIMIT database. The outcomes of the Bayes classifiers, in terms of the percentages of classification errors for both Melspectral features and Fourier spectral features, are presented in Table 1. According to Table 1, the Mel-spectral features lead to the much better speech/noise frame detection performance than the Fourier spectral features in both white and babble noises.

NOVEL VOICE DETECTION USING MEL-SPECTRA AND ADAPTIVE THERSHOLDS
In this section, we introduce the feature extraction and the adaptive threshold determination and thereby, propose the new Mel energy-based adaptive-threshold voice activity detector (ME VAD). Feature extraction and threshold adaptation are presented in Sections 3.1 and 3.2, respectively. A new realignment mechanism to improve the classification accuracy is also addressed in Section 3.3.

Mel-energy feature extraction
Consider a time-domain noisy speech signal, which is divided into the overlapping frames each of 256 samples. In addition, each frame is weighted by a Hamming window of the same size. Initially, the short-time framed Fourier energy E FT [n , k] can be determined using (2). The short-time framed Mel-energy E mel (n , i) for the ith frequency band in the n th frame can be calculated using (5). Thus, for the n th frame, the robust energy indicator I(n ) can be calculated as [5] For example, I(n ) is depicted for a speech signal corrupted by the babble noise for SNR = 5 dB in Figure 3.

Adaptive threshold determination
We propose a new "two-threshold" scheme for a better speech/noise classification. Since the first N noise frames consist of noise only, we can determine the first threshold or the a priori threshold η apr (n ), for the frame n , as where E n ≡ (1/N noise ) Nnoise m=1 I(m) denotes the average energy of the first few noise frames, E max (n ) = max 1≤m≤n { I(m) }. Accordingly, the a priori decision rule is given by After the first N noise noise frames, the classified noise energy indicator N(n ) = I(n ), if the n th frame is the noise frame according to (14), is stored in a noise energy buffer, while S(n ) = I(n ) is stored in a speech energy buffer if such a frame is the speech frame instead. Once the sufficient collection of both N(n ) and S(n ) is available, the SNR estimation can be achieved using [11]. Next, we introduce the procedure for estimating temporal SNR and the a posteriori threshold η aps (n ) as follows. The temporal speech energy estimate S(n ) can be obtained as where B s and B n specify the speech buffer and noise buffer sizes, respectively. Thereby, the temporal SNR can be calculated as SNR(n ) ≡ 10 log 10 S(n ) n m=n −Bn+1 N(m) . (16) According to (16), we can estimate the temporal noise energy estimate N(n ) as where γ is the control parameter. Thus, the a posteriori threshold η aps (n ) can be determined as The a posteriori speech/noise frame classification can be achieved using According to (13) and (18), the two thresholds are dynamically adapted and therefore, they can track the nonstationary. Finally, these speech/noise frame labels are sent through a realignment mechanism to remove the sparse occurrence of speech/noise as described in the next section.

Realignment
Maleh and Kabal concluded that the hangover schemes were not effective in correcting isolated VAD errors (i.e., a speech frame among a sequence of noise frames or vice versa) and hence they proposed an isolated error correction mechanism (IECM) [16]. Our realignment mechanism is simply based on the similar approach using the majority voting to forcefully reassign the labels resulting from (19) consistent with the majority every 5 to 7 successive frames. The complete ME VAD algorithm is presented in Figure 4.

SIMULATIONS
Here, we provide the performance comparison for the different noise environments at various noise levels. Three major VAD algorithms are carried out for simulations, namely, our ME VAD method, statistical model-based voice activity detector (SM VAD) [8] and enhanced time-frequency-based robust word boundary detection algorithm (ETF VAD) [5]. The signals in experiments were drawn from the TIMIT database [14] and corrupted by the nonstationary babble and vehicular noises from the NOISEX-92 database [15]. The sampling frequency was set as 16 kHz. Three male and three female speakers were selected from each of the three different US regions (New England , Southern, and Western) [14]. Thirty six data files were generated by the computer; each data file under test consisted of three different speech utterances from the same speaker concatenated together with pauses in between and added with the aforementioned noise samples for SNR = 5 dB and SNR = 15 dB. The simulation results will be given next.

Receiver operating characteristics (ROC) for VAD
The receiver operating characteristics (ROC) curve of a detector is often utilized to illustrate the overall detection performance [17]. The corresponding ROC curves for the aforementioned VAD algorithms are depicted in Figures 5 and  6 for SNR = 5 dB and SNR = 15 dB, respectively. It can be observed that the performance of ME VAD (our algorithm) is almost always better than the ETF VAD and SM VAD methods for the nonstationary babble and vehicular noises. For the critical false detection rates of 2% and 4%, the correct detection rates are tabulated in Tables 2 and  3 for SNR=5 dB and SNR = 15 dB, respectively. According to Figures 5 and 6

Speech waveform compression
In addition, we develop a speech waveform compression scheme using the ME VAD outcomes. This algorithm identifies the beginnings and the ends of the pauses in a speech signal and then deletes the detected noise frames for waveform compression. The speech reconstruction (decompression) can be performed by reinserting the pauses at a desired noise level [18]. The performance of any compression scheme is generally measured in terms of compression efficiency and playback quality [19]. However, the compression efficiency measure might be misleading since it dramatically varies among data and strictly depends on the durations of the silence periods. Therefore, in this paper, we define a new compression efficiency measure C d as On the other hand, the nature of the original speech data can be characterized as the actual noise percentage measure C a , that is, The number of actual noise frames The number of total frames (%).
The compression efficiency curves for different speech data files (added with babble noise at SNR = 5 dB) as depicted in Figure 7 illustrate the relationship between C d and C a from the ground truth. It is obvious that the optimal compression is achieved only when C d = C a . Thus, the closer the C d − C a curve for each scheme to the straight line C a − C a as illustrated as "Optimum" in Figure 7, the better the compression performance. According to Figure 7, our ME VADbased speech compression method significantly outperforms others.

CONCLUSION
In this paper, we study the advantage of speech/noise detection using the Mel-spectral features over the Fourier spectral features via the Bayes hypothesis analysis. Then, we design a robust speech/noise voice activity detection algorithm using the adaptive a priori and a posteriori thresholds incorporated with a realignment scheme. Moreover, we can extend this new voice activity detection method to establish a new speech waveform compressor for voice communications. Simulation results show that our new voice activity detection algorithm and the corresponding speech compression scheme greatly outperform other existing methods, especially in the low signal-to-noise ratio environments of babble and vehicular noises.