Speech/Nonspeech Detection Using Minimal Walsh Basis Functions

This paper presents a new method to detect speech/nonspeech components of a given noisy signal. Employing the combination of binary Walsh basis functions and an analysis-synthesis scheme, the original noisy speech signal is modiﬁed ﬁrst. From the modiﬁed signals, the speech components are distinguished from the nonspeech components by using a simple decision scheme. Minimal number of Walsh basis functions to be applied is determined using singular value decomposition (SVD). The main advantages of the proposed method are low computational complexity, less parameters to be adjusted, and simple implementation. It is observed that the use of Walsh basis functions makes the proposed algorithm e ﬃ ciently applicable in real-world situations where processing time is crucial. Simulation results indicate that the proposed algorithm achieves high-speech and nonspeech detection rates while maintaining a low error rate for di ﬀ erent noisy conditions.


INTRODUCTION
Speech/nonspeech detection is simply the task of discriminating noise-only frames of a signal from its noisy speech frames.In the literature, this process is usually known as voice activity detection (VAD) and it becomes an important problem in many areas of speech processing such as real-time noise reduction for speech enhancement, speech recognition, digital hearing aids, and modern telecommunication systems.In multimedia communications, silence compression algorithms are usually applied to reduce the average transmission rate during silence periods of speech.These compression algorithms are also based on speech/silence detection and they allow the speech channel to be shared with other information so that the capacity of channel can be improved.Furthermore, VAD is an essential component in variable rate speech coders to achieve efficient bandwidth reduction without speech quality degradation.Several methods that trade off the accuracy, delay, perceptual quality, and computational complexity have been proposed in the literature to deal with the problem of speech/nonspeech detection.
A silence compression speech communication system with VAD was standardized by ITU-T Recommendation G. 729 [1,Annex B].It uses a feature vector consisting of four parameters: full-band energy, low-band energy, zero-crossing rate, and a spectral measure for the multiboundary decisions.Based on the difference between each parameter and its respective long-term average, the fourteen boundary decisions are defined.The initial voice activity decision for each frame is set to 1 if one of these multiboundary decisions in the space of the four difference measures is true.Final decision is made by smoothing the initial decision in four stages (i.e., hangover scheme).A voice detection algorithm based on a pattern recognition approach and fuzzy logic was proposed for wireless communications in noisy environments [2].This algorithm uses the same acoustic parameters adopted by G.729 for feature extraction.
A VAD standardized for the GSM cellular communication system is the ETSI speech coder [3].Based on the spectral estimation and periodicity detection, this adaptive multirate speech coder (AMR) specifies two options for VAD to be used in DTX (discontinuous transmission) mode.For applications like mobile phones and packet networks, discontinuous transmission (DTX) mode is usually required for lower bit-rate transmission speech coder.In AMR Option 1, the input signal is divided into subbands and the level of signal in each band is calculated.The VAD decision is made by using the outputs from pitch detection, tone detection, complex signal analysis modules, and signal level.A hangover scheme is also added before the final decision is made.
In AMR Option 2 the input signal is first converted into frequency domain using discrete Fourier transform (DFT).Then, based on the channel energy estimator, channel SNR estimator, spectral deviation estimator, background noise estimator, peak-to-average-ratio module, and voice metric calculation module, the VAD decision is made.
Apart from the above voice activity detection methods, most of which are based on the parameters of speech, modelbased VADs have been introduced recently.Formulating the problem of speech pause detection into a statistical decision theory, two detectors based on maximum a posteriori probability (MAP) and Neyman-Pearson test were described in [4].A Gaussian statistical model which assumes that the discrete Fourier transform coefficients of speech and noise are asymptotically independent Gaussian random variables was proposed in [5,6].Assuming the distributions of speech and noise signals to be Laplacian and Gaussian models, the authors in [7] developed a soft voice activity detector by decomposing the speech signal into discrete cosine transform (DCT) components.
Noise is a well-known factor which degrades the quality and intelligibility of speech in many applications' areas.
To reduce the noise level without affecting the quality of speech signals, a noise reduction algorithm is usually employed.Spectral subtraction is a widely used approach in practical noise suppression schemes.This scheme usually estimates the noise characteristics from the nonspeech intervals of the signal.Therefore, identification of nonspeech periods is an important and sensitive part of existing noise reduction schemes.In this context, accuracy and reliability of a VAD becomes critical in determining the performance of noise reduction algorithm.Most papers reporting on noise reduction refer to speech pause detection when dealing with the problem of noise estimation.Speech pause detectors are very sensitive and often limiting part of the systems for the reduction of noise in speech [8].
A speech pause detection algorithm based on an autocorrelation voicing detector algorithm was developed in [9].The algorithm was designed for real-time system and implemented on a DSP platform for the application of speech enhancement for hearing aids.An adaptive Karhunen-Loéve transform (KLT) tracking-based algorithm was also proposed for enhancement of speech degraded by additive color noise [10].An algorithm, which detects the speech pauses by tracking the dynamics of the signal's temporal power envelope, was proposed in [8].Sometimes, detection algorithms were designed for specific applications such as noise suppression [11] and wideband coding [12].Voice activity detection algorithms for cellular networks in the presence of babble noise and vehicular noise were presented in [13] by adopting the approach used in European digital mobile cellular standard [14].Combining the geometrically adaptive energy threshold method (GAET) and leastsquare periodicity estimator (LSPE), conversational speech is separated from silence [15].A fuzzy polarity correlation function is also applied to determine speech sections and background noise in the environment of telephone network [16].
In this paper, a method to discriminate the active and inactive periods of speech signals corrupted by unknown type and unknown level of noise is presented.It is assumed that intervals of the inactive segments can be short as well as long (i.e., while some active segments are located very closely, some active segments may be separated by longer periods).Taking the simplicity of binary Walsh transform as an advantage, the proposed speech/nonspeech detection algorithm is developed.First, the signal to be classified is modified employing binary Walsh basis functions.The minimal number of basis functions to be applied is determined by using a technique for the selection of wavelet decomposition at natural scale [17].Using the statistics of the modified signals, which are highly informative about the characteristics of noisy speech frames as well as noise only frames, classification is performed with a decision scheme.
Unlike other VAD methods, in which the decision is made on a frame-by-frame basis, the proposed method instantaneously obtains the set of consecutive frames as speech and nonspeech segments.The effectiveness of the proposed method is evaluated by conducting the objective performance on different types of noise with varying SNRs using the criteria of error rate, speech/nonspeech detection rates, and false alarm rate.ROC analyses have been shown to compare the standardized algorithms: G.729 and AMR Option 1 and Option 2. Experimental results show that the detection accuracy of the proposed algorithm is high for both speech and nonspeech frames regardless of noise levels.

PROPOSED ALGORITHM
The block diagram of the proposed speech/nonspeech detection algorithm based on the binary Walsh basis functions is depicted in Figure 1.First, the signal is represented using FFTs.These representations are then modified by Walsh basis functions before reconstructing.The number of basis functions to be applied is determined using SVD.Finally, speech/nonspeech periods are detected from the modified signals utilizing a decision scheme.Details of the algorithm are explained in the following sections.

Modification of signal
The noisy input signal is reconstructed as a modified sequence based on an analysis/synthesis scheme described in [18].Firstly, the input signal x(n) of sampling frequency 8 kHz is multiplied by a Hanning window to yield successive windowed segments of x s (n).These window segments are transformed into the spectral domain by using FFTs of size 128.In this manner, a time varying spectrum X s (n, k) = |X s (n, k)|e jϕ(n,k) with n = 0, 1, . . ., N −1 and k = 0, 1, . . ., N − 1 for each windowed segment is computed.Here, X s (n, k) denotes the spectral component of the noisy input signal at frequency index k and time index n.Before synthesis, each sth windowed segment is modified as the weighted sum of the magnitude |X s (n, k)| using binary Walsh basis functions.Using basis functions, the number of parameters to track along the variations between active and inactive regions of the noisy signal can be lessened.In this context, SVD is used to determine the minimal number of Walsh basis functions to be applied.The detailed procedure for the identification of the minimal number of Walsh basis functions is described in the next section.Applying the ith basis function φ i , a modified sequence, y s (n), for each windowed segment can be obtained as All the modified segments of S are then concatenated producing an output signal y(n) by showing the time-varying magnitude responses: (2)

Determination of minimal Walsh basis functions
The Walsh transform is a matrix consisting of a complete orthogonal function set having only two values +1 and −1 over their definition intervals.The motivation for using Walsh transform rather than other transforms is its computational simplicity giving a realistic processing time.The Walsh function of order N can be represented as where u = 0, 1, . . ., N − 1, N = 2 q , and b i (x) is the ith bit value of x.In this context, the Walsh functions are arranged into sequence order, the number of zero crossings of Walsh function per definition interval, to obtain a set of basis functions.The number of zero crossings increases with the order of basis functions It is very important to select the proper basis functions so that variations between the dynamics of speech and nonspeech can be captured more precisely.A method to select the global natural scale in discrete wavelet transform [17] is adopted to determine the required number of basis functions.This method adaptively detects the optimal scale using SVD while decomposition is being carried out.Consider an input noisy speech signal x of length V, and y d (ν) being its modified sequence obtained applying the basis functions of order d into (1) and (2).
Modified sequences {y d (ν)} D−1 d=0 can be represented in a matrix P of size D × V. To determine the order of basis functions with dominant eigenvalues, the SVD of the matrix P is calculated adaptively starting with the first two orders (i.e., φ 0 and φ 1 ) while adding the higher orders.
In order to determine the number of basis functions to be applied, we studied the probability distributions of basis function orders as a function of SNRs.In this analysis, speech signals from TIDIGITS database spoken by male and female speakers were used.If there exist long interword silences, they were removed first.Silence segments of different sizes were then introduced to have varying intervals between active regions.To generate the noisy signals, the commonly used white Gaussian noise was artificially added with SNR levels of 20 dB, 10 dB, 5 dB, and 0 dB.Here, SNR is defined as where s is speech, v is noise, and N s and N v are the lengths of speech and noise signals, respectively.Figure 2 displays the probability of occurrence of a basis function order, termed as coverage, for changing levels of SNR.It is observed that dominant eigenvalue is located only within the first few basis functions.In particular, the minimal order for highly noisy signals of 5 dB and 0 dB is found to be 1.And for the signals at high SNR of 20 dB, 10 dB, and clean, the dominant eigenvalue is found when the order of basis function is 3. Hence, the lower-order basis functions of Walsh transform matrix are highly informative and they should be used in modification process.Moreover, it is found that higher-order coefficients carry less weight in terms of their magnitude and may not be evident to interpret a large Walsh kernel [19].In practice, it is not possible to obtain any a priori information about noise level and noise type.Hence, the proposed algorithm defines the minimal order of basis functions N min as 3 throughout the experiments.In the original algorithm [17], optimal scale is defined as the average of the details from the first level to the natural scale, the level associated with the dominant eigenvalues.However, this averaging may introduce clipping effect for the signals with low speech level.To avoid this effect, a shifting operator which swaps the right and left halves of the basis function coefficients is applied first.Then a good estimate of the binary Walsh basis function at dominant eigenvalue is defined as where N min = 3 is the largest-order relating the most prominent eigenvalues and CS(•) is the shifting operator.This new basis function ψ provides sharper representation and higher discriminating features.It is also found that identification between noisy speech periods and noise only components with narrow intervals become more apparent in the modified sequence obtained by using ψ.
For length N, the function ψ consists of 1's for n = 0, . . ., N/2 − 1 followed by −1's for n = N/2, . . ., 3N/4 − 1 and 1's for n = 3N/4, . . ., N − 1, where n is the sample index.Substituting the values of ψ in (1), we find In order to compare ψ with φ o , we replace φ i with φ o and rewrite (1) as Using (7), the difference between the "short-term area under the magnitude spectrum" for the noisy speech case and the noise only case (specially for white Gaussian noise) will be less due to the sum taken over the whole 0-4 kHz frequency band.Based on the expressions of ( 6) and ( 7), we can notice that the discrimination between speech and nonspeech segments will be higher for using ψ compared to φ o .
To demonstrate the effectiveness of the proposed modification presented above, an example is shown in Figures 3-5.A clean signal is shown in Figure 3.The modified version of this signal in white Gaussian noise at 5 dB SNR using 0order basis function φ 0 and estimated basis function ψ is also shown in Figures 4 and 5, respectively.It is observed from Figures 4 and 5 that discriminating ability of the modified signal y m as obtained using ψ is better for the speech and nonspeech frames due to its deeper and sharper representation.
It seems that the function ψ is more efficient to capture the intrasegment variation between the noisy speech segments and noise only segments of narrow interval.

Decision scheme
First, 0-order basis function, φ 0 is used to produce a modified sequence, y 0 (ν), to get the global information of the original noisy signal.This modified sequence is used as a reference or pilot signal as in the area of telecommunication.In telecommunication, a pilot signal is usually transmitted over a communication system for supervisory, control, or reference purposes.Carrying the local characteristics, another modified signal, y m (ν), is formed using the new basis function ψ.From this sequence, locations and durations of speech active and inactive periods can be captured more precisely.In this way, the approximate locations of active and inactive frames are first determined from the modified signal, y 0 (ν).Then, the accuracy of these reference decisions are improved by using the second modified signal, y m (ν), containing the detailed information.Applying the reconstructed signals y 0 and y m , the procedure of detection scheme can be described as follows.
(i) Extract two sequences of local minima, {α 0i } L i=1 and , where L is the number of frames, from every 4 ms frame of y 0 (ν) and y m (ν) for which it is assumed that the initial 200 ms consists of noise only period.
(ii) Set thresholds, τ 0 and τ m , for each minima sequence which are obtained using a simple statistics as τ 0 = μ 0 − κδ 0 and τ m = μ m − κδ m , where μ 0 and δ 0 are the mean and the standard deviation of the first set of local minima, and μ m and δ m are those of the second set of local minima while κ is a positive value.After experimenting with the modified waveforms for a number of clean as well as noisy speech data, κ is set to be 0.75.
(iii) Declare a frame as an inactive frame if either α 0i < τ 0 or α mi < τ m .In this way, the nonactive frame indices are obtained from y 0 (ν) and y m (ν) as R and T : R = r 1 , r 2 , . . ., r P , (iv) Combine the two initial boundary decisions as follows: where C={c 1 , c 2 , . . ., c J } is the set of elements common to R and T .Considering that the members of C are the indices of the inactive frames, the final decision for detecting speech and nonspeech frames are obtained.
Here, we decide that there exist inactive frames whenever some or all of the prominent local minima obtained from the first modified signal y 0 (ν) would coincide with the local minima found from the second modified signal y m (ν).For those detected frames when their corresponding local minima are not obtained from both modified sequences of y 0 (ν) and y m (ν) are discarded as outliers.

EXPERIMENTAL RESULTS AND COMPARISON
In this section, the results and objective evaluation of the proposed method is presented.The detection result for a noisy speech signal is illustrated in Figure 6, where the signal is at 0 dB SNR and embedded in white Gaussian noise.The results obtained by the proposed detection scheme are shown together with manually determined actual speech and nonspeech detection results.It is seen that the detection accuracy is high for both speech and nonspeech periods.And thus the proposed algorithm achieves a good performance level.

Evaluation data
To evaluate the efficiency of the proposed method, its performance was compared with G.729 VAD and AMR Options 1 and 2. For the comparison purpose, the speech signals from 11 speakers of TIDIGITS database were extracted.Three signals from each of these male and female speakers were concatenated to generate the signals of 8 s to 11 s long.Silence or pause segments of varying intervals were then inserted between the active segments as described in Section 2.2.Test sequences consist of nearly 70% of active speech components and 30% of inactive speech components.The silence segments of very short as well as long durations are also included in the test sequences.For reference decisions, active and inactive frames of all clean signals were marked manually.Five types of noise, white Gaussian, babble, car, street, and train, were added to the original signals with different SNRs 20 dB, 10 dB, and 0 dB.

Performance evaluation
As performance criteria, the speech detection rate, nonspeech detection rate, and error rate were employed.Speech and nonspeech detection rates are defined as the ratio of the correctly classified speech frames to the total number of speech frames and the ratio of the correctly classified nonspeech frames to the total number of nonspeech frames, respectively.The error rate is defined as the ratio of the incorrectly classified frames to the total number of frames.
In   the problem of nonspeech detection for all noise conditions.Moreover, the proposed method can detect both speech and nonspeech frames with least error probabilities for all levels of SNRs in all environments.
The results of the performance comparisons for average rates of speech detection, nonspeech detection and error of the proposed method to ITU G.729, AMR Options 1 and 2 in five background noise (white, babble, car, street, and train) and SNR ranging from 20 dB to 0 dB are shown in Figures 7, 8, and 9. Average speech detection rates of the proposed method is nearly constant for varying SNRs of 20 dB, 10 dB, and 0 dB with their respective values of 88.74%, 88.66%, and 85.99%.Although the speech detection rates of above standardized methods are high in 20 dB, their performance is decreased with decreasing SNRs.In terms of nonspeech detection rates, G.729 yields the lowest rates followed by AMR1.The nonspeech detection rates of the proposed algorithm are the highest although AMR2 achieves improved rates over G.729 and AMR1.The proposed method achieves significantly the lowest error rates (10.01%,12.04%, and 19.05%) for SNRs of 20 dB down to 0 dB.Error rates of AMR2 are found to be dependent on the noise levels, although it offers moderate nonspeech detection rates over G.729 and AMR1.

Computational considerations
The proposed algorithm is implemented in Matlab whereas the other algorithms are implemented using C.

RECEIVER OPERATING CHARACTERISTICS ANALYSIS
In this section, the detectability and discriminability of the proposed method is verified in terms of receiver operating alarm rate (1-speech detection rate) are determined over the proposed method, G.729, ETSI AMR1, and AMR2.The operating points of G.729, AMR1, and AMR2 shift to the right in ROC plane with decreasing SNRs.However, the operating point of the proposed method can maintain an almost constant false alarm rate.False alarm rates of AMR2 increases with decreasing SNR although its nonspeech hit rates become higher.Among these standard VADs, G.729 maintains most of the lowest false alarm rates.However, it also has poor nonspeech hit rates for all SNR levels.For more noisy conditions, the nonspeech detectability of AMR2 is better than AMR1.Obviously, the proposed method significantly improves the nonspeech hit rate over the other methods with a nearly constant false alarm rates at changing environments.For a given nonspeech hit rate, the proposed scheme can detect the signal with the lowest false alarm rate.In addition, for a given false alarm rate, the highest nonspeech hit rate can be obtained by our method.From this objective evaluation, it can be concluded that discriminability of the proposed method between speech and noise is found better compared to the standardized methods.

CONCLUSION
In this paper, the problem of speech/nonspeech detection in the presence of noise is addressed.A method, which is based on the binary Walsh functions is developed.The basic idea is to reconstruct the noisy speech signal as modified sequences from which speech and nonspeech frames are detected.The main advantage of this method is its very low computational complexity.The Walsh basis functions make the proposed algorithm efficient, simple, fewer parameters to be optimized, and faster in implementation.Thus the algorithm is applicable in practical situations where processing time is critical.Experimental results indicate that the proposed method can detect speech as well as nonspeech frames with lower error rates across different types of noise with varying SNRs.ROC analysis also shows that the proposed method consistently outperforms G.729, AMR1, and AMR2 in terms of discriminability between speech and noise.Since the computational complexity of the algorithm is relatively low, the algorithm can be applied in the areas such as real time noise cancellation systems and noise reduction for enhancement of speech signals.

Figure 1 :
Figure 1: Block diagram of the proposed algorithm.

Figure 2 :
Figure 2: The distribution of the order of basis functions for the signals from clean to 0 dB.

Figure 7 :
Figure 7: Performance comparison for average nonspeech detection rate of the proposed method and standard VADs (G.729, AMR1, and AMR2) in different backgrounds with varying SNRs.

Figure 8 :
Figure 8: Performance comparison for average speech detection rate of the proposed method and standard VADs (G.729, AMR1, and AMR2) in different backgrounds with varying SNRs.

Figure 9 :
Figure 9: Performance comparison for average error rate of the proposed method and standard VADs (G.729, AMR1, and AMR2) in different backgrounds with varying SNRs.

Table 1 :
Comparison of speech detection rates, nonspeech detection rates, and error rates of the proposed method to standard methods (G.729, AMR1, and AMR2) for different levels of SNRs in various noisy environments.

Table 1
all noise types.Proposed binary Walsh transform based method can consistently detect the speech frames with almost constant rate regardless of noise types and levels.Considering the nonspeech detection rates, G.729 is the worst with an accuracy of less than 20% for most of the time.Although AMR1 and AMR2 yield better detection rate than G.729, the proposed method is found to be the best one in Figure 12: Receiver operating characteristic analysis for proposed method, ITU G.729, AMR1, and AMR2 at 0 dB with car noise.
characteris11,and 12) analysis.In signal detection, the relationship between detection and false alarm probabilities is often characterized by ROC curves.Only the subset of speech database in car noise, as described in Section 3, is used in this ROC analysis.Figures 10,11,and 12show the results of ROC analysis at 20 dB, 10 dB, and 0 dB SNRs.For each noise level, nonspeech hit rate (nonspeech detection rate) and false