A SEMI-CONTINUOUS STATE TRANSITION PROBABILITY HMM-BASED VOICE ACTIVITY DETECTION

In this paper we introduce an efficient Hidden Markov Model-based Voice Activity Detection (VAD) algorithm with time-variant state transition probabilities in the underlying Markov chain. The transition probabilities vary in an exponential charge/discharge scheme and are softly merged with state conditional likelihood into a final VAD decision. Working in the domain of ITU-T G.729 parameters with no additional cost for feature extraction, the proposed algorithm significantly outperforms G.729 Annex B VAD while providing a balanced tradeoff between clipping and false detection errors. The performance compares very favorably with Adaptive MultiRate VAD, phase 2 (AMR2).


INTRODUCTION
Actual speech activities normally occupy 60% of the time on a regular conversation in a telecommunication system [1].Voice Activity Detection (VAD) enables reallocating system resources during the periods of speech absence.In modern telecommunication systems, VAD, in conjunction with Comfort Noise Generator (CNG) and Discontinuous Transmission (DTX) modules, play an important role in enhancing the utilization of system resources.
VAD distinguishes between speech and non-speech frames in the presence of background noise.In general, VAD errors can be categorized into two main types of errors, notably clipping errors and false detection errors.Clipping errors occur when a speech frame is misclassified as a noise frame, which is intolerable in speech encoders due to its effect on speech intelligibility.While false detection errors are due to misclassifying a noise frame into a speech frame.Echo cancellation systems are normally sensitive to this type of errors because it results in incorrect parameter adaptation.
In this paper, we focus on voice activity detection of one of the popular communications standards, namely G.729.This voice coding standard was introduced by the International Telecommunication Union (ITU) along with a recommended VAD algorithm in G.729-Annex B [3] and was tested by Rockwell International in [1].G.729B VAD is based on a simple piecewise linear decision boundary between a set of differential parameters and their respective long-term values.The advantage of the G.729B VAD is that it works in the parameter domain of the underlying coder with no extra load for feature extraction.However, the performance of the G.729B VAD is lower than many other VAD algorithms.Fuzzy logic VADs (FVAD) [3] have been recently introduced for the G.729 environment.FVAD provides 43% and 25% an improvement on clipping and false detection errors, respectively compared with G.729 VAD.
We continue in the same direction and introduce a Hidden Markov Model (HMM)-based VAD algorithm that works in the domain of the G.729 parameters and provides a balanced improvement to the traditional G.729B VAD with minimal additional complexity.We will compare the performance of the proposed VAD with the performance of the G.729B VAD.We will also compare its performance with performance of the popular Adaptive Multi Rate, option 2 (AMR2) VAD [6], although the latter works in the FFT domain which is different than the G.729 feature domain.
The proposed VAD softly merges the state conditional likelihood of the frame parameters to be speech/noise (irrespective of past frames) with a dynamic behavioral model across consecutive frames.It requires no prior offline learning as opposed to FVAD.
The structure of the proposed VAD system is given in Section 2 while the proposed algorithm is described in Section 3. The performance of the proposed VAD is studied and compared with the G.729B VAD and with the Adaptive MultiRate VAD, phase 2 (AMR2) in Section 4 and a summary is given is Section 5.

THE STRUCTURE OF THE PROPOSED VAD
Modern VAD algorithms, in general, consist of two major parts.The main part produces a preliminary decision as for the current frame being a speech or a non-speech frame.This preliminary decision depends on the difference between the characteristics of speech and noise in a certain domain using a certain criterion of comparison.Due to being far from ideal, the main part of the VAD does not always provide the correct decision, e.g.clippings may happen at areas of change from noise to speech and vise versa.In order to compensate for this shortcoming, the second part of VAD modifies the preliminary decision based on the previous decision(s).For example, some VAD algorithms use a discrete Markov chain while others modify the current frame status into speech frame if the preliminary decision of the previous frame is speech, regardless of the current frame characteristics.This part of the VAD is often known as the hangover scheme.
Applying a hangover scheme reduces clipping error rate at the expense of an increase in false detection error rate.A hangover scheme is acceptable as long as the overall performance is improved.
In the proposed VAD we adopt a semi-continuousstate-transition probability HMM-based algorithm.The structure of the HMM provides an integrated probabilistic framework where the main VAD stage and the hangover stage are softly combined.One decision is produced (per frame) based on the interaction between the two system components, namely the hidden layer and the observation layer.As a rough analogy, the state transition layer serves as a dynamic hangover while the observation layer takes care of the comparison of the frame features.

The state transition layer (hidden layer)
The proposed model assumes two states, S 1 and S 2 , representing the noise and speech frames, respectively.The probability of being in a certain state given the immediate previous state is defined by a state transition matrix A={a ij }, where a ij is the probability of a state transition from state S i to state S j , subject to the constraint: .
To reflect the higher likelihood of remaining in the same state, a 00 and a 11 are expected to be generally larger than a 01 and a 10 , respectively.The transition probability from the speech state to the noise state, a 10 , is more important for a communication system VAD than the transition from the noise state to the speech state, a 01 .
Incorrect transition from the speech state to the noise state should be discouraged in order to avoid misclassifying parts of speech, e.g.offset speech, as an outcome of the noise state.We adopt a dynamic scheme in which the probability of making such transition, a 10 , decreases starting from the beginning of a phrase down to a limit a 10min .In other words, a 10 is inversely proportional to the time spent continuously in a speech state, given that the conditional probability of the current frame x t to be produced by state S 1 , b 1 (x t ), is higher than the conditional probability of the current frame x t to be produced by state S 0 , b 0 (x t ).Otherwise, a 10 , gradually returns to its idle value a 10max .This form of continuous transition probability HMM (CHMM) has a transition matrix that is given by: where f ij (t) is defined as: where t i is time index of the frame where the condition b i (x t )>b j (x t ) was first met in the most recent segment, t i ' is time index of the frame where the condition b i (x t )<b j (x t ) was first met in the most recent segment, and b i (x t ) is the conditional probability of the t th frame whose parameter set is x t to be generated by a state S i , i.e.: For simplicity, τ 0 is set to infinity while a 01max , a 10max and τ 1 are set to .1, reducing the number of free parameters in the system while maintaining emphasis on transitions from the speech state.Thus, a 10min becomes the system parameter that controls the system bias for/against speech.A bias factor β is defined as β=-log(a 10min ), subject to the constraint β>0.In our simulation, we set the bias factor β to an arbitrary value of 10.It should be noted that, the higher the bias factor β the more difficult to leave the speech state, i.e. less clipping and more false speech detection may result.
Setting τ 0 to infinity results in a constant a 00 and a 01 , and matrix A becomes: The proposed model is thus a semi-continuous transition probability HMM.This should not be confused with the semi-continuous HMM, where the "semicontinuous" term refers to the probability density function of the HMM.

The observation layer
The observation layer is the part of the system that is concerned with computing the likelihood of a frame being a speech or a noise frame given a certain state.This conditional likelihood is estimated based on a distribution associated with each state, which takes the form of a Probability Density Function (PDF) for continuousprobability HMMs.
A state PDF is normally approximated by a weighted sum of a set of prototype distributions.For simplicity, we approximate the state PDFs in the proposed HMM by one p-dimensional multivariate distribution per state PDF.We adopt a generalized multivariate Gaussian distribution in [4] with κ= 0.5 for Laplacian case: (5) where Γ(.) is the Gamma function, p is the size of the feature vector x, and Σ is a non-negative definite pxp matrix that is given by: where cov(x) is the covariance matrix of x.
Choosing Laplacian distribution to represent the state PDF is motivated by our statistical observations on a set of 32000 frames from voice streams of two male and two female speakers [5].

THE PROPOSED ALGORITHM
An initial estimate of noise state PDF is obtained from the first 16 frames.The initial parameters of the speech state PDF are assumed to be the same except for the variance.The initial variance of the speech state PDF is assumed 10 times larger than that of the noise state PDF.This assumption, which is important to compensate for the absence of prior information about speech statistics, seems acceptable in a wide range of SNR (down to 0dB).However, this assumption is expected to have a negative impact on the system performance at extremely low SNR levels (-5 dB and below) due to the fact that at such a low SNR, the background noise variance becomes extremely large invalidating the assumption of noise variance being .1 of the speech variance.
A VAD flag of a frame is set to 1 if the likelihood of the speech state is larger than or equal to the likelihood of the noise state at any given frame, and is set to 0 otherwise.The likelihood of a state S j to generate a frame t, whose feature vector is x t , and the frame sequence up to the time t is given by: , ,..., ), | ( q t is the effective state at the t th frame, t 0 is the number of frames used to initialize the state PDFs and T is the total number of frames in the stream. In order to improve the estimation of the PDF parameters and to compensate for the (presumably) slowly varying changes in the speech environment, we adopt an adjustment scheme by which the parameters of state PDFs are updated as follows: and ρ=1/n (j) , where n (j) is the number of past visits to a state S j .
Small values of ρ are better from stability point-ofview but result in slower adjustment.We note that this adjustment scheme may not be highly robust at large values of ρ where error accumulation may result from wrong decisions.This argument is particularly important in low performance VAD conditions (e.g.very low SNR), where the correct detection rate is lower than 50%.In order to ensure the stability at the beginning of the call where the number of visits to both states is small, we limit the adjustment factor ρ to .1%.The complexity of the proposed algorithm is about three folds of that of the G.729 VAD, which is very small compared with the overall G.729 encoder complexity.

RESULTS AND DISCUSSION
The proposed VAD works on top of the G.729 encoder and is applied to a set of 12 voice streams (about 96 seconds) from 4 different speakers; two males and two females with 3 streams/speaker from [5].The G.729 encoder runs on 100 frame/sec (80 samples/frame) and provides the values of energy, low-band energy, zero crossing rate, and ten Line Spectral Frequencies (LSFs) for each frame.The voice streams are corrupted by three different types of background noise; white noise, babble noise and car noise at different average SNR levels between 20 dB and 0 dB.Table 1 shows a comparison between the performance of the proposed HHM VAD and Adaptive MultiRate VAD, phase 2 (AMR2) [6] against the performance of ITU G.729 B VAD.The performance is evaluated in terms of the probability of clipping, Pc, and the probability of false detection, Pe, where: -Pe is the ratio of the number of noise frames that are mistakenly classified as speech to the total number of noise frames.
-Pc is the ratio of the number of speech frames that are mistakenly classified as noise to the total number of speech frames.
In general, AMR2 VAD provides the lowest clipping rate over G.729B VAD and the proposed HMM VAD (with 93.02% improvement over G.729B VAD).This happens at the cost of higher false detection rate (42.37% average degradation), specially in the case of Babble noise.On contrary, the proposed HMM VAD provides a balanced, yet significant, improvement to G.729B for clipping rate and false detection rate; 72.21 and 72.37%, respectively.
We note that, the improvement of the proposed system in the false detection rate is better than the improvement of the clipping rate in the case of white noise.This is because the noise is more stationary and thus easier to track.On the other hand, in the case of car noise the improvement in the clipping rate of the proposed system is better compared to the improvement of the false detection rate because the noise is less stationary.

SUMMARY
In this paper, we propose an efficient VAD algorithm to work with G.729 compliant encoders in their parameter domain with minimal additional computational load for feature extraction.The proposed VAD is a semicontinuous state transition probabilities HMM-based with a Laplacian observation layer, with no need for offline learning.The proposed VAD provides a significant improvement to G.729B with a good balance between the drop in clipping rate and in the false detection rate compared with that of the G.729 B VAD.

Table 1 .
The performance of the proposed HMM VAD and AMR2 VAD against the performance of G.729B VAD.The performance is evaluated in terms of:-the probability of clipping, Pc, -the probability of false detection, Pe, -the improvement in Pc, which is given by -(Pc| AMR2/HMM -Pc| G.729 )x100/ Pc| G.729 , and -the improvement in Pe, which is given by -(Pe| AMR2/HMM -Pe| G.729 )x100/ Pe| G.729 .