Short communicationA deep architecture for audio-visual voice activity detection in the presence of transients☆
Introduction
Voice activity detection is a segmentation problem of a given speech signal into sections that contain speech and sections that contain only noise and interferences. It constitutes an essential part in many modern speech-based systems such as those for speech and speaker recognition, speech enhancement, emotion recognition and dominant speaker identification. We consider a multimodal setting, in which speech is captured by a microphone, and a video camera is pointed at the face of the desired speaker. The multimodal setting is especially useful in difficult acoustic environments, where the audio signal is measured in the presence of high levels of acoustic noise and transient interferences, such as keyboard tapping and hammering [1], [2]. The video signal is completely invariant to the acoustic environment, and nowadays, it is widely available in devices such as smart-phones and laptops. Therefore, proper incorporation of the video signal significantly improves voice detection, as we show in this paper.
In silent acoustic environments, speech segments in a given signal are successfully distinguished from the silence segments using methods based on simple acoustic features such as zero-crossing rate and energy values in short time intervals [3], [4], [5]. However, the performances of these methods significantly deteriorate in the presence of noise even with moderate levels of signal-to-noise ratios (SNR). Another group of methods assumes statistical models for the noisy signal, focusing on estimation of the model parameters. For example, the variances of speech and noise can be estimated by tracking the variations of the noisy signal over time [6], [7], [8], [9]. The main drawback of such methods is that they cannot properly model highly non-stationary noise and transient interferences, which are in the main scope of this study. The spectrum of transients often rapidly varies over time, as does the spectrum of speech, and as a result, they are not properly distinguished [2].
More recent studies address the problem of voice activity detection from a machine learning point of view, in which the goal is to classify segments of the noisy signal into speech and non-speech classes [10], [11]. Learning-based methods learn implicit models from training data instead of assuming explicit distributions for the noisy signal. A particular school of models, relevant to this paper, is deep neural networks, which have gained popularity in recent years in a variety of machine learning tasks. These models utilize multiple hidden layers for useful signal representations, and their potential for voice activity detection has been partially exploited in recent studies. Zhang and Wu [12] proposed using a deep-belief network to learn an underlying representation of a speech signal from predefined acoustic features. The new representation is then fed into a linear classifier for speech detection. Mendelev et al. [13] introduced a multi-layer perceptron network for speech detection, and proposed to improve its robustness to noise using the “Dropout” technique [14]. Despite the improved performance, the network in [13] classifies each time frame independently, thus ignoring temporal relations between segments of the signal. The studies presented in [15], [16], [17], [18] propose using a recurrent neural network (RNN) to naturally exploit temporal information by incorporating previous inputs for voice detection. These methods however still struggle in frames that contain both speech and transients. Since transients are characterized by fast variations in time and high energy values, they often appear more dominant than speech. Therefore, frames containing only transients appear similar to frames containing both transients and speech, so that they are wrongly detected as speech frames.
A different school of studies suggests improving the robustness of speech detection to noise and transients by incorporating a video signal, which is invariant to the acoustic environment. Often, the video captures the mouth region of the speakers, and it is represented by specifically designed features, which model the shape and movement of the mouth in each frame. Examples of such features are the height and the width of the mouth [19], [20], key-points and intensity levels extracted from the region of the mouth [21], [22], [23], [24], and motion vectors [25], [26].
Two common approaches exist in the literature concerning the fusion of audio and video signals, termed early and late fusion [27], [28]. In early fusion, video and audio features are concatenated into a single feature vector and processed as single-modal data [29]. In late fusion, measures of speech presence and absence are constructed separately from each modality, and then combined using statistical models [30], [31]. Dov et al. [32], [33], for example, proposed to obtain separate low dimensional representations of the audio and video signals using diffusion maps. The two modalities are then fused by a combination of speech presence measures, which are based on spatial and temporal relations between samples of the signal in the low dimensional domain.
In this paper, we propose a deep neural network architecture for audio-visual voice activity detection. The architecture is based on specifically designed auto-encoders providing an underlying representation of the signal, in which simultaneous data from audio and video modalities are fused in order to reduce the effect of transients. The new representation is incorporated into an RNN, which, in turn, is trained for speech presence/absence classification by incorporating temporal relations between samples of the signal in the new representation. The classification is performed in a frame-by-frame manner without temporal delay, which makes the proposed deep architecture suitable for online applications.
The proposed deep architecture is evaluated in the presence of highly non-stationary noises and transient interferences. Experimental results show improved performance of the proposed architecture compared to single-modal approaches that exploit only the audio or video signals, thus demonstrating the advantage of audio-video data fusion. In addition, we show that the proposed architecture outperforms competing multimodal detectors.
The remainder of the paper is organized as follows. In Section 2, we formulate the problem. In Section 3, we introduce the proposed architecture. In Section 4, we demonstrate the performance of the proposed deep architecture for voice activity detection. Finally, in Section 5, we draw conclusions and offer some directions for future research.
Section snippets
Problem formulation
We consider a speech signal simultaneously recorded via a single microphone and a video camera pointed at a front-facing speaker. The video signal comprises the mouth region of the speaker. It is aligned to the audio signal by a proper selection of the frame length and the overlap of the audio signal as described in Section 4. Let and be feature representations of the nth frame of the clean audio and video signals, respectively, where A and V are the number of features. Similarly to
Review of autoencoders
The proposed deep architecture is based on obtaining a transient reducing representation of the signal via the use of auto-encoders, which are shortly reviewed in this subsection for the sake of completeness [34]. An auto-encoder is a feed-forward neural network with an input and output layers of the same size, which we denote by and respectively. They are connected by one hidden layer such that the input layer x is mapped into the hidden layer h through an affine mapping:
Dataset
We evaluate the proposed deep architecture for voice activity detection using the dataset presented in [32]. The dataset includes audio-visual sequences of 11 speakers reading aloud an article chosen from the web, while making natural pauses every few sentences. Thus, the intervals of speech and non-speech range from several hundred ms to several seconds in length. The video signal uses a bounding box around the mouth region of the speaker, cropped from the original recording, and it is of
Conclusions and future work
We have proposed a deep architecture for speech detection, based on specifically designed auto-encoders providing a new representation of the audio-visual signal, in which the effect of transients is reduced. The new representation is fed into a deep RNN, trained in a supervised manner to generate voice activity detection while exploiting the differences in the dynamics between speech and the transients. Experimental results have demonstrated that the proposed architecture outperforms competing
References (46)
- et al.
Efficient voice activity detection algorithms using long-term speech information
Speech Commun.
(2004) - et al.
Voice activity detection based on statistical models and machine learning approaches
Comput. Speech Lang.
(2010) - et al.
Robust visual speakingness detection using bi-level HMM
Pattern Recognit.
(2012) - et al.
Diffusion maps
Appl. Comput. Harmonic Anal.
(2006) - et al.
Voice activity detection in presence of transients using the scattering transform
Proc. 28th Convention of the Electrical & Electronics Engineers in Israel (IEEEI)
(2014) - et al.
Kernel method for voice activity detection in the presence of transients
IEEE/ACM Trans. Audio, Speech Lang. Process.
(2016) - et al.
An autocorrelation pitch detector and voicing decision with confidence measures developed for noise-corrupted speech
IEEE Trans. Signal Process.
(1991) - et al.
A robust algorithm for word boundary detection in the presence of noise
IEEE Trans. Speech Audio Process.
(1994) - et al.
A comparative study of speech detection methods
Proc. 5th European Conference on Speech Communication and Technology (EUROSPEECH)
(1997) - et al.
Enhanced voice activity detection using acoustic event detection and classification
IEEE Trans. Consumer Electron.
(2011)
Voice activity detection based on multiple statistical models
IEEE Trans. Signal Process.
A statistical model-based voice activity detection
IEEE Signal Process. Lett.
Maximum margin clustering based statistical VAD with multiple observation compound feature
IEEE Signal Process. Lett.
Deep belief networks based voice activity detection
IEEE Trans. Audio, Speech Lang. Process.
Robust voice activity detection with deep maxout neural networks
Mod. Appl. Sci.
Dropout: a simple way to prevent neural networks from overfitting
J. Mach. Learn. Res.
Singing voice detection with deep recurrent neural networks
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Speech recognition with deep recurrent neural networks
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Recurrent neural networks for voice activity detection
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Voice activity detection based on noise-immunity recurrent neural networks
Int. J. Adv. Comput. Technol. (IJACT)
An analysis of visual speech information applied to voice activity detection
Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
A study of lip movements during spontaneous dialog and its application to voice activity detection
J. Acoustical Soc. Am.
Two novel visual voice activity detectors based on appearance models and retinal filtering
Proc. 15th European Signal Processing Conference (EUSIPCO)
Cited by (31)
Sound classification using evolving ensemble models and Particle Swarm Optimization
2022, Applied Soft ComputingCitation Excerpt :Other oversampling techniques (such as autoencoder) will also be employed to tackle the class imbalance problem. We will further evaluate the proposed PSO algorithm for generating deep learning models [43,58,63] with respect to other signal and vision processing tasks such as voice activity detection [98], speech emotion recognition [41], visual saliency detection [19,62], and image description generation [6–8,99]. Li Zhang: Conceptualization, Datacuration, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Roles/Writing - original draft, Writing - review & editing.
Audiovisual speaker indexing for Web-TV automations
2021, Expert Systems with ApplicationsAdaptive line enhancer for nonstationary harmonic noise reduction
2021, Computer Speech and LanguageNovel dynamic center based binary and ternary pattern network using M4 pooling for real world voice recognition
2019, Applied AcousticsCitation Excerpt :The comparisons were presented according to threshold and signal-to-noise ratio. Ariav et al. (2018) [29] presented an approach for speech detection. In this method, recurrent neural network was used.
End-to-end audiovisual speech activity detection with bimodal recurrent neural models
2019, Speech CommunicationCitation Excerpt :For AV-SAD, however, there are only few studies using DL approaches. One exception is the approach proposed by Ariav et al. (2018). They used an autoencoder to create an audiovisual bottleneck representation.
- ☆
This research was supported by the Israel Science Foundation (grant no. 576/16).