Elsevier

Signal Processing

Volume 142, January 2018, Pages 69-74
Signal Processing

Short communication
A deep architecture for audio-visual voice activity detection in the presence of transients

https://doi.org/10.1016/j.sigpro.2017.07.006Get rights and content

Highlights

  • A Deep architecture for audio-visual voice activity detection is proposed.

  • Specifically designed auto-encoders fuse audio and video while reducing interferences.

  • Incorporated into an RNN, the deep architecture outperforms recent detectors.

Abstract

We address the problem of voice activity detection in difficult acoustic environments including high levels of noise and transients, which are common in real life scenarios. We consider a multimodal setting, in which the speech signal is captured by a microphone, and a video camera is pointed at the face of the desired speaker. Accordingly, speech detection translates to the question of how to properly fuse the audio and video signals, which we address within the framework of deep learning. Specifically, we present a neural network architecture based on a variant of auto-encoders, which combines the two modalities, and provides a new representation of the signal, in which the effect of interferences is reduced. To further encode differences between the dynamics of speech and interfering transients, the signal, in this new representation, is fed into a recurrent neural network, which is trained in a supervised manner for speech detection. Experimental results demonstrate improved performance of the proposed deep architecture compared to competing multimodal detectors.

Introduction

Voice activity detection is a segmentation problem of a given speech signal into sections that contain speech and sections that contain only noise and interferences. It constitutes an essential part in many modern speech-based systems such as those for speech and speaker recognition, speech enhancement, emotion recognition and dominant speaker identification. We consider a multimodal setting, in which speech is captured by a microphone, and a video camera is pointed at the face of the desired speaker. The multimodal setting is especially useful in difficult acoustic environments, where the audio signal is measured in the presence of high levels of acoustic noise and transient interferences, such as keyboard tapping and hammering [1], [2]. The video signal is completely invariant to the acoustic environment, and nowadays, it is widely available in devices such as smart-phones and laptops. Therefore, proper incorporation of the video signal significantly improves voice detection, as we show in this paper.

In silent acoustic environments, speech segments in a given signal are successfully distinguished from the silence segments using methods based on simple acoustic features such as zero-crossing rate and energy values in short time intervals [3], [4], [5]. However, the performances of these methods significantly deteriorate in the presence of noise even with moderate levels of signal-to-noise ratios (SNR). Another group of methods assumes statistical models for the noisy signal, focusing on estimation of the model parameters. For example, the variances of speech and noise can be estimated by tracking the variations of the noisy signal over time [6], [7], [8], [9]. The main drawback of such methods is that they cannot properly model highly non-stationary noise and transient interferences, which are in the main scope of this study. The spectrum of transients often rapidly varies over time, as does the spectrum of speech, and as a result, they are not properly distinguished [2].

More recent studies address the problem of voice activity detection from a machine learning point of view, in which the goal is to classify segments of the noisy signal into speech and non-speech classes [10], [11]. Learning-based methods learn implicit models from training data instead of assuming explicit distributions for the noisy signal. A particular school of models, relevant to this paper, is deep neural networks, which have gained popularity in recent years in a variety of machine learning tasks. These models utilize multiple hidden layers for useful signal representations, and their potential for voice activity detection has been partially exploited in recent studies. Zhang and Wu [12] proposed using a deep-belief network to learn an underlying representation of a speech signal from predefined acoustic features. The new representation is then fed into a linear classifier for speech detection. Mendelev et al. [13] introduced a multi-layer perceptron network for speech detection, and proposed to improve its robustness to noise using the “Dropout” technique [14]. Despite the improved performance, the network in [13] classifies each time frame independently, thus ignoring temporal relations between segments of the signal. The studies presented in [15], [16], [17], [18] propose using a recurrent neural network (RNN) to naturally exploit temporal information by incorporating previous inputs for voice detection. These methods however still struggle in frames that contain both speech and transients. Since transients are characterized by fast variations in time and high energy values, they often appear more dominant than speech. Therefore, frames containing only transients appear similar to frames containing both transients and speech, so that they are wrongly detected as speech frames.

A different school of studies suggests improving the robustness of speech detection to noise and transients by incorporating a video signal, which is invariant to the acoustic environment. Often, the video captures the mouth region of the speakers, and it is represented by specifically designed features, which model the shape and movement of the mouth in each frame. Examples of such features are the height and the width of the mouth [19], [20], key-points and intensity levels extracted from the region of the mouth [21], [22], [23], [24], and motion vectors [25], [26].

Two common approaches exist in the literature concerning the fusion of audio and video signals, termed early and late fusion [27], [28]. In early fusion, video and audio features are concatenated into a single feature vector and processed as single-modal data [29]. In late fusion, measures of speech presence and absence are constructed separately from each modality, and then combined using statistical models [30], [31]. Dov et al. [32], [33], for example, proposed to obtain separate low dimensional representations of the audio and video signals using diffusion maps. The two modalities are then fused by a combination of speech presence measures, which are based on spatial and temporal relations between samples of the signal in the low dimensional domain.

In this paper, we propose a deep neural network architecture for audio-visual voice activity detection. The architecture is based on specifically designed auto-encoders providing an underlying representation of the signal, in which simultaneous data from audio and video modalities are fused in order to reduce the effect of transients. The new representation is incorporated into an RNN, which, in turn, is trained for speech presence/absence classification by incorporating temporal relations between samples of the signal in the new representation. The classification is performed in a frame-by-frame manner without temporal delay, which makes the proposed deep architecture suitable for online applications.

The proposed deep architecture is evaluated in the presence of highly non-stationary noises and transient interferences. Experimental results show improved performance of the proposed architecture compared to single-modal approaches that exploit only the audio or video signals, thus demonstrating the advantage of audio-video data fusion. In addition, we show that the proposed architecture outperforms competing multimodal detectors.

The remainder of the paper is organized as follows. In Section 2, we formulate the problem. In Section 3, we introduce the proposed architecture. In Section 4, we demonstrate the performance of the proposed deep architecture for voice activity detection. Finally, in Section 5, we draw conclusions and offer some directions for future research.

Section snippets

Problem formulation

We consider a speech signal simultaneously recorded via a single microphone and a video camera pointed at a front-facing speaker. The video signal comprises the mouth region of the speaker. It is aligned to the audio signal by a proper selection of the frame length and the overlap of the audio signal as described in Section 4. Let anRA and vnRV be feature representations of the nth frame of the clean audio and video signals, respectively, where A and V are the number of features. Similarly to

Review of autoencoders

The proposed deep architecture is based on obtaining a transient reducing representation of the signal via the use of auto-encoders, which are shortly reviewed in this subsection for the sake of completeness [34]. An auto-encoder is a feed-forward neural network with an input and output layers of the same size, which we denote by xRD and yRD, respectively. They are connected by one hidden layer hRM, such that the input layer x is mapped into the hidden layer h through an affine mapping: h=σ(W

Dataset

We evaluate the proposed deep architecture for voice activity detection using the dataset presented in [32]. The dataset includes audio-visual sequences of 11 speakers reading aloud an article chosen from the web, while making natural pauses every few sentences. Thus, the intervals of speech and non-speech range from several hundred ms to several seconds in length. The video signal uses a bounding box around the mouth region of the speaker, cropped from the original recording, and it is of

Conclusions and future work

We have proposed a deep architecture for speech detection, based on specifically designed auto-encoders providing a new representation of the audio-visual signal, in which the effect of transients is reduced. The new representation is fed into a deep RNN, trained in a supervised manner to generate voice activity detection while exploiting the differences in the dynamics between speech and the transients. Experimental results have demonstrated that the proposed architecture outperforms competing

References (46)

  • J. Ramírez et al.

    Efficient voice activity detection algorithms using long-term speech information

    Speech Commun.

    (2004)
  • J.W. Shin et al.

    Voice activity detection based on statistical models and machine learning approaches

    Comput. Speech Lang.

    (2010)
  • P. Tiawongsombat et al.

    Robust visual speakingness detection using bi-level HMM

    Pattern Recognit.

    (2012)
  • R.R. Coifman et al.

    Diffusion maps

    Appl. Comput. Harmonic Anal.

    (2006)
  • D. Dov et al.

    Voice activity detection in presence of transients using the scattering transform

    Proc. 28th Convention of the Electrical & Electronics Engineers in Israel (IEEEI)

    (2014)
  • D. Dov et al.

    Kernel method for voice activity detection in the presence of transients

    IEEE/ACM Trans. Audio, Speech Lang. Process.

    (2016)
  • D.A. Krubsack et al.

    An autocorrelation pitch detector and voicing decision with confidence measures developed for noise-corrupted speech

    IEEE Trans. Signal Process.

    (1991)
  • J.-C. Junqua et al.

    A robust algorithm for word boundary detection in the presence of noise

    IEEE Trans. Speech Audio Process.

    (1994)
  • S. Van Gerven et al.

    A comparative study of speech detection methods

    Proc. 5th European Conference on Speech Communication and Technology (EUROSPEECH)

    (1997)
  • N. Cho et al.

    Enhanced voice activity detection using acoustic event detection and classification

    IEEE Trans. Consumer Electron.

    (2011)
  • J.-H. Chang et al.

    Voice activity detection based on multiple statistical models

    IEEE Trans. Signal Process.

    (2006)
  • J. Sohn et al.

    A statistical model-based voice activity detection

    IEEE Signal Process. Lett.

    (1999)
  • J. Wu et al.

    Maximum margin clustering based statistical VAD with multiple observation compound feature

    IEEE Signal Process. Lett.

    (2011)
  • X.-L. Zhang et al.

    Deep belief networks based voice activity detection

    IEEE Trans. Audio, Speech Lang. Process.

    (2013)
  • V.S. Mendelev et al.

    Robust voice activity detection with deep maxout neural networks

    Mod. Appl. Sci.

    (2015)
  • N. Srivastava et al.

    Dropout: a simple way to prevent neural networks from overfitting

    J. Mach. Learn. Res.

    (2014)
  • S. Leglaive et al.

    Singing voice detection with deep recurrent neural networks

    Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2015)
  • A. Graves et al.

    Speech recognition with deep recurrent neural networks

    Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2013)
  • T. Hughes et al.

    Recurrent neural networks for voice activity detection

    Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2013)
  • W.-T. Hong et al.

    Voice activity detection based on noise-immunity recurrent neural networks

    Int. J. Adv. Comput. Technol. (IJACT)

    (2013)
  • D. Sodoyer et al.

    An analysis of visual speech information applied to voice activity detection

    Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2006)
  • D. Sodoyer et al.

    A study of lip movements during spontaneous dialog and its application to voice activity detection

    J. Acoustical Soc. Am.

    (2009)
  • A. Aubrey et al.

    Two novel visual voice activity detectors based on appearance models and retinal filtering

    Proc. 15th European Signal Processing Conference (EUSIPCO)

    (2007)
  • Cited by (31)

    • Sound classification using evolving ensemble models and Particle Swarm Optimization

      2022, Applied Soft Computing
      Citation Excerpt :

      Other oversampling techniques (such as autoencoder) will also be employed to tackle the class imbalance problem. We will further evaluate the proposed PSO algorithm for generating deep learning models [43,58,63] with respect to other signal and vision processing tasks such as voice activity detection [98], speech emotion recognition [41], visual saliency detection [19,62], and image description generation [6–8,99]. Li Zhang: Conceptualization, Datacuration, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Roles/Writing - original draft, Writing - review & editing.

    • Audiovisual speaker indexing for Web-TV automations

      2021, Expert Systems with Applications
    • Novel dynamic center based binary and ternary pattern network using M4 pooling for real world voice recognition

      2019, Applied Acoustics
      Citation Excerpt :

      The comparisons were presented according to threshold and signal-to-noise ratio. Ariav et al. (2018) [29] presented an approach for speech detection. In this method, recurrent neural network was used.

    • End-to-end audiovisual speech activity detection with bimodal recurrent neural models

      2019, Speech Communication
      Citation Excerpt :

      For AV-SAD, however, there are only few studies using DL approaches. One exception is the approach proposed by Ariav et al. (2018). They used an autoencoder to create an audiovisual bottleneck representation.

    View all citing articles on Scopus

    This research was supported by the Israel Science Foundation (grant no. 576/16).

    View full text