The design and collection of COSINE, a multi-microphone in situ speech corpus recorded in noisy environments

https://doi.org/10.1016/j.csl.2010.12.003Get rights and content

Abstract

We present an overview of the data collection and transcription efforts for the COnversational Speech In Noisy Environments (COSINE) corpus. The corpus is a set of multi-party conversations recorded in real world environments with background noise. It can be used to train noise-robust speech recognition systems or develop speech de-noising algorithms. We explain the motivation for creating such a corpus, and describe the resulting audio recordings and transcriptions that comprise the corpus. These high quality recordings were captured in situ on a custom wearable recording system, whose design and construction is also described. On separate synchronized audio channels, seven-channel audio is captured with a 4-channel far-field microphone array, along with a close-talking, a monophonic far-field, and a throat microphone. This corpus thus creates many possibilities for speech algorithm research.

Highlights

► We present the COnversational Speech In Noisy Environments (COSINE) corpus. ► COSINE consists of multi-party conversations recorded in noisy environments. ► The recordings were captured in situ on a custom wearable portable recording system. ► Seven separate heterogeneous synchronized audio channels have been captured. ► The corpus is useful for noise-robust ASR and speech de-noising algorithms.

Introduction

In many applications, practical automatic speech recognition (ASR) systems must be robust to the presence of background noise in the environment. These applications are numerous, and include dictation software; speech-based human–computer interfaces; speech recognition of telephone or air traffic control conversations; speech recognition or keyword search of TV or radio programs; and voice commands used by automobile drivers, soldiers, firefighters, law enforcement officials, or disabled individuals to interact with assistive devices.

Two types of effects must be overcome when training a speech recognition system that must work in noisy environments: the presence of additive or convolutional noise as well as reverberation, and the effects of the noisy environment on the nature of the speech (e.g., the Lombard effect (Lombard, 1911, Junqua et al., 1999)). The methods used to mitigate these effects fall into three main categories (Gong, 1995):

  • 1.

    Perform noise cancellation or reduction on the audio signal before passing it into the speech recognizer.

  • 2.

    Use noise-robust feature extraction methods to gain performance improvements over standard MFCC features (Guo et al., 2007). For example, mean/variance normalization, feature smoothing (Chen, 2004), and a variety of other feature cleaning/enhancement techniques show improvement over standard MFCC/PLP features (Islam et al., 2006, Lim et al., 2008, Yu et al., 2008).

  • 3.

    Train the acoustic model on a combination of clean and noisy speech, or use speech recorded in the desired noisy environment to adapt an existing acoustic model. The performance of a system trained on audio with noise conditions that are matched to the audio being recognized is likely to be an upper bound on the performance of model compensation schemes (acoustic model adaptation) (Gong, 1995). The use of training audio that exhibits the Lombard effect has also been shown to improve the performance of speech recognition systems (Lippmann et al., 1987).

  • 4.

    Models that explicitly account for the relationship between clean and noisy speech, and then manage this relationship in a computationally efficient way (such as uncertainty decoding (Droppo et al., 1999) which can work with missing variables representing clean data) are also very effective.

  • 5.

    New model topologies or structures can be developed that might have inherent “immunity” to noise properties. These might be ones that are trained or structured discriminatively, and by the nature of the task they are asked to perform (discrimination) they might be less susceptible to noise.

We do not mean this list to be exhaustive by any means, and moreover it is quite likely that some combination of 2 or more of the above methods can be used to produce a system that improves on any system that relies on only one of the above. In most cases, however, it is useful to have corpora collected in noisy environments, which is the main topic of this paper.

Many speech corpora have been developed for studying algorithmic improvements that provide ASR performance increases; however, few of them provide ideal training data for systems that must recognize conversational speech in noisy environments. For example, Broadcast News (Graff et al., 1997) contains mostly read speech, and TIMIT (Garofolo et al., 1993) consists of read speech with no noise. NTIMIT (Jankowski et al., 1990) is a version of TIMIT created by passing its recordings through telephone channels. Limited-vocabulary corpora exist, such as NOIZEUS (Hu and Loizou, 2007) and the AURORA databases, which contain recordings of spoken digits (and other small- to medium-vocabulary settings) that are clean, recorded in noisy environments, and/or often artificially distorted (by additive noise and simulations of rooms and telephone networks), as well as SPINE (Schmidt-Nielsen et al., 2002), in which background noise was played on a speaker in the recording booth. Other corpora have been designed to capture the Lombard effect, including UT-Scope (Varadarajan et al., 2006) and the Albayzin Spanish-language corpus (Moreno et al., 1993). The ICSI (Janin et al., 2003) and AMI (Carletta et al., 2006) meeting corpora contain microphone array recordings of multi-party conversations in indoor environments. Several in-car corpora have been created, with multi-microphone recordings of limited-vocabulary speech in noisy environments. These include AVICAR (Lee et al., 2004) and the CIAIR Japanese corpus (Kawaguchi et al., 2000), which also includes dialog recordings. There are also databases which capture the effects of specific types of distortion, such as telephone channels in Switchboard (Godfrey et al., 1992). Additionally, some multi-modal corpora exist (including AVICAR and the AMI corpus), that allow the combination of, say, audio and visual information.

Our goal was to create a corpus that brings together many of the elements that make each of these corpora useful: i.e., the presence of various levels and types of background noise, simultaneous recordings of Lombard speech that capture and reject the background noise, spontaneous multi-person conversations, and synchronized multi-microphone recordings (including a microphone array) of each conversation participant. Many considerations motivated the design of the recording hardware and data collection practices. The corpus contains multi-party conversations about everyday topics. Additionally, the speech is recorded in true noisy environments, rather than having the noise added later or piped in as background noise through speakers. In fact, the participants are actually walking around within the real noisy environment in which the speech is recorded, and are potentially being affected by all of the distractions that such a noisy-environment might entail. These noisy environments range in both noise type and intensity, and include a wide range of indoor and outdoor noise sources such as crowds, vehicles, and wind at a variety of SNRs. To achieve this, we have custom-designed a portable recording device that allows for the multi-track speech to be recorded in situ, rather than making the recordings in a studio, which would affect the speakers’ comfort, behavior, and speech patterns. Of course, any speech that compromises the privacy of the speakers is deleted from the corpus, but the fact that there is such speech indicates that the speakers engage in comfortable, natural, and highly disfluent speaking styles, conversations, and vocalic patterns. We therefore believe that our corpus, due to its inherent in situ nature, provides a unique perspective on aspects of human speech production when spoken in real-world noisy environments, and also on the acoustic properties of speech and noise when it is collected in such real-world environments.

Our resulting “COSINE” corpus was first introduced in (Stupakov et al., 2009); the present paper gives a much more detailed description of every aspect of the creation of the corpus. The paper is organized as follows: in Section 2, we discuss the design and construction of our custom portable wearable recorders. In Section 3, we describe the recording sessions during which the corpus was recorded. Section 4 explains the word-interior annotation method that was used to mark words during transcription, and the nature of the transcription is described in detail in Section 5. Section 6 contains details about the public release of the corpus. In Section 7, we conclude with a discussion of potential applications for the resulting recordings.

Section snippets

Portable wearable multi-channel recording system

We designed the hardware component of our speech recording system to be portable, light-weight, unobtrusive, and comfortable. Given the requirements of the corpus mentioned in the previous section, this involved evaluating different trade-offs between microphone placement and audio quality. In order for the audio in the corpus to represent a variety of scenarios, we chose to record the speech with several different types of microphones, which cover a broad range of comfort, audio quality, and

Recording sessions

Paid volunteers (approved for human subjects research) participated in multi-person recording sessions that lasted between 45 min and 1.5 h. The breakdown of the number of people per session is: 2 people: 13%, 3 people: 19%, 4: 42%, 5: 3.5%, 6: 19%, and 7: 3.5%. Even numbers of participants were favored because people tend to talk more when they are able to pair up. After putting on the recording devices, the volunteers were asked to walk to various noisy locations, and to talk about anything

Word-interior annotation

An important consideration for the corpus was the extent to which the data would be labeled. To expedite the transcription process, three methods were evaluated:

1. fully labeled (FL)—transcribers mark the precise beginnings and ends of words,

2. sequence-labeled (SL)—transcribers mark the beginning and end of a phrase and then transcribe only the sequence of words, and

3. a technique introduced in (Subramanya and Bilmes, 2007) called partially labeled (PL)—the word sequence is transcribed, and an

Orthographic transcription

Using the Praat program (Boersma, 2001), a three-pass transcription of the corpus was done by a group of twelve 3rd and 4th year Linguistics undergraduate students at the University of Washington. All transcribers are native English speakers.

To ensure consistency of transcription, an annotation guide was created for the transcribers to follow. A wiki was also established, allowing the transcribers to standardize their transcriptions and to learn from each other. The wiki contained discussions

Release

The final release of the corpus contains all of the recorded audio (excepting any privacy-related deletions), at a 44.1 kHz sampling rate and a bit depth of 16 bits, and stored in the FLAC compressed lossless audio format; the transcriptions; and all non-privacy-sensitive subject information. The release is accompanied by a thorough set of documentation which includes detailed statistics about the speakers, the recordings, and the transcriptions; two suggested train/dev/test set subdivisions

Conclusions

We expect that the COSINE corpus could be a unique and valuable tool for the speech and language community. Its annotations comprise word-level transcriptions of multi-party in situ conversational speech, including word-interior markings. Each speaker has been recorded simultaneously on seven different channels with noise content and channel distortion, representing the varying conditions of real-world microphone types and placement. As the speech has been recorded in situ, there are no

Acknowledgments

This material is based upon work supported in part by DARPA's ASSIST Program (contract number NBCH-C-05-0137) and an ONR MURI grant (No. N000140510388). The authors would like to express their thanks to the transcribers who made this work possible: Naomi Bancroft, Eric Braun, Dutch Hixenbaugh, Alexander Keane, Brent Nelson, Kellen Michael Paisley, Justina Rompogren, and Yeon-Hee Yim, with special thanks to the four transcribers who also did additional quality screening of the data: Min Amodio,

References (36)

  • Y. Gong

    Speech recognition in noisy environments: a survey

    Speech Communication

    (1995)
  • Y. Hu et al.

    Subjective evaluation and comparison of speech enhancement algorithms

    Speech Communication

    (2007)
  • Aurora speech recognition experimental framework....
  • P. Boersma

    Praat, a system for doing phonetics by computer

    Glot International

    (2001)
  • C. M. University, 2008. cmudict0.7a, https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/cmudict0....
  • J. Carletta et al.

    The AMI meeting corpus: a pre-announcement

    Lecture notes in computer science

    (2006)
  • Chen, C., 2004. Noise robustness in automatic speech recognition. Ph.D. thesis, University of...
  • J. Droppo et al.

    Uncertainty decoding with SPLICE for noise robust speech recognition

  • FLAC—Free Lossless Audio Codec, v1.1....
  • J.S. Garofolo et al.

    DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM

    (1993)
  • J. Godfrey et al.

    SWITCHBOARD: telephone speech corpus for research and development

  • D. Graff et al.

    The 1996 broadcast news speech and language-model corpus

  • W. Guo et al.

    An auditory neural feature extraction method for robust speech recognition

  • D. Heckerman et al.

    Models and selection criteria for regression and classification

  • M. Islam et al.

    An improved mel-Wiener filter for mel-LPC based speech recognition

  • C. Jankowski

    NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database

  • A. Janin et al.

    The ICSI meeting corpus

  • J.-C. Junqua et al.

    The Lombard effect: a reflex to better communicate with others in noise

  • Cited by (29)

    • VoiceHome-2, an extended corpus for multichannel speech processing in real homes

      2019, Speech Communication
      Citation Excerpt :

      The development of robust techniques able to alleviate reverberation and noise, the main challenges of distant-speech processing “in the wild”, requires suitable corpora for development and testing. A number of real corpora are now publicly available for environments and application scenarios such as voice command for cars (Aurora-3, 2000; Hansen et al., 2001; Lee et al., 2004) and in public spaces (Barker et al., 2015), automatic transcription of lectures (Lamel et al., 1994), meetings (Janin et al., 2003; Mostefa et al., 2007; Renals et al., 2008), dialogs (LLSEC, 1996; Stupakov et al., 2011) and other public gatherings (Lincoln et al., 2005; Fox et al., 2013), and automatic transcription of noisy or overlapped speech in broadcast media (Gravier et al., 2012; Bell et al., 2015). More recently, distant-microphone speech processing in domestic environments has drawn much interest.

    • An analysis of environment, microphone and data simulation mismatches in robust speech recognition

      2017, Computer Speech and Language
      Citation Excerpt :

      Speech enhancement and automatic speech recognition (ASR) in the presence of reverberation and nonstationary noise are still challenging tasks today (Baker et al., 2009; Wölfel and McDonough, 2009; Virtanen et al., 2012;Li et al., 2015). Research in this field has made great progress thanks to real speech corpora collected for various application scenarios such as voice command for cars (Hansen et al., 2001), smart homes (Ravanelli et al., 2015), or tablets (Barker et al., 2015), and automatic transcription of lectures (Lamel et al., 1994), meetings (Renals et al., 2008), conversations (Harper, 2015), dialogues (Stupakov et al., 2011), game sessions (Fox et al., 2013), or broadcast media (Bell et al., 2015). In most corpora, the training speakers differ from the test speakers.

    • Probabilistic speech feature extraction with context-sensitive Bottleneck neural networks

      2014, Neurocomputing
      Citation Excerpt :

      To obtain the final decorrelated feature vectors, PCA is applied on the joint feature vectors consisting of forward and backward Bottleneck layer activations and MFCC features xt. We optimized and evaluated our BLSTM feature extraction scheme on the ‘COnversational Speech In Noisy Environments’ (COSINE) corpus [17] which is a relatively new database containing multi-party conversations recorded in real world environments. The COSINE corpus has also been used in [14,15,6] which allows us to compare the proposed front-end to the previously introduced concepts for BLSTM-based feature-level context modeling in continuous ASR.

    • LSTM-modeling of continuous emotions in an audiovisual affect recognition framework

      2013, Image and Vision Computing
      Citation Excerpt :

      The HMM back-end is based on the open-source Julius decoder [38]. Both, a back-off bigram language model and tied-state triphone acoustic models were trained on the COSINE corpus [39], the SAL database [40], and the training set of the SEMAINE database [14]. All of these corpora contain spontaneous, conversational, and partly emotional speech.

    View all citing articles on Scopus

    This paper has been recommended for acceptance by Thomas Hain.

    View full text