Robust speech interaction in motorcycle environment

https://doi.org/10.1016/j.eswa.2009.07.011Get rights and content

Abstract

Aiming at robust spoken dialogue interaction in motorcycle environment, we investigate various configurations for a speech front-end, which consists of speech pre-processing, speech enhancement and speech recognition components. These components are implemented as agents in the Olympus/RavenClaw framework, which is the core of a multimodal dialogue interaction interface of a wearable solution for information support of the motorcycle police force on the move. In the present effort, aiming at optimizing the speech recognition performance, different experimental setups are considered for the speech front-end. The practical value of various speech enhancement techniques is assessed and, after analysis of their performances, a collaborative scheme is proposed. In this collaborative scheme independent speech enhancement channels operate in parallel on a common input and their outputs are fed to the multithread speech recognition component. The outcome of the speech recognition process is post-processed by an appropriate fusion technique, which contributes for a more accurate interpretation of the input. Investigating various fusion algorithms, we identified the Adaboost.M1 algorithm as the one performing best. Utilizing the fusion collaborative scheme based on the Adaboost.M1 algorithm, significant improvement of the overall speech recognition performance was achieved. This is expressed in terms of word recognition rate and correctly recognized words, as accuracy gain of 8.0% and 5.48%, respectively, when compared to the performance of the best speech enhancement channel, alone. The advance offered in the present work reaches beyond the specifics of the present application, and can be beneficial to spoken interfaces operating in non-stationary noise environments.

Introduction

Human–computer interaction has a long history, during which various interfaces were developed, currently aiming to provide a more natural interaction, mimicking human to human interaction, such as 3D gesture or speech. Achievement of naturalness involves progress from command or menu-based (system driven) to user-driven dialog management and system intelligence, to allow adaptation to environment changes and user preferences. Technological advances in the Internet protocol (IP)-telephony domain, have leaded to an increased interest to provide accessibility to the large domain of web applications over the phone (Tsai, 2006). Personal assistant-based dialogue systems which offer higher comfort for the end-users were developed for the needs of various applications (Möller, 2004, Paraiso and Barthes, 2006). Activities which traditionally were performed in an office or at home, i.e. in a well controlled environment, are now supported by mobile and embedded technologies and have migrated outdoors. There is an increased demand on services providing efficiency, empowered by high comfort and safety, in the new environment, taking into account that most of the times parallel activities, such as driving a car or a motorcycle, are performed. On the route, driver distraction can become a significant problem, thus highly efficient human–machine interfaces are required. In order to meet both, comfort and safety requirements, new technologies need to be introduced into the car environment enabling drivers to interact with mobile systems and services in an easy, risk-free way.

Spoken language dialogue systems considerably improve safety and user-friendliness of human–machine interfaces, due to their similarity to the conversational activity with another human, a parallel activity to which the driver is used to and it allows him concentrate on the main activity, the driving itself. Driving quality, stress and strain situations and user acceptance when using speech and manual commands to acquire certain information on the route has been previously studied (Gartner, Konig, & Wittig, 2001), and the results have shown that, with speech input, the feeling of being distracted from driving is smaller, and road safety is improved, especially in case of complex tasks. Moreover, assessment of user requirements from multimodal interfaces in a car environment has shown that when the car is moving the system should switch to the “speech-only” interaction mode, as any other safety risks (i.e. driver distraction from the driving task by gesture input or graphical output) must be avoided (Berton, Buhler, & Minker, 2006).

Commercial use of speech recognition in the context of human–computer interaction has been intensified in the last decade, with major applications to “informative” systems: inquiries of time schedules for trains or movies, information regarding services and products, bank account or card balance inquiry, etc. Spoken communication, even between humans, is highly affected by noise pollution (Zahaeeruddin & Jain, 2008). Even more, the performance of speech recognition systems, although reliable enough to support speaker and device independence in controlled environments, degrades substantially in a mobile environment used on the road. There are various types and sources of noise interfering with the speech signal, starting with the acoustic environment (vibrations, road/fan/wind noise, engine noise, traffic, etc.) to changes in speaker’s voice due to task stress, distributed attention, increased cognitive load, etc. In the integration of speech-based interfaces within vehicle environments the research is conducted in two directions: (i) addition of front-end speech enhancement systems to improve the quality of the recorded signal, and (ii) training the speech models of the recognizer engine on noisy, real-life, speech databases.

Speech recognition in car environment in the early 1990s started with combinations of basic hidden Markov models (HMM) recognizers with front-end noise suppression, environmental noise adaptation and multi-channel concepts (Hansen and Clements, 1991, Lockwood and Boundy, 1992). Preliminary speech/noise detection with front-end speech enhancement methods as noise suppression front-ends for robust speech recognition has shown promising results and currently benefits from the suppression of interfering signals by using a microphone array, which enables both spatial and temporal measurements (Visser, Otsuka, & Lee, 2003). The advantages of multi-channel speech enhancement can be successfully applied to the car environment, while in the motorcycle environment research is focused to one-channel speech enhancement. After more than three decades of advances on the one-channel speech enhancement problem, four distinct families of algorithms seem to have predominated in the literature: (i) the spectral subtractive algorithms (Kamath & Loizou, 2002), (ii) the statistical model-based approaches (Ephraim and Malah, 1985, Hu and Loizou, 2004, Loizou, 2005), (iii) the signal subspace approaches (Hu and Loizou, 2003, Jabloun and Champagne, 2003), and (iv) the enhancement approaches based on a special type of filtering (Gannot, Burshtein, & Weinstein, 1998).

The accuracy of the speech recognition task is highly improved by using suitably trained speech models for the recognizer engine. Sufficient noise scenarios, from the application domain, should be included in the training phase for the improvement of the performance. Dedicated speech corpora have been designed, recorded and annotated, starting with the car environment, and emerging with the motorcycle one. European initiative, supporting the development of corpora to support training and testing of multilingual speech recognition applications in the car environment started in 1998 with the SPEECHDAT-CAR project (Moreno et al., 2000). The databases developed are designed to include a phonetically balanced corpus to train generic speech recognition systems and an application corpus, providing enough data to adapt speaker independent recognition systems to the automotive environment. A total of 10 languages are supported, with recordings from at least 300 speakers for each language and seven characteristic environments (low speed, high speed, with audio equipment on, etc.). The CU-Move corpus consists of five domains, including digit strings, route navigation expressions, street and location sentences, phonetically balanced sentences and a route navigation dialog in a human Wizard-of-Oz like scenario, considering a total of 500 speakers from United States of America and a natural conversational interaction (Hansen et al., 2003). The research of human–computer interaction in car environment has evolved to the multimodal mode (audio and visual), and adequate audio-visual corpus has been developed in the AVICAR database (Lee, Hasegawa-Johnson, & Goudeseune, 2004) using a multi-sensory array of eight microphones and four video cameras. For the motorcycle environment, the SmartWeb motorbike corpus has been designed for a dialogue system dealing with open domains (Kaiser, Mogele, & Shiel, 2006). Recently, a domain-specific (police domain) database, dealing with the extreme conditions of the motorcycle environment, has been developed in the MoveOn project (Winkler et al., 2008). In the latest the focus is the specificity of the domain, where the cognitive load is quite high and the accuracy in recognition of commands in the context of a template-driven dialog, in the motorcycle environment, is of high priority.

In developing the speech interface in the MoveOn system, there are certain research challenges to overcome in order to achieve reliable and natural voice interaction in the motorcycle environment: zero-distraction interaction system for people who are moving on the road (“on the move”, “eyes-busy” and “hands-busy”) not being able to interact through a visual/tactile interface, such as a screen or a button pad, due to safety reasons. Target users that will benefit from the application of interest discussed here are police force motorcyclists and motorcycle drivers at large.

In the present work, we report on a challenging research and development effort for optimising the speech recognition accuracy of the MoveOn system’s speech front-end. This development is based on a collaborative scheme, which relies on a number of speech enhancement channels and multithread automatic speech recognition component. The speech pre-processing, speech enhancement, speech recognition and data fusion components discussed in the following sections are implemented as interactive agents in the Olympus/RavenClaw framework, which is the core of the multimodal dialogue interaction system. The present work can be viewed as a natural continuation of an earlier study (Ntalampiras, Ganchev, Potamitis, & Fakotakis, 2008), where eight speech enhancement algorithms were evaluated in terms of objective and subjective quality of speech on the motorcycle speech database, referred to as MoveOn speech and noise database (Winkler et al., 2008). That earlier study provided useful indication about the potential usefulness of various speech enhancement algorithms and their performance in the target environmental conditions, and assisted us in selecting the potential candidates for best performing speech enhancement algorithms for the present work. However, since the performance of these speech enhancement algorithms in terms of usefulness for improvement of speech recognition performance is not known, and cannot be judged directly from the results of the aforementioned objective evaluation, we were motivated to consider the implementation of an optimized system, taking advantage of multiple best performing algorithms, instead of simply relying on the top-performer in terms of perceptual quality.

Thus, in contrast to (Ntalampiras et al., 2008), in the present study the performance of the various speech enhancement schemes, on the MoveOn speech and noise database, is assessed by ranking their effect on speech recognition performance. Moreover, we demonstrate how the speech recognition performance can be boosted further by proper fusion of the outputs of several parallel speech enhancement channels, which are equipped with different speech enhancement techniques and accordingly adapted channel-specific acoustic models of the speech recognizer. The benefit of such collaborative scheme is experimentally demonstrated in terms of improved word recognition rate (WRR) and higher rates of correctly recognized words (CRW).

The remaining sections of this article are organized as follows: In Section 2 we introduce the MoveOn application, outline the architecture of the multimodal interaction dialogue system, and specify the requirements to the speech front-end. In Section 3 we discuss the implementation of the collaborative scheme for speech front-end, which is composed of multiple parallel speech enhancement channels, whose outputs are fused, in an attempt to improve the overall speech recognition performance. In Section 4 we detail the experimental setup, and Section 5 presents the experimental results. Finally, Section 6 offers discussion and concluding remarks.

Section snippets

System architecture

In this section we briefly introduce the MoveOn application, the main design solution and the generic functionality requirements to the speech front-end.

Speech front-end

The speech front-end mentioned in Section 2.3 considers single channel speech recognition. However, the optimized speech front-end system, implemented for the specific environment and present in here, is based on a collaborative scheme, which consists of four types of building blocks: speech pre-processing, speech enhancement, speech recognition and fusion agents. As Fig. 2 presents, in the proposed composite scheme, the speech front-end is designed to function as a parallel structure. In the

Experimental setup

The speech front-end proposed in Section 3 was evaluated in different configurations: single channel or multiple channel speech enhancement, different configuration settings of the speech recognition engine, and various fusion methods. Following, the speech data, the settings of the experimental setup, the fusion algorithms and the experimental protocol of the present evaluation are presented.

Experimental results

In this section, the experimental results are presented, considering, in a first step, the performance of individual speech enhancement algorithms and, in a second step, the performance of a fusion collaborative scheme, applied to individual speech enhancement channels. The performance of the front-end speech enhancement is measured in terms of averaged word recognition rate (WRR) per speech utterance.

The performance results for each individual speech enhancement method in the motorcycle

Discussion and conclusion

In the present contribution we studied a collaborative scheme for speech recognition, which is based on a multi-channel speech enhancement agent, implementing different speech enhancement techniques per channel. The outputs of the individual speech enhancement channels are fed to multithread speech recognition agent. The speech recognition outcomes, corresponding to the individual speech enhancement channels are post-processed by utilizing a fusion method that is trained to predict the best

Acknowledgement

The research leading to these results was financially supported by the MoveOn project (IST-2005-034753), under the [European Community’s] Sixth Framework Programme.

References (42)

  • L. Breiman

    Bagging predictors

    Machine Learning

    (1996)
  • Clarkson, P. R., & Rosenfeld, R. (1997). Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings...
  • Y. Ephraim et al.

    Speech enhancement using a minimum mean square error log-spectral amplitude estimator

    IEEE Transactions on Acoustics, Speech, Signal Processing

    (1985)
  • Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of the 13th...
  • S. Gannot et al.

    Iterative and sequential Kalman filter-based speech enhancement algorithms

    IEEE Transactions on SAP

    (1998)
  • Gartner, U., Konig, W., & Wittig, T. (2001). Evaluation of manual vs. speech input when using a driver information...
  • J.L. Gauvain et al.

    Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains

    IEEE Transactions on SAP

    (1994)
  • J.H.L. Hansen et al.

    Constrained iterative speech enhancement with application to speech recognition

    IEEE Transactions on ASSP

    (1991)
  • Hansen, J. H. L., Zhang, X., Akbacak, M., Yapanel, U., Pellom, B., & Ward, W. (2003). CU-Move: Advances in in-vehicle...
  • Hoge, H., Draxler, C., Van den Heuvel, H., Johansen, F. T., Sanders, E., & Tropf, H. S. (1999). SpeechDat multilingual...
  • Y. Hu et al.

    A generalized subspace approach for enhancing speech corrupted with colored noise

    IEEE Transactions on SAP

    (2003)
  • Cited by (7)

    • An on-line VAD based on Multi-Normalisation Scoring (MNS) of observation likelihoods

      2018, Expert Systems with Applications
      Citation Excerpt :

      It is profusely used in the development of all kinds of expert systems. In Mporas, Kocsis, Ganchev, and Fakotakis (2010) the authors use Automatic Speech Recognition (ASR) technology with a VAD to develop a dialogue system in a motorcycle environment. Principi, Squartini, Bonfigli, Ferroni, and Piazza (2015) describe an integrated system for processing voice emergency commands using a VAD followed by ASR.

    • Affective speech interface in serious games for supporting therapy of mental disorders

      2012, Expert Systems with Applications
      Citation Excerpt :

      The existing components are augmented to support multimodality, to be adaptable to context changes, to user preferences, and to game tasks. The design and implementation was motivated from previous research in speech recognition (Mporas, Ganchev, Kocsis, & Fakotakis, 2011a, 2011b; Mporas, Ganchev, Siafarikas, & Kostoulas, 2007; Mporas, Kocsis, Ganchev, & Fakotakis, 2010). The experimental results indicate that the speech recognition performance for emotional speech is reduced moderately, and thus acoustic models built from emotional speech are required for optimal performance.

    • Context-adaptive pre-processing scheme for robust speech recognition in fast-varying noise environment

      2011, Signal Processing
      Citation Excerpt :

      The present contribution builds on the fact that dissimilar speech enhancement algorithms perform differently for dissimilar types of interference and noise conditions [28], and on the idea that the most appropriate speech enhancement algorithm for each environmental condition can be selected dynamically during the run-time operation of the speech front-end, depending on the present audio input. The proposed adaptive scheme automatically selects only one speech enhancement channel, among all available, and thus alleviates scalability constraints inherent to earlier designs [24,27], which assume a number of speech enhancement algorithms operating in parallel on a common input. Specifically, the adaptive speech pre-processing scheme proposed here is organized as a two-stage process, where in the first stage the parameterized audio input is compared against a number of predefined clusters in the acoustic feature space to generate a new feature vector which consist of normalized log-likelihoods.

    • Cochlea-inspired speech recognition interface

      2019, Medical and Biological Engineering and Computing
    • Assessing spoken dialog services from the end-user perspective: Usability and experience

      2017, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Dynamic selection of a speech enhancement method for robust speech recognition in moving motorcycle environment

      2011, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
    View all citing articles on Scopus
    View full text