Large vocabulary continuous speech recognition of Broadcast News – The Philips/RWTH approach

https://doi.org/10.1016/S0167-6393(01)00062-0Get rights and content

Abstract

Automatic speech recognition of real-live broadcast news (BN) data (Hub-4) has become a challenging research topic in recent years. This paper summarizes our key efforts to build a large vocabulary continuous speech recognition system for the heterogenous BN task without inducing undesired complexity and computational resources. These key efforts included:

  • automatic segmentation of the audio signal into speech utterances;

  • efficient one-pass trigram decoding using look-ahead techniques;

  • optimal log-linear interpolation of a variety of acoustic and language models using discriminative model combination (DMC);

  • handling short-range and weak longer-range correlations in natural speech and language by the use of phrases and of distance-language models;

  • improving the acoustic modeling by a robust feature extraction, channel normalization, adaptation techniques as well as automatic script selection and verification.

The starting point of the system development was the Philips 64k-NAB word-internal triphone trigram system. On the speaker-independent but microphone-dependent NAB-task (transcription of read newspaper texts) we obtained a word error rate of about 10%. Now, at the conclusion of the system development, we have arrived at Philips at an DMC-interpolated phrase-based crossword-pentaphone 4-gram system. This system transcribes BN data with an overall word error rate of about 17%.

Zusammenfassung

Die automatische Spracherkennung von aktuellen Nachrichtensendungen (“Hub-4” Aufgabe, Broadcast-News Aufgabe) ist in den vergangenen Jahren zu einem wichtigen Forschungsthema geworden. Diese Publikation faßt die Schwerpunkte unserer Arbeit beim Aufbau eines Systems zur Erkennung kontinuierlicher Sprache mit großem Vokabular für die heterogene Broadcast-News-Aufgabe zusammen, wobei wir versucht haben, die Komplexität und den Rechenaufwand des Systems so gering wie möglich zu halten. Unter anderem haben wir uns auf folgende Ziele fokussiert:

  • Automatische Segmentierung des Audio-Signals in sprachliche Äußerungen;

  • Effiziente einstufige Trigramm-Suche mit Look-Ahead-Techniken;

  • Optimale log-lineare Interpolation einer Anzahl von akustischen Modellen und Sprachmodellen mit Hilfe der Diskriminativen Modellkombination (DMC);

  • Behandlung von Kurzzeit- und schwachen Langzeitkorrelationen in natürlicher Sprache durch den Einsatz von Phrasen und von Abstands-Sprachmodellen;

  • Verbesserung der akustischen Modellierung durch eine robuste Merkmalsextraktion, Kanalnormierung, Adaptionstechniken, wie auch durch automatische Skriptselektion und Skriptverifikation.

Der Startpunkt unserer Systementwicklung war das Philips 64k-NAB wortinterne Triphon-Trigramm-System. Auf der sprecherunabhängigen aber mikrophonabhängigen NAB-Aufgabe (Transkription von vorgelesenen Zeitungstexten) erreichten wir eine Wortfehlerrate von ca. 10%. Die Entwicklungsarbeit wurde mit dem Aufbau eines DMC-interpolierten phrasenbasierten wortübergreifenden Pentaphon-Viergramm-Systems abgeschlossen. Dieses System transkribiert Nachrichtensendungen mit einer Gesamtfehlerrate von ca. 17%.

Résumé

La transcription automatique d'émissions parlées d'informations radio-télévisées (tâche désignée par “Hub-4”) a été l'objet d'intenses travaux de recherche ces dernières années. Ce papier présente les lignes principales de nos efforts d'élaboration d'un système de reconnaissance de parole continue qui soit à même de traiter le signal hétérogène provenant d'émissions d'information sans entraı̂ner une trop grande complexité ou le recours à des ressources de calculs excessives. L'essentiel de nos efforts a porté sur les points suivants:

  • La segmentation automatique du signal audio en une suite de passages parlés;

  • Le décodage rapide en une passe intégrant un modèle de trigrammes avec une technique d'anticipation;

  • L'interpolation log-linéaire optimale d'une variété de modèles acoustiques et grammaticaux au moyen d'une technique de combinaison discriminative de modèles (DMC);

  • La prise en compte de corrélations linguistiques à court terme et, plus faiblement, à long terme au moyen de groupements de mots (phrases) et de modèles de languages dits “à distance”;

  • L'amélioration de la modélisation acoustique à l'aide d'une extraction robuste du contenu du signal combinée à la normalisation des canaux, l'adaptation des modèles phonétiques ainsi que la sélection et la vérification des scripts du corpus d'entraînement.

Notre point de départ fut le système Philips “NAB-64k” fondé sur l'emploi de triphones intra-mots et de modèles de trigrammes. Pour la tâche “NAB” impliquant la transcription d'articles lus à l'aide d'un microphone connu, ce système indépendant du locuteur atteint un taux d'erreur moyen de 10% au niveau du mot. Au terme de ce travail, nous avons développé un système qui combine par DMC des modèles phonétique intra-mots et inter-mots, des pentaphones, des groupements de mots ainsi que des modèles de langage jusqu'à l'ordre 4. Ce système produit une transcription d'émissions parlées d'information avec un taux d'erreur global d'environ 17%.

Introduction

Past speech recognition research has focused mainly on the decoding of high quality speech in quiet environments. Recently, however, the focus has shifted to speech found in the “real world”. One of the data sources of real-world speech are audio recordings from radio and television broadcast news (BN). As compared to previous work involving automatic speech recognition, the BN task imposes the following additional research problems:

  • Unknown sentence boundaries.

  • Diverse and rapidly changing acoustic environment. Typical degradations of the speech signal are introduced by background music, noise, interfering speakers as well as by changes between studio and telephone channels. Furthermore, regional dialects or accents of non-native speakers have to be considered.

  • Real-life speaking styles (spontaneous speech) as well as unknown speaker turns. Speaking styles range from carefully read speech to free and spontaneous conversation.

  • Natural language. Difficulties arise from unpredictable changes of topics of the BN as well as from spontaneous reactions in free conversations.

This paper summarizes our approach in dealing with these challenges and describes the system we developed between 1997 and 1998.

Section snippets

Overview

The system architecture of the Philips/RWTH Hub-4 system is plotted in Fig. 1. The system consists of three decoding stages: segmentation, one-pass trigram decoding and discriminative model combination (DMC). The task of the segmentation stage is to handle the problem of unknown sentence boundaries. It transforms the continuous BN audio stream in a sequence of spoken utterances (segments), which are similar to sentences. Identification of acoustic channel bandwidth, gender and speaker cluster

Automatic segmentation into “sentences”

In most transcription tasks boundaries of the utterances are known, and the background accoustic conditions of the utterances are fixed. Further information may also be available such as gender or channel information. Using the given information, models can easily be adapted to the conditions at hand. A BN transcription system may receive, for example, a complete 3 h input stream. In this stream, one encounters, for example, telephone speech, speech in noisy “real life” surroundings,

Efficient one-pass trigram decoding

Like most other Hub-4 systems, a 64k word trigram recognition coupled with the use of triphone models is applied in the early decoding stages to the speech utterances, obtained from the segmenter (Section 3). Longer linguistic and acoustic contexts can also be handled, though, in later stages when the search is restricted to a word lattice (Section 5). The prime decoding task thus consists in performing a first “robust” search that fulfills the requirements of a trigram language model and

Discriminative model combination

During the course of an evaluation, state-of-the-art speech recognition systems use multiple acoustic and language model sets with increasing complexity to obtain the best of all possible WERs. Applying a multi-pass decoding strategy is typically the way to incorporate multiple model sets into the decoder. The Hub-4 sites used five or more decoding passes in their evaluation systems. In a multi-pass decoding setup various model sets are applied in a predefined order for successive improvement

Building phrase-based distance language models

Natural language created by humans is correlated. Using a particular word not only influences the word immediately following, but up to the next 1000 words (Peters and Klakow, 1999). Thus these correlations have to be captured in the best way possible to reduce the resources needed and to minimize the number of parameters. In the course of the past few years, new methods have been developed for Hub-4 serving this purpose: the use of phrases, consisting of several consecutive strongly correlated

Feature extraction, normalization and speaker adaptation

Mel-frequency cepstral coefficients (MFCC) (Davis and Mermelstein, 1980) are probably the most popular features for speech recognition. Nevertheless, there is still active research in superior speech representations for speech recognition. A lot of effort is devoted to exploiting physiological and psychoacoustic findings about human perception. As an example, Hermansky (1990) has extended linear prediction analysis to perceptual linear prediction (PLP) by introducing concepts from

Conclusions

A brief summary of our findings is listed below:

Segmenter: Two automatic segmentation approaches were investigated for the automatic segmentation of the continuous BN audio stream: (1) a phoneme decoder and (2) a GMM–BIC segmenter. The GMM–BIC segmenter provides better results. The loss of word accuracy by automatic segmentation compared to manual segmentation is about 5% relative.

One-pass decoder: One-pass trigram decoding compares favorably with a two-pass strategy, the overall

Summary of symbols

    Symbols

    Explanation

    pΛ(w|h)

    probability of word w given history h and parameter Λ

    ZΛ(h)

    normalization term

    pi

    probability model i

    λi

    weight of model i in a model combination

    NW

    size of the vocabulary denoted as W

    Wj

    word j in word history

    P(w|u,v)

    probability of word w given predecessor words u,v

    H(u,v)

    hash index for word pair u,v

    MW

    hash constant

    πv(s)

    anticipated language model probability for state s and predecessor word v

    pΛ(k|x)

    posterior probability of class k given background information x

    E(Λ)

    word error count

    x

References (39)

  • S. Ortmanns et al.

    A word graph algorithm for large vocabulary continuous speech recognition

    Comput. Speech Language

    (1997)
  • F. Alleva et al.

    Improvements on the pronunciation prefix tree search organization

  • X. Aubert

    One pass crossword decoding for large vocabularies based on a lexical tree search organization

  • X. Aubert et al.

    Large vocabulary continuous speech recognition using word graphs

  • X. Aubert et al.

    Large vocabulary continuous speech recognition of Wall Street Journal corpus

  • P. Beyerlein

    Discriminative model combination

  • P. Beyerlein et al.

    Modelling and decoding of crossword context dependent phones in the Philips large vocabulary continuous speech recognition system

  • P. Beyerlein et al.

    Automatic transcription of English broadcast news

  • P. Beyerlein et al.

    The Philips/RWTH system for transcription of Broadcast News

  • S. Chen et al.

    Speaker, environment and channel change detection and clustering via the Bayesian information criterion

  • J.N. Darroch et al.

    Generalized iterative scaling for log linear models

    Annals Math. Stat.

    (1972)
  • S.B. Davis et al.

    Comparison of parametric representations for Monosyllabic word recognition in continuously spoken sentences

    IEEE T-ASSP

    (1980)
  • J.G. Fiscus

    A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER)

  • R. Haeb-Umbach et al.

    An investigation of cepstral parameterisations for large vocabulary speech recognition

  • T. Hain et al.

    Segment generatation and clustering in the HTK Broadcast News transcription system

  • M. Harris et al.
  • H. Hermansky

    Perceptual linear predictive (PLP) analysis of speech

    J. Acoust. Soc. Am.

    (1990)
  • H. Jin et al.

    Automatic speaker clustering

  • D. Klakow

    Language-model optimization by mapping of corpora

  • Cited by (0)

    View full text