Large vocabulary continuous speech recognition of Broadcast News – The Philips/RWTH approach
Introduction
Past speech recognition research has focused mainly on the decoding of high quality speech in quiet environments. Recently, however, the focus has shifted to speech found in the “real world”. One of the data sources of real-world speech are audio recordings from radio and television broadcast news (BN). As compared to previous work involving automatic speech recognition, the BN task imposes the following additional research problems:
- •
Unknown sentence boundaries.
- •
Diverse and rapidly changing acoustic environment. Typical degradations of the speech signal are introduced by background music, noise, interfering speakers as well as by changes between studio and telephone channels. Furthermore, regional dialects or accents of non-native speakers have to be considered.
- •
Real-life speaking styles (spontaneous speech) as well as unknown speaker turns. Speaking styles range from carefully read speech to free and spontaneous conversation.
- •
Natural language. Difficulties arise from unpredictable changes of topics of the BN as well as from spontaneous reactions in free conversations.
Section snippets
Overview
The system architecture of the Philips/RWTH Hub-4 system is plotted in Fig. 1. The system consists of three decoding stages: segmentation, one-pass trigram decoding and discriminative model combination (DMC). The task of the segmentation stage is to handle the problem of unknown sentence boundaries. It transforms the continuous BN audio stream in a sequence of spoken utterances (segments), which are similar to sentences. Identification of acoustic channel bandwidth, gender and speaker cluster
Automatic segmentation into “sentences”
In most transcription tasks boundaries of the utterances are known, and the background accoustic conditions of the utterances are fixed. Further information may also be available such as gender or channel information. Using the given information, models can easily be adapted to the conditions at hand. A BN transcription system may receive, for example, a complete 3 h input stream. In this stream, one encounters, for example, telephone speech, speech in noisy “real life” surroundings,
Efficient one-pass trigram decoding
Like most other Hub-4 systems, a 64k word trigram recognition coupled with the use of triphone models is applied in the early decoding stages to the speech utterances, obtained from the segmenter (Section 3). Longer linguistic and acoustic contexts can also be handled, though, in later stages when the search is restricted to a word lattice (Section 5). The prime decoding task thus consists in performing a first “robust” search that fulfills the requirements of a trigram language model and
Discriminative model combination
During the course of an evaluation, state-of-the-art speech recognition systems use multiple acoustic and language model sets with increasing complexity to obtain the best of all possible WERs. Applying a multi-pass decoding strategy is typically the way to incorporate multiple model sets into the decoder. The Hub-4 sites used five or more decoding passes in their evaluation systems. In a multi-pass decoding setup various model sets are applied in a predefined order for successive improvement
Building phrase-based distance language models
Natural language created by humans is correlated. Using a particular word not only influences the word immediately following, but up to the next 1000 words (Peters and Klakow, 1999). Thus these correlations have to be captured in the best way possible to reduce the resources needed and to minimize the number of parameters. In the course of the past few years, new methods have been developed for Hub-4 serving this purpose: the use of phrases, consisting of several consecutive strongly correlated
Feature extraction, normalization and speaker adaptation
Mel-frequency cepstral coefficients (MFCC) (Davis and Mermelstein, 1980) are probably the most popular features for speech recognition. Nevertheless, there is still active research in superior speech representations for speech recognition. A lot of effort is devoted to exploiting physiological and psychoacoustic findings about human perception. As an example, Hermansky (1990) has extended linear prediction analysis to perceptual linear prediction (PLP) by introducing concepts from
Conclusions
A brief summary of our findings is listed below:
• Segmenter: Two automatic segmentation approaches were investigated for the automatic segmentation of the continuous BN audio stream: (1) a phoneme decoder and (2) a GMM–BIC segmenter. The GMM–BIC segmenter provides better results. The loss of word accuracy by automatic segmentation compared to manual segmentation is about 5% relative.
• One-pass decoder: One-pass trigram decoding compares favorably with a two-pass strategy, the overall
Summary of symbols
- Symbols
Explanation
- pΛ(w|h)
probability of word w given history h and parameter Λ
- ZΛ(h)
normalization term
- pi
probability model i
- λi
weight of model i in a model combination
- NW
size of the vocabulary denoted as
- Wj
word j in word history
- P(w|u,v)
probability of word w given predecessor words u,v
- H(u,v)
hash index for word pair u,v
- MW
hash constant
- πv(s)
anticipated language model probability for state s and predecessor word v
- pΛ(k|x)
posterior probability of class k given background information x
- E(Λ)
word error count
- x
References (39)
- et al.
A word graph algorithm for large vocabulary continuous speech recognition
Comput. Speech Language
(1997) - et al.
Improvements on the pronunciation prefix tree search organization
One pass crossword decoding for large vocabularies based on a lexical tree search organization
- et al.
Large vocabulary continuous speech recognition using word graphs
- et al.
Large vocabulary continuous speech recognition of Wall Street Journal corpus
Discriminative model combination
- et al.
Modelling and decoding of crossword context dependent phones in the Philips large vocabulary continuous speech recognition system
- et al.
Automatic transcription of English broadcast news
- et al.
The Philips/RWTH system for transcription of Broadcast News
- et al.
Speaker, environment and channel change detection and clustering via the Bayesian information criterion