Tone-enhanced generalized character posterior probability (GCPP) for Cantonese LVCSR

https://doi.org/10.1016/j.csl.2007.12.003Get rights and content

Abstract

Tone-enhanced generalized character posterior probability (GCPP), a generalized form of posterior probability at subword (Chinese character) level, is proposed as a rescoring metric for improving Cantonese LVCSR performance. GCPP is computed by tone score along with the corresponding acoustic and language model scores. The tone score is output from a supra-tone model, which characterizes not only the tone contour of a single syllable but also that of adjacent ones and significantly outperforms other conventional tone models. The search network is constructed first by converting the original word graph to a restructured word graph, then a character graph and finally, a character confusion network (CCN). Based upon tone-enhanced GCPP, the character error rate (CER) is minimized or the GCPP product is maximized over a chosen graph. Experimental results show that the tone-enhanced GCPP can improve character error rate by up to 15.1%, relatively.

Introduction

Most HMM-based speech recognizers search for the word string (sentence) hypothesis that yields the maximum a posterior (MAP) probability. Under the MAP criterion misrecognized sentences are minimized in the expected value sense. However, word error rate (WER), rather than sentence error rate, is more universally accepted in the speech recognition community as the sole objective performance measure of an LVCSR system. Thus, in such cases it would be more appropriate to use a cost function λ(w1M,wˆ1N) to weight each sentence error. Here, we use w1M=w1,w2,,wMandwˆ1N=wˆ1,wˆ2,,wˆN to represent the hypothesized word string and candidate word string, respectively. The expected cost or risk associated with selecting w1M is defined asR(w1M|x1T)=E[λ(w1M,wˆ1N)]=wˆ1Nλ(w1M,wˆ1N)p(wˆ1N|x1T)where p(wˆ1N|x1T) is the posterior probability of word string wˆ1N given the acoustic observations x1T=x1,x2,,xT. The decision of speech recognition can be based on the minimization of the expected cost, i.e.w1M=argminM,w1Mwˆ1Nλ(w1M,wˆ1N)p(wˆ1N|x1T)The MAP is in fact a special case of the minimum expected cost decision where a cost function 0 is assigned for two completely matched word strings and 1, otherwise. Eq. (2) can be rewritten asw1M=argminM,w1Mwˆ1Nwˆ1Nw1Mp(wˆ1N|x1T)=argmaxM,w1Mp(w1M|x1T)MAP based decoding has been widely adopted in speech recognition because sentence with maximum posterior probability can be efficiently found by using Bayes rule and Viterbi search.

Many studies have been done on how to train a recognizer or perform search in recognition to optimize such a measure. For example, the cost function λ(w1M,wˆ1N) of the Levenshtein (string edit) distance between two word strings w1Mandwˆ1N, can be used to minimize the expected word error rate and it was proposed as the optimal search criterion for speech recognition (Stolcke et al., 1997, Mangu et al., 2000, Evermann and Woodland, 2000, Goel and Byrne, 2000). Estimation of word posterior probability and determination of the sentence with minimum expected word error were investigated for N-best output (Stolcke et al., 1997). They were also applied to a word graph (Mangu et al., 2000), where multiple string alignment instead of pairwise string alignment was adopted. In Goel and Byrne (2000), the minimum Bayes-risk (MBR) approach, a more general cost function based on word error measurement, is implemented to rescore N-best list and to A search over the word lattice. In addition, confidence measures at the word level were used for rescoring (Wessel et al., 2000, Fetter et al., 1996, Neti et al., 1997).

Posterior probability assesses quantitatively the correctness of recognition results. It can be computed at sentence, word or subword, e.g. syllable or character, level. There have been numerous studies on its estimation and applications (Weintraub, 1995, Wessel et al., 2001). Generalized posterior probability (Soong et al., 2004a) tries to address the various modeling discrepancies and numerical issues in computing the posterior probability. It is designed to incorporate automatically trained optimal weights to equalize the different dynamic range of acoustic and language models, segmentation ambiguities, etc. It attempts to configure the most appropriate posterior probabilities for different recognition or verification tasks. Its effectiveness has been demonstrated in verification of recognition outputs under both clean and noisy conditions (Soong et al., 2004b, Lo et al., 2004).

Cantonese, a popular Southern Chinese dialect, is a syllabically paced, tonal language of which tones are lexical. The basic written unit of Cantonese is the Chinese character which is shared among many Chinese dialects, including the official spoken language, Mandarin or “Putonghua” in China. Each character is pronounced as a tonal monosyllable, which has a relatively simple (C)–V–(C) structure and relatively stable duration than other speech units in Chinese. Character, a subword unit in Chinese, also plays an important role in both morphology and phonology of Chinese languages. Most of the morphemes consist of one single character. In written Chinese, except for the occasional punctuation marks, there is no delimiter (like blank space) between two adjacent characters. As a result, the definition of a word in Chinese is somewhat vague and the final performance of Chinese LVCSR is usually measured by character error rate (CER), rather than the word error rate.

There have been numerous studies on automatic tone recognition for Chinese ASR. Approaches to the subject fall into two major categories, namely, embedded tone modeling and explicit tone recognition. In embedded tone modeling, tone-related features such as F0 (the fundamental frequency) are included as extra components in the short-time feature vectors and consequently the acoustic models become tone-dependent (Chen et al., 1997, Huang and Seide, 2000, Wong and Chang, 2001, Wang et al., 2006). In this way, tone recognition is done as an integral part of the existing ASR framework. On the other hand, in explicit tone recognition, tones are independently recognized in parallel to the recognition of phonetic units. The results of tone and phonetic recognition are then combined in a post-processing stage (Lee et al., 2002, Lin et al., 1996) or integrated back into a global search process (Seide and Wang, 2000, Cao et al., 2000).

In this paper, firstly, we propose a novel method of supra-tone modeling for Cantonese tone recognition; then we extend word level generalized posterior probability (Soong et al., 2004a) to Chinese character level; finally, we use tone-enhanced generalized character posterior probability (GCPP) as a rescoring metric for Cantonese LVCSR. Each supra-tone model characterizes the F0 contour of two or three tones in succession. The tone sequence of a continuous utterance is formed as an overlapped concatenation of supra-tone units. GCPP is computed in a restructured word graph by incorporating the supra-tone models. Two improved search approaches based on GCPP, either minimizing character error rate (CER) or maximizing GCPP product, will be presented.

The rest of paper is organized as follows. A brief introduction to the Cantonese dialect will be given in Section 2. Cantonese tone modeling will be introduced in Section 3. Tone-enhanced generalized character posterior probability (GCPP) and GCPP-based rescoring will be illustrated in Sections 4 Generalized character posterior probability (GCPP), 5 GCPP-based rescoring, respectively. In Section 6, experimental results will be presented to demonstrate the effectiveness of the proposed method. In Section 7, we give a conclusion of this research.

Section snippets

The cantonese dialect

Cantonese, a popular Chinese dialect, is the mother tongue of tens of millions of people living in Southern China, Hong Kong and overseas. Like Mandarin (Putonghua), Cantonese is a monosyllabic and tonal language.

Cantonese tone modeling

The six tones of Cantonese can be roughly categorized as level tones or rising tones, according to the shapes of tone contours. This is unlike Mandarin, in which all four basic tones have distinctive contour shapes, namely, high-level, mid-rising, falling–rising and high-falling (Xu, 1997). Discrimination between the Cantonese tones relies more on the heights than on the shapes of the pitch contours. In Bauer and Benedict (1997), it pointed out that the height of a tone is not an absolute

Generalized character posterior probability (GCPP)

The generalized character posterior probabilities are estimated in restricted word graphs and enhanced by supra-tone models.

GCPP-based rescoring

GCPP provides a quantitative estimate for the correctness of recognized characters. It is more appropriate as a performance metric since the performance of Chinese LVCSR is usually measured by CER. Here, two improved search criteria based on GCPP are investigated.

Speech database and baseline system

The speech corpus used in the experiments is CUSent™, which was collected at the DSP & Speech Technology Laboratory of the Chinese University of Hong Kong (CUHK) (CUCorpora: Cantonese Spoken Language Resources, 2001). It is a continuous Cantonese speech corpus. The contents are given as in Table 1.

The baseline LVCSR system, named CURec, was also developed by the same research group at CUHK (Choi et al., 2000). It uses context-dependent syllable Initial/Final models. The acoustic feature vector

Conclusions

GCPP is proposed to be used as a search metric for improving Cantonese LVCSR performance. For each hypothesized character, tone-enhanced GCPP is computed by incorporating the tone model score with the corresponding acoustic and language model scores in a restructured word graph, which not only contains more string hypotheses than a typical N-best list but also can recover some good but prematurely pruned string hypotheses. It is shown that our two GCPP-based rescoring can reduce CER of

References (43)

  • Chen, C.J., Gopinath, R.A., Monkowski, M.D., Picheny, M.A., Shen, K., 1997. New methods in continuous Mandarin speech...
  • Choi, W.N., Wong, Y.W., Lee, Tan., Ching, P.C., 2000. Lexical tree decoding with a class-based language model for...
  • CUCorpora: Cantonese Spoken Language Resources, 2001....
  • Evermann, G., Woodland, P.C., 2000. Posterior probability decoding: confidence estimation and system combination. In:...
  • Fetter, P., Dandurand, F., Brietzmann, P.R., 1996. Word graph rescoring using confidence measures. In: Proceedings of...
  • O.-K.Y. Hashimoto

    Studies in Yue Dialects 1: Phonology of Cantonese

    (1972)
  • Hirose, K., Zhang, J.S., 1999. Tone recognition of Chinese continuous speech using tone critical segments. In:...
  • Kong Hong

    Linguistic Society of Hong Kong (LSHK)

    (1997)
  • Huang, H., Seide, F., 2000. Pitch tracking and tone features for Mandarin speech recognition. In: Proceedings of the...
  • X. Huang et al.

    Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

    (2001)
  • Tan Lee et al.

    Tone recognition of isolated Cantonese syllables

    IEEE Trans. Speech Audio Process.

    (1995)
  • Cited by (5)

    View full text