Tone-enhanced generalized character posterior probability (GCPP) for Cantonese LVCSR

doi:10.1016/j.csl.2007.12.003

Computer Speech & Language

Volume 22, Issue 4, October 2008, Pages 360-373

https://doi.org/10.1016/j.csl.2007.12.003 Get rights and content

Abstract

Tone-enhanced generalized character posterior probability (GCPP), a generalized form of posterior probability at subword (Chinese character) level, is proposed as a rescoring metric for improving Cantonese LVCSR performance. GCPP is computed by tone score along with the corresponding acoustic and language model scores. The tone score is output from a supra-tone model, which characterizes not only the tone contour of a single syllable but also that of adjacent ones and significantly outperforms other conventional tone models. The search network is constructed first by converting the original word graph to a restructured word graph, then a character graph and finally, a character confusion network (CCN). Based upon tone-enhanced GCPP, the character error rate (CER) is minimized or the GCPP product is maximized over a chosen graph. Experimental results show that the tone-enhanced GCPP can improve character error rate by up to 15.1%, relatively.

Introduction

Most HMM-based speech recognizers search for the word string (sentence) hypothesis that yields the maximum a posterior (MAP) probability. Under the MAP criterion misrecognized sentences are minimized in the expected value sense. However, word error rate (WER), rather than sentence error rate, is more universally accepted in the speech recognition community as the sole objective performance measure of an LVCSR system. Thus, in such cases it would be more appropriate to use a cost function $λ (w_{1}^{M}, {\hat{w}}_{1}^{N})$ to weight each sentence error. Here, we use $w_{1}^{M} = w_{1}, w_{2}, \dots, w_{M} and {\hat{w}}_{1}^{N} = {\hat{w}}_{1}, {\hat{w}}_{2}, \dots, {\hat{w}}_{N}$ to represent the hypothesized word string and candidate word string, respectively. The expected cost or risk associated with selecting $w_{1}^{M}$ is defined as $R (w_{1}^{M} | x_{1}^{T}) = E [λ (w_{1}^{M}, {\hat{w}}_{1}^{N})] = \sum_{{\hat{w}}_{1}^{N}} λ (w_{1}^{M}, {\hat{w}}_{1}^{N}) p ({\hat{w}}_{1}^{N} | x_{1}^{T})$ where $p ({\hat{w}}_{1}^{N} | x_{1}^{T})$ is the posterior probability of word string ${\hat{w}}_{1}^{N}$ given the acoustic observations $x_{1}^{T} = x_{1}, x_{2}, \dots, x_{T}$ . The decision of speech recognition can be based on the minimization of the expected cost, i.e. $w_{1}^{* M} = \arg \min_{M, w_{1}^{M}} \sum_{{\hat{w}}_{1}^{N}} λ (w_{1}^{M}, {\hat{w}}_{1}^{N}) p ({\hat{w}}_{1}^{N} | x_{1}^{T})$ The MAP is in fact a special case of the minimum expected cost decision where a cost function 0 is assigned for two completely matched word strings and 1, otherwise. Eq. (2) can be rewritten as $w_{1}^{* M} = \arg \min_{M, w_{1}^{M}} \sum_{\binom{{\hat{w}}_{1}^{N}}{{\hat{w}}_{1}^{N} \neq w_{1}^{M}}} p ({\hat{w}}_{1}^{N} | x_{1}^{T}) = \underset{M, w_{1}^{M}}{argmax} p (w_{1}^{M} | x_{1}^{T})$ MAP based decoding has been widely adopted in speech recognition because sentence with maximum posterior probability can be efficiently found by using Bayes rule and Viterbi search.

Many studies have been done on how to train a recognizer or perform search in recognition to optimize such a measure. For example, the cost function $λ (w_{1}^{M}, {\hat{w}}_{1}^{N})$ of the Levenshtein (string edit) distance between two word strings $w_{1}^{M} and {\hat{w}}_{1}^{N}$ , can be used to minimize the expected word error rate and it was proposed as the optimal search criterion for speech recognition (Stolcke et al., 1997, Mangu et al., 2000, Evermann and Woodland, 2000, Goel and Byrne, 2000). Estimation of word posterior probability and determination of the sentence with minimum expected word error were investigated for N-best output (Stolcke et al., 1997). They were also applied to a word graph (Mangu et al., 2000), where multiple string alignment instead of pairwise string alignment was adopted. In Goel and Byrne (2000), the minimum Bayes-risk (MBR) approach, a more general cost function based on word error measurement, is implemented to rescore N-best list and to A^∗ search over the word lattice. In addition, confidence measures at the word level were used for rescoring (Wessel et al., 2000, Fetter et al., 1996, Neti et al., 1997).

Posterior probability assesses quantitatively the correctness of recognition results. It can be computed at sentence, word or subword, e.g. syllable or character, level. There have been numerous studies on its estimation and applications (Weintraub, 1995, Wessel et al., 2001). Generalized posterior probability (Soong et al., 2004a) tries to address the various modeling discrepancies and numerical issues in computing the posterior probability. It is designed to incorporate automatically trained optimal weights to equalize the different dynamic range of acoustic and language models, segmentation ambiguities, etc. It attempts to configure the most appropriate posterior probabilities for different recognition or verification tasks. Its effectiveness has been demonstrated in verification of recognition outputs under both clean and noisy conditions (Soong et al., 2004b, Lo et al., 2004).

Cantonese, a popular Southern Chinese dialect, is a syllabically paced, tonal language of which tones are lexical. The basic written unit of Cantonese is the Chinese character which is shared among many Chinese dialects, including the official spoken language, Mandarin or “Putonghua” in China. Each character is pronounced as a tonal monosyllable, which has a relatively simple (C)–V–(C) structure and relatively stable duration than other speech units in Chinese. Character, a subword unit in Chinese, also plays an important role in both morphology and phonology of Chinese languages. Most of the morphemes consist of one single character. In written Chinese, except for the occasional punctuation marks, there is no delimiter (like blank space) between two adjacent characters. As a result, the definition of a word in Chinese is somewhat vague and the final performance of Chinese LVCSR is usually measured by character error rate (CER), rather than the word error rate.

There have been numerous studies on automatic tone recognition for Chinese ASR. Approaches to the subject fall into two major categories, namely, embedded tone modeling and explicit tone recognition. In embedded tone modeling, tone-related features such as F0 (the fundamental frequency) are included as extra components in the short-time feature vectors and consequently the acoustic models become tone-dependent (Chen et al., 1997, Huang and Seide, 2000, Wong and Chang, 2001, Wang et al., 2006). In this way, tone recognition is done as an integral part of the existing ASR framework. On the other hand, in explicit tone recognition, tones are independently recognized in parallel to the recognition of phonetic units. The results of tone and phonetic recognition are then combined in a post-processing stage (Lee et al., 2002, Lin et al., 1996) or integrated back into a global search process (Seide and Wang, 2000, Cao et al., 2000).

In this paper, firstly, we propose a novel method of supra-tone modeling for Cantonese tone recognition; then we extend word level generalized posterior probability (Soong et al., 2004a) to Chinese character level; finally, we use tone-enhanced generalized character posterior probability (GCPP) as a rescoring metric for Cantonese LVCSR. Each supra-tone model characterizes the F0 contour of two or three tones in succession. The tone sequence of a continuous utterance is formed as an overlapped concatenation of supra-tone units. GCPP is computed in a restructured word graph by incorporating the supra-tone models. Two improved search approaches based on GCPP, either minimizing character error rate (CER) or maximizing GCPP product, will be presented.

The rest of paper is organized as follows. A brief introduction to the Cantonese dialect will be given in Section 2. Cantonese tone modeling will be introduced in Section 3. Tone-enhanced generalized character posterior probability (GCPP) and GCPP-based rescoring will be illustrated in Sections 4 Generalized character posterior probability (GCPP), 5 GCPP-based rescoring, respectively. In Section 6, experimental results will be presented to demonstrate the effectiveness of the proposed method. In Section 7, we give a conclusion of this research.

Section snippets

The cantonese dialect

Cantonese, a popular Chinese dialect, is the mother tongue of tens of millions of people living in Southern China, Hong Kong and overseas. Like Mandarin (Putonghua), Cantonese is a monosyllabic and tonal language.

Cantonese tone modeling

The six tones of Cantonese can be roughly categorized as level tones or rising tones, according to the shapes of tone contours. This is unlike Mandarin, in which all four basic tones have distinctive contour shapes, namely, high-level, mid-rising, falling–rising and high-falling (Xu, 1997). Discrimination between the Cantonese tones relies more on the heights than on the shapes of the pitch contours. In Bauer and Benedict (1997), it pointed out that the height of a tone is not an absolute

Generalized character posterior probability (GCPP)

The generalized character posterior probabilities are estimated in restricted word graphs and enhanced by supra-tone models.

GCPP-based rescoring

GCPP provides a quantitative estimate for the correctness of recognized characters. It is more appropriate as a performance metric since the performance of Chinese LVCSR is usually measured by CER. Here, two improved search criteria based on GCPP are investigated.

Speech database and baseline system

The speech corpus used in the experiments is CUSent™, which was collected at the DSP & Speech Technology Laboratory of the Chinese University of Hong Kong (CUHK) (CUCorpora: Cantonese Spoken Language Resources, 2001). It is a continuous Cantonese speech corpus. The contents are given as in Table 1.

The baseline LVCSR system, named CURec, was also developed by the same research group at CUHK (Choi et al., 2000). It uses context-dependent syllable Initial/Final models. The acoustic feature vector

Conclusions

GCPP is proposed to be used as a search metric for improving Cantonese LVCSR performance. For each hypothesized character, tone-enhanced GCPP is computed by incorporating the tone model score with the corresponding acoustic and language model scores in a restructured word graph, which not only contains more string hypotheses than a typical N-best list but also can recover some good but prematurely pruned string hypotheses. It is shown that our two GCPP-based rescoring can reduce CER of

References (43)

V. Goel et al.
Minimum Bayes-risk automatic speech recognition
Comp. Speech Lang.
(2000)
C.-H. Lin et al.
Frameworks for recognition of Mandarin syllables with tones using sub-syllabic units
J. Speech Commun.
(1996)
L. Mangu et al.
Finding consensus in speech recognition: word error minimization and other applications of confusion networks
Comp. Speech Lang.
(2000)
S. Ortmanns et al.
A word graph algorithm for large vocabulary continuous speech recognition
Comp. Speech Lang.
(1997)
G. Peng et al.
Tone recognition of continuous Cantonese speech based on support vector machines
J. Speech Commun.
(2005)
Y. Xu
Contextual tonal variation on Mandarin
J. Phonetics
(1997)
R.S. Bauer et al.
Modern Cantonese phonology
(1997)
Cao, Y., Deng, Y., Zhang, H., Huang, T., Xu, B., 2000. Decision-tree based Mandarin tone model and its application to...
Y.R.A. Chao
system of tone letters
Le Maitre Phonetique
(1930)
S.H. Chen et al.
Tone recognition of continuous Mandarin speech based on neural networks
IEEE Trans. Speech Audio Process.
(1995)

Chen, C.J., Gopinath, R.A., Monkowski, M.D., Picheny, M.A., Shen, K., 1997. New methods in continuous Mandarin speech...

Choi, W.N., Wong, Y.W., Lee, Tan., Ching, P.C., 2000. Lexical tree decoding with a class-based language model for...

CUCorpora: Cantonese Spoken Language Resources, 2001....

Evermann, G., Woodland, P.C., 2000. Posterior probability decoding: confidence estimation and system combination. In:...

Fetter, P., Dandurand, F., Brietzmann, P.R., 1996. Word graph rescoring using confidence measures. In: Proceedings of...

O.-K.Y. Hashimoto

Studies in Yue Dialects 1: Phonology of Cantonese

(1972)

Hirose, K., Zhang, J.S., 1999. Tone recognition of Chinese continuous speech using tone critical segments. In:...

Kong Hong

Linguistic Society of Hong Kong (LSHK)

(1997)

Huang, H., Seide, F., 2000. Pitch tracking and tone features for Mandarin speech recognition. In: Proceedings of the...

X. Huang et al.

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

(2001)

Tan Lee et al.

Tone recognition of isolated Cantonese syllables

IEEE Trans. Speech Audio Process.

(1995)

Cited by (5)

Automatic recognition of oral vowels in tone language: Experiments with fuzzy logic and neural network models
2011, Applied Soft Computing Journal
Automatic recognition of tone language speech is a complex problem in that it involves two parallel recognition tasks. A recognition system to accomplish this task must be able to simultaneously recognise tone and phone Components in the acoustic signal. The acoustic cue for the tones is the fundamental frequency (F0) while the first and second formant (F1 and F2) frequencies are the acoustic cues for the phones. In this study, we experiment with two soft-computing techniques, namely: artificial neural network (ANN) and fuzzy logic (FL) in the recognition of oral vowels in tone language. The standard Yoruba (SY) language is used for our case study.
The ANN and FL speech recognition systems were developed using MatLab. The result showed that the ANN based model performed better on the training data while the FL based model performed better on the test set. This implies that the ANN system was able to interpolate or approximate the data more accurately whereas the FL system is better at extrapolating from the data. In addition, it was observed that the ANN system required larger amount of data for it is development whereas the FL system development requires some expert's knowledge. In conclusion, the FL based system seems to be the better approach for developing practical automatic speech recognition (ASR) system for languages such as SY where the language resources are limited.
Mixed models based pronunciation evaluation of Mandarin tone
2013, Journal of Multimedia
Approaches for the detection of the keywords in spoken documents application for the field of E-libraries
2012, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
A novel Chinese Mandarin speech indexing method based on confusion network using tone information
2009, Proceedings - 5th International Conference on Wireless Communications, Networking and Mobile Computing, WiCOM 2009
Discriminative Lexicon Adaptation for Improved Character Accuracy – A New Direction in Chinese Language Modeling
2009, ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.

View full text

Tone-enhanced generalized character posterior probability (GCPP) for Cantonese LVCSR

Abstract

Introduction

Section snippets

The cantonese dialect

Cantonese tone modeling

Generalized character posterior probability (GCPP)

GCPP-based rescoring

Speech database and baseline system

Conclusions

Comp. Speech Lang.

J. Speech Commun.

Comp. Speech Lang.

Comp. Speech Lang.

J. Speech Commun.

J. Phonetics

Modern Cantonese phonology

system of tone letters

Le Maitre Phonetique

Tone recognition of continuous Mandarin speech based on neural networks

IEEE Trans. Speech Audio Process.

Studies in Yue Dialects 1: Phonology of Cantonese

Linguistic Society of Hong Kong (LSHK)

Spoken Language Processing: A Guide to Theory, Algorithm, and System Development

Tone recognition of isolated Cantonese syllables

IEEE Trans. Speech Audio Process.