Tone-enhanced generalized character posterior probability (GCPP) for Cantonese LVCSR
Introduction
Most HMM-based speech recognizers search for the word string (sentence) hypothesis that yields the maximum a posterior (MAP) probability. Under the MAP criterion misrecognized sentences are minimized in the expected value sense. However, word error rate (WER), rather than sentence error rate, is more universally accepted in the speech recognition community as the sole objective performance measure of an LVCSR system. Thus, in such cases it would be more appropriate to use a cost function to weight each sentence error. Here, we use to represent the hypothesized word string and candidate word string, respectively. The expected cost or risk associated with selecting is defined aswhere is the posterior probability of word string given the acoustic observations . The decision of speech recognition can be based on the minimization of the expected cost, i.e.The MAP is in fact a special case of the minimum expected cost decision where a cost function 0 is assigned for two completely matched word strings and 1, otherwise. Eq. (2) can be rewritten asMAP based decoding has been widely adopted in speech recognition because sentence with maximum posterior probability can be efficiently found by using Bayes rule and Viterbi search.
Many studies have been done on how to train a recognizer or perform search in recognition to optimize such a measure. For example, the cost function of the Levenshtein (string edit) distance between two word strings , can be used to minimize the expected word error rate and it was proposed as the optimal search criterion for speech recognition (Stolcke et al., 1997, Mangu et al., 2000, Evermann and Woodland, 2000, Goel and Byrne, 2000). Estimation of word posterior probability and determination of the sentence with minimum expected word error were investigated for N-best output (Stolcke et al., 1997). They were also applied to a word graph (Mangu et al., 2000), where multiple string alignment instead of pairwise string alignment was adopted. In Goel and Byrne (2000), the minimum Bayes-risk (MBR) approach, a more general cost function based on word error measurement, is implemented to rescore N-best list and to A∗ search over the word lattice. In addition, confidence measures at the word level were used for rescoring (Wessel et al., 2000, Fetter et al., 1996, Neti et al., 1997).
Posterior probability assesses quantitatively the correctness of recognition results. It can be computed at sentence, word or subword, e.g. syllable or character, level. There have been numerous studies on its estimation and applications (Weintraub, 1995, Wessel et al., 2001). Generalized posterior probability (Soong et al., 2004a) tries to address the various modeling discrepancies and numerical issues in computing the posterior probability. It is designed to incorporate automatically trained optimal weights to equalize the different dynamic range of acoustic and language models, segmentation ambiguities, etc. It attempts to configure the most appropriate posterior probabilities for different recognition or verification tasks. Its effectiveness has been demonstrated in verification of recognition outputs under both clean and noisy conditions (Soong et al., 2004b, Lo et al., 2004).
Cantonese, a popular Southern Chinese dialect, is a syllabically paced, tonal language of which tones are lexical. The basic written unit of Cantonese is the Chinese character which is shared among many Chinese dialects, including the official spoken language, Mandarin or “Putonghua” in China. Each character is pronounced as a tonal monosyllable, which has a relatively simple (C)–V–(C) structure and relatively stable duration than other speech units in Chinese. Character, a subword unit in Chinese, also plays an important role in both morphology and phonology of Chinese languages. Most of the morphemes consist of one single character. In written Chinese, except for the occasional punctuation marks, there is no delimiter (like blank space) between two adjacent characters. As a result, the definition of a word in Chinese is somewhat vague and the final performance of Chinese LVCSR is usually measured by character error rate (CER), rather than the word error rate.
There have been numerous studies on automatic tone recognition for Chinese ASR. Approaches to the subject fall into two major categories, namely, embedded tone modeling and explicit tone recognition. In embedded tone modeling, tone-related features such as F0 (the fundamental frequency) are included as extra components in the short-time feature vectors and consequently the acoustic models become tone-dependent (Chen et al., 1997, Huang and Seide, 2000, Wong and Chang, 2001, Wang et al., 2006). In this way, tone recognition is done as an integral part of the existing ASR framework. On the other hand, in explicit tone recognition, tones are independently recognized in parallel to the recognition of phonetic units. The results of tone and phonetic recognition are then combined in a post-processing stage (Lee et al., 2002, Lin et al., 1996) or integrated back into a global search process (Seide and Wang, 2000, Cao et al., 2000).
In this paper, firstly, we propose a novel method of supra-tone modeling for Cantonese tone recognition; then we extend word level generalized posterior probability (Soong et al., 2004a) to Chinese character level; finally, we use tone-enhanced generalized character posterior probability (GCPP) as a rescoring metric for Cantonese LVCSR. Each supra-tone model characterizes the F0 contour of two or three tones in succession. The tone sequence of a continuous utterance is formed as an overlapped concatenation of supra-tone units. GCPP is computed in a restructured word graph by incorporating the supra-tone models. Two improved search approaches based on GCPP, either minimizing character error rate (CER) or maximizing GCPP product, will be presented.
The rest of paper is organized as follows. A brief introduction to the Cantonese dialect will be given in Section 2. Cantonese tone modeling will be introduced in Section 3. Tone-enhanced generalized character posterior probability (GCPP) and GCPP-based rescoring will be illustrated in Sections 4 Generalized character posterior probability (GCPP), 5 GCPP-based rescoring, respectively. In Section 6, experimental results will be presented to demonstrate the effectiveness of the proposed method. In Section 7, we give a conclusion of this research.
Section snippets
The cantonese dialect
Cantonese, a popular Chinese dialect, is the mother tongue of tens of millions of people living in Southern China, Hong Kong and overseas. Like Mandarin (Putonghua), Cantonese is a monosyllabic and tonal language.
Cantonese tone modeling
The six tones of Cantonese can be roughly categorized as level tones or rising tones, according to the shapes of tone contours. This is unlike Mandarin, in which all four basic tones have distinctive contour shapes, namely, high-level, mid-rising, falling–rising and high-falling (Xu, 1997). Discrimination between the Cantonese tones relies more on the heights than on the shapes of the pitch contours. In Bauer and Benedict (1997), it pointed out that the height of a tone is not an absolute
Generalized character posterior probability (GCPP)
The generalized character posterior probabilities are estimated in restricted word graphs and enhanced by supra-tone models.
GCPP-based rescoring
GCPP provides a quantitative estimate for the correctness of recognized characters. It is more appropriate as a performance metric since the performance of Chinese LVCSR is usually measured by CER. Here, two improved search criteria based on GCPP are investigated.
Speech database and baseline system
The speech corpus used in the experiments is CUSent™, which was collected at the DSP & Speech Technology Laboratory of the Chinese University of Hong Kong (CUHK) (CUCorpora: Cantonese Spoken Language Resources, 2001). It is a continuous Cantonese speech corpus. The contents are given as in Table 1.
The baseline LVCSR system, named CURec, was also developed by the same research group at CUHK (Choi et al., 2000). It uses context-dependent syllable Initial/Final models. The acoustic feature vector
Conclusions
GCPP is proposed to be used as a search metric for improving Cantonese LVCSR performance. For each hypothesized character, tone-enhanced GCPP is computed by incorporating the tone model score with the corresponding acoustic and language model scores in a restructured word graph, which not only contains more string hypotheses than a typical N-best list but also can recover some good but prematurely pruned string hypotheses. It is shown that our two GCPP-based rescoring can reduce CER of
References (43)
- et al.
Minimum Bayes-risk automatic speech recognition
Comp. Speech Lang.
(2000) - et al.
Frameworks for recognition of Mandarin syllables with tones using sub-syllabic units
J. Speech Commun.
(1996) - et al.
Finding consensus in speech recognition: word error minimization and other applications of confusion networks
Comp. Speech Lang.
(2000) - et al.
A word graph algorithm for large vocabulary continuous speech recognition
Comp. Speech Lang.
(1997) - et al.
Tone recognition of continuous Cantonese speech based on support vector machines
J. Speech Commun.
(2005) Contextual tonal variation on Mandarin
J. Phonetics
(1997)- et al.
Modern Cantonese phonology
(1997) - Cao, Y., Deng, Y., Zhang, H., Huang, T., Xu, B., 2000. Decision-tree based Mandarin tone model and its application to...
system of tone letters
Le Maitre Phonetique
(1930)- et al.
Tone recognition of continuous Mandarin speech based on neural networks
IEEE Trans. Speech Audio Process.
(1995)
Studies in Yue Dialects 1: Phonology of Cantonese
Linguistic Society of Hong Kong (LSHK)
Spoken Language Processing: A Guide to Theory, Algorithm, and System Development
Tone recognition of isolated Cantonese syllables
IEEE Trans. Speech Audio Process.
Cited by (5)
Automatic recognition of oral vowels in tone language: Experiments with fuzzy logic and neural network models
2011, Applied Soft Computing JournalMixed models based pronunciation evaluation of Mandarin tone
2013, Journal of MultimediaApproaches for the detection of the keywords in spoken documents application for the field of E-libraries
2012, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)A novel Chinese Mandarin speech indexing method based on confusion network using tone information
2009, Proceedings - 5th International Conference on Wireless Communications, Networking and Mobile Computing, WiCOM 2009Discriminative Lexicon Adaptation for Improved Character Accuracy – A New Direction in Chinese Language Modeling
2009, ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf.