This paper presents an automatic phrase boundary labeling method for speech synthesis database annotation using context-dependent hidden Markov models (CD-HMMs) and n-gram prior distributions. At training stage, CD-HMMs are built to describe the conditional distribution of acoustic features given phonetic label and phrase boundary. In addition, n-gram models are estimated to represent the prior distributions of the phrase boundaries to be predicted. At decoding stage, the CD-HMMs and n-gram models are combined to predict the phrase boundaries by Viterbi decoding under maximum a posteriori (MAP) criterion. In our experiments, the proposed method utilizing context-dependent bigram prior distributions improved the F-score of phrase boundary labeling from 72.2% to 79.6% on the Boston University Radio News Corpus (BURNC), and from 69.6% to 81.0% on the Blizzard Challenge 2007 database respectively, comparing with the method using only acoustic models.
Cite as: Chen, Q., Ling, Z.-H., Yang, C.-Y., Dai, L.-R. (2015) Automatic phrase boundary labeling of speech synthesis database using context-dependent HMMs and n-gram prior distributions. Proc. Interspeech 2015, 1581-1585, doi: 10.21437/Interspeech.2015-367
@inproceedings{chen15i_interspeech, author={Qian Chen and Zhen-Hua Ling and Chen-Yu Yang and Li-Rong Dai}, title={{Automatic phrase boundary labeling of speech synthesis database using context-dependent HMMs and n-gram prior distributions}}, year=2015, booktitle={Proc. Interspeech 2015}, pages={1581--1585}, doi={10.21437/Interspeech.2015-367} }