ISCA Archive Eurospeech 2003
ISCA Archive Eurospeech 2003

Unit selection in concatenative TTS synthesis systems based on mel filter bank amplitudes and phonetic context

T. Lambert, Andrew P. Breen, Barry Eggleton, Stephen J. Cox, Ben P. Milner

In concatenative text-to-speech (TTS) synthesis systems unit selection aims to reduce the number of concatenation points in the synthesized speech and make concatenation joins as smooth as possible. This research considers synthesis of completely new utterances from non-uniform units, whereby the most appropriate units, according to acoustic and phonetic criteria, are selected from a myriad of similar speech database candidates. A Viterbi-style algorithm dynamically selects the most suitable database units from a large speech database by considering concatenation and target costs. Concatenation costs are derived from mel filter bank amplitudes, whereas target costs are considered in terms of the phonemic and phonetic properties of required units. Within subjects and between subjects ANOVA [9] evaluation of listeners' scores showed that the TTS system with this method of unit selection was preferred in 52% of test sentences.


doi: 10.21437/Eurospeech.2003-117

Cite as: Lambert, T., Breen, A.P., Eggleton, B., Cox, S.J., Milner, B.P. (2003) Unit selection in concatenative TTS synthesis systems based on mel filter bank amplitudes and phonetic context. Proc. 8th European Conference on Speech Communication and Technology (Eurospeech 2003), 273-276, doi: 10.21437/Eurospeech.2003-117

@inproceedings{lambert03_eurospeech,
  author={T. Lambert and Andrew P. Breen and Barry Eggleton and Stephen J. Cox and Ben P. Milner},
  title={{Unit selection in concatenative TTS synthesis systems based on mel filter bank amplitudes and phonetic context}},
  year=2003,
  booktitle={Proc. 8th European Conference on Speech Communication and Technology (Eurospeech 2003)},
  pages={273--276},
  doi={10.21437/Eurospeech.2003-117}
}