Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Kim, Minchan; Jeong, Myeonghun; Choi, Byoung Jin; Kim, Semin; Lee, Joun Yeop; Kim, Nam Soo

Electrical Engineering and Systems Science > Audio and Speech Processing

arXiv:2401.01498 (eess)

[Submitted on 3 Jan 2024]

Title:Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Authors:Minchan Kim, Myeonghun Jeong, Byoung Jin Choi, Semin Kim, Joun Yeop Lee, Nam Soo Kim

View PDF HTML (experimental)

Abstract:We propose a novel text-to-speech (TTS) framework centered around a neural transducer. Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages, utilizing discrete semantic tokens obtained from wav2vec2.0 embeddings. For a robust and efficient alignment modeling, we employ a neural transducer named token transducer for the semantic token prediction, benefiting from its hard monotonic alignment constraints. Subsequently, a non-autoregressive (NAR) speech generator efficiently synthesizes waveforms from these semantic tokens. Additionally, a reference speech controls temporal dynamics and acoustic conditions at each stage. This decoupled framework reduces the training complexity of TTS while allowing each stage to focus on semantic and acoustic modeling. Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity, both objectively and subjectively. We also delve into the inference speed and prosody control capabilities of our approach, highlighting the potential of neural transducers in TTS frameworks.

Comments:	This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
Subjects:	Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD)
Cite as:	arXiv:2401.01498 [eess.AS]
	(or arXiv:2401.01498v1 [eess.AS] for this version)
	https://doi.org/10.48550/arXiv.2401.01498

Submission history

From: Minchan Kim [view email]
[v1] Wed, 3 Jan 2024 02:03:36 UTC (3,443 KB)

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Electrical Engineering and Systems Science > Audio and Speech Processing

Title:Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators