Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

Yoon, Hyungchan; Um, Seyun; Kim, Changwhan; Kang, Hong-Goo

doi:10.21437/Interspeech.2023-1571

Computer Science > Sound

arXiv:2204.02172 (cs)

[Submitted on 5 Apr 2022 (v1), last revised 28 Aug 2023 (this version, v2)]

Title:Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

Authors:Hyungchan Yoon, Seyun Um, Changwhan Kim, Hong-Goo Kang

View PDF

Abstract:To simplify the generation process, several text-to-speech (TTS) systems implicitly learn intermediate latent representations instead of relying on predefined features (e.g., mel-spectrogram). However, their generation quality is unsatisfactory as these representations lack speech variances. In this paper, we improve TTS performance by adding \emph{prosody embeddings} to the latent representations. During training, we extract reference prosody embeddings from mel-spectrograms, and during inference, we estimate these embeddings from text using generative adversarial networks (GANs). Using GANs, we reliably estimate the prosody embeddings in a fast way, which have complex distributions due to the dynamic nature of speech. We also show that the prosody embeddings work as efficient features for learning a robust alignment between text and acoustic features. Our proposed model surpasses several publicly available models with less parameters and computational complexity in comparative experiments.

Comments:	INTERSPEECH 2023
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
MSC classes:	68T07 (Primary) 68T50, 68T99 (Secondary)
ACM classes:	I.2.7; I.2.6
Cite as:	arXiv:2204.02172 [cs.SD]
	(or arXiv:2204.02172v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2204.02172
Related DOI:	https://doi.org/10.21437/Interspeech.2023-1571

Submission history

From: Hyungchan Yoon [view email]
[v1] Tue, 5 Apr 2022 12:58:47 UTC (1,734 KB)
[v2] Mon, 28 Aug 2023 13:45:26 UTC (492 KB)

Computer Science > Sound

Title:Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Adversarial Learning of Intermediate Acoustic Feature for End-to-End Lightweight Text-to-Speech

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators