The First Vietnamese FOSD-Tacotron-2-based Text-to-Speech Model Dataset

Recent trends in voicebot application development have enabled utilization of both speech-to-text and text-to-speech (TTS) generation techniques. In order to generate a voice response to a given speech, one needs to use a TTS engine. The recently developed TTS engines are shifting towards end-to-end approaches utilizing models such as Tacotron, Tacotron-2, WaveNet, and WaveGlow. The reason is that it enables a TTS service provider to focus on developing training and validating datasets comprising of labelled texts and recorded speeches instead of designing an entirely new model that outperforms the others which is time-consuming and costly. In this context, this work introduces the first Vietnamese FPT Open Speech Data (FOSD)-Tacotron-2-based TTS model dataset. This dataset comprises of a configuration file in *.json format; training and validating text input files (in *.csv format); a 225,000-step checkpoint of the trained model; and several sample generated audios. The published dataset is extremely worth for serving as a model for benchmarking with other newly developed TTS models / engines. In addition, it opens an entirely new TTS research optimization problem to be addressed: How to effectively generate speech from text given: a black box TTS (trained) model and its training and validation input texts.


Specifications
Computer Science Specific subject area Artificial Intelligence; Human-Computer Interaction; Information Systems Type of data Trained model checkpoint (up to 225,0 0 0 steps) plus input training and validation datasets.

How data were acquired
The model was trained by utilizing Mozilla TTS repository available at [1] and the subset data (comprising of 23,0 0 0 training sentences and 1,900 validating sentences) out of over 25,0 0 0 sentences given in the FPT Open Speech Data available at [2] .

Description of data collection
This is the 1st FPT Open Speech Data (FOSD) and Tacotron

Value of the Data
These data are extremely useful for benchmarking with different developed Vietnamese TTS models or engines. In addition, since input text for training and validation are provided, they open an entirely new research optimization problem to be addressed: How to effectively generate speech from text given: a black box TTS (trained) model and its training and validation input texts. These data are useful for researches related to natural language processing, natural language generation, Vietnamese TTS applications especially for those using Artificial Intelligence and Machine Learning techniques like Tacotron, Tacotron-2, WaveGlow, WaveNet. Those who are benefit from these data include but not limited to researchers, research scientists, students and hobbyists in the aforementioned areas, companies working in Vietnamese TTS, and automatic call centres.

Data Description
This is the 1st FPT Open Speech Data (FOSD) and Tacotron-2 -based Text-to-Speech Model Dataset for Vietnamese. It comprises of: • A configuration file in * .json format; • Training and validation text input files (in * .csv format); • A trained model (checkpoint file, after 225,0 0 0 steps); • Sample generated audios.
This dataset is useful for research related to TTS and its applications, text processing and especially TTS output optimization given a set of predefined input texts.

Experimental Design, Materials, and Methods
The following describes the experimental design, material and methods.
• Step 5: Generate audio files using the trained model with Vietnamese texts given in file de_sentences.txt .

Declaration of Competing Interest
Copyright 2018 FPT Corporation Permission is hereby granted, free of charge, non-exclusive, worldwide, irrevocable, to any person obtaining a copy of this data or software and associated documentation files (the "Data or Software"), to deal in the Data or Software without restriction, including without limitation the rights to use, copy, modify, remix, transform, merge, build upon, publish, distribute and redistribute, sublicense, and/or sell copies of the Data or Software, for any purpose, even commercially, and to permit persons to whom the Data or Software is furnished to do so, subject to the following conditions: The above copyright notice, and this per-