TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese

Speech provides a natural way for human-computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in Portuguese.


Introduction
Speech synthesis systems have received a lot of attention in recent years due to the great advance provided by the use of deep learning, which allowed the popularization of virtual assistants, such as Amazon Alexa [1], Google Home [2] and Apple Siri [3].
According to [4] traditional Speech Synthesis systems are not easy to develop, because these are typically composed of many specific modules, such as, a text analyzer, a graphemeto-phoneme converter, a duration estimator and an acoustic model. In summary, given an input text, the text analyzer module converts dates, currency symbols, abbreviations, acronyms and numbers into their standard formats to be pronounced or read by the system, i.e. carries out the text normalization and tackles problems like homographs, then with the normalized text the phonetic analyzer converts the grapheme into phonemes. In turn, the duration estimator estimates the duration of each phoneme. Finally, the acoustic model receives the phoneme representation sequence, the prosodic information about phoneme segments' length, the F0 contour and computes the speech signal [5,6]. Several acoustic models have been proposed such as the classical formant model [7], Linear Prediction Coefficients (LPC) model, the Pitch Synchronous Overlap and Add (PSOLA) models [8] widely used in TTS engines like Microsoft Speech API. In addition, Hidden Markov Model (HMM) based synthesis is still a topic of research [9,10,11], as well as a variety of Unit Selection Models [12,13].
Deep Learning [14] allows to integrate all processing steps into a single model and connect them directly from the input text to the synthesized audio output, which is referred to as endto-end learning. While neural models are sometimes criticized as difficult to interpret, several end-to-end trained speech synthesis systems [15,16,17,4,18,19,20] were shown to be able to estimate spectrograms from text inputs with promising performances. Due to the sequential characteristic of text and audio data, recurrent units were the standard building blocks for speech synthesis such as in Tacotron 1 and 2 [16,17]. In addition, convolutional layers showed good performance while reducing computational costs as implemented in DeepVoice3 and Deep Convolutional Text To Speech (DCTTS) methods [4,18].
Models based on Deep Learning require a greater amount of data for training, therefore, languages with low available resources are impaired. For this reason, most current TTS models are designed for the English language [17,18,20,19], which is a language with many open resources. In this work we propose to solve this problem for the Portuguese language. Although there are some public datasets of synthesis for European Portuguese [21], due to the small amount of speech, approximately 100 minutes, makes the training of models based on Deep Learning unfeasible. In addition, simultaneously with this work, two datasets for automatic speech recognition for Portuguese, with good quality, were released. The CETUC [22] dataset, which was made publicly available by [23], has approximately 145 hours of 100 speakers. In this dataset, each speaker uttered a thousand phonetically balanced sentences extracted from journalistic texts; on average each speaker spoke 1.45 hours. The Multilingual LibriSpeech (MLS) [24] dataset is derived from LibriVox audiobooks and consists of speech in 8 languages including Portuguese. For Portuguese, the authors provided approximately 130 hours of 54 speakers, an average of 2.40 hours of speech per speaker. Although the quality of both datasets is good, both were made available with a sampling rate of 16Khz and have no punctuation in their texts, making it difficult to apply them for speech synthesis. In addition, the amount of speech per speaker in both datasets is low, thus making it difficult to derive a single-speaker dataset with a large vocabulary for single-speaker speech synthesis. For example, the LJ Speech [25] dataset, which is derived from audiobooks and is one of the most popular open datasets for single-speaker speech synthesis in English, has approximately 24 hours of speech.
In this article, we compare models of TTS available in the literature for a language with low available resources for speech synthesis. The experiments were carried out in Brazilian Portuguese and based on a single-speaker TTS. For this, we created a new public dataset, including 10.5 hours of speech. Our contributions are twofold (i) a new publicly available dataset with more than 10 hours of speech recorded by a native speaker of Brazilian Portuguese; (ii) an experimental analysis comparing two publicly available TTS models in Portuguese language. In addition, our results and discussions shed light on the matter of training end-to-end methods for a non-English language, in particular Portuguese, and the first public dataset and trained model for this language are made available.
This work is organized as follows. Section 2 presents related work on speech synthesis. Section 3 describes our novel audio dataset. Section 4 details the models and experiments performed. Section 5 compares and discusses the results. Finally, Section 6 presents conclusions of this work and future work.

Speech Synthesis Approaches
With the advent of deep learning, speech synthesis systems have evolved greatly, and are still being intensively studied. Models based on Recurrent Neural Networks such as Tacotron [16], Tacotron 2 [17], Deep Voice 1 [26] and Deep Voice 2 [27] have gained prominence, but as these models use recurrent layers they have high computational costs. This has led to the development of fully convolutional models such as DCTTS [4] and Deep Voice 3 [18], which sought to reduce computational cost while maintaining good synthesis quality. [18] proposed a fully convolutional model for speech synthesis and compared three different vocoders: Griffin-Lim [28], WORLD Vocoder [29] and WaveNet [30]. Their results indicated that WaveNet neural vocoder produced a more natural waveform synthesis. However, WORLD was recommended due to its better runtime, even though WaveNet had better quality. The authors further compared the proposed model (Deep Voice 3) with the Tacotron [16] and Deep Voice 2 [27] models. [4] proposed the DCTTS model, a fully convolutional model, consisting of two neural networks. The first, called Text2Mel (text to Mel spectrogram), which aims to generate a Mel spectrogram from an input text and the second, Spectrogram Super-resolution Network (SSRN), which converts a Mel spectrogram to the STFT (Short-time Fourier Transform) spectrogram [31]. DCTTS consists of only convolutional layers and uses dilated convolution [32,33] to take long, contextual information into account. DCTTS uses the vocoder RTISI-LA (Real-Time Iterative Spectrogram Inversion with Look-Ahead) [34], which is an adaptation of the Griffin-Lim vocoder [28], which aims to increase the speed of the synthesis by slightly sacrificing the quality of the audio generated.
Tacotron 1 [16] proposes the use of a single trained end-toend Deep neural network. The model includes an encoder and a decoder. It uses an attention mechanism [35] and also includes a post-processing module. This model uses convolutional filters, skip connections [36], and Gated Recurrent Units (GRUs) [37] neurons. Tacotron also uses Griffin-Lim [28] algorithm to convert the STFT spectrogram to the wave form.
Tacotron 2 [17] combines Tacotron 1 with a modified WaveNet vocoder [38]. Tacotron 2 is composed of a recurrent network of prediction resources from sequence to sequence that maps the incorporation of characters in Mel spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize waveforms in the time domain from those spectrograms. They also demonstrated that the use of Mel spectrograms as the conditioning input for WaveNet, instead of linguistic characteristics, allows a significant reduction in the size of the WaveNet architecture.

TTS-Portuguese Corpus
Portuguese is a language with few publicly available resources for speech synthesis. In Brazilian Portuguese, as far as we know there is no public dataset with a large amount of speech and quality available for speech synthesis. Although there are some public speech datasets for European Portuguese, for example [21], the work has a small amount of speech, approximately 100 minutes, which normally is not useful for training deep-learning models. On the other hand, [39] explored the training of a model based on deep learning with an in-house dataset, called SSF1, which has approximately 14 hours of speech in European Portuguese. Therefore, given the inexistence of an open dataset with a large amount of speech and quality for speech synthesis for Brazilian Portuguese, we propose the TTS-Portuguese Corpus.
To create the TTS-Portuguese Corpus, public domain texts were used. Initially, seeking to reach a large vocabulary we extracted articles from the Highlights sections of Wikipedia for all knowledge areas. After this extraction, we separated the articles into sentences (considering textual punctuation) and randomly selected sentences from this corpus during the recording. In addition, we used 20 sets of phonetically balanced sentences, each set containing 10 sentences proposed by [40]. Finally, in order to increase the number of questions and introduce more expressive speech, we extracted sentences from Chatterbot-corpus 1 , a corpus originally created for the construction of chatbots. Therefore, we decided both to have a large vocabulary and also to bring words from different areas. In addition, to have an expressive speech representation with the use of questions and answers from a chatbot dataset.
The recording was made by a male native Brazilian Portuguese speaker, not professional, in quiet environment but without acoustic isolation due to difficulties having access to studios. All the audios were recorded at a sampling frequency of 48 kHz and a 32-bit resolution.
In the dataset, each audio file has its respective textual transcription (phonetic transcription is not provided). The final dataset consists of a total of 71, 358 words spoken by the speaker, 13, 311 unique words, resulting in 3, 632 audio files and totaling 10 hours and 28 minutes of speech. Audio files range in length from 0.67 to 50.08 seconds. The Figure 1 shows two histograms regarding the number of words and the duration of each file.
To compare TTS-Portuguese Corpus with datasets used in the literature for speech synthesis, we chose the LJ Speech [25] dataset, which is one of the most widely used, publicly available datasets, for training single-speaker models in the English language. Additionally, we present the statistics for the SSF1 dataset, which is a corpus of European Portuguese although not explored in the work of [39]. Table 1 shows the language, duration, sampling rate and percentage of interrogative, exclamatory and declarative sentences in the LJ Speech, SSF1 and TTS-Portuguese datasets.
The TTS-Portuguese Corpus dataset has a smaller number of hours when compared to the others, it has 14 hours less than the LJ Speech and 4 hours less than the SSF1.
The sampling rate of 22 kHz is widely used in the training of TTS models based on deep learning. However, some works like [17] use a sampling rate of 24 kHz. In addition, [41] showed that it is possible to obtain a 44 kHz TTS model by training the NU-GAN model on a dataset sampled at 44 kHz.

Experiments
To evaluate the quality of TTS-Portuguese Corpus in practice, we explored the speech synthesis using models of prominence in the literature. We chose the models: DCTTS [4] and Tacotron 2 [17].
Here, we compare the models DCTTS and Tacotron 2. To maintain results reproducible, we used open source implementations and tried to replicate related works as faithfully as possible. In the cases where hyper-parameters were not specified, we empirically optimized those for our dataset. We have used the following implementations: DCTTS provided by [42] and Tacotron 2 provided by [43].
For all experiments, to speed up training, we initialized the model using the weights of the pre-trained model on English, using the LJ Speech dataset and we also use RTISI-LA [34] as a vocoder, which is a variation of the Griffin-Lim [28] vocoder.
Although the acquisition avoided external noise as best as possible, the audio files were not recorded in a studio setting. Therefore, some noise may be present in part of the files. To reduce the interference with our analysis, we applied RNNoise [44] in all audio files. RNNoise is based on Recurrent Neural Networks; more specifically Gated Recurrent Unit [45], and demonstrated good performance for noise suppression.
We report two experiments: For this experiment, as reported in the DCTTS article, the model receives the text directly as input, so no phonetic transcription is used. As previously mentioned, the original DCTTS paper does not describe any normalization, so for the model to converge we tested different normalization options and decided to use, in all layers, 5% dropout and layer normalization [46]. We did not use a fixed learning rate as described in the original article. Instead, we used a starting learning rate of 0.001 decaying using Noam's learning rate decay scheme [47].
• Experiment 2: this experiment explores Tacotron 2 [17] model, for that we use the Mozilla TTS implementation [43]. This model receives phonetic transcription as input instead of text directly. To perform phonetic transcription we use the Phonemizer 4 library that supports 121 languages and accents.
In experiment 1, two parts of the model are trained separately. The first part of the model, called Text2Mel, is responsible for generating a Mel spectrogram from the input text and this part of the model was induced using the composition of the functions: binary cross-entropy, L1 [14] and guided attention loss [4]. The second part, called SSRN, is responsible for the transformation of a mel spectrogram into the complete STFT spectrogram and applies super-resolution in the process and the loss function is composed of the functions L1 and binary crossentropy.
In experiment 2, no guided attention is used, therefore, the loss function did not include the cost of attention. Since the network is trained end-to-end, the loss depends on the output of two network modules. The first module converts text into Mel spectrogram. The second module is a SSRN-like module called CBHG (1-D Convolution Bank Highway Network Bidirectional Gated Recurrent Unit). Table 2 shows the hardware specifications of the equipment used for model training. Experiment 1 was trained on computer 2, while experiment 2 were performed using computer 1. Table 3 presents the training data from the experiments. The metrics presented in the table are: number of training steps, and the time required for training. It is important to note that experiment 1 is trained in two phases, both reported in the table: Text2Mel and SSRN.

Results and Discussion
To compare and analyze our results we used the Mean Opinion Score (MOS) calculated following the work of [48]. To calculate the MOS, the evaluators were asked to assess the naturalness of generated sentences on a five-point scale (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent). We chose we chose 20 phonetically balanced sentences [40] not seen in the training, so that our analysis has a good phonetic coverage. These sentences were synthesized for each of our experiments. In addition, 20 samples with the pronunciation of these sentences by the original speaker were added as ground truth. Each sample was evaluated by 20 native evaluators. Our Models, synthesized audios, corpus and an interactive demo are public available 5 . Table 4 presents the MOS values, with their respective 95% confidence intervals, for our experiments and for the best experiment of the [39], which can also be seen in Figure 2. The results of the main analysis indicate that experiment 2 (Mozilla TTS) presented the best MOS value (4.02). According to [48], the obtained value indicates a good quality audio, with a barely perceptible, but not annoying, distortion. On the other hand, experiment 1 presented a MOS of 3.03 indicating a perceptible and slightly annoying distortion in the audios.
With respect to previous results in the English language, [17] (Tacotron 2) can be compared to our experiment 2. The authors trained their model on an in-house US English dataset (24.6 hours), reaching a MOS of 4.52 ±0.06. Considering the confidence intervals, our model reaches 4.29 in the best case and 3.75 in the worst case. Therefore, our model has a slightly lower MOS, and this can be justified as [17] uses the WaveNet vocoder which achieves a higher quality in relation to the RTISI-LA/Griffin-Lim vocoder as shown in [18]. 5  It is also possible to compare our results with related works in Portuguese. The current state of the art (SOTA) in Portuguese [39] achieved a MOS score of 3.82 ±0.69 when training Tacotron 2 on the in-house SSF1 dataset. Considering the confidence intervals in the best case, the model of [39] can achieve a MOS of 4.51 and in the worst case of 3.13. On the other hand, as previously discussed, our best model can reach 4.29 and 3.75 in the best and worst cases, respectively. These values are compatible since in the work of [39] the authors used the neural vocoder WaveNet that generates speech with a higher quality. In addition, our confidence intervals are shorter, which may indicate that our evaluators agreed more during the evaluation. In addition, [39] used only 8 evaluators in their MOS analysis, while in this work we used 20 evaluators; the number of evaluators can also have a impact on confidence intervals.
Comparing the Ground truth for the SSF1 dataset and TTS-Portuguese Corpus, we can see that the TTS-Portuguese Corpus can vary from 4.87 to 4.55 in the best and worst cases, respectively. On the other hand, the MOS for the SSF1 dataset reported by [39] ranges from 5.02 (a value above 5 can be justified by rounding) to 3.82. Considering this MOS analysis, the two datasets are comparable in terms of quality and naturalness.

Conclusions and Future Work
This work presented an open dataset, as well as the training of two speech synthesizer models based on deep learning, applied to the Brazilian Portuguese language. The dataset is publicly available and contains approximately 10.5 hours.
We found that it is possible to train a good quality speech synthesizer for Portuguese using our dataset, reaching 4.02 MOS value. Our best results were based on Tacotron 2 model. We had MOS scores comparable to the SOTA paper that explores the use of Deep Learning in the Portuguese language [39], using a in-house dataset. In addition, our results are also comparable to works in the literature that used the English language.
To the best of our knowledge, this is the first publicly available single-speaker synthesis dataset for the language. Similarly, the trained models are a contribution to the Portuguese language, since it has limited open access models based on deep learning.

Acknowledgements
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior -Brasil (CAPES) -Finance Code 001. We also gratefully acknowledge the support of NVIDIA corporation with the donation of the GPU used in part of the experiments presented in this research.    (1)