ICE-Talk 2: Interface for Controllable Expressive TTS with perceptual assessment tool

In this paper, we present open-source 1 tools that facilitates the use of controllable TTS systems in experiments, towards the democratization of TTS systems across domains. ICE-Talk is a web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for controllable TTS. A tool to design a perceptual experiment is provided and consists of three steps: pre-synthesizing samples covering the 2D plot representing controllable dimensions, including this interface inside a template question, and integrate it in a Mechanical Turk system called turkle.


Introduction and motivations
Speech Synthesis is an important component of Human-Robot Interaction. However as of today, expressiveness in speech generated by Text-to-Speech (TTS) systems is under-explored in such interactions. The reason is the difficulty of accessing the variables controlling speech expressiveness in a deep learning-based TTS system [1].
To tackle this problem we propose a tool allowing the control of these variables through a graphical interface, thus contributing to the democratization of the use of Deep Learning (DL)-based TTS systems.
The code (and data) in this article has been certified as Reproducible by Code Ocean: (https://codeocean.com/). More information on the Reproducibility Badge Initiative is available at https://www.elsevier.com/physical-sciences-and-engineering/computer-science/journals. * Corresponding author.
E-mail address: noe.tits@alumni.umons.ac.be (N. Tits). 1 https://github.com/noetits/ICE-Talk. This interface allows for the control over the synthesis parameters of a DL-based model through its latent space directly and intuitively in a graphical way. It therefore allows the implementation of several interesting applications and experiments such as listening tests for the evaluation of such systems thanks to easy prototyping of experiments. Indeed in speech synthesis, it is well known that objective measures of quality can sometimes be misleading because they do not always correlate well with the subjective perception. There exists tools to evaluate the naturalness of synthesized speech with subjective tests. However, the field of Controllable Expressive Speech synthesis needs experiments and protocols to assess the controllability of such systems.

Related work
As of today, there are some open-source web interfaces allowing the use of DL TTS models. 2 They allow to write text, that is sent to the model and get the synthesized speech as an audio object that one can listen. The text is therefore the only control variable that we can access.
Recently, an interface 3 allowing to give an audio for TTS with speaker characteristics was developed based on the research of Tacotron team [2,3]. It allows to select a reference audio file and synthesize speech from text imitating the voice of the reference. It is however not possible to interact with a latent space representing acoustic variability. ICE-Talk [4] provides a web interface capable of visualizing and exploring a space of voice expressiveness and synthesize corresponding expressive speech, it is a proof of concept based on [5]. However the controllable aspect of such system is difficult to assess and would need the design of perceptual experiments in which a user has to solve a task that would measure this controllability.
In this paper, we present an extended version of ICE-Talk that allows to study the controllability of a Controllable Expressive TTS. It is an integration of the interface inside a questionnaire template, making it possible to build perceptual experiments involving a user to interact with ICE-talk. . To make the model available as a web service and communicate information of text, audio and style between the web interface and the TTS model, the Falcon Web framework 4 is used. Falcon allows to bridge the gap between a python code and a web interface, allowing the use of Deep Learning frameworks through a web application (see Fig. 2).

Controllable DL-based TTS
We use a modified version of Deep Convolutional Text-to-Speech (DCTTS) [6], a state-of-the art Deep-Learning Sequence-to-Sequence (seq2seq) model with a controllable expressiveness through a Latent Space designed to represent variations in voice style as described in [5].
A TTS seq2seq model typically consists of an encoder-decoder structure. Text is encoded as a latent representation that is then decoded with an attention based decoder to predict a mel-spectrogram later inverted to an audio waveform.
In [5], to obtain a voice style representation for controllable expressiveness, a mel-spectrogram encoder is added. It consists of a stack of 1D convolutional layers, followed by an average pooling, to obtain and 8D encoding vector. This operation ensures to obtain time-invariant information. It can thus contain information about statistics of prosody such as pitch average, average speaking rate, but not a pitch evolution.

Web interface
The interface contains a 2D representation of a latent space which is an internal representation of the data distribution by the network. This 2D representation is obtained via a dimensionality reduction applied to the highly dimensional latent space of the system. The interface also 3 https://github.com/CorentinJ/Real-Time-Voice-Cloning. 4 https://falcon.readthedocs.io/en/stable/.  contains a text box for the system's input and an audio player for the system's output.
The latent space represents the distribution of some controlling parameters (the expressiveness for instance) of the output speech, and is obtained after training. By writing a text and clicking on a point on the 2D space, an audio signal is generated with the parameters values corresponding to the point clicked on. The web interface is implemented in HTML5 and javascript to use the service.
There are several possibilities for dimensional reduction : UMAP, PCA or t-SNE. The click of the mouse is detected using javascript in pixels coordinates and mapped to the reduced data space.
Then Nearest Neighbour regression is used to compute the 2D data point, and a lookup table gives the corresponding 8D point of the latent space. The text and the 8D vector are fed to the model that generates the sentence and save it into a wav file. The audio wav file is then served and played as an HTML5 audio object.

Perceptual assessment tool
To study the controllability of a Controllable Expressive TTS system, we provide a tool to design perceptual tests.
First, we provide a python script to generate a set of predefined sentences covering the 2D latent space by discretizing it in a set of points.
An interface can then be used to play the pre-synthesized samples corresponding to the different regions of the space. A demonstration is available from the github repository. 5 We provide a template of question, depicted in Fig. 3, that includes this interface, and that can be integrated inside turkle, 6 an open-source web server equivalent to Amazon's Mechanical Turk that one can host on a server or run on a local computer. This template shows an reference audio that a user should find by exploring the 2D space by clicking in it.
It is then possible to ask participants to use the 2D interface to produce the same expressiveness as in a given reference. We assume that if a participant is able to locate in the space the expressiveness corresponding to the reference, it means he is able to use this interface to find the expressiveness he has in mind.

Conclusions and future works
We presented an extension of ICE-Talk, a proof of concept of research results of [5], which is a tool that allows to build perceptual experiments involving a user to interact with ICE-Talk.
This tool will enable the study and assessment of the controllability of Controllable Expressive TTS systems and how participants behave and feel with the system.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.