Quality prediction of synthesized speech based on tensor structured EEG signals

This study investigates quality prediction methods for synthesized speech using EEG. Training a predictive model using EEG is challenging due to a small number of training trials, a low signal-to-noise ratio, and a high correlation among independent variables. When a predictive model is trained with a machine learning algorithm, the features extracted from multi-channel EEG signals are usually organized as a vector and their structures are ignored even though they are highly structured signals. This study predicts the subjective rating scores of synthesized speeches, including their overall impression, valence, and arousal, by creating tensor structured features instead of vectorized ones to exploit the structure of the features. We extracted various features to construct a tensor feature that maintained their structure. Vectorized and tensorial features were used to predict the rating scales, and the experimental result showed that prediction with tensorial features achieved the better predictive performance. Among the features, the alpha and beta bands are particularly more effective for predictions than other features, which agrees with previous neurophysiological studies.


Introduction
Text-to-Speech (TTS) systems, which convert a written text into speech, and are becoming more widely implemented in mobile phones, car navigation systems, and other consumer electronics. Such systems play a critical role in many applications because speech is the most fundamental and easiest communication tool for human beings. Therefore, synthesized speeches must sound natural for good machine-to-human communications.
In the first approach, the two most common aspects for quality judgment are naturalness and intelligibility. Naturalness describes how close synthesized speech is to human speech, and intelligibility reflects how well the speech content can be heard. The former is usually measured by a mean opinion score (MOS) test [1], and the latter is gauged by semantically unpredictable sentences (SUS) [3]. In addition, valence and arousal are often used to evaluate the PLOS  subjective impressions of speech [11,13,15,16] and to model emotions [17][18][19][20]. Valence reflects a positive or a negative emotion. Arousal reflects the degree of intensity or activation. In a MOS test, subjects listen to speech and rate its relative perceived quality on some kind of a scale, for example, "excellent," "good," "fair," "poor," "bad." Then the scores are averaged across subjects. This is well established method for which references on how to perform it are available [2], making it the only standard way to evaluate the naturalness quality of synthesized speech. However, their appropriateness has not been fully proven because high inter-and intra-subject inconsistencies are often observed in the ratings, resulting in poor reproductivity [21].
In the second approach, speech quality is automatically evaluated at its signal level by software that inputs a speech file and outputs the estimated speech quality. Advantages of these methods include complete reproductivity and less time consumption after such software is developed. However, appropriateness is difficult to prove because the exact relationship between the acoustic features and the perceived quality of speech by a listener is not well understood [21]. In fact, speech quality must be evaluated not only physically but also psychologically because it is commonly defined as an assessment result within which a listener compares his/her perceptions with expectations [22,23].
Last, quality estimation methods are emerging that measure the physiological responses of a listener [24]. Even though these methods have not been established yet, they are worth investigating because physiological signals can be recorded automatically and continuously to provide insight about listener's cognitive states without interruptions caused by directly asking him/her to answer questions. Among existing non-invasive physiological response measures, electroencephalography (EEG) has especially great potential to estimate a listener's perceived speech qualities for the following reasons. EEGs can be recorded at a higher temporal resolution, e.g., a millisecond range, than hemodynamic measures, including functional magnetic resonance imaging (fMRI) and functional near-infrared spectroscopy (fNIRS), both of which analyze the changes in blood flow that inherently take a few seconds until a brain response can be recorded. Temporal resolution is important to evaluate speech quality since the temporal structure of speech largely affects its perceived quality. In addition, an EEG recording equipment is relatively small, less expensive than other brain recording equipments, and can be even wireless, which allows us to use it in daily environments, whereas fMRI and magnetoencephalography (MEG) can only be used in experimental rooms because of the lack of portabity. Measuring physiological responses to speech in daily environments is critical because speech is everywhere. Despite the above advantages, the main disadvantage of physiological measures is the difficulty of data gathering. The amount of data that can be collected from a subject is limited for practical and ethical reasons. Conducting experiments is usually timeconsuming and labor-intensive. In addition, physiological data are generally noisy and easily contaminated by artifacts. Furthermore, multi-channel EEG signals are usually highly correlated to each other, which makes the features extracted from them less informative compared to the height of their dimensions. These aspects of EEG (limited amount of data, noise, and high correlation and dimension) complicate training a predictive model with EEG data and require a sophisticated dimension reduction or regularization techniques [25].
Existing researches have analyzed EEG responses to speech stimuli using event-related potentials (ERP), which are time-locked responses to external or internal events in terms of a voltage change that are usually visualized and quantified after synchronous averaging of multiple epochs [7][8][9]. Due to its definition, measuring ERP need the instantaneous time-locking points at which an event occurs, complicating the use of ERP if stimuli onsets are gradual or unclear [26]. Therefore, ERP is not suitable for our purpose of the predicting perceived quality of speech whose length exceeds a second because it is usually unclear which time points affect a listener's perceived quality. Other research used power spectral density [14,27] and their difference between EEG channels [11,13] at multiple frequency bands. Neuroscience studies reported that EEG spectral changes in distinct regions and between hemispheres are related to emotions [28][29][30][31]. Other studies used EEG phase synchronization between EEG channel pairs and found a correlation to emotions [32,33].
The purpose of this research is to predict the perceived qualities of synthesized speeches using only EEG. Interest is growing in the development of a machine learning algorithm that uses an input/output data structure as tensor formats [34][35][36]. Such tensor structured features were investigated in this study because EEG signals can have structures in time, frequency, space, experimental condition, and other modalities.

Materials
We used the PhySyQX data set [10], which consists of speech files, their subjective rating scores from 21 subjects, and EEG signals from the same subjects recorded while they listened to the speech. The data recording protocol was approved by the INRS Research Ethics Office, and participants gave informed consent for their participation and to make their data anonymous and freely available online. The details of the data set and the experimental procedures are available in [10]. We obtained it by an e-mail request.

Speech stimuli
The speech stimuli presented to the subjects in the data set consist of speech collected from four humans and seven commercially available TTS systems. From each human and each TTS system, four English sentences were collected, whose durations ranged from 13 to 22 seconds. The 44 human and synthesized speeches were presented to each subject in random order.

Experimental procedure
The experiment's timeline is shown in Fig 1. A 15-second rest period was provided before each stimulus presentation. It is followed by a subjective rating period during which the subjects evaluated the speech to which they had just listened. The subjective rating scales used in this study are shown in Table 1 and include overall impression (MOS), valence (VAL), and arousal (ARL). MOS was evaluated with a 5-scale rating and the others with a 9-scale using self-assessment manikin [37].
EEG recording and preprocess. EEG data were recorded throughout the experiment with 64 scalp channels. The sampling rate was 512 Hz, which was down-sampled to 256 Hz. All the channels were placed on scalp according to the modified 10/20 system [38]. Some channels were removed from further analysis because they were noisy. A band-pass filter was Quality prediction of synthesized speech with EEG applied to all the data between 0.5-50 Hz and applied an independent component analysis based semi-artifact removal technique using the ADJUST toolbox [39]. After these preprocessing, the EEG signal of each subject was cut into 44 epochs corresponding to the stimuli listening periods.

Feature extraction
All features were extracted at five frequency bands from a channel or a channel pair. The frequency bands include delta (δ: 1-4 Hz), theta (θ: 4-8 Hz), alpha (α: 8-12 Hz), beta (β: 12-30 Hz), and gamma (γ: 30-45 Hz). Let us denote the Fourier transformation at the frequency of f k of the n-th trial recorded by the m-th channel by x n,m (f k ). An estimator of the power spectrum density and a phase spectrum denoted by p k and h k can be calculated using the periodogram method as follows: h n;m ðf k Þ ¼ angleðx n;m ðf k ÞÞ; where T is the number of time samples within a trial. Then, we averaged the power spectrum density over the frequency bins within the range of each frequency band to define channelbased features PSD n (m, f) as follows: where D f is the index set of the frequency bins included in the range of the f-th frequency band and |D f | is the number of the elements in D f . The channel-pair-based features are also defined using the averaged power spectrum density and the phase spectrum as follows: If M EEG channels and F frequency bands are used (F = 5 in this study), I = F(M(M − 1) + M) features are calculated. The feature matrix X can be expressed as: where N is the number of training trials and x(n) is a feature vector of the n-th trial and has all the features PSD n , PWD n , and PHD n . To exploit structures of the features, we organized the features as a tensor X 2 R NÂMÂMÂF as follows: The feature matrix and tensor are depicted in Fig 2.

Regression analysis
Higher order partial least square (HOPLS) [34] and standard partial least square (PLS) [40,41] simultaneously perform dimension reduction and regression, which were used in this study. The former is a natural extension of the latter so that tensor-format features can be used. Let us denote the response matrix by Y that has all the response variables of all training trials: where y(n) is the J = 3 dimensional response vector of the n-th trial. All response variables were normalized to have zero mean and unit variance. PLS performs a simultaneous decomposition of X and Y to find common latent variables t r 2 R N as: where E and F are the residual matrices, and R 1 is called the number of the components. On the other hand, HOPLS can be similarly formulated as the problem to find latent variables as follows: where G r is called the core tensor, E and V are the residuals, R 2 is the number of the components, and × k denotes the k-mode product [42]. P ðnÞ r is called the loading matrix of the r-th component, and L k is called the number of the k-mode loadings.
If data are plentiful, which is rare in EEG studies, the best approach for training and evaluating the performance of a predictive model is to randomly divide the dataset into three parts: training, validation, and test sets, which are respectively used to train a model, tune hyperparameters or select a model, and evaluate the generalization error [43]. However, since the amount of data in this study is too small to exploit such an ideal protocol, we instead used leave-one-out cross-validation for each subject. The hyper-parameter R 1 of PLS varied from 1 to 43, loadings of the channel-1 L 1 and the channel-2 L 2 ranged from 1 to 7. The loading gs of the frequency band L 3 and the number of components R 2 of HOPLS ranged from 1 to 5. The result of the models that achieved the best performance was reported in Results.

Evaluation metrics
Root mean squared error (RMSE) was used to quantify the predictability of the regression models for each subject, which are formulated as: where N is the number of test samples,ŷ i is the predicted value for the i-th test data, and y i is the actual value. Table 2 summarizes RMSE, and the numbers of latent factors identified by PLS and HOPLS, respectively. Predictions with tensorial features generally made smaller errors than the vectorized ones for all the rating scales. Fig 3 reports the one hundred features that contributed to the prediction the most greatly, where feature contributions were calculated by taking the

Discussion
Channel-pair-based features (PWD and PHD) contributed more to the predictions than channel-based ones (PSD), which agrees with a previous study [31] and suggests the importance of considering scalp EEG dynamics between brain regions, and that graph theory based features or functional connectivity analysis can be effective [45,46]. The importance of spectral differences in caudality (DCAU) between the anterior and posterior [12,47] or the front-posterior brain regions [31] as well as the lateral (left-right) spectral difference (DLAT) have been documented [28,30]. In this study, both of DLAT and DCAU contributed to the predictions (Fig 4) although their effectiveness was dependent on the subjects. Quality prediction models were independently trained for each subject in this study because emotion regulation is reportedly dependent on individuals [48]. The commonality of the channels/channel pairs, which greatly contributed to the predictions, was actually rather small ( Fig  4). Therefore, creating subject-independent features is an interesting future work. However, note that the alpha and beta bands commonly contributed to the predictions, whereas the effective channels/channel pairs differed depending on the subjects. The alpha and beta bands contributed more largely to the predictions than the other frequency bands, which is in line with previous neurophysiological studies. The relationship between alpha band asymmetry and the withdrawal or disengagement from a stimulus or negative valence has been well documented in response to a variety of stimuli, including pictures [49,50], music [31,47,51], movies [52], and speech [11,13]. The beta band, which contributed the most to the ARL predictions, is reportedly associated with arousal and emotional experiences [53,54].
Gupta et al. [13] predicted MOS values using the same data set that we used in this study. Their study used a simple linear regression model with not only EEG but also speech features. They reported the RMSE of their model was 0.117, which is much lower than our model, and suggests that speech features are much more informative than EEG features to predict subjective quality ratings.
Although we predicted the response values of MOS, VAL, and ARL, other perpetual dimensions were also proposed recently to model emotions or perceived quality-of-experiences [55, 56], which should be investigated in future research.
Neither previous work nor our current study advocate that physiological assessment methods of speech quality should replace subjective rating methods or signal analysis methods because, as stated in Introduction, each method has its own advantages and disadvantages and they can complement each other.
Several open questions remain. First, features were extracted and constructed as tensors as described in Feature Extraction and Regression Analysis, but other features and construction ways are also possible. For example, if time-frequency analysis is employed, times frames can be treated as one of the tensor modes. Second, this study analyzed the overall quality of each speech stimulus longer than ten seconds. However, parts of speech can affect much more largely its overall perceived quality. Therefore, analysis methods to specify such parts need to be studied.

Conclusion
This study predicted the subjective quality ratings of synthesized speech solely based on EEG. We created vectorized and tensorial features for the regression that include channel-based and channel-pair-based features at multiple frequency bands. The experimental result showed that tensorial features more effectively predicted the subjective ratings than the other, and the trained predictive models were neurophysiologically plausible.