Automatic Text-Independent Artifact Detection, Localization, and Classification in Synthetic Speech

. The paper describes experiments with statistical approaches to automatic detection, localization, and classification of the basic types of artifacts in the synthetic speech produced by the Czech text-to-speech system using the unit selection method. The first experiment is aimed at artifact detection by the analysis of variances (ANOVA) and hypothesis testing. The second experiment is focused on localization of the detected artifacts by the Gaussian mixture models (GMM). Finally, the developed open-set artifact classifier is described. The influence of the feature vector length and structure on the resulting artifact detection accuracy is analyzed together with other factors affecting the stability of the artifact detection process. Further investigations have shown a relatively great influence of the number of mixtures and the type of a covariance matrix on the artifact classification error rate as well as on the computational complexity. The obtained experimental results confirm the functionality of the artifact detector based on the ANOVA and hypothesis tests, and the GMM-based artifact localizer and classifier. The described statistical approaches represent the alternatives to the standard listening tests and the manual labeling of the artifacts.


Introduction
The synthetic speech produced by text-to-speech (TTS) systems is increasingly used to make dialogue management in human-machine interaction more effective. People involved in such a dialogue usually demand high quality, naturalness, and intelligibility of the generated synthetic speech. Various speech synthesis techniques may be implemented in the TTS systems. The most widely used one is the corpus-based speech synthesis using the unit selection (USEL) [1], i.e. selection of the largest suitable segments from the natural speech according to various phonetic, prosodic, and positional criteria, commonly known as the target cost. These speech segments should be smoothly concatenated by minimizing the concatenation cost [2], [3]. However, any concatenation point may become a source of an audible artifact in the finally generated speech [4]. Apart from the wrong description of the natural original speech (such as wrong annotation and/or segmentation [5]), the most dominant causes of the artifacts are related mainly to discontinuities of the fundamental frequency in the voiced speech [6]. From among the other reasons of serious artifacts, time inconsistencies or spectral mismatches at concatenation points can be mentioned [7]. In the process of the TTS system development, all these artifacts must be identified by evaluation methods working without any human interaction. In such an objective method, the automatic speech recognition system yields the final evaluation in the form of a recognition score. Here, the Gaussian mixture models (GMM) [8] are mostly used. In general, the automatic artifact detection, localization, and classification can help in the whole process of the TTS system creation. It holds especially for the artifacts caused by wrong annotation or those found in the already generated synthetic sentence. If their location is known, they can be eliminated in the post-processing or directly during the unit selection as a component part of the concatenation cost.
This work was motivated mainly by the aim of finding an alternative objective approach to the standard listening tests for detection and localization of the artifacts in the synthetic speech. It is important in the cases when the listening test is rather time consuming and relatively difficult due to small audible differences. In addition, the main disadvantage of the human evaluation lies in its subjectivity, lack of reproducibility (different obtained results for repeated tests even from the same subjects), and dependence on ambient conditions. On the other hand, the main advantage of the automatic evaluation system is its function without human interaction and possibility of direct numerical matching of the obtained results using the objective comparison criterion.
The paper describes three basic experiments with developed automatic speech artifact detector, localizer, and classifier. The functionality of this system is verified and its optimal settings are found. The evaluated objective results are compared with those obtained by the listening test as a subjective rating method.

Method
Our previous experience with the TTS system based on the USEL synthesis method has shown appearance of six basic types of speech artifacts [9], [ Principally, the proposed artifact automatic detection, localization, and classification system consists of three parts:  artifact detection based on the analysis of variances (ANOVA) described in the previous paper [9],  artifact localization developed in concordance with its first stage dealt with in [10] where the position of the GMM score maximum coincides with the location of the artifact inside the tested sentence,  artifact type classification also based on the GMM approach.
The function of the automatic system begins with the analysis of the tested input sentence. Then, the speech spectral and prosodic features are determined and subsequently applied in the ANOVA detector block making a decision whether a speech artifact is present or not. In this step, the database of the clean synthetic speech (DB CLEAN ) and the database of the synthetic speech with artifacts (DB ARTF ) are usedsee the block diagram in Fig. 1 If the sentence is marked as having an artifact, other types of spectral and prosodic parameters are used for artifact localization using the trained GMM models of the starting/ ending parts and the bodies of the artifacts (database DB ARTF2 ). Once the artifact is localized, the nearest region of interest (ROI) is determined and the united GMM models of the starting, ending, and body parts are used for the final classification of the artifact type (see the example in Fig. 2).

Determination of Speech Spectral Features and Prosodic Parameters
The speech artifact detection method begins with listening of the speech signal and its evaluation using the standard audio software or the program system dedicated to speech processing, e.g. Praat [11]. After detection of an audible artifact by repeated listening, the next step is visual evaluation of the speech signal. In this way, the original speech material is selected and prepared for building of the basic speech feature databases DB CLEAN and DB ARTF .  2. Demonstration of differences in speech signals: the clean sentence "send04g" (a), the sentence with an artifact "send04b"and its ROI (b), detail of the clean signal in the ROI (c), detailed part in the artifact neighborhood with the determined start/end locations (d).
Spectral features like the mel-frequency cepstral coefficients together with the energy and the prosodic parameters are mostly used in the GMM-based speaker identification or verification [12], [13]. Among the other spectral properties, e.g. the first five formants can be used in psychological stress detection in speech [14]. In our experiments the features differ for the ANOVA detection and the GMM localization/classification of the artifacts. In general, three types of speech features can be determined: 1. supra-segmental parameters  speech signal energy calculated from the first cepstral coefficient c 0 (En c0 ) or by the autocorrelation coefficient r 0 (En r0 ), differential F0 microintonation (F0 DIFF ), jitter, shimmer, zero-crossing period (L ZCR ), and frequency (F ZCR ).
The determined speech features are structured as vectors with the length N SF and stored in the DB CLEAN and DB ARTF databases separately for male and female voices. The DB ARTF2 database comprises the separate speech features determined from the start, end, and body parts of the artifacts for localization and classification. For the GMMs creation and training, the representative statistical values (mean, median, rel. maximum, rel. minimum, skewness, kurtosis, etc.) are calculated from the original speech features.
To obtain the relevant speech features from DB CLEAN and DB ARTF databases the criterion of mutual independence between the synthetic speech with and without artifacts is applied. The final value of the mutual independence for every feature and every category is evaluated using three parameters: 1. relative RMS distance D RMSrel between the histograms of features extracted from the DB CLEAN and DB ARTF , 2. absolute distance between group means D 12 after the multiple comparison of the group means applied to ANOVA statistical results, 3. hypothesis probability resulting from the Wilcoxon test [15] or the Mann-Whitney U test [16] comparing whether two samples come from identical distributions with equal medians or they do not have equal medians.
For all the three parameters, the features are sorted, in such a way that the higher the index, the lower the mutual independence. The parameter quantifying the mutual independence of the databases for every feature (MUTI SF ) is represented by the resulting mean position in the category calculated as  (1) where N SFC is the number of features in the category, cr 1 is a criterion by D RMSrel , cr 2 by D 12 , cr 3 by the hypothesis probability, and cw 1-3 are individual weights depending on the importance of the criterion. If the null hypothesis cannot be rejected for any feature, it is penalized by the highest index of the sorted vector N SFC . The features are selected by two rules: exclusion of the features with very small null hypothesis probabilities, and elimination of those with small RMS distances between the speech with and without artifacts. This feature separation process is performed with the speech material comprising sentences spoken by all speakers.

Artifact Detection Based on ANOVA
The first step of our speech artifact identification experiment based on ANOVA analysis is focused on testing whether there is a common mean of speech features from several groups. Besides the ANOVA F-test giving the ratio of variances between and within groups we use the Ansari-Bradley probability test specifying whether two distributions are the same or they differ in their variances. For  a chosen significance level the resulting logical value "0" denotes that the null hypothesis cannot be rejected and the value "1" indicates that it can be rejected. The overall structure of the method can be seen in Fig. 3.
The speech spectral properties and prosodic parameters obtained during analysis of the tested sentence are used to calculate the corresponding basic statistical parameters and the occurrence distributions of the feature values. Further, they are processed by the one-way ANOVA analysis. The distances between the means of the groups are visualized using the multiple comparisons of groups (see Fig. 4). The minimum absolute value of the group distance is found from among the distances between the group means:  D T1  the tested sentence and the clean sentence,  D T2  the tested sentence and the one with an artifact,  D 12  the clean sentence and the one with an artifact.
For each of N SF speech features these results yield the decision about the tested sentence (clean/artifact), and the Ansari-Bradley test between probability distributions gives the probability and the logical output value (0/1).

GMM-based Artifact Localization
The GMMs represent a linear combination of multiple Gaussian probability distribution functions of the input data vector. For their creation it is necessary to determine the covariance matrix, the vector of means, and the weights from the input training data. In general, spherical, diagonal, or full covariance matrices may be used. If the elements of the feature vectors are correlated, their number must be relatively high and satisfactory approximation can be achieved only with the full covariance matrix. However, in the speaker identification tasks, the diagonal covariance matrix is used due to its lower computational complexity. The maximum likelihood function of the GMM is found by the expectation-maximization iteration algorithm. It is controlled by the number of mixtures N MIX and the number of iterations. The classifier returns the probability that the tested utterance belongs to the GMM model. In the standard GMM classifier, the resulting class is given by the maximum overall probability of all the obtained scores corresponding to K output classes. Here, only one output class is defined and the GMM classifier processes N feature vectors corresponding to N frames of the tested sentence.
The main idea of the proposed localization method is based on the assumption of correlation between the position of the artifact and the score maximum from the vector of normalized scores obtained by comparison between the currently tested speech frame and the trained GMM model (see Fig. 5). Three types of the GMM models of the artifacts are created and trained for each voice: a) starting part -speech signal in the left margin frame of the artifact and ±i frames in its neighborhood, b) ending part -speech signal in the right margin frame of the artifact and ±i frames in its neighborhood,   c) body of the artifact -speech signal spanning from the starting to the ending frame.
In the classification phase, the input feature vectors are compared with these 3 trained GMM models to get 3 output vectors of normalized scores. For final localization of the artifact position the first 3 maxima are evaluated by logical matching with the predefined rules (see Fig. 6) covering the situations when the localization algorithm might fail -the starting frame position must precede the ending one, the artifact body must lie between the start and the end, etc. If one of these conditions is not fulfilled, the position will be assigned to the 2nd or the 3rd determined score maximum. Only one artifact within the tested sentence can be found by this approach and the artifact presence must be confirmed by another detection method, e.g. ANOVA-based approach.

Classification of Artifact Types
The artifact types 1-5 occur relatively often, so the corresponding changes of prosodic and spectral parameters may be defined appropriately and the classification can be carried out with a relatively high precision. In the last class, the amount of reference data is not sufficient for the GMM model creation and training due to a different context of an artifact each time it appears. Therefore, the GMM-based classifier must be created in the open set with the 6th class containing all artifacts that had not been classified as the types 1-5.
The last part of our experiment begins with training of the united GMM models on speech signals of the start, body, and end of the artifact with ±i frames in the left/right vicinity of ROI. In the classification phase, the input feature vectors from the tested sentence are compared in parallel with 3 trained GMMs to obtain 3 output vectors of the normalized scores. These output scores are analyzed to determine the maximum overall probability in the discriminator block performing basic classification to one of M output classes assigned to each of the processed speech feature vectors (see Fig. 7). Next, the class distribution based on histograms is constructed and the maximum occurrence is determined. The final classification block works with M+1 output classes -the virtual class is added to the basic closed set of M artifact types to create the open-set artifact identifier. The classification strategy is based on the consideration that when the class distribution has no dominant class, the whole tested sentence finally belongs to the 6th class. Practically, the maximum occurrence is compared with the threshold Tresh 0 given as a ratio between the number of the currently processed frames and the number of the basic classes (M).

Material, Experiments, and Results
Three basic comparison experiments were performed within the research described in this paper: the first one is the verification of functionality of the ANOVA-based artifact detector using the synthetic speech produced by the Czech TTS system. The second experiment compares the automatically localized artifact position using the GMMbased classifier. The third experiment consists in testing and verifying of the proposed automatic GMM-based artifact type classifier.
The correctness of selection of the ROI with the artifact inside the tested sentence was checked for its influence on the accuracy and the stability of the classification results. In the auxiliary experiments we analyze the influence of different types of speech spectral features and suprasegmental parameters on the resulting artifact detection accuracy. Next, the localization accuracy is analyzed using the artifact position relative error (APE rel ) and then compared regarding the number of used GMM mixture components. Furthermore, the dependence of the error rate of artifact classification (ERAC) on the number of GMM components and on the method of the covariance matrix calculation was analyzed. Finally, the computational complexity (CPU processing time) was evaluated with the aim to find critical parts of the proposed algorithms and subsequently to make an optimization for real-time processing.

Material and Processing Conditions
The artifact detection, localization, and classification experiments use the synthetic speech produced by the Czech TTS system implementing the USEL synthesis method [17][18][19]. The main speech corpus was divided into two parallel groups of 40 declarative sentences of the male/female voices. The first group comprises the sentences without any audible artifact designated as "clean"; the second group consists of the same sentences produced by the same male and female TTS voices with just one speech artifact in each sentence. All the sentences with duration 2.5 to 5 s were sampled at 16 kHz. The derived database of ROIs of the artifacts was used for training of the GMM models to classify the artifact types. Independence of the male/female voices during the training and the testing was achieved by the data k-fold cross-validation. The groups of sentences were divided by the ratio of 3:1three for the training and one for the testing/classification. Due to a limited number of sentences with "real" artifacts occurring during the TTS synthesis, the classical crossvalidation data selection could not be used in the GMMbased artifact localization. Therefore, for the testing in the localization experiment, another 20 + 20 sentences with artifacts were derived by cutting or adding a sentence part, using a signal from another sentence, etc. to change the position of the artifact in the sentence.
For determination of the MUTI SF values 25 different types of speech features were tested: 10 prosodic parameters, 10 basic spectral features, and 5 supplementary spectral features (see the detailed results for the prosodic features in Fig. 8a, the basic spectral features in Fig. 8b, the values for the supplementary spectral ones in Tab. 1). Due to statistical similarity between the "clean" and "artifact" groups, 10 features with the lowest mutual independence were omitted, so 15 speech features were used for ANOVA -based detection. In accordance with the previous research [10] the basic classification of artifacts in the speech utilizes 6 feature sets of 9 items (Tab. 2). Influence of different number of features with high mutual independence between the synthetic speech with and without artifacts was analyzed for three feature vector lengths: 5, 9, 15 (PN5, PN9, PN15). The shortest one PN5 consists of the features with the first five smallest MUTI SF : En r0 , HNR, F0 DIFF , jitter, shimmer. The second one PN9 includes also the features with the MUTI SF value below the threshold containing the features of the set P3 for the male voice and P4 for the female voice. The extended vector PN15 consists of the features with MUTI SF < N SFC and h = 1: En c0 , En r0 , HNR, S centr , S spread , S skew , S tilt, SFM, SHE, F0 DIFF , L ZCR , F ZCR , F 1 /F 2 , J abs , AP rel . According to the re-sults published in [10], [20], the length of the input data vector for GMM training and testing was set to 16.
The objective ANOVA-based evaluation was performed separately for each gender. The resulting artifact detection accuracy was calculated from the number X a of correctly identified artifact/clean sentences and the total number N u of sentences as (X a /N u )  100 [%]. The artifact neighborhood before its beginning and after its end was set to ±11 frames in correlation with [10]. The artifact position relative error APE rel in frames was calculated as the average of the absolute position error of the starting and the ending parts APE ABSstart , APE ABSend in every sentence as where w O is the frame shift for analysis chosen as one fourth of the frame in samples. To determine the dominant class inside the open-set classification the threshold was set experimentally to 1.2  Tresh 0 (i.e. adding 20 % to the basic level given by the calculated P/M ratio). In all cases, the ROI was selected manually for further comparison and evaluation of the ERAC calculated from the number X C of the sentences with the correctly determined artifact class and the total number N T of the tested sentences as The described speech signal processing was realized in the Matlab environment (ver. 2012a) and the basic functions of the Nabney "Netlab" pattern analysis toolbox [21] were used in the GMM classifier. The computational complexity was determined using the UltraBook with the following configuration: processor Intel(R) Intel i5-4200U at 2.30 GHz, 8 GB RAM, and Windows 10 (64-bit) OS.

ANOVA-Based Artifact Detection
For verification of functionality of the ANOVA-based artifact detector the subjective artifact determination was performed using the conventional listening test "Synthetic speech quality evaluation -male / female voice" by the automated internet application located on the web page http://www.lef.um.savba.sk/scripts/itstposl2.dll. It had been accessible from June 15 to 30, 2014 and then the results were processed. Twenty one listeners (5 women, 16 men) took part in this subjective evaluation consisting of 42 listening tests (21 male, 21 female voices). This internet application in the form of MS ISAPI/NSAPI DLL script runs on the server PC and communicates with the user within the framework of the HTTP protocol by means of HTML pages. The complete test consists of 10 evaluation sets with random selection of sentences. For each sentence there is a choice from three possibilities: "clean -without artifact", "with an artifact", or "other -cannot be recognized". The resulting confusion matrix of the results for the male/female voices is shown in Tab. 3, comparison of the artifact detection accuracy based on ANOVA and the lis-tening test can be seen in Fig. 9. The results obtained from the performed listening tests show principally high successfulness in the subjective evaluation of the synthetic speech artifacts. Particularly, the best results are achieved in the case of the male voice (approx. 95%) in comparison with the accuracy of 89% for the female voice.    In this part of the experiment, the following two auxiliary investigations were performed: 1. effect of the feature vector composition (sets P0-P5) on the clean/artifact detection accuracy (Fig. 10), 2. effect of the number of used features on the mean clean/artifact speech detection accuracy (Tab. 4). Figure 10 shows that, for the male voice, the highest accuracy (94%) was achieved for the set P3 consisting of all three feature categories. The best accuracy (84%) for the female voice corresponds to a different mix of these three feature categories (set P4) being contrariwise almost the worst for the males. Generally, the artifact detection in the sentences was more successful for the male voice than for the female one.
It might be caused by not finding the proper features for classification (in principle, for the female voice there is higher variability on the supra-segmental as well as on the spectral level). The second auxiliary ANOVA-based experiment documents that the accuracy is greatly affected by limitation of the speech feature vector length N SF . Higher error rates were produced for the numbers of features lower than 9, however, for 15 features there is not adequate impact on the artifact detection accuracy as documented in Tab. 4.

GMM-Based Artifact Localization
The second basic experiment compares automatically localized artifact positions using the GMM-based classifier. The positions determined manually by the Praat program and by the listening were used to calculate APE rel . In addition, two auxiliary experiments were realized with the aim to cover: 1. effect of the number of mixtures N MIX = {16, 32, 48, 64, 128} during GMM training on APE REL (Fig. 11), 2. the computational complexity: CPU times of the GMM training and classification phases for different number of mixtures (Fig. 12). The obtained results in this part of the experiment document proper functioning of the developed GMMbased artifact localizer. The analysis has shown principal impact of different number of mixtures on the localization precision (see the bar-graph comparison in Fig. 11). For this reason, a suboptimum of 32/48 mixtures was finally applied. The complexity of computation depends on the number of applied mixtures only in the creation and training of the GMMs, not in the localization/classification. The bar-graph in Fig. 12 shows that the computation time is 10 times higher for 128 mixtures than for 16 mixtures, however, unexpectedly, according to Fig. 11 the accuracy of fixing the artifact position decreases for more than 32/48 mixtures in the case of the male/female voice.

GMM-Based Artifact Classification
The third experiment consists in testing and verifying whether the proposed automatic GMM-based classifier of artifacts is principally correct and produces sufficiently low ERAC -see the best results in the form of the confusion matrices in Tab. 5, 6 for the male/female voices. Then the influence of selection of ROI with an artifact in the tested sentence was compared (see the results in Tab The obtained results show that, if the ROI is not set and the whole sentence is analyzed, the error rate will be unacceptable especially for the virtual 6th class -compare

Conclusions
From the main point of view, the task of finding the alternative to the standard listening tests was fulfilled. The proposed and tested artifact detection, localization, and classification methods are functional and produce the results comparable with those obtained manually. The determined time durations of the performed tests and the listeners' feed-back information document differences between male and female listeners and their different approach in execution of the evaluation task: the female evaluators try to do it more carefully than the male ones resulting in a paradox -their results are practically worse than those of the male evaluators. At this point the "subjectivity" of the used method is well-founded and it also supports our aim to find objective evaluation methods.
At present, only one male and one female voice are implemented in the tested Czech TTS system working with the USEL-based synthesis method [2,6,7]. Therefore, the speech features determined from the synthetized sentences are not actually gender-dependent (male/female voice), but speaker-dependent (according to the voices used for building of the TTS inventory). Collecting of the databases of speech features from the synthetized sentences with and without artifacts was very difficult and time consuming. Therefore, at present only a small number of sentences were processed for usage in the ANOVA artifact automatic detection experiment. The proper choice of the used speech features in the input feature vector is very important. However, the choice of the optimal feature set for the artifact detection is not universal -different feature sets had to be used for the male and female voices. Generally, the detection accuracy depends, first of all, on elimination of statistical "similarities" between clean/artifact groups and a group of features from the tested sentence. In the case of the developed GMM-based artifact localizer, the realized auxiliary analysis has shown a considerable impact of different number of mixtures on the localization precision. Next, the existence of a principal influence of the accurate setting of the ROI on the precision of the artifact type classification was covered. If the ROI with the artifact is set incorrectly, the output error rate will rapidly increase up to 100% making the whole artifact detection, localization, and classification system useless. The presented artifact detection system processes only one artifact in an analyzed sentence. Two or more artifacts in one sentence could be found by dividing the speech signal into two parts and independent artifact detection and localization in each part. This step could be repeated several times, however, limited by the minimum time duration of the processed speech signal necessary for proper ANOVA analysis [9].
There are two imperfections which should be remedied in the near future to increase performance and accuracy of the whole developed artifact detection, localization, and classification system. The first drawback lies in the fact that mistakes caused by inappropriate segment duration are not treated as a special group of artifacts although they represent a considerable part of errors. At present, they are included in the 6th group but it could be worth distinguishing between the wrong segment length and the incorrect element. The second drawback stems from a relatively limited speech corpus of 40 sentences. A larger database with a sufficient number of sentences must be built to integrate all recognized speech artifacts produced by the TTS system based on the USEL synthesis. Finally, the results of the computation complexity in Matlab environment indicate the need for some optimization and implementation in a higher programing language for the realtime processing.