HIDDEN MARKOV MODEL BASED SPEECH SYNTHESIS SYSTEM IN SLOVAK LANGUAGE WITH SPEAKER INTERPOLATION

This paper describes the first experiments with speaker interpolation in Slovak speech synthesis system. The interpolation provides an approach to voice characteristic conversion for hidden Markov model based text-to-speech synthesis system. The main idea of this technique is to synthesize an artificial speech with unseen and untrained output speech characteristic by interpolating of the existing sets of the pretrained models. The use of this technique allows to create new voices without the need to add additional data into training procedure. This is a major advantage especially for the low resources languages, such as Slovak language, where it is often difficult to obtain the necessary amount of data. The obtained results shows that the characteristics of the synthesized speech is changed from one male speaker to another one with the help of the interpolation ratio by which the Slovak voices with the new characteristics may be crated.


INTRODUCTION
The speech synthesis systems are today represented mainly by the computer systems which are able to convert the input text into output audio file which represents the speech.The use of this type of systems in practise is often faced with the reluctance of the users to communicate with the device, whereas the synthetic speech acts unnatural and it creates a barrier in communication [1].That is why research in this field is oriented on the development of new advanced methods and their improvement in order to make final output speech from these systems as similar to the human interpretation as possible.The hidden Markov model (HMM) based speech synthesis method represents one of the most progressive approach how to convert written text into speech [2].The progressiveness of this method is particularly involved in its high flexibility, where it allows to quite easily convert the voices with the help of an adaptation [3], interpolation [4] or, for example, using the technique of eigenvoice [5].The utilization of these techniques arise from the using of hidden Markov models which can be properly mathematically modified in order to obtain their desired modified versions.
The interpolation allows to synthesized speech with unseen and untrained speaker's characteristics by modifying the HMM parameters among some pretrained speaker's HMM sets.It is possible to gradually change the characteristics of synthesized speech from one to another speaker with the help of the interpolation ration between the sets of the HMM models.The major advantage of this approach is that no further data are necessary what is an effective way how to create new voices especially for the low resources languages, such as Slovak language, where it is often difficult to obtain the necessary amount of data.
The major contribution of this work is an extension of the existing sets of the models for the Slovak HMM-based speech synthesis.It is very important to have as much models as it is possible particularly in the case of low resource languages such as Slovak language.This paper is organized as follows: In Section 2 the HMM-based text-to-speech system with interpolation is described.Section 3 describes the speaker interpolation.In section 4, the experiments and results are presented and the conclusions are listed at the end of this paper.

HMM-BASED SPEECH SYNTHESIS SYSTEM WITH INTERPOLATION
A block diagram of the HMM-based speech synthesis system is shown in Figure 1.The system consists of three parts, namely there is the training, the interpolation and the synthesis part [4].where each of them was recorded by one certain speaker.
The training procedure is carried out for each sub database separately and this process results in the set of individual HMM models.In the HMM-based speech synthesis method, each of the HMMs correspond to a left-to-right model where each output vector is composed of two components.It consists of the spectrum part, represented by the mel-cepstral coefficients and their delta and deltadelta coefficients and the excitation part which is represented by the excitation parameters and their corresponding delta and delta-delta dynamic features.
The second part of the system is interpolation step.The main task of this stage is interpolating representative HMM sets.It is necessary to generate a new HMM set by interpolating between the representative speaker's HMM sets with an arbitrary interpolation ratio.The following chapter describes the interpolation process in more detail.
The synthesis part consists of two main components.The first one is represented by the text analyser, which converts a given text into contextual label sequence.The second component is represented by the several blocks which are responsible for the parameter and the duration generation from the HMMs and excitation generation based on the generated excitation parameters.The vectors of melcepstral coefficients and logarithmic values of the fundamental frequency are generated based on the obtained HMM sequence and the speech waveform is synthesized from these vectors by using the speech synthesis filter, which represents the last block of the synthesis stage.A conventional system works as a mel-cepstral vocoder with a simple impulse train as the excitation signal, where a sequence of the periodic pulses and white noise together with a mel-log spectrum approximation (MLSA) filter are used [6].Recently, a several high-quality vocoders with a more advanced excitation were implemented into system.Such methods include, e.g., MELP (Mixed Excitation Linear Prediction) method [7], HSM (Harmonic/Stochastic Model) model [8], excitation model based on modeling of residues [9], STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of Weighted Spectrum) [10] or AHOcoder [11].
Recently, the HMM-based speech synthesis technique has been reported for many languages, such as for large languages including the Mandarin Chinese [12] or Spanish [13], but the flexibility in the development of those systems also enabled the integration of small languages, such as Thai [14], Slovenian [15] and also Slovak.

SPEAKER INTERPOLATION
If the specific speakers who enter to the training procedure are marked as


and the HMM models pertaining to them are marked as  and if the target speaker S is represented by the set of models  .Then the distance between the target speaker S and each of the specific speakers N S can be measured by Kullback information measure between  and k  as follows [16]: When we consider interpolating between N HMM sets


, it is possible to determine interpolated HMM set of models  in a way that  maximize the cost function: If we consider that each HMM state has a single Gaussian output probability density then it is necessary only to interpolate the N Gaussian probability density functions (pdf), Subsequently, the Kullback information formula can be rewritten as follows: Figure 2 shows the spatial view on the interpolation between the specific speakers and the target speaker.

Fig. 2 The spatial view on the interpolation HMM models
The use of interpolation in case of the HMM-based speech synthesis thus enables the smooth change of the speaker characteristics with the help of the interpolation ratio between the speakers [17].It is also possible to smoothly change from one to another speaker or change the level of the emotion in case of the emotional speech synthesis.

EXPERIMENTS
Together, two previously trained HMM-based voices, marked as MSM and ADM, were used and evaluated for the Slovak language using speaker interpolation technique [18].These system use previously developed modules for Slovak text analysis together with the proposed language ISSN 1335-8243 (print) © 2015 FEI TUKE ISSN 1338-3957 (online), www.aei.tuke.skdependent context clustering.For these experiments, the 5 state left-to-right models together with the conventional MLSA filter approach were used.Five different types of synthesized speech were generated with the help of the two HMM pretrained HMM sets and three newly created.The interpolation ratio was set as follows:

ABX listening test
The evaluation of newly created interpolated samples was performed by the subjective listening tests.The ABX listening assessment was used for this purpose.In these test, five new sentences were evaluated where they were different from the training data.The basic idea of the test is that the three speech samples are used in the evaluation procedure.The speech samples A and B correspond to the synthesized speech samples from the original HMM sets of voices without interpolation.In our case, the samples A and B correspond to the generated sentences from the system MSM and ADM.This two samples represent some kind of the reference in the evaluation procedure.The assessed sample, marked as X, is represented by one of the interpolated synthesized speech sample with the particular interpolation ratio.The role of the participants in this evaluation is to consider, if the sample X is closer to the sample A or B. In our experiments, we had together 13 participants who evaluated the five submitted samples with different interpolation ratio.Figure 3 shows the experimental results of the ABX listening test.

Fig. 3 Results of ABX listening test
The horizontal axis represents the rate that speech samples from the newly created interpolated HMM models were evaluated to be closer to the HMM models of the ADM speaker, and vice versa.It is evident that the interpolated synthesized samples faithfully represent the characteristics of the particular interpolation ratio.As can be seen, the largest deviation occurs when the interpolation ratio is equal to ) 0 , . The deviation still can be considered as a small, which could be caused, for example, by inattention of the participants and so on.The results for other interpolation ratios can be considered as very good.

Comparison of the generated fundamental frequency
Figure 4. shows the generated fundamental frequency of the interpolated speech samples.The five plots from top to bottom represent the interpolation ratio between the previously described two sets of the HMM models.The same Slovak sentence "Syntéza slovenčiny" was synthesized in each sample.As can be seen, the overall decrease of the frequency is evident when the interpolation ratio is changing from one speaker to another.The significant difference is noticeable especially at the beginning of the frequency contour where the two gaps were formed.In general, the fundamental frequency interpolation can be considered as appropriate.

Comparison of the generated spectra
The generated spectrum of the synthesized interpolated speech is shown in Figure 5 and 6.The same sentence as in the case of the fundamental frequency was used for this purpose and only the frequency range from 0 to 8 kHz is shown because only there was a significant difference between the speech samples.
Figure 5 shows the overall spectrum of the synthesized samples where it is evident that the amplitude is smoothly changing from one speaker to another.
Figure 6 shows the spectrum change in time where the spectrum in each frame is demonstrated.

CONCLUSIONS
This paper presents the first experiments with the speaker interpolation for Slovak HMM-based speech synthesis system.Together, three new sets of models have been created with different interpolation ratio between the two previously trained models of the synthesized speech.
The obtained results showed the potential of new voice creation with this technology.The interpolation provides an efficient technology how to create new Slovak voices with the combination of the pretrained HMM sets which is a great advantage in the developing and expanding of the Slovak text-to-speech systems.

Fig. 1
Fig. 1 Block diagram of speech synthesis system with interpolationThe training part consists of the training of the sets of HMM models[6].Its main task is an extraction of the spectral and the excitation parameters from the speech databases as well as an implementation of the HMMs training.In case of the interpolation based training, the speech database consists of the multiple sub-databases,

Table 1 .
shows the description of MSM and ADM voices.

Table 1
Description of the initial HMM-based Slovak voices