Text-To-Speech Synthesizer for English, Hindi and Marathi Spoken Signals

The paper proposes a model of Text-To-Speech (TTS) engine for Marathi, Hindi and English languages. The characters and their representation are analyzed and synthesized with the help of TTS-engine. The engine would be spoken utterances produced from text. A concatenative approach based on linguistic rules has been applied. In order to test the artificial voice generation, an analysis of prosody and MOS testes were performed. Cepstral pitch detection algorithm has been used for extraction of fundamental frequency. Mean and standard deviation were computed on pitch reading of each spoken signals. MOS is a subjective test depends on listeners who are aware of any three of the languages. All scores are provided on the basis of two dissimilar parameters of MOS: 1. The result was achieved between fair and good for listening quality. 2. Naturalness score in between pleasant and slightly-pleasant. Ultimately, the satisfactory results were found for three different languages.


INTRODUCTION
A lot of primary students have major difficulties in learning and correct hearing. Many educational organizations use state-of-art facility with several technologies. For any organization, the technology helps to improve two aspects of education: A writing style and other spoken form.

Original Research Article
The writing style belongs to the character shape and spoken form is based on voice signals [1,2]. Voice and linguistic forms are being applied on primary education domain. The linguistics is the scientific study of languages, from the sounds, characters, words and letter-to-sound rules. Voice output is a pivotal section of such correct speech generation. A couple of these aspects include an application of text and voice interface. Text-to-speech synthesizer is one of the models of text and voice interface. With the help of voice communication, the effective text-to-speech model is tried to generate synthetic voice as close as possible to mankind of voice. Clarity of voice generation plays a vital role to convey the message positively [3][4][5].
For many years, text-to-speech synthesizer has been considering a rich domain than other fields of speech processing. TTS-model converts text into voice form in a specific language. Earlier the process of voice generation, it is concerned with prosody. So, superasegmental type of the prosody is used. It involves the different sections: duration of the speech, intonation level and pitch detection. The sections of prosody are fundamental needs of TTS-synthesizer. For implementation, TTS-models have number of techniques. The techniques were borrowed from the speech processing. A lot of the present techniques designed using concatenative speech synthesis. The present work is based on concatenative approach using linguistic rules. It focuses on any type of character and their representation in three unconnected languages. The common students can easily understand the expression of the speaker. Even, primary students are able to improve the basic skills like learning, writing and building their confidence level. The major contribution of TTS-engine is its application for blind students [5][6][7].
The major objective of TTS-system is to convert text into voice signals. The received text would be any type of character with their representation in three languages. The artificial voice signals with noiseless environment are generated in Marathi, Hindi and English languages.
The paper is organized as follows: The next section discusses on past efforts of text-tospeech (TTS) synthesis system. Third section explains the characteristic of the Marathi, Hindi and English languages. The fourth section proposes speech synthesis engine for Marathi, Hindi and English text. The experimental work and result analysis describe in fifth section. Finally, the present paper gives the conclusion in sixth section.

PAST EFFORTS OF TTS-SYSTEM
Numerous researchers have been working on text-to-speech (TTS) synthesis domain since recent decades. TTS-synthesizer is an artificial producible sound engine. TTS-models have been implemented on several languages. Katsunobu Fushikida et al. [1] developed a Japanese text to speech synthesis system for the personal computer. In their system, they have used the formant parameters based on synthesis-by-rule. Richard Sprat [2] introduced a model of multilingual text-to-speech (TTS). The model has been worked on morphological rules, numeralexpansion rules and phonological rules. Speech technologies are widely believed to have a useful contribution. Norbert Pachler [3] presented one of most useful contribution for teaching and learning processes in foreign language. The system was based on the concatenative approach. Gerry Kennedy [4] represented the benefits of speech synthesis software. The software would be beneficial to persons of any age who are visually impaired, various primary students with learning difficulties and mainstream, enjoy hearing what they have written. Bhuvana Narasimhan et al. [5] proposed schwa-deletion in Hindi text-to-speech system based on concatenative synthesis. Diemo Schwarz [6] designed concatenative technique for musical sound synthesis. The concatenative technique provided a large sound corpus. With the advanced computer device, a huge speech corpus has been used for naturalness and intelligibility. In [7] Hwai-Tsu et al. described Mandarin Text-To-Speech (TTS) synthesizer using concatenative technique. It has created 408 syllabic utterances. Usha Goswami [8] showed that synthetic phonics can be preferred the approach for young English learners. P. Zervas [9] presented statistical evaluation of a prosodic database for Greek speech synthesis. The step of Greek speech was segmented into phones. The phones recognizer based on the HTK platform. Naim R. Tyson et al. [10] implemented a new concept that the phenomenon of schwa-deletion in Hindi was handled in the pronunciation component using concatenative text-to-speech system. Pamela Chaudhur et al. [11] developed Telugu Text-To-Speech system using concatenative speech synthesis. It carried out mean opinion score test for intelligibility. Arun Soman et al. [12] proposed a corpus-driven Malayalam text-to-speech (TTS) system based on concatenative synthesis approach for naturalness and intelligibility. [13] Kwan Min Lee et al. studied that the effect of the student choice on social responses to computergeneration speech was investigated. Matej Rojc et al. [14] proposed a corpus-based text-tospeech synthesis system based on unit selection algorithm for synthesized speech quality. Even, Lakshmi Sahu et al. [15] represented a corpusdriven text-to-speech system for Indian languages: Hindi, Telugu and Kannada. It was used for human spoken voice. Catalin Ungurean et al. [16] designed an improved preprocessing unit for a high quality Romanian TTS-system. It has a major contribution to the clearness and natural sound of a synthesized text. Madiha Jalil et al. [17] surveyed text-to-speech synthesis techniques by highlighting its digital signal processing component. Concatenative synthesis is an easy to implement than rule-based synthesis. There is a reason that no needs to establish speech production rules. Shreekanth T et al. [18] represented concatenative speech synthesis using phoneme, di-phone and allophone. The syllables were a basic unit of Hindi language. Vivek Hanumante et al. [19] aimed at providing conversion of English text into multilingual speech output using an Android application. The application assisted to people lacking the power of speech or non-native speakers. Mukta Gahlawant et al. [20] proposed natural speech synthesizer using hybrid approach. In natural synthesizer, dynamic nature of human speech was very difficult to mimic for speech factors: intelligibility understands an output easily and naturalness describes matching accuracy of output sounds with human voice. Theophile K. Dagba et al. [21] described a unit selection speech synthesis system for Fon language. It was based on phonetic rules, post lexical rules and letter-to-sound rules for automatic phonetic transcription of received text. Sunil S. Nimbhore et al. [22] developed a first level of implementation of natural sounding speech synthesizer for Marathi language using English script.
It has used concatenative synthesis approach and evaluated formant frequencies which are one of part of prosodic analysis. Fiona S. Baker [23] exposed the facts of text-to-speech systems. It used for 24 innovative-English-speaking to college students who were experiencing reading difficulties in their freshman year of college. Soumya Priyadarsini Panda et al. [24] proposed a module for text-tospeech synthesis in Indian languages: Hindi, Odia and Bengali using the pronunciation rule. The rule was based on concatenation approach for TTS-model using some amount of the memory requirement. Smita P. Kawachale et al. [25] illustrated a new approach and methodology that helped to reduce database size of the syllabic based on concatenative speech synthesis. The model has been worked on human natural speech systematically. We [26] have attempted to design a TTS-synthesis system for Marathi numerals. It was based on rule-based approach. This system would be beneficial to a person who is able to understand the synthesized number. Saleh M. Abu-Soud [27] developed a new multilingual text-to-speech system. It was based on inductive learning. In the system has been composed of three phases: the analysis phase, learning phase and synthesis phase. It was produced the correct phonemes with high accuracy. John O. R. Aoga [28] presented the integration of Yoruba language into MaryTTS. A synthesized system carried out the unit-selection technique. For the system, it was used a speech corpus containing 2415 different sentences. German Bordel et al. [29] dealt with either very long audio tracks or acoustically inaccurate text transcripts. It revealed a probabilistic kernel based on phonetic knowledge using a large corpus of speech. Shinnosuke Takamichi et al. [30] introduced a statistical parametric speech synthesis including text-to-speech and voice conversion. This system has been worked on Hidden Markov Model (HMM), Gaussian Mixture Model (GMM)based voice conversion for improvement of the synthetic speech quality. Generating natural voice is an uphill task of speech synthesis field.  [10,16,18].

CHARACTERISTIC OF THE MARATHI, HINDI AND ENGLISH LANGUAGES
Marathi language is used by approximately 90 million people in Maharashtra and Goa states of India. Marathi language is officially used in most of the government and private sector in Maharashtra. In the schools of Maharashtra state, Marathi is studied as primary language [26]. In Marathi and Hindi Languages, there are five categories of vowels: Short, long, conjunct, nasal and visarg as shown in Table 1. A vowel is produced sound by the vocal cords. In short, vowels (V) in a word or syllable which makes a short sound. Long vowels are used a word generates long sound in both languages. Conjunct vowel is produced a combination of short and long vowels. Nasal vowel is generated with low tune where there is air pressure through nose and mouth as well. Visarg vowel is uttered as the voiceless sound after the vowels [18]. Marathi and Hindi languages have 33 and 34 consonants respectively. But ' ' (la) letter is absent in Hindi. Table 2 mentions that the place of articulation and manner of articulation stop air from moving out of the mouth. It refers a successively forward position of the tongue. All consonants are organized into eight groups of places of articulation: Guttural, Palatal, Cerebral, Dental, Labial, Semivowel, Sibilant and other letters. And Five categories of each group are as follows: (U)-Unaspriated Un-voiced, (A)-Aspirated Un-Voiced, (U)-Unaspriated Voiced, (A)-Aspirated Voiced and Nasal. The first-two groups of nasal consonants can never be found it which can be written alone. Usually, it is conjuncted with another consonant in their corresponding groups [22][23][24].
English is the world language for communication amongst people. It uses Roman Script for written form. Nowadays, Marathi and English as well are used in school, institute or other divisions for convenience of students. All English sounds are divided into two broad categories: vowels and consonants. In the production of vowels, air comes out freely through mouth. All 26 alphabets are classified into five vowels and 21 consonants as shown in Tables 3 and 4. In phonetics, a vowel is a sound in spoken language and a consonant is articulated with complete or partial closure of the vocal tract. Consonants have friction when they are spoken mostly using the position of the tongue against the lips, teeth and roof of the mouth. Consonants may be voiced or unvoiced. The voiced consonants are pronounced with the vibration of the vocal cords. Consonants can be classified according to the place of articulation as follows: Labial is a combination of bilabial and labio-dental. Bilabial is expressed the sound by the two lips, e.g. /P, B, M, W/. Labio-dental as similar as labial is produced the voice by the lower lip against the upper teeth, e.g. /F, V/. Dental is articulated by the tip of the tongue against the upper teeth, e.g. /T, D/. Alveolar articulated by the blade of the tongue against the teeth-ridge, e.g. /S, Z, N, L, R/. Palatal articulated by the front of the tongue against the hard palate, e.g. /J/. Velar type is articulated by the back of the tongue against the soft palate, e.g. /K, G, X/. Glottal is produced by an obstruction or narrowing between the vocal cords, e.g. /H/. And other letters are /C, K, Q/. The classification of places of articulation is based on manner of articulation. All alphabets of Marathi, Hindi and English languages are used for the speech analysis and synthesis purpose. [24][25][26][27].

PROPOSED SPEECH SYNTHESIS SYSTEM FOR MARATHI, HINDI AND ENGLISH TEXT
TTS-synthesizer means a system electronically produces the human sound. Text in three distinct languages would be recognized into a couple of scripts. Marathi and Hindi are identified into Devnagari script. Roman script supports English language. In order to process the generation of audio form, it is analyzed by the prosody as shown in Fig. 1. Concatenative-based synthesis technique is used for concatenation the sound units.

Procedure of Text and Sound Selection Based on Concatenative Approach
For the procedure of text and sound selection engine, the personal computer has picked up enough processing speech and memory capacity to serve as TTS platforms. Concatenative approach plays a key role to implement the textto-speech model. The model is used for any type of character with their representation in three unrelated languages based on the using linguistic process in Fig. 2. The procedure of text and sound selection involves two broad processing: linguistic and producible audio form. Linguistic processing consists of syllables. Syllables are being constructed various forms such as C, V, CV, VC, CCV, VVC or so on. C constructs a form of consonant and V is a Vowel. Producible audio form [2][3][4]25] is an integral part and sub-component of speech synthesis system. Using concatenative technique, producible audio form has performed a role in well manner.
Concatenative approach is divided into tokens or units. This approach provides the extreme intelligibility and to close natural sound. All vowels or consonants with their representation in Roman or Devnagari scripts are stored into the information. The information is classified into two dissimilar sections: Text and voice. The received text can be available in Unicode and phonetic form. But phonetic form is used for implementation process. Text section is classified into the natural and nonnatural tags for each language. In order to analyze the natural tag, a lot of characters in three languages are normalized [6,18]. All characters are retrieved from a text dictionary. The set of syllables is matched the text in each language.

Fig. 2. Architecture of the TTS engine using Concatenative-based approach for Marathi, Hindi and English Languages
The text and sound selection uses context sensitive rewrite rule, which is shown in the equation form as follows: Where, A is translated d when preceded by B and followed by C. In implemented systems, the context B and C can be any length for character and their representation.
The clean voice is a fundamental need for listening it. But the common voice is recorded with noisy atmosphere. Reducing the noise from original voice signals is an uphill task for speech processing. The process of noise reduction is damaged on common sound signals, but spectral subtraction method doesn't affect the voice signals. The method is used of the acoustic terms for quality speech. The acoustic terms of voice signals are required such as Frame length -30 ms; Acoustic Parameter -FFT Spectrum (256 pts.); Frame shift -10 ms; Sampling frequency -22 (22050 Hz) KHz; Analysis window -Hamming window [26].
Text unit is matched with a quality sound unit which is retrieved from a speech dictionary. All selected voice units are merged into one speech unit using concatenating process. The complete unit is posted to audio system. Audio system is performing a main task of generating the voice.
The nature of audio system converts digital signals into analog signals. Human beings can understand only analog signals [22,26,30].

Enriched Prosody
Prosody is a heart of TTS-model. With the help of prosody, speech corpus can be created and extracted prosodic features of speech signals. Speech synthesis requires a speaker. There is no limitation of speaker who is either an unprofessional or a professional. But the speaker should be aware of Marathi, Hindi and English language. So, single unprofessional speaker is sufficient for recording sound signals. The sound information would be in three different languages. All sounds with noisy environment are recorded. In order to enhance the speech dictionary, speech signals have to go the process of analysis of prosody as shown in Fig. 3. Enriched prosody includes three parts: Period of the sound signals, pitch detection of spoken form and intensity of spoken signals. All spoken forms in three untouched languages are stored into speech dictionary. But the section focuses on different issues of superasegmental prosody such as period of sound signals and pitch detection [5,10].

Period of Sound Signals
The following equation is to find out period of sound signals as follows: Where, period is total duration of sound signals, 1 is the sampled information of sound, F is an actual sampling frequency of sound signals [26].

Pitch Detection Technique
In speech processing, pitch detection or fundamental frequency (F0) relates to the note of the voice. It is determined by the frequency of vibration of the vocal cords. For pitch detection, cepstral technique is implemented. Cepstral technique has the problems. Potential problems can be avoided by electing frames from the phones if the phones are unexpected signals. [5] The cepstral coefficients are found by the following equation: The cepstral domain represents the frequency, which would be in the logarithmic magnitude spectrum of a signal. The cepstrum is formed by taking the FFT of log magnitude spectrum of a signal. The pitch can be estimated by picking the peak value of the resulting signal within a certain range. A peak factor in the cepstrum indicates that the signal is a linear combination of a couple of pitch or fundamental frequency (F0). The pitch period is able to find out many coefficients where the peak occurs. Cepstral method enables to detect the pitch whether a signal contains periodic, aperiodic or silence region. The detection of pitch in a specific range is between 140-340 Hz for a male or a female [8][9][10].

Recording and Storing of Information
Recording and storing of the information is a part of data acquisition. The data acquisition of vowels or consonants is used. It is divided into two areas. The primary area is a text dictionary and another area is a speech dictionary. As no standardized text corpus for Marathi, Hindi and

Fig. 4. Designed of Cepstral Pitch Detection Technique
English languages is available, the text dictionary has been created for primary education domain. A special document is designed for collecting the information. Manually, the information is collected from the primary books. The overall size of text dictionary for three unconnected languages is 4,910 syllables. Another area, the speech dictionary is acquired through standard PRAAT tool. The sound information is any type of vowels or consonants with their representation in three unrelated languages. The sampling frequency of each sound sample is 20 KHz (22,050Hz). The additional two acoustic things are one mono channel and another file format (wav) for analyzed and synthesized. Speech synthesizer is essential only one male or female speaker for implementation. For the system, single male speaker is used. The speaker should be familiar with Marathi, Hindi and English languages for implementation, All characters with their representation are produced the sound signals by a male speaker with noisy environment, age group is 25 to 35 in 10x12 rooms at School of Computer Sciences in North Maharashtra University, Jalgaon. [1,7,12] Speech signals are divided into three different categories of the sound: Normal, Noise and Clean signals in

Mean Opinion Score Value
In voice communication, the quality of output voice is subjective. Measuring the quality o voice is a challenging issue. A numerical method of expressing voice quality is called as MOS (Mean Opinion Score). Therefore, MOS value is proposed using two parameters. It is expressed in the range from 0 to 5 as per Table 6.

EXPERIMENTAL WORK AND RESULTS ANALYSIS
TTS-model designed to convert character with their representation into sound form. Speech synthesizer has been disposed of two disciplines tests: Analysis of Prosody test and MOS test. First test expresses in auditory terms as pitch detection and duration of speech signals. Second test is MOS (Mean Opinion Score) which depends on subjective analysis.

Analysis of Prosody Test
In the field of speech synthesis, there are a number of open questions to be solved. One of the most open questions is that how to generate artificial sound. When a system feeds the input text data, the machine can generate the synthesized sound. Due to it, the analysis of prosody test aims at evaluating the prosodic analysis of individual synthesized voice signals. All synthesized sounds have been passed through the present test. The panel of a test is involved two bizarre tasks: Pitch detection and period of spoken form. The period of spoken form assists to the prosodic feature. In pitch detection, the synthesized clean sound in Marathi, Hindi and English languages is extracted the numerical appearance. Pitch detection based on cepstral technique is demonstrated in the shape of its graphical representation as depicted in Fig. 5 and Fig. 6. The Fig. 5 (A) and Fig. 6 (A) are shown in the time domain waveform. The pattern of the waveforms is plotted time against amplitude. Pitch estimation of voice signals without noise is depicted tracking waveforms of pitch in and Fig. 6 (B). It is difficult to show all figures. But single English word and a Marathi word can be seen. The results of the test are explained in following Tables 8, 9, 10, 11, 12 and exactness of pitch frequency is executed by cepstral pitch detection algorithm. All readings of pitch frequency have been exposed in mean and standard deviation.
These tracking values of pitch are relied on a specific range. The range of the pitch measured between 240-300 Hz for mean and 100-120 Hz for standard deviation. Some pitch reading of each speech has to be neglected because various unexpected signals are closest to zero hertz. The computed pitch frequency of three separate languages can be achieved. The evaluation of the test is influenced on the prosodic features.

Fig. 5. English Word "Apple" MALE Voice A] Speech Waveform Noise-Free B] Pitch Tracking
waveforms is plotted time against amplitude. without noise is depicted tracking waveforms of pitch in Fig. 5 (B) 6 (B). It is difficult to show all figures. But single English word and a Marathi word can be seen. The results of the test are explained in 10, 11, 12 and 13. The exactness of pitch frequency is executed by cepstral pitch detection algorithm. All readings of pitch frequency have been exposed in mean and These tracking values of pitch are relied on a specific range. The range of the pitch detection is 300 Hz for mean and 120 Hz for standard deviation. Some pitch reading of each speech has to be neglected because various unexpected signals are closest to zero hertz. The computed pitch frequency of guages can be achieved. The evaluation of the test is influenced on the

Mean Opinion Score Test
Second test of TTS-model is MOS (Mean Opinion Score). MOS is subjective listening test on the basis of input received from listeners. A subjective type question is asked to all listeners based on six rates for listening quality as similar as criteria of pleasant. The process of MOS was performed on any type of characters and their representation in Marathi, Hindi and English languages.
Those listeners follow the instructions on the basis of MOS section to judge the listening quality. Two parameters for MO examined on generating a mankind of sound. Individual feedback was taken from 43 listeners. The MOS test was figured out the average for each language. The results were declared the scores in Tables 14 and 15   In Hindi, MOS score test for the vowels were given the score 3.57 for listening rate and 3.51 for pleasant rate by 11 listeners (ML-05 and FL-06) as shown in Table  And for the consonants of Hindi, the mean of MOS score for listening rate and  In Hindi, MOS score test for the vowels were given the score 3.57 for listening rate r pleasant rate by 11 listeners 06) as shown in Table 16. And for the consonants of Hindi, the mean of MOS score for listening rate and pleasant rate was 3.92 and 3.52 respectively in Table 17.
For English vowels, the mean of MOS test was 4.11 for listening rate and 2.99 for pleasant rate by 13 listeners (ML-07 and FL-06) in Table  And, the mean of MOS score for listening rate and pleasant rate is 3.99 (to close a good listening quality) and 3.69 (between pleasant and slightly-pleasant) for the English consonants respectively in Table 19.

FUTURE SCOPE
In future work, speech synthesizer will be attempted on various numbers system in Marathi, Hindi, English languages. And the spoken samples will be recorded by speakers in noisy environment. The prosody analysis of voice signals will have to be evaluated by a machine.