Text Normalization for Telugu Text-to-Speech Synthesis

Most areas related to language and speech technology, directly or indirectly, require handling of unrestricted text, and Text-to-speech systems directly need to work on real text. To build a natural sounding speech synthesis system, it is essential that the text processing component produce an appropriate sequence of phonemic units corresponding to an arbitrary input text. A novel approach is used, where the input text is tokenized, and classification is done based on token type. The token sense disambiguation is achieved by the semantic nature of the language and then the expansion rules are applied to get the normalized text. However, for Telugu language not much work is done on text normalization. In this paper we discuss our efforts for designing a rule based system to achieve text normalization in the context of building Telugu text-to-speech system.


INTRODUCTION
The objective of the text processing component [1,2] is to process the given input text and produce the written form (orthographic form) of the text into the spoken form.This orthographic form is realized by the speech generation component either by synthesis from parameters or by selection of a unit from a large speech corpus.For natural sounding speech synthesis [3,4], it is essential that the text processing component produce an appropriate sequence of orthographic units corresponding to an arbitrary input text.The input to a TTS system is the raw text [6,7] as available in news websites, blogs, documents etc which contain the required text in font-encodings, native scripts and non-standard words [5] such as addresses, numbers, currency etc.The majority of the issues are associated in building a TTS for a new language is associated with handling of real-world text.Current state-of-art TTS system which are existing for well researched languages like English and other popular languages use rich set of linguistic resources such as word-sense disambiguation [8], morphological analyzer [9], Part-of-Speech tagging, letter-to-sound rules, syllabification, stresspatterns in one form or the other to build a text processing component of a TTS system.Processing the given text to readable [10] form consists in expanding abbreviations, converting names, numbers, acronyms, dates etc. into their spoken form.In real text, many non-standard representations of words appear, for e.g., numbers, abbreviations, acronyms, currency, dates, URLs.All these non-standard representations must typically be normalized to standard words before synthesis.For example, number 30 has to be expanded to "muppai" ముప్ ప ై a cardinal number or to "muppai rUpAyalu" , ముప్ ప ై రుపయలు or muppai nimishalu" , ముప్ ప ై నిమిషలు .This paper, presents the ongoing efforts made to identify the various non-standard representations of the words in unrestricted real Telugu text and ways of converting them to standard words.A novel approach to text processing is proposed, wherein the first level we combine tokenization and initial token classification as one stage, followed by a second level of token sense disambiguation.Finally the standard word representation is achieved using the expansion rules and the look up table (database).

Nature and Format of Telugu Text
Telugu is a South-Central Dravidian language predominantly spoken in the South Indian state of Andhra Pradesh where it is an official language.One of the four classical languages of India, Telugu ranks third by the number of native speakers in India (74 million), thirteenth in the Ethnologue list of most-spoken languages worldwide.It is one of the twenty-two scheduled languages of the Republic of India.Telugu has several features of Sanskrit that have subsequently been lost in Sanskrit's daughter languages such as Hindi and Bengali, especially in the pronunciation of some vowels and consonants.The scripts in Indian languages are stored in digital computers in ISCII, UNICODE and in transliteration schemes of various fonts.The input text could be any of these formats.

PROPOSED MODEL FOR TEXT NORMALIZATION
The architecture of text normalization (see Figure 1) consists of various modules like tokenization, token classification, token sense disambiguation and standard word generation.The proposed system is designed with a set of hand crafted rules which are required at each phase.The modularization is made keeping in view the nature of Telugu language.These modules convert the non standard representations of words in the sentences into standard representations by applying the rules at each phase.The following description gives the detailed explanation of each module.

Tokenization and Token Classification
A token is a string of characters, categorized according to the rules as a symbol.Tokens are defined by regular expressions, which are understood by a lexical analyzer generator.Tokenization and token classification are performed using a lexical analyzer.Tokenization is the process of breaking up a stream of text into words, phrases, symbols, or other meaningful elements called tokens.Each token is separated by spaces or entries in the separation list and is classified using the various classifications.
Initially an input text is taken and it is chunked into sentences based on the sentence delimiter.Whitespace is the most commonly used delimiter between words and is extensively employed for tokenization.But using whitespace as the only delimiter has a number of shortcomings: a token type which allows the occurrence of whitespace within the token will not be recognized as a single token, but split up into two or more tokens.Further, an important shortcoming is that every token so obtained will then have to go through a token identification process that identifies its token type/category.This approach might not be feasible for some languages like Chinese and Japanese which do not use any form of white space between words.In the proposed approach, tokenization and initial token classification/identification are achieved in a single step.

Token Sense Disambiguation
Once the tokens are extracted from the input text, the category of each token is identified by the lexical analyzer itself when there is no ambiguity arising from the formats of token.In case of ambiguity, the token with the possible token categories is given as output to facilitate further disambiguation.Disambiguation is generally handled by hand-crafted context-dependent rules which are difficult to write, maintain and adapt to new domain.Identification of token category involves high degree of ambiguity.The process of resolving the conflicts that arise when a single term is ambiguous is called token sense disambiguation.These conflicts are resolved using disambiguation rules.Token sense disambiguation can be mapped to general homograph disambiguation using decision-list based data-driven techniques.When dealing with unrestricted text input ambiguity exists among cardinal, ordinal, decimal numbers and time formats, etc.To remove this ambiguity a set of rules are framed and implemented to achieve token sense disambiguation.

Standard Word Generation
Standard word generation is accomplished by using expansion rules and look-up table.After identifying and disambiguating the token categories, expansion of non standard words is done by a combination of rules (e.g., for expanding numbers, currency, dates) and look-up tables (e.g., for abbreviations and acronyms).Token-To-Word rules are written, which are specific for each token type, and for each format within a token type.Finally all the non-standard representations are converted to standard words, or normalized which would then be processed in various applications.

IMPLEMENTATION OF THE SYSTEM
To explain in detail about the working of all the modules in the system an input text is taken as given below which includes various categories of input token types like time, money and normal string literals.

Tokenization
The given input will be divided into tokens by using the lexical analyzer.White space delimiter is used to tokenize the words of the text input.

Token Classification
In this phase, the tokens are classified based on their types and the tokens that are in the standard form does not require any kind of classification.Only స., 5:30, 50రూ, 4.50 tokens are classified as they are not in standard form.The first two tokens (స., 5:30) are given to time group, third token (50రూ,) is given to currency, and fourth token (4.50) is given to decimal numbers.

Token Sense Disambiguation
In the first three tokens, there is no ambiguity.So they are passed to next phase.Whereas the fourth token 4.50 in the above phase is considered as a decimal number and the expected output will be four point five zero, but the context in which it is used represents money.So the output should be in the form four rupees fifty paisa.Thus there is a necessity to frame the disambiguation rules to avoid ambiguity when classifying the tokens.

Standard Word Generation
During this phase the abbreviated tokens will refer to the look up table and the remaining tokens will refer to the O c t 5 , 2 0 1 3 expansion rules, based on this normalized text is generated.

Cardinal Numbers
A Cardinal Number is a number that gives the count of something, such as one, two, three, four, five.Cardinal numbers are also known as "counting numbers," because they represent the quantity.Cardinal numbers (Table 1) may appear in the input text to give the count of any unit like population, money, etc,.

Ordinal Numbers
An Ordinal Number is a number that tells rank or the position of something in a list. 1 st , 2 nd , 3 rd , 4 th , 5 th etc.When an ordinal number (Table 2) occurs it should not be spoken as a cardinal number, for example number "3" should be spoken as "third" but not "three".

Decimal Numbers
Decimal numbers consist of a point or dot which separates the whole number part from the fractional part of the number.In this kind of inputs there is a possibility of ambiguity due to the context in which the decimal number is used.When dealing with decimal numbers the fractional part need to be expanded differently depending on the context it is used.Few examples (Table 3) are mentioned which lead to word sense disambiguation depending on the context it is used.

Phone Numbers
A telephone number or phone number is a sequence of digits.Here each individual digit needs to be extracted and its number name has to be spoken rather than treating it as a number that represents currency or ordinal or cardinal number.
An example is shown in Table 4.

Date Formats
There are various date formats like dd/mm/yy, mm/dd/yy, yy/dd/mm where dd represents day, mm represents month and y/yy represents year.Day, month and year can be delimited by either using "/" or "-".In Telugu language (Table 5) the date can be give as month name followed by day and year.

Time
Telugu language uses a set of representations (Table 6) to mention the time depending on whether it is morning/evening or day/night.Time of a day is written in the 24-hour notation or 12-hour notation in the form hh:mm (for example 01:23) or hh:mm:ss (for example, 01:23:45), where hh is the number of hours, mm is the number of minutes and ss is the number of seconds.

Currency
The representation of currency in Telugu is given by using "రూ", any numeric value followed by this symbol indicates that the given numeric value is rupees and therefore it should be expanded using the currency notation.In Telugu language (Table 7) we first speak out the numeric value followed by the word "రూపయలు".Here the expansion should not be like phone numbers it should contain the currency notations like crores, lakhs, thousands, etc.

Abbreviations and Acronyms
An abbreviation is a shortening by any method; like taking a letter or group of letters taken from a word or phrase.An acronym is an abbreviation formed from the initial components in a phrase or word.Few examples (Table 8) are mentioned where, in the real text we find "Dr" but we speak out as "Doctor".

Address
The address consists of text which is to be read as it appears and few numbers like house number or flat number or a pin code which need to read accordingly.The expansion of house number or flat number is done using money notation whereas the pin code is represented as individual digits like phone number.

Percentages
A percentage is a number or ratio as a fraction of 100.It is denoted using the percent sign, "%".A percentage can be a single number or can range between two numbers, which is represented using "-".Then the input text containing (Table 10) such numeric"s like "20 -30%" should be read out as "twenty to thirty percent" not as "twenty minus thirty".

Symbols
When preprocessing the input text many symbols may appear which may need to be represented accordingly.Special care has to be taken to handle the different symbols (Table 11) that can be seen during text processing.

Coverage Analysis
The performance of the system (Table 12) is tested on the data obtained from different sources (web).The data is collected from various domains like news articles, sports, stock market related information, politics, business, health and etc.The test data was manually prepared in such a way that it consisted of all categories of tokens.The data collected consisted of 3250 lines with 28600 words.O c t 5 , 2 0 1 3

CONCLUSIONS
This paper presents the need for text to be preprocessed before it is handed to any synthesizer.The real unrestricted text may contain many different non standard formats, which need to be converted to a standard representation.The token types handled are Numbers (cardinal, ordinal and decimal), Phone Numbers, Time, Date and Year, Symbols, URL"s, Websites, Addresses and Literal Strings.A rule based system is designed to achieve word sense disambiguation and the standard word representation.An effort is made to identify most of the non-standard words that can occur in the unrestricted text and rules are designed, implemented and tested to a great extent.However for languages like Telugu, where research is in progress does not have enough linguistic resources, it involves several complexities starting from accumulation of text corpora in digital and processable format.Linguistic components are not available in such rich fashion for all languages of the world.