Wavelet LPC with Neural Network for Spoken Arabic Digits Recognition System

The crucial problem of Arabic recognition systems is the availability of several dialects in Arabic language, particularly those with sound variations. Therefore, low recognition rate is encountered as a result of such an environment. In this research paper the authors presented dialect-independent via an enormously effectual wavelet transform (WT) based Arabic digits classier. The proposed system may be divided into two main blocks the features extraction method by combining wavelet transform with the linear prediction coding (LPC) and the classiﬁcation by probabilistic neural network (PNN). The proposed classier provided a high recognition rate reaching up to 100%, in some cases


INTRODUCTION
These days interactive voice response systems are increasingly and widely used, especially involving speaker-independent recognition of given vocabularies conveyed over the telephone network or microphoneas investigated by [1]. The recent increase shed some light on crescendo activity in mobile communication domain inaugurating a new era of opportunities forth applications of speech recognition including digits and sentences. In text to speech, or vice versa, as well as incredibly vital issues in many computer applications, where English language has achieved immense success and forms the major part of interest. On the other hand, Arabic language speech recognition has slight attraction due to its various dialects and several alphabets forms.
The major works which study the speech recognition in Arabic language deal with the morphological structure [1][2][3] or the phonetic features in order to recognize the distinct Arabic phonemes (pharyngeal, geminate and emphatic consonants) [4,5] and discusses their further implication in a larger vocabulary speech system. This opens an interesting field for researchers insofar as the applications in terms of implementation of recognition system dedicated and devoted to spoken isolated words or continuous speech are not extensively explored, and only a few examples have been improved and ameliorated in this research paper. A derivative scheme, named the Concurrent GRNN, implemented for accurate Arabic phonemes identification in order to automate the intensity and formants-based feature extraction, was studied in [6]. The validation tests expressed in terms of recognition rate obtained with free of noise speech signals were up to 93.37%. An isolated word speech recognition using the RNN was investigated in [7]. The achieved accuracy is 94.5% in terms of recognition rate in speaker-independent mode and 99.5% in speaker-dependent mode. Several Arabic speech recognition systems were discussed in [8].
The Fuzzy C-Means method has been added to the traditional ANN/HMM speech recognizer using RASTA-PLP features vectors. The Word Error Rate (WER) is over 14.4%. With the same approach, a method using data fusion gave a WER of 0.8% [9]. However, this method was tested only on one personal corpus and the authors indicated that the obtained improvement needed the use of three neural networks working in parallel. Another alternative hybrid method was proposed by [9], where the Support Vector Machine (SVM) and the K nearest neighbor (KNN) were substituted for the ANN in the traditional hybrid system; better recognition rate was achieved as a result, but it did not exceed 92.72% for KNN/HMM and 90.62% for SVM/HMM.
A new algorithm to recognize separate voices of some Arabic words was presented in [10], the digits from zero to ten were presented and compared. For feature extraction, transformation and hence recognition, the algorithm of minimal Eigen values of Toeplitz matrices along with other methods of speech processing and recognition were used to achieve a more accurate recognition rate of speaker-independent mode. The success rate obtained in the presented experiments was almost ideal and exceeded 98% in many cases. A hybrid method has been applied to Arabic digits recognition by [11].
From literatures papers presented by researchers, neural networks were used to identify features of Arabic language, such as the emphasis, germination and relevant vowel lengthening [7]. This was studied using ANN and other techniques [12] where many systems and configurations were considered, including time delay neural networks (TDNNs). Bearing in mind ANNs were used to recognize the 10 Malay digits [13], Saeed and Nammous (2005a) proposed a heuristic method of Arabic digit recognition, using the Probabilistic Neural Network (PNN). The use of a neural network recognizer, with a nonparametric activation function, constitutes a promising solution to increase the performances of speech recognition systems, particularly in the case of Arabic language. [14,15] demonstrated the advantages of the GRNN speech recognizer over the MLP and the HMM in a quiet environment. Butthe method investigated by authors is applicable for a quiet environment. In extremely noisy environments, the recognition performance degrades considerably. Robustness to noise is essential for a professional using recognition systems particularly in mobile networks context [16,17]. Many studies have been conducted in this track [18,19]. Numerous pre-processing techniques have been developed in order to reduce or eliminate the noise effects in the speech before adding toad recognizer. Enchantment procedures like spectral subtraction [20,21] remove ambient surrounding noise. The transmission effects are reduced using equalization techniques such as cepstral normalization and adaptive filtering [22,23].

THE ARABIC LANGUAGE
Arabic language is one of the most widely spoken languages in the world, with an expected number of 350 millions speakers covering 22 Arabic countries. Arabic is a Semitic language, which is characterized by the existence of particular consonants like pharyngeal, glottal and emphatic consonants. Furthermore, it presents some phonetics and morpho-syntactic particularities. The morpho-syntactic structures are built around pattern roots (CVCVCV, CVCCVC, etc.) [24].
The Arabic alphabet consists of 28 letters that can be extended to a set of 90 by additional shapes, marks, and vowels [24]. The 28 letters represent the consonants and long vowels such as‫ى‬and ‫ٱ‬ (both pronounced as/a:/), ‫(ي‬pronounced as/i:/), and‫و‬ (pronounced as/u:/). The short vowels and certain other phonetic information such as consonant doubling (shadda) are not represented by letters, but by diacritics. A diacritic is a short stroke placed above or below the consonant. Table 1 shows the complete set of Arabic diacritics. We split the Arabic diacritics into three sets: short vowels, doubled case endings, and syllabification marks. Short vowels are written as symbols either above or below the letter in text with diacritics, and dropped all together in text without diacritics. We find three short vowels: fatha: it represents the /a/ sound and is an oblique dash over a letter dammar: it represents the /u/ sound and has the shape of a comma over a letter and kasha: it represents the /i/ sound and is an oblique dash under a letter (see Table 1) [24]. Many issues of Arabic language, such as, the phonology and the syntax, do not present difficulty for automatic speech recognition. Standard, language-independent techniques for acoustic and pronunciation modeling, such as context-dependent phones, can easily be applied to model of the acoustic-phonetic properties of Arabic. The most difficult problems in developing high-accuracy speech recognition systems to Arabic language are the predominance of non-diacritized text material, the enormous dialectal variety and the morphological complexity.
The principle problem of the dialectal variety is due to the current lack of training data for conversational Arabic; while, MSA data can readily be acquired from various media sources. Finally, morphological complexity is approved to present solemn problems for speech recognition. A high scale of affixation, derivation etc. contributes to the explosion of unlike word forms, making it difficult, if not impossible, to robustly estimate language model probabilities. Prosperous morphology also leads to elevated out-of-vocabulary rates and bigger search spaces during decoding, thus slowing down the recognition process [3].

Arabic Digits
Arabic digits from zero to nine are polysyllabic words except the first one, zero, which is a monosyllable word [2]. Table 2 shows the 10 Arabic digits along with pronunciation, signals and number of syllable [7].

Table 2. Arabic different dialects; modern, Egyptian, Jordanian and Palestinian
thā-mā-nê-yah tā-mā-n-yah thā-mā-n-yeh tā-mā-n-yeh 'Two' ?ith-nān te-nān ?ith-nen ?it-nān Compared to other languages, Arabic digits are much more elongated. They include two to four syllables, while French, English and Mandarin digits are single or double syllables. Arabic digits can be considered as representative elements of language, because more than half of the phonemes of the Arabic language are included in the 10 digits. The fricative and plosive consonants are more dominant and characterized by the presence of noise in the high-frequency band spectrum. In fact, these consonants are easily corrupted by noise sources. Therefore, speech recognition systems usually fails to identify them in adverse conditions [24].

Similarities between arabic digits
The similarity between Arabic digits, in term of pronunciation and signal morphology may lead to big recognition confusion rate [7]. In this research paper we present some of these Arabic digits similarities:  When digit 0 is investigated against digit 1, we can observe that the second phonemes in both digits 0 and 1 are vowels /I / and /a: / respectively, and they have high similarity depending on their spectrograms. Power Spectral Density (PSD) of the two digits contains some common maximum peaks (see Fig. 1). An overlap between these phonemes may occur, hence causing a misleading match between these digits.  The similarities between digits 0 and 2 are very little and this result is evident when spectrograms and PSD are studied. This is also confirmed by the results of digit recognition system, except noise contaminated digits.  By investigating digits 1 and 2, we can encounter that there is a large dissimilarity between these two digits especially in the second part of their spectrograms. Digit 2 has a long vowel in the second syllable and the same syllable starts with the nasal phoneme /n/ and ends with the same nasal phoneme.  Spectrograms of digits 1 and 3 contain big similarities. PSD of the two digits have common two core peaks at 40 and 10 in the frequency scale (see Fig. 1). The digit recognition systems, always produces immense confusions [7,24].  Digit 1and 4 have the same penultimate phoneme, a short vowel /a/. There are moderate common peaks in PSD curves. On the other hand, there were small spectrogram similarities.  There is little similarity between digits 2 and 3 the number and the type of syllable.  Digits 3 and 8 have high pronunciation similarities; the sounds /h/ and /a/ are the first and second phonemes in both digits (i.e., the first syllable in both digits are exactly the same).  There is a similarity between digits 4 and 5 in the last two phonemes. Phonemes /a/ and /h/ are the final two phonemes in both digits. Also the second phonemes in each are also the same  There is large pronunciation dissimilarity between digits 4 and 6. Digit 6 consists mostly of unvoiced. Consonants, namely, /s/ and /t/ (twice), while digit 4 consists mostly of voiced phonemes, namely, vowels and /r/, /b/, and /? / Consonants. There is a low similarity between these digits in terms of PSD and spectrogram. There are no recognition system confusions.  Digits 4 and 7 have a high similarity in terms of pronunciation but are different in terms of PSD and spectrogram.  Digits 6 and 7 have the identical patterns of syllables, CVC-CVC and they have the same first phoneme, /s/, but are different in term of PSD and spectrogram.

Wavelet Packet Transform Feature Extraction Method
In order to achieve the utmost results of this work, the speech signal was decomposed into wavelet packet transform (WPT) using the common form of the equivalent low pass discrete time speech signal.
where m X is the representation of the sequence of the discrete speech signal for each value, which is obtained from the data acquisition stage; ( ) p t is the signal pulse, it represents the importance of the signal design problem whenever there is a bandwidth restriction on the channel of the signal; and T is the sampling time. The processing of the speech is achieved in conjunction with the consideration of ( ) t mT   as a scaling function of the wavelet packet, i.e., , the finite set of orthogonal subspaces as defined in [25][26][27] and can be constructed as:  (1) is suited and customized as: Where, the speech signal model in equation (3) is the basic form of wavelet packet transform, which is used in signal decomposition. The signal is carried by orthogonal functions, which shape a wavelet packet composition in 0 2 N W space. Also, we may use the discrete wavelet packet transforms (DWPT) procedure defined as: For a certain tree structure, the function n l  in equation (7) is called the constituent terminal function of 1 0  . For our research paper the tree used consists of two stages, one is three high pass nodes and the other is the three low pass nodes.
The wavelet packet is used to extract and stem additional features in order to guarantee a higher recognition rate. In this work, WPT is applied at the stage of feature extraction, but these data are not proper for classification due to a great amount of data length (for example, a speech signal with a number of 35582 samples will reach 71166 after WPT decomposition at level two). Thus, we have to seek for a better representation of the speech features. A good survey was conducted, [28] proposed a method to calculate the entropy value of the wavelet norm in digital modulation recognition. In the biomedical field, [27] presented a combination of genetic algorithm and wavelet packet transform used in the pathological evaluation, and the energy features are determined from a group of wavelet packet coefficients [29]. Proposed a robust speech recognition scheme in a noisy environment by using wavelet-based energy as a threshold for denoising estimation. In [30], the energy indexes of WP were proposed for speaker identification. Sure entropy is calculated for the waveforms at the terminal node signals obtained from DWT [31] for speaker identification [32,28]. Proposed features extraction method for speaker recognition based on a combination of three entropies types (sure, logarithmic energy and norm). In this paper we use LPC obtained from WP tree nodes for digits feature vector constructing to be used for digits identification.

Discrete Wavelet Transform Feature Extraction Method
The DWT indicates an arbitrary square integrable function as a superposition of a family of basic functions. These functions are wavelet functions. A family of wavelet basis functions can be produced by translating and dilating the mother wavelet [33,34]. The DWT coefficients can be generated by taking the inner product between the original signal and the wavelet functions. Since the wavelet functions are translated and dilated versions of each other, a simpler algorithm, known as Mallet's pyramid tree algorithm has been proposed (see Fig. 2) [33]. The DWT can be utilized as the multi-resolution decomposition of a sequence. It takes a length N sequence ( ) a n as the input and produces a length N sequence as the output.
The output N/2 has values at the highest resolution (level 1) and N/4 values at the next resolution (level 2) and so on. Let 2 ,2 ,...,2 m . As described by the Mallet pyramid algorithm Fig. 1.The DWT coefficients of the previous stage are expressed as follows [35]: H W p j is the pth wavelet coefficient at the jth stage, and h(n), g(n) are the dilation coefficients relating to the scaling and wavelet functions, respectively. WT was proposed for recognition by [31]. In [36] and [32], the use of DWT for speech recognition, which has a good time and frequency resolution, is proposed instead of the discrete cosine transform (DCT) to solve the problem of high frequency artifacts being introduced due to abrupt changes at window boundaries. The features based on DWT and WPT were chosen to evaluate the effectiveness of the selected feature for speaker identification [35]. [37,38] stated that the use of a DWT approximation sub signal via several levels instead of the original imposter had good performance on AWGN facing, particularly on levels 3 and 4 in the text-independent speaker identification system. Therefore, we use LPCC obtained from DWT tree nodes for digits feature vector constructing to be used for text-independent digits recognition.

Average Framing LPC Feature Extraction Method
Before the stage of features extraction, the speech data are processed by a silence removing algorithm followed by the application of a pre-processing, which is achieved by applying the normalization on speech signals to make the signals comparable regardless of differences in magnitude, because the distribution of these magnitudes is closely related to the volume of the speakers. To achieve this, the signals are normalized by using the following formula [39]: Where i S is the ith element of the signal S , S    and  are the mean and standard deviation of the vector , S respectively, and Ni S is the ith element of the signal series N S after normalization.
The LPC method is not a new technique for modeling of speech vocal tracing parameters. It was developed in the 1960s by [40] and still used in the recent papers for speech vocal tracing. The reason behind that is the representing a speaker by modeling vocal tract parameters and the data size are very suitable and well fit for speech compression throughout the digital channel [39]. In this paper, a modified LPC coefficients approach is suggested for reducing the size of feature vectors. The proposed wavelet averaging framing LPC (AFLPC) extracts the features from Z frames of each WT speech sub signal: Where Z is the number of considered frames (each frame of 20 ms duration) for the th q WT sub signal ( ). q u t The average of LPC coefficients calculated for Z frames of ( ) q u t is utilized to extract wavelet sub signal feature vector as follows: The feature vector of the whole given speech signals are represented as: ,..., Q AFLPC aflpc aflpc aflpc  (12) In this paper we use AFLPC taken from WP at level two (AFWP)and taken from DWT at level 5 (AFDWT).

Proposed Probabilistic Neural Networks Algorithm
Ganchev [39][40][41][42][43] proposed PNN with Mel-frequency cepstral coefficients for textindependence. Although there are numerous enhanced versions of the original PNN presented by many researchers, which are either more economical or exhibit an appreciably better performance for simplicity of exposition, we adopted and invoked the original PNN for classification task (see Fig. 3). The proposed algorithm is denoted by PNN and depends on the following construction: ( , , ), Net PNN X P SP  Where X is a180 24 x matrix of 24 input speaker feature vectors (pattern) of 180 average framing LPC coefficients, a method that was denoted above by AFLPC, taken from DWT or WP sub signals for net training: . .

Fig. 3. Structure of the original probabilistic neural network
The SP parameter is a spread of radial basis functions. We use an SP value of one because that is a typical distance between the input vectors. If the SP approaches zero, the network will act as the nearest neighbor classifier. As the SP becomes larger, the designed network will take into account several nearby design vectors. We create a two layer network. The first layer has radial basis transfer function (RB) neurons as shown in Fig. 4.

Fig. 3. Structure of the original probabilistic neural network
The SP parameter is a spread of radial basis functions. We use an SP value of one because that is a typical distance between the input vectors. If the SP approaches zero, the network will act as the nearest neighbor classifier. As the SP becomes larger, the designed network will take into account several nearby design vectors. We create a two layer network. The first layer has radial basis transfer function (RB) neurons as shown in Fig. 4.

Fig. 3. Structure of the original probabilistic neural network
The SP parameter is a spread of radial basis functions. We use an SP value of one because that is a typical distance between the input vectors. If the SP approaches zero, the network will act as the nearest neighbor classifier. As the SP becomes larger, the designed network will take into account several nearby design vectors. We create a two layer network. The first layer has radial basis transfer function (RB) neurons as shown in Fig. 4.
And calculates its weighted inputs with Euclidean distance (ED);

Fig. 4. Radial basis transfer function
And its net input with net product functions, which calculate a layer's net input by combining its weighted inputs and biases. The second layer has competitive transfer function (see Fig.  5) neurons, and calculates its weighted input with a dot product weight function. It is a weight function applies weights to an input to get weighted inputs. The proposed net calculates its net input functions (called NETSUM) that calculate a layer's net input by combining its weighted inputs and biases. Only the first layer has biases. PNN sets the first layer weights to X', and the first layer biases are all set to 0.8326/SP, resulting in radial basis functions that cross 0.5 at weighted inputs of +/-SP. The second layer weights are set to P .

Fig. 5. Competitive transfer function
Now, to test the network on a new feature vector (outsider imposter) for identification, simulation with network results is performed.

EXPERIMENTAL RESULTS AND DISCUSSION
Speech signals were recorded via PC-sound card, with sampling frequency of 8000 Hz. The Arabic digits from zero to nine were recorded, 3 times, in three Arabic dialects: Egyptian Arabic Dialect (EAD), Jordanian Arabic Dialect (JAD) and Palestinian Arabic Dialect PAD (tabulated in Tab. 2); 3 females, (aged from 20 to 30years), along with 11 males, (aged from 20 to 50 years) participated in speech digits recording. The recording was done in university office environment.
Our study of Dialect-independent classification system performance for robustness to noise is conducted by presenting some experiments depending on several considered aspects. The real noise presented by the restaurant noise will also be investigated. All experiments as well as comparison investigation will be performed in a common environment. Same testing and training signals will be utilized in case of comparison performing.  Table 3, also showed the mixed recognition digit for certain cases. Confused similarity is seen for Arabic digit 3, where was mixed with digits 1, 7 and 8. So that, less Recognition Rate was obtained for digit 3.

Experimental-1
Large mixed recognition between digits 7 and 3 found because of the similarities between these two digits as shown in Fig. 2. In terms of Recognition Rate in dialect-independent system according to our implementation results, Alotaibi classifier [7] has considerable results, about 80%, less than proposed classifier.
In the next experiment two conventional methods were investigated for comparison: Fast Fourier transform and feed forward back propagation (FFTNN) [34], time-frequency and feed forward back propagation (TFNN) [44], were investigated for comparison. The proposed method achieved superior performance as tabulated in Table 6.

CONCLUSION
The wavelet based features extraction method has been proposed. This approach faces the dialect-independent and speaker-independent difficulties.Probabilistic neural network is utilized in the classification part of the proposed classifier. The proposed classifier performed high recognition rate of up to 100% in some cases, with an average rate reaching up to 93%, for 450 tested digits signals. The method performance was tested in noisy environment by adding WGN as well as restaurant. The comparison between the WP and DWT was investigated. We concluded according to the results contained in Table 5 that DWT has superior performance. In terms of Recognition Rate in dialect-independent system according to our implementation results, Alotaibi classifier has considerable results-about 80% less than proposed classifier. A comparison with conventional methods was studied. The reason of this success is the utilization of the sophisticated extraction based on Wavelet Transform in conjunction with LPC.DWT has overcome the WPT in terms of recognition rate. The wavelet transform utilization in the feature extraction procedure could enhance the results significantly overcoming the conventional methods. The reason behind that is the possibility of extracting features through the different wavelet levels.