INVESTIGATION OF SPEECH DISFLUENCIES CLASSIFICATION ON DIFFERENT THRESHOLD SELECTION TECHNIQUES USING ENERGY FEATURE EXTRACTION

Filled pause and Elongation are the two types of speech disfluencies that need more suitable acoustical features to be classified correctly since they are always being misclassified. This work concentrates on developing an accurate and robust energy feature extraction for modelling filled pause and elongation by investigating different energy features using local maxima points of the speech energy. Method: In this paper, we extracted peak values from each frame of a voiced signal by implementing different thresholding techniques to classify filled pause and elongation. These energy features are evaluated by using statistical naïve Bayes classifier to see the contribution on the classification processes. Various samples of sustained syllables and filled pauses of spontaneous speech were extracted from Malaysian Parliamentary Debate Database of the year 2008. A naïve Bayes was used as a classifier. We performed F-measure evaluation to investigate the significant differences in mean of filled pause and elongation samples. Results: Results revealed that our proposed LM-E has increase the classification with up to 71% and 75% F-measure for elongation and filled pause. Conclusion: The best achieved accuracies in both filled pause and elongation classification were varied depending on the types of thresholding techniques applied during the local maxima of speech energy extraction. The most contributed thresholding technique is our proposed technique which is by using the adaptive height as the threshold that extracts the local maxima of the speech energy (LM-E).


1.
Introduction Over the past decades, Automatic Speech Recognition (ASR) system offers invaluable contributions to various fields.The benefits of ASR can be clearly seen in read and planned speech as speech is the main tool in daily communication and has been used in many application (Zapata and Kirkedal, 2015).However, developing an ASR system becomes more challenging for natural speech due to the occurrences of disfluencies such as filled pause.Studies have reported filled pauses has degraded the ASR's performance because it interrupts the fluency of speech, increases ASR complexity, and causes confusion to machine-based recognition devices (Singh et al., 2012).This problem becomes pertinent when a vowel sound of a normal word being spoken relatively long at any position in an utterance, both within a word as well as between words.This occurrence formerly known as elongation causes a normal word to be falsely detected as filled pause because both elongation and filled pause shared similar acoustical feature patterns (Kaushik et al. 2010).Several established related researches have been conducted in detecting the filled pause, where both filled pause and elongation were classified into the same disfluency class (Audhkasi et al., 2009).However, classifying filled pause and elongation into the same disfluency class can affect ASR's performance as eliminating normal words from recognition may modify the intended context of a speech and leads to inaccurate transcription.According to (Kaushik et al., 2010), filled pause and elongation causes transcription problem in ASR.Many ways were conducted to separate filled pause and elongation.The most common way is by extracting the acoustical features of the filled pause to be used in the classification.Various acoustical features were used to model filled pause such as energy, fundamental frequency, Mel-frequency cepstral coefficients and formant frequency.Among the well-established acoustical features, fundamental frequency is mostly used as can be found in (Gabrea et al., 2000, Goto et al., 1999, Audhkhasi et al., 2009and Kaushik et al., 2010).Fundamental frequency is associated with energy as confirmed by (Rosenberg & Hirschberg, 2006) in his work where energy is used to classify pitch into accented or non-accented word.However, when the conventional energy extraction is used, the accurate modelling of filled pause and elongation cannot be achieved as seen in (Li et al., 2010).Therefore, this paper addresses the exploitation of speech energy as a feature to accurately model the filled pause and elongation.Energy was widely used in filled pause research (Garg & Wards, 2006, Li et al.,2008and Stouten et al., 2006).The use of energy can be found in different language of filled pause studies such as Mandarin, European Portuguese and English.Since filled pause and elongation is language specific (Yusof et al., 2008), the performance of energy was reported differently.It was proven in (Stouten et al., 2006) that energy is unable to differentiate filled pause and elongation of European Portuguese language due to the equal pattern of energy stability.In contrast with (Li et al., 2008), the energy along with MFCC and F0 have shown promising classification performance for Mandarin filled pause.It is observable from those researches that the combination of suitable feature with energy can increase the classification process compared to energy alone.
Energy of the speech may be measured using several techniques such as log energy, sum of square energy and sum of absolute energy.Generally, all the above-mentioned techniques of calculating the sums of energy are measured on each short frame.These techniques are suitable and beneficial for speech involving normal words.However, sum of energy cannot sufficiently represent filled pause, especially when filled pause needs to be differentiated with elongation.According to (Stouten et al., 2006), the current means of representing energy is not able to separate filled pause and elongation in Portuguese language well due to their similar energy characteristics.The use of energy parameter is customary but not limited in endpoint detection only.It is also beneficial in consonant and vowel detection in (Izzad et al., 2013).However, sum of energy calculated from short time speech frame is unable to detect the energy variation from the consonant and vowel in the elongation.These researchers concluded that there are difficulties in differentiating filled pause and elongation into two separate classes.Therefore, further work is needed to investigate and select the suitable energy feature extraction technique for the abovementioned purpose.Rigorous energy feature selection research for representing filled pause and elongation remains hard to find.Therefore, this research aims to identify the most suitable energy characteristic of filled pause and elongation, and construct a classification model that is able to discriminate filled pause and elongation into their own separate classes. .

Methodology
The methods of this research are divided into several stages.The first stage is dataset development of filled pause and elongation.Filled pause dataset (i.e.FP_DATA) and elongation dataset (i.e.ELO_DATA) are then subjected to pre-processing stage which is a combination of established procedures in speech analysis.The output of the speech preprocessing is passed to the energy feature extraction stage process to get the energy feature representation of the speech.The selected energy feature vectors are then fed into the classification stage to classify the speech disfluencies into filled pause or elongation.The last stage is to evaluate the classifier performance based on several measurements.In overall, this research uses Matlab, Wavesurfer and R statistical software for speech processing and analysis.Detail of each stage is further elaborated in the subsequent sections.

Dataset Development
The raw data that is used in this research is taken from Malaysia Parliamentary Debate Database of the year 2008.The data collection process is started from the video file conversion to audio format by using video to audio converter freeware and named MPHD.wav.The video recording collection of MPHD comprises of 51 video files.Each video file contains a morning and an evening session that was conducted within eight to thirteen hours and is accompanied with text transcription.The analysis of video quality is done one by one to select the best perfect match between video and text transcription.Out of 51 video files, only 22 files are suitable for further processing.They are not corrupted, no missing sounds and matched perfectly with the transcriptions (text files).These 22 audio (.wav) files contains 1 074 072 words with approximately 214 814 sentences.Only seven audio (.wav) files are randomly chosen and exploited to extract the Malay filled pause and elongation.The quantitative information analysis of the randomly chosen files is tabulated in Table 1.The examples of sentences that contain filled pause, normal words and elongation are presented in Figure 1 and Figure 2. In the figures, the filled pause is marked in dashed-oval while normal word is marked in dashed-rectangle and the elongation is marked in dashed-square.The silence is transcribed as sil in the transcription pane above the speech waveform.The description for each segmented sentence is given by following the rule of "S (number of sentence) F/M (gender) T (topic number) and the segmented isolated filled pause and elongation is based on the number of sentences followed by number of filled pauses.For example, the sentence in Figure 2 is labelled as S53M5T03 with the corresponding filled pause and elongation of the sentence is F53 and E53.Subsequently, in order to gather different sets of filled pause and elongation data collection, all sentences are manually segmented for further used in this research.A total of 3000 isolated filled pause is collected comprising 2400 'aaa', 450 'eee' and 150 'emm' are named as FP_DATA.Meanwhile, a total of 3000 elongations are name as ELO_DATA.The ELO_DATA is a segmented syllable that is elongated by the speaker.In order to get an accurate endpoints segment, voice activity detection (VAD) techniques will be applied in both datasets (FP_DATA and ELO_DATA) which consists of 6000 manually speech segments.Furthermore, the datasets have been verified by the linguist experts (Dr.Norizah Ardi, Pusat Pengajian Bahasa UiTM Shah Alam) to confirm that the collection only contains the filled pause and elongation of word segments.

Pre-processing
Pre-processing is one of the main part in ASR process (Deng et al., 2018).All the speech data that are used in this research are pre-processed for the purpose of feature extraction.In the preprocessing stage, several processes are undertaken inclusive of amplitude normalization, preemphasis, framing and windowing and voice activity detection.The pre-processing of speech is a vital stage in any speech processing research.Pre-processing is a crucial task in this research that involved speech vector normalization, framing windowing and voice activity detection.Each of the pre-processing process is discussed in the following subsections.

Amplitude Normalization
The raw speech data is a collection of speech uttered by different speakers thus the amplitude and energy vary.The variety of speaker's speech energy can cause error or unstable classification rate if the feature vector is directly extracted.Therefore, the purpose of amplitude normalization is to ensure that the level of the energy is standardized or similarly calibrated.In this research, the z-score normalization technique is adopted.The speech amplitude variability is normalized to have zero mean and one standard deviation.Speakers' volume variations need Amplitude to be normalized before the next process is taken so that the volume will not become a performance degradation factor.
The normalization steps are as follows: i.The mean of the speech vector is computed ii.The standard deviation of the speech vectors (x) is computed iii.The mean and standard deviation calculated in step (i) and step (ii) are used to calculate the normalized speech vector as in Eq. ( 1) where x = speech vector The normalization effect is evaluated by calculating the mean amplitudes of the speech samples (3000 FP and 3000 ELO).The mean amplitudes variance before and after the amplitude normalization are compared and shown in Table 2. From the result, it is clearly observed that the mean amplitude variance after the normalization is smaller compared to before speech vectors normalization.Smaller variance shows that the difference between normalized amplitude among the filled pauses and elongations is very minimal.As stated earlier, the amplitude normalization is important to ensure the energy of the speeches within the same range.The output of normalized speech signal, z(x) is used as input to proceed with the pre-emphasis stage.

Pre-Emphasis
Generally, digitized speech waveforms comprise additive noise and have high spectral dynamic range.For example, a low energy can be found in high frequency spectrum of a speech as well as high energy in low frequency spectrum.Because of that reason, a process called as preemphasis is performed on the normalized speech to flatten the speech spectrum and to emphasize the high-frequency part of the speech signal that was repressed through the human sound production mechanism.For example, pronunciation of vowels existing in filled pause and elongations have high energy (Kitamaya et al., 2003) and may be pronounced at the lower frequency.Therefore, it needs to be boosted to attenuate the information from the higher frequency for better acoustical feature representation.The most extensively used pre-emphasis digital high-pass filter is defined as in Eq. ( 2).
(2) where: = the value of normalized input signal at discrete time step n A = is a constant normally set between 0.9 to 1 In this research, the value of 0.95 is chosen as A. In the literature, there are various usages of pre-emphasis constant.A constant of 0.95 for pre-emphasis process was used in Verkhodanova, & Shapranov, 2014).While in (Murakami & Mizuguchi, 2010), the pre-emphasis constant is set to 0.97.However, according to (Abbas et al., 2013), the typical value of pre-emphasis constant is 0.95.A low frequency signal is the one with slow time variation.The slow variation effect on low frequency signal concurrently produces adjacent samples of similar numerical value.From equation (2)(2), the subtraction process removed the part of the samples that did not change in relation to its adjacent samples to retain the high-frequency components.The output signal of the pre-emphasis process ) (n x prem is then past to the framing stage.

Framing
Speech signal is non-stationary and non-periodic in a longer duration.Its statistical properties are non-constant over time.However, practically, at a frame of 20ms~30ms, speech is considered stationary and quasi-periodic (Ganaphaty, 2012).Thus, the non-stationary properties of a speech signal need to be transformed as stationary using framing.Framing a speech signal is a process of blocking the speech signal into frames of N samples, with adjacent frames being separated by M samples i.e., the frame is shifted with M samples from the adjacent frame.The spectral features estimated from frame to frame will be smooth if the shifting is small.The shifting process is important to ensure overlapping of the speech frame.The absence of overlapping between adjacent frames will cause the speech signal to be entirely mislaid and will contain noisy components only.
The general equation for frame blocking is written in Eq. ( 3) by assuming that the speech frame length ( th l ) is represented as S and the entire speech signal is denoted as L. ( where.In this research, the frame size is set to 20ms (320 points) frames, which were overlapped at 10ms (160 points).A typical frame shift of 10ms of a short frame of 20ms is always chosen in speech processing research (Rosenberg & Hirschberg, 2006).The overlapping is important to ensure the smooth transition of estimated parameters between frames.

Windowing
Windowing is done to reduce the discontinuities of the speech signal at the edges of each frame by applying a tapered window to each frame.At each framed speech signal, a window is applied at the beginning and ending by using window function.For a window w(n), the windowed signal will be defined as in Eq. (4)(4). ), where, Hamming window is the mostly used windowing function applied on each speech's frame of the speech and is described in Eq. ( 5)(5).It also provides better frequency resolution as it minimizes signal discontinuity.(5)

Energy feature extraction
In general, the process of getting the representation of each speech sample's energy is by using the standard method (Jalil et al., 2013) that is by calculating the sum of the energy of each short speech frame as in equation ( 6).
The next step is to calculate the standard deviation of the whole speech segment to measure energy's stability.Energy standard deviation of the filled pause is expected to be small (Stouten et al., 2006) as they are presumed to be more stable.Energy example of filled pause and elongation is taken to demonstrate its function in representing elongation and filled pause as shown in Figure 3  In filled pause research, energy is an important feature.Several acoustical features that were previously tested in filled pause classification such as fundamental frequency and spectral envelope are correlated with energy (Rosenberg & Hirschberg, 2006).Generally, the energy of filled pause is stable and constant, as proven in (Garg & Ward, 2006).However, due to the transition between consonant and vowel in the elongation, the standard method of energy measurement is not able to represent this transition named as expressive intonation.Therefore, another way of exploiting the energy of the speech is by using the local information of the speech energy need to be investigated.This is further explored and discussed in the next subsection.
For each duration that are tested (i.e.10ms, 20ms and 40ms), the standard deviation of the energy produced by elongation are denoted as 56, 105 and 182 which are lower compared to filled pause energy's standard deviations (i.e. 70, 121, 241).The distribution of energy value of both filled pause and elongation is shown in Figure 5. From Figure 5, it is obviously seen that the energy representation (energy standard deviation) of filled pause and elongation is overlapping.It shows that the filled pause and elongation cannot be differentiated by using energy as the feature.

Proposed Speech Energy Extraction using Local Maxima
Previously, several techniques of local maxima extraction have been proposed.Basically, the techniques of local maxima extraction depend on the threshold parameter selection.One of the techniques of local maxima extraction is by utilizing the distance between peaks as threshold (Schwartzman et al., 2011).The technique is implemented by assigning a minimum peak as a threshold.A point is marked as local maxima if it is the highest peak number among the descending peak data.The other technique is by using minimum height (Bertot etal., 2014) as threshold.In this technique, the peak is detected by first order difference information.A peak occurs when the trend changes from upward to downward, i.e., a peak is where the difference changed from a streak of positives and zeros to negative.Both techniques were applied in this research.According to (Schwartzman et al., 2011), these techniques are only applicable when the noise is stationary and isotropic.However, it is well-known that speech is non-stationary and the values of amplitudes represented by the volumes or energy is extremely varied and thus not isotropic.Therefore, the aim of the proposed energy extraction manipulating the local maxima is to optimize the local maxima selection in each speech segment.
Speech energy is closely related to the amplitude of the speech (Izzad et al., 2013).Instead of calculating the total energy of each frame, in this research the energy stability of the speech is measured based on the amplitude transition from one frame to another.To measure the amplitude transition, this research proposed the manipulation of the local maxima points of the speech.We introduced adaptive local maxima threshold selection technique by directly comparing one peak points to another using adaptive threshold selection based on the height difference into a matrix for further process.In this proposed method, different adjustable positive scalar number is tested as threshold to observe the most suitable parameter.

Classification
The classification stage is preceded after the energy feature vectors have been collected from the energy feature extraction stage.In this research a simple naïve Bayes classifier is used to evaluate the performance of extracted featured in representing the filled pause and elongation.The overall steps can be visualized as in Figure 6.The process of feature classification is described as follows: i.
The classifier learns the conditional probability from the training data of the attributes X (acoustical feature values) given the class label, C (FP or ELO).ii.
The classification is performed by applying Bayes rules to compute the probability of C given the particular feature of X. iii.
The class of the feature X is predicted by the highest posterior probability.et al., 1995).The class label of the disfluency is determined by using Bayes theorem as in equation (7) (7).
where, The prior probability of 0.5 for each class is set equally since the number of filled pause and elongation is distributed equivalently.To validate the classifier, a 10-folds cross validation is used.Cross validation (CV) is the most common and recently used (Elkan, 2012;Qin et. al, 2012).There are several techniques applied in CV such as leave one-out and fold-CV.In (Stouten et al., 2006), a total of 186 iterations is applied into leave one-out cross-validation in which each time of the experiment, 1 sequence of data is taken out as a test data while the rest is used for training.This process is repeated up to 186 times.However, this method is quite time consuming for a larger dataset.A large dataset that consists of 1076 samples has been applied with 10-fold CV in order to test the classifier's performance (Elkan, 2012).The study found that their classifier's performance is comparable with the previous work done by (Bouckaert, 2004).In (Murakami & Mizuguchi, 2010) In the 10-fold CV the total data of filled pause and elongation are divided into 10 equivalent folds.This process is executed 10 times with different fold used as testing during each iteration.Then, the evaluation of the classification is done based on several measurement techniques such as accuracy, F-Measure, precision and recall.

Results and Discussion
To verify the validity of the extracted energy features from the MPHD database in the classification processes of filled pauses and elongations, various experiments were performed.
To ensure accuracy, various experiments were performed individually for each energy feature by using 10-folds cross-validation.The feature classification performances are measured using precision, recall, F-measure and accuracy.The precision and recall rate are needed to get the Fmeasure.The recall rate shows that the number of relevant filled pause or elongation that is successfully classified among the relevant filled pause or elongation.Precision shows the number of relevant filled pause or elongation that is successfully classified among all of the filled pause or elongation.On the other hand, F-measure is the harmonic mean between precision and recall rate.The accuracy shows the overall performance which denotes the number of filled pause or elongation that is successfully classified among the entire filled pause and elongation.All of the stated measurements between both STE and proposed LM-E are shown in Table 3. From the results, it can be seen that the LM-E outperform the well-established STE.In overall, the accuracy of the energy feature increased from 67% to 74% which are about 7% increment when the technique of adaptive thresholding is introduced.Among them, LM-E scored higher recall and precision rate at > 68% for both filed pause and elongation compared to STE.The highest F-measure for filled pause is achieved by LM-E at 71% followed by STE at 63%.LM-E scored higher F-measure at 75% for elongations followed by STE at 70%.It shows that the proposed LM-E represents elongation better compared to STE.The results of accuracy for each fold in the 10-fold CV for both LM-E and STE are shown in Fig 8 .For the proposed LM-E, the accuracy differences between fold is considerably small which is only 3.69.This indicates LM-E is consistent in representing each filled pause and elongations.The lowest accuracy of the proposed LM-E is denoted at 68% as seen in the 7th fold.Most of the speech data of the 7th fold is from DR20080528 and DR20080828 datasets.
An example of misclassified ELO and FP are randomly taken from DR20080828 dataset.The LM-E standard deviation for both ELO (ELO07.wavand ELO06.wav) and FP (FP11.wavand FP107.wav) are 0.684, 0.378, 0.937 and 0.9828 respectively.It is obviously shown that the LM-E standard deviation for FP are lower compared to ELO which is supposedly to be small.In speech production, there is a transition between consonant to vowel causing the acoustic changes within the transition (Doellinger et al., 2011).According to (Doellinger et al., 2011) the transition between consonant to vowel is due to the interval between the release burst and the onset of laryngeal pulsing.The transition from consonant to vowel in Malay language

LM-E STE
dataset produced a unique phenomenon named as expressive intonation in this research.The graphical representation of the consonant to vowel transition is shown in Figure 9. Since there is no significant transition between consonant to vowel in the elongations depicted in Figure 9, a lower standard deviation of LM-E is derived.Thus, the standard deviation does not meet the acoustical rules of LM-E for elongation; they are misclassified as filled pause.
Some of the elongation starts with voiced consonant (i.e./ga/, /da/, and /ni/) unvoiced consonant (i.e./pi/, /tu/, and /ke/).There are also elongations uttered with semivowel (i.e./ya/, /wa/).It is observable that there is no significant amplitude transition between consonant to vowel in many of the elongations of the 7 th fold; thus, causing lower LM-E standard deviation.The elongation that is in the form of semivowel is hardly to be correctly classified by LM-E.Most of the elongations cannot be correctly classified by using LM-E as the energy of the semivowel and the vowel of the filled pause do not differ significantly.Filled pause is uttered in an emotional state of mind such as angry, happy and doubt; producing expressive intonation in the filled pause utterances.Therefore, filled pause is misclassified as elongation as it possessed characteristic similar to elongation.III.
Insignificant transition between consonant to vowel in elongation; causing a low LME's standard deviation.
As stated earlier, the LM-E is associated with the speech energy (STE).Therefore, this research compares the performance of these two speech energy characteristics in differentiating filled pause and elongation.Since the filled pause is unvaried pronunciation of phonemes, the energy is constant.The consistency of the energy is measured based on STE's lower standard deviation (Stouten et la., 2006).In other words, the STE's standard deviation for filled pause is lower compared to elongation.The LM-E which is an exploitation of the speech energy, however managed to differentiate the elongation better compared to STE.

Conclusion
This research concludes that the exploitation of the well-established STE has produced better classification accuracy for FP and ELO.In the future, the research is expected to produce a more robust energy feature or any acoustical feature that ae more suitable especially in overcoming the problem of semivowel detection in elongation.The research also suggests a more efficient algorithm can be constructed so that it can reduce the computation time.

Figure. 1 .
Figure. 1.A complete sentence with only filled pause (Malay sentence id S169M9T04: Pesakit aaa, buah pinggang) Figure. 3. Example of STE measurements on elongation . The proposed energy extraction technique is Local Maxima of the Speech Energy (LM-E).The details steps of the proposed LM-E are as below: Step1: Find the minimum peak m p of all the peaks in the speech Step2: Set the m p as the first threshold.Step 3: Iterate the process to the next consecutive point in the speech n

Figure
Figure. 6. Classification process Let x be a specific feature with assigned values of x1, x2, x3 …xn and C is the class with assigned values of class variables of C1, C2, C3 …Cn.The Bayes classifier enables the computation of the posterior the disfluencies (c = FP for filled pause , c = ELO for elongation) x = acoustical feature p(c) = prior probability p(x|c) = conditional probability p(c|x) = posterior probability , two stage of classifiers validation is done.The first stage is conducted by using standard training and testing data partition with different data division ratios while the second stage uses cross-validation.This research chooses cross validation method to test the accuracy of the model.In 10-CV technique, nine folds are used to train classifier, and the one-fold that is held out is then used to test the classifier.The process of dividing the data into 10-fold CV is as follows: Input: Training set S , integer constant  Procedure: Partition  into  equal-sized subset  1 Run learning algorithm (Bayes classifier) with T as training set Test the resulting classifier or i S .

Figure. 8
Figure. 8. 10-fold CV for LM-E and STE accuracy -E and STE for each fold of 10-fold CV

Figure. 9 .
Figure. 9. Consonant to vowel transition in elongation /da/ According to (Espy, 1986), the similar acoustical pattern between semivowel and vowel causing the detection of semivowels is a challenging task.In summary, several causes of misclassification done by LM-E are: I.A low volume of voice pronunciation by the speaker caused inaccurate representation of LM-E for filled pause.II.Filled pause is uttered in an emotional state of mind such as angry, happy and doubt; producing expressive intonation in the filled pause utterances.Therefore, filled pause is misclassified as elongation as it possessed characteristic similar to elongation.III.Insignificant transition between consonant to vowel in elongation; causing a low LME's standard deviation.

Table 1 .
Quantitative information of selected MPHD files

Table 2 .
Mean amplitude variance due to normalization