Gaussian Mixture Model Based Classification of Stuttering Dysfluencies

P. Mahesha; D.S. Vinod

doi:10.1515/jisys-2014-0140

Open Access Published by De Gruyter June 12, 2015

Gaussian Mixture Model Based Classification of Stuttering Dysfluencies

P. Mahesha and D.S. Vinod

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2014-0140

Abstract

The classification of dysfluencies is one of the important steps in objective measurement of stuttering disorder. In this work, the focus is on investigating the applicability of automatic speaker recognition (ASR) method for stuttering dysfluency recognition. The system designed for this particular task relies on the Gaussian mixture model (GMM), which is the most widely used probabilistic modeling technique in ASR. The GMM parameters are estimated from Mel frequency cepstral coefficients (MFCCs). This statistical speaker-modeling technique represents the fundamental characteristic sounds of speech signal. Using this model, we build a dysfluency recognizer that is capable of recognizing dysfluencies irrespective of a person as well as what is being said. The performance of the system is evaluated for different types of dysfluencies such as syllable repetition, word repetition, prolongation, and interjection using speech samples from the University College London Archive of Stuttered Speech (UCLASS).

Keywords: Dysfluency; EM algorithm; GMM; MFCC; stuttering

MSC 2010: 68T01; 68T10; 68T27

1 Introduction

There is a growing interest in adopting the computerized technique for pathological voice recognition in recent years [1–3, 5, 6, 23, 29, 33]. Speech signal processing techniques have proved to be an excellent tool for speech disorder detection. A number of researchers have investigated the application of automatic speaker recognition (ASR) for people with speech and language impairments [1–3, 6, 23]. The literature [6, 9] has revealed that ASR has the potential to facilitate people with voice disorders like dysarthria and hearing impairment to interact with a computer. However, the application of ASR with other speech and fluency disorders is documented to a smaller extent [9]. This paper examines the potential application of ASR to detect stuttering dysfluencies.

Stuttering is an abnormal speaking behavior exhibited by a person. It is characterized by certain types of dysfluencies called stuttering events such as repetition of a syllable or word, prolongation, interjection, and involuntary silent pauses or blocks in communication [13–15, 37]. Some examples of dysfluencies are given below:

Syllable repetition (SR): The speaker may repeat a sound or syllable at the beginning of the word. For example: The baby had the s-s-s-soup.
Word repetition (WR): The speaker may repeat the whole word. For example: The baby-baby had the soup.
Prolongation (P): The speaker may prolong the sounds or syllables. For example: The baaaaby had the soup.
Interjection (I): These are extra sounds or syllable or words that add no meaning to the message. For example: The baby um um had uh the soup.

Stuttering can make communication difficult with other people, and this can affect a person’s quality of life. Approximately 1% of the general population in the world and people of every age group are affected by this disorder [5]. The number of males are affected two to five times more than the number of females [15, 37].

The assessment of stuttering operates on many levels that can be divided into several different targets such as frequency, type, duration, and severity. In this aspect, counting a dysfluency event is a common and reliable measure that can be used for various purposes [4, 18]. It facilitates speech–language pathologists (SLPs) for planning appropriate training program and also helpful as a snapshot measure of improvement during treatment.

In traditional stuttering assessment, SLPs identify and count the number of occurrences of dysfluencies manually by listening or transcribing the speech. The result of these assessment methods is inconsistent across SLPs and clinics. It is due to their subjective opinion, and also, each clinic uses slightly different definitions of stuttering events and makes mistakes in the counts [5, 38]. As a result, these traditional methods are time consuming, subjective, inconsistent, and error-prone task.

Hence, it would be appropriate to establish an objective method of detecting particular types of dysfluencies based on acoustical characteristics of speech signal and pattern recognition techniques.

2 Related Work

This section presents an overview of various approaches found in the literature on stuttering dysfluency recognition. Over the past several years, considerable studies have been carried out on automatic detection and classification of stuttering dysfluencies [1–3, 10, 23, 24, 32, 34, 35]. Most of them focus on acoustic analysis, parametric and nonparametric feature extraction, and statistical approaches [10, 24, 32, 34, 35].

To provide an objective and consistent measurement, stuttering dysfluency recognition systems have been proposed. Researchers have employed various classifiers such as the artificial neural network (ANN), hidden Markov model(HMM), support vector machine (SVM), linear discriminant analysis (LDA), and k-nearest neighbor (k-NN) to recognize dysfluencies.

ANNs are used in dysfluency recognition by many authors [5, 23, 13, 14, 31]. Howel et al. introduced the first study on recognition of repetition and prolongation dysfluency using ANN [13, 14]. Autocorrelation envelop parameter was used as the input vector to ANN [13]. Another study with a two-stage procedure for automatic recognition of dysfluencies was presented by employing spectral measures, fragmentation measures, duration, and energy features [14]. An experiment on automatic detection of dysfluent events based on rough sets and ANNs was presented by Czyzewski et al. [5]. The proposed task was based on the detection of stop gaps, syllable repetitions, and discerning vowel prolongations. Ravikumar et al. [23] proposed automatic detection of syllable repetition for objective assessment of stuttering dysfluencies based on the Mel frequency cepstral coefficients (MFCC) features and perceptron decision logic. Swietlicka et al. [31] applied Kohonen network to recognize repetition and prolongation dysfluency. Speech samples were analyzed using 21 digital 1/3 octave filters of center frequencies between 100 Hz and 10 kHz. These parameters were used as input to the Kohonen network.

HMMs were also employed to recognize repetition and prolongation dysfluencies in [32, 34, 35]. Wisniewski et al. [34, 35] proposed two techniques on automatic detection of prolonged fricative phonemes using the MFCC feature and HMM classification method. Tian-Swee et al. [32] presented the stuttering recognition method by employing HMM to evaluate stuttering in children.

Further, SVM classifiers were used in dysfluency recognition by authors of [10, 24]. In 2009, Ravikumar et al. [24] attempted to improve their previous results [23] through the SVM with MFCC features. Yen Fook et al. [10] presented a comparison of three feature extraction methods such as the MFCC, linear predictive coding (LPC) and perceptual linear predictive (PLP) analysis for the classification of repetition and prolongation dysfluencies. Three different classifiers (k-NN, LDA, and SVM) were employed and compared in the study.

Sin Chee et al. and Ooi Chia et al. have proposed the dysfluency recognition system [2, 3] using K-NN and LDA classifier using the MFCC and linear prediction cepstral coefficients (LPCC) features. Three feature extraction methods such as the LPC, LPCC, and weighted linear prediction cepstral coefficients (WLPCCs) were proposed by Hariharan et al. [11]. Two classifiers, namely, k-NN and LDA, were employed in the study.

Recently, Juraj palfy [21] has demonstrated the applicability of speech transformation to symbolic sequences for searching repeated patterns in speech using the SVM classifier and obtained the best accuracy of 98%. Even though the work shows high accuracy for the same database (University College London Archive of Stuttered Speech [UCLASS]), their results and achievements are not easily comparable, due to lack of uniformity in the selection of speech samples and preparation of speech segments for experimentation.

Table 1 presents a summary of the related research works chronologically in terms of database size, features employed, classifier used, and accuracy obtained. Analysis of the literature shows that only two types of dysfluencies such as repetition and prolongation are considered in the experimentations. However, stuttering is also characterized by many groups of dysfluencies as listed in the Introduction Section. Hence, in this work, we are motivated to consider four types of dysfluencies such as syllable repetition, word repetition, prolongation, and interjection. Further, the effectiveness of the Gaussian mixture model (GMM)–MFCC framework is evaluated for categorizing these dysfluencies.

Table 1

Summary of Previous Research Works on Stuttering Recognition.

First author	Year	Database	Features	Classifier	Accuracy
Howell [13]	1995	12 Speakers	Envelop parameter	ANNs	80%
Howell [14]	1997	12 Speakers	Duration, energy peaks	ANNs	78.01%
Czyzewski [5]	2003	6 Fluent samples + 6 stop gap samples	Formant frequencies and its amplitude	ANNs	78.1%
Wiśniewski [34]	2007	38 Samples	MFCC	HMMs	80%
Tian-Swee [32]	2007	35 Samples	MFCC	HMMs	96%
Ravikumar [23]	2008	10 Speakers	MFCC	Perceptron	83%
Ravikumar [24]	2009	15 Speakers	MFCC	SVM	94.35%
Świetlicka [31]	2009	8 Speakers	Spectral measure	MLP,	88.1%,
			(FFT12)	RBF	94.90%
Chee [3]	2009	10 Samples from UCLASS	MFCC	k-NN, LDA	90.91%
Chee [2]	2009	10 Samples from UCLASS	LPCC	k-NN, LDA	89.77%
Wiśniewski [35]	2010	2 Speakers	MFCC, signal energy	HMMs	80%
Ai [1]	2012	39 Samples from UCLASS	MFCC	k-NN,	92.55%,
				LDA	88.82%
		39 Samples from UCLASS	LPCC	k-NN,	94.51%,
				LDA	90%
Hariharan [11]	2012	39 Samples from UCLASS	LPC, LPCC, WLPCC	k-NN	92.16,
					96.47,
					97.45%
			LPC, LPCC, WLPCC	LDA	94.90%,
					97.06%,
					98.04%
Fook [10]	2013	39 Samples from UCLASS	LPCC, WLPCC, MFCC	SVM	95%
Palfy [21]	2014	16 Samples from UCLASS	MFCC	SVM	98.00%

3 Establishing the Acoustic Model

An effective acoustic model needs to be derived from dysfluent speech characteristics, which will enable the system to discriminate among different dysfluencies. The human vocal tract system produces different sounds by varying its shape, and these different sounds have different frequencies. To understand these frequency characteristics, we analyze the power spectral density (PSD) of different dysfluencies. As the human vocal tract can be modeled as an all-pole filter, the Yule–Walker parametric spectral estimation technique is employed to calculate PSDs [30].

The PSD estimate of four different dysfluencies are plotted in Figure 1. We notice that the peaks in the PSD remain consistent for a particular dysfluency but differ between dysfluencies. This implies that the acoustic models can be obtained from spectral features. The MFCC spectral features, one of the widely employed features in speaker recognition, are used to build the acoustic model.

Figure 1:

PSD Estimate of Four Different Dysfluencies.

4 Applicability of GMM for Stuttering Dysfluency Recognition

A GMM is a parametric probability density function represented as a weighted sum of Gaussian component densities [27]. GMM modeling and classification techniques are widely recognized as the state-of-the-art methods in the ASR systems [26, 28]. The reason for the popularity is due to the capability of representing a large class of sample distributions. In addition, their probabilistic framework provides direct integration with speech recognition systems. The success of speaker recognition motivated us to adopt the GMM framework as a baseline method for stuttering dysfluency recognition.

Though the proposed work is motivated from the ASR system, there are some major differences between the two systems as given below:

In the ASR, the model corresponds to a speaker, whereas in the proposed system, it corresponds to a particular group of dysfluency.
In dysfluency recognition, samples used for training and testing are different, whereas in the ASR, two sets are similar.
In the ASR, the goal is to identify the test sample as one of the enrolled speakers. Here, the objective is to classify the test sample to one of the four types of dysfluencies such as syllable repetition, word repetition, prolongations, and interjection.

5 Experimental Data

We conducted experiments using the UCLASS stuttering speech database created by Howell et al. [12] and Devis et al. [8]. This database has been designed for research and clinical purposes to investigate language and speech behavior of speakers who stutter. It contains 107 reading recordings contributed by 43 different speakers. In this study, 50 samples are selected from the UCLASS.

The four types of dysfluencies such as syllable repetition, word repetition, prolongation, and interjection are identified and segmented manually by listening to speech samples. We have prepared a total of 200 dysfluent segments consisting of four types of dysfluencies with 50 numbers each. The 80% of the segments are used for training and the remaining 20% for testing. Table 2 clearly shows the gender, age range, and number of samples obtained from the UCLASS. It also shows the total number of dysfluent segments prepared.

Table 2

Age, Gender and Dysfluent Segments selected from Subset of UCLASS Database.

Gender (number)	Age range (years)	Dysfluent speech segments
Gender (number)	Age range (years)	Number	Training	Testing
Male (40)	10–20	160	128	32
Female (10)	10–20	40	32	8
	Total	200	160	40

6 Methodology

This study aims at investigating the applicability of the GMM to dysfluent speech recognition. The general block diagram of the dysfluency classification system is shown in Figure 2.

Figure 2:

General Block Diagram of the Dysfluency Classification System.

Basically, the recognition system consists of two stages, namely, training and testing or recognition phases. The feature extraction task is common to both stages. It is usually preceded by signal preprocessing to compensate the high-frequency part that was suppressed during the sound production mechanism [22]. In the training phase, the system extracts feature parameters from the input dysfluent speech signal, and it is used to build representative dysfluent models.

The modeling phase plays a crucial role as it is intended to capture the patterns found in the feature vectors extracted from the dysfluency utterance. The objective of feature extraction and modeling is to determine the uniqueness in the dysfluency characteristics that helps in discriminating dysfluencies more conveniently.

During the testing phase, feature extraction is performed for unknown dysfluency (test sample). These vectors are then passed to the classifier, which performs a pattern-matching task. That determines the closest-matching dysfluency model by computing the similarity between the test sample feature vector and the available dysfluency models. This results in classifying the test sample to one of the four types of dysfluencies that is syllable repetition, word repetition, prolongation, and interjection.

6.1 Signal Preprocessing

The preprocessing is carried out using a pre-emphasis filter to remove the effects of lip radiation and avoid lower frequency components from dominating the signal. Normally, a first-order high-pass filter is used to accomplish this step [22]. The filter equation is given below:

(1)y(n) = x(n) − a⋅x(n − 1), 0.9 ≤ a ≤ 1.0 (1)

where x(n) is the value of the input signal at a discrete signal time step n, and y(n) is the value of the output signal at a discrete signal time step n.

6.2 Feature Extraction

The MFCC was introduced in the context of speech recognition in 1980 by Davis et al. [7]. Currently, it is a well-established and most widely used acoustic feature for speaker modeling and recognition. It produces a multidimensional feature vector for every frame of speech. The MFCC is computed by partitioning the speech frequency into an overlapping triangular Mel frequency filter banks. This ability to partition the speech frequency makes MFCCs more appropriate to effectively quantify the potential disorders of speech articulation. The choice of MFCC for evaluating dysfluencies in speech is augmented more by empirical evidences rather than by theoretical reasoning. In this work, a different number of MFCC consisting of 12 parameters, 12 δ parameters (first derivative of MFCC), and 12 δ δ parameters (second derivative of MFCC) and one average spectral energy parameter are considered.

The feature extraction is performed based on the short-time window analysis with frames of length 20 ms and 10 ms step size. The MFCC parameters were calculated by mapping the voiced speech spectrum into the Mel frequency scale. The mapping equation between frequency in Hz and the corresponding subjective pitch in Mels is shown in equation (2).

(2)fmel = 2595 ∗ log10(1 + f700) (2)

where f_mel is the subjective pitch in Mels corresponding to f, which is the actual frequency in Hz.

This Mel frequency mapping is done by multiplying the magnitude of speech spectrum of a preprocessed frame by the magnitude of triangular filters in the Mel filterbank. The equation is followed by log-compression of sub-band energies of the Mel-scale filters. Finally, the discrete cosine transform (DCT) is applied on the log energy obtained from the triangular bandpass filter to have the L Mel scale cepstral coefficients. The equation for DCT is given by

(3)Cm = ∑k = 1Ncos[m(i − 12)πN]Ek, for m = 1, 2, 3 … L (3)

where N is the number of the triangular bandpass filter, L is the number of the cepstral coefficients, and E_k is formulated as the log energy output of the i^th filter.

The original sequence of the feature vectors are called static features. The dynamic behavior of the speech signal can be obtained by computing the temporal derivative of the MFCC parameters. The δ (first derivative of cepstral features) and double δ (second derivative of cepstral features) are referred to as the velocity and acceleration parameters, respectively [20]. They are obtained by differentiating the static features. To calculate the first derivative, the following equation is used:

(4)dt = ∑n = 1Nn(ct + n − ct − 1)2∑n = 1Nn2 (4)

where d_t is the δ coefficient at frame t computed in terms of static coefficient c_{t + n} − c_{t − 1}, and N is the size of the window. The δ δ coefficients are computed in the same manner using the δs.

6.3 The GMM Dysfluency Modeling

This study uses the GMM-based approach to model the dysfluencies. The adaptation of this model to dysfluency recognition is described in the following section.

The GMM method iteratively develops a different weighted sum of multivariate Gaussian probability [28]. The weighted sum of the N component Gaussian densities are given by [26]:

(5)p(x→) = ∑i = 1Mwibi(x→) (5)

where x→ is D dimensional feature vector, bi(x→),i=1, …, M are the component densities, and w_i, i=1, …, M are the mixture weight. Each component density is a D variate Gaussian function of the form

(6)bi(x→) = 1(2π)D/2(|Σi|)1/2exp{−12(x→ − μ→i)'Σi−1(x→ − μ→i)} (6)

with the mean vector μ_i and covariance matrix Σ_i. The mixture satisfies the constraint 0 ≤ w_i ≤ 1 and ∑i = 1Mwi = 1.

The complete Gaussian mixture model is parameterized by the mean vectors, covariance matrices, and mixture weights from all component densities [26].

Each dysfluency is identified by a GMM λ_i, which is completely parameterized by its mixture weights, mean, and covariance matrices. These density parameters are collectively represented by notation [28]:

(7)λi = {wi, μi, Σi}, i = 1 ... M (7)

For dysfluency identification, each dysfluency is represented by a GMM and is referred by its model λ. In a further step of identification, λ is used as the model of dysfluency. The training phase corresponds to establishing the appropriate λ for each type of dysfluency.

There are two important motivations in using the Gaussian mixture densities for dysfluency recognition. First, the component densities of a mixture model together represent a set of acoustic classes [27, 28]. Stuttering dysfluency can be interpreted as an acoustic space, which is characterized by a set of acoustic classes. They contain relevant phonetic characteristics such as vowels, nasals, and consonants. These acoustic classes reflect some general vocal tract configurations that are useful for characterizing stuttering dysfluency. The mean μ_i represents the spectral shape of the i^th acoustic class. The covariance matrix Σ_i represents the variations of the average spectral shape.

The second motivation is based on the idea that the GMM is powerful in order to attain accurate approximations with arbitrarily shaped densities, which makes dysfluency recognition efficient than other approaches.

The GMMs are trained separately on each dysfluency data using the expectation maximization (EM) algorithm [20].

6.3.1 Maximum Likelihood Parameter Estimation

Having a distribution of feature vectors, the goal of the training phase is to estimate the model λ that best matches the distribution of the training vector. There are a number of techniques available for estimating the parameters of a GMM [27]; we have used the maximum likelihood (ML) estimation, which is the most popular and well-established method [19].

The ML approach aims at finding the model parameters (λ = {w_i, μ_i, Σ_i}, i = 1 … N), which maximize the likelihood of the GMM, given the training data. For a sequence of T training vector X = {x→1, …, x→T}, the GMM likelihood can be written as:

(8)p(X|λ) = ∏t = 1Tp(x→t|λ). (8)

The objective is to obtain the ML parameter estimates. The process is an iterative calculation known as the expectation-maximization (EM) algorithm [20]. It can be used to find the values for the parameters in the model λ that maximizes the likelihood function. The basic idea of the EM algorithm is as follows [19, 27]:

Start with an initial model λ to estimate a new model λ̅ such that p(X|λ̅) ≥ p(X|λ).
The new model then becomes the initial model for the next step and so on.
The model is recalculated iteratively using the previous step to estimate the actual one.
The process continues until a convergence value is reached, that is, until the parameters of λ reaches a stable value.
The number of iterations often turns around 10, which is a generally accepted number for the algorithm to reach the threshold value.

On each EM iteration, the parameters are updated that guarantee a monotonic increase in the model’s likelihood value. The respective update formula for each parameter are

Mixture weights

(9)w¯i = 1T∑t = 1Tp(i|x→t, λ) (9)

Mean

(10)μ¯→i = ∑t = 1Tp(i|x→t, λ)x→t∑t = 1Tp(i|x→t, λ) (10)

Variance

(11)σ¯i2 = ∑t = 1Tp(i|x→t, λ)x→t2∑t = 1Tp(i|x→t, λ) − μ¯i2 (11)

where σi2, xt, and μ_i refer to arbitrary elements of the vectors σ→i2, x→t, and μ_i, respectively [27].

The final step of ML is to obtain the a posteriori probability for each feature vectors. From equation (5) and Bayes’s a posteriori rule, we obtain a posteriori probability for acoustic class i as follows

(12)p(i|x→t, λ) = wibi(x→t)∑k = 1Mwkbk(x→t) (12)

The EM algorithm iteratively refines the model parameter estimates to maximize the likelihood that the model matches the distribution of the training data. Subsequently, the parameters converge to a final solution in few iterations. The speed of convergence of EM is influenced by the mixture component overlap [25, 36]. It is also observed that the imbalance mixture coefficients take more number of iterations to converge. On the contrary, for balanced mixing coefficients, EM converges to nearly accurate parameters much faster with less number of iterations.

In an effort to improve the performance and the efficiency of the conventional GMMs, the authors of [16, 17] have proposed the fuzzy Gaussian mixture models (FGMMs), which are inspired from the mechanism of the fuzzy C-means (FCMs). They have experimentally shown that their methods have better fitting performance and faster convergence speed compared to conventional GMMs.

6.4 Dysfluency Identification Phase

This phase identifies a given test dysfluency into one of the four classes of dysfluencies such as syllable repetition, word repetition, prolongation, and interjection. The process includes using a set of test dysfluencies, applying the same method to extract a mixture per unknown test dysfluency and compare its model to each model of the training database. The comparison between test and training model yields a likelihood. The likelihood with the highest score corresponds to the unknown dysfluency.

For a reference group of S = {1, 2, …, S}, dysfluency models are represented by GMM’s {λ₁, λ₂, …, λ_s}. The idea is to identify the dysfluency model that has the maximum a posteriori probability for a given observation sequence. In other words, for a given set of n test feature vectors, a set of n acoustic classes are extracted, which provides a set of na posteriori probability from which the maximum value provides information for dysfluency identification. The unknown dysfluency is, therefore, represented by:

(13)S^ = argmax1 ≤ k ≤ SP(λk|X) = argmax1 ≤ k ≤ Sp(X|λk)Pr(λk)p(X) (13)

where the second equation is due to the Bayes’s rule. The equation can be simplified by assuming that all dysfluencies are equally likely to be identified to the unknown dysfluency (i.e. Pr(λ_k) = 1/S). Therefore, Pr(λ_k) is constant and can be neglected. Then, p(X) is the same for all dysfluency models. Therefore, p(X) is constant and can be neglected; the classification rule simplifies to

(14)S^ = argmax1 ≤ k ≤ Sp(X|λk) (14)

Representing X as a set of feature vectors X = {x₁, x₂, …, x_t}, t = 1 … T and using the logarithms, the unknown dysfluency model is identified to each model:

(15)S^ = argmax1 ≤ k ≤ S ∑t = 1Tlog p(x→t|λk) (15)

where p(x→t|λk) is given in equation (5).

7 Results and Discussions

This section presents the experimental results of the four types of stuttering dysfluency classification based on the GMM. The dysfluency segments are prepared as discussed in Section 5. The segmented data samples are divided into two sets, one for training and another for testing. The distribution of speech segments for training and testing are presented in Table 2.

In the GMM, the most important and difficult task is to decide the number of mixtures essential to model a subject effectively. There is no hard and fast rule to approximate the number of mixture components [21]. The objective is to determine the minimum number of components required to adequately model a subject for good classification. Selecting very few mixture components does not effectively model the distinguished characteristics of a subject. Choosing too many components can lower the performance, which leads to excessive computational complexity both in training and classification. The performance of GMM with 1, 2, 4, 6, 8 16, 32, and 64 component densities are investigated. As an example, Figure 3 shows the classification results of syllable repetition. This graph is plotted as the percent classification versus the number of Gaussian components for 13, 26, and 39 MFCC parameters. The results show that the dysfluency classification system gives the good performance for a higher number of MFCC parameters and mixture components. The GMM is able to attain accurate approximations with arbitrarily shaped densities, which makes dysfluency recognition efficient.

Figure 3:

Dysfluency Classification Performance for Different Mixtures and Parameters.

The results of the dysfluency classification experiments conducted using the MFCC and GMM is presented in Table 3. The lowest and highest average accuracy of 71.55% and 81.25% are achieved for model orders 8 and 64, respectively. In all the experiments, 64 mixture components have given the highest classification results. Figure 4 shows the results of the dysfluency classification for 64 mixture components with three different MFCC parameters.

Table 3

GMM Classification Performance for Different Parameters and Model Order.

MFCC features	Model order	SR	WR	P	I	Average classification accuracy
13	8	63.5	72.8	72.3	77.6	71.55
	16	74.1	73.5	74.0	79.5	75.28
	32	78.6	76.1	75.1	83.4	78.3
	64	81.2	79.2	80.3	84.3	81.25
26	8	81.9	80.1	82.3	86.3	82.65
	16	85.6	83.5	83.1	89.2	85.35
	32	87.8	85.6	84.0	91.6	87.25
	64	90.6	88.0	86.5	92.7	89.45
39	8	90.1	87.6	91.8	93.4	90.73
	16	91.8	91.3	93.0	94.4	92.66
	32	94.3	93.6	94.6	97.2	94.96
	64	96.4	95.0	95.7	98.6	96.43

SR, syllable repetition; WR, word repetition; P, prolongation; I, interjection.

Figure 4:

Classification of Dysfluencies for Different MFCC Coefficients.

In Table 3, it can be seen that the highest recognition accuracy of 96.40%, 95.00%, 95.70%, and 98.60% are achieved for SR, WR, P, and I, respectively. The best average dysfluency classification accuracy of 96.43% is achieved for 39 MFCC parameters and model order 64. It is the highest classification accuracy achieved compared to the earlier research works as indicated in Table 4. This confirms that the GMM–MFCC framework provides excellent performance in classifying various dysfluencies.

Table 4

Some of the Previous Works Pertaining to the Recognition of Dysfluencies.

Author	Database size	No. of segments prepared	Types of dysfluencies considered	Feature	Classifier	Accuracy (%)
Ravikumar [23] 2008	10	–	Repetition	MFCC	Perceptron	83.00
Ravikumar [24] 2009	15	–	Repetition	MFCC	SVM	94.35
Chee [3] 2009	10 UCLASS	–	Repetition, prolongation	MFCC	k-NN, LDA	90.91
Chee [2] 2009	10 UCLASS	–	Repetition, prolongation	LPCC	k-NN, LDA	89.77
Ai [1] 2012	39 UCLASS	–	Repetition, prolongation	MFCC LPCC	k-NN, LDA	92.55 94.51
Hariharan [11] 2012	39 UCLASS	–	Repetition, prolongation	LPC, LPCC, WLPCC	k-NN, LDA	92.16 96.47 98.40
Fook [10] 2013	39 UCLASS	–	Repetition, prolongation	LPC, LPCC, WLPCC	SVM	95.00
Palfy [21] 2014	16 UCLASS	20	Repetition, prolongation	MFCC	SVM	98.00
Current study	50 UCLASS	200	SR	MFCC	GMM	96.43
			WR	DMFCC
			Prolongation	DDMFCC
			Interjection

The comparison between this study and previous works related to the recognition of dysfluency, database size, and types of stuttered events considered are detailed in Table 4. Further, the selection and preparation of the different types of dysfluent segments and their recognition accuracies of this study are clearly indicated in Table 5. It was observed that previous works are concentrated on two types of dysfluencies such as repetition and prolongation. In this work, four important dysfluencies that have a major impact in stuttering assessment are considered. Further, in most of the previous works [1, 2, 10, 11, 21, 30, 35], there are no clear indications regarding dysfluent segments used and individual recognition accuracies.

Table 5

Results of Proposed Methods Pertaining to the Recognition of Dysfluencies.

Database size	Features	Classifier	Types of dysfluencies considered		Accuracy (%)
Database size	Features	Classifier	Dysfluency	Segments prepared	Accuracy (%)
50 UCLASS	MFCC	GMM	Syllable repetition	50	96.40
	DMFCC		Word repetition	50	95.00
	DDMFCC		Prolongation	50	95.70
			Interjection	50	98.60

8 Conclusions

The classification of dysfluencies can be viewed as pattern recognition problem. Hence, the statistical methods will be an essential tool for discriminating the different categories of dysfluencies. This work is focused on introducing and evaluating the GMM, the state of the art in the ASR domain for classifying stuttering dysfluencies. The GMM model is specifically investigated for the classification of four types of dysfluencies using different model order and MFCC parameters.

The best average classification accuracy of 96.43% was attained for 64 mixture components and 39 MFCC coefficients. The results indicate that the overall performance of GMM for dysfluency classification is quite promising. In the future, the potential of the GMM modeling is to be investigated with expanded dysfluency database.

It is evident that the accuracy of the system is enhanced by increasing the number of MFCC parameters and GMM mixture components. The impact of the mixture component numbers make the model important by allowing optimal representation of the data. The future work of this study will be focused on exploring the application of FGMMs proposed in [17], to improve the performance and efficiency of the dysfluency classification system.

Corresponding author: P. Mahesha, Department of Computer Science and Engineering, S.J. College of Engineering, Mysore, India, e-mail: maheshsjce@yahoo.com

Bibliography

[1] O. C. Ai, M. Hariharan, S. Yaacob and L. S. Chee, Classification of speech dysfluencies with MFCC and LPCC features, Elsevier International Journal of Expert Systems with Applications39 (2012), 2157–2165.10.1016/j.eswa.2011.07.065Search in Google Scholar

[2] L. S. Chee, O. C. Ai, M. Hariharan and S. Yaacob, Automatic detection of prolongations and repetitions using LPCC, in: Proc. of International Conference for Technical Postgraduates (TECHPOS), pp. 1–4, 2009.10.1109/TECHPOS.2009.5412080Search in Google Scholar

[3] L. S. Chee, O. C. Ai, M. Hariharan and S. Yaacob, MFCC based recognition of repetitions and prolongations in stuttered speech using k-NN and LDA, in: Proc. of IEEE Student Conference on Research and Development (SCOReD), pp. 146–149, 2009.10.1109/SCORED.2009.5443210Search in Google Scholar

[4] A. K. Cordes, P. F. Ingham and J. C. Ingham, Time-interval analysis of interjudge and intrajudge agreement for stuttering event judgments, J. Speech Hear. Res.35 (1992), 483–494.10.1044/jshr.3503.483Search in Google Scholar

[5] A. Czyzewski, A. Kaczmarek and B. Kostek, Intelligent processing of stuttered speech, J. Intell. Inf. Syst.21 (2003), 143–171.10.1023/A:1024710532716Search in Google Scholar

[6] J. R. Dalton and C. Q. Peterson, The use of voice recognition as a control interface for word processing, J. Occup. Theory Health Care11 (1997), 75–81.10.1080/J003v11n01_05Search in Google Scholar

[7] S. Davis and P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition, IEEE Trans. Audio, Speech, and Language Processing28 (1980), 357–366.10.1109/TASSP.1980.1163420Search in Google Scholar

[8] S. Devis, P. Howell and J. Batrip, The UCLASS archive of stuttered speech, J. Speech Lang. Hear. Res.52 (2009), 556–569.10.1044/1092-4388(2009/07-0129)Search in Google Scholar

[9] L. J. Ferrier, N. Jarell, T. Carpenter and C. Shane, A case study of a dysarthric speaker using the Dracgin dictate speech recognition system, J. Comp. User. Speech Hear.8 (1992), 33–52.Search in Google Scholar

[10] C. Y. Fook, H. Muthusamy, L. S. Chee, S. B. Yaacob and A. H. B. Adom, Comparison of speech parameterization techniques for the classification of speech disfluencies, Turk. J. Elec. Eng. Comput. Sci.21, 1983–1994, 2013.10.3906/elk-1112-84Search in Google Scholar

[11] M. Hariharan, L. S. Chee, O. C. A. and S. Yaacob, Classification of speech dysfluencies using LPC based parameterization techniques, J. Med. Syst.36 (2012), 1821–1830.10.1007/s10916-010-9641-6Search in Google Scholar PubMed

[12] P. Howell and M. Huckvale, Facilities to assist people to research into stammered speech, stammering research, An On-line Journal Published by the British Stammering Association, pp. 130–242, 2004.Search in Google Scholar

[13] P. Howell, S. Sackin and K. Glenn, Automatic recognition of repetitions and prolongations in stuttered speech, in: Proc. of the First World Congress on Fluency Disorders, pp. 372–374, 1995.Search in Google Scholar

[14] P. Howell, S. Sackin and K. Glenn, Development of a two-stage procedure for the automatic recognition of dysfluencies in the speech of children who stutter: II. ANN recognition of repetitions and prolongations with supplied word segment markers, J. Speech Lang. Hear. Res.40 (1997), 1085–1096.10.1044/jslhr.4005.1085Search in Google Scholar

[15] W. Johnson, R. M. Boehmler, W. G. Dahlstrom, F. L. Darley, L. D. Goodstein, J. A. Kools, J. N. Neeley, W. F. Prather, D. Sherman, C. G. Thurman, W. D. Trotter, D. Williams and M. A. Young, The onset of stuttering; research findings and implications, University of Minnesota Press, Minneapolis, 1959.Search in Google Scholar

[16] Z. Ju and H. Liu, A unified Fuzzy framework for human hand motion recognition, IEEE Trans. Fuzzy Syst.19 (2011), 901–913.10.1109/TFUZZ.2011.2150756Search in Google Scholar

[17] Z. Ju and H. Liu, Fuzzy Gaussian mixture models, Pattern Recog.45 (2012), 1146–1158.10.1016/j.patcog.2011.08.028Search in Google Scholar

[18] D. Kully and E. Boberg, An investigation of interclinic agreement in the identification of fluent and stuttered syllables, J. Fluency Disord.13 (1988), 309–318.10.1016/0094-730X(88)90001-0Search in Google Scholar

[19] N. M. Laird, A. P. Dempster and D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm (with discussion), J R Stat. Soc. B39 (1980), 1–38.Search in Google Scholar

[20] Y. Liu, M. Russell and M. Carey, The role of dynamic features in text-dependent and independent speaker verification, in: Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1, pp. 669–672, 2006.Search in Google Scholar

[21] J. Palfy, Analysis of dysfluencies by computational intelligence, Inf. Sci. Technol. Bull. ACM Slovakia6 (2014), 1–15.Search in Google Scholar

[22] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice Hall Signal Processing Series, Pearson Education with Dorlen Kindersley, India, 1993.Search in Google Scholar

[23] K. M. Ravikumar, B. Reddy, R. Rajagopal and H. C. Nagaraj, Automatic detection of syllable repetition in read speech for objective assessment of stuttered disfluencies, in: Proc. of World Academy Science, Engineering and Technology, pp. 270–273, 2008.Search in Google Scholar

[24] K. M. Ravikumar, R. Rajagopal and H. C. Nagaraj, An approach for objective assessment of stuttered speech using MFCC features, Digital Signal Process. 9, pp. 19–24, 2009.Search in Google Scholar

[25] R. A. Redner and H. F. Walker, Mixture densities, maximum likelihood and the EM algorithm. SIAM Rev. 26 (1984), 195–239.10.1137/1026034Search in Google Scholar

[26] D. A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, J. Speech Commun.17 (1995), 91–108.10.1016/0167-6393(95)00009-DSearch in Google Scholar

[27] D. Reynolds, Gaussian mixture models, MIT Lincoln Laboratory, 244 Wood St., Lexington, MA 02140, USA.Search in Google Scholar

[28] D. A. Reynolds, R. C. Rose and M. J. T. Smith, PC-based TMS320C30 implementation of the Gaussian mixture model text-independent speaker recognition system, in: Proc. of International Conference on Signal Processing Applications Technology, Boston, USA, pp. 967–973, 1992.Search in Google Scholar

[29] R. T. Ritchings, M. McGillion and C. J. Moore, Pathological voice quality assessment using artificial neural networks, J. Med. Eng. Phys.24, ISSN 1350-4533 (2002), 561–564.10.1016/S1350-4533(02)00064-4Search in Google Scholar

[30] P. Stoica and R. Moses, Spectral analysis of signals, Prentice Hall, I edition, Prentice Hall, Upper Saddle River, New Jersey, 2005.10.1007/978-3-031-02525-9Search in Google Scholar

[31] I. Świetlicka, W. Kuniszyk-Jóźkowiak and E. Smołka, Artificial neural networks in the disabled speech analysis, in: Computer Recognition System, Springer Berlin/Heidelberg, 57/2009, pp. 347–354, 2009.10.1007/978-3-540-93905-4_41Search in Google Scholar

[32] T. Tian-Swee, L. Helbin, A. K. Ariff, T. Chee-Ming and S. H. Salleh, Application of Malay speech technology in Malay Speech Therapy Assistance Tools, in: Proc. of International Conference on Intelligent and Advanced Systems (ICIAS), pp. 330–334, 2007.Search in Google Scholar

[33] J. Wang and C. Jo, Vocal folds disorder detection using pattern recognition methods, in: Proc. of 29th Annual International Conference of the IEEE, Engineering in Medicine and Biology Society, pp. 3253–3256, 2007.Search in Google Scholar

[34] M. Wisniewski, W. Kuniszyk-Jzkowiak, E. Smoka and W. Suszynski, Automatic detection of prolonged fricative phonemes with the hidden Markov models approach, J. Med. Inform. Technol.11 (2007), 293–297.Search in Google Scholar

[35] M. Wisniewski, W. Kuniszyk-Jzkowiak, E. Smoka and W. Suszynski, Improved approach to automatic detection of speech disorders based on the hidden Markov models approach, J. Med. Inform. Technol.15/2010, ISSN 1642-6037, 145–152, 2010.Search in Google Scholar

[36] C. F. Wu, On the convergence properties of the EM algorithm, Ann. Stat.11 (1983), 95–103.10.1214/aos/1176346060Search in Google Scholar

[37] E. Yairi and R. Curlee, The clinical-research connection in early childhood stuttering, Am. J. Speech-Lang. Pathol.6 (1997), 85–86.10.1044/1058-0360.0604.85Search in Google Scholar

[38] J. S. Yaruss, Clinical measurement of stuttering behaviors, J. Contemp. Issues Commun. Sci. Disord.24 (1997), 33–44.10.1044/cicsd_24_S_27Search in Google Scholar

Received: 2014-7-11

Published Online: 2015-6-12

Published in Print: 2016-7-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Gaussian Mixture Model Based Classification of Stuttering Dysfluencies

Abstract

1 Introduction

2 Related Work

3 Establishing the Acoustic Model

4 Applicability of GMM for Stuttering Dysfluency Recognition

5 Experimental Data

6 Methodology

6.1 Signal Preprocessing

6.2 Feature Extraction

6.3 The GMM Dysfluency Modeling

6.3.1 Maximum Likelihood Parameter Estimation

6.4 Dysfluency Identification Phase

7 Results and Discussions

8 Conclusions

Bibliography

Journal and Issue

Articles in the same Issue