A comprehensive survey of automatic dysarthric speech recognition

ABSTRACT


INTRODUCTION
Dysarthria is a speech disorder generated due to weakness in speed production muscle or when an individual is unable to control them. It frequently causes slow or slurred speech which is difficult to understand. Dysarthria can be caused due to neural disorder, troat or tongue muscle weakness, or facial paralysis [1], [2]. The muscle used for speed production is controlled by the nervous system and brain. Mostly dysarthria is caused due to damage to these muscles. Dysarthria is grouped into developmental and acquired dysarthria. The developmental dysarthria normally found in children is occurred due to brain damage during or before birth. The acquired dysarthria generally occurred due to brain damage in adulthood or later in life such as brain tumors, stroke, head injury, motor neuron disease, or Parkinson's disease [3]- [5].
The term "dysarthria" refers to a variety of neurological speech abnormalities caused by injury to the central or peripheral nerve systems. Reduced stress, sluggish speech pace, hyper-nasality, muscular stiffness, spasticity, monopitch, and a limited range of speech motions are all signs of dysarthric speech. It can impact ISSN: 2252-8776  A comprehensive survey of automatic dysarthric speech recognition (Shailaja Yadav) 243 the subglottal, laryngeal, and articulatory systems, which can make speech production difficult. Stroke, Parkinson's disease, and cerebral palsys are the most common roots of motor speech difficulties. According to reports, improving human-machine interaction for persons with dysarthria is becoming increasingly important in order to boost overall wellness and independence. Physical impairments are common in people with dysarthria, making common input methods (typing and touch screen) difficult to use [6], [7]. Traditionally, the language or speech therapist diagnosed dysarthria disorder by asking people to read passages loudly, recite numbers or weekdays, make various sounds or talk about any familiar topic. The traditional techniques performance is limited due to various factors such as inadequate knowledge of experts, tiredness, and fatigue. Dysarthria may affect phonation, breathing, prosody, articulation, resonance, and lip movement. It shows a larger variation in speech intelligibility. The scope of intelligibility is huge and may depend upon the extent of nervous system damage. The typical symptoms of the dysarthria are listed in Figure 1. Because of articulatory difficulties, there is no uniformity in articulation. Pronunciation changes and speaking pace slows as a result of exhaustion. All of these distinctiveness impair the dysarthric speaker's intelligibility (the degree to which others can understand their speech) and limit verbal interactions, reducing their quality of life [8], [9]. The classification system helps to narrow down the dimension of perceptual analysis of dysarthric speech. The classification of dysarthric speech is given in Figure 2. Most clinicians find this useful to correct or reduce the deficit found in dysarthric speech production. Normal speakers typically communicate at rates between 150 to 200 words per minute. The speech is clear, timely, and contextually relevant. Speakers with severe impairments communicate at a rate of fewer than 15 words per minute. This reduction in the rate of communications has implications in the quantity and the quality. People suffering from dysarthria are generally physically challenged. It is difficult for them to handle the conventional keyboard or mouse interfaces. Dysarthric speakers experience difficulty to contribute enough samples of speech data. Some dysarthric speakers get tired soon which may lead to distress. They often fall short to utter certain sounds, which results in phonetic variation [10], [11].  The generalized process of dysarthric speech recognition (DSR) is shown in Figure 3 that encompasses the pre-processing, feature representation, classification, and DSR. The pre-processing phase deals with the primary processing on the dysarthric speech to improve the quality of features and performance of the classifiers. It encompasses framing, cropping, speech separation, noise suppression, windowing, normalization, speech enhancement, and data augmentation. The dysarthric speech contains different types of the reverberations, silent regions, stops, wide variety in pitch, and energy of the signal which tends to use speech enhancement to enhance DSR effectiveness. The feature extraction is important phase to collect the distinctive and unique characteristics of the normal and dysarthric speech. The features are generally grouped into spectral, prosodic, voice quality, and teager-energy operator features. Traditional machine learning (ML) based DSR includes feature extraction followed by classification whereas in deep learning (DL) the feature extraction may not be used as DL techniques often refers to combination of hidden feature extraction layers and classification layer. However, many hybrid DL algorithms uses the traditional features as the input to boost the speech intelligibility, feature representation, and DSR accuracy. Various DSR strategies have been presented in last two decades. This section gives a quick overview of recent DSR approaches. Voice tremor has been quantified using phonation parameters that define disordered voice, such as jitter and fundamental frequency [9], [12]. To avoid the gender and acoustic environment dependence of these parameters, a pitch period entropy-based evaluation was developed [13]. Hypophonia has also been described using fluctuation of energy and short-time energy [14]. The Teager-Kaiser energy operator which provides the speech intensity measure is utilized to adjust for signal frequency [15]. To explore the influence on articulatory dynamics and speech intelligibility, acoustic cues based on the first three formants and their respective bandwidths can be studied [16]. Vowel space area (VSA) has been investigated for assessing speech intelligibility [17]. A support vector machine (SVM) classifier was used to investigate a method for distinguishing dysarthric speech from healthy speech using a collection of glottal and openSMILE characteristics [18]. Gurugubelli and Vuppala [19] investigated analytic phase characteristics generated from voice signals using the single frequency filtering (SFF) approach. Audio descriptor information used for determining musical instrument timbre were combined with an artificial neural network (ANN) model to classify dysarthric speech severity levels [20]. For dysarthria classification, multi-tapered spectral estimation was used to extract audio descriptor features.
Research by Johnson et al. [21] evaluate recognition performance for dysarthric speech compared with automatic speech recognition (ASR) systems based on Gaussian mixture model (GMM) hidden Markov models (HMMs) and SVMs [22]. The experimental results showed that the HMM-based model may provide robustness against large-scale word-length variances. Meanwhile, the SVM-based model can alleviate the effect of deletion of or reduction in consonants. Rudzicz [23] investigated acoustic models of GMM-HMM, conditional random field, SVM, and ANNs [24]. The results showed that the ANNs provided higher accuracy  [25] presented multiple such as Gamma tone energy (GFE), modified group delay function cepstrum (MGDFC), and stock well features for isolated DSR. It used decision level fusion with the help of vector quantization (VQ) classifier. It used speech enhancement scheme to minimize the distortions and improve the speech intelligibility. It resulted in word error rate (WER) of 4% for the dysarthric subjects with 6% intelligibility. Qatab and Mustafa [26] used four types of features such as spectral, cepstral, voice quality, prosodic, and overall speech features along with SVM, ANN, linear discriminent analysis (LDA), classification and regression tree (CART), Naïve Bayes (NB), and random forest (RF) classifier for DSR. Seven feature selection algorithms have been presented for the feature selection to select the dominant features such as conditional information feature extraction (CIFE), double input symmetrical relevance (DISR), interaction capping (ICAP), conditional mutual information maximization (CMIM), conditional redundancy (Condred), joint mutual information (JMI), and relief. It provided average ranking score of 4.88 for RF and relief feature selection. Janbakhshi et al. [27] presented singular value decomposition (SVD) for the spectro-temporal representation of the dysarthric speech and temporal grassmann discriminant analysis (T-GDA) for the DSR. It outperformed the traditional mel frequency cepstral coefficient (MFCC)-SVM based DSR. The subspace based learning shows superior discrimination between normal and dysarthric speech. The temporal subspace gives enhanced performance compared with spectral subspace.
Recently, DL technology has been widely used in many voiced based automation systems and has proven it can provide better performance than conventional ML based methods [28], [29]. Fathima et al. [30] applied a multilingual time delay neural network (TDNN) system that combined acoustic modeling and language specific information to increase ASR performance. The experimental results showed that the TDNN-based ASR system achieved suitable performance, as the WER was 16.07% in this study. Yue et al. [31] investigated convolutional and light gated recurrent unit (LiGRU) based multi-spectra acoustic model for DSR. It used data augmentation to minimize the data scarcity problem using speed perturbation which has given 11% and 40.6% WER for normal and dysarthric speech. Yue et al. [32] developed multi-stream acoustic model based on convolutional neural network (CNN), LiGRU, and fully connected multi layer perceptron (MLP) and optimal fusion technique for DSR. The proposed model provided a WER of 4.6% for the pre-processed data using electromagnetic articulography (EMA). The EMA pre-processing includes Butterworth filter for measurement noise minimization and down-sampling for synchronization of MFCC features.
The data efficiency is major obstacle in the DSR. Soleymanpour et al. [33] proposed text to speech (TTS) synthesizer for the data augmentation based on FastSpeech model. The augmented data provided to deep neural network (DNN)-HMM with light bidirectional GRU that has given a WER improvement of 12.2% over the baseline model. Traditional data augmentation approaches majorly focuses on the temporal variations of the signal however spectral envelope remains same. Liu et al. [34] presented vocal tract length perturbation (VTLP), tempo perturbation and speed perturbation for the data augmentation that concentrates on temporal as well as spectral transformations of the dysarthric speech signal. The DNN and Neural architecture search (NAS) based DSR provides WER of 25.21 % and 5.4% for UASpeech and CUHK dataset respectively. Shahamiri [35] used voicegram to provide the correlation between phonemes and the dysarthric speech. The visual data augmentation model is used for the data augmentation to minimize data scarcity problem in DSR. The spatial-convolutional neural network (S-CNN) provides an accuracy of 67% on UASpeech dataset. The proposed S-CNN some time causes vanishing gradient problem and provides poor results for the moderate dysarthria. The intelligibility of the speech is hugely affected due to time domain variance of dysarthric speech and background noise. Lin et al. [36] suggested that the DL based voice conversion (DVC) using phonetic posteriorgram (PPG) provides stable performance compared with DVCmel under noisy condition.
Kodrasi and Bourlard [37] suggested that spectro-temporal sparsity using the Gini index provided better performance than shimmer, jitter, fundamental frequency, harmonics to noise ratio (HNR), and MFCC for the DSR. It is observed that spectral sparsity has proven better performance than temporal sparsity. Kodrasi [38] used CNN for learning the temporal spectral characteristics obtained using temporal envelope and fine structure (TEFS). The TEFS outperformed the traditional short-time fourier transform (SIFT) based speech signal spectrogram. The TEFS-CNN provides 85.72% accuracy for DSR whereas SIFT-CNN provides 69.76% accuracy for DSR. Chandrashekar et al. [39] investigated the time-frequency CNN for capturing the temporal as well as spectral properties of the dysarthric speech. The spectro-temporal properties of the speech signals are obtained using SIFT, spectrograms using SFF, and constant Q-transform (CQT). The DSR performance has shown higher accuracy for the female subjects compared with the male subject. The training data deficiency resulted in class imbalance problem. The time-frequency based CNN provides better spectrotemporal variation of the dysarthric speech which has shown significant improvement in DSR accuracy over the traditional ANNs [40]. Fritsch and Doss [41] presented recurrent neural network (RNN) based binary and CNN based multi-feature classifier. It provided high correlation for synthesized speech generated using TTS. This paper presents a comprehensive survey of distinct ML-based and DL-based DSR systems. It focuses on the DSR methodology that comprises enhancement, data augmentation, feature extraction, feature selection, and classification techniques. It analyses the dataset, experimental results, and performance metrics to depict the merits, demerits, and challenges of the present DSR systems. Additionally the performance of the various ML and DL based DSR schems is evaluated on the UASpeech dataset and results are analyzed using accuracy, recall, precision, and F1-score. The rest of paper is structured as follow: section 2 depicts the generalized process of the automatic DSR and gives the succinct survey of recent ML and DL based speech emotion recognition (SER) systems, section 2 elaborates the detailed description of the method, section 3 gives detailed results and its findings, and section 4 concludes the paper and paves the way for future enhancement through future scope.

RESEARCH METHOD
The process of the proposed analysis of different feature extraction and classification techniques for the DSR is illustrated in the Figure 4. The proposed system used pre-emphasis filtering which uses the moving average filter for minimizing noise and normalizing the speech. It diminish the irregularities present in the speech signal.   Various DL based DSR schemes such as DNN, DCNN, LSTM, and DCNN-LSTM are utilized to evaluate the performance of DSR on UASpeech dataset as given in Figure 5. It used five layered 1-D DNN that gives 85% accuracy for raw speech and 87.5% accuracy for 39 MFCC coefficients that encompasses 13 MFCC coefficients, 13 delta coefficients, and 13 delta-delta coefficients that represents the spectral variation over the frames of the speech. It provides 89.45% and 90.56% accuracy for 2-D representation of the speech signal using CQT and MFCC spectrogram. It is noted that 2D representation of the speech signal provides better spectral and spatial representation of the speech signal and helps to improve the accuracy over 1-D representation of the signal. Further, it used 5 layered DCNN which encpasses convolution, batch normalization, and maximum pooling layer at every layer. It uses 32, 64, 96, 128, and 256 filters for first to fifth layer of the DCNN. The DCNN provides gives 86.60% accuracy for raw speech and 88.80% accuracy for 39 MFCC coefficients. It provides 90.10% and 91% accuracy for CQT and MFCC spectrogram. Afterward, LSTM with five layers is employed for representing the temporal characterstics of the dysarthric signal which has given 85%, 86.20%, 87%, and 88.50% accuracy for the raw speech+LSTM, MFCC coefficients+LSTM, CQT spectrogram+LSTM, and MFCC spectrogram+LSTM respectively. DCNN helps to achieve best spectral representation however lacks in time domain representation of the signal. To improve the time domain characteristics LSTM is collaborated with the DCNN which combines the frequency domain and time domain characteristics of the speech sigal for DSR. The DCNN-LSTM provides gives 88.20% accuracy for raw speech and 89.20% accuracy for 39 MFCC coefficients that encompasses 13 MFCC coefficients, 13 delta coefficients, and 13 delta-delta coefficients that represents the spectral variation over the frames of the speech. It provides 91.5% and 93% accuracy for 2-D representation of the speech signal using CQT and MFCC spectrogram.

CONCLUSION
Thus, this article presents the DSR based on various ML and DL approaches that covers the methodology, database, evaluation metrics, advantages, disadvantages, and finding from the study. It is observed that the DL techniques outperformed the traditional ML techniques because of its superior feature representation. The DL approaches are less dependent on the hand crafted features unlike traditional ML based approaches. The experimental results shows that the DL based DSR schems outperforms the ML based DSR schemes and provides better feature representation compared with traditional handcrafted features. The performance of DL framework is better for 2-D representation of the speech signal compared with 1-D signal because of higher representation capability in spectral and spatial domin. Also, combination DCNN and LSTM provides superiority over DNN, DCBB, and LSTM which has better feature representation capability in spectral and temporal domain. Database generation is challenging task because of unavailability of theproper resources and proper ground truth. The DSR is very challenging due to variability in the speech intelligibility because of various attributes such as language, age, gender, region, and noise.