Measuring the intelligibility of dysarthric speech through automatic speech recognition in a pluricentric language

Speech intelligibility is an essential though complex construct for evaluating dysarthric speech. Various procedures can be used to measure speech intelligibility, most of which are based on subjective ratings assigned by experts. Since these procedures are subjective and laborious, automatic speech recognition (ASR) has been proposed to obtain objective metrics of intelligibility. Although promising results have been reported, ASR for dysarthric speech generally requires large amounts of data consisting of recorded and annotated speech. In the present study, we explored the possibility of using dysarthric speech resources from the dominant language variety to improve the performance of ASR systems on the dysarthric speech of the non-dominant variety of the same pluricentric language. Dutch is used as an example of a pluricentric language, with Netherlandic Dutch considered the dominant and Flemish Dutch the non-dominant variety. The performance of ASR is evaluated by using two types of intelligibility metrics: orthographic transcriptions and global intelligibility assessments, both obtained from experts. Overall, the results show that dysarthric speech data from the dominant language variety can contribute to improving automatic transcriptions and to developing objective, automatic global measures of speech intelligibility only when no data from the non-dominant variety are available for training ASR models.


Introduction
Patients suffering from dysarthria, a motor speech disorder caused by neurological injury such as Parkinson's disease or stroke, experience difficulties in speech production that can lead to reduced speech intelligibility.Intensive speech therapy can be provided to limit the loss in speech intelligibility or even achieve some improvement.Important instruments in this process are reliable and valid measurements of speech intelligibility to establish a diagnosis and to evaluate therapy effectiveness.So far, the usual metrics for measuring speech intelligibility in research and clinical practice are based on subjective judgements obtained from human raters.Because these metrics not only contain an element of subjectivity but are also rather time-consuming, researchers have been looking for more objective metrics based on acoustic measurements, possibly obtained semiautomatically or automatically.One of the technologies that have been used for this purpose is Automatic Speech Recognition (ASR) (Rosen and Yampolsky, 2000).Although promising results have been obtained, it is clear that applying ASR for this specific purpose requires large amounts of data (Keshet, 2018) consisting of recordings of patients' speech with annotations such as orthographic transcriptions.However, obtaining such data is both ethically challenging and practically laborious.By ethically challenging, we refer to the difficulties in acquiring ethical approval and finding enough people willing to be recorded for academic research.These consequences are more severe in the case of languages that have relatively smaller resources like Dutch, which in general have fewer speech resources to train and test ASR algorithms than a large-size language like English.
In fact, both Dutch and English are pluricentric languages, which are defined as languages with distinct varieties belonging to different countries.These varieties can be characterized as dominant and nondominant varieties according to their power asymmetries (Norrby et al., 2020).An interesting course of action that has been proposed for improving speech recognition on less-resourced languages that have multiple varieties, such as pluricentric languages, is to employ speech resources of different varieties.This course of action is even more relevant for "non-dominant" varieties.For instance, Dutch is a pluricentric language with two different standard language varieties, Netherlandic Dutch, spoken in the Netherlands, and Flemish Dutch, spoken in Flanders, Belgium.These two varieties have lexical differences, but the most significant differences can be found in their phonetics (Verhoeven, 2005;Van de Velde et al., 2010), a distinction comparable to that in many other pluricentric languages.The pluricentric character of Dutch plays a major role in the language policy developed by the Netherlands and Flanders. 1 It has been also an important pillar in the national and bi-national initiatives aimed at strengthening the digital infrastructure for the Dutch language such as the Spoken Dutch Corpus (CGN; Oostdijk, 2000), the BLARK (Strik et al., 2002) and STEVIN (Spyns and Odijk, 2013).Thanks to these balanced programmes both varieties of Dutch have developed important language and speech technology resources that can be used for both research and development.However, for conducting ASR-based research in dysarthric speech, the amounts of speech data with corresponding annotations are still insufficient.A complicating factor is that ASR performance on dysarthric speech is notoriously lower than on non-dysarthric speech.To address this problem, Yilmaz et al. (2016b) studied whether combining resources of Netherlandic Dutch and Flemish non-dysarthric speech would help improve ASR performance on Flemish dysarthric speech.The results showed that combining resources led to improved ASR recognition on the Flemish dysarthric speech when comparing the ASR outputs to the prompts.These promising results open up opportunities for further research in developing dedicated ASR technology for the diagnosis and therapy of patients affected by speech disorders that reduce their speech intelligibility.
Currently, several languages can be defined as pluricentric, and resources for speech technology are available to a limited extent for these pluricentric languages.However, it seems that studies exploring whether existing resources from different varieties of a pluricentric language can be employed to the benefit of recognition of speech from other varieties are few and far between.Some studies did employ resources of different varieties of a pluricentric language, for instance, to study accented speech recognition.Winata et al. (2020) performed a cross-accented English speech recognition task as a benchmark for measuring the ability of the model to adapt to unseen accents using the CommonVoice corpus.This corpus contains English read speech from 16 different areas such as Africa, Australia, Canada, England, Hong Kong, India, the United States, etc.In their study, the corpus was split into training and test sets for two settings: (1) mixed-region and (2) cross-region.The focus was on evaluating a new approach, i.e., model-agnostic meta-learning (MAML) to develop a robust speech recognition system.Arsikere et al. (2019) studied a simple phone mapping approach to English multi-dialect acoustic model.In addition to the evaluation of their new approach, they discussed the gains in using resources from multiple varieties of English.The ASR model trained on four resources, i.e., American, Australian, Indian and British English, performed better not only for accents with smaller amounts of training data but also for those which contribute more than half of the total training data.
Some other studies also used corpora from different varieties of a pluricentric language in training and test sets for dysarthric speech recognition.For instance, Librispeech, a corpus of American English non-dysarthric speech, was used as the training set, while TORGO, a corpus of Canadian English dysarthric speech, was used as the test set (Lin et al., 2020;Yue et al., 2020).However, these studies focused more on employing a new approach, developing models with different architectures, and on the difference between non-dysarthric and dysarthric speech rather than on the added value of using speech resources from different varieties of the same pluricentric language.
It is important to note that the majority of studies on ASR for dysarthric speech recognition did not employ pluricentric language resources.Some of these studies are cross-lingual, in the sense that the ASR models are trained on dysarthric speech of one language and then tested on dysarthric speech of another language (Takashima et al., 2019a).This is different from studies employing pluricentric language resources, where speech data for training and testing ASR models are part of the same language although they originate from different varieties.Some studies are cross-speaker-type in which ASR models are trained on non-dysarthric speech and tested on dysarthric speech (Haderlein et al., 2011;Le et al., 2016;Middag et al., 2008b;Wang et al., 2021).Some others combine these two types of approach (Bhat and Strik, 2020;Takashima et al., 2019b).Many of these studies showed moderate correlations between ASR-based feature vectors and subjective intelligibility measures (Le et al., 2016;Middag et al., 2008b;Van Nuffelen et al., 2009).For example, Middag et al. (2008b) reported on training ASR systems to generate different features and then used different combinations of the features through an intelligibility prediction model to generate intelligibility scores automatically.High correlations were reported for combining different types of pathological speech and for a specific pathology type, 0.86 and 0.90, respectively.Haderlein et al. (2011) reported a human-machine correlation of r = 0.85 for testing on German dysarthric speech with ASR trained on German non-dysarthric speech.However, it is difficult for speech-language pathologists in clinical practice to interpret such correlations, which are calculated between ASR-based feature vectors and subjective intelligibility measures.Speech-language pathologists lack the background knowledge to interpret these sets of features, such as those based on Mel-Frequency Cepstral Coefficient (MFCC).In addition, different studies may use different sets of features, making it difficult for them to compare the results across studies.Moreover, these ASR models were trained and tested on the speech of only one language variety (Haderlein et al., 2011;Middag et al., 2008b) or used specifically designed data, e.g., word lists used in the Dutch Intelligibility Assessment (DIA; Middag et al., 2008a).Such specifically designed data require a substantial amount of human effort in preparation, may not be available for other languages or varieties of a different pluricentric language, and are less ecologically valid.As described above, most of the studies focused on training specific acoustic models, adapting non-dysarthric speech models to dysarthric speech, and identifying alternative feature sets that are less dependent on mismatched acoustic models.In any case, they do not consider the resources from varieties of a pluricentric language, which might have potential benefit to the recognition of dysarthric speech or the relation with speech intelligibility of dysarthric speech.
In the present paper, we investigated whether speech corpora pertaining to the dominant variety of a pluricentric language can be used to develop objective, automatic measures of speech intelligibility of dysarthric speech in the non-dominant variety.We use dominant in the sense of having more speakers and larger resources.To broaden the scope of the paper, we abstracted away from the specific scenario in which speech resources are, to a certain extent, available for both varieties, the dominant and the non-dominant one.We hypothesized a scenario in which specific speech resources are not available for the nondominant variety and use speech resources from the dominant variety instead.Subsequently, we compared the results obtained in the hypothetical scenario with those obtained in the realistic scenario in which specific speech resources from the non-dominant variety are actually available.As mentioned above, it is important to note that in the case of pathological speech in general and dysarthric speech in particular, we are always dealing with relatively small amounts of speech data.
In line with research on the intelligibility of dysarthric speech, different measures can be obtained at different levels of granularity through different measurement methods.An important distinction is that between rating-based measures, such as a global measure, and transcription-based measures through orthographic transcriptions, a sort of verbatim representation of the speech produced by a patient.These different measures have both advantages and disadvantages that we are not going to discuss here.Suffice it to say that both can be very useful for diagnosis and therapy.This means that for both types of measures, it is interesting to investigate to what extent speech intelligibility can be measured automatically by employing ASR technology.In turn, it is meaningful to investigate to what extent speech data of different varieties of the same pluricentric language can be usefully employed to develop such objective, automatic measures of speech intelligibility.
The Dutch language is used here as an example of a pluricentric language since several dysarthric speech resources are available for its dominant variety, Netherlandic, and the non-dominant variety, Flemish Dutch, albeit of limited size.Accordingly, we address the following two research questions: It is important to note that this paper evaluates ASR models by using two types of intelligibility metrics: orthographic transcriptions and global intelligibility assessments, both obtained from experts.To address our first research question, the target references used in ASR systems were orthographic transcriptions obtained from experts including the pronunciation errors they transcribed.Differently, to address our second research question, canonical words were used as target references.The results were then compared with the global intelligibility assessment, which were also collected from experts.

Experimental design
For each research question, we investigated two different scenarios: a) a hypothetical scenario in which Flemish dysarthric speech data are not available for training the ASR models so that one has to resort to dysarthric speech data of the Netherlandic Dutch; b) a realistic scenario in which some Flemish dysarthric speech data are available and can be used for training.In each scenario, we trained two ASR models on different data, with and without Netherlandic dysarthric speech, and compared their performance.For all four ASR models, we used the same language model that was trained on combined text materials of Flemish and Netherlandic Dutch in CGN corpus (Oostdijk, 2000).Detailed information of the two scenarios about the corresponding training and test sets for each ASR model is presented in Table 1.

Speech data
We describe the speech corpora from which we selected the speech data for the training sets (FN, ND and FD-train) and test set (FD-test).The speech segments with a non-speech sound produced by the speaker or with incomprehensible words were excluded from both the training and test sets.
For Flemish non-dysarthric speech (FN), we used the read speech components of Flemish speech in the CGN corpus (Oostdijk, 2000).It is a Netherlandic Dutch-Flemish speech corpus that contains representative collections of contemporary standard Dutch as spoken by non-dysarthric adults in the Netherlands and Flanders.This corpus contains 14 components such as conversations (face-to-face), interviews, telephone conversations, lectures, and read speech.We only selected the read speech components in order to have speech material comparable to the dysarthric speech we have in the test set.The read speech is based on an open set of texts, resulting in recordings varying in length.Also, this read speech has less background noise compared to the other components and, thus, is rather clean for training ASR models.It also covers a large number of speakers.The duration of the selected training data from CGN is 6 h 42 min.
For Netherlandic Dutch dysarthric speech (ND), we combined three datasets, i.e., the EST dataset and two CHASING datasets.The EST dataset (Yılmaz et al., 2016a) contains Dutch dysarthric speech that was collected as a part of the e-health-based speech therapy (EST) research program (Beijer, 2012).The collected data were annotated according to a common protocol to create a principled dysarthric speech corpus.This contains dysarthric speech from 16 patients, where ten of them had Parkinson's Disease (PD), four patients had had a Cerebral Vascular Accident (CVA), one patient had suffered Traumatic Brain Injury (TBI), and one patient was affected by dysarthria due to a birth defect.These patients were aged from 34 to 75 years with a median of 54.5 years.The level of dysarthria varied from mild to moderate, with seven at mild, eight at moderate, and one at moderate-severe level.The speech tasks presented to the patients consisted of word and sentence lists with varying linguistic complexity.The database includes 12 Semantically Unpredictable Sentences (SUSs) with 6-and 13-word declarative sentences, 12 interrogative sentences with 6 words each, 13 Plomp and Mimpen sentences (Plomp and Mimpen, 1979), which are short, simple and represent conversational speech, 5 short texts, 30 sentences with /t/, /p/ and /k/ in initial position and unstressed syllable, 15 sentences with /a/, /e/ and /o/ in unstressed syllables, production of 3 individual vowels /a/, /e/ and /o/, 15 bisyllabic words with /t/, /p/ and /k/ in initial position and unstressed syllable and 25 words with alternating vowel-consonant combinations (CVC, CVCVCC, etc.).The duration of the selected training data from EST is 5 h 56 min.
The two CHASING datasets were collected in research aimed at developing a serious game for speech disorder treatment. 2The first one, the CHASING01 dataset (Yılmaz et al., 2017), contains speech of five male patients who participated in speech training experiments and were tested at six different times during the treatment.The second one contains speech of five male and three female patients who were tested at three different times during the treatment.These patients were aged from 53 to 75 with a median of 63.5 years, ten of them having PD and three having had a CVA.For each test time, utterances of the following material were collected: 12 SUSs, 30 /p/, /t/, /k/ sentences in which the first syllable of the last word is unstressed and starts with /p/, /t/ or /k/, 15 vowel sentences with the vowels /a/, /e/ and /o/ in stressed syllables, pronunciations of isolated vowels /a/, /o/, and /e/, 15 words with /p/, /t/, /k/ in word initial position and in unstressed syllable, and the "appeltaarttekst" ("apple cake recipe" in English) in five parts.The total duration of the selected training data from these two datasets is 15 h 16 min.Note that the two CHASING datasets did not report severity level information.On the other hand, the speakers in the two CHASING datasets were of similar age to those in the EST dataset and had the same diseases as some of those in the EST dataset.
For the Flemish dysarthric speech (FD), we used speech from the COPAS corpus (Middag, 2012) of 49 speakers who read the text 'Papa en Marloes' (PM, 'Papa and Marloes' in English), which is a commonly used narrative for assessing the intelligibility of pathological speech.We selected four of nine sentences of the PM text as follows: (1) "Papa en Marloes staan op het station."(in English "Papa and Marloes are at the station."), (2) "Marloes kijkt naar links."(in English "Marloes looks to the left."), (3) "In de verte ziet ze de trein al aankomen."(in English "In the distance she can see the train coming."),(4) "Het is al vijf over drie dus het duurt nog vier minuten."(in English "It is already five past three so it will take another four minutes.").These four sentences were selected according to our previous study (Xue et al., 2021).The authors explained that these four sentences were chosen because they vary in length and contain the corner vowels, i.e. /a:/, /u/, and /i/.Further, for the training set (FD-train), we used speech of 23 speakers who were aged from 11 to 78 with a median of 44.5 years.These speakers varied in severity levels of dysarthria from mild to moderate, with 16 speakers at mild level, six at moderate level, and for one speaker the severity level was unknown.The total duration of the FD-train is 6.4 min of speech.For the test set (FD-test), we used speech of 36 speakers who were aged from 8 to 85 with a median of 46.5 years.FD-test includes 10 non-dysarthric speakers, 12 mild, 9 moderate and 5 severe dysarthric speakers.The total duration of the FD-test is 9 min of speech.Note that the FD-test was built based on our previous study (Xue et al., 2021).The study was conducted to collect various subjective intelligibility measures from experts, and these intelligibility measures were used to evaluate our ASR models, as explained below in Section 2.3.Moreover, to ensure that all four ASR models were speaker-independent, the dysarthric speakers in FD-train and FD-test were different.This means that excluding the 26 dysarthric speakers in the FD-test set based on Xue et al. (2021), the other 23 of the total 49 dysarthric speakers were used in the FD-train set.

Evaluation
To evaluate the performance of ASR models, we calculated the word error rate (WER) between the ASR outputs and target references.Specifically, for the first research question, the human orthographic transcriptions at word level were used as the target references to compute WER, and the distribution of the WERs over different utterances for different severity levels was analyzed.For the second research question, the prompts of the utterances were used as the target references, and the correlations between the WERs and the subjective global intelligibility measures were analyzed.
The human orthographic transcriptions and the subjective global intelligibility measures mentioned above for the FD-test were collected from five speech-language pathologists as expert raters in our previous study (Xue et al., 2021).The global intelligibility measures were collected through a Visual Analogue Scale (VAS) ranging from 0 (not intelligible) to 100 (intelligible) with tick marks for every ten scores (e. g., 10, 20, 30, etc.).The interrater reliabilities of the word accuracy (AcW) computed from the transcriptions and the global measures from VAS are 0.83 and 0.93, respectively.The construct validity was measured by the correlation between these two subjective intelligibility measures and is 0.75.The high interrater reliabilities and the validity show that it is legitimate to use these subjective intelligibility measures as the reference to evaluate the ASR models' performance and to address the research questions.

Implementation details of training
The ASR experiments were performed using the Kaldi ASR toolkit (Povey et al., 2011).A standard feature extraction scheme was used by applying Hamming windowing with a frame length of 25 ms.and a frameshift of 10 ms.A conventional context-dependent GMM-HMM system with 20k Gaussians and 2500 triphone states was trained on the 39-dimensional MFCC features including the deltas and delta-deltas.We also trained a GMM-HMM system on the LDA-MLLT-SAT features, followed by training models with speaker adaptive training using FMLLR features.This system was used to obtain the state alignments required for training the DNN as shown below.
The DNNs with 6 hidden layers and 1024 sigmoid hidden units at each hidden layer were trained on the 40-dimensional log-Mel filterbank features with the deltas and delta-deltas.The DNN training was done by mini-batch Stochastic Gradient Descent with an initial learning rate of 0.0015 and a minibatch size of 256.The default initial learning rate of 0.0015 was used in the first training stage.A trigram language model was trained on the text as we described in Section 2.1.

Statistical analyses
To compare the performance of different ASR models, we applied statistical analyses by using the stats (R Core Team, 2020), base (R Core Team, 2020), car (Fox and Weisberg, 2019) and ggplot2 (Wickham, 2016) packages in RStudio (RStudio Team, 2020) with R version 4.0.2(R Core Team, 2020).Specifically, to address the first research question, we calculated the mean and standard deviation of WER scores computed between ASR outputs and human orthographic transcriptions in the four ASR models per utterance and speakers' severity level.We applied a t-test to study the difference in WER scores between different models and their effect size based on Cohen's d.We also made a boxplot for WER by severity level.To address the second research question, we first calculated the correlation coefficients between ASR outputs (WER) and the subjective global intelligibility measures at utterance and speaker level, as well as the significance of correlations.The ASR outputs (WER) were logit transformed as this is a proportional measure.The logit transform was calculated as the log of the proportion divided by one minus the proportion, where proportion refers to WER divided by 100.Note that in WER, 0 has been set as 1 and 100 as 99 before the transform as it was not possible to logit transform 0 and 100 values.To further study the relationship between perceptual and computed intelligibility scores, we applied regression analysis (Bocklet et al., 2012;Middag et al., 2008b;Van Nuffelen et al., 2009).We visualized the results at speaker level in scatterplots.We studied the residuals based on the linear regression for each severity level in the four ASR models and calculated Levene's test for the four models to test whether the variances of the residuals in the four severity groups of speakers were different.

RQ1: To what extent can Netherlandic Dutch dysarthric speech data contribute to improving automatic transcriptions of Flemish dysarthric speech?
The mean and standard deviation (SD) of WER and of the numbers of errors, i.e., substitutions, deletions, and insertions, using the four ASR models are presented in Table 2.The values of the t tests (df = 719) between all model pairs were as follows: the WER of FN was greater than those of the other three models with t values of 9.14, 24.18, and 23.45, respectively; the WER of FN + ND was greater than those of the other two models with t values of 19.78 and 19.55, respectively; the WER of FN + FD-train was greater than that of FN + FD-train + ND with a t value

Table 2
Mean (standard deviation) of word error rate (WER) and of the numbers of errors, i.e., substitutions, deletions, and insertions, computed between ASR outputs and human orthographic transcriptions (the lower the better).All WER differences between all model pairs were significant (t-test, p < 0.05).As can be seen in Table 2, the mean values of WER decrease substantially from FN to FN+ND and to FN+FD-train, especially largely from FN+ND to FN+FD-train.FN+FD-train+ND performs slightly worse than FN+FD-train.Although the difference was significant, it was small according to its effect size (Cohen's d = 0.04).The WER for the different models seem to be reflected best by the substitutions.We need to ascertain that the WER reduction rates mirror a better performance of the ASR models in recognizing the words that speakers actually said.
Since the interrater reliability of the word accuracy using human orthographic transcriptions, as mentioned in Section 2.3, was very high, we did not plot the WERs across the raters.Rather, we explored the model differences in WERs over different severity levels for the four ASR models.The results are given in Fig. 1.
As can be seen in Fig. 1, different models perform differently for severity levels.The recurrent pattern is that the medians in the boxplots rise along increasing severity levels.At the same time, we see smaller boxes for FN+FD-train and FN+FD-train+ND, especially for mild and moderate levels, and lower WER scores for all severity levels.These results showed that FN+FD-train and FN+FD-train+ND had smaller intersections between the severity levels than those for FN and FN+ND.

RQ2: To what extent can Netherlandic Dutch dysarthric speech data contribute to developing objective, automatic global measures of speech intelligibility for Flemish dysarthric speech?
As shown in Table 3, the magnitudes of the correlation coefficients gradually increase from FN to FN+FD-train+ND at both utterance and speaker levels although FN+FD-train+ND showed a slightly lower correlation coefficient than FN+FD-train at speaker level.The correlations at speaker level for the last two models in Table 3 are very high.All correlation coefficients were significant (p < .05).
The scatterplots in Fig. 2 show that the observed data points in FN+FD-train and FN+FD-train+ND are generally closer to the regression lines than those in FN and FN+ND, especially for non-dysarthric and mild dysarthric speakers.This reflects the higher correlations of FN+FD-train and FN+FD-train+ND.For moderate and severe dysarthric speakers, we observed slightly larger distances between the predicted and observed scores.Fig. 2 also shows clearly that there is one larger overestimation and one larger underestimation of WERs on the basis of the VAS score in the moderate group.
We further investigated the residuals of the regression analyses for each severity level in the four ASR models.The results are shown in Table 4. Larger absolute values of the means mean that the speakers are further away from the regression lines in Fig. 2. We also applied Levene's test on the four models to test whether the variances of the residuals in the four groups of speakers with varying severity levels were different.We observed no significant differences for model FN.Significant differences were found for the other three models (p = .018for FN+ND; p = .040for the last two models).These results indicate that the models are less successful in predicting WER by using VAS for dysarthric speakers, especially in the moderate and severe groups.This effect is also visible in Fig. 2, as we mentioned above.Nevertheless, FN+ND better matches the two successful trained models, i.e.FN+FD-train and FN+FD-train+ND, than FN.

Discussion
In this paper, we have investigated whether speech resources pertaining to the dominant variety of a pluricentric language are useful to obtain objective, automatic, speech technology-based intelligibility measures for dysarthric speech of the non-dominant variety.Our study focused on two types of intelligibility metrics that are often employed in dysarthric speech research, orthographic transcription and a global measure, as addressed by our two research questions.For each research question, a hypothetical and a realistic scenario are discussed separately.

RQ1: To what extent can Netherlandic Dutch dysarthric speech data contribute to improving automatic transcriptions of Flemish dysarthric speech?
The results presented in Section 3 indicate that in the hypothetical scenario, adding the ND speech data leads to a lower word error rate.This means that the automatic transcriptions are more in line with those Fig. 1.Boxplots for WER computed between ASR outputs and human orthographic transcriptions in the four ASR models per speakers' severity level.All WER differences between all model pairs were significant (t-test, p < 0.05).Best viewed in color.made by expert transcribers.Improvements are observed for all utterances and all severity levels as well as for non-dysarthric speech.These results are plausible.Dysarthric speech deviates considerably from nondysarthric speech.When dysarthric speech data from the non-dominant variety are not available for training, adding dysarthric speech of the dominant variety helps because this added speech is similar to the test speech pertaining to the non-dominant variety.Similar results can also be found for accented speech in Winata et al. (2020) where the performance of ASR models for unseen accented speech, e.g., English speech in Wales, was better when other accented speech was added for pre-training.
In the realistic scenario, we see that the FN+FD-train and FN+FD-train+ND models perform comparably in all cases.This seems to indicate that once you have some dysarthric speech of the non-dominant variety for training, adding dysarthric speech of the dominant variety does not make a significant contribution.However, the marginal differences in results between these two models might have to do with a possible mismatch between the two types of dysarthric speech over and above the fact that they are from different varieties.In this specific case, there is a speaker mismatch between ND and FD-test.The ND data were recorded from speakers aged between 34 and 75, while the FD-test set also contains young children.Besides, the ND data contain speakers who are mild to moderate dysarthric, while the FD-test set also contains nondysarthric and severe dysarthric speakers.Another explanation could be that solely adding FD-train, which is similar to FD-test in terms of language variety and text of the recordings, already led to a large boost.
Therefore, further adding ND, which is very different from FD-test, will not contribute much useful information.This is confirmed by the fact that the FN+FD-train model performs better than the FN+ND model.
Overall, the results show that adding Netherlandic Dutch dysarthric speech can help obtain better automatic transcriptions of Flemish dysarthric speech if no Flemish dysarthric speech data are available for training.Note that Yilmaz et al. (2016b) also studied the ASR models for Flemish dysarthric speech but added the Netherlandic Dutch non-dysarthric speech rather than dysarthric speech for training.Also, they evaluated the ASR models by comparing the automatic transcriptions with prompts, whereas we evaluated the ASR models by comparing the automatic transcriptions with orthographic transcriptions obtained from human experts.

RQ2: To what extent can Netherlandic Dutch dysarthric speech data contribute to developing objective, automatic global measures of speech intelligibility for Flemish dysarthric speech?
In the hypothetical scenario, as described in Section 3, the correlations between WER and the subjective global intelligibility measure (VAS) for the FN+ND model are clearly higher than those for the FN model at both utterance and speaker levels.Similarly, the scatterplots, as well as the mean and standard deviations of residuals of regression analyses, show that the FN+ND model performs better at speaker level.This seems to be in line with the finding for RQ1 in the same scenario.The FN+ND model yields outputs that are more similar to the transcriptions by expert transcribers for all severity levels as compared to the FN model.
In the realistic scenario, the correlations for the FN+FD-train+ND model are higher than those for the FN+FD-train model at utterance level.The correlations at speaker level are comparable in the two models when combining all speakers.Similarly, the scatterplots, as well as the mean and standard deviations of residuals of regression analyses, show that the two models perform similarly at speaker level for speaker at all severity levels and for non-dysarthric speakers.The marginal difference between the two models might be due to the mismatch between the ND and the FD-test sets, as described above, concerning speaker, text of  recordings and language variety.In addition, these two models both outperform the FN and the FN+ND models, especially for non-dysarthric and mild dysarthric speakers, since the added FD-train set is very similar to the FD-test set.Many studies reported very high correlations between ASR outputs and subjective measures of intelligibility of dysarthric speech, but only when the models were specifically designed or the training and test sets have similar distributions (Kim et al., 2015;Middag et al., 2008b;Van Nuffelen et al., 2009).For example, Middag et al. (2008b) studied ASR systems to align the speech to generate phonemic and/or phonological features and used the features to predict the phoneme intelligibility for speakers through a prediction model.Both the ASR systems and the prediction model were trained and tested on data from COPAS.Their study targeted only the word lists that were specifically designed for assessing intelligibility of phonemes, i.e., DIA task.Their other study (Van Nuffelen et al., 2009) used a similar system as described above with ASR systems trained on CGN read speech and CoCGN corpus (Demuynck et al., 1997) together with the prompts.Although their training set also included speech data from different varieties of Dutch, the training data were very similar to those in the test.The individual contribution of each of the two varieties was not studied.Our study did not aim to explore a new feature or a model structure.Rather, we chose a very common design and implementation of the ASR models and explored the influence of using speech resources in the training set that are of a different variety from those in the test set.
In general, adding Netherlandic Dutch dysarthric speech can contribute to developing objective, automatic global measures of speech intelligibility for non-dysarthric and dysarthric Flemish speech when no dysarthric speech Flemish data are available for training.Moreover, we studied ASR models by using orthographic transcriptions including error as target references.This is different from typical robust ASR systems where canonical words are used as target reference.By applying such analyses, we were able to explore the possibility of replacing human transcribers by ASR models, resulting into a fully automatic assessment of intelligibility for dysarthric speech.Again, as we discussed above, further exploration on the condition that the Netherlandic Dutch dysarthric speech in the training set and the Flemish dysarthric speech in the test set are matched in terms of speaker and text is needed.Note that Dutch is in a very specific situation that speech resources are available for both varieties, the dominant and the non-dominant one.In reality, this specific situation may not hold true for other pluricentric languages, which is exactly consistent with the hypothetical scenario we discussed.Therefore, our findings in this study provide important insight for these pluricentric languages.It is feasible to use the speech resources from the dominant variety to improve dysarthric speech recognition for the nondominant variety, especially to improve the correlation with global intelligibility measures of dysarthric speech.

Conclusion
In this paper, we investigated the potential contribution of dysarthric speech resources from the dominant variety of a pluricentric language as additional training data for improving automatic speech recognition of dysarthric speech of the non-dominant variety.The performance of automatic speech recognition was evaluated by two types of intelligibility metrics: orthographic transcriptions and global intelligibility assessments, both obtained from experts.Overall, the results show that adding dysarthric speech of the dominant variety, i.e., Netherlandic Dutch dysarthric speech, can help obtain better automatic transcriptions and global intelligibility measures for dysarthric speech of the nondominant variety, i.e., Flemish Dutch dysarthric speech.However, this positive effect is overpowered due to other factors when dysarthric speech resources from the non-dominant variety are available.In the future, it is worth exploring the condition in which the Netherlandic Dutch dysarthric speech in the training set and the Flemish dysarthric speech in the test set are matched in terms of speaker and text.Given the high correlation between the ASR outcomes and the subjective intelligibility measures, further research is also needed to develop ASR models as a tool to estimate and diagnose the intelligibility of dysarthric speech.
Interestingly, exploiting pluricentric language resources is not only applicable to assessing intelligibility of dysarthric speech but also to other domains of speech research, particularly when the resources are scarce or asymmetric with much larger speech resources available in the dominant variety.For example, it may help improve the robustness of ASR for non-native speech recognition in the non-dominant variety.Another domain is aphasic speech.However, the success of a pluricentric approach depends on the linguistic differences between the varieties of the pluricentric languages involved.As we earlier pointed out, phonetic differences are a main component of distinguishing Netherlandic and Flemish Dutch.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

( 1 )
To what extent can Netherlandic Dutch dysarthric speech data contribute to improving automatic transcriptions of Flemish dysarthric and non-dysarthric speech?(2) To what extent can Netherlandic Dutch dysarthric speech data contribute to developing objective, automatic global measures of speech intelligibility for Flemish dysarthric and non-dysarthric speech?

Fig. 2 .
Fig. 2. Scatterplots with regression line for the subjective global intelligibility measures, through a Visual Analogue Scale (VAS), and WER (with logit transform) at speaker level for the four ASR models, with different colors for the severity levels.Best viewed in color.

Table 1
Detailed information of the two scenarios about the corresponding training and test sets for each ASR model.FN = Flemish non-dysarthric speech, ND = Netherlandic dysarthric speech, FD = Flemish dysarthric speech.The FD speech data were split into a training and a test part.

Table 3
Magnitudes of correlation coefficients between ASR outputs (with logit transform) and the subjective global intelligibility measures at utterance and speaker levels.All correlation coefficients were significant (p < .05).

Table 4
Mean and Standard Deviation (SD) of residuals of a regression analysis for each severity level in the four ASR models.