Exploring Native and Non-Native English Child Speech Recognition With Whisper

Modern end-to-end Automatic Speech Recognition (ASR) systems struggle to recognise children’s speech. This challenge is due to the high acoustic variability in children’s voices and the scarcity of child speech training data, particularly for accented or low-resource languages. This study focuses on improving the performance of ASR on native and non-native English child speech using publicly available datasets. We evaluate how the large-scale whisper models (trained with a large amount of adult speech data) perform with child speech. In addition, we perform finetuning experiments using different child speech datasets to investigate the performance of whisper ASR on non-native English-speaking children’s speech. Our findings indicate relative Word Error Rate (WER) improvements ranging from 29% to 89% over previous benchmarks on the same datasets. Notably, these gains were achieved by finetuning with only a 10% sample of unseen non-native datasets. These results demonstrate the potential of whisper for improving ASR in a low-resource scenario for non-native child speech.


I. INTRODUCTION
While ASR performance for adult speech has improved in recent times due to the availability of large-scale transcribed speech corpora and the development of end-to-end (E2E) attention-based acoustic models [1], [2], [3], [4], the same benefits have not been extended to the child speech domain due to a lack of available transcribed child audio data.The acoustic variability in children's speech caused by developmental changes of the vocal tract coupled with the child's limited linguistic and phonetic knowledge affects the performance of ASR systems for this age group [5], [6], [7], [8].Furthermore, the scarcity of data for ASR training in the child speech domain is an acute problem, as acquiring The associate editor coordinating the review of this manuscript and approving it for publication was Lin Wang .and annotating such data is a complex and resource-intensive task [9].
Recent developments in transfer learning have shown promising results in ASR, especially in recognizing speech from low-resource languages [10], [11], [12].A key strategy involves finetuning an acoustic model.The model leverages frame-level acoustic representations derived from self-supervised models like wav2vec2 [3], [13], which were initially trained on vast amounts of unlabeled adult speech data using a masking objective.This has proven to be effective for downstream speech recognition applications with small amounts of labelled data.However, the self-supervised learning (SSL) training procedure for ASR is less effective in the case of domain shifting [14].This means that the performance drops when the model encounters data that significantly differs from the training set, such as non-native child speech, making accurate recognition difficult.
Supervised transfer learning has also emerged as a promising solution to this problem.It adapts features learned from adult speech to enhance child speech recognition [15], [16], [17], [18].Additionally, audio augmentation techniques, which expand the training dataset [19], [20], [21], have also been effective in boosting ASR performance for child speech.Recent work on ASR for non-native child speech has also explored transfer learning as a way to make significant improvements [15], [19], [20], [21].For instance, the use of a pretrained transformer model for transfer learning has been investigated to better adapt to non-native children's speech [18].Moreover, there have been notable strides in supervised learning approaches that show potential for child speech recognition [16], [22], [23].Previous studies [24], [25] have demonstrated that training models across multiple datasets using supervised learning methods can enhance the model's ability to generalize across new, unseen datasets.This broad approach to training suggests a pathway towards more robust and adaptable ASR systems capable of handling the complexities of unseen child speech.
Given the low-resource nature of child speech and the limited datasets available for research use, this study opted to utilize the recent state-of-the-art (SOTA) Whisper [4] approach.Whisper has successfully addressed the challenges of weakly supervised speech recognition by training on large amounts of labelled adult audio datasets in a supervised manner.It has shown impressive performance in low-resource multilingual languages due to its multitask learning objectives and the use of multilingual datasets for training [4].This research aims to investigate whether Whisper's multilingual training approach can enhance ASR performance in the particularly challenging area of low-resource child speech.First, we evaluate the performance of the original pretrained whisper models on different native and non-native English child speech datasets.Since whisper learns speech representations from a large number of multilingual audio datasets, it was also intended to adapt these whisper models for non-native English child speech datasets by performing further finetuning.
The primary contribution of this paper lies in adapting and finetuning the whisper model for child speech recognition.While finetuning large transformer models on small datasets is a well-established practice, our study goes beyond this by focusing on the unique characteristics and challenges associated with non-native child speech data.Through the careful crafting of experiments, we demonstrate the effectiveness of the whisper transformer model and wish to underscore the practical implications of our research, such as the promising applications of the finetuned model in real-world scenarios.Child speech recognition has wide-ranging applications in the education, healthcare, and accessibility domains.The main contributions of the paper are highlighted as the following: • Demonstrates significant performance improvements on non-native English child speech datasets.
• Showcases whisper's ability to adapt effectively to diverse child speech datasets through finetuning.
• Proves whisper's resistance to catastrophic forgetting, maintaining performance on adult speech while improving child speech recognition.
• Provides insightful analysis and discussions of the outcomes derived from whisper finetuning.

II. METHODOLOGY
In this work, the whisper [4] model is used, showcasing the benefits of large-scale weakly supervised pre-training for improved ASR performance.It employs training data of up to 680,000 hours of labelled audio data, of which 117,000 hours include 96 non-English languages and 125,000 hours of X→en translation data.

A. WHISPER ARCHITECTURE
The architecture of the whisper model (see Figure 1) is based on an encoder-decoder transformer, which uses 80-channel log-Mel spectrograms as input.The encoder consists of two convolution layers with a kernel size of 3, a sinusoidal positional encoding, and a stacked set of transformer blocks.The decoder also uses the learned positional embeddings and the same number of transformer blocks as the encoder.The model uses a byte-level BPE text tokenizer [26] for English-only models and refits the vocabulary for multilingual models to avoid excessive fragmentation in other languages.A multitask training format is used, where models are trained to perform various speech-processing tasks using a single decoder.Multitask training is done by conditioning the decoder on a sequence of input tokens that specify the task and desired output format.These tasks include multilingual speech recognition, spoken language detection, speech translation, and voice activity detection.

B. TRAINING DETAILS
The models were optimized using AdamW [27] and gradient norm clipping [28] with a linear learning rate decay with a warmup over the first 2048 updates.The pretrained whisper models are categorized based on their sizes, namely: tiny, base, small, medium, and large (see Table 1).There are two versions of each model: one trained with multilingual data and one using only English data (indicated by '.en' in the name).We provide the initial non-finetuning results on all the available pretrained models.We select the best-performing models and apply finetuning to those.Architectural hyperparameter details can be found in Table 1.
Whisper is trained using a large amount of multilingual speech datasets including low-resource languages, and we aimed to investigate whether this multilingual-focused ASR model could be used to improve performance on non-native child speech.We performed finetuning using parallel child audio data on the final layer of the pretrained whisper models for up to 4000 epochs [4], with a learning rate of 1e-05 and a linear learning rate scheduler.

C. DECODING IN WHISPER
The whisper ASR system employs several decoding strategies during inference [4], and these strategies are executed up to six times.The goal is to select the best transcription based on the heuristics and the decoding strategies' performance.1. Beam Search with 5 Beams: This strategy uses beam search, a common technique in ASR systems.It explores multiple hypotheses (in this case, five) and selects the one with the highest log probability as the final transcription.This approach favours more probable sequences.2. Greedy Decoding with Best of 5 Sampling: Greedy decoding starts with the most likely token at each step, while sampling introduces randomness.The system uses a sampling temperature schedule (0.0, 0.2, 0.4, 0.6, 0.8, 1.0) for successive attempts.Lower temperatures make the sampling approach more deterministic, while higher temperatures allow more randomness in token selection.This strategy explores a range of sampling behaviours to find the most suitable transcription.These decoding strategies are applied to enhance transcription quality, particularly in situations where the model may be less certain, such as when there is background noise or other challenging audio conditions are present.The impact of these decoding strategies can vary across different datasets, as noted in the whisper paper [4], but collectively, they help improve transcription accuracy and reliability by considering both the model's confidence and the compression characteristics of the transcribed text.We do not use any external language models since it was intended to keep the decoding technique identical to the original implementation by the whisper authors [4] and concentrate on the recognition capabilities of the finetuned acoustic models.

III. CORPUS DESCRIPTION
The authors of Whisper do not explicitly mention the list of training datasets used for pretraining [4].We used the following child speech datasets for our finetuning and testing experiments: MyST Corpus [29], PFSTAR Corpus [30], CMU_Kids Corpus [31], and Speechocean672 [32].LibriTTS [33] was the only adult dataset used in the experiments during inference.

A. DATASET CLEANING AND DESCRIPTION
Each dataset was cleaned according to whisper authors' text standardization guidelines [4].The abbreviations, punctuations, white spaces, non-linguistic symbols, and other non-alphanumeric characters were removed from the transcripts, and all the characters were changed to lowercase.All the audio data was converted to a 16-bit mono channel with a 16Khz sampling rate and saved as '.wav' audio files, while the transcriptions were saved as '.txt' files.Child data-specific cleaning methodology was kept consistent with [34].Given the low resource nature of non-native child speech datasets, we opted to split the available data into 80% for testing and 20% for training.Allocating a larger proportion of the data for testing helped obtain more objective results.The datasets used are described below: 1) LibriTTS [33] is a multispeaker English adult speech dataset.The 'dev-clean' subset of LibriTTS with 9 hours of audio is used as the representative for adult speech for our finetuned models during testing.
2) My Science Tutor (MyST) Corpus [29] is an American English child speech dataset containing over 393 hours of audio data out of which 197 hours are fully transcribed.We use the cleaned version of this dataset (as described in [34]), with 65 hours of speech divided into two subsets: 55 hours for training, called 'MyST_train' and 10 hours for testing, called 'MyST_test'.
3) PFSTAR Corpus [30] contains a collection of words spoken by native British English children and non-native English child speech from Swedish, German, and Italian natives.The cleaned PFSTAR British dataset (as described in [34]) contains a total of 12 hours of usable audio.This data was divided into 10 hours for training, called 'PF_br_train', and 2 hours for testing, called 'PF_br_test'.
The PFSTAR Swedish subset contains 1.27 hours of English child speech with Swedish accents.It is divided into 1.01 hours (80%) for testing and 0.24 hours (20%) for

B. DATASET USAGE IN TRAINING AND TESTING
MyST, PFSTAR British, and CMU_Kids are considered native English child speech datasets, while PFSTAR (Swedish, German, and Italian) and Speechocean762 are non-native English child speech datasets in this study.The dataset division into training and testing categories is presented in Table 2.
Due to the limited volume of non-native data, we consolidated the non-native training datasets described earlier into two distinct subsets for the purpose of finetuning: Non_Native_10 (NN_10): This subset comprises half of the selected non-native training sets, specifically PF_sw_train, PF_ge_train, PF_it_train, and SO_train.It represents 10% of the overall non-native data pool.Non_Native_20 (NN_20): This subset encompasses the entire range of non-native training datasets mentioned, including PF_sw_train, PF_ge_train, PF_it_train, and SO_train in their entirety.This constitutes 20% of the total non-native dataset.

IV. CODEBASE AND EXPERIMENTS A. CODEBASE
The whisper finetuning codebase, used for implementing our initial testing and subsequent finetuning is available here. 1ur trained whisper models are openly available to use on the Hugging Face platform. 2The information regarding the checkpoint, model parameters, learning rates, training curves, dataset availability, and access to cleaned datasets are available on our GitHub. 3We followed the same finetuning approach as in our earlier work with the whisper model [35].This study is essentially a continuation of our previous research, where we specialized whisper models for recognizing children's speech and compared it with the wav2vec2 self-supervised approach on the same distribution of datasets.
Nine sets of experiments were conducted, organized into groups A, B, C, D, E, F, G, H and I as detailed below.Table 3 shows the Word Error Rates (WERs) obtained from these experiments, a standard metric for evaluating ASR system performance.WER quantifies the error rate in recognizing spoken words against a reference transcript, calculated by summing substitution, deletion, and insertion errors, then dividing by the total word count in the reference.A lower WER indicates better performance.
Group A served as the baseline, comprising tests on the original Whisper models without finetuning.This establishes the benchmark performance.The remaining groups (B, C, D, E, F, G, H, and I) focused on the three top-performing models from group A, which were finetuned using different distributions of child speech training datasets as detailed in Table 2. Various experiments were conducted to finetune the ASR models by utilizing different combinations of child speech datasets.The objective was to identify the optimal combinations of child audio training data that would result in the lowest WERs on diverse test datasets.Additionally, different data distributions were employed to determine the complementary datasets and identify those that hindered the improvement of the ASR model.
In Group B, the models were subjected to finetuning using the MyST_train dataset, which was selected for the initial finetuning experiments as it is the largest available child speech dataset.Subsequently, in Group C and Group D experiments, the next largest datasets, namely CMU_train and PF_br_train, were added along with MyST_train.Group C models used the MyST_train and CMU_train datasets, while Group D models used the MyST_train and PF_br_train combination of datasets.For Group E finetuning experiments, all three datasets, namely MyST_train, CMU_train, and PF_br_train, were used collectively.The remaining experiments focused on finetuning using different distributions of Non-Native English child speech datasets (NN_10 and NN_20), to study the performance of the finetuned models on the test datasets.Thus, in Group F, the models were finetuned using the MyST_train, PF_br_train, and NN_10 datasets, while in Group G, the models were finetuned using the MyST_train, PF_br_train, and NN_20 datasets.Finally, we wanted to assess the ASR performance using the complete set of available datasets, therefore, Group H models were finetuned using the MyST_train, CMU_train, PF_br_train, and NN_10 datasets.Similarly, Group I models were finetuned with the MyST_train, CMU_train, PF_br_train, and NN_20 datasets.

V. RESULTS AND DISCUSSIONS A. MAIN RESULTS FROM GROUP EXPERIMENTS
The results obtained from these experiments are presented in Table 3 and the lowest WERs are highlighted in bold.

1) GROUP A
The results from Group A highlight the WERs of pre-trained Whisper models across various speech datasets as outlined in Table 2.The findings show that smaller models, including Tiny, Base, and Small, generally exhibit higher WERs compared to the larger models, namely Medium and Large, as documented in Table 3.This trend suggests that the larger models, due to their increased size, possess a greater capacity for generalization, thereby enhancing speech recognition accuracy.When comparing models of equivalent size, it was observed that English-only models outperform their multilingual counterparts.This indicates that models trained specifically on language-focused datasets exhibit improved performance for those particular languages.Based on these insights, the models demonstrating the most robust performance -specifically the 'Medium', 'Medium.en',and 'Large-V2' models -were chosen for subsequent finetuning experiments.

2) GROUP B
The finetuning of models with the MyST_train dataset in Group B resulted in notable enhancements in ASR performance across all test datasets, with the sole exception of the CMU_test dataset.This could indicate a mismatch between the characteristics of the CMU_test data and the training data used for finetuning, possibly due to accent, dialect, or speech complexity differences not adequately covered by the MyST_train dataset.The 'medium' model showed a marked reduction in WER across various datasets, including a significant drop from baseline figures in Group A, highlighting the model's improved adaptability to different speech patterns post-finetuning.Overall, the average performance across Group B models illustrates the tangible benefits of finetuning on child speech, with improvements evident in lower WERs for a majority of the test scenarios.

3) GROUP C
In Group C, the finetuning process incorporated the CMU_train dataset alongside MyST_train, aiming to investigate its impact on reducing the WER for CMU_test.Following this finetuning, there was a notable decrease in the WER for the CMU_test dataset to as low as 2.32.However, this adjustment resulted in increased WERs across all other test datasets.Interestingly, the performance on the dev-clean dataset, which represents adult speech, remained unchanged.These outcomes hint at the acoustic similarities between the CMU_Kids dataset and adult speech, evidenced by the stable performance on adult speech and increased WERs on child speech datasets.This also suggests that the CMU_Kids dataset, while beneficial for targeting specificities of the CMU_test, may not align well with the acoustic properties of other child speech datasets.The disparity in WERs could also be linked to the inherent differences in domain or unique acoustic features present in the non-native test datasets, which are not sufficiently represented in the CMU_train dataset.Furthermore, the presence of low-quality audio within the CMU_train dataset might have adversely affected the model's performance, particularly evident in the heightened WERs observed in the PF_sw_test, PF_ge_test, PF_it_test, and SO_test datasets.This indicates that while targeted finetuning can enhance performance on specific datasets, it also underscores the challenge of balancing improvements across diverse speech datasets, especially when dealing with varying audio quality and distinct acoustic characteristics.

4) GROUP D
The decision to incorporate the PF_br_train dataset, a British English child speech dataset, into the finetuning process for Group D was influenced by the observed increase in WER across nearly all test datasets (except for the CMU_test), following the inclusion of CMU_train in Group C. This shift also aimed to assess the impact of PF_br_train on model performance across various speech recognition tasks.The results from Group D finetuning demonstrate a marked improvement across all non-native child speech test datasets, with WER decreasing for all tests except for the CMU_test.This suggests that the PF_br_train dataset's characteristics are more aligned with the acoustic properties required for effective recognition of non-native child speech, enhancing the models' performance significantly.The lower WERs in Group D can be attributed to the complementary nature of the PF_br_train dataset.

5) GROUP E
In the Group E experiments, we used a combination of MyST_train, CMU_train, and PF_br_train.In comparison to the results from Groups C and D, there is a performance increase on all the seen datasets, however, performance degradation can be observed on non-native datasets.This confirms that CMU_train datasets had a negative impact on the performance of non-native English test datasets.Notably, the WERs for CMU_test and PF_br_test dropped to 1.86 and 3.10, respectively, nearing human-level accuracy.These results indicate that having a similar distribution of data improves performance on both seen and unseen child speech datasets.

VOLUME 12, 2024
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
However, this improvement also points to a potential limitation: while the models became more proficient with data similar to their training set, their ability to generalize across diverse linguistic backgrounds weakened.The results underline the critical balance needed in selecting training datasets that perform well across a broad spectrum of speech recognition tasks.

6) GROUP F
In the Group F experiments, the finetuning included NN_10 along with the Group training datasets.On comparing groups D and F, the addition of this small dataset of non-native speech resulted in significant improvements in the performance on all non-native child speech test datasets, while the performance on the other test datasets remained unchanged.The addition of NN_10 also led to a decrease in the WER on CMU_test.This demonstrates that whisper finetuning can enhance ASR performance on non-native child speech in a low-resource scenario and can be extended to other multi-accented non-native child speech.

7) GROUP G
The Group G finetuning experiments substituted NN_10 with NN_20, compared to Group F. This adjustment led to additional improvements across all non-native test datasets, with WERs dropping by 1-3% for each test dataset relative to the results from Group F. This further shows the benefit of including more extensive non-native speech data in finetuning to enhance ASR performance on non-native child speech.

8) GROUP H
In the Group H experiments, the finetuning included NN_10 along with the Group E training datasets.CMU_train was included in the finetuning to see its impact when used in conjunction with the non-native datasets.By looking at the average WERs, no significant difference between groups F and H can be observed, except for the CMU_test WER, which was expected.Surprisingly, adding CMU_train in finetuning didn't impact the performance on non-native test datasets in this group.

9) GROUP I
In the Group I finetuning experiments, the NN_20 training data was included instead of NN_10 in Group H.This resulted in further improvements, as WERs decreased by between 1-3% on all test datasets compared to Group H. Furthermore, it can be noted that the inclusion of CMU_train in the finetuning process did not have any noticeable effect on performance on non-native test datasets within this group.

B. DISCUSSION
The findings of this study offer significant insights into the feasibility of finetuning the whisper model on different combinations of child audio data to improve child speech recognition in both native and non-native English accents.This section delves deeper into the nuances of model performance, emphasizing the influence of model size, the role of dataset-specific finetuning, and the model's capability to adapt to diverse linguistic environments.The analysis in this section also addresses the challenges in accent recognition and evaluates the whisper model's resilience to catastrophic forgetting, highlighting its potential in the evolving field of speech recognition technology.1. Model Size: Smaller models (e.g., Tiny, Small) tend to have higher WER scores compared to larger models (e.g., Medium, Large-V2).This suggests that larger models have a better capacity to capture and represent speech patterns, which leads to lower WER scores at the inference stage.It can also be seen that there is only a 1-2% WER difference between medium, medium.en,and large-V2 models, suggesting an upper limit to the generalizability of models with an increase in model size.2. No Finetuning: For the Group-A experiments, it can be observed that on American English test datasets, such as MyST_test and CMU_test, the models generally had low WERs without any finetuning, as compared to model results on other test datasets.This implies that American-accented child speech has acoustic properties similar to adult speech.3. Generalization: The models trained on the MyST_train dataset (Groups B through I) generally exhibit good generalization to other test scenarios, exhibiting relatively low WER.This suggests that the finetuned models can effectively adapt to diverse speech recognition tasks even when trained with a single-child speech dataset.4. Finetuning Impact: Finetuning the models on specific datasets (Groups B through I) consistently leads to improved performance compared to the models without finetuning (Group A).This highlights the importance of adapting models to domain-specific data for better speech recognition.5. Dataset Contribution: Among the finetuning datasets, the PF_br_train dataset (used in groups D, F, and G) consistently provides the most significant improvements in WER scores across various test scenarios.It indicates that incorporating a dataset with diverse linguistic features can greatly benefit the model's performance.and E).This implies that additional non-native training datasets, even with amounts as low as 10% of the unseen dataset can improve the ASR system's performance.8. Language and Accent: In our experiments, various child speech accents were used.Among the accented speech, British, Italian, and American accents are easier to improve on for child ASR tasks.German and accents still posed challenges in ASR accuracy, although small improvements were still seen.9. Catastrophic Forgetting: The WER on adult speech (LibriTTS 'dev-clean') remained in the range of 4-6% for all finetuning experiments.This shows that whisper doesn't suffer from the catastrophic forgetting problem [36], which appears when a model is retrained with a different dataset than the original dataset and the effect being a significant reduction in performance data from the original training domain.Whisper models were able to retain a similar WER accuracy for adult speech while also improving the WER for child speech ASR.This may be attributed to careful training strategies, architectural design, and regularization techniques used in whisper [4].

C. COMPARISON WITH PREVIOUS SOTA RESULTS
Table 4 compares the results we obtained on the various test sets with previously reported results in the literature.
Our results show significant improvements over the previously reported results.It is important to note, however, that prior researchers employed varied methodologies for data cleaning, and in the absence of a uniform standard for this process, a direct comparison cannot be provided.Consequently, we include these comparisons primarily to illustrate the effectiveness of our methodology on its own merits, rather than as a direct benchmark against previous work.This approach allows us to highlight the significant improvements our research contributes to the field while acknowledging the methodological differences that exist in data preprocessing practices.We report relative WER improvements between 29.7% on the MyST_test, 41.5% on the PF_br_test, 89.1% on the CMU_test, and 85.1% on the PF_sw_test datasets.During our research, other similar studies with whisper finetuning were also conducted which utilized different volumes of the MyST dataset for finetuning and testing.The WER from these studies are also presented in Table 4 (marked in blue) to draw a comparison of whisper finetuning with varying volumes of the same child speech dataset.

VI. CONCLUSION
This study aims to enhance the performance of ASR on native and non-native English child speech datasets available for research use.Whisper models, pretrained with huge amounts of data, are selected as the basis of the experimental studies conducted as part of this work.Our work adapts these pretrained whisper models to non-native child speech using finetuning.The effectiveness of whisper finetuning on ASR performance is studied through various experimental combinations of datasets.It was observed that whisper finetuning improves the ASR performance on non-native English children's speech, a low-resource domain.Additionally, our approach outperforms previously reported results on the non-native child speech datasets used in this paper by using only 10% distributions of these datasets during finetuning.Best WERs on Swedish, German, Italian, and Chinese accented non-native English child speech are reported in this paper.Using child speech data with different linguistic features can benefit the overall ASR performance.Germanaccented speech was the most challenging for ASR while British and American English speech was the least challenging.It was also observed that Whisper does not suffer from catastrophic forgetting when finetuning on new datasets.
For future work, we aim to include more training datasets from other low-resource languages in finetuning to further improve on these baseline results.The influence of external language models at the decoding stage on non-native child speech will also be examined.We also intend to conduct a three-way experimental analysis with wav2vec2 [3], Whisper [4], and Conformer [2] models to study the strengths and limitations associated with each model for working with child speech.

6 .
Limited impact of the CMU_train dataset: Finetuning with the CMU_train dataset (Groups C and E) shows relatively smaller improvements on test datasets as compared to PF_br_train.This suggests that the CMU_train dataset might not capture linguistic features as effectively as the PF_br_train dataset.7. Additional Data Impact: The inclusion of additional multi-accented non-native training datasets represented by NN_10 and NN_20 (used in Groups F, G, H, and I) yields substantial improvements in WER scores on non-native child speech test datasets (Group B, C, D,

TABLE 1 .
Architecture parameters for whisper.

TABLE 2 .
Datasets used for training and testing.

TABLE 3 .
WER for whisper original and finetuned models over different child speech test datasets used in this paper.

TABLE 4 .
Comparison of previously reported WER results with our results.