Exploring task-diverse meta-learning on Tibetan multi-dialect speech recognition

The disparities in phonetics and corpuses across the three major dialects of Tibetan exacerbate the difficulty of a single task model for one dialect to accommodate other different dialects. To address this issue, this paper proposes task-diverse meta-learning. Our model can acquire more comprehensive and robust features, facilitating its adaptation to the variations among different dialects. This study uses Tibetan dialect ID recognition and Tibetan speaker recognition as the source tasks for meta-learning, which aims to augment the ability of the model to discriminate variations and differences among different dialects. Consequently, the model’s performance in Tibetan multi-dialect speech recognition tasks is enhanced. The experimental results show that task-diverse meta-learning leads to improved performance in Tibetan multi-dialect speech recognition. This demonstrates the effectiveness and applicability of task-diverse meta-learning, thereby contributing to the advancement of speech recognition techniques in multi-dialect environments


Introduction
During the long process of historical development, Tibetan language has gradually evolved into three major dialects: the Ü-Tsang dialect, the Amdo dialect, and the Kham dialect.Among these, the Ü-Tsang dialect holds the status of the standard Tibetan, thereby garnering more attention and research in the domain of speech recognition [1][2][3][4].Conversely, the research and exploration of the other two dialects are relatively limited.Presently, the open speech data for the three Tibetan dialects amounts to approximately 80-100 h, whereas major languages like Mandarin Chinese and English boast vast corpus surpassing 10,000 h.As a result, the resources for Tibetan speech remain woefully inadequate.Because notable discrepancies in research progress have occurred across the three primary dialects, how to effectively perform speech recognition of Tibetan multi-dialect speech becomes a challenging problem.
In the case of exceedingly scarce speech data, the efficacy of pre-training and fine-tuning techniques exhibits a modest impact.Furthermore, as opposed to the Ü-Tsang dialect, the standard corpus of the Amdo and Kham dialects is inadequate.Consequently, in the domain of speech recognition of the Amdo and Kham dialects, fine-tuning the model with a very small amount of target data does not lead to more desirable results.So, it is necessary to explore new method aimed at enhancing the performance of recognition models.Meta-learning is usually used as a potential solution to the question of "How to make models more efficient for new data?".Meta-learning allows models to fit new data rapidly or with a few steps, which is compatible with the challenges encountered in the domain of low-resource language speech recognition [5][6][7].
Finn et al. [8] proposed a meta-learning algorithm known as model-agnostic meta-learning (MAML), which can apply diverse learning scenarios, including classification, regression, and reinforcement learning.The distinctive feature of MAML lies in its model-agnostic nature.It can apply to any gradient descent-trained model.The experiment results show that MAML has remarkable capabilities, particularly excelling in the domain of smallsample classification, where it has the highest performance.Furthermore, this algorithm has also achieved great outcomes in the domain of small-sample regression tasks, further proving its effectiveness.Gu et al. [9] used meta-learning on the neural machine translation for the low-resource language.In their work, they used a comprehensive framework that encompassed eighteen different languages as source tasks.To overcome the inherent challenges posed by input-output disparities across different languages, they used a universal vocabulary representation.Moreover, they selected five different languages from the source task as target tasks to evaluate the effectiveness of their meta-learning approach.The experimental results showed the superiority of the metalearning method over transfer learning.The findings highlighted the remarkable performance gains achieved through meta-learning, thereby emphasizing its potential as a potent technique for enhancing low-resource language translation tasks.Huang et al. [10] used metalearning as a training algorithm in the domain of multispeaker speech synthesis.This approach allows the model to be quickly adapted to speech synthesis, even when confronted with limited data for each speaker.Through a series of experiments, the results showed the capabilities of meta-learning in synthesizing highly authentic speech from a relatively smaller number of samples.Furthermore, the speed of adaptability was fast, facilitating rapid adjustments to individual speaker characteristics.These findings underscore the efficacy of meta-learning as a powerful method in multi-speaker speech synthesis.When the current better speech emotion recognition model is fed with multi-language data, the model's recognition performance is poor [11].Training corpus is a big problem when dealing with low-resource languages.The authors solved this problem by introducing metalearning.The experimental results show that the model with the addition of meta-learning performs better, and meta-learning is more advantageous when dealing with multi-language problems.
The aforementioned studies collectively highlight the feasibility and efficacy of adopting meta-learning methods as a means to tackle challenges associated with limited data.However, a noteworthy observation across these studies is that the source tasks and target tasks are similar, thereby falling to fully capitalize on the inherent advantages of meta-learning.In this paper, we aim to expand the task types during source task selection.In addition to incorporating existing high-resource language speech recognition tasks, we introduce novel dimensions in the selection process, including Tibetan dialect ID recognition and Tibetan speaker recognition.It takes full advantage of meta-learning's ability to learn generalized knowledge from multiple tasks.This paper's contributions are twofold.First, we integrate dialect identification (ID) recognition and speaker recognition into the meta-training framework of metalearning.Previous meta-learning research predominantly focused on aligning meta-training tasks with the target task, overlooking the holistic properties inherent to meta-learning.This expansion of meta-training encompasses additional dimensions crucial for comprehensive meta-learning, thereby enriching the meta-learning process and advancing its efficacy.The second contribution lies in using multi-task pre-training as an initialization step for meta-learning.This method facilitates a more favorable starting point for the meta-learning process, enhancing its efficiency and effectiveness.By leveraging the insights gained from multi-task pre-training, the meta-learning framework is equipped with a robust foundation, thereby optimizing its performance and potential for subsequent task adaptation.
The structure of this paper is as follows: Sect. 2 introduces related work, including the multi-task technique and the current research status of meta-learning for Tibetan speech recognition.Section 3 introduces the meta-learning method and our multi-task meta-learning.Section 4 describes the experimental data, setting, and results.Section 5 offers a discussion and summary of this work.

Related work
The development of Tibetan dialect speech recognition has been hindered by a lack of linguistic knowledge and corpus data.Presently, due to widely use of the Ü-Tsang dialect, more linguistic and phonological research has been carried out and more research results have been achieved.In contrast, the Amdo and Kham dialects suffer from a considerable scarcity of linguistic knowledge and corpus data, which severely diminishes the feasibility of constructing speech recognition systems for these dialects.However, multi-task and multi-language learning are common methods for enhancing the performance of multi-dialect language recognition, particularly for lowresource languages.
Kannan et al. [12] addressed the problem of crosslanguage training data imbalance by constructing an end-to-end multi-language speech recognition system.The experimental results showed that the end-to-end multi-language model had a lower error rate than the mono-language model.For the Japanese multi-dialect, Imaizumi et al. [13] used multi-task learning for dialect ID recognition and dialect recognition based on an endto-end model.The experimental results showed that the recognition results of the multi-task learning model were better than those of the traditional single-task model.In the work [14], Siddharth Dalmia et al. showed that multitask learning can improve the recognition performance of low-resource languages.Zhao et al. [15] used the multi-task learning method with shared hard-parameter to build a multi-task WaveNet-CTC model.The experimental results show that the multi-task model has high recognition accuracy for three Tibetan dialects, surpassing the single-dialect baseline model.Dan et al. [16] used a soft-parameter shared multi-task learning method to build a multi-task Transformer model.The experimental results show that the recognition accuracy of this model is higher than that of a single-dialect model and a hardparameter shared multi-dialect model.However, multitask learning may lead to interference among tasks.The weights of each task need be carefully designed.This in turn will increase the difficulty of training the model.But meta-learning is unbiased towards all tasks and does not weigh tasks.It just improves the learning ability of the model through different tasks.
The core idea of meta-learning is to make machine learning models better adapt to new tasks or new domains with a few labeled samples by learning how to learn [17][18][19].By pre-training the model's initial parameters on different tasks, these initial parameters can learn some internal features that are more suitable for the delivery.When the model encounters a new task, the loss function of the new task will be more sensitive to the initial parameters, so that even a small amount of gradient update can make the model quickly converge, to achieve fast learning.In 2020, Hsu et al. [20] proposed to use meta-learning for low-resource automatic speech recognition.Their experiment uses six high-resource languages as source languages and four low-resource languages as target languages.The experimental results show that meta-learning performs better than pre-training on all target languages with different combinations of source languages.At present, there are relatively little researches on Tibetan speech recognition based on metalearning.Dhanva Eledath et al. [21] used model-agnostic meta-learning to train the Conformer model, leading to enhanced speech recognition accuracy across various low-resource English accents.Qin et al. [22] used metalearning based on fine-grained modeling units for speech recognition of Lhasa-Tibetan.The experimental results show that meta-learning is better than multi-language pre-training in the low-resource language Tibetan.So meta-learning has a unique advantage for low-resource languages.
Multi-task learning achieves performance gains for each of these individual tasks.However, the source and target tasks in the above meta-learning are speech recognition tasks.This way does not fully utilize the characteristic of meta-learning to acquire general knowledge from multiple tasks and the advantages of multi-task learning.Therefore, in this paper, the source tasks include not only speech recognition tasks but also two different types of tasks: Tibetan speaker recognition and Tibetan dialect ID recognition.

Meta-learning
Unlike multi-task learning, meta-learning places a strong emphasis on a model's learning capacity rather than its performance on each specific task.Meanwhile, by enabling the model to swiftly adapt to new tasks, meta-learning has a great advantage over pre-training, particularly in facing low-resource multi-dialect.The adaptability of the model is greatly enhanced, enabling it to respond more rapidly and effectively to new problems.The learning process maximizes the sensitivity of the loss function to the model's parameters for each new task.As the sensitivity increases, even minute adjustments to the parameters trigger substantial feedback in the loss function, thereby facilitating the model's rapid and accurate parameter adjustments in response to these heightened feedback signals.This mechanism enables the model to fine-tune itself efficiently, ultimately leading to superior performance on various tasks, especially in low-resource and diverse linguistic contexts.
Meta-learning can be divided into two main processes: the meta-training process and the adaptive process.The meta-training process is the most essential feature that distinguishes meta-learning from traditional machine learning.The training unit of traditional machine learning is a single data, while the training unit of meta-training is a single task.Each task has a support set and a query set.The meta-training process is shown in Fig. 1.
T = {T 1 , T 2 , . . ., T n } represents the set of training tasks.The task T k is selected from the set, for which the support set is D tr k and the query set is D te k .The learning rate is η ′ , then the parameter θ ′ k in the task T k is updated in the way shown in Eq. (1): At this point, the overall parameters of the meta-learning are updated with the D te k of the task T k .The meta- learning rate is η .The parameters of the meta-learning are updated in the way shown in Eq. ( 2): (1) The update of meta-learning consists of an outer loop and an inner loop.The inner loop is the parameter update within each task.The outer loop is the parameter update performed between source tasks, i.e., the parameter update of meta-learning.Once the source tasks are all trained, the model is fine-tuned with the support set of the target task and the model is then tested on the query set.

Task-diverse meta-learning
Meta-learning can help the model make better use of the small amount of data, allowing the model to quickly adapt to changes in different dialects and speakers.Through meta-learning, the model can adapt and adjust the parameters of the model faster when it receives a new dialect, thus improving the overall recognition performance.The key idea of meta-learning is to enable the model to learn quickly on a new task through patterns and knowledge learned from different tasks.One of its strengths is very suitable for low-resource speech recognition.However, existing source task choices of metalearning for Tibetan speech recognition are all speech recognition tasks, which do not fully utilize the inherent advantages of meta-learning.
Transfer learning aims to capitalize on previously acquired knowledge and experiences to enhance performance in novel tasks or domains.However, the efficacy of transfer learning hinges on the degree of similarity between the source and target tasks.When faced with substantial disparities between tasks, the applicability of learned knowledge to the new task diminishes, thereby posing challenges to effective knowledge transfer.So, meta-learning and transfer learning diverge in selecting source data.Transfer learning focuses on utilizing source data that closely mirrors the characteristics of the target data, fostering a more seamless transition during the training process.This similarity between the source and target datasets enhances the effectiveness of the training, as the model can leverage existing (2) knowledge to learn new tasks more efficiently.In contrast, meta-learning operates without stringent requirements regarding source and target data similarity.Instead, it embraces a diverse array of source data, aiming to identify the most suitable parameters for rapid adaptation to varying tasks and contexts.This flexibility allows meta-learning algorithms to explore a wide range of data types and structures, enabling the model to generalize better across different scenarios.While transfer learning relies on source data similarity for effective training, meta-learning thrives on diversity, leveraging various data sources and skillfully applied parameters to optimize adaptation and mitigate overfitting.

Pre-training initialization for task-diverse meta-learning
The initial parameters of a deep learning model have an important impact on the performance and convergence speed of the model.Inappropriate initial parameters may lead to problems such as the model not converging properly, falling into local optimal solutions, vanishing or exploding gradients, etc.It affects the training effect and generalization ability of the model.So, the initial parameters of the model have a vital influence on the final effect of the model.In previous research related to meta-learning, the initial parameters of the model before meta-training are generally obtained by random initialization.The disadvantage of random initialization is that the parameters may be inappropriate.Random initialization does not ensure that each neuron receives the appropriate initial weights, which may result in consistently smaller or larger outputs for some neurons, making the model's performance ability insufficient.Generally, pre-training involves the initial training of a model on a large-scale dataset, followed by fine-tuning on the target task.This method circumvents the need to train the model from scratch and significantly reduces the demand for labeled data.

Experimental data
The experimental datasets in this paper are obtained from the Open Speech and Language Resources (OpenSLR), as shown in Table 1.Since Tibetan is a low-resource language, the existing corpus is limited.However, there are a large scale of Chinese and English corpus, so we use more Chinese and English data than Tibetan data for meta-learning.
The first dataset comes from the Aishell-1 Chinese dataset [23].It is a Chinese speech corpus covering 11 fields such as smart homes, autonomous driving, and industrial production.We select 80,604 sentences as the training set with a total duration of 100.42 h and 4551 sentences as the test set with a total duration of 5.67 h.The audio files are converted into Windows Audio Volume (WAV) format with 16 KHz sampling rate and 16-bit quantization accuracy.
The second dataset is the LibriSpeech English dataset [24].We select 28,539 sentences as the training set with a total duration of 100.59 h and 1530 sentences as the test set with a total duration of 5.39 h.The audio files are converted into Windows Audio Volume (WAV) format with 16 KHz sampling rate and 16-bit quantization accuracy.
The third dataset is the TIBMD@MUC Tibetan dataset [25].It is a multi-dialect Tibetan dataset.To compare the results with the works of [15,16], the same data were used for the experiments.So, we select 20594 Ü-Tsang sentences as the training set with a total duration of 23.74 h and 2575 Ü-Tsang sentences as the test set with a total duration of 2.98 h.There are 33 Ü-Tsang dialect native speakers including 13 males and 20 females.We select 17286 Amdo sentences as the training set with a total duration of 18.83 h and 2161 Amdo sentences as the test set with a total duration of 2.36 h.There are 19 Amdo dialect native speakers including 10 males and 9 females.We select 2376 Kham sentences as the training set with a total duration of 2.61 h and 269 Kham sentences as the test set with a total duration of 0.33 h.There are 10 Kham dialect native speakers including 7 males and 3 females.
The audio files are converted into Windows Audio Volume (WAV) format with 16KHz sampling rate and 16-bit quantization accuracy.

Experimental settings
The recognition model is based on the Conformer model [26], an architecture that has achieved remarkable results in the field of speech recognition in recent years.It combines the self-attention mechanism and the convolutional neural network.It has strong capabilities of modeling and context understanding.For low-resource languages, the introduction of the Conformer model can help the model to better capture the features and structures in the speech signal, thus achieving more accurate recognition.
Training and testing on a server equipped with the NVIDIA Quadro RTX 4000 were done.The Conformer model is built on the Pytorch framework, and it is trained with 80 epochs.The batch size is 10.For optimization, the Adam algorithm with gradient clipping is used.The learning rate is η ′ = 0.005; the meta-learning rate is η = 0.01.The label-smoothing technology is used during training, and the smoothing parameter α = 0.1.
Our model is not limited to speech recognition in the selection of source tasks for meta-learning and makes full use of the mutual benefit relationship between multiple tasks by introducing Tibetan speaker identification and Tibetan dialect ID recognition.The structure of the taskdiverse meta-learning method is shown in Fig. 2.This comprehensive method enhances the system's ability to understand and model diverse speech data.Through dialect ID recognition, the model can distinguish the differences among different dialects and perform more targeted training and optimization for a specific dialect.Through speaker recognition, the model can learn the characteristics of different speakers.The model can be better trained to accommodate differences between dialects and speakers.These additional source tasks help provide more speech features and contextual information to further improve the accuracy and robustness of the recognition.
We choose to use the pre-trained model as the initialization for meta-learning, which provides a set of highquality initial parameters for the model.It greatly reduces the randomness of the parameters and provides a clear direction for subsequent model training.The structure of the model is shown in Fig. 3. Our model realizes the meta-learning Tibetan multi-dialect speech recognition with pre-training initialization.
In our work, we choose English word recognition, Chinese character recognition, and Chinese pinyin recognition as the pre-training tasks.Here, we do not need to fine-tune the pre-trained model on the target language; just finetune the meta model after the meta-training is over.This pre-training method can provide a clear direction for the  (3) Character Error Rate = Insertions + Substitutions + Deletions Total Words × 100% The meta-learning tasks include English word recognition, Chinese character recognition, Tibetan speaker recognition, and Tibetan dialect ID recognition.This can enrich the variety of source tasks.The meta model can get

Experimental results and analysis
In Table 2, the experimental results are shown for the comparison of several methods on the Tibetan multi-dialect dataset.We compare our Task-Diverse Meta-Learning Conformer without pre-training with the Hard-MTL Transformer [15], the Single-Dialect Transformer, Multi-Dialect Transformer, Soft-MTL Transformer, and Adaptive Soft-MTL Transformer [16].Meanwhile, we also compare the pre-training and fine-tuning Conformer, which is first pre-trained in Chinese and English languages, followed by fine-tuning on the Tibetan multidialect dataset.The Task-Diverse Meta-Learning without pre-training is our model.
From Table 2, we can see that the Single-Dialect Transformer has the highest error rates among all three dialects.The effect of multi-task learning is obviously better than single-task learning, and the mutually beneficial relationships between multiple tasks benefits each task.The results of our proposed method are relatively good, but not the best, so we need to continue to explore it with pre-training process.
Previous meta-learning was initialized by random initialization.To avoid the drawbacks associated with random initialization, we initialize the model by using pre-training.The model is firstly trained on the Chinese and English, and then the trained model parameters are used as meta-learning initial parameters instead of the previous random initialization.The error rates of the two models on the Tibetan multiple dialects are shown in Table 3.
In Table 3, compared with Pre-Training Fine-Tuning Conformer, the experimental results show that the contribution of meta-learning to Tibetan multi-dialect speech recognition is greater than that of pre-training.The task-diverse meta-learning with pre-training obviously has better recognition rates, with a 1.33% reduction in the error rate for the Ü-Tsang dialect, a 0.24% reduction in the error rate for the Amdo dialect, and a 3.37% reduction in the error rate for the Kham dialect.After pre-training, initialization can improve the training efficiency of the model instead of blindly random initialization, thus adapting to new tasks faster.In comparison with previous models, our proposed task-diverse metalearning after pre-training achieves the best results on both Ü-Tsang and Kham dialects.On the Amdo dialect, our method is slightly worse than Adaptive Soft-MTL Transformer in cER.
Adaptive Soft-MTL Transformer is good at dealing with tasks with small differences, while meta-learning is more skilled at learning general knowledge from different tasks.Multi-task learning typically requires models to handle multiple tasks simultaneously, which can lead to interferences between tasks or require large amounts of data to balance the individual tasks.Meta-learning, on the other hand, focuses on training the model to learn how to learn compared to Adaptive Soft-MTL Transformer so that it can adapt to new tasks or new domains more quickly.So, our proposed model is generally better than Adaptive Soft-MTL Transformer.

Conclusion
This work aims to explore the application of task-diverse meta-learning methods to multi-dialect speech recognition in Tibetan.Traditional speech recognition models often face challenges of multi-dialect situations because there are significant differences between each dialect.These differences make it difficult for single-task models to adapt to the features of different dialects.To solve this problem, we introduce the idea of task-diverse metalearning.Multi-task meta-learning can help the model  learn more general and robust representations, enabling the model to better adapt to the differences between dialects.By introducing Tibetan dialect ID recognition and Tibetan speaker recognition tasks as source tasks for meta-learning, we force the model to focus on the commonalities and differences between dialects, thus improving the model's performance on the Tibetan multi-dialect speech recognition task.Task-diverse meta-learning is better able to utilize the knowledge of the source task to guide the learning of the target task.This task-diverse meta-learning method provides the model with richer training signals and enhances the model's understanding of the variations and differences between dialects.
In addition, we also compared our approach with hardparameter-sharing multi-task learning and soft-parameter-sharing multi-task learning models.The experimental results show that our method has significant improvement in the task of Tibetan multi-dialect speech recognition.Thus, our study provides an innovative method to solve the problem of multi-dialect speech recognition and lays the foundation for further improvement of speech recognition techniques in multi-dialect environments.

Fig. 1
Fig. 1 The meta-training process following meta-learning and improve the training efficiency of the model.The metric for experimental evaluation in this paper is Tibetan character error rate (CER).The method of calculating CER is shown in Eq. (3).

Fig. 2 Fig. 3
Fig. 2 The structure of task-diverse meta-learning.The inputs to the model are Fbank features.The initial parameters are randomly initialized.The four source tasks are Tibetan Dialect ID Recognition, Tibetan Speaker Recognition, Chinese Speech Recognition, and English Speech Recognition, respectively

Table 1
The statistics of dataset

Table 2
The error rate of Tibetan multi-dialect speech recognitionThe bold values represent the lowest error rate

Table 3
The error rates of task-diverse meta-learning with and without pre-trainingThe bold values represent the lowest error rate